# Brief Review of Reinforcement Learning

What makes reinforcement learning different from other machine learning paradigms?
- There is no supervisor, only a reward signal
- Feedback is delayed, not instantaneous
- Time really matters (sequential, non i.i.d data)
- Agent’s actions affect the subsequent data it receives

## Rewards

A reward $R_{t}$ is a scalar feedback signal

Indicates how well agent is doing at step $t$

The agent’s job is to maximise cumulative reward

Reinforcement learning is based on the **reward hypothesis**

*: All goals can be described by the maximisation of expected cumulative reward*

## Sequential Decision Making

- It means making a series of decisions or taking a sequence of actions over time to achieve a long-term goal

- Goal: select actions to maximise total future reward

*E.g*

Let's say you're playing a game where you're trying to reach a treasure chest at the end of a maze. You start off in a room with multiple doors, and you have to choose which door to go through to get to the next room. Each door leads to a different room, and some rooms have traps that will cause you to lose a life.

In this scenario, sequential decision making means that you have to make a series of decisions, one after the other, in order to reach your goal of getting to the treasure chest. Each decision you make affects your chances of success, because some rooms have traps and some don't, and you won't know which is which until you enter them.

- Actions may have long term consequences

: Actions having long-term consequences means that the decision you make at each stage of the game can have an impact on the rest of the game.

- Reward may be delayed

: The agent may not receive a reward immediately after taking an action. Instead, the reward may be delayed and only received after some time has passed or after a sequence of actions has been taken.

- It may be better to sacrifice immediate reward to gain more long-term reward

## Agent and Environment

- Agent

An agent is an entity that learns to perform actions in an environment to maximize a cumulative reward signal. The agent interacts with the environment by taking actions and observing the resulting state and reward, and uses this information to improve its decision-making process over time. The goal of the agent is to learn a policy that maps states to actions in a way that maximizes the expected cumulative reward.

- Environment

The environment is the problem that the agent is trying to solve, and the agent's goal is to learn how to interact with the environment in a way that maximizes its rewards.

- obeservation : $O_{t}$

Observation refers to the current state of the environment as perceived by the agent. Observations can include information about the agent's position, velocity, sensory input, and any other relevant features of the environment that are necessary for the agent to make decisions and take actions. The agent uses its observations to learn about the environment and determine the best actions to take in order to achieve its goals. 

- action: $A_{t}$

An action refers to the decision made by an agent at a particular time step in response to the observation it receives from the environment. It is the agent's way of influencing the environment in order to achieve a certain goal or maximize its cumulative reward. The action can take various forms depending on the specific problem, such as moving a robot, playing a move in a game, or selecting an advertisement to display to a user.

$t$ increments at env. step

At each step $t$ the agent:
- Executes $A_{t}$
- Receives $O_{t}$
- Receives $R_{t}$

The environment:
- Recevies $A_{t}$
- Emits $O_{t+1}$
- Emits $R_{t+1}$

<img src="RL1.png" alt="RL1"/>


## History and State

- History: $H_{t}$

: The history is the sequence of observations, actions and rewards

$H_{t} = O_{1},R_{1},A_{1},...,A_{t−1},O_{t},R_{t}$

- State: $S_{t} = f(H_{t})$

: The information used to determine what happens next

## Environment state

The environment state $S_{t}^{e}$ is the environment's private representation

The environment's internal representation of its current state. It is private because the agent does not have direct access to it and must infer it based on its own observations and actions.

The environment state is not usually visible to the agent

*Q. I think agent use 'environment', but the environment state is not usually visible to the agent. Then why?*

- The agent's interaction with the environment (through observations and actions) can provide information that allows the agent to make inferences about the environment state.

Even if $S_{t}^{e}$ is visible, it may contain irrelevant information



## Agent state

The agent state $S_{t}^{a}$ is the agent's internal representation

- Whatever information the agent uses to pick the next action
- It is the information used by RL algos
- It can be any function of history : the agent can use any combination of past observations and actions in order to determine its next action.


$S_{t}^{a} = f(H_{t})$

## Information state(Markov state)

An information state contains all useful information form the history

- Definition

A state $S_{t}$ is Markov if and only if $\mathbb{P}[S_{t+1}|S_{t}] = \mathbb{P}[S_{t+1}| S_{1}, ... , S_{t}]$

The statement means that a state at time t is Markovian if and only if the probability of transitioning to the next state at time t+1, given the current state at time t, is equal to the probability of transitioning to the next state at time t+1, given all the previous states from time 1 up to time t. In other words, the current state contains all the relevant information necessary to predict the future, and there is no additional information from the past that is needed to make accurate predictions.

So if we know everything, then we can say it's *'Markovian'*

If the probability of transitioning to the next state at time t+1, given the current state at time t, is not equal to the probability of transitioning to the next state at time t+1, given all the previous states from time 1 up to time t, then the state is *not Markovian.*

The future is independent of the past given the present

: the future state depends only on the present state and not on the history of states.

$H_{1:t} -> S_{t} -> H_{t+1:\infty}$

Once the state is known, the history may be thrown away

The environment state $S_{t}^{e}$ is Markov

: Because it satisfies the Markov property. Specifically, the environment state at time t contains all the information necessary to predict the future evolution of the environment, given the current observation and action of the agent. In other words, the environment state at time t summarizes all relevant information from the past that is necessary to make accurate predictions about the future. This means that the environment state satisfies the Markov property, which is a key assumption in many reinforcement learning algorithms.

E.g 

Suppose you have a robot vacuum cleaner that can move around a room and clean up dirt. The state of the environment at time t could be the positions of the robot and the dirt, as well as other relevant information such as the location of obstacles in the room. The robot's sensors can detect the current position and any obstacles or dirt nearby, and its actions are to move in a certain direction or to clean the dirt at its current position.

The environment state is Markov in this case because the future state of the environment (i.e. the positions of the robot and dirt at time t+1) only depends on the current state of the environment (i.e. the positions of the robot and dirt at time t) and the action taken by the robot at time t. There is no additional information from past states that is necessary to predict the future state.

The history $H_{t}$ is Markov

: Because it contains all the relevant information needed to predict the future. Specifically, the history Ht includes the sequence of observations and actions up to time t, which fully captures the state of the environment at time t. Given Ht, the probability of transitioning to the next state St+1 depends only on St and At, the current state and action, and not on any previous states or actions. This satisfies the Markov property, which states that the future is independent of the past given the present.

E.g

Suppose a robot is navigating a room and has sensors that can detect walls and obstacles, as well as motors to control its movement. The robot's history Ht includes all of the sensor readings and motor commands up to time t. If the robot's sensors and motors are designed in such a way that the current sensor readings are sufficient to predict the next sensor readings and necessary motor commands, then the robot's history Ht is Markov. This means that the current state of the robot's sensor and motor system contains all of the relevant information needed to predict the future.

## Fully Observable Environments

- Full observability

: agent directly observes environment state

$O_{t} = S_{t}^{a} = S_{t}^{e}$ (Agent state = environment state = information state)

Fomally, this is a Markov decision process (MDP) 

## Partial Observable Environments

- Partial observability: agent indirectly observes environment

E.g
- A robot with camera vision isn't told its absolute location
- A trading agent only observes current prices
- A poker playing agent only observes public cards


Now agent state != environment state

Formally this is a partially observable Markov decision process (POMDP)

Agent must construct its own state representation $S_{t}^{a}$, e.g.
- complete history: $S_{t}^{a} = H_{t}$
- Beliefs of environment state : $S_{t}^{a} = (\mathbb{P}[S_{t}^{e} = s^{1}, ... , \mathbb{P}[S_{t}^{e} = s^{n})]$
- RNN: $S_{t}^{a} = \sigma(S_{t-1}^{a}W_{s} + O_{t}W_{o})$

## Policy

: A policy is the agent's behaviour

- Deterministic policy: $a = \pi(s)$
- Stochastic policy: $\pi(a|s) = \mathbb{P}[A_{t} = a|S_{t} = s]$

![RL2](RL2.png)

## Value Function

: Value function is a prediction of future reward, used to evaluate the goodness / badness of states

- $v_{\pi}(s) = \mathbb{E}_{\pi}[R_{t+1} + \gamma R_{t+2} + \gamma^{2} R_{t+3} + ... |S_{t} = s]$

$𝑉_{𝜋}(𝑠)$: the state-value function for policy 𝜋 at state 𝑠, which represents the expected return (i.e., the sum of discounted rewards) when starting from state 𝑠 and following policy 𝜋 thereafter.

$𝐸_{𝜋}[𝑅_{𝑡+1}+𝛾𝑅_{𝑡+2}+𝛾^{2}𝑅_{𝑡+3}+...|𝑆_{𝑡}=𝑠]$: the expected return for the current state 𝑠, which is the sum of the expected immediate reward $𝑅_{𝑡+1}$ and the expected return for the next state $𝑆_{𝑡+1}$ under the same policy 𝜋, which is discounted by a factor of 𝛾 (i.e., the discount rate). This process continues until the end of the episode or termination of the environment.

![RL3](RL3.png)

## Model

: A model predicts what the environment will do next

- $P$ predicts the next state
- $R$ predicts the next (immediate) reward, e.g.

$P_{ss'}^{a} = \mathbb{P}[S_{t+1} = s' | S_{t} = s, A_{t} = a]$

$R_{s}^{a} = \mathbb{E}[R_{t+1} | S_{t} = s, A_{t} = a]$

![RL4](RL4.png)

## Categorizing RL agents

- Value Based: No policy, Value Function
- Policy Based: Policy, No Value Function
- Actor Critic: Policy, Value Function
- Model Free: Policy and/ or Value Function, No model
- Model Based: Policy and/ or Value Function, Model

![RL5](RL5.png)

## Learning and Planning

: Two fundamental problems in sequential decision making

Reinforcement Learning:
- The environment is initially unknown
- The agent interacts with the environment
- The agent improves its policy


Planning:
- A model of the environment is known
- The agent performs computations with its model (without any external interaction)
- The agent improves its policy
- a.k.a deliberation, reasoning, introspection, pondering, thought, search


The agent does not have any prior knowledge about the environment it is interacting with. The agent learns about the environment by taking actions and receiving feedback in the form of rewards or punishments. Based on this feedback, the agent updates its policy to maximize the total reward it can receive from the environment. In simple terms, the agent learns by trial and error.

On the other hand, in planning, the agent already has a model of the environment, which means it knows how the environment works and what the outcomes of different actions will be. The agent can use this model to perform computations and simulations without any external interaction with the environment. Based on the results of these simulations, the agent can improve its policy to achieve the desired goals.

To summarize, the key difference between reinforcement learning and planning is that in reinforcement learning, the agent learns from its interaction with the environment, while in planning, the agent uses its knowledge of the environment to make decisions without any interaction.

## Exploration and Exploitation

- RL is like trial-and-error learning
- The agent should discover a good policy
- From its experiences of the environment
- Without losing too much reward along the way

*Exploration* finds more information about the environment

*Exploitation* exploits known information to maximise reward

**It is usually important to explore as well as exploit**

E.g

1. Restaurant Selection:
- Exploitation: Go to your farvorite restaurant
- Exploration: Try a new restaurant

2. Online Banner Advertisements:
- Exploitation: Show the most successful advert
- Exploration: Show a different advert

3. Oil Drilling
- Exploitation: Drill at the best known location
- Exploration: Drill at a new location

4. Game Playing
- Exploitation: Play the move you believe is best
- Exploration: Play an experimental move



## Prediction and Control

- Prediction: evaluate the future (Given a policy)
- Control: optimise the future (Find the best policy)