## Policy Search

### Reinforcement Learning

One of the "big topics" in today machine learning, where an Agent is performing actions that **modify the environment** and receive a **reward** for each action. The goal is to maximize the reward. At each step we obtain a "new" state of the environment and a reward.  
Considered a subfield of machine learning and AI.  
The main idea behind RL is to formalize a sequential decision making where the goal is to maximize the **sum of rewards**.
At each step $ t $ the agent :
- receives an observation $ O_t $
- receives a reward $ R_t $
- chooses an action $ A_t $
At each step $ t $ the environment:
- receives an action $ A_t $
- emits an observation $ O_{t+1} $
- emits a reward $ R_{t+1} $  

The environment state is a function of the history of the environment. But the "real" Environment State $ S_t^E $ might not be fully visible to the agent.
The agent state $ S_t^A $ is a function of the history of the agent.  
An environment is **fully observable** if $ O_t = S_t^E = S_t^A $, otherwise it's **partially observable** where the agent receives "indirectly the environment, like in pokers.  
The history of the system represents the sequence of observations, actions and rewards that happened over time : $$ H_n = {S_0 , A_0} , {S_1 , A_1 , R_1} , ... $$ 

The *Expected Return* is the sum of all the rewards that we expect to receive from the current time $ t $ to the "final step" based on our knowledge of the environment and the current history: $$ G _t = \sum_{u=t}^{T} R_{u+1} $$  
The goal of the agent can be casted to **maximize the expected return**.
**Scarcity of rewards** : the agent does not receive a reward at each step, but maybe after some time or after a certain number of steps.  
There are two main types of tasks:  
- **Contuinuous tasks** : there is no "final step" and the expected return is infinite, so we need to introduce a "discount rate" $ \gamma $ that defines how much we value future rewards. For example $ \gamma = 0.9 $ means that we value future rewards as much as the current reward , on the other hand $ \gamma = 0.1 $ means that we value more the current reward. The problem can now be casted as **maximize the discounted rewards**
$$ G _t = \sum_{k=1}^{\infty} \gamma^k R_{k+1} $$
- **Episodic tasks** : the agent receives a reward at the end of each episode. The same discounted return can be used, but now we have a "final step" and the expected return is finite.

Rewards hypothesis : all the goals can be described by the maximization of the expected cumulative reward. In this way each action may have long term consequences.  
In the case that rewards are delayed, we can prefer to sacrifice the short term reward to obtain a bigger reward in the future, for example putting more fuel in an airplane to prevent it crashing if it needs more time to arrive at destination.  

There are two main types of rewards:
- **Dense rewards** : the agent receives a reward at each step. For example in chess we can give a reward for each move that is not a losing move.
- **Sparse rewards** : the agent receives a reward only at the end of the episode. For example in chess we can give a reward only if the agent wins the game.

Time-delayed labeling (aka semi-supervised learning) : the agent receives the labels only after a certain amount of time, so sometimes needs to learn from unlabeled data.  
Also called as the "credit assignment problem" (Minsky, 1963) : how to assign the correct credit to each action that led to the reward if we only know that at the end of a certain amount of actions involved. We may not know what was the "real" sequence of actions that led to the reward.

Examples of rewards:
- Managing a portfolio : $ \pm r $ for each € gained or lost while managing the portfolio
- Controlling a power station : $ +r $ for power produced while remaining in the safe zone, $ -r $ if we go out of the safe zone

**Agent's Policy** : The definition of an Agent's behavior. It is a mapping from the agent's state to the agent's action.   
Defined as :  
- Deterministic Policy : $ \pi (s) = a $
- Stochastic Policy : $ \pi (a|s) = P[A_t = a | S_t = s] $  
where $ \pi $ is the policy, $ a $ is the action and $ s $ is the state.
  
Note that the state should contain also some information about the history of the agent.  

**Markov Hypothesis** : given enough informations, I can predict the future without knowing all the past.

**sample efficiency** : how many samples are needed to learn a good policy. Very important in reinforcement learning. 

There are main different approaches to optimize a policy :  
- Differential Programming
- Monte Carlo Methods : not very sample-efficient method
- Temporal Difference Learning
- Bellman Equations 
- Exploration vs Exploitation
- Gradient Descent : EA & co. 

Assignment : find an (easy) example where reinforcement learning does not work.