# Reinforcement Learning Foundation
This module is based on the DeepMind lectures given by David Silver, you can find the following material below:

- [DeepMind RL Course](https://www.davidsilver.uk/teaching/)
- [Algorithms for Reinforcement Learning](https://sites.ualberta.ca/~szepesva/papers/RLAlgsInMDPs.pdf)

#### Table of Contents
1. Introduction (Basic Ideas behind RL)
2. Markov Decision Processes
3. Planning by Dyanmic Programming
4. Model-Free Prediction
5. Model-Free Control
6. Value Functon Approximation
7. Policy Gradient Methods
8. Integrating Learning and Planning
9. Exploration and Exploitation

### A Brief Introduction

Reinforcement is the third category of machine learning, completely seperate from supervised and unsupervised learning.

**Sequential Decision Making**
The goal of the agent is to pick a sequence of actions that maximizes the largest future reward. Actions are not primarily short and can be long term.

At each time step (t) and agent and environment:
1. Executes an action A<sub>t</sub>
2. Recieves an observation O<sub>t</sub>
3. Recieves a reward R<sub>t</sub>

**History and State**

The history is a sequence of all prior observations, actions, and rewards.

H<sub>t</sub> = O<sub>t</sub>, A<sub>t</sub>, R<sub>t</sub>

#### Definitions of State
**Environment State (S<sub>t</sub><sup>e</sup>)**

The rules that make up the environment (e.g. physics and differential equations that exmplain the physical world, rules in atari)

**Not visible by the agent, thus the agent must make an observation action off their observation


**Agent State (A<sub>t</sub><sup>a</sup>)**

The agents internal representation of the world. The agent state is used to pick the actions that leads to the largest cumulative reward

**Information State (Markov Decision Process)**

Contains all usefull infromation from the history.

Can be expressed as:

P(S<sub>t+1</sub> | S<sub>t</sub>) = P(S<sub>t+1</sub> | S<sub>1</sub>, S<sub>2</sub>, S<sub>3</sub>...)

*The current state in an markov decision process is equivalent to all previous states, thus we can ignore all previous states and only focus on the last*

#### Types of Environments

**Fully Observable**

The agent direcly observes the environment and understands all aspects of the system

O<sub>t</sub> = S<sub>t</sub><sup>a</sup> = S<sub>t</sub><sup>e</sup>

The observation is equal to the state of the agent which is equal to the state of the environment

**Partially Observable**

The agent indirectly observes the environment

For example the camera in a robot does not understand it's absolute location relative to the environment thus it has to explore the area to create an understanding


#### Architecture of Agents

**Policy:** The agents behavior. Is a map from the a state to an action. A policy can either be:

1. Deterministic: a = pi(s)
2. Stochastic: pi(a|s) = P[A=a, S=s]

**Value Function:** How good is each state or action? Is a prediction of the expected future reward and is used to evaluate the goodness/badness of states

V<sub>pi</sub> = Expected value<sub>pi</sub>[R<sub>t</sub> + gamma<sup>1</sup>R<sub>t+1</sub> + gamma<sup>2</sup>R<sub>t+1</sub>... | S<sub>t</sub>=s)

where gamma<sup>n</sup> is a decay function (we care more about immediate actions and states)

**Model:** The agent's representation of the environment. Predicts what the environment will do next. 

1. Transitions: P predicts next state (e.g. dynamics)
2. Rewards: R predicts next immediate reward

A dumb way to think about it: "If we take this transition, what is the reward?"

#### Categories of Agents

**Value based:** value function, no policy(implicit)

**Policy based:** policy, no value function

**Actor Critic:** value and policy based

**Model Free:** policy and/or value function (in this scenario we do not try to understand the environment state)

**Model Based:** model of environment

#### General Problems with RL

Reinforcement learning falls under two subsequent categories:

- **RL:** The environment is initially unknown and we must interact with it through trial and error runs to gain an understaning.
- **Planning:** We already have a full understanding and model of the environment; instead of interacting with the environment the agent preforms computations with its model 

### Markov Decision Processes

A Markov Decision Process essentially says that all prior steps can be summarized by the current step; in short the current step is dependent on all prior steps.