Reinforcement Learning
======================
RL is a type of machine learning that allows an agent to learn how to behave in an environment by performing actions and receiving rewards. The agent learns to achieve a goal in an uncertain, potentially complex environment. In RL, an agent learns from trial and error to achieve a clear objective. The agent learns from its experiences and interactions with the environment. The agent learns from the consequences of its actions, rather than from being told what to do.
We have states, actions, and rewards. The agent interacts with the environment by taking actions. The agent receives rewards by performing actions. The agent learns to achieve a goal by maximizing rewards. The agent learns a policy that maps states to actions. The agent learns a value function that maps states or state-action pairs to rewards. The agent learns a model of the environment that predicts the next state and reward given the current state and action.
Some applications include:
- controlling robots
- factory optimization
- financial (stock) trading
- playing games (including video games)

Mars Rover Example
==================
Let's consider the example of a Mars rover. The rover is an agent that learns to navigate the Martian surface. The rover has states (locations on the surface), actions (movements in different directions), and rewards (scientific discoveries). The rover learns to explore the surface to maximize its scientific discoveries. The rover learns a policy that tells it where to go next. The rover learns a value function that tells it how valuable each location is. The rover learns a model of the environment that predicts what it will find at each location.
We will show the transitions with a 4-parameter tuple: (state, action, reward, next_state).

The Return in RL
-----------------
The return in RL is the sum of the rewards that the agent receives over time, but with discount factor, a number a little less than 1.
The formula for the return is:
$$G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}$$
where:
- $G_t$ is the return at time $t$
- $R_t$ is the reward at time $t$
- $\gamma$ is the discount factor
What the gamma does is to make the rewards that are further in the future less important.
So the return helps us to measure how good a state or an action is, helping the machine to decide what to do at each step.

Policy
-------
The policy in RL is a function that maps states to actions. The policy tells the agent what action to take in each state. The policy can be deterministic (always choose the same action in a given state) or stochastic (choose different actions with different probabilities in a given state).
So the policy pi(s)=a is a function mapping from states to actions, that tells you what action to take in each state.
The goal in RL is to find a policy that maximizes the expected return.

Note: All the concepts introduced above are called a 'Markov Decision Process' (MDP).
Markov means that the future is independent of the past given the present.

State-Action Value Function
----------------------------
The state-action value function (also called Q-function) is a function that maps state-action pairs to values. The Q-function tells us how good it is to take a particular action in a particular state. The Q-function is defined as the expected return of taking an action in a state and then following a policy.
Q(s, a) = Return, if you:
- start in state s
- take action a
- then behave optimally after that
-> The best possible return from state s is max(Q(s, a)) on a.
-> The best possible action in state s is the action a that gives max(Q(s, a)).

So if we can compute the Q-function, we can find the best policy by choosing the action that maximizes Q in each state.

Bellman Equation
----------------
The Bellman equation is a fundamental equation in RL that decomposes the Q-function into two parts: the immediate reward and the discounted future reward (return from behaving optimally starting from state s').
The Bellman equation for the Q-function is:
$$Q(s, a) = R + \gamma \max_{a'} Q(s', a')$$
where:
- Q(s, a) is the Q-function for state s and action a
- R is the immediate reward for taking action a in state s
- $\gamma$ is the discount factor
- s' is the next state after taking action a in state s
- a' is the next action after taking action a in state s

Random (Stochastic) Environment
-------------------------------
In reality, most of the time the environment is stochastic, meaning that the outcome of an action is not deterministic. In a stochastic environment, the agent cannot predict with certainty what will happen when it takes an action. There is some randomness in the environment. For example, a robot might slip on a wet floor, or a self-driving car might encounter unexpected traffic.

Expected Return = Average(R1 + gamma*R2 + gamma^2*R3 + ...)
                = E[R1 + gamma*R2 + gamma^2*R3 + ...]
                
So the goal of RL is to find a policy that maximizes the expected return.
Bellman Equation becomes:
$$Q(s, a) = R + E[\gamma \max_{a'} Q(s', a')]$$

Note: In many applications the state space is not discrete, but continuous.