# Reinforcement Learning

Reinforcement learning (RL) is a framework where an agent learns to make decisions by interacting with an environment. The agent receives feedback through rewards, and the ultimate goal is to develop a strategy that maximizes the total accumulated reward.

## Fundamental Components

### State ($s$)
The **state** represents the current situation or configuration of the system. It may include various factors such as:
- **Position and Orientation:** For an autonomous vehicle or robot, the state includes its current position, orientation, and speed.
- **Environment Details:** In a game, the state could be the arrangement of pieces on a board or the configuration of a game level.

### Action ($a$)
An **action** is any decision or move that the agent can make. Actions affect the state of the system. For example:
- **Control Inputs:** For an autonomous helicopter, actions might involve adjusting the control sticks.
- **Movement Decisions:** For a Mars rover, actions could include moving left or right to reach a desired location.
- **Game Moves:** In chess, an action is one of the legal moves available from a given board position.

### Reward ($R(s)$)
The **reward** is a numerical value that provides feedback on the outcome of an action. It indicates how favorable a particular state or action is:
- **Positive Reward:** Signals that the agent is performing well (e.g., maintaining stable flight or achieving a winning move in a game).
- **Negative Reward:** Indicates poor performance (e.g., crashing an aircraft or making a losing move).

The reward function helps guide the learning process by reinforcing behaviors that lead to higher rewards.

### Discount Factor ($\gamma$)
The **discount factor** ($\gamma$) is a number between 0 and 1 that determines the importance of future rewards relative to immediate rewards. A discount factor close to 1 means future rewards are nearly as valuable as immediate ones, while a lower value makes the agent focus more on immediate rewards.

The idea is captured mathematically when computing the **return**.

### Return ($G$)
The **return** is the total accumulated reward over time, but with future rewards discounted to reflect their delayed benefit. It is given by the formula:

$$
G = R_1 + \gamma R_2 + \gamma^2 R_3 + \gamma^3 R_4 + \cdots
$$

This formula shows that rewards received later are worth less than rewards received immediately, depending on the value of $\gamma$. For example, if $\gamma = 0.9$, the reward received two steps later is multiplied by $0.9^2$.

### Policy ($\pi$)
A **policy** is a function that maps each state to an action. It defines the agent’s strategy for decision-making. In mathematical terms, a policy is written as:

$$
\pi(s) = a
$$

The objective in reinforcement learning is to find an optimal policy that maximizes the expected return for every state.

## The Markov Decision Process (MDP)

Reinforcement learning problems are commonly formulated as a **Markov decision process (MDP)**, which provides a formal framework consisting of:

- **States:** All possible situations the agent can encounter.
- **Actions:** All available decisions or moves.
- **Rewards:** Immediate feedback for actions taken.
- **Transition Dynamics:** Rules that determine how actions change the state.
- **Markov Property:** The principle that the future state depends only on the current state and the action taken, not on the history of past states.

This framework allows for a clear mathematical treatment of decision-making problems and helps in designing algorithms that learn effective policies.

## Examples

### Autonomous Systems (e.g., Helicopter)
- **State:** The helicopter’s position, speed, orientation, and sensor data (such as GPS and accelerometer readings).
- **Action:** Adjustments to control sticks.
- **Reward:** A small positive reward (e.g., +1 per second) when flying stably, and a large negative reward (e.g., -1000) if it crashes.
- **Goal:** To learn a policy that keeps the helicopter flying safely and performing maneuvers like aerobatic stunts.

### Mars Rover Navigation
Imagine a simplified rover that can be in one of six positions (states 1 through 6):
- **States:** Six positions with specific rewards assigned to some states (e.g., state 1 gives a reward of 100, state 6 gives 40, while states 2–5 provide zero reward).
- **Actions:** The rover can choose to move left or right.
- **Return Calculation:**  
  For instance, if the rover starts at state 4 and moves left with a discount factor of $\gamma = 0.5$, the sequence of rewards (assuming zero rewards for intermediate states until reaching state 1) is weighted as follows:

$$
G = 0 + 0.5 \times 0 + 0.5^2 \times 0 + 0.5^3 \times 100 = 12.5
$$

- **Policy:** The strategy might involve choosing actions that lead to state 1 faster to maximize the return.

### Game Playing (e.g., Chess)
- **State:** The configuration of the chess board.
- **Action:** A legal move from the current board position.
- **Reward:** Typically defined as +1 for winning, -1 for losing, and 0 for a draw.
- **Discount Factor:** Often chosen very close to 1 (e.g., $\gamma = 0.99$) to emphasize long-term outcomes.
- **Policy:** A function that selects the best move from a given board state to maximize the chance of winning the game.

## Summary

In reinforcement learning, an agent learns to maximize its accumulated reward through:
- Observing the **state** of the environment.
- Taking an **action** based on a **policy**.
- Receiving a **reward** that provides feedback.
- Considering future rewards using a **discount factor**.
- Evaluating the overall success using the **return**.

The problem is typically modeled as a **Markov decision process (MDP)**, where the next state depends solely on the current state and the chosen action. This framework is versatile and can be applied to robotics, autonomous systems, game playing, financial trading, and many other complex decision-making tasks.

## The State-Action Value Function (Q Function)

The **state-action value function** is denoted as $Q(s, a)$. 
It represents the **expected return** when:
- Starting in state $s$,
- Taking action $a$ once,
- And then following an **optimal policy** thereafter.

In other words:

$$
Q(s, a) = \text{Return obtained from } s \text{ after taking } a \text{ and acting optimally henceforth.}
$$

### Return and Discounting

The **return** is the sum of discounted rewards:

$$
 G_t = R_1 + \gamma R_2 + \gamma^2 R_3 + \cdots
$$

where:
- $R_1$ is the immediate reward,
- $R_2, R_3, \dots$ are future rewards,
- $\gamma$ (gamma) is the **discount factor** with $0 < \gamma \leq 1$.
  
> A high $\gamma$ (close to 1) means the agent is more patient, valuing future rewards nearly as much as immediate ones. A low $\gamma$ causes the agent to be more short-sighted.

### Example: Mars Rover
Imagine a Mars Rover that can take two actions (e.g., **left** or **right**) in each state:

**Rewards:**
- Terminal states yield a reward (e.g., 100 or 40).
- Intermediate states often yield a reward of 0.

**Numerical Example:**
- From **state 2**, taking **left** might lead directly to a terminal state with a reward of 100. If $\gamma = 0.5$, then:

$$
Q(2, \text{left}) = 0 + 0.5 \times 100 = 50.
$$

- Alternatively, taking **right** from state 2 may lead to state 3, and following the optimal policy from there might yield:

$$
Q(2, \text{right}) = 0 + 0.5 \times 25 = 12.5.
$$

### Determining the Optimal Policy

The optimal action $\pi(s)$ in any state $s$ is the one that maximizes the Q value:

$$
\pi(s) = \arg\max_{a} Q(s, a).
$$

In the above example, since $50 > 12.5$, the best action in state 2 is to go **left**.

---

## The Bellman Equation

The **Bellman equation** breaks down the Q function into two parts:
- **Immediate reward:** The reward $R(s)$ received in the current state.
- **Future rewards:** The discounted optimal return from the next state.

The equation is written as:

$$
Q(s, a) = R(s) + \gamma \max_{a'} Q(s', a')
$$

where:
- $s'$ is the state reached after taking action $a$ in state $s$.
- $a'$ represents all possible actions in state $s'$.

### Intuition
- **Immediate Reward ($R(s)$):** What you gain right away.
- **Future Return:** $\gamma \max_{a'} Q(s', a')$ is the best discounted return possible from the next state.
- The **total return** from taking action $a$ in state $s$ is the sum of these two parts.

### Detailed Examples
1. **State 2, Action Right:**
   - Suppose $R(2) = 0$, and taking right leads to state 3 with $\max_{a'} Q(3, a') = 25$.
   - Then:

$$
Q(2, \text{right}) = 0 + 0.5 \times 25 = 12.5.
$$

2. **State 4, Action Left:**
   - With $R(4) = 0$, if going left takes you to state 3 (again with $\max_{a'} Q(3, a') = 25$):

$$
Q(4, \text{left}) = 0 + 0.5 \times 25 = 12.5.
$$

3. **Terminal States:**
   - In terminal states, no future rewards exist. Hence, the equation simplifies to:

$$
Q(s,a) = R(s).
$$

### Alternative View of Returns
The return can be re-written as:

$$
G_t = R_1 + \gamma (R_2 + \gamma R_3 + \gamma^2 R_4 + \cdots),
$$

and the term in parentheses represents the optimal future return:

$$
\max_{a'} Q(s', a').
$$

Thus, the Bellman equation encapsulates this recursive structure.

---

## Stochastic Environments

In many real-world problems, the effect of an action is **stochastic** (random).

**Example:** A Mars Rover commanded to go left might:
- Successfully move left with probability 0.9.
- Slip and move right with probability 0.1.

### Expected Return in Stochastic Settings

Since the outcome is uncertain, we focus on the **expected return** rather than a single deterministic return.

We imagine running the same experiment many times and then calculating the mean of all the results. For example, if you have a Mars Rover and you let it perform a mission repeatedly, each run (or trajectory) may give you a slightly different sum of discounted rewards because of randomness in the environment. 

Instead of getting one fixed value, you run the experiment over and over, record the total reward for each run, and then take the average of all those total rewards. 

$$
\mathbb{E}[G_t] = \mathbb{E}\left[ R_1 + \gamma R_2 + \gamma^2 R_3 + \cdots \right],
$$

- In statistics `Expected` $\mathbb{E}$ just means average in this context.
- This process of averaging is done over all possible outcomes weighted by how likely each outcome is to occur. In practice, if you repeated the mission 1000 times, you would add up all the discounted rewards from each run and then divide by 1000. That gives you the expected return, which is what the $\mathbb{E}$ operator represents.

The **Bellman equation** is modified to account for this expectation:

$$
Q(s, a) = R(s) + \gamma \, \mathbb{E}\left[ \max_{a'} Q(s', a') \right].
$$

### Impact of Stochasticity
- **Increased Uncertainty:** As the misstep (or error) probability increases, the control over the state transitions decreases.
- **Effect on Q Values:** Higher misstep probabilities generally lead to lower Q values because the expected return is reduced.
- **Lab Experiments:** Modifying parameters such as the terminal reward, discount factor $\gamma$, or the misstep probability in simulation notebooks can help build intuition on how these values and the optimal policy change.

---

## Summary and Key Takeaways

**Q Function ($Q(s,a)$):** Represents the expected return starting from state $s$, taking action $a$, and then following the optimal policy. Enables deriving the optimal policy:

$$
\pi(s) = \arg\max_{a} Q(s,a).
$$

**Bellman Equation:** Decomposes the return into an immediate reward plus the discounted future return. Provides a recursive way to compute $Q(s,a)$:

$$
Q(s, a) = R(s) + \gamma \max_{a'} Q(s', a').
$$

**Stochastic Environments:** Require considering the **expected value** of returns due to uncertainty in action outcomes. Modify the Bellman equation with an expectation:

$$
Q(s, a) = R(s) + \gamma \, \mathbb{E}\left[ \max_{a'} Q(s', a') \right].
$$

**Parameter Effects:**
- **Discount Factor ($\gamma$):**
    - High $\gamma$ (closer to 1): Agent is more patient; future rewards have higher weight.
    - Low $\gamma$: Agent is more focused on immediate rewards.
- **Reward Structure:** Directly influences the computed Q values.
- **Misstep Probability:** Higher probability of action failure lowers the expected returns.

## Continuous State Spaces

### Discrete vs. Continuous States
**Discrete State Spaces:**  
- Example: A simplified Mars rover that can only occupy one of six fixed positions.

**Continuous State Spaces:**  
- Most robots operate in environments where the state is defined by a continuous range of values.
- **Examples:**
    1. A Mars rover moving along a line with positions from 0 to 6 kilometers can be at 2.7 km, 4.8 km, etc.
    2. A self-driving car or truck:
        - **State vector components:**  
            - Position: $x$, $y$
            - Orientation: $\theta$
            - Velocities: $x\_dot$, $y\_dot$
            - Angular velocity: $\theta\_dot$  
        - The state is a vector of six continuous numbers.
    3. Autonomous Helicopter
        - **State representation:**  
            - Position: $x$, $y$, $z$
            - Orientation (using Euler angles):
            - Roll ($\phi$)
            - Pitch ($\theta$)
            - Yaw ($\omega$)
            - Velocities: Linear velocities along $x$, $y$, $z$ and angular velocities (rates of change for $\phi$, $\theta$, $\omega$
        - The helicopter's full state might be a vector of 12 numbers.

---

### The Lunar Lander Environment

- **Task:**  
  Land a simulated lunar lander safely on the moon by firing appropriate thrusters.
- **Actions:**  
  The agent can take one of four actions at every time step:
  - `nothing` – No thrust; gravity and inertia act.
  - `left` – Fire the left thruster to push right.
  - `main` – Fire the main engine (downward thrust).
  - `right` – Fire the right thruster to push left.
  
- **State Variables:**  
  The state vector contains:
  - **Position:** $x$, $y$
  - **Velocity:** $x\_dot$, $y\_dot$
  - **Orientation:** Angle $\theta$ and angular velocity $\theta\_dot$
  - **Leg Contact Indicators:**  
    - $l$: Whether the left leg is touching the ground (binary: 0 or 1)
    - $r$: Whether the right leg is touching the ground (binary: 0 or 1)

#### Reward Function
- **Landing Reward:**  
  - A reward between +100 and +140 is given when the lander reaches the pad.
- **Additional Rewards:**  
  - Positive reward for moving toward the pad.
  - Negative reward for drifting away.
  - Crash penalty: $-100$ for crashing.
  - Soft landing bonus: $+100$ for each leg that touches down, plus an extra $+10$ reward.
- **Fuel Penalties:**  
  - Main engine fire: $-0.3$
  - Side thrusters (left/right): $-0.03$

#### Objective

Learn a policy $\pi$ that maps states $s$ to actions $a$ (i.e., $a = \pi(s)$) in order to maximize the **return**, defined as the sum of discounted rewards:
  
$$G_t = \sum_{k=0}^{\infty} \gamma^k R_{t+k}$$

**Discount Factor:**  For Lunar Lander, a high discount factor (e.g., $\gamma = 0.985$) is typically used.

---

## Deep Q-Network (DQN) Algorithm

**Basic Architecture (Inefficient Approach):**
- **Input:** Concatenation of state ($8$ numbers for Lunar Lander) and action (one-hot encoding of $4$ possible actions) resulting in a $12$-dimensional vector.
- **Hidden Layers:** Two hidden layers with, e.g., 64 units each.
- **Output:** A single scalar value approximating $Q(s, a)$.


**Improved Architecture:**
- **Input:** Only the state vector (8 numbers).
- **Output:**  A vector of 4 values, each corresponding to $Q(s, \texttt{nothing})$, $Q(s, \texttt{left})$, $Q(s, \texttt{main})$, and $Q(s, \texttt{right})$.
- **Benefit:**  
    - Inference is performed once per state rather than separately for each action.
    - Easily computes $\max_{a'} Q(s', a')$ during the Bellman update.

### Training via Bellman's Equation

**Bellman Equation:**

$$
Q(s, a) = R(s) + \gamma \max_{a'} Q(s', a')
$$

**Training Examples:**
From experiences collected as tuples $(s, a, R(s), s')$, form training pairs:
- **Input:** $x = [s; a]$
- **Target Value:** $y = R(s) + \gamma \max_{a'} Q(s', a')$
- **Supervised Learning:** Use mean squared error (MSE) loss to update network parameters so that the network approximates the Q-function.

### Experience Replay

**Replay Buffer:**  
- Store the most recent (e.g., 10,000) experiences.
- Helps break correlations between sequential data and improves stability.

---

## Epsilon-Greedy Policy (Exploration vs. Exploitation)

### Policy Definition
- **Greedy Action:** With high probability (e.g., 95%), select the action that maximizes $Q(s, a)$.
- **Exploration:** With a small probability ($\epsilon$, e.g., 5%), choose an action at random.
  
### Purpose
- **Avoid Local Optima:** Random actions prevent the algorithm from getting stuck due to initial random weight assignments that might incorrectly penalize good actions.
- **Adaptive $\epsilon$:** Often, $\epsilon$ starts high (e.g., 1.0) to encourage exploration and is gradually decreased to a low value (e.g., 0.01) as learning improves.

---

## Refinements to the Learning Algorithm

### Mini-Batching

Instead of training on the full replay buffer (e.g., 10,000 examples) each time, select a smaller batch (e.g., 1,000 examples).

**Benefits:**
- Speeds up each training iteration.
- Introduces noise which may help in escaping local minima.

**Analogy with Supervised Learning:**
- Similar to mini-batch gradient descent in linear regression where a subset of the full dataset is used to compute parameter updates.

### Soft Updates

**Problem with Abrupt Updates:** 

Directly copying parameters from a newly trained network can lead to instability if the new network performs worse.

**Solution – Soft Update Rule:**

Instead of setting parameters $\theta \leftarrow \theta_{new}$, update gradually:

$$
\theta \leftarrow \tau \theta_{new} + (1 - \tau) \theta
$$

Example with $\tau = 0.01$:

For weights $W$:

$$W \leftarrow 0.01 \, W_{new} + 0.99 \, W$$

For biases $B$:

$$
B \leftarrow 0.01 \, B_{new} + 0.99 \, B
$$

- **Benefit:** Increases the stability of the learning process and helps the algorithm converge more reliably.

---

## Summary and Practical Considerations

### Overview of the Learning Process
1. **Initialization:**
   - Randomly initialize the neural network parameters.
2. **Experience Collection:**
   - Interact with the environment (e.g., Lunar Lander) using an epsilon-greedy policy.
   - Collect tuples $(s, a, R(s), s')$ and store them in the replay buffer.
3. **Training Loop:**
   - Periodically sample mini-batches from the replay buffer.
   - Form training pairs $(x, y)$ using the Bellman equation.
   - Train the neural network to minimize the MSE between $Q(s, a)$ and $y$.
   - Update network parameters using either a direct copy or a soft update strategy.
4. **Policy Improvement:**
   - As the network improves its estimate of the Q-function, use it to select actions that maximize the expected return.

### Context in Machine Learning
- **Reinforcement Learning vs. Supervised/Unsupervised Learning:**
  - RL focuses on learning policies to maximize cumulative reward, whereas supervised learning maps inputs to outputs and unsupervised learning finds structure in data.
- **Challenges:**
  - RL can be more sensitive to hyperparameter choices (e.g., learning rate, $\epsilon$ scheduling) and may require extensive tuning.
- **Real-World Application:**
  - While RL shows impressive results in simulations (e.g., video games, simulated robotics), transferring these methods to real-world robotics remains challenging.