# Chapter 18: Reinforcement Learning

## 1. Chapter Overview
**Goal:** Reinforcement Learning (RL) is the technology behind AlphaGo, self-driving cars, and robots that can walk. In this chapter, we will build agents that can learn to balance a pole on a cart (CartPole) and play Atari games by using Deep Q-Learning (DQN).

**Key Concepts:**
* **RL Components:** Agent, Environment, Action, State, Reward.
* **Policy Search:** Finding the best strategy (policy) to maximize rewards.
* **Credit Assignment Problem:** Determining which action caused the reward (was it the move 10 steps ago?).
* **Markov Decision Processes (MDPs):** The mathematical framework for RL.
* **Q-Learning:** Learning the "quality" (value) of every action in every state.
* **Deep Q-Networks (DQN):** Using a neural network to approximate Q-values for complex environments.
* **Exploration vs. Exploitation:** Balancing trying new things (random actions) vs. using known best actions.
* **Replay Buffer:** Storing past experiences to train the network stably.

**Practical Skills:**
* Using **OpenAI Gym** to render and interact with environments.
* Implementing a hard-coded policy for CartPole.
* Building a **DQN** Agent with Keras.
* Implementing a **Custom Training Loop** for RL (collect data -> train -> repeat).

In [None]:
# Setup
import sys
assert sys.version_info >= (3, 5)

import sklearn
assert sklearn.__version__ >= "0.20"

import tensorflow as tf
from tensorflow import keras
assert tf.__version__ >= "2.0"

import numpy as np
import os
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt

np.random.seed(42)
tf.random.set_seed(42)

# To run this chapter, you need OpenAI Gym
# pip install gym
import gym

## 2. Theoretical Explanation (In-Depth)

### 1. The RL Cycle
1.  **Observation:** The Agent sees the current state $S_t$ of the environment (e.g., pixel image of a game).
2.  **Action:** The Agent selects an action $A_t$ based on its Policy $\pi$.
3.  **Step:** The Environment processes the action, transitions to a new state $S_{t+1}$, and returns a Reward $R_{t+1}$ (e.g., +1 for survival, -1 for crashing).
4.  **Repeat:** The goal is to maximize the expected sum of future rewards (Return).

### 2. Markov Decision Processes (MDP)
RL assumes the environment satisfies the **Markov Property**: The future depends only on the current state, not the history. 
We use the **Bellman Optimality Equation** to find the optimal Q-Value $Q^*(s, a)$:
$$ Q^*(s, a) = R(s, a) + \gamma \max_{a'} Q^*(s', a') $$
Where $\gamma$ (gamma) is the discount factor (0 to 1). If $\gamma=0$, the agent is short-sighted. If $\gamma=1$, it cares infinitely about the future.

### 3. Deep Q-Learning (DQN)
In simple games (Tic-Tac-Toe), we can store a table of Q-values for every state. In complex games (Pacman, Go), states are infinite. 
We use a **Deep Neural Network** (DQN) to *approximate* the Q-value function: $Q(s, a) \approx DNN(s)$.
Input: State (Image/Vector). Output: Q-values for each possible action.

### 4. Experience Replay
Neural networks hate correlated data (sequential frames of a game are highly correlated). To fix this, we store the agent's experiences $(S, A, R, S')$ in a massive buffer (replay memory) and train the network on small **random batches** sampled from this buffer. This breaks correlation and stabilizes training.

## 3. Code Reproduction

### 3.1 The CartPole Environment
The goal is to balance a pole on a moving cart.

In [None]:
env = gym.make("CartPole-v1")
obs = env.reset()
print("Observation (Cart Pos, Cart Vel, Pole Angle, Pole Vel):", obs)

# Render the environment (Note: Rendering might not work in headless cloud notebooks like Colab without extra setup)
# env.render()

# Hard-coded policy: If pole tilts left, push cart left.
def basic_policy(obs):
    angle = obs[2]
    return 0 if angle < 0 else 1 # 0 = Left, 1 = Right

totals = []
for episode in range(500):
    episode_rewards = 0
    obs = env.reset()
    for step in range(200):
        action = basic_policy(obs)
        obs, reward, done, info = env.step(action)
        episode_rewards += reward
        if done:
            break
    totals.append(episode_rewards)

print("Mean Reward (Hard-coded):", np.mean(totals), "(Max is 200.0)")

### 3.2 Building a DQN Agent
We will build a neural network that takes the observation (4 floats) and outputs 2 Q-values (Left, Right).

In [None]:
input_shape = [4] # 4 observations
n_outputs = 2 # 2 actions

model = keras.models.Sequential([
    keras.layers.Dense(32, activation="elu", input_shape=input_shape),
    keras.layers.Dense(32, activation="elu"),
    keras.layers.Dense(n_outputs)
])

def epsilon_greedy_policy(state, epsilon=0):
    if np.random.rand() < epsilon:
        return np.random.randint(2) # Exploration
    else:
        Q_values = model.predict(state[np.newaxis])
        return np.argmax(Q_values[0]) # Exploitation

### 3.3 Replay Buffer
We use a `deque` to store the last 2000 steps.

In [None]:
from collections import deque

replay_memory = deque(maxlen=2000)

def sample_experiences(batch_size):
    indices = np.random.randint(len(replay_memory), size=batch_size)
    batch = [replay_memory[index] for index in indices]
    states, actions, rewards, next_states, dones = [
        np.array([experience[field_index] for experience in batch])
        for field_index in range(5)]
    return states, actions, rewards, next_states, dones

def play_one_step(env, state, epsilon):
    action = epsilon_greedy_policy(state, epsilon)
    next_state, reward, done, info = env.step(action)
    replay_memory.append((state, action, reward, next_state, done))
    return next_state, reward, done, info

### 3.4 Custom Training Loop (The DQN Algorithm)
We train the model to predict the Target Q-Value, which is calculated using the Bellman Equation: `Reward + gamma * max(Next_Q_Value)`.

In [None]:
batch_size = 32
discount_rate = 0.95
optimizer = keras.optimizers.Adam(lr=1e-3)
loss_fn = keras.losses.mean_squared_error

def training_step(batch_size):
    experiences = sample_experiences(batch_size)
    states, actions, rewards, next_states, dones = experiences
    
    # Compute target Q values
    next_Q_values = model.predict(next_states)
    max_next_Q_values = np.max(next_Q_values, axis=1)
    target_Q_values = (rewards + 
                       (1 - dones) * discount_rate * max_next_Q_values)
    target_Q_values = target_Q_values.reshape(-1, 1)
    mask = tf.one_hot(actions, n_outputs)
    
    with tf.GradientTape() as tape:
        all_Q_values = model(states)
        # We only care about the Q-value of the action we actually took
        Q_values = tf.reduce_sum(all_Q_values * mask, axis=1, keepdims=True)
        loss = tf.reduce_mean(loss_fn(target_Q_values, Q_values))
    
    grads = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(grads, model.trainable_variables))

### 3.5 Running the Training
We run 600 episodes. We start with high exploration (epsilon=1) and decay it to 0.01.

In [None]:
rewards = []
best_score = 0

for episode in range(600):
    obs = env.reset()
    for step in range(200):
        epsilon = max(1 - episode / 500, 0.01)
        obs, reward, done, info = play_one_step(env, obs, epsilon)
        if done:
            break
    rewards.append(step)
    if step >= best_score:
        best_score = step
    
    # Train only after we have enough data
    if episode > 50:
        training_step(batch_size)

    if episode % 50 == 0:
        print(f"Episode: {episode}, Best Score: {best_score}, Mean Reward: {np.mean(rewards[-50:])}")

print("Training Finished.")

## 4. Step-by-Step Explanation

### 1. Epsilon-Greedy Policy
At the beginning of training, the model knows nothing (random weights). If we just follow the model's predictions (Exploitation), the agent will repeat the same stupid mistakes forever.
We need **Exploration**: forcing the agent to take random actions to discover new states. 
* `epsilon=1.0`: 100% random actions.
* `epsilon=0.01`: 1% random, 99% best action.
We decay epsilon over time as the model gets smarter.

### 2. The Training Step (Bellman Logic)
1.  **Predict:** The model estimates Q-values for the *next state* ($S_{t+1}$).
2.  **Target:** We assume the best future action will be taken. So the "Ground Truth" value of the current action is: *Immediate Reward + Discounted Future Reward*.
3.  **Mask:** The model outputs Q-values for ALL actions (Left, Right). But we only executed ONE action. We use a mask (one-hot vector) to zero out the Q-value of the action we didn't take, so it doesn't affect the loss calculation.
4.  **Backprop:** We minimize the difference (MSE) between the Model's guess for the current Q-value and the Target Q-value calculated from the actual reward received.

### 3. Stability Issues
RL is notoriously unstable. If the model updates its weights, the "Target" values (which are calculated using the model itself!) shift immediately. It's like a dog chasing its own tail. 
* **Solution (Target Network):** In production DQN (like DeepMind's), we use two networks: one for choosing actions (Online) and a frozen copy for calculating targets (Target). We copy the weights from Online to Target every 1000 steps.

## 5. Chapter Summary

* **RL** is about learning from consequences (Rewards).
* **Policy:** The strategy the agent follows.
* **Q-Learning:** Finds the optimal policy by learning the value of state-action pairs.
* **DQN:** Adapts Q-Learning to complex environments by using a Neural Network to approximate the Q-table.
* **Replay Buffer:** Essential for breaking data correlations and stabilizing training.
* **Epsilon-Greedy:** Essential for balancing exploration and exploitation.