## Exercise 1 (30 points)

### Function Approximation in Reinforcement Learning:
-  1.1 (5 points): What is the difference between tabular Q-learning and Q-learning with function approximation?
- 1.2 (5 points): Why are TD updates by Gradient Descent called Semi-Gradient updates?
- 1.3 (10 points): What challenges may tabular Q-learning meet when learning to play the game of Chess? Provide at least 2 answers.
- 1.4 (10 points): What challenges may value function approximation meet in a Reinforcement Learning context compared to offline regression in a Supervised Learning context? Provide at least 3 answers.

<span style="color: lightgreen;" >

- 1.1 **Tabular Q-learning** stores Q-values in a lookup table for every state-action pair, making it simple and effective for small, discrete environments but impractical for large or continuous spaces. **Q-learning with function approximatio** replaces the table with a parameterized function (typically a neural network) to estimate Q-values, allowing it to handle large or continuous spaces and generalize across unseen states, but at the cost of increased complexity and potential instability.
- 1.2 The TD value function approximation target is $r_{t+1} + \gamma v_{w}(s_{t+1})$ so typically the gradient of the squared TD error would be $\nabla_{w}(r_{t+1}+\gamma v_{w}(s_{t+1;w})-v_{w}(s_{t};w))$. But semi-gradient approach choose to use $\nabla_{w}v_{w}(s_{t};w)$ as the gradient instead of both.
- 1.3 Answer:
    - Chess has an enormous number of possible board states so it is comutationally and memory-wise infeasible. 
    - The second reason is that each (s, a) pair is treated independently in the across the table. So even if it learns a good strategy in one situation, it cannot transfer that knowledge to similar board positions.
- 1.4 Answer:
    - In TD value function approximation, the target values depend on model's own predictions(e.g. bootstrapping) where the target value $v_{w}(s_{t+1};w) changes when $w$ is being updated. In contrast in supervised learning, the target value is typically fixed and won't change during training.
    - The target values are often biased and correlated samples of the ground truth $v_{\pi}(s_{t})$. In other words, training data used to parametrize the predictive function is gathered online and often non-stationary, which could result in local optimum or slow convergence.while in supervised learning setting the samples are usually assumed to be i.i.d.
    - The environment in RL during exploration and exploitation may continuously evolve, or be too big for exhaustive sampling, and thus may never reach equilibrium whereas in supervised learning the environment is fixed, data is already collected and we don't have to explore.
</span>

##  Exercise 2 (20 points)

### Deep Q-Network (DQN):
- 2.1 (10 points): What is Experience Replay? How does it benefit DQN?
- 2.2 (10 points): Provide a complete pseudo-code for the DQN algorithm.

<span style="color: lightgreen;" >

- 2.1 Answer:
    - Experience Replay is a method where the agent stores its past experiences (transitions) in a replay buffer: $(s, a, r, s^{'})$. During training, instead of learning from the most recent transition, the agent randomly samples a batch of past experiences from this buffer to update its Q-network.
    - Random sampling fromthe replay buffer makes training data more i.i.d, the estimation of value target more stationary which could speed up convergence. Since each experience is used multiple times, they allow the agent the learn more effectively and thoroughly from the past experiences without new interactions with the environment all the time. 
- 2.2 The pseudocode is as follows:

    Initialize:
    - Q-network parameters $w$ arbitraraily
    - Target Q-network parameters $w_{target} \leftarrow w$
    - Policy $\pi_{s}$ (e.g. $\epsilon$-greedy based on $q_{w}(s, a)$)
    - Replay buffer $D \leftarrow$ empty
    - Hyperparameters: learning rate $\alpha$, discount factor $\gamma$, batch size $B$, target network update frequency $C$
    - Step counter $t\leftarrow 0$

    Loop forever:

    - Initialize environment and state $s$

    - Loop until episode ends:
        - Select action $a$ from $s$ using policy $\pi_{s}$
        - Execute $a$, obverse reward $r$ and next state $s'$
        - Store the transition $(s, a, r, s')$ in the replay buffer $D$
        - Update parameters $w$:
            $w \leftarrow w - \alpha * (r + \gamma * \text{max}_{a'}q_{w}^{target}(s', a') - q_{w}(s, a))\nabla_{w}q_{w}(s, a)$
        - Update policy $\pi_{s}$ based on $q_{w}(s, a)$ (e.g. with epsilon-greedy)
        - Set $s \leftarrow s'$
        - Increment $t \leftarrow t + 1$

        - If $t \mod C == 0$:
            - Sample mini-batches from the replay buffer $D$
            - For each $(s_{i}, a_{i}, r_{i}, s_{i+1})$ in $D$:
                - Update target network: $w_{target} \leftarrow w_{target} - \alpha * (r_{i} + \gamma * \text{max}_{a'}q_{w}^{target}(s_{i+1}, a') - q_{w}(s_{i}, a_{i}))\nabla_{w}q_{w}(s_{i}, a_{i})$
            


</span>


## Exercise 3 (50 points)
### For 3.1 and 3.2, please produce the exact code (not pseudo-code):
- 3.1 (15 points): Write a custom function that implements Experience Replay updates for
an arbitrary minibatch size (use the TD(0) target as in the original DQN algorithm). The
required input to this function must be listed as arguments and/or defined in the docstring.

In [None]:
import random
import numpy as np

def experience_replay(replay_buffer, mini_batch_size, model, target_model, gamma):
    """
    Perform Experience Replay update using a minibatch of transitions.

    Args:
        replay_buffer (list): List of stored transitions (s, a, r, s', done),
                              where each transition is a tuple.
        mini_batch_size (int): Number of transitions to sample in each minibatch.
        model (object): Q-network used to predict current Q-values (online network).
        target_model (object): Target Q-network used to compute TD targets.
        gamma (float): Discount factor (0 < gamma <= 1).

    Returns:
        None
    """
    # Do nothing if the buffer has too few samples
    if len(replay_buffer) < mini_batch_size:
        return

    # Sample a random minibatch
    minibatch = random.sample(replay_buffer, mini_batch_size)
    states, actions, rewards, next_states, dones = zip(*minibatch)

    # Convert to NumPy arrays
    states = np.array(states)
    next_states = np.array(next_states)

    # Predict Q-values for current states and next states
    q_values = model.predict(states, verbose=0)   
    next_q_values = target_model.predict(next_states, verbose=0)

    # Prepare target batch
    target_batch = q_values.copy()

    for i in range(mini_batch_size):
        if dones[i]:
            target = rewards[i]
        else:
            target = rewards[i] + gamma * np.max(next_q_values[i])

        # Update only the action taken
        target_batch[i][actions[i]] = target

    # Perform one step of gradient descent on the batch
    model.fit(states, target_batch, epochs=1, verbose=0)


- 3.2 (20 points): Implement both, the Gym environment and an agent to learn to control a
cart pole. The agent should be the PPO algorithm from the stable-baselines Python library.
It should start from scratch, train for 50,000 steps, and be evaluated on 50 episodes.


In [None]:
import gym
from pyglet.window import key 
import tensorflow as tf
from keras import __version__
tf.keras.__version__ = __version__
from stable_baselines3 import PPO 
from stable_baselines3.common.evaluation import evaluate_policy


# Answer begins here:
def create_cartpole_env():
    env_cartpole = gym.make("CartPole-v1")   
    env_cartpole.reset()

    return env_cartpole


def create_agent(env):
    """
    This model will map the state to a q-value for each action.
    Args:
        env (gym.Env): The environment to use for the agent.
    """
    dqn_agent = PPO("MlpPolicy", env, verbose=1) # Mlp for continuous low-dimensional input
    return dqn_agent


def train_agent(agent, nb_steps=50000):
    """
    Train the agent for a given number of steps.
    Args:
        agent (stable_baselines3.PPO): The agent to train.
        nb_steps (int): The number of steps to train the agent.
    """
    agent.learn(total_timesteps=nb_steps)


def evaluate_agent(agent, env, nb_episodes=50):
    """
    Evaluate the agent for a given number of episodes.
    Args:
        agent (stable_baselines3.PPO): The agent to evaluate.
        env (gym.Env): The environment to use for the evaluation.
        nb_episodes (int): The number of episodes to evaluate the agent.
    """
    mean_reward, std_reward = evaluate_policy(agent, env, n_eval_episodes=nb_episodes)
    print(f"Mean reward: {mean_reward} +/- {std_reward}")

# Triggers
env_cartpole = create_cartpole_env()
agent = create_agent(env_cartpole)
train_agent(agent, nb_steps=50000)
evaluate_agent(agent, env_cartpole, nb_episodes=50)


Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 20.4     |
|    ep_rew_mean     | 20.4     |
| time/              |          |
|    fps             | 4372     |
|    iterations      | 1        |
|    time_elapsed    | 0        |
|    total_timesteps | 2048     |
---------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 23.6        |
|    ep_rew_mean          | 23.6        |
| time/                   |             |
|    fps                  | 2006        |
|    iterations           | 2           |
|    time_elapsed         | 2           |
|    total_timesteps      | 4096        |
| train/                  |             |
|    approx_kl            | 0.009306562 |
|    clip_fraction        | 0.0906      |
|    clip_range           | 0.2         |
|    entropy_loss   



Mean reward: 500.0 +/- 0.0


- 3.3 (15 points): List at least 3 examples of RL agent components that you can import
ready-to-use from the keras-rl Python library. For each component, provide both the exact
method name (list of arguments is not necessary) and a description of what it is.

<span style="color: lightgreen;" >

- `SequentialMemory()` in the `rl.memory`: This is a replay buffer used to store the agent's past experiences in the form of transitions. It enables experience replay by randomly sampling past experiences to break correlation and improve sample efficiency.
- `DQNAgent()` in the `rl.agent`: This is the main Deep Q-Learning Agent class.
It combines a neural network model, memory buffer, and policy to learn Q-values and interact with the environment using the DQN algorithm.
- `LinearAnnealedPolicy()` in the `rl.policy`: It is a policy wrapper that gradually reduces a parameter (typically ε in ε-greedy) linearly over time to balance exploration and exploitation during training.
</span>

In [3]:
from rl.agents import DQNAgent
from rl.memory import SequentialMemory
from rl.policy import LinearAnnealedPolicy, EpsGreedyQPolicy


