# RL Final Project

Now it's finally time to put into use what we have learned so far in this course!

The aim of this project is to assess your practical knowledge in Reinforcement Learning.

your project consist of 2 parts. you will get the chance to work with 2 different environment.


## 2.Atari Game Pong


<img src="zzzzzzzzzzzzzzzzzc"/>

**[Pong](https://www.gymlibrary.dev/environments/atari/pong/)** is a famus atari game that almost all of us have played it at least once!
The goal of this task is to get engage with **gym** library and use Deep Reinforcement Learning to train an agent which can actually play this game!

In [1]:
# !pip install ALE
# !pip install gym
# !pip install opencv-python
#
# !pip install "tensorflow==2.10"
# !pip install "tensorflow-gpu==2.10"
#
# !pip install tqdm
# !pip install jdc
#
!pip list

Package                      Version
---------------------------- -----------
absl-py                      1.4.0
Ale                          0.8.4
ale-py                       0.8.1
anyio                        3.5.0
appdirs                      1.4.4
argon2-cffi                  21.3.0
argon2-cffi-bindings         21.2.0
asttokens                    2.0.5
astunparse                   1.6.3
attrs                        22.1.0
AutoROM                      0.4.2
AutoROM.accept-rom-license   0.6.1
Babel                        2.11.0
backcall                     0.2.0
beautifulsoup4               4.11.1
bleach                       4.1.0
brotlipy                     0.7.0
cachetools                   5.3.0
certifi                      2023.5.7
cffi                         1.15.1
charset-normalizer           2.0.4
click                        8.1.3
cloudpickle                  2.2.1
colorama                     0.4.6
comm                         0.1.2
cryptography                 38.0.4
de

**Importing Libraries:** The necessary libraries and modules are imported, including gym for the game environment, random for random actions, warnings for suppressing warnings, numpy for numerical operations, tensorflow for building and training neural networks, PIL for image preprocessing, collections for deque (double-ended queue) data structure, IPython for capturing output, and tqdm for displaying progress bars.

In [2]:
import gym
import random
import warnings

import numpy as np
import tensorflow as tf

from PIL import Image
from collections import deque
from IPython.utils import io
from tqdm.notebook import tqdm

**Ignoring Warnings:** The warnings.filterwarnings('ignore') statement is used to suppress warnings.

In [3]:
warnings.filterwarnings('ignore')

**Listing GPU Devices:** tf.config.list_physical_devices('GPU') lists the available GPU devices. This line is likely used to check if a GPU is available for acceleration but does not store or use the output.

In [4]:
tf.config.list_physical_devices('GPU')

[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

**Training Configuration:** Several constants and configuration parameters are defined, such as TRAIN (a boolean flag indicating whether to train the agent), GAME_NAME (the name of the game environment), MODEL_PATH (the path to save the trained model), INPUT_SHAPE (the shape of input frames), BATCH_SIZE (the size of training batches), MEMORY_SIZE (the size of the replay buffer), TARGET_UPDATE_FREQ (the frequency of updating the target network), GAMMA (the discount factor), EPSILON (the exploration rate), MIN_EPSILON (the minimum exploration rate), and EPSILON_DECAY (the decay rate for exploration).

In [6]:
TRAIN = True
GAME_NAME = 'ALE/Pong-v5'
MODEL_PATH = './models/pong_model.h5'
INPUT_SHAPE = (84, 84, 1)

BATCH_SIZE = 32
MEMORY_SIZE = 10000
TARGET_UPDATE_FREQ = 1000

GAMMA = 0.95
EPSILON = 1.0
MIN_EPSILON = 0.1
EPSILON_DECAY = 0.999

**Prioritized Replay Buffer:** The PrioritizedReplayBuffer class implements a prioritized replay buffer for storing and sampling experiences. Experiences are stored in a deque with limited size, and priorities are stored in a separate deque. Experiences can be appended, sampled, and priorities can be updated.

In [7]:
class PrioritizedReplayBuffer:
    def __init__(self, size, alpha=0.6, beta_start=0.4, beta_frames=100000):
        self.alpha = alpha
        self.beta_start = beta_start
        self.beta_frames = beta_frames
        self.frame = 1
        self.buffer = deque(maxlen=size)
        self.priorities = deque(maxlen=size)
        self.max_priority = 1.0

    def append(self, experience):
        self.buffer.append(experience)
        self.priorities.append(self.max_priority)

    def sample(self, batch_size):
        probs = np.array(self.priorities) ** self.alpha
        probs /= probs.sum()

        beta = self.beta_start + (1 - self.beta_start) * (self.frame / self.beta_frames)
        self.frame += 1

        indices = np.random.choice(len(self.buffer), batch_size, p=probs)
        experiences = [self.buffer[i] for i in indices]
        weights = (len(self.buffer) * probs[indices]) ** (-beta)
        weights /= weights.max()

        return experiences, indices, np.array(weights, dtype=np.float32)

    def update_priorities(self, indices, errors, absolute_error=1e-5):
        priorities = self.priorities
        max_priority = self.max_priority
        for i, error in zip(indices, errors):
            priority = np.max(np.abs(error)) + absolute_error
            priorities[i] = priority
            max_priority = max(max_priority, priority)
        self.max_priority = max_priority

    def __len__(self):
        return len(self.buffer)

**DQN Model:** The DQNModel class defines the architecture of the DQN model using convolutional and dense layers. The call method specifies the forward pass of the model.

In [8]:
class DQNModel(tf.keras.Model):
    def __init__(self, action_size, act='relu'):
        super(DQNModel, self).__init__()
        self.conv1 = tf.keras.layers.Conv2D(32, 8, 4, activation=act)
        self.conv2 = tf.keras.layers.Conv2D(64, 4, 2, activation=act)
        self.conv3 = tf.keras.layers.Conv2D(64, 3, 1, activation=act)
        self.flatten = tf.keras.layers.Flatten()
        self.dense = tf.keras.layers.Dense(512, activation=act)
        self.outputs = tf.keras.layers.Dense(action_size)

    def call(self, inputs):
        x = self.conv1(inputs)
        x = self.conv2(x)
        x = self.conv3(x)
        x = self.flatten(x)
        x = self.dense(x)
        return self.outputs(x)

**DQN Agent:** The DQNAgent class represents the DQN agent. It initializes the agent with the state and action sizes, creates a replay buffer, sets the exploration rate, builds the model and target model using the DQNModel class, initializes an optimizer, and defines methods for updating the target model, choosing actions, training a step, running an episode, remembering experiences, loading and saving model weights.

In [9]:
class DQNAgent:
    def __init__(self, state_size, action_size):
        self.state_size = state_size
        self.action_size = action_size
        self.memory = PrioritizedReplayBuffer(MEMORY_SIZE)
        self.epsilon = EPSILON
        self.model = DQNModel(self.action_size)
        self.target_model = DQNModel(self.action_size)
        self.update_target_model()
        self.optimizer = tf.keras.optimizers.Adam()

    def update_target_model(self):
        self.target_model.set_weights(self.model.get_weights())

    def choose_action(self, state):
        if np.random.rand() <= self.epsilon:
            return random.randrange(self.action_size)
        else:
            return np.argmax(self.model.predict(state)[0])

    @tf.function
    def train_step(self, states, target_f, weights):
        with tf.GradientTape() as tape:
            predictions = tf.recompute_grad(self.model)(states / 255.0)
            loss = tf.keras.losses.MSE(target_f, predictions)
            loss = tf.reduce_mean(loss * weights)
        grads = tape.gradient(loss, self.model.trainable_weights)
        self.optimizer.apply_gradients(zip(grads, self.model.trainable_weights))

    def run_episode(self):
        experiences, indices, weights = self.memory.sample(BATCH_SIZE)
        states, actions, rewards, next_states, dones = zip(*experiences)
        states = np.concatenate(states, axis=0)
        next_states = np.concatenate(next_states, axis=0)

        target_q_values = self.target_model.predict(next_states / 255.0)
        online_q_values = self.model.predict(next_states / 255.0)
        best_actions = np.argmax(online_q_values, axis=1)

        targets = np.array(rewards) + (1 - np.array(dones)) * GAMMA * target_q_values[
            np.arange(BATCH_SIZE), best_actions]

        target_f = self.model.predict(states / 255.0)
        target_f[np.arange(BATCH_SIZE), np.array(actions)] = targets

        errors = np.abs(self.model.predict(states / 255.0) - target_f)
        self.memory.update_priorities(indices, errors)

        if self.epsilon > MIN_EPSILON:
            self.epsilon *= EPSILON_DECAY

        self.train_step(states, target_f, weights)

        self.update_target_model()

    def remember(self, state, action, reward, next_state, done):
        self.memory.append((state, action, reward, next_state, done))

    def load(self, name):
        self.model.load_weights(name)

    def save(self, name):
        self.model.save_weights(name)

**Frame Preprocessing:** The preprocess_frame function takes a frame as input, rescales it, converts it to grayscale, resizes it to the desired input shape, and returns the preprocessed frame.

In [10]:
def preprocess_frame(frame):
    frame = frame[0].astype(np.float32) / 255.0
    if len(frame.shape) > 2 and frame.shape[-1] == 3:
        frame = np.dot(frame[..., :3], [0.2989, 0.5870, 0.1140])
    frame = Image.fromarray(frame).resize((INPUT_SHAPE[0], INPUT_SHAPE[1]))
    frame = np.expand_dims(frame, axis=2)
    return frame

**Training Function:** The train function trains the DQN agent for a specified number of episodes. It creates the game environment, initializes the agent, loads model weights (if available), iterates over episodes, resets the environment, preprocesses the initial state, chooses actions, takes steps, remembers experiences, runs episodes, updates the target model, and displays progress.

In [11]:
def train(episodes):
    env = gym.make(GAME_NAME)
    state_size = INPUT_SHAPE
    action_size = env.action_space.n
    agent = DQNAgent(state_size, action_size)

    dummy_data = np.zeros((1,) + INPUT_SHAPE)
    with io.capture_output() as captured:
        agent.model.predict(dummy_data)
        agent.target_model.predict(dummy_data)

    agent.load(MODEL_PATH)

    bar_format = 'Training: {percentage:3.0f}% |{bar}| Elapsed: {elapsed} Remaining: {remaining}{postfix}'
    training_pbar = tqdm(total=episodes, bar_format=bar_format, unit='episode')

    best_total_reward = -np.inf

    for e in range(episodes):
        state = env.reset()
        state = preprocess_frame(state)
        state = np.expand_dims(state, axis=0)
        done = False
        total_reward = 0
        while not done:
            with io.capture_output() as captured:
                action = agent.choose_action(state)
            next_state, reward, done, _, _ = env.step(action)
            total_reward += reward

            next_state = preprocess_frame(next_state)
            next_state = np.expand_dims(next_state, axis=0)
            agent.remember(state, action, reward, next_state, done)
            state = next_state

            if done:
                if total_reward > best_total_reward:
                    print(f"New best total reward {total_reward}, saving model weights.")
                    best_total_reward = total_reward
                    agent.save(MODEL_PATH)

        if len(agent.memory) >= BATCH_SIZE and len(agent.memory) >= BATCH_SIZE:
            with io.capture_output() as captured:
                agent.run_episode()

        if e % TARGET_UPDATE_FREQ == 0:
            agent.update_target_model()

        training_pbar.set_postfix_str(f'Reward: {total_reward}')
        training_pbar.update(1)

    training_pbar.close()

**Training Execution:**

In [12]:
if TRAIN:
    train(episodes=1000)

Training:   0% |          | Elapsed: 00:00 Remaining: ?

New best total reward -21.0, saving model weights.
New best total reward -20.0, saving model weights.
New best total reward -16.0, saving model weights.


Exception ignored in: <function WeakKeyDictionary.__init__.<locals>.remove at 0x000002308A5193F0>
Traceback (most recent call last):
  File "C:\Users\Admin\anaconda3\envs\RL\lib\weakref.py", line 371, in remove
    self = selfref()
KeyboardInterrupt: 

KeyboardInterrupt



**Playing the game:** use the trained DQN agent. It creates the Gym environment, initializes the agent, loads the model weights, and interacts with the environment based on the agent's actions.

In [None]:
def play_with_model():
    env = gym.make(GAME_NAME, render_mode='human')
    state_size = INPUT_SHAPE
    action_size = env.action_space.n
    agent = DQNAgent(state_size, action_size)
    agent.load(MODEL_PATH)

    state = preprocess_frame(env.reset())
    state = np.expand_dims(state, axis=0)
    done = False
    while not done:
        env.render()
        with io.capture_output() as captured:
            action = agent.choose_action(state)
        next_state, reward, done, _, _ = env.step(action)
        state = preprocess_frame(next_state)
        state = np.expand_dims(state, axis=0)

**Playing execution:**

In [None]:
play_with_model()

**Note**: Keep in mind that observation space for this environment are frames from environment. Observation space is an image of size (210, 160, 3). so you will need to implement an agent which can process images!(a CNN based agent). 

Make sure to do perform preprocessing on the frames. For example, you can convert the RBG image to gray. you can use [OpenCV](https://docs.opencv.org/4.x/d6/d00/tutorial_py_root.html) library to perform resize\ing, bluring or any applicable filtering on the frames.

## Grading criteria
Project: 35 points

* Final Viva: 10 points
* Implementation: 10 points
* Final Report: 15 points

For viva you will need to expilictly mention each team member's contribution.

You can write your report on this notebook. The report must include visualization of your results. Train your model at least with 2 different sets of hyperparameters and in visualization section compare their output.


### Good Luck!