# RL Final Project

Now it's finally time to put into use what we have learned so far in this course!

The aim of this project is to assess your practical knowledge in Reinforcement Learning.

your project consist of 2 parts. you will get the chance to work with 2 different environment.


## 2.Atari Game Pong


<img src="zzzzzzzzzzzzzzzzzc"/>

**[Pong](https://www.gymlibrary.dev/environments/atari/pong/)** is a famus atari game that almost all of us have played it at least once!
The goal of this task is to get engage with **gym** library and use Deep Reinforcement Learning to train an agent which can actually play this game!

In [25]:
# !pip install ALE
# !pip install gym
# !pip install opencv-python
#
# !pip install "tensorflow==2.10"
# !pip install "tensorflow-gpu==2.10"
#
# !pip install tqdm
# !pip install jdc
#
# !pip list

Imports the necessary libraries and modules for the code, including Gym (for the RL environment), OpenCV (for image processing), NumPy (for numerical operations), TensorFlow (for deep learning), and Keras (for building and training the DQN model).

In [26]:
import gym
import cv2
import jdc
import random
import warnings
import numpy as np
import tensorflow as tf

from IPython.utils import io
from tqdm.notebook import tqdm

Supress unimportant warnings
Define constant
Check if tensorflow found any GPU

In [27]:
warnings.filterwarnings('ignore')

INPUT_SHAPE = (84, 84, 1)
TRAIN = False
GAME_NAME = 'ALE/Pong-v5'

tf.config.list_physical_devices('GPU')

[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

Defines the DQNAgent class, which represents the DQN (Deep Q-Network) agent. The DQNAgent has methods to build the model, get an action, store experiences, replay experiences for training, and run episodes. The model is built using the Keras Sequential API and consists of convolutional layers, fully connected layers, and an output layer. The model is compiled with the Adam optimizer and mean squared error (MSE) loss.

In [28]:
class DQNAgent:
    def __init__(self, learning_rate=0.0001, batch_size=32, memory_size=20000, update_frequency=1000):
        self.env = gym.make(GAME_NAME)  # , render_mode='human')
        self.action_space = self.env.action_space.n
        self.model = self.build_model(learning_rate)
        self.target_model = self.build_model(learning_rate)
        self.target_model.set_weights(self.model.get_weights())
        self.batch_size = batch_size
        self.memory = []
        self.memory_size = memory_size
        self.update_frequency = update_frequency
        self.steps = 0

    def build_model(self, learning_rate, activation='relu'):
        model = tf.keras.models.Sequential()
        model.add(tf.keras.layers.Conv2D(32, kernel_size=(8, 8), strides=(4, 4), activation=activation,
                                         input_shape=INPUT_SHAPE))
        model.add(tf.keras.layers.Conv2D(64, kernel_size=(4, 4), strides=(2, 2), activation=activation))
        model.add(tf.keras.layers.Conv2D(64, kernel_size=(3, 3), strides=(1, 1), activation=activation))
        model.add(tf.keras.layers.Flatten())
        model.add(tf.keras.layers.Dense(512, activation=activation))
        model.add(tf.keras.layers.Dense(self.action_space))
        optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate)
        model.compile(optimizer=optimizer, loss='mse')
        return model

Adds a method get_action to the DQNAgent class. It takes a state and an epsilon value for epsilon-greedy exploration. With probability epsilon, it selects a random action. Otherwise, it uses the model to predict the Q-values for the given state and selects the action with the highest Q-value.

In [29]:
%%add_to DQNAgent

def get_action(self, state, epsilon):
    if np.random.rand() <= epsilon:
        return np.random.randint(self.action_space)
    with io.capture_output() as captured:
        q_values = self.model.predict(state)
    return np.argmax(q_values[0])

Adds a method remember to the DQNAgent class. It stores the agent's experience tuple (state, action, reward, next_state, done) in the memory. If the memory exceeds the specified memory size, the oldest experiences are removed to maintain the memory size.

In [30]:
%%add_to DQNAgent

def remember(self, state, action, reward, next_state, done):
    state = state[0]
    next_state = next_state[0] if next_state is not None else None
    reward = np.clip(reward, -1, 1)
    self.memory.append((state, action, reward, next_state, done))
    if len(self.memory) > self.memory_size:
        self.memory = self.memory[-self.memory_size:]

Adds a method replay to the DQNAgent class. It performs the model training using the experience replay technique. It samples a batch of experiences from the memory and calculates the target Q-values using the Bellman equation. The model is then updated using the states and target Q-values.

In [31]:
%%add_to DQNAgent

def replay(self, gamma):
    if len(self.memory) < self.batch_size:
        return
    batch = random.sample(self.memory, self.batch_size)
    states, actions, rewards, next_states, dones = zip(*batch)
    states = np.array(states)
    actions = np.array(actions)
    rewards = np.array(rewards)
    next_states = np.array(next_states) if any(x is not None for x in next_states) else None
    dones = np.array(dones)

    not_dones = 1 - dones
    with io.capture_output() as captured:
        targets = rewards + gamma * np.max(self.target_model.predict(next_states), axis=1) * not_dones
        target_f = self.model.predict(states)
    target_f[np.arange(self.batch_size), actions] = targets
    self.model.train_on_batch(states, target_f)

    self.steps += 1
    if self.steps % self.update_frequency == 0:
        self.target_model.set_weights(self.model.get_weights())

Adds a method train to the DQNAgent class. It trains the DQN agent for a specified number of episodes. It uses a progress bar to track the progress of the training. In each episode, it calls the run_episode method to run a single episode and update the model's weights. The epsilon value is decayed over episodes to gradually shift from exploration to exploitation. After training, the model is saved to a file.

In [32]:
%%add_to DQNAgent

def train(
        self,
        model_name,
        episodes,
        epsilon_decay,
        epsilon_start=1.0,
        epsilon_end=0.1,
        gamma=0.99,
        max_episode_length=1000
):
    epsilon = epsilon_start
    bar_format = 'Training: {percentage:3.0f}% |{bar}| Elapsed: {elapsed} Remaining: {remaining}{postfix}'
    training_pbar = tqdm(total=episodes, bar_format=bar_format, unit='episode')

    for episode in range(episodes):
        total_reward = self.run_episode(epsilon, gamma, max_episode_length)
        training_pbar.set_postfix_str(f'Reward: {int(total_reward)}')
        training_pbar.update(1)
        epsilon = max(epsilon_end, epsilon * epsilon_decay)
        self.replay(gamma)

    training_pbar.close()
    print('Training completed.')
    self.model.save(model_name)
    print('Model saved.')
    self.env.close()

Adds a method run_episode to the DQNAgent class. It runs a single episode of the environment using the current policy and updates the model's weights. It iteratively selects actions, observes the next state and reward, and performs a model update using the Q-learning algorithm. The progress is tracked using a separate progress bar.

In [33]:
%%add_to DQNAgent

def run_episode(self, epsilon, gamma, max_episode_length):
    observation = self.env.reset()
    state = preprocess_frame(observation)
    state = np.reshape(state, (1, *INPUT_SHAPE))
    done = False
    total_reward = 0
    episode_length = 0

    while not done:
        # if 'render_fps' in self.env.metadata:
        #     self.env.render()
        action = self.get_action(state, epsilon)
        next_observation, reward, done, _, _ = self.env.step(action)
        next_state = preprocess_frame(next_observation)
        if next_state is not None:
            next_state = np.reshape(next_state, (1, *INPUT_SHAPE))
            self.remember(state, action, reward, next_state, done)
            state = next_state
            total_reward += reward # + 0.01
            episode_length += 1
        if episode_length >= max_episode_length or next_state is None:
            done = True
    return total_reward

Defines the preprocess_frame function, which takes an observation from the environment and preprocesses the frame. It converts the frame to grayscale, resizes it to (84, 84) pixels, and normalizes the pixel values to the range [0, 1].

In [34]:
def preprocess_frame(frame):
    frame = frame[0]
    if frame.ndim > 2:
        frame = cv2.cvtColor(frame, cv2.COLOR_RGB2GRAY)
    frame = cv2.resize(frame, (84, 84)) / 255.0
    return frame

To open and play with the trained models

In [35]:
def open_model(model_name):
    model = tf.keras.models.load_model(model_name)
    env = gym.make(GAME_NAME, render_mode='human')
    observation = env.reset()
    done = False
    observation = env.reset()
    state = preprocess_frame(observation)
    state = np.reshape(state, (1, *INPUT_SHAPE))

    while not done:
        if 'render_fps' in env.metadata:
            env.render()
        with io.capture_output() as captured:
            action = np.argmax(model.predict(state))
        next_observation, reward, done, _, _ = env.step(action)
        next_state = preprocess_frame(next_observation)
        next_state = np.reshape(next_state, (1, *INPUT_SHAPE))
        state = next_state

Creates an instance of the DQNAgent class. It specifies the input shape, action space size, and learning rate for the agent. Trains the agent by calling the train method. It specifies the model name for saving, the number of episodes, epsilon decay rate, and maximum episode length. After training, it closes the environment.

In [None]:
if TRAIN:
    agent = DQNAgent()
    agent.train(
        model_name='trained_modell.h5',
        episodes=150,
        epsilon_decay=0.995,
        max_episode_length=1000
    )

open_model('trained_modell.h5')

**Note**: Keep in mind that observation space for this environment are frames from environment. Observation space is an image of size (210, 160, 3). so you will need to implement an agent which can process images!(a CNN based agent). 

Make sure to do perform preprocessing on the frames. For example, you can convert the RBG image to gray. you can use [OpenCV](https://docs.opencv.org/4.x/d6/d00/tutorial_py_root.html) library to perform resize\ing, bluring or any applicable filtering on the frames.

## Grading criteria
Project: 35 points

* Final Viva: 10 points
* Implementation: 10 points
* Final Report: 15 points

For viva you will need to expilictly mention each team member's contribution.

You can write your report on this notebook. The report must include visualization of your results. Train your model at least with 2 different sets of hyperparameters and in visualization section compare their output.


### Good Luck!