# RL Final Project

Now it's finally time to put into use what we have learned so far in this course!

The aim of this project is to assess your practical knowledge in Reinforcement Learning.

your project consist of 2 parts. you will get the chance to work with 2 different environment.



## 1.Resource Allocation

First environment is a simplified simulation of a real-world problem called **Resource Allocation** more specifically computation resource allocation.  Computational resource allocation refers to the process of assigning and distributing computing resources, such as processing power, memory, and storage, to different tasks and applications. This allocation is typically managed by an operating system or a resource manager, which monitors the system's resource usage and ensures that each task or application receives the necessary resources to operate efficiently. The goal of computational resource allocation is to maximize the utilization of available resources while minimizing the impact on other tasks and ensuring that critical tasks are prioritized. Effective resource allocation is essential in optimizing the performance of a computing system and can have a significant impact on overall productivity and efficiency.




### 1.1 Simulation:
The architecture that we are using in this simulation is **[Edge Server](https://www.cloudflare.com/learning/cdn/glossary/edge-server/)** architecture. It consists of 3 entities in hierarchical manner.
* The **Root** (first) layer is a ***cloud*** node with high computation resources.
* The second layer is **Edge server** which maintains moderate compuation power (partially high comparing to enduser).
* The last layer (access layer) is where **endpoints** are defined. this node has relatively low computational power. In this level workloads are generated.

**Note**: Keep in mind that the communication is only [uplink](https://www.everythingrf.com/community/what-is-uplink) transmission is considered.


---


<img src="https://drive.google.com/uc?id=1G7UwG9bBQEYlisPSE21PZNvgx_Nm1fdj"/>

* The goal here is to minimize the computational delay for users.
### 1.2 Action space : n + 2
where n is the number of **edge** servers. 

Number ***2*** is representing two other possible action:
  * Processing the workload **locally**
  * Offloading the workload to **cload** server

Your task is to train an agent using an algorithm of your choice.

In [None]:
import numpy as np
import random
import itertools
import scipy.misc
import matplotlib.pyplot as plt
import random
import networkx as nx
from posixpath import expandvars



class Env():
    def __init__(self,endpoints, edgeservers,):
        self.endpoints = endpoints
        self.edgeservers = edgeservers
        self.G = nx.Graph()
        self.edges = []
        self.nodes = np.zeros(1  + edgeservers + endpoints)  #this is the processing power
        self.base_index = self.edgeservers + 1
        self.resources = [np.zeros(1000)]+[np.zeros(r_edgeserver) for _ in range(self.edgeservers)] + [np.zeros(r_endpoint) for _ in range(self.endpoints)]
        self.n_actions = 1 + self.edgeservers + 1    #  action[0] for local processing , action[1 : num_edgeservers] for processing in one of the edgeservers , action[-1] for processing in cloud
        self.workloads = []
        self.costs = np.zeros(self.endpoints)  # this array represents computation delay for enduser u. the bigger the delay, the bigger the cost.



    def configure_network(self):

        # declaring the edges between core cloud and the edgeserves
        for j in range(1, self.edgeservers + 1):
          # self.edges.append((0,j ))
          self.G.add_edge(j,0)
          self.G[j][0]['weight'] = random.uniform(400,500)


        # declaring the edges between edgeservers
        for i in range(1, self.edgeservers):
          self.G.add_edge(i,i+1)
          self.G.add_edge(i+1,i)
          self.G[i][i+1]['weight'] = random.uniform(250,300)

        # declaring the edges between edgeservers and endpoints
        for i,j in zip(range(1 + self.edgeservers ,self.endpoints + 1 + self.edgeservers), np.resize(np.arange(1,self.edgeservers + 1), self.endpoints)):
          self.G.add_edge(i,j)
          self.G[i][j]['weight'] = random.uniform(180,250)

        # print(self.G[0])

        # declaring the processing power
        self.nodes[0] = random.uniform(5000,7000) # Core cloud processing power
        for i in range(1,self.base_index):
          self.nodes[i] = random.uniform(500,2500) # Edge serverse processing power
        for i in range(self.base_index, len(self.nodes)):
          self.nodes[i] = random.uniform(50,300) # Endpoints porcessing power



    def generate_task(self):
        # generating workload values (fixed workloads) #first value is the offload and the second value is the emergency level
        self.workloads = [(i, random.uniform(1e3 * 3.0, 1e4 * 3.0), np.random.choice([0,1], p=[0.7, 0.3]))  for i in range(self.endpoints)]
        # return self.workloads

    def clean_resources(self):
        self.resources = [np.zeros(1000)]+[np.zeros(r_edgeserver) for _ in range(self.edgeservers)] + [np.zeros(r_endpoint) for _ in range(self.endpoints)]

    def random_action(self):
        self.actions =[random.randint(0,1 + self.edgeservers) for _ in range(self.endpoints)]
        # return self.actions


     ####################################################################################################################################   AUXILLARY FUNCTIONS
     # The functions in this section
     # are not mandatory to use. These are just
     # auxillary functions that might accelerate
     # your implementation

    def initalize_Qtable(self):       #### it's not mandatory to use this function in your implementation
        self.Qtable = np.zeros((self.endpoints, 1 + self.edgeservers + 1))
        # print(self.Qtable)

    def latency_calculator(self, node1, node2, W):        #### it's not mandatory to use this function in your implementation
      latency = 0
      if node2 == self.n_actions - 1 :
        node2 = node1
      path = nx.shortest_path(self.G, source = node1,  target = node2)
      for i in range(len(path) - 1):
        B = self.G[i][i+1]['weight']
        latency += W / (B * self.network_gain())
      latency += W / self.nodes[int(node2)]
      return latency

    def network_gain(self):         #### it's not mandatory to use this function in your implementation
      return np.random.choice([0.5,0.25], p = [0.5,0.5])


    def capacity_check(self, resource):       #### it's not mandatory to use this function in your implementation
      c = False
      for elem in resource:
        if elem == 0:
          c = True
          break
      return c

    def allocate_task(self, resource, W, P):    #### it's not mandatory to use this function in your implementation
      for i, slot in enumerate(resource):
        if slot == 0:
          resource[i] = W / P
          return resource



    def update_resource(self):      #### it's not mandatory to use this function in your implementation
      for r in self.resources:
        r -= 50
        for i in range(len(r)):

          if r[i] < 0:
            r[i] = 0



        #######################################################################################################################################




    def policy_action(self):
      pass   #### Your implementation here


    def update_values(self):
      pass    #### Your implementation here


    def step(self):
      pass    #### Your implementation here


In [None]:
# hyper parameters
K = 25
r_endpoint  = 2 ### processing slots
r_edgeserver = 5 ### processing slots
discount_factor = 0.1
learning_rate = 0.001
epsilon = 0.99
num_endusers = 10
num_edgeservers = 3

#### do not change this number
time_slot_duration = 50 #seconds


In [None]:
def Algorithm(plot = False):

  env = Env(num_endusers, num_edgeservers)
  env.configure_network()


  #### your implementation here

This is the results that I got from impelementing SARSA algorithm where *avg cost* is the average computational delay for all the users. 
$
\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;
$

<img src="https://drive.google.com/uc?id=1UIRkMEcb-9sAwOXDEpnuQV93WeFqJH-x"/>

## 2.Atari Game Pong


<img src="https://drive.google.com/uc?id=1FrWbdg-A30j7FAxT4zQacbeemCTFRPeh"/>

**[Pong](https://www.gymlibrary.dev/environments/atari/pong/)** is a famus atari game that almost all of us have played it at least once!
The goal of this task is to get engage with **gym** library and use Deep Reinforcement Learning to train an agent which can actually play this game!

In [111]:
# conda install -c conda-forge cudatoolkit=11.2 cudnn=8.1.0

!pip install ALE gym
!pip install "gym[accept-rom-license, atari]"
!pip install opencv-python

!pip install "tensorflow<2.11"
!pip install tensorflow-gpu == 2.10

!pip install tqdm
!pip install jdc

Collecting jdc
  Downloading jdc-0.0.9-py2.py3-none-any.whl (2.1 kB)
Installing collected packages: jdc
Successfully installed jdc-0.0.9


Imports the necessary libraries and modules for the code, including Gym (for the RL environment), OpenCV (for image processing), NumPy (for numerical operations), TensorFlow (for deep learning), and Keras (for building and training the DQN model).

In [133]:
import gym
import cv2
import numpy as np
import tensorflow as tf
import jdc
import warnings

from keras.models import Sequential
from keras.layers import Dense, Conv2D, Flatten
from keras.optimizers import Adam
from IPython.utils import io
from tqdm.notebook import tqdm

Supress unimportant warnings

In [134]:
warnings.filterwarnings("ignore")

Sets up the TensorFlow session to run on the GPU. It configures the session with the GPU options and sets it as the backend for Keras.

In [135]:
with io.capture_output() as captured:
    config = tf.compat.v1.ConfigProto()
    config.gpu_options.allow_growth = True
    session = tf.compat.v1.Session(config=config)
    tf.compat.v1.keras.backend.set_session(session)

Defines the preprocess_frame method, which takes an observation from the environment and preprocesses the frame. It converts the frame to grayscale, resizes it to (84, 84) pixels, and normalizes the pixel values to the range [0, 1].

In [136]:
def preprocess_frame(observation):
    frame = observation[0]
    if frame.ndim > 2:
        frame = cv2.cvtColor(frame, cv2.COLOR_RGB2GRAY)
    frame = cv2.resize(frame, (84, 84))
    frame = frame / 255.0
    frame = np.expand_dims(frame, axis=-1)
    return frame

Defines the DQNAgent class, which represents the DQN agent. It has methods to build the model, get an action, train the model, and run an episode. The model is built using Keras, consisting of convolutional layers, fully connected layers, and an output layer. The model is compiled with the Adam optimizer and mean squared error (MSE) loss.

In [137]:
class DQNAgent:
    def __init__(self, input_shape, action_space, learning_rate):
        self.input_shape = input_shape
        self.action_space = action_space
        self.model = self.build_model(learning_rate)

    def build_model(self, alpha, activation='relu'):
        model = Sequential()
        model.add(Conv2D(32, kernel_size=(8, 8), strides=4, activation=activation, input_shape=self.input_shape))
        model.add(Conv2D(64, kernel_size=(4, 4), strides=2, activation=activation))
        model.add(Conv2D(64, kernel_size=(3, 3), strides=1, activation=activation))
        model.add(Flatten())
        model.add(Dense(512, activation=activation))
        model.add(Dense(self.action_space))

        optimizer = Adam(learning_rate=alpha)
        model.compile(optimizer=optimizer, loss='mse')
        return model

Defines the get_action method of the DQNAgent class. It takes a state and an epsilon value for epsilon-greedy exploration. With probability epsilon, it selects a random action. Otherwise, it uses the model to predict the Q-values for the given state and selects the action with the highest Q-value.

In [138]:
%%add_to DQNAgent
def get_action(self, state, epsilon):
    if np.random.rand() <= epsilon:
        return np.random.randint(self.action_space)
    with io.capture_output() as captured:
        q_values = self.model.predict(state)
    return np.argmax(q_values[0])

Defines the train method of the DQNAgent class. It performs the training of the DQN agent for a specified number of episodes. It uses a progress bar to track the progress of the training. In each episode, it calls the run_episode method to run a single episode and update the model's weights. The epsilon value is decayed over episodes to gradually shift from exploration to exploitation. After training, the model is saved to a file.

In [139]:
%%add_to DQNAgent
def train(self, model_name, episodes, epsilon_decay, epsilon_start=1.0, epsilon_end=0.1, gamma=0.97,max_episode_length=100):
    epsilon = epsilon_start

    bar_format = 'Training: {percentage:3.0f}% |{bar}| Elapsed: {elapsed}, Remaining: {remaining}{postfix}'
    training_pbar = tqdm(total=episodes, bar_format=bar_format, unit='episode')

    for episode in range(episodes):
        total_reward = self.run_episode(epsilon, gamma, max_episode_length)
        training_pbar.set_postfix_str(f'Reward: {total_reward}')
        training_pbar.update(1)
        epsilon = max(epsilon_end, epsilon * epsilon_decay)

    training_pbar.close()
    print("Training completed.")

    self.model.save(model_name)
    print("Model saved.")

Defines the run_episode method of the DQNAgent class. It runs a single episode of the environment using the current policy and updates the model's weights. It iteratively selects actions, observes the next state and reward, and performs a model update using the Q-learning algorithm. The progress is tracked using a separate progress bar.

In [140]:
%%add_to DQNAgent
def run_episode(self, epsilon, gamma, max_episode_length):
        observation = env.reset()
        state = preprocess_frame(observation)
        state = np.reshape(state, (1, *self.input_shape))
        done = False
        total_reward = 0
        episode_length = 0

        bar_format = 'Episode: {percentage:3.0f}% |{bar}| Speed: {rate_fmt}{postfix}'
        episode_pbar = tqdm(total=max_episode_length, bar_format=bar_format, unit='step')

        while not done:
            action = self.get_action(state, epsilon)
            next_observation, reward, done, _, _ = env.step(action)
            next_state = preprocess_frame(next_observation)
            if next_state is not None:
                next_state = np.reshape(next_state, (1, *self.input_shape))
                with io.capture_output() as captured:
                    value = np.array([reward + gamma * np.max(self.model.predict(next_state))])
                    self.model.fit(state, value, verbose=0)
                state = next_state
                total_reward += reward
                episode_length += 1
                episode_pbar.set_postfix_str(f'Reward: {total_reward}')
                episode_pbar.update(1)

            if episode_length >= max_episode_length:
                done = True

        episode_pbar.close()
        return total_reward

Sets up the Pong environment using Gym. It creates an instance of the environment and resets it to obtain the initial observation.

In [141]:
# env = gym.make("ALE/Pong-v5", render_mode='human')
env = gym.make("ALE/Pong-v5")
observation = env.reset()

Creates an instance of the DQNAgent class. It specifies the input shape, action space size, and learning rate for the agent.

In [142]:
agent = DQNAgent(
    input_shape=(84, 84, 1),
    action_space=env.action_space.n,
    learning_rate=0.00025
)

Trains the agent by calling the train method. It specifies the model name for saving, the number of episodes, epsilon decay rate, and maximum episode length. After training, it closes the environment.

In [None]:
agent.train(
    model_name='trained_model.h5',
    episodes=50,
    epsilon_decay=0.99,
    max_episode_length=500
)

env.close()

Training:   0% |          | Elapsed: 00:00, Remaining: ?

Episode:   0% |          | Speed: ?step/s

Episode:   0% |          | Speed: ?step/s

**Note**: Keep in mind that observation space for this environment are frames from environment. Observation space is an image of size (210, 160, 3). so you will need to implement an agent which can process images!(a CNN based agent). 

Make sure to do perform preprocessing on the frames. For example, you can convert the RBG image to gray. you can use [OpenCV](https://docs.opencv.org/4.x/d6/d00/tutorial_py_root.html) library to perform resize\ing, bluring or any applicable filtering on the frames.

## Grading criteria
Project: 35 points

* Final Viva: 10 points
* Implementation: 10 points
* Final Report: 15 points

For viva you will need to expilictly mention each team member's contribution.

You can write your report on this notebook. The report must include visualization of your results. Train your model at least with 2 different sets of hyperparameters and in visualization section compare their output.


### Good Luck!