# <ins> Deep Q Learning <ins>:
-This Notebook serves as a step by step illustration of the **Deep Q Learning** code used to train agents in a certain environment. This code belows extends the **Deep Q Learning** method by adding another target network, we shall code this method **Double Deep Q Learning** or **Double DQL** for short.

-This algorithm uses classes and the Bellman Equation that will be introduced bellow 

**Let us start importing the necessary libraries for the following code**

In [4]:
import torch as T
import torch.nn as nn
import numpy as np
import torch.nn.functional as F
import torchvision.transforms as transforms
import torch.optim as optimum
from gymnasium.wrappers import AtariPreprocessing, FrameStack, record_video,RecordVideo
import gymnasium as gym
import numpy as np
import matplotlib.pyplot as plt
import pickle as pk
import sys

We imported many libraries, but to be short, we mainly imported **Pytorch** and **gymnasium** libraries and sub-libraries. **gymnasium.wrappers** is a sub gymnasium library used to add the different preprocessing utils that reshape, grayscale, framestack and occasionnaly record the developpement of our environment.

Now we can start constructing our classes wich will contain many useful functions following the DQL algorithm.

## <ins>Neural Network Class<ins>

In [3]:
class DeepQNetwork(nn.Module):
    def __init__(self, input_dims,n_actions):
        super(DeepQNetwork,self).__init__()

        self.input_dims = input_dims
        self.conv1 = nn.Conv2d(4,16,8,stride=4)
        self.conv2 = nn.Conv2d(16,32,4,stride=2)     
        self.conv3 = nn.Conv2d(32,32,3,stride=1)
        self.fc1 = nn.Linear(32*7*7, 256)
        self.fc2 = nn.Linear(256, n_actions)
        self.relu = nn.ReLU()


    def forward(self, x):

        x = T.as_tensor(x)
        x = self.relu(self.conv1(x))
        x = self.relu(self.conv2(x))
        x = self.relu(self.conv3(x).flatten(start_dim=1))
        x = self.relu(self.fc1(x)) 
        return self.relu(self.fc2(x))


**<ins>NOTE <ins>** : The neural network parameters were used are the same ones found in the 2015 DQL paper, we have chosen these values since according to the paper, we can get promissing resuts with minimal processing time.

**<ins> How does the depth of a neural network change? <ins>**
The depth of a neural network may help to greatly approximate the probabilities at the end of the network. However, the computation time increases significantly. In the beginning we had inputs of **64**, but we changed it to **32** for the sake of computation and because the environment aren't that complex

## <ins>Memory/Experience Replay<ins>: 
Now let us define our Memory class, which really highlights the **Deep** in **Deep Q Learning**. The agent will learn using previous iterations, a sort of buffer where we compare the current observation with randomly sampled memories to compute the gradient and finally converge the Q value.

In [4]:
class Memory():
    def __init__(self,input_dims,max_mem_size = 20000):
        self.mem_size = max_mem_size
        self.mem_cntr = 0
        
        #intialise the memory
        
        self.state_memory = np.zeros((self.mem_size, *input_dims), dtype = np.float32)
        self.new_state_mem = np.zeros((self.mem_size,*input_dims),dtype = np.float32)
        self.action_memory = np.zeros(self.mem_size,dtype= np.int32)
        self.reward_memory = np.zeros(self.mem_size, dtype= np.float32)
        self.terminal_memory = np.zeros(self.mem_size,dtype = bool)

    def store_transition(self, state, action, reward, state_neu, done):
        #storing the memories
        
        index = self.mem_cntr % self.mem_size
        self.state_memory[index] = state
        self.new_state_mem[index] = state_neu
        self.reward_memory [index] = reward
        self.action_memory[index] = action
        self.terminal_memory[index] = done
        self.mem_cntr +=1


## <ins>Agent<ins>:
Let us now construct the agent, **the fundemental** and **core** part of the **DQL Algorithm**. We instansiate the fundemental parts of a Neural Network such as : **The optimizer : Adam in our case**  , **the loss function : Huber here**, **the memory sections**, and another **gamma** that shall be used later .

We now enter the **learning section**, in here we use **the Bellman Equation** in order to converge the **Q_values** to a certain Matrix optimal for the environment. 
## <ins>Epsilon Greey Approach<ins>:
We follow the $\epsilon$-greedy algorithm used in DQL, this approach is used so that the agent explores the maximum of possibilities in a certain environment. This helps avoid situations where an agent finds a local loss minimum and, well, stays there. We had this happen where the agent found it self getting the maximum rewards by just holding a certain position in breakout and pong. 
Furthermore, this $\epsilon$ decays as time goes by, meaning that the agent becomes more and more independant and has to compute the best action in its environment following his Q values.
 
## <ins>Learning<ins> 

We start off by filling up the buffer so that we get at least the minimum of random environment, actions, rewards, terminal states and new observation states.
This code bellow shows that for mem_cntr \< min_mem == 4000, the agent **will not learn**. 
After the buffer is filled, we choose random numbers and random **(states,new states,rewards, terminals states , actions states)** that will be used to compute the gradient in our current situation in order for **Q** to converge.\

We then define a supplementary Q matrix that we call **q_target**, this will serve as our secondary Q matrix.

## <ins>Q Target<ins>
### But why add another Q matrix ? Doesn't it complicate the code ? Why not evaluate the new state and the current state in just a single Q Matrix ?

Well there a many reasons but essentially:\
    The problem with Q-Learning is that the same samples are being used to decide which action is the best (highest expected reward), and the same samples are also being used to estimate that action-value as shown in the Bellman Equation : 

$$Q(s, a) = R(s, a) + \gamma \cdot \max_{a'} Q(s', a')$$

if an action’s value was overestimated, it will be chosen as the best action, and it’s overestimated value is used as the target

The Target Q value helps to not oerestimate an action following a certain states. The weights of the Q_target matrix are then updated to correspond with the weights of the Q Matrix as not to diverge from it.

**NOTE** : We follow the 2015 paper and compute the following equation for the Q matrix. We only update the Q value if the state is not terminated, this helps in the agent's convergence.

$$Q(s, a) = R(s, a) + \gamma \cdot\ (1-terminated) \cdot \max_{a'} Q(s', a')$$

In [5]:
class Agents():
    #initialising agent
    def __init__(self, gamma, lr, batch_size, n_actions, input_dims, epsilon, eps_end=0.1, eps_dec = 0.9999, max_mem_size = 20000, min_mem = 4000):
        
        self.device = T.device('cuda' if T.cuda.is_available() else 'cpu')
        self.gamma= gamma
        self.epsilon = epsilon
        self.eps_min = eps_end
        self.eps_dec = eps_dec
        self.lr = lr
        self.min_mem = min_mem
        self.action_space  = [i for i in range(n_actions)]
        self.batch_size = batch_size

        self.memory = Memory(input_dims,max_mem_size)

        self.Q_eval = DeepQNetwork(input_dims, n_actions).to(self.device)
        self.optimizer = optimum.Adam(self.Q_eval.parameters(),lr=self.lr,eps=1e-7)
        self.loss_fn = nn.HuberLoss()

    #choosing a random action depending of the value of epsilon
    def choose_action(self, observation):
        if np.random.random() > self.epsilon:
            state = np.array([observation],dtype=np.float32)
            state = T.tensor(state).to(self.device)
            actions = self.Q_eval(state)
            return T.argmax(actions).item()
        else:
            return np.random.choice(self.action_space)
    
    def learn(self, target_res , terminated):
        if self.memory.mem_cntr < self.min_mem:
            return

        ##Randomly selecting a batch index

        max_mem = min(self.memory.mem_cntr, self.memory.mem_size)
        batch = np.random.choice(max_mem,self.batch_size, replace= False)
        batch_index = np.arange(self.batch_size,dtype = np.int32)
        
        #randomly selecting states
        state_batch =  T.tensor(self.memory.state_memory[batch]).to(self.device)
        new_state_batch = T.tensor(self.memory.new_state_mem[batch]).to(self.device)
        reward_batch = T.tensor(self.memory.reward_memory[batch]).to(self.device)

        action_batch = self.memory.action_memory[batch]
        
        #evaluating the state
        q_eval = self.Q_eval(state_batch)[batch_index,action_batch]
        
        q_next = target_res(new_state_batch)

        #computing the target vale
        q_target = reward_batch + (1 - terminated)*self.gamma * T.max(q_next, dim=1)[0]

        #back propegating the loss
        self.optimizer.zero_grad()
        loss = self.loss_fn(q_target, q_eval).to(self.device)
        loss.backward()
        self.optimizer.step()

### <ins>How does the Learning rate affect the Network ?<ins>
The learning rate used in the Adam optimizer plays a major role for Q Learning. Big values may cause the Network to be extremely unstable. While small learning rates may lead to no learning at all ! 
We had this problem where with big values the network was unstable, so we kept decreasing it (to 0.00025) where he had better results.

## <ins>Training<ins>:
Let us now Start the agents' Training.  We import the **Pong Environment** and we *wrap* this environment by some preprocessing preprocesses. **TheAtari preprocessing** basically **rescales the imgae to 84x 84 , grayscales the image, and skips four frames ie only the fourth frame matters so we explore more actions**.\
Finally we Stack four frames in order to process the maximum number of environment, we end up finally with a **(4,1,84,84) Tensor**. 

**NOTE :** You may have noticed that we have a *frame_skip = 1*, this is because this operation is already done by the ALE library, we would not want to skip 8 frames !\
Removing *frame_skip = 1* will result in an error !

In [5]:
env = gym.make("ALE/Pong-v5", render_mode=None) 
#frame_skip =1 because the ALE library already does a frame_skip = 4
env = AtariPreprocessing(env, frame_skip=1) 
#Stack four frames to speed up the process
env = FrameStack(env, 4) 
n_episodes = 10000

We then make two differents Q matrixes and we initialise an agent with the following values.

In [7]:

agent = Agents(gamma= 0.99, lr = 0.00025, batch_size=64, n_actions=env.action_space.n, input_dims=(4,84,84), epsilon=1, eps_dec=0.99999)
target_res = DeepQNetwork(env.observation_space.shape, env.action_space.n).to(T.device('cuda' if T.cuda.is_available() else 'cpu'))

We make sure we are using a GPU and not a CPU, these computations take a **LONG** time so make sure to have one.

In [None]:
if T.cuda.is_available():
    print("Running on GPU :", T.cuda.get_device_name(0))
else:
    print("Running on CPU")

Following the 2015 Paper, we clip the rewards between -1 and 1 since,according to the authors, it helps with the agent's convergence and helps to avoid divergence. Furthermore, the rewards of some environment are difficult to control, making the clipping manageable.

We now can effectively start training. This takes a **LONG LONG** time , so sit back and watch the agent accumulate rewards over time ! We update the Q Target network every 10 interations as stated above.\
We also save the Network's weights every 30 iterations or when interrupting so as to continue training later on.\
We finally plot the rewards after finishing the training or when interrupting

In [None]:

rewards_per_episode = []
try:
    for i in range(n_episodes+1):
        rewards = 0
        time = 0
        terminated = False
        tronc = False
        
        ## We initialise the environment using env.reset()
        
        observation = env.reset()[0]
        
        ## We execute a while loop, whre for a certain observation state, we get an action, then we compute:
        ## observation_neu, reward, terminated, _, info = env.step(action) , this gives us the useful parameters defined above
        ## we Store these values in our memory
        ## we execute the learning function
        ## we finally update the observation with the new observation
        
        while not terminated:
            action = agent.choose_action(observation)
            reward = 0
            
            #new observation state
            observation_neu, reward, terminated, _, info = env.step(action)
            
            #clipping the rewards
            reward = np.clip( reward, a_min=-1, a_max=1 )
            
            #storing the resuls and learning
            agent.memory.store_transition(observation,action,reward,observation_neu,terminated)
            agent.learn(target_res,terminated)
            
            #updating the oservation
            observation  = observation_neu
            rewards += reward

            ## We decay the epsilon following geometric decay
            agent.epsilon = max(agent.epsilon*agent.eps_dec, agent.eps_min)

        rewards_per_episode.append(rewards)
        mean_rewards = np.mean(rewards_per_episode[len(rewards_per_episode)-100:])

        # Update target weights
        
        if i % 10 == 0:
            target_res.load_state_dict(agent.Q_eval.state_dict())
        
        if i%30==0:
            text = "save_w_checkpoint_"+str(0)+".pth"
            T.save(agent.Q_eval.state_dict(),"./"+text)
            print(f'Episode: {i}, Rewards: {rewards},  Epsilon: {agent.epsilon:0.2f}, Mean Rewards: {mean_rewards:0.1f}')
    
    #Plotting the rewards and saving the results for future iterations
    env.close()
    T.save(agent.Q_eval.state_dict(), "save_w_break.pth")
    plt.plot(rewards_per_episode)
    plt.xlabel("epochs")
    plt.ylabel("rewards")
    plt.savefig("./rewardplot.png")

except KeyboardInterrupt:
    #Plotting the rewards and saving the results for future iterations
    env.close()
    T.save(agent.Q_eval.state_dict(), "save_w_interrupt.pth")
    plt.plot(rewards_per_episode)
    plt.xlabel("epochs")
    plt.ylabel("rewards")
    plt.savefig("./rewardplot.png")
    sys.exit()


**Congratulation**, you have finished training, Now let us see the agents results. if everything goes according to plan you will get this figure approximately:
![reward plot](./rewardplot.png)
## <ins>Rendering<ins>:

Simply change the default **render mode** from **None** to **human** to see your agent playing the game

In [None]:
env = gym.make("ALE/Pong-v5", render_mode="human") 
env = AtariPreprocessing(env,frame_skip=1) 
env = FrameStack(env, 4) 
recorder = False

if recorder:
    env = RecordVideo(env, video_folder="./videos", name_prefix="Breakout_best",
                      episode_trigger=lambda x: x %100 == 0)
n_episodes = 100

agent = Agents(gamma= 0.99, lr = 0.00025, batch_size=32, n_actions=env.action_space.n, input_dims=env.observation_space.shape, epsilon=0.001)

rewards_per_episode = []

for i in range(n_episodes):
    rewards = 0
    terminated = False
    observation = env.reset()[0]

    ## We execute a while loop, whre for a certain observation state, we get an action, then we compute:
    ## observation_neu, reward, terminated, _, info = env.step(action) , this gives us the useful parameters defined above
    ## we Store these values in our memory
    ## we execute the learning function
    ## we finally update the observation with the new observation
    
    while not terminated:
        action = agent.choose_action(observation)
        new_observation, reward, terminated, _, info =  env.step(action)
        rewards += reward
        observation = new_observation
        

    rewards_per_episode.append(rewards)
    mean_rewards = np.mean(rewards_per_episode[-100:])
    
    print(f'Episode: {i}, Rewards: {rewards},  Epsilon: {agent.epsilon:0.2f}, Mean Rewards: {mean_rewards:0.1f}')

env.close()


In [5]:
from IPython.display import Video
Video('./eval-episode-0.mp4', width=450, height=450)