### CDS NYU
### DS-GA 3001 | Reinforcement Learning
### Lab 05
### February 27, 2025


# Deep Q-learning algorithm (from scratch...)

<br>

---

## Section Leader


Akshitha Kumbam – ak11071@nyu.edu

Kushagra Khatwani – kk5395@nyu.edu


## Goal of Today's Lab 

In this lab, we will create a Deep Reinforcement Learning method called Deep-Q-Network (DQN). We will again use a simple environment from OpenAI Gym, but you will showcase the enormous gain we get by switching from tabular Q-Learning to Deep Q Learning.

## Resources

* https://gymnasium.farama.org/


## Please activate a cloned version of your anaconda environment for this lab as some packages may require different versions of gym.

### Do not forget to change kernel to the cloned env.

# 1. Solve *Cart Pole*  with Deep Q-Network (DQN)

The `CartPole` environment corresponds to the version of the cart-pole problem described by Barto, Sutton, and Anderson in “*Neuronlike Adaptive Elements That Can Solve Difficult Learning Control Problem*”. A pendulum ("pole") is attached by an un-actuated joint to a cart, placed upright on the cart, and the cart can move along a frictionless track. 

The pendulum is placed upright on the cart and the goal is to balance the pole (i.e., keep it upright within -12 to +12 degrees) by applying forces in the left and right direction on the cart, using two possible actions: <br>
`0`: Push cart to the left <br>
`1`: Push cart to the right

**Challenges**: The episode terminates if the cart x-position gets outside the range [-2.4, 2.4] and/or the pole angle gets outside the range [-12°, 12°]. What makes this problem non-trivial is that the velocity which is reduced or increased by the applied force is not fixed and it depends on the angle the pole is pointing. The center of gravity of the pole varies the amount of energy needed to move the cart underneath it.


Details can be found in the Gym/Gymnasium doc: https://gymnasium.farama.org/environments/classic_control/cart_pole/


## Imports

We will use tensorflow libraries to create a deep learning neural network.

In [None]:
from collections import deque
import random

import numpy as np
import gymnasium as gym  
from tensorflow.keras.models import Sequential  # To compose multiple Layers
from tensorflow.keras.layers import Dense       # Fully-Connected layer
from tensorflow.keras.layers import Activation  # Activation functions
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import clone_model

## Execute random actions just to get familiar with the environment

In [None]:
# Load the CartPole Gym environment with graphical rendering to vizualize the environment
env = gym.make("CartPole-v1", render_mode="human")   

# Set to initial state
env.reset()  

# Loop over 200 steps
for _ in range(200):
    env.render()                                                 # Render on the screen
    action = env.action_space.sample()                           # Choose a random action
    new_state, reward, done, truncated, info = env.step(action)  # Carry out the action
    if done or truncated:
        env.reset()
            
env.close()


## Implement an Artificial Neural Network
To build the Q-network (referred ot as *model* in the code), we first need to find out how many actions and observations our environment has.
We can either get those information from the source code (https://github.com/openai/gym/blob/master/gym/envs/classic_control/cartpole.py) or via the following commands:

In [None]:
env = gym.make("CartPole-v1") 

num_actions = env.action_space.n
num_observations = env.observation_space.shape[0]  
print(f"There are {num_actions} possible actions and {num_observations} observations")


So our network needs to have an input dimension of 4 and an output dimension of 2.
In between we are free to chose.

Let's use a four layer architecture:

1. The first layer has 16 neurons
2. The second layer has 32 neurons
4. The fourth layer (output layer) has 2 neurons

This yields 690 parameters
$$ \text{4 observations} * 16 (\text{neurons}) + 16 (\text{bias}) + (16*32) + 32 + (32*2)+2 = 690$$

In [None]:
model = Sequential()

model.add(Dense(16, input_shape=(1, num_observations)))
model.add(Activation('relu'))

model.add(Dense(32))
model.add(Activation('relu'))


model.add(Dense(num_actions))
model.add(Activation('linear'))

print(model.summary())

Now we have our model which takes an observation as input and outputs a value for each action.
The higher the value, the more likely this value is a suitable action for the current observation.

As stated in the lecture, Deep-Q-Learning works better when using a target network. So let's copy the above network to define a separate, target network.

In [None]:
#model.load_weights("34.ckt")
target_model = clone_model(model)


## Set up DQN hyperparameters

In [None]:
EPOCHS = 300
epsilon = 1.0
EPSILON_REDUCE = 0.995  
LEARNING_RATE = 0.001 
GAMMA = 0.95


Let us use the epsilon greedy action selection function once again:

In [None]:
def epsilon_greedy_action_selection(model, epsilon, observation):
    obs=[]
    if np.random.random() > epsilon:
        #print(f"*** Taking Greedy Action, observation shape 1: {observation.shape}")
        observation = observation.reshape([1, 1, 4]) 
        #print(f"*** Taking Greedy Action, observation shape 2: {observation.shape}")
        prediction = model.predict(observation, verbose=0)  # Perform the prediction on the observation
        action = np.argmax(prediction)           # Chose the action with the highest value
    else:
        #print(f"*** Taking a random action")
        action = np.random.randint(0, env.action_space.n)  # Select random action with probability epsilon
    return action

As shown in the lecture, we need a replay buffer.
We can use the Python `deque` data structure for this. The *maxlen* argument specifies the number of elements the buffer can store before it starts overwriting elements at the beginning of the queue.

The following cell shows an example using the deque function. You can see in the first example all values fit into the deque, so nothing is overwritten. 

In the second example, the deque is printed in each iteration. It can hold all values in the first five iterations but then needs to delete the oldest value in the deque to make room for the new value 

In [None]:
# [OPTIONAL CODE | EXAMPLE OF DEQUE]
# In this example all values fit into the deque, no overwriting
deque_1 = deque(maxlen=5)
for i in range(5):    
    deque_1.append(i)
print(deque_1)
print("---------------------")

# In this example, qfter the first 5 values are stored, it needs to overwrite the oldest value to store the new one
deque_2 = deque(maxlen=5)
for i in range(10):  
    deque_2.append(i)
    print(deque_2)

Let's say we allow our replay buffer a maximum size of 20000

In [None]:
replay_buffer = deque(maxlen=20000)
update_target_model = 10

As mentioned in the lecture, experience replaying is very useful to stabilize training in Deep Q-Learning. <br />
The following cell implements one version of the experience replay algorithm. <br />
It uses the zip statement paired with the * operator (unpacking argument lists) to create mini-batches from the samples of experience accumulated.<br />
The zip statement returns all corresponding pairs from each entry. <br />
It might look confusing but the following example should clarify it

In [None]:
test_tuple = [(1, 2, 3), (4, 5, 6), (7, 8, 9)]
zipped_list = list(zip(*test_tuple))
a, b, c = zipped_list
print(a, b, c)

In [None]:
def replay(replay_buffer, batch_size, model, target_model):
    
    # As long as the buffer has not enough elements we do nothing
    if len(replay_buffer) < batch_size: 
        return
    
    # Take a random sample from the buffer with size batch_size
    samples = random.sample(replay_buffer, batch_size)  
    
    # Initialize variable to store the targets predicted by the target network for training
    target_batch = []  
    
    # Efficient way to handle the sample by using the zip functionality
    zipped_samples = list(zip(*samples))  
    states, actions, rewards, new_states, dones = zipped_samples  
    
    # Predict targets for all states from the sample
    #print(f"*** *** *** *** EXPERIENCE REPLAY, length states: {len(states)}")
    #print(f"*** *** *** *** EXPERIENCE REPLAY, states: {np.array(states).shape}")
    #print(f"*** *** *** *** EXPERIENCE REPLAY, states: {np.array(states[0]).shape}")
    targets = target_model.predict(np.array(states), verbose=0)
    
    # Predict Q-Values for all new states from the sample
    q_values = model.predict(np.array(new_states), verbose=0)  
    
    # Now we loop over all predicted values to compute the actual targets
    for i in range(batch_size):  
        
        # Take the maximum Q-Value for each sample
        q_value = max(q_values[i][0])  
        
        # Store the ith target in order to update it according to the formula
        target = targets[i].copy()  
        if dones[i]:
            target[0][actions[i]] = rewards[i]
        else:
            target[0][actions[i]] = rewards[i] + q_value * GAMMA
        target_batch.append(target)

    # Fit the model based on the states and the updated targets for 1 epoch
    model.fit(np.array(states), np.array(target_batch), epochs=1, verbose=0)  


We need to update the target network every once in a while. <br />
Keras provides the *set_weights()* and *get_weights()* methods which can do the work for us, so we only need to check whether we hit an update epoch

In [None]:
def update_model_handler(epoch, update_target_model, model, target_model):
    if epoch > 0 and epoch % update_target_model == 0:
        target_model.set_weights(model.get_weights())
        #print(f"*** Debug: Updating model")


## Train the DQN agent

In [None]:
model.compile(loss='mse', optimizer=Adam(learning_rate=LEARNING_RATE))


In [None]:
# Can use provided chekpoints as starting point

In [None]:
best_so_far = 0

for epoch in range(EPOCHS):
    
    observation, info = env.reset()  # Get inital state
    
    # Keras expects the input to be of shape [1, X] thus we have to reshape
    # [Jeremy] Original state is an array of shape (4): [Cart Position, Cart Velocity, Pole Angle, Pole Angular Velocity]
    observation = observation.reshape([1, 4])  
    #print(f"*** Debug, observation shape: {observation.shape}")
    done = False  
    
    points = 0
    while not done:  # As long current run is active
        # Select action according to strategy
        action = epsilon_greedy_action_selection(model, epsilon, observation)
        
        # Perform action and get next state
        next_observation, reward, done, truncated, info = env.step(action)

        next_observation = next_observation.reshape([1, 4])  # Reshape!
        
        replay_buffer.append((observation, action, reward, next_observation, done))  # Update the replay buffer
        
        observation = next_observation  # Update the observation
        
        points += 1

        # Train the model by replaying
        #print(f"*** Debug: Done = {done}")
        replay(replay_buffer, 32, model, target_model)
    epsilon *= EPSILON_REDUCE  # Reduce epsilon
    
    # Check if we need to update the target model
    update_model_handler(epoch, update_target_model, model, target_model)
    
    if points > best_so_far:
        best_so_far = points
    if epoch %25 == 0:
        print(f"========== {epoch}: Points reached: {points} - epsilon: {epsilon} - Best: {best_so_far}")
env.close()


## Exploit learned Q values in test simulations

Finally, let's showcase how the trained agent performs by graphically vizualizing it behaving in the environment. 

In [None]:
env.close()

In [None]:
env = gym.make("CartPole-v1", render_mode="human") 

observation, info = env.reset()

for counter in range(500):
    
    env.render()
    
    # Choose action from predicted Q-values
    action = np.argmax(model.predict(observation.reshape([1,1,4]), verbose=0)) 
    
    # Perform the action 
    observation, reward, done, truncated, info = env.step(action)
    
    # clear_output(wait=True)
    
    if done:
        print(f"Test episode done")
        observation, info = env.reset()
        #break
        
env.close()

## Thank you everyone!