<a href="https://colab.research.google.com/github/Pranav-Reddy-Pedaballe/Reinforcement-Learning/blob/main/Q_Learning_for_Lunar_Landing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Deep Q-Learning for Lunar Landing

## Part 0 - Installing the required packages and importing the libraries

### Installing Gymnasium

In [None]:
!pip install gymnasium
!pip install "gymnasium[atari, accept-rom-license]"
!apt-get install -y swig
!pip install gymnasium[box2d]

Collecting gymnasium
  Downloading gymnasium-1.0.0-py3-none-any.whl.metadata (9.5 kB)
Collecting farama-notifications>=0.0.1 (from gymnasium)
  Downloading Farama_Notifications-0.0.4-py3-none-any.whl.metadata (558 bytes)
Downloading gymnasium-1.0.0-py3-none-any.whl (958 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m958.1/958.1 kB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading Farama_Notifications-0.0.4-py3-none-any.whl (2.5 kB)
Installing collected packages: farama-notifications, gymnasium
Successfully installed farama-notifications-0.0.4 gymnasium-1.0.0
Collecting ale-py>=0.9 (from gymnasium[accept-rom-license,atari])
  Downloading ale_py-0.10.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.6 kB)
Downloading ale_py-0.10.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m16.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collec

### Importing the libraries

In [None]:
import os
import random
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import torch.autograd as autograd
from torch.autograd import Variable
from collections import deque, namedtuple

## Part 1 - Building the AI

### Creating the architecture of the Neural Network

In [None]:
class Network(nn.Module):

   def __init__(self,state_size,action_size,seed=42):
       super(Network,self).__init__()
       self.seed = torch.manual_seed(seed)
       self.fc1= nn.Linear(state_size,64)
       self.fc2= nn.Linear(64,64)
       self.fc3= nn.Linear(64,action_size)

   def forward(self,state):
       x=self.fc1(state)
       x=F.relu(x)
       x=self.fc2(x)
       x=F.relu(x)
       return self.fc3(x)


1.   We define a new class called Network which is inheriting from the nn.Module.
2.    def _init() defines the constructor method
3. self: refers to the current instance of the class
4. A seed is a starting value used to initialize a random number generator, ensuring that random processes in a program (like weight initialization or data shuffling) produce the same results every time for reproducibility
5. super(Network, self): This calls the constructor of the parent class (nn.Module). It's necessary to initialize the nn.Module class properly so that PyTorch can manage model parameters and other functionalities. __init__(): This calls the __init__ method of nn.Module, the parent class.
6. torch.manual_seed(seed): This function sets the random seed for generating random numbers in PyTorch. Using the same seed ensures that the random processes in the model (such as weight initialization or shuffling) are reproducible.
self.seed: The seed value is stored as an attribute of the instance, so it can be accessed later if needed.
7. self.fc1 represents the first fully connected layer of the nueral network.
8. nn.Linear(state_size, 64): This defines a fully connected (linear) layer that takes state_size inputs and produces 64 outputs. The number 64 is the size of the hidden layer, and it can be adjusted based on the complexity of the problem.
9. and then we create anther fully connected layer and then connect it to the output states.












self.fc1(state) -> state input is passed to the first fully connected layer.The output of this operation is a linear transformation of the input data, typically calculated as
xW+b, where x is the input, W is the weight matrix, and
b is the bias vector.

x = F.relu(x)

F: Refers to the functional API in PyTorch (torch.nn.functional), which provides various activation functions.

relu: A rectified linear unit activation function. It replaces all negative values in x with 0 and keeps positive values unchanged.

(x): The input to the relu function is the result of the previous layer (fc1). This introduces non-linearity to the model.


ReLU (Rectified Linear Unit) is used in neural networks to introduce non-linearity, enabling the network to learn complex patterns in data. It also helps prevent the vanishing gradient problem, ensuring smoother and faster learning by keeping gradients large for positive values. ReLU is computationally efficient and promotes sparse representations by outputting zero for negative inputs. Despite its simplicity, it’s highly effective for deep learning tasks.

## Part 2 - Training the AI

### Setting up the environment

In [None]:
import gymnasium as gym
env = gym.make('LunarLander-v3') # The Lunar Lander environment was upgraded to v3
state_shape = env.observation_space.shape
state_size = env.observation_space.shape[0]
number_actions = env.action_space.n
print('State shape: ', state_shape)
print('State size: ', state_size)
print('Number of actions: ', number_actions)

State shape:  (8,)
State size:  8
Number of actions:  4


state_shape is the shape of the input like 1D vector or 2D vector.

state_size is the number of inputs given, here it is 8.

The state is an 8-dimensional vector: the coordinates of the lander in x & y, its linear velocities in x & y, its angle, its angular velocity, and two booleans that represent whether each leg is in contact with the ground or not.

env.action_space is to get the actions and .n gives the number of actions which is 4 here ,

0: do nothing

1: fire left orientation engine

2: fire main engine

3: fire right orientation engine

### Initializing the hyperparameters

In [None]:
learning_rate = 5e-4
minibatch_size = 100
discount_factor = 0.99
replay_buffer_size = int(1e5)
interpolation_parameter = 1e-3


minibatch size -> the number of observations used in one step of the training to update the model parameters.

Efficiency: Using a minibatch size is more efficient than processing one sample at a time, especially on hardware like GPUs.

Regularization: Small minibatches introduce noise to the gradients, helping prevent overfitting.

Memory Constraints: A smaller minibatch size can fit into memory, while the full dataset might not.

How Experience Replay Works:

***Replay Buffer:***

The agent stores its past experiences (transitions) in a memory buffer.
An experience is typically represented as a tuple:

(s,a,r,s′,d), where:
s: Current state.
a: Action taken.
r: Reward received.
s′: Next state.
d: Done flag (indicates if the episode ended).

***Sampling Mini-Batches:***

Instead of using the most recent experience for learning, the agent randomly samples a mini-batch of past experiences from the buffer.

These experiences are used to update the model (e.g., the Q-network).

***Update and Replace:***

The oldest experiences in the buffer are replaced as the buffer reaches its capacity, ensuring the agent learns from a diverse set of experiences.


Learning rate:- key parameter in updating the Q-values during training.

Interpolation parameter:- It controls the rate at which the target network parameters are updated towards the main network parameters.

### Implementing Experience Replay

In [None]:
class ReplayMemory(object):

    def __init__(self, capacity):
      self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
      self.capacity = capacity
      self.memory = []

    def push(self,event):
      self.memory.append(event)
      if len(self.memory)>self.capacity:
        del self.memory[0]

    def sample(self,batch_size):
      experiences = random.sample(self.memory,k = batch_size)
      states = torch.from_numpy(np.vstack([e[0] for e in experiences if e is not None ])).float().to(self.device)
      actions = torch.from_numpy(np.vstack([e[1] for e in experiences if e is not None ])).long().to(self.device)
      rewards = torch.from_numpy(np.vstack([e[2] for e in experiences if e is not None ])).float().to(self.device)
      next_states = torch.from_numpy(np.vstack([e[3] for e in experiences if e is not None ])).float().to(self.device)
      dones = torch.from_numpy(np.vstack([e[4] for e in experiences if e is not None ]).astype(np.uint8)).float().to(self.device)
      return states,next_states,actions,rewards,dones



capacity:- capacity of the memory/the maximum size of the memory buffer

memory:- the list that will stor the experiences , each one containing the state,action,reward,next state , and whether we are done or not.

The push method adds an experience into the replay memory buffer.

We are appending an event into the memory list.
and then we are checking if the length of the memory capacity is exceeding the limit then we delete the oldest event.


the sample method will randomly select a batch of experiences from the memory buffer

`vstack` will stack the states in the sampled experiences together by extracting them and then stacking them and then we run a for loop and stack the respective element , like for states for e in experiences we stack e[0].and we then convert these stack of states into pytorch tensors.

So we add `torch.from_numpy()` and then we set the data type to float.

`uint8` is to represent the boolean data type

### Implementing the DQN class

In [None]:
class Agent():

    def __init__(self, state_size, action_size):
      self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
      self.state_size = state_size
      self.action_size = action_size
      self.local_qnetwork = Network(state_size, action_size).to(self.device)
      self.target_qnetwork = Network(state_size, action_size).to(self.device)
      self.optimizer = optim.Adam(self.local_qnetwork.parameters(), lr = learning_rate)
      self.memory = ReplayMemory(replay_buffer_size)
      self.t_step = 0

    def step(self, state, action, reward, next_state, done):
      self.memory.push((state, action, reward, next_state, done)) #to store in the replay memory
      self.t_step = (self.t_step + 1)%4                           #to learn every 4 steps.
      if self.t_step == 0 :
        if len(self.memory.memory) > minibatch_size:
          experiences = self.memory.sample(100)
          self.learn(experiences, discount_factor)

    def act(self, state, epsilon = 0.):
      state = torch.from_numpy(state).float().unsqueeze(0).to(self.device)
      self.local_qnetwork.eval()
      with torch.no_grad():
        action_values = self.local_qnetwork(state)
      self.local_qnetwork.train()
      if random.random() > epsilon :
        return np.argmax(action_values.cpu().data.numpy())
      else:
        return random.choice(np.arange(self.action_size))

    def learn(self, experiences, discount_factor):
      states, next_states, actions, rewards, dones = experiences
      next_q_targets = self.target_qnetwork(next_states).detach().max(1)[0].unsqueeze(1)
      q_targets = rewards + (discount_factor * next_q_targets * (1 - dones))
      q_expected = self.local_qnetwork(states).gather(1, actions)
      loss = F.mse_loss(q_expected, q_targets)
      self.optimizer.zero_grad()
      loss.backward()
      self.optimizer.step()
      self.soft_update(self.local_qnetwork, self.target_qnetwork, interpolation_parameter)

    def soft_update(self, local_model, target_model, interpolation_parameter):
      for target_param, local_param in zip(target_model.parameters(), local_model.parameters()):
        target_param.data.copy_(interpolation_parameter * local_param.data + (1.0 - interpolation_parameter) * target_param.data)




The concepts of the local Q-network and the target Q-network arise in the context of Deep Q-Learning, an extension of Q-Learning that uses deep neural networks to approximate the Q-value function. These two networks play a crucial role in stabilizing the learning process. Here's a breakdown of their purposes and why they are used:

***Local Q-Network (or simply Q-Network):***

***Purpose:*** This network is responsible for learning the Q-value function. It is updated continuously during the training process.

**Operation:** At each step of training, this network takes the current state as input and outputs Q-values for all possible actions in that state.

**Updating:** The weights of this network are updated frequently, typically at every step or every few steps of training, using a technique like backpropagation. The updates are guided by a loss function that measures the difference between the predicted Q-values and the target Q-values (which come from the target Q-network).

***Target Q-Network:***

Purpose: The target Q-network is used to generate stable target values for the updates of the Q-network. It helps in stabilizing the learning process.

Operation: Like the local Q-network, it also takes the state as input and outputs Q-values. However, its weights are not updated as frequently.

Updating: The weights of the target Q-network are periodically updated to match those of the local Q-network. This update happens less frequently, such as every few hundred or thousand training steps.

The reason for having these two separate networks comes from the need to address a significant challenge in training deep Q-networks: the moving target problem. When a single network is used both to select actions and to evaluate them, it can lead to highly correlated Q-value estimates. This correlation can make the training unstable and inefficient, as the network is effectively chasing a constantly moving target (its own continuously updated estimates).

By separating the networks, the target Q-network provides more stable and less frequently changing target values for the local Q-network to learn from. This separation reduces the correlations in the update process, leading to more stable and reliable learning. The idea is similar to using fixed datasets for training and validation in supervised learning to prevent overfitting and to ensure that the model generalizes well.

In the Deep Q-Learning implementation, the local Q-network and the target Q-network are initially kept the same to ensure consistency in the Q-value predictions. However, during training, the weights of the local Q-network are updated frequently based on the actions taken and the rewards received. The target Q-network, on the other hand, is updated less frequently and is used to provide stable target Q-values for the training process. This helps in reducing the oscillations and divergence during training by providing a more stable learning target. The periodic update of the target Q-network with the weights of the local Q-network ensures that the training process is stable and efficient.

optimizer is used to minimize the loss function.Minimizing the loss function ensures that the model learns the best possible parameters to achieve this.

The `optim` module has optimization algorithms, like SGD or Adam, that adjust the model's weights based on the loss to improve performance. They basically help the model learn better during training.


The step method stores experiences and decides when to learn from them.

We are then doing an epsilon greedy action policy in this method.

We are converting the state which is now a numpy array into torch tensor.

We are then adding an extra dimension to the state vector using unsqueeze which corresponds to the batch, i.e it says which batch the state belongs to.This step is important in any deep reinforcement/Q learning.Right now we have 8 dimensions.And then 0 means the first dimension of the state vector will be batch.

`.unsqueeze(0):` Adds a batch dimension at the start because the model expects input in a batch format (even if it's a single state).

We then set the local qnetwork to evaluation mode.Evaluation mode ensures the Q-network behaves predictably and efficiently during action selection, focusing solely on inference, not training.

We are in inference and not training mode , so we want to predict the q values , so we add the line `with torch.no_grad()` . This disables gradient computation, which saves memory and speeds up inference since we're only evaluating the Q-values.

action_values are the predicted actions by the local qnetwork.

`action_values = self.local_qnetwork(state):` Passes the state through the Q-network to get Q-values for all possible actions in that state.

Now we are going to go back into traiing mode.

`.train():` Puts the Q-network back in training mode after evaluation, ensuring it’s ready for future updates.

Now we are going to do the epsilon greedy action policy.

Epsilon greedy method :- We generate a random number and then we say if this random number is greater than epsilon value (here 0) then it will select the action with the highest Q value and if the value is less than epsilon, then it will select a random action.

`random.random()` here we are first calling the random library from which we are calling the random function.
This generates a random number between 0 and 1.

`action_values:` The Q-values computed by the Q-network.

`.cpu():` Moves the tensor from the GPU to the CPU.

`.data.numpy():` Converts the tensor to a NumPy array for processing.

`np.argmax:` Returns the index of the action with the highest Q-value, representing the best action to take in this state.

`random.choice:` Randomly selects an action.

`np.arange(self.action_size):` Creates an array of all possible action indices (0 to self.action_size - 1), from which the random action is chosen.

Right now epsilon is 0 , this is during eval and testing , but while training we'll change the value.

LEARN:-

we first unpack the experiences , and it set it some local variables.

`states, next_states, actions, rewards, dones = experiences:` extracts each component of the experience batch (states, next states, actions, rewards, terminal flags) for processing.

`next_q_targets = self.target_qnetwork(next_states).detach().max(1)[0].unsqueeze(1)`

`self.target_qnetwork(next_states):` Passes next_states through the target Q-network to get predicted Q-values for all actions, i.e it gives the action values of our target Q network propogating the next state.

`.detach():` Ensures gradients are not calculated for the target network during this step.

`.max(1)[0]`: Finds the maximum Q-value along the action dimension (index 0), representing the best Q-value for the next state. 1 here is because we need the maximum value along dimension one, which corresponds to the action dimension.

`.unsqueeze(1):` Reshapes the tensor to ensure dimensionality consistency for future calculations as we need to add the dimension of the batch.

This is to get the maximum predicted q values.


`q_targets = rewards + (discount_factor * next_q_targets * (1 - dones))`

Now we want to compute the Q values for the current state.

`rewards:` Immediate rewards received from the environment.

`discount_factor * next_q_targets:` Adds the discounted maximum Q-value for the next state, representing the expected future reward.

`(1 - dones):` Multiplies by 1 - dones to zero out the future rewards for terminal states (where done = 1).

`q_expected = self.local_qnetwork(states).gather(1, actions)`
now we are getting the predicted q values from the local q network

`self.local_qnetwork(states):` Passes states through the local Q-network to get predicted Q-values for all actions.

`.gather(1, actions):` Selects the Q-values corresponding to the actions taken in those states.We write 1 because dim=1 refers to the action dimension.

The Q-network outputs Q-values for all actions in each state, but we only care about the Q-value of the specific action the agent actually took in each state.
Using gather(1, actions) ensures that we focus only on the Q-values for the agent's chosen actions during training.




`loss = F.mse_loss(q_expected, q_targets)`

`F.mse_loss:` Calculates the mean squared error (MSE) between the predicted Q-values (q_expected) and the target Q-values (q_targets).

The loss measures how far the current Q-network is from the target values.


We now back propogate the loss in order to update the model parameters to update the new q values leading to a better action selection policy.

`self.optimizer.zero_grad()` : Clears any previously accumulated gradients.

`loss.backward()` : Computes gradients of the loss with respect to the network parameters using backpropagation.

These gradients are stored for the optimizer, which uses them to update the parameters in the next step:

`self.optimizer.step()` : Updates the Q-network parameters using the optimizer (here Adam).


***The soft_update function*** is used to gradually update the parameters of the target Q-network using the parameters of the local Q-network. This helps stabilize training in reinforcement learning.

`zip(target_model.parameters(), local_model.parameters()):`

Iterates over corresponding parameters of the target and local models (e.g., weights and biases for each layer).

`target_param.data.copy_():`

Updates the data of each parameter in the target model by combining its current value with the corresponding parameter in the local model.

`interpolation_parameter * local_param.data:`

Takes a fraction (τ) of the local model's parameter.

`(1.0 - interpolation_parameter) * target_param.data:`

Retains a fraction (1−τ) of the target model's parameter.

`Result:`

The target model parameter is updated as a weighted average:

new_target_param=τ⋅local_param+(1−τ)⋅target_param

This gradually moves the target model's parameters closer to the local model's parameters.

Imagine you’re trying to hit a moving target while running. If the target keeps moving unpredictably as you adjust your aim, it’s much harder to hit. The target network acts like a stable guide, allowing you to adjust gradually, while the local network actively learns and tries to hit the target.

By the end of training, the local and target networks ideally converge, both representing the optimal Q-function.

The soft update consists of softly updating the target model parameters using the weighted average of the local and target parameters

### Initializing the DQN agent

In [None]:
agent = Agent(state_size, number_actions)

We are creating an instance of the Agent class.

### Training the DQN agent

In [None]:
number_episodes = 2000
maximum_number_timesteps_per_episode = 1000
epsilon_starting_value  = 1.0
epsilon_ending_value  = 0.01
epsilon_decay_value  = 0.995
epsilon = epsilon_starting_value
scores_on_100_episodes = deque(maxlen = 100)

for episode in range(1, number_episodes + 1):
  state, _ = env.reset()
  score = 0
  for t in range(maximum_number_timesteps_per_episode):
    action = agent.act(state, epsilon)
    next_state, reward, done, _, _ = env.step(action)
    agent.step(state, action, reward, next_state, done)
    state = next_state
    score += reward
    if done:
      break
  scores_on_100_episodes.append(score)
  epsilon = max(epsilon_ending_value, epsilon_decay_value * epsilon)
  print('\rEpisode {}\tAverage Score: {:.2f}'.format(episode, np.mean(scores_on_100_episodes)), end = "")
  if episode % 100 == 0:
    print('\rEpisode {}\tAverage Score: {:.2f}'.format(episode, np.mean(scores_on_100_episodes)))
  if np.mean(scores_on_100_episodes) >= 200.0:
    print('\nEnvironment solved in {:d} episodes!\tAverage Score: {:.2f}'.format(episode - 100, np.mean(scores_on_100_episodes)))
    torch.save(agent.local_qnetwork.state_dict(), 'checkpoint.pth')
    break

Episode 100	Average Score: -101.70
Episode 200	Average Score: -23.82
Episode 300	Average Score: 73.93
Episode 400	Average Score: 198.05
Episode 403	Average Score: 201.41
Environment solved in 303 episodes!	Average Score: 201.41


first we initialize the training parameters.


## Part 3 - Visualizing the results

In [None]:
import glob
import io
import base64
import imageio
from IPython.display import HTML, display

def show_video_of_model(agent, env_name):
    env = gym.make(env_name, render_mode='rgb_array')
    state, _ = env.reset()
    done = False
    frames = []
    while not done:
        frame = env.render()
        frames.append(frame)
        action = agent.act(state)
        state, reward, done, _, _ = env.step(action.item())
    env.close()
    imageio.mimsave('video.mp4', frames, fps=30)

show_video_of_model(agent, 'LunarLander-v3')

def show_video():
    mp4list = glob.glob('*.mp4')
    if len(mp4list) > 0:
        mp4 = mp4list[0]
        video = io.open(mp4, 'r+b').read()
        encoded = base64.b64encode(video)
        display(HTML(data='''<video alt="test" autoplay
                loop controls style="height: 400px;">
                <source src="data:video/mp4;base64,{0}" type="video/mp4" />
             </video>'''.format(encoded.decode('ascii'))))
    else:
        print("Could not find video")

show_video()

