<a href="https://colab.research.google.com/github/Pranav-Reddy-Pedaballe/Reinforcement-Learning/blob/main/Deep_Convolutional_Q_Learning_for_Pac_Man.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Deep Convolutional Q-Learning for Pac-Man

## Part 0 - Installing the required packages and importing the libraries

### Installing Gymnasium

In [None]:
!pip install gymnasium
!pip install "gymnasium[atari, accept-rom-license]"
!pip install ale-py
!apt-get install -y swig
!pip install gymnasium[box2d]

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
swig is already the newest version (4.0.2-1ubuntu1).
0 upgraded, 0 newly installed, 0 to remove and 49 not upgraded.


### Importing the libraries

In [None]:
import os
import random
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from collections import deque
from torch.utils.data import DataLoader, TensorDataset

## Part 1 - Building the AI

### Creating the architecture of the Neural Network

In [None]:
class Network(nn.Module):

  def __init__(self, action_size, seed = 42):
    super(Network, self).__init__()
    self.seed = torch.manual_seed(seed)
    self.conv1 = nn.Conv2d(3, 32, kernel_size = 8, stride = 4)
    self.bn1 = nn.BatchNorm2d(32)
    self.conv2 = nn.Conv2d(32, 64, kernel_size = 4, stride = 2)
    self.bn2 = nn.BatchNorm2d(64)
    self.conv3 = nn.Conv2d(64, 64, kernel_size = 3, stride = 1)
    self.bn3 = nn.BatchNorm2d(64)
    self.conv4 = nn.Conv2d(64, 128, kernel_size = 3, stride = 1)
    self.bn4 = nn.BatchNorm2d(128)
    self.fc1 = nn.Linear(10 * 10 * 128, 512)
    self.fc2 = nn.Linear(512, 256)
    self.fc3 = nn.Linear(256, action_size)

  def forward(self, state):
    x = F.relu(self.bn1(self.conv1(state)))
    x = F.relu(self.bn2(self.conv2(x)))
    x = F.relu(self.bn3(self.conv3(x)))
    x = F.relu(self.bn4(self.conv4(x)))
    x = x.view(x.size(0), -1)
    x = F.relu(self.fc1(x))
    x = F.relu(self.fc2(x))
    return self.fc3(x)

```
self.conv1 = nn.Conv2d(3, 32, kernel_size=8, stride=4)
self.bn1 = nn.BatchNorm2d(32)
self.conv2 = nn.Conv2d(32, 64, kernel_size=4, stride=2)
self.bn2 = nn.BatchNorm2d(64)
self.conv3 = nn.Conv2d(64, 64, kernel_size=3, stride=1)
self.bn3 = nn.BatchNorm2d(64)
self.conv4 = nn.Conv2d(64, 128, kernel_size=3, stride=1)
self.bn4 = nn.BatchNorm2d(128)
```


`nn.Conv2d:` Defines convolutional layers for image-like input (e.g., frames in Atari environments).

`3:` Number of input channels (RGB image).

`32, 64, 128`: Number of output channels (filters).

`kernel_size:` Size of the convolution filter.

`stride:` Controls how much the filter moves with each step.

`Batch Normalization (nn.BatchNorm2d):`
Normalizes the output of each convolution layer to stabilize training and improve performance.

Then we pass through a neural network (linear)/ fully connected layers to get the action states.

The forward method defines how the data would flow through the network during a forward pass.


###PASSING THROUGH THE CONVOLUTIONAL LAYERS

`F.relu:` Applies the ReLU activation function after each convolution and batch normalization. This introduces non-linearity, helping the network learn complex patterns.

`x = x.view(x.size(0), -1)`: Flattens the output from the convolutional layers into a vector so it can be fed into the fully connected layers.

x.size(0) represents the batch size, and -1 automatically calculates the appropriate size for the remaining dimensions.

It is then passed through the linear neural network.




## Part 2 - Training the AI

### Setting up the environment

In [None]:
import ale_py
import gymnasium as gym
env = gym.make('MsPacmanDeterministic-v0', full_action_space = False)
state_shape = env.observation_space.shape
state_size = env.observation_space.shape[0]
number_actions = env.action_space.n
print('State shape: ', state_shape)
print('State size: ', state_size)
print('Number of actions: ', number_actions)

  logger.deprecation(


State shape:  (210, 160, 3)
State size:  210
Number of actions:  9


### Initializing the hyperparameters

In [None]:
learning_rate = 5e-4
minibatch_size = 64
discount_factor = 0.99

### Preprocessing the frames

In [None]:
from PIL import Image
from torchvision import transforms

def preprocess_frame(frame):
  frame = Image.fromarray(frame)
  preprocess = transforms.Compose([transforms.Resize((128, 128)), transforms.ToTensor()])
  return preprocess(frame).unsqueeze(0)

`preprocess_frame:` A function that takes a frame (likely an image or a frame from a video/game environment) and processes it into a format suitable for input to a neural network.

`Image.fromarray(frame):` Converts a NumPy array (the typical format of video/game frames) into a PIL image.
This step is necessary because torchvision transforms generally expect a PIL image.

###Define Preprocessing Pipeline:
`transforms.Compose:` Combines multiple transformation steps into a single pipeline.

`transforms.Resize((128, 128)):` Resizes the image to 128x128 pixels. This standardizes the input size for the neural network.

Resizing to 128x128 reduces the computational cost without losing essential visual features.

`transforms.ToTensor():` Converts the image (PIL) into a PyTorch tensor and scales pixel values to the range [0, 1].

Neural networks in PyTorch require input as tensors, not images or arrays.

###Apply Transformations and Add Batch Dimension:

`preprocess(frame):` Applies the resizing and conversion to tensor on the input frame.

`.unsqueeze(0):` Adds a batch dimension to the tensor.
Neural networks expect input in the form of (batch_size, channels, height, width).

`unsqueeze(0)` adds a dimension at position 0, making it a batch of size 1 (i.e., shape [1, 3, 128, 128] for RGB images).

Neural networks expect input in the form:
(batch size,channels,height,width).
A single frame/image lacks the batch dimension.
.unsqueeze(0) adds this dimension, making the frame compatible with models that expect batched input.

### Implementing the DCQN class

In [None]:
class Agent():

  def __init__(self, action_size):
    self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    self.action_size = action_size
    self.local_qnetwork = Network(action_size).to(self.device)
    self.target_qnetwork = Network(action_size).to(self.device)
    self.optimizer = optim.Adam(self.local_qnetwork.parameters(), lr = learning_rate)
    self.memory = deque(maxlen = 10000)

  def step(self, state, action, reward, next_state, done):
    state = preprocess_frame(state)
    next_state = preprocess_frame(next_state)
    self.memory.append((state, action, reward, next_state, done))
    if len(self.memory) > minibatch_size:
      experiences = random.sample(self.memory, k = minibatch_size)
      self.learn(experiences, discount_factor)

  def act(self, state, epsilon = 0.):
    state = preprocess_frame(state).to(self.device)
    self.local_qnetwork.eval()
    with torch.no_grad():
      action_values = self.local_qnetwork(state)
    self.local_qnetwork.train()
    if random.random() > epsilon:
      return np.argmax(action_values.cpu().data.numpy())
    else:
      return random.choice(np.arange(self.action_size))

  def learn(self, experiences, discount_factor):
    states, actions, rewards, next_states, dones = zip(*experiences)
    states = torch.from_numpy(np.vstack(states)).float().to(self.device)
    actions = torch.from_numpy(np.vstack(actions)).long().to(self.device)
    rewards = torch.from_numpy(np.vstack(rewards)).float().to(self.device)
    next_states = torch.from_numpy(np.vstack(next_states)).float().to(self.device)
    dones = torch.from_numpy(np.vstack(dones).astype(np.uint8)).float().to(self.device)
    next_q_targets = self.target_qnetwork(next_states).detach().max(1)[0].unsqueeze(1)
    q_targets = rewards + discount_factor * next_q_targets * (1 - dones)
    q_expected = self.local_qnetwork(states).gather(1, actions)
    loss = F.mse_loss(q_expected, q_targets)
    self.optimizer.zero_grad()
    loss.backward()
    self.optimizer.step()

`local_qnetwork:` The main network used to estimate Q-values during action selection.

`target_qnetwork:` A slower-updated copy of the main network, used to compute the target Q-values for stability.

Optimizer (Adam): Updates the network's weights by minimizing the loss function during learning.

`learning_rate:` Controls how much the weights are updated during backpropagation.

`self.memory = deque(maxlen=10000)`

> `Replay Buffer (memory):` Stores experiences (state, action, reward, next_state, done).

>`maxlen=10000:` Limits buffer size to the most recent 10,000 experiences.




###2. Storing Experiences – step:
This method handles experience storage and learning.
```
state = preprocess_frame(state)
next_state = preprocess_frame(next_state)
```
***Frame Preprocessing:*** Resizes and converts raw frames into tensors. Ensures that the network gets consistent input.

***`self.memory.append((state, action, reward, next_state, done))`*** :

**Experience Storage:** Appends the experience to the replay buffer (deque).

Learning starts only after the buffer has enough experiences for a full minibatch.

```
experiences = random.sample(self.memory, k=minibatch_size)
self.learn(experiences, discount_factor)
```
**Random Sampling (Experience Replay):**
Randomly samples a batch of experiences to break correlation between consecutive states and improve learning.



###3. Action Selection – act:

***Epsilon-Greedy Strategy :***
Selects actions based on predicted Q-values or explores random actions with probability epsilon (to encourage exploration).

```
state = preprocess_frame(state).to(self.device)
self.local_qnetwork.eval()
```
`eval():` Sets the network to evaluation mode (no dropout or batch norm updates).

```
with torch.no_grad():
    action_values = self.local_qnetwork(state)
```
`torch.no_grad():` Disables gradient computation during action selection (for efficiency).

**Why?**

Action selection is inference; we don’t need gradients. This reduces computation time.

```
self.local_qnetwork.train()
if random.random() > epsilon:
    return np.argmax(action_values.cpu().data.numpy())
else:
    return random.choice(np.arange(self.action_size))
```
`train():` Puts the network back in training mode.

`np.argmax():` Chooses the action with the highest Q-value (exploitation).

`random.choice():` Randomly selects an action (exploration) based on epsilon.



### Learning – learn:
Learning from Experiences: Updates the network by computing the loss between predicted Q-values and target Q-values.

`states, actions, rewards, next_states, dones = zip(*experiences)`

Unpacks the experiences into separate tensors for efficient processing.

***All the vstack lines :-*** Converts experience tuples into tensors and moves them to GPU (if available).

```
next_q_targets = self.target_qnetwork(next_states).detach().max(1)[0].unsqueeze(1)
```
***Target Q-values:***

Predicts Q-values for next_states using the target network.

`.detach():` Ensures gradients are not computed for the target network (target is fixed).

`max(1)[0]:` Chooses the highest Q-value (best action).

`unsqueeze(1):` Reshapes the tensor to align with reward tensor dimensions.

***Bellman Equation:***
Calculates target Q-values by incorporating rewards and discounted future Q-values.
If done = 1 (episode ends), future Q-values are ignored.

***Q-values for Current Actions:***
Extracts Q-values for the actions actually taken during training.

***Loss Function (MSE):*** Measures the difference between predicted and target Q-values.

```
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
```
***Backpropagation:***
Clears gradients (zero_grad), computes gradients (backward), and updates weights (step) to minimize loss.











### Initializing the DCQN agent

In [None]:
agent = Agent(number_actions)

### Training the DCQN agent

In [None]:
number_episodes = 2000
maximum_number_timesteps_per_episode = 10000
epsilon_starting_value  = 1.0
epsilon_ending_value  = 0.01
epsilon_decay_value  = 0.995
epsilon = epsilon_starting_value
scores_on_100_episodes = deque(maxlen = 100)

for episode in range(1, number_episodes + 1):
  state, _ = env.reset()
  score = 0
  for t in range(maximum_number_timesteps_per_episode):
    action = agent.act(state, epsilon)
    next_state, reward, done, _, _ = env.step(action)
    agent.step(state, action, reward, next_state, done)
    state = next_state
    score += reward
    if done:
      break
  scores_on_100_episodes.append(score)
  epsilon = max(epsilon_ending_value, epsilon_decay_value * epsilon)
  print('\rEpisode {}\tAverage Score: {:.2f}'.format(episode, np.mean(scores_on_100_episodes)), end = "")
  if episode % 100 == 0:
    print('\rEpisode {}\tAverage Score: {:.2f}'.format(episode, np.mean(scores_on_100_episodes)))
  if np.mean(scores_on_100_episodes) >= 350.0:
    print('\nEnvironment solved in {:d} episodes!\tAverage Score: {:.2f}'.format(episode - 100, np.mean(scores_on_100_episodes)))
    torch.save(agent.local_qnetwork.state_dict(), 'checkpoint.pth')
    break

Episode 100	Average Score: 332.60
Episode 116	Average Score: 352.10
Environment solved in 16 episodes!	Average Score: 352.10


`number_episodes:` Total episodes for training (2000).

`maximum_number_timesteps_per_episode:` Max steps per episode (limits how long the agent can interact with the environment).

`epsilon_starting_value:` Initial exploration rate (100% random actions at start).

`epsilon_ending_value:` Minimum exploration (1% random actions at minimum).

`epsilon_decay_value:` Epsilon decay rate (multiplies epsilon by 0.995 each episode).

`epsilon:` Current exploration rate (starts at 1.0 and decays).

`scores_on_100_episodes:` Stores the last 100 scores to track performance.


`for episode in range(1, number_episodes + 1):`
Loops over episodes (each episode is one complete interaction with the environment until done).

`env.reset():` Resets the environment to start a new episode.

`state:` The initial state of the environment.

`score: `Tracks the total reward for the episode.

`for t in range(maximum_number_timesteps_per_episode):`
The agent interacts with the environment for up to 10,000 timesteps per episode (or until done).

`action = agent.act(state, epsilon)`:
The agent selects an action based on epsilon-greedy policy:
epsilon: High at the start → more random actions (exploration).
Low later → selects the best action (exploitation).

**The environment returns:**

`next_state:` New state after the action.

`reward:` Immediate reward for the action.

`done:` Whether the episode is over (goal reached or failure).

```
agent.step(state, action, reward, next_state, done)
```
Stores the experience (state, action, reward, next_state, done) in replay memory.

If enough experiences are stored, the agent samples a minibatch and learns from it.



```
state = next_state
score += reward
```
Updates the state for the next timestep.
Accumulates the reward for tracking performance.

If the environment returns done = True, the episode terminates early (agent either succeeded or failed).

`scores_on_100_episodes.append(score)` :- Stores the episode score (reward) to monitor average performance over the last 100 episodes.

```
print('\rEpisode {}\tAverage Score: {:.2f}'.format(episode, np.mean(scores_on_100_episodes)), end="")
```
Prints the average score over the last 100 episodes after each episode (overwrites to keep the output clean).

```
if episode % 100 == 0:
    print('\rEpisode {}\tAverage Score: {:.2f}'.format(episode, np.mean(scores_on_100_episodes)))
```
Every 100 episodes, the average score is printed. This helps track learning progress.

```
if np.mean(scores_on_100_episodes) >= 300.0:
    print('\nEnvironment solved in {:d} episodes!\tAverage Score: {:.2f}'.format(episode - 100, np.mean(scores_on_100_episodes)))
    torch.save(agent.local_qnetwork.state_dict(), 'checkpoint.pth')
    break
```
***Goal:***
If the average score over 100 episodes is 300+, the environment is considered solved.

`torch.save()` saves the model weights to 'checkpoint.pth' for future use.
Training stops (break) once the environment is solved.

## Part 3 - Visualizing the results

In [None]:
import glob
import io
import base64
import imageio
from IPython.display import HTML, display

def show_video_of_model(agent, env_name):
    env = gym.make(env_name, render_mode='rgb_array')
    state, _ = env.reset()
    done = False
    frames = []
    while not done:
        frame = env.render()
        frames.append(frame)
        action = agent.act(state)
        state, reward, done, _, _ = env.step(action)
    env.close()
    imageio.mimsave('video.mp4', frames, fps=30)

show_video_of_model(agent, 'MsPacmanDeterministic-v0')

def show_video():
    mp4list = glob.glob('*.mp4')
    if len(mp4list) > 0:
        mp4 = mp4list[0]
        video = io.open(mp4, 'r+b').read()
        encoded = base64.b64encode(video)
        display(HTML(data='''<video alt="test" autoplay
                loop controls style="height: 400px;">
                <source src="data:video/mp4;base64,{0}" type="video/mp4" />
             </video>'''.format(encoded.decode('ascii'))))
    else:
        print("Could not find video")

show_video()

  logger.deprecation(
