#STUDENT DETAILS

STUDENT NAME : Aayush Thakar

STUDENT ID : 24041785

#Why Reinforcement Learning is the ML paradigm of choice for this task?

Reinforcement Learning (RL) is the best approach because it is all about training agent to make decisions in real world over time in environmental condition that is constantly changing. In the game of Atari  **[ALE/DemonAttack-v5]** , the agent is not simply reacting in the moment, but has to anticipate, train from its action to action and adjust in response to that what works. Every action it carries out such as evading a hit or firing a demon provides it an opportunity of feedback type rewards. As it plays more, it acquires an understanding on how to improve on gameplay and also raise the score.

Unlike supervised learning which trains on labelled data, RL does not require labels. However, the agent does not learn what to do from a knowledge base but learns it because he tries out things and sees what ensues. That is why RL, particularly Deep Q-Networks (DQNs), is the method of choice for beating games like this.

INSTALLING NECESSARY PACKAGES

In [1]:
%pip install gym[atari] gym[accept-rom-license] torch torchvision numpy opencv-python




In [2]:
%pip install matplotlib tqdm ale-py tqdm




In [3]:
%pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Looking in indexes: https://download.pytorch.org/whl/cu118Note: you may need to restart the kernel to use updated packages.



IMPORTS

In [4]:
import gym
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import random
from collections import deque
import cv2
import matplotlib.pyplot as plt

NVIDIA GPU - Geforce GTX 1650

In [5]:
#If GPU available then (TRUE) if not then (FALSE)
print(torch.cuda.is_available())

#INFO of GPU
print(torch.cuda.get_device_name(0))

True
NVIDIA GeForce GTX 1650


In [6]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [7]:
import os
os.environ["CUDA_LAUNCH_BLOCKING"] = "1"
os.environ["TORCH_USE_CUDA_DSA"] = "1"

Hyperparameters

In [8]:
#Hyperparameters
num_episodes = 1000  
epsilon_start = 1.0
epsilon_end = 0.01
epsilon_decay = 50000  
gamma = 0.99
batch_size = 32  
lr = 1e-4
memory_capacity = 5000  
update_target_every = 1000

In [9]:
def preprocess_frame(frame):
    gray = cv2.cvtColor(frame, cv2.COLOR_RGB2GRAY)
    resized = cv2.resize(gray, (84, 84), interpolation=cv2.INTER_AREA)
    return resized / 255.0

def stack_frames(stacked_frames, state, is_new_episode, stack_size=4):
    frame = preprocess_frame(state)

    if is_new_episode:
        stacked_frames = deque([np.zeros((84, 84), dtype=np.float32) for _ in range(stack_size)], maxlen=stack_size)
        for _ in range(stack_size):
            stacked_frames.append(frame)
    else:
        stacked_frames.append(frame)

    stacked_state = np.stack(stacked_frames, axis=0)  
    return stacked_state, stacked_frames

ATARI Environment : DemonAttack-v5

You are facing waves of demons in the ice planet of Krybor. Points are accumulated by destroying demons. You begin with 3 reserve bunkers, and can increase its number (up to 6) by avoiding enemy attacks. Each attack wave you survive without any hits, grants you a new bunker. Every time an enemy hits you, a bunker is destroyed. When the last bunker falls, the next enemy hit will destroy you and the game ends.

In [16]:
env = gym.make("ALE/DemonAttack-v5", render_mode='human')
n_actions = env.action_space.n

In [11]:
#DQN class
class DQN(nn.Module):
    def __init__(self, action_size):
        super(DQN, self).__init__()
        self.net = nn.Sequential(
            nn.Conv2d(4, 32, kernel_size=8, stride=4),  
            nn.ReLU(),
            nn.Conv2d(32, 64, kernel_size=4, stride=2),
            nn.ReLU(),
            nn.Conv2d(64, 64, kernel_size=3, stride=1),
            nn.ReLU(),
            nn.Flatten(),
            nn.Linear(7 * 7 * 64, 512),
            nn.ReLU(),
            nn.Linear(512, action_size)
        )

    def forward(self, x):
        return self.net(x)


In [12]:
def train(memory, policy_net, target_net, optimizer, batch_size, gamma, use_double_dqn, device):
    if len(memory) < batch_size:
        return

    states, actions, rewards, next_states, dones = memory.sample(batch_size)

    # Verify tensor shapes
    assert states.shape[1:] == (4, 84, 84), f"Expected shape [batch, 4, 84, 84], got {states.shape}"
    assert next_states.shape[1:] == (4, 84, 84), f"Expected shape [batch, 4, 84, 84], got {next_states.shape}"

    states = torch.FloatTensor(states).to(device)
    actions = torch.LongTensor(actions).unsqueeze(1).to(device)
    rewards = torch.FloatTensor(rewards).unsqueeze(1).to(device)
    next_states = torch.FloatTensor(next_states).to(device)
    dones = torch.FloatTensor(dones).unsqueeze(1).to(device)

    # Check for NaN or Inf values in next_states
    if torch.isnan(next_states).any() or torch.isinf(next_states).any():
        print("Warning: NaN or Inf detected in next_states tensor")
        return

    q_values = policy_net(states).gather(1, actions)

    with torch.no_grad():
        if use_double_dqn:
            next_actions = policy_net(next_states).max(1)[1].unsqueeze(1)
            next_q_values = target_net(next_states).gather(1, next_actions)
        else:
            next_q_values = target_net(next_states).max(1)[0].unsqueeze(1)

        expected_q_values = rewards + (gamma * next_q_values * (1 - dones))

    loss = F.mse_loss(q_values, expected_q_values)

    optimizer.zero_grad()
    loss.backward()
    torch.nn.utils.clip_grad_norm_(policy_net.parameters(), max_norm=1.0)
    optimizer.step()
    if device.type == "cuda":
        torch.cuda.synchronize()

    # Explicitly free tensors
    del states, actions, rewards, next_states, dones, q_values, next_q_values, expected_q_values, loss
    if device.type == "cuda":
        torch.cuda.empty_cache()

In [13]:
#Replay Buffer
class ReplayBuffer:
    def __init__(self, capacity):
        self.buffer = deque(maxlen=capacity)

    def push(self, state, action, reward, next_state, done):
        self.buffer.append((state, action, reward, next_state, done))

    def sample(self, batch_size):
        batch = random.sample(self.buffer, batch_size)
        state, action, reward, next_state, done = map(np.stack, zip(*batch))
        return state, action, reward, next_state, done

    def __len__(self):
        return len(self.buffer)

In [14]:
def select_action(state, steps_done, epsilon, policy_net, device, env):
    if random.random() > epsilon:
        with torch.no_grad():
            state_tensor = torch.FloatTensor(state).unsqueeze(0).to(device)  
            q_values = policy_net(state_tensor)
            action = q_values.max(1)[1].item()
            # Free memory
            del state_tensor, q_values
            if device.type == "cuda":
                torch.cuda.empty_cache()
            return action
    else:
        return env.action_space.sample()

In [17]:
#Vanilla DQN
def train_vanilla_dqn(env, num_episodes, device, epsilon_start, epsilon_end, epsilon_decay, 
                      batch_size, gamma, update_target_every, memory_capacity, lr, save_path='vanilla_best.pt'):
    policy_net = DQN(env.action_space.n).to(device)
    target_net = DQN(env.action_space.n).to(device)
    target_net.load_state_dict(policy_net.state_dict())
    optimizer = optim.Adam(policy_net.parameters(), lr=lr)
    memory = ReplayBuffer(memory_capacity)

    all_rewards = []
    epsilon_values = []
    steps_done = 0
    best_reward = -float('inf')

    for episode in range(num_episodes):
        obs = env.reset()[0] if isinstance(env.reset(), tuple) else env.reset()
        state, stacked_frames = stack_frames(None, obs, True)
        total_reward = 0
        episode_epsilon = []
        done = False

        while not done:
            env.render()
            epsilon = epsilon_end + (epsilon_start - epsilon_end) * np.exp(-1. * steps_done / epsilon_decay)
            episode_epsilon.append(epsilon)
            action = select_action(state, steps_done, epsilon, policy_net, device, env)
            next_obs, reward, terminated, truncated, _ = env.step(action)
            next_state, stacked_frames = stack_frames(stacked_frames, next_obs, False)
            done = terminated or truncated

            memory.push(state, action, reward, next_state, done)
            state = next_state
            total_reward += reward
            steps_done += 1

            train(memory, policy_net, target_net, optimizer, batch_size, gamma, use_double_dqn=False, device=device)

            if steps_done % update_target_every == 0:
                target_net.load_state_dict(policy_net.state_dict())

        all_rewards.append(total_reward)
        epsilon_values.append(np.mean(episode_epsilon))
        print(f"Vanilla DQN | Episode {episode}, Reward: {total_reward}, Avg Epsilon: {epsilon_values[-1]:.4f}")

        if total_reward > best_reward:
            best_reward = total_reward
            torch.save(policy_net.state_dict(), save_path)

        if episode % 50 == 0 and episode > 0:
            policy_net.load_state_dict(torch.load(save_path))
            target_net.load_state_dict(torch.load(save_path))
            print(f"Vanilla DQN | Episode {episode}, Loaded best model from {save_path}")

        if device.type == "cuda":
            torch.cuda.empty_cache()

    return all_rewards, epsilon_values

vanilla_rewards, vanilla_epsilons = train_vanilla_dqn(
    env=env, num_episodes=num_episodes, device=device,
    epsilon_start=epsilon_start, epsilon_end=epsilon_end, epsilon_decay=epsilon_decay,
    batch_size=batch_size, gamma=gamma, update_target_every=update_target_every,
    memory_capacity=memory_capacity, lr=lr
)


Vanilla DQN | Episode 0, Reward: 205.0, Avg Epsilon: 0.9885
Vanilla DQN | Episode 1, Reward: 80.0, Avg Epsilon: 0.9683
Vanilla DQN | Episode 2, Reward: 150.0, Avg Epsilon: 0.9518
Vanilla DQN | Episode 3, Reward: 110.0, Avg Epsilon: 0.9388
Vanilla DQN | Episode 4, Reward: 160.0, Avg Epsilon: 0.9226
Vanilla DQN | Episode 5, Reward: 130.0, Avg Epsilon: 0.9053
Vanilla DQN | Episode 6, Reward: 235.0, Avg Epsilon: 0.8882
Vanilla DQN | Episode 7, Reward: 90.0, Avg Epsilon: 0.8706
Vanilla DQN | Episode 8, Reward: 90.0, Avg Epsilon: 0.8593
Vanilla DQN | Episode 9, Reward: 130.0, Avg Epsilon: 0.8489
Vanilla DQN | Episode 10, Reward: 205.0, Avg Epsilon: 0.8293
Vanilla DQN | Episode 11, Reward: 265.0, Avg Epsilon: 0.8047
Vanilla DQN | Episode 12, Reward: 120.0, Avg Epsilon: 0.7872
Vanilla DQN | Episode 13, Reward: 80.0, Avg Epsilon: 0.7757
Vanilla DQN | Episode 14, Reward: 80.0, Avg Epsilon: 0.7667
Vanilla DQN | Episode 15, Reward: 220.0, Avg Epsilon: 0.7557
Vanilla DQN | Episode 16, Reward: 190.0

  policy_net.load_state_dict(torch.load(save_path))
  target_net.load_state_dict(torch.load(save_path))


Vanilla DQN | Episode 50, Loaded best model from vanilla_best.pt
Vanilla DQN | Episode 51, Reward: 340.0, Avg Epsilon: 0.4539
Vanilla DQN | Episode 52, Reward: 110.0, Avg Epsilon: 0.4443
Vanilla DQN | Episode 53, Reward: 110.0, Avg Epsilon: 0.4385
Vanilla DQN | Episode 54, Reward: 100.0, Avg Epsilon: 0.4336
Vanilla DQN | Episode 55, Reward: 100.0, Avg Epsilon: 0.4283
Vanilla DQN | Episode 56, Reward: 100.0, Avg Epsilon: 0.4228
Vanilla DQN | Episode 57, Reward: 150.0, Avg Epsilon: 0.4175
Vanilla DQN | Episode 58, Reward: 150.0, Avg Epsilon: 0.4121
Vanilla DQN | Episode 59, Reward: 130.0, Avg Epsilon: 0.4067
Vanilla DQN | Episode 60, Reward: 30.0, Avg Epsilon: 0.4027
Vanilla DQN | Episode 61, Reward: 100.0, Avg Epsilon: 0.3993
Vanilla DQN | Episode 62, Reward: 205.0, Avg Epsilon: 0.3930
Vanilla DQN | Episode 63, Reward: 175.0, Avg Epsilon: 0.3868
Vanilla DQN | Episode 64, Reward: 90.0, Avg Epsilon: 0.3825
Vanilla DQN | Episode 65, Reward: 205.0, Avg Epsilon: 0.3772
Vanilla DQN | Episode 

RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

In [None]:
def train_double_dqn(env, num_episodes, device, epsilon_start, epsilon_end, epsilon_decay, 
                     batch_size, gamma, update_target_every, memory_capacity, lr, save_path='double_best.pt'):
    # Initialize networks
    policy_net = DQN(env.action_space.n).to(device)
    target_net = DQN(env.action_space.n).to(device)
    target_net.load_state_dict(policy_net.state_dict())
    optimizer = optim.Adam(policy_net.parameters(), lr=lr)
    memory = ReplayBuffer(memory_capacity)
    
    all_rewards = []
    epsilon_values = []
    steps_done = 0
    best_reward = -float('inf')

    for episode in range(num_episodes):
        obs = env.reset()[0] if isinstance(env.reset(), tuple) else env.reset()
        state, stacked_frames = stack_frames(None, obs, True)
        total_reward = 0
        episode_epsilon = []
        done = False

        while not done:
            env.render()
            epsilon = epsilon_end + (epsilon_start - epsilon_end) * np.exp(-1. * steps_done / epsilon_decay)
            episode_epsilon.append(epsilon)

            action = select_action(state, steps_done, epsilon, policy_net, device, env)
            next_obs, reward, terminated, truncated, _ = env.step(action)
            next_state, stacked_frames = stack_frames(stacked_frames, next_obs, False)
            done = terminated or truncated

            memory.push(state, action, reward, next_state, done)
            state = next_state
            total_reward += reward
            steps_done += 1

            train(memory, policy_net, target_net, optimizer, batch_size, gamma, use_double_dqn=True, device=device)

            if steps_done % update_target_every == 0:
                target_net.load_state_dict(policy_net.state_dict())

        all_rewards.append(total_reward)
        epsilon_values.append(np.mean(episode_epsilon))
        print(f"Double DQN | Episode {episode}, Reward: {total_reward}, Avg Epsilon: {epsilon_values[-1]:}")

        if total_reward > best_reward:
            best_reward = total_reward
            torch.save(policy_net.state_dict(), save_path)

        if episode % 50 == 0 and episode > 0:
            policy_net.load_state_dict(torch.load(save_path))
            target_net.load_state_dict(torch.load(save_path))
            print(f"Double DQN | Episode {episode}, Loaded best model from {save_path}")

        # Clear GPU memory after each episode
        if device.type == "cuda":
            torch.cuda.empty_cache()

    return all_rewards, epsilon_values

double_rewards, double_epsilons = train_double_dqn(
    env=env, num_episodes=num_episodes, device=device,
    epsilon_start=epsilon_start, epsilon_end=epsilon_end, epsilon_decay=epsilon_decay,
    batch_size=batch_size, gamma=gamma, update_target_every=update_target_every,
    memory_capacity=memory_capacity, lr=lr
)

env.close()

In [None]:
plt.figure(figsize=(14, 6))

In [None]:
plt.subplot(1, 2, 1)
plt.plot(vanilla_rewards, label='Vanilla DQN', color='blue')
plt.plot(double_rewards, label='Double DQN', color='orange')
plt.xlabel('Episode')
plt.ylabel('Reward')
plt.title('Training Rewards: Vanilla vs. Double DQN')
plt.legend()
plt.grid(True)

In [None]:
plt.subplot(1, 2, 2)
plt.plot(vanilla_epsilons, label='Vanilla DQN', color='blue')
plt.plot(double_epsilons, label='Double DQN', color='orange')
plt.xlabel('Episode')
plt.ylabel('Average Epsilon')
plt.title('Epsilon Decay: Vanilla vs. Double DQN')
plt.legend()
plt.grid(True)

In [None]:
plt.tight_layout()
plt.show()