# Navigation Project

## Section 1: Setup and Installation

This section sets up the environment by configuring paths and installing necessary packages.
It ensures compatibility with the Unity ML-Agents environment and checks the NumPy version.

In [1]:
import os
os.environ['PATH'] = f"{os.environ['PATH']}:/home/student/.local/bin"
os.environ['PATH'] = f"{os.environ['PATH']}:/opt/conda/lib/python3.10/site-packages"
os.environ['PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION'] = 'python'

In [2]:
# Check NumPy version
!python -m pip freeze | grep numpy

numpy @ file:///work/mkl/numpy_and_numpy_base_1682953417311/work


In [3]:
# Install required packages quietly
!pip -q install .

## Section 2: Import Libraries

Imports all necessary libraries for the reinforcement learning task, including PyTorch for neural networks,
NumPy for numerical operations, and Matplotlib for visualization.

In [4]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import numpy as np
import random
from collections import deque
from unityagents import UnityEnvironment
import time
import matplotlib.pyplot as plt

## Section 3: Initialize Environment

Initializes the Unity ML-Agents Banana environment and extracts state and action space sizes.
The environment is set to training mode for faster simulation.

In [5]:
env = UnityEnvironment(file_name="/data/Banana_Linux_NoVis/Banana.x86_64")
brain_name = env.brain_names[0]
brain = env.brains[brain_name]
state_size = 37  # Ray perceptions + velocity
action_size = 4  # Forward, backward, left, right

Found path: /data/Banana_Linux_NoVis/Banana.x86_64
Mono path[0] = '/data/Banana_Linux_NoVis/Banana_Data/Managed'
Mono config path = '/data/Banana_Linux_NoVis/Banana_Data/MonoBleedingEdge/etc'
Preloaded 'libgrpc_csharp_ext.x64.so'
Unable to preload the following plugins:
	libgrpc_csharp_ext.x86.so
Logging to /home/student/.config/unity3d/Unity Technologies/Unity Environment/Player.log


INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		
Unity brain name: BananaBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 37
        Number of stacked Vector Observation: 1
        Vector Action space type: discrete
        Vector Action space size (per agent): 4
        Vector Action descriptions: , , , 


## Section 4: State and Action Space Explanation

State Space (37 dimensions):
- 35 dimensions: Ray-based perception from 7 rays at angles [20, 90, 160, 45, 135, 70, 110].
  Each ray returns 5 values: [Yellow Banana, Wall, Blue Banana, Agent, Distance].
- 2 dimensions: Agent's velocity (left/right and forward/backward).
- Fully observable, capturing all relevant information about obstacles and agent dynamics.

Action Space (4 discrete actions):
- 0: Move forward
- 1: Move backward
- 2: Turn left
- 3: Turn right

Objective:
- Collect yellow bananas (+1 reward) while avoiding blue bananas (-1 reward).
- Goal: Achieve an average score of +13 over 100 consecutive episodes.

## Section 5: Dueling DQN Network

Defines the Dueling DQN architecture, which splits Q-values into value and advantage streams for
improved value estimation.

In [7]:
class DuelingQNetwork(nn.Module):
    def __init__(self, state_size, action_size, seed):
        super(DuelingQNetwork, self).__init__()
        self.seed = torch.manual_seed(seed)
        self.fc1 = nn.Linear(state_size, 64)
        self.fc2 = nn.Linear(64, 64)
        self.fc_value = nn.Linear(64, 1)          # Value stream
        self.fc_advantage = nn.Linear(64, action_size)  # Advantage stream

    def forward(self, state):
        x = F.relu(self.fc1(state))
        x = F.relu(self.fc2(x))
        value = self.fc_value(x)
        advantage = self.fc_advantage(x)
        return value + advantage - advantage.mean()  # Combine streams

## Section 6: Prioritized Replay Buffer

Implements a Prioritized Experience Replay buffer, which prioritizes experiences with higher
TD errors for more efficient learning.

In [8]:
class PrioritizedReplayBuffer:
    def __init__(self, buffer_size, batch_size, seed, alpha=0.7, beta=0.5):
        self.buffer_size = buffer_size
        self.batch_size = batch_size
        # self.alpha = alpha  # Priority exponent (increased for more focus on high-error samples)
        self.alpha = 0.6  # Reduced for less extreme prioritization
        # self.beta = beta    # Importance sampling exponent (increased for stronger correction)
        self.beta = 0.4   # Starting value for annealing
        self.seed = random.seed(seed)
        self.memory = []
        self.priorities = []
        self.pos = 0

    def add(self, state, action, reward, next_state, done, priority):
        if len(self.memory) < self.buffer_size:
            self.memory.append(None)
            self.priorities.append(None)
        self.memory[self.pos] = (state, action, reward, next_state, done)
        self.priorities[self.pos] = priority
        self.pos = (self.pos + 1) % self.buffer_size

    def sample(self):
        if len(self.memory) < self.batch_size:
            return None
        priorities = np.array(self.priorities[:len(self.memory)], dtype=np.float32)
        probabilities = priorities ** self.alpha
        probabilities /= probabilities.sum()
        indices = np.random.choice(len(self.memory), self.batch_size, p=probabilities)
        experiences = [self.memory[idx] for idx in indices]
        states = torch.from_numpy(np.vstack([e[0] for e in experiences])).float()
        actions = torch.from_numpy(np.vstack([e[1] for e in experiences])).long()
        rewards = torch.from_numpy(np.vstack([e[2] for e in experiences])).float()
        next_states = torch.from_numpy(np.vstack([e[3] for e in experiences])).float()
        dones = torch.from_numpy(np.vstack([e[4] for e in experiences]).astype(np.uint8)).float()
        weights = (len(self.memory) * probabilities[indices]) ** (-self.beta)
        weights /= weights.max()
        return (states, actions, rewards, next_states, dones, indices, torch.from_numpy(weights).float())

    def update_priorities(self, indices, priorities):
        for idx, priority in zip(indices, priorities):
            self.priorities[idx] = priority + 1e-5  # Small constant to avoid zero priority

    def __len__(self):
        return len(self.memory)

## Section 7: Agent Class

Defines the Agent, which uses Dueling DQN with PER, Double DQN, and multi-step learning for
improved performance.

In [9]:
class Agent:
    def __init__(self, state_size, action_size, seed):
        self.state_size = state_size
        self.action_size = action_size
        self.seed = random.seed(seed)
        self.qnetwork_local = DuelingQNetwork(state_size, action_size, seed)
        self.qnetwork_target = DuelingQNetwork(state_size, action_size, seed)
        # self.optimizer = optim.Adam(self.qnetwork_local.parameters(), lr=0.0005)  # Reduced learning rate
        self.optimizer = optim.Adam(self.qnetwork_local.parameters(), lr=0.0001)  # Reduced for stability
        self.memory = PrioritizedReplayBuffer(buffer_size=100000, batch_size=128, seed=seed)
        self.t_step = 0
        self.n_step = 3  # Multi-step horizon
        self.gamma = 0.99  # Discount factor
        self.n_step_buffer = deque(maxlen=self.n_step)  # Buffer for n-step learning
        # new
        self.beta = 0.4  # Initial beta
        self.beta_increment = (1.0 - 0.4) / 1000  # Increment over 1000 episodes

    def step(self, state, action, reward, next_state, done):
        # Add single-step transition immediately
        priority = 1.0 if len(self.memory) == 0 else max(self.memory.priorities)
        self.n_step_buffer.append((state, action, reward, next_state, done))

        if len(self.n_step_buffer) == self.n_step or done:
            state, action, r, ns, d = self.n_step_buffer[0]
            total_reward = r
            if not d:
                for i in range(1, min(self.n_step, len(self.n_step_buffer))):
                    _, _, r, _, d = self.n_step_buffer[i]
                    total_reward += (self.gamma ** i) * r
                    if d:
                        ns = self.n_step_buffer[i][3]
                        break
                if not d and len(self.n_step_buffer) == self.n_step:
                    ns = self.n_step_buffer[-1][3]
            self.memory.add(state, action, total_reward, ns, d, priority)

        self.t_step = (self.t_step + 1) % 4
        if self.t_step == 0 and len(self.memory) >= 128:
            experiences = self.memory.sample()
            if experiences is not None:
                self.learn(experiences, self.gamma)

    def act(self, state, eps=0.):
        state = torch.from_numpy(state).float().unsqueeze(0)
        self.qnetwork_local.eval()
        with torch.no_grad():
            action_values = self.qnetwork_local(state)
        self.qnetwork_local.train()
        if random.random() > eps:
            return np.argmax(action_values.cpu().data.numpy())
        else:
            return random.choice(np.arange(self.action_size))

    def learn(self, experiences, gamma):
        states, actions, rewards, next_states, dones, indices, weights = experiences
        # Double DQN: Use local network to select actions, target network to evaluate
        _, max_actions = self.qnetwork_local(next_states).detach().max(1)
        Q_targets_next = self.qnetwork_target(next_states).detach().gather(1, max_actions.unsqueeze(1))
        Q_targets = rewards + (gamma ** self.n_step * Q_targets_next * (1 - dones))
        Q_expected = self.qnetwork_local(states).gather(1, actions)
        td_errors = Q_targets - Q_expected
        
        # Update beta each learning step
        self.beta = min(1.0, self.beta + self.beta_increment)
        self.memory.beta = self.beta  # Update buffer’s beta
        
        loss = (weights * td_errors ** 2).mean()
        self.optimizer.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(self.qnetwork_local.parameters(), 1.0)  # Clip gradients
        self.optimizer.step()
        # Update priorities based on TD error
        priorities = td_errors.detach().abs().squeeze().cpu().numpy()
        self.memory.update_priorities(indices, priorities)
        # Soft update target network with slightly higher tau
        #for target_param, local_param in zip(self.qnetwork_target.parameters(), self.qnetwork_local.parameters()):
        #   target_param.data.copy_(0.99 * target_param.data + 0.01 * local_param.data)
        tau = 0.001  # Slower update for stability
        for target_param, local_param in zip(self.qnetwork_target.parameters(), self.qnetwork_local.parameters()):
            target_param.data.copy_(tau * local_param.data + (1.0 - tau) * target_param.data)

## Section 8: Training the Agent

Trains the agent using the improved Dueling DQN with Double DQN and multi-step learning.
Saves model weights when the environment is solved and generates periodic progress plots.

In [10]:
# Initialize the agent
agent = Agent(state_size=state_size, action_size=action_size, seed=0)

# Training parameters
max_episodes = 2000
scores = []
avg_scores = []
scores_window = deque(maxlen=100)
eps = 1.0
eps_min = 0.01
eps_decay = 0.99  # Faster decay for quicker exploitation
start_time = time.time()

# Function to plot and save training progress
def plot_training_progress(scores, avg_scores, episode):
    plt.figure(figsize=(10, 5))
    episodes = range(1, len(scores) + 1)
    plt.plot(episodes, scores, label='Score per episode', alpha=0.5)
    plt.plot(episodes, avg_scores, label='Average over last 100 episodes', color='red')
    plt.axhline(y=13, color='green', linestyle='--', label='Target (13)')
    plt.xlabel('Episode')
    plt.ylabel('Score')
    plt.title(f'Training Progress up to Episode {episode}')
    plt.legend()
    plt.grid(True)
    plt.savefig(f'training_progress_{episode}.png')
    plt.close()

In [11]:
# Training loop
for i_episode in range(1, max_episodes + 1):
    env_info = env.reset(train_mode=True)[brain_name]
    state = env_info.vector_observations[0]
    score = 0
    while True:
        action = agent.act(state, eps)
        env_info = env.step(action)[brain_name]
        next_state = env_info.vector_observations[0]
        reward = env_info.rewards[0]
        done = env_info.local_done[0]
        agent.step(state, action, reward, next_state, done)
        state = next_state
        score += reward
        if done:
            break
    scores.append(score)
    scores_window.append(score)
    # eps = max(eps_min, eps_decay * eps)
    #linear decay :
    eps = max(eps_min, 1.0 - (1.0 - eps_min) * (i_episode / 500))
    avg_score = np.mean(scores_window)
    avg_scores.append(avg_score)
    elapsed_time = time.time() - start_time
    print(f"Episode {i_episode}\tAverage Score: {avg_score:.2f}\tTime Elapsed: {elapsed_time:.2f} seconds")
    
    # Periodic plotting and checkpointing
    if i_episode % 100 == 0:
        print(f"\nEpisode {i_episode}\tAverage Score: {avg_score:.2f}\n")
        plot_training_progress(scores, avg_scores, i_episode)
        torch.save(agent.qnetwork_local.state_dict(), f'model_checkpoint_{i_episode}.pt')
    
    # Check if the environment is solved
    if avg_score >= 13.0 and len(scores_window) == 100:
        print(f"\nEnvironment solved in {i_episode} episodes!\tAverage Score: {avg_score:.2f}")
        torch.save(agent.qnetwork_local.state_dict(), 'model.pt')
        plot_training_progress(scores, avg_scores, i_episode)
        break

Episode 1	Average Score: 0.00	Time Elapsed: 6.87 seconds
Episode 2	Average Score: 0.00	Time Elapsed: 15.56 seconds
Episode 3	Average Score: 0.00	Time Elapsed: 24.17 seconds
Episode 4	Average Score: 0.00	Time Elapsed: 33.16 seconds
Episode 5	Average Score: 0.00	Time Elapsed: 42.17 seconds
Episode 6	Average Score: -0.17	Time Elapsed: 51.26 seconds
Episode 7	Average Score: -0.43	Time Elapsed: 59.86 seconds
Episode 8	Average Score: -0.50	Time Elapsed: 69.08 seconds
Episode 9	Average Score: -0.56	Time Elapsed: 78.18 seconds
Episode 10	Average Score: -0.40	Time Elapsed: 86.97 seconds
Episode 11	Average Score: -0.27	Time Elapsed: 95.38 seconds
Episode 12	Average Score: -0.17	Time Elapsed: 104.38 seconds
Episode 13	Average Score: -0.08	Time Elapsed: 113.67 seconds
Episode 14	Average Score: -0.14	Time Elapsed: 122.58 seconds
Episode 15	Average Score: -0.07	Time Elapsed: 131.87 seconds
Episode 16	Average Score: 0.00	Time Elapsed: 140.68 seconds
Episode 17	Average Score: 0.06	Time Elapsed: 149.56

Episode 137	Average Score: 0.93	Time Elapsed: 1306.66 seconds
Episode 138	Average Score: 0.96	Time Elapsed: 1316.87 seconds
Episode 139	Average Score: 0.97	Time Elapsed: 1327.18 seconds
Episode 140	Average Score: 0.96	Time Elapsed: 1337.79 seconds
Episode 141	Average Score: 0.95	Time Elapsed: 1348.16 seconds
Episode 142	Average Score: 0.97	Time Elapsed: 1358.19 seconds
Episode 143	Average Score: 0.99	Time Elapsed: 1368.56 seconds
Episode 144	Average Score: 1.01	Time Elapsed: 1378.78 seconds
Episode 145	Average Score: 1.05	Time Elapsed: 1389.09 seconds
Episode 146	Average Score: 1.06	Time Elapsed: 1399.26 seconds
Episode 147	Average Score: 1.06	Time Elapsed: 1409.69 seconds
Episode 148	Average Score: 1.09	Time Elapsed: 1420.36 seconds
Episode 149	Average Score: 1.08	Time Elapsed: 1430.68 seconds
Episode 150	Average Score: 1.10	Time Elapsed: 1440.98 seconds
Episode 151	Average Score: 1.14	Time Elapsed: 1451.49 seconds
Episode 152	Average Score: 1.13	Time Elapsed: 1461.79 seconds
Episode 

Episode 269	Average Score: 2.99	Time Elapsed: 2761.56 seconds
Episode 270	Average Score: 3.04	Time Elapsed: 2773.28 seconds
Episode 271	Average Score: 3.06	Time Elapsed: 2785.00 seconds
Episode 272	Average Score: 3.09	Time Elapsed: 2796.96 seconds
Episode 273	Average Score: 3.09	Time Elapsed: 2808.60 seconds
Episode 274	Average Score: 3.11	Time Elapsed: 2820.46 seconds
Episode 275	Average Score: 3.17	Time Elapsed: 2832.36 seconds
Episode 276	Average Score: 3.18	Time Elapsed: 2843.99 seconds
Episode 277	Average Score: 3.14	Time Elapsed: 2855.39 seconds
Episode 278	Average Score: 3.15	Time Elapsed: 2866.80 seconds
Episode 279	Average Score: 3.21	Time Elapsed: 2878.76 seconds
Episode 280	Average Score: 3.24	Time Elapsed: 2890.60 seconds
Episode 281	Average Score: 3.28	Time Elapsed: 2902.00 seconds
Episode 282	Average Score: 3.31	Time Elapsed: 2913.56 seconds
Episode 283	Average Score: 3.28	Time Elapsed: 2924.86 seconds
Episode 284	Average Score: 3.35	Time Elapsed: 2936.26 seconds
Episode 

Episode 401	Average Score: 6.94	Time Elapsed: 4382.66 seconds
Episode 402	Average Score: 7.07	Time Elapsed: 4394.46 seconds
Episode 403	Average Score: 7.15	Time Elapsed: 4407.26 seconds
Episode 404	Average Score: 7.21	Time Elapsed: 4418.89 seconds
Episode 405	Average Score: 7.18	Time Elapsed: 4432.36 seconds
Episode 406	Average Score: 7.21	Time Elapsed: 4445.36 seconds
Episode 407	Average Score: 7.30	Time Elapsed: 4457.86 seconds
Episode 408	Average Score: 7.33	Time Elapsed: 4470.06 seconds
Episode 409	Average Score: 7.36	Time Elapsed: 4482.26 seconds
Episode 410	Average Score: 7.41	Time Elapsed: 4494.26 seconds
Episode 411	Average Score: 7.43	Time Elapsed: 4506.38 seconds
Episode 412	Average Score: 7.49	Time Elapsed: 4518.66 seconds
Episode 413	Average Score: 7.49	Time Elapsed: 4532.26 seconds
Episode 414	Average Score: 7.47	Time Elapsed: 4546.19 seconds
Episode 415	Average Score: 7.51	Time Elapsed: 4560.16 seconds
Episode 416	Average Score: 7.50	Time Elapsed: 4573.76 seconds
Episode 

Episode 533	Average Score: 10.85	Time Elapsed: 6061.96 seconds
Episode 534	Average Score: 10.94	Time Elapsed: 6074.96 seconds
Episode 535	Average Score: 11.11	Time Elapsed: 6087.77 seconds
Episode 536	Average Score: 11.22	Time Elapsed: 6099.40 seconds
Episode 537	Average Score: 11.23	Time Elapsed: 6113.46 seconds
Episode 538	Average Score: 11.18	Time Elapsed: 6127.56 seconds
Episode 539	Average Score: 11.11	Time Elapsed: 6141.28 seconds
Episode 540	Average Score: 11.24	Time Elapsed: 6154.56 seconds
Episode 541	Average Score: 11.23	Time Elapsed: 6168.86 seconds
Episode 542	Average Score: 11.18	Time Elapsed: 6182.06 seconds
Episode 543	Average Score: 11.21	Time Elapsed: 6195.00 seconds
Episode 544	Average Score: 11.28	Time Elapsed: 6208.66 seconds
Episode 545	Average Score: 11.36	Time Elapsed: 6222.86 seconds
Episode 546	Average Score: 11.32	Time Elapsed: 6236.66 seconds
Episode 547	Average Score: 11.43	Time Elapsed: 6249.09 seconds
Episode 548	Average Score: 11.48	Time Elapsed: 6262.56 

In [14]:
# Final plot
plot_training_progress(scores, avg_scores, i_episode)

## Section 9: Closing the Environment

In [13]:
# Closes the Unity environment to free resources.
env.close()