## <center>CSE 546: Reinforcement Learning</center>
### <center>Prof. Alina Vereshchaka</center>
<!-- ### <center>Fall 2022</center> -->

Welcome to the Assignment 2, Part 1: Introduction to Deep Reinforcement Learning and Neural Networks! The goal of this assignment is to make you comfortable with the application of different Neural Network structures depending on how the Reinforcement Learning environment is set up.

We will be working with an implementation of the Wumpus World environment. The environment comes from the book "Artificial Intelligence: A Modern Approach" by Stuart J. Russell and Peter Norvig. 

### ENVIRONMENT DETAILS:

The environment is a 6 x 6 grid world containing a total of 36 grid blocks. 

#### ENVIRONMENT OBJECTS:
The environment consists of the following objects:

1. **Agent** - The agent starts in the grid block at the bottom left corner whose co-ordinates are [0, 0]. The goal of our agent is to collect the Gold while avoiding the Wumpus and the pits. 

2. **Wumpus** - The monster which would eat the agent if they are in the same grid block.

3. **Pit** - The agent must avoid falling into the pits. 

4. **Gold** - The agent must collect the Gold.

5. **Breeze** - Breeze surrounds the Pits and warn the agent of a Pit in an adjacent grid block.

6. **Stench** - Stench surrounds the Wumpus and warns the agent of the Wumpus in an adjacent grid block.

#### ENVIRONMENT OBSERVATIONS:

Our implementation of the environment provides you with four different types of observations:

1. **Integer** - Integer in the range [0 - 35]. This represents the grid block the agent is in. E.g., if the agent is in the bottom left grid block (starting position) the observation would be 0, if the agent is in the grid block containing the Gold the observation would be 34, if the agent is in the top right grid block the observation would be 35.

2. **Vector** - 

    **2.1.** A vector of length 2 representing the agent co-ordinates. The first entry represents the x co-ordinate and the second entry represets the y co-ordinate. E.g., if the agent is in the bottom left grid block (starting position) the observation would be [0, 0], if the agent is in the grid block containing the Gold the observation would be [4, 5], if the agent is in the top right grid block the observation would be [5, 5].
    
    **2.2.** A vector of length 36 representing the one-hot encoding of the integer observation (refer type 1 above). E.g., if the agent is in the bottom left grid block (starting position) the observation would be [1, 0, ..., 0, 0], if the agent is in the grid block containing the Gold the observation would be [0, 0, ..., 1, 0], if the agent is in the top right grid block the observation would be [0, 0, ..., 0, 1].


3. **Image** - Image render of the environment returned as an NumPy array. The image size is 84 * 84 (same size used in the DQN paper). E.g., if the agent is in the bottom right grid block the observation is:

    Observation: (84 * 84)

     [[255 255 255 ... 255 255 255]

     [255 255 255 ... 255 255 255]

     [255 255 255 ... 255 255 255]

     ...

     [255 255 255 ... 255 255 255]

     [255 255 255 ... 255 255 255]

     [255 255 255 ... 255 255 255]]

    Observation type: <class 'numpy.ndarray'>

    Observation Shape: (84, 84)

    Visually, it looks like:
    <img src="./images/environment_render.png" width="500" height="500">
    

4. **Float** - Float in the range [0 - $\infty$] representing the time elapsed in seconds. 

#### ENVIRONMENT ACTIONS:

Our implementation of the environment provides you with three different types of actions:

1. **Discrete** - Integer in the range [0 - 3] representing the four actions possible in the environment as follows: 0 - Right 1 - Left 2 - Up 3 - Down.

2. **Multi-Discrete** - Array of length four where each element takes binary values 0 or 1. Array elements represent if we take a particular action. Array element with index 0 corresponds to the right action, index 1 corresponds to the left action, index 2 corresponds to the up action, and index 3 corresponds to the down action. E.g., 
   action = [1, 0, 0, 0] would result in the agent moving right.
   action = [1, 0, 1, 0] would result in the agent moving right and up.
   action = [0, 1, 0, 1] would result in the agent moving left and down.

3. **Continuous** - Float in the range [-1, 1] determining whether the agent will go left, right, up, or down as follows:

    if -1 <= action <= -0.5:
        Go Right.
    elif -0.5 < action <= 0:
        Go Left.
    elif 0 < action <= 0.5:
        Go Up.
    elif 0.5 < action <= 1:
        Go Down.
        
### YOUR TASK IS TO USE A NEURAL NETWORK TO WORK WITH ALL FOUR TYPES OF OBSERVATIONS AND ALL THREE TYPES OF  ACTIONS.
### Note: You don't have to train your agent/neural network. You just have to build the neural network structure that takes the observation as input and produces the desired output with the initial weights.

#### You can use libraries such as PyTorch/TensorFlow/Keras to build your neural networks.

#### <span style="color:red">You cannot use RL libraries that already provide the neural network to you such as Stable-baselines3, Keras-RL, TF agents, Ray RLLib etc.</span>

In [2]:
# Imports
from environment import WumpusWorldEnvironment
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import numpy as np
import random

# Environment

<img src="./images/wumpus_world_environment.jpg" width="600" height="600">

# START COMPLETING YOUR ASSIGNMENT HERE

## Observation Type - Integer, Action Type - Discrete

The part of the assignment requires you to create a sequential dense neural network with 1 hidden layer having 64 neurons and the output layer having 4 neurons. The input to the neural network is an integer (refer to environment observations type 1). The output of the neural network is an array represeting the Q-values from which you will choose an action (refer to environment actions type 1).

The following figure shows the network structure you will have to use:

<img src="./images/neural_network_structures/neural_network_1_64_4.png">

In [3]:
"""TO DO: Create a neural network, pass it the observation from the environment
and get the predicted Q-values for the four actions. Print the observation and the Q-values."""

environment = WumpusWorldEnvironment(observation_type='integer', action_type='discrete')
observation, info = environment.reset()

# Convert integer observation to tensor
obs_tensor = torch.tensor([observation], dtype=torch.float32)

# Define the neural network
model = torch.nn.Sequential(
    torch.nn.Linear(1, 64),
    torch.nn.ReLU(),
    torch.nn.Linear(64, 4)
)

q_values = model(obs_tensor)
print("Observation (Integer):", observation)
print("Q-values:", q_values.detach().numpy())


Observation (Integer): 0
Q-values: [-0.10651372  0.20428684 -0.39606595 -0.19233286]


## Observation Type - Vector (2.1), Action Type - Discrete

The part of the assignment requires you to create a sequential dense neural network with 1 hidden layer having 64 neurons and the output layer having 4 neurons. The input to the neural network is a vector of length 2 (refer to environment observations type 2.1). The output of the neural network is an array represeting the Q-values from which you will choose an action (refer to environment actions type 1).

The following figure shows the network structure you will have to use:

<img src="./images/neural_network_structures/neural_network_2_64_4.png">

In [3]:
"""TO DO: Create a neural network, pass it the observation from the environment
and get the predicted Q-values for the four actions. Print the observation and the Q-values."""

environment = WumpusWorldEnvironment(observation_type='vector', action_type='discrete')
observation, info = environment.reset()

# Convert vector observation to tensor
obs_tensor = torch.tensor(observation, dtype=torch.float32)

# Define the neural network
model = torch.nn.Sequential(
    torch.nn.Linear(2, 64),
    torch.nn.ReLU(),
    torch.nn.Linear(64, 4)
)

q_values = model(obs_tensor)
print("\nObservation (Vector 2.1):", observation)
print("Q-values:", q_values.detach().numpy())



Observation (Vector 2.1): [0 0]
Q-values: [-0.19993767 -0.05581397  0.18185377 -0.13431422]


## Observation Type - Vector (2.2), Action Type - Discrete

The part of the assignment requires you to create a sequential dense neural network with 1 hidden layer having 64 neurons and the output layer having 4 neurons. The input to the neural network is a vector of length 36 (refer to environment observations type 2.2). The output of the neural network is an array represeting the Q-values from which you will choose an action (refer to environment actions type 1).

**HINT:** Use the integer observation and convert it to a one-hot encoded vector.

The following figure shows the network structure you will have to use:

<img src="./images/neural_network_structures/neural_network_36_64_4.png">

In [4]:
"""TO DO: Create a neural network, pass it the observation from the environment
and get the predicted Q-values for the four actions. Print the observation and the Q-values."""

# Observation Type - Vector (2.2), Action Type - Discrete
environment = WumpusWorldEnvironment(observation_type='integer', action_type='discrete')
observation, info = environment.reset()

# Convert integer to one-hot encoded vector
one_hot = torch.zeros(36)
one_hot[observation] = 1.0

# Define the neural network
model = torch.nn.Sequential(
    torch.nn.Linear(36, 64),
    torch.nn.ReLU(),
    torch.nn.Linear(64, 4)
)

q_values = model(one_hot)
print("\nObservation (One-Hot 36):", one_hot.numpy())
print("Q-values:", q_values.detach().numpy())



Observation (One-Hot 36): [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
Q-values: [ 0.07193527  0.01945292 -0.0960912  -0.00920804]


## Observation Type - Image, Action Type - Discrete

The part of the assignment requires you to create a convolutional neural network with one convolutional layer having 128 filters of size 3 x 3, one hidden layer having 64 neurons, and the output layer having 4 neurons. The input to the neural network is an image of size 84 * 84 (refer to environment observations type 3). The output of the neural network is an array represeting the Q-values from which you will choose an action (refer to environment actions type 1).

The following figure shows the network structure you will have to use:

<img src="./images/neural_network_structures/convolutional_neural_network_84x84_128_64_4.png">

In [5]:
"""TO DO: Create a neural network, pass it the observation from the environment
and get the predicted Q-values for the four actions. Print the observation and the Q-values."""

# Observation Type - Image, Action Type - Discrete
environment = WumpusWorldEnvironment(observation_type='image', action_type='discrete')
observation, info = environment.reset()

# Convert image to tensor and add batch and channel dimensions
obs_tensor = torch.tensor(observation, dtype=torch.float32).unsqueeze(0).unsqueeze(0)

# Define the neural network
model = torch.nn.Sequential(
    torch.nn.Conv2d(1, 128, kernel_size=3, padding=1),
    torch.nn.ReLU(),
    torch.nn.Flatten(),
    torch.nn.Linear(128 * 84 * 84, 64),
    torch.nn.ReLU(),
    torch.nn.Linear(64, 4)
)

q_values = model(obs_tensor)
print("\nObservation (Image) Shape:", observation.shape)
print("Q-values:", q_values.detach().numpy())


Observation (Image) Shape: (84, 84)
Q-values: [[13.330366   3.3391213 20.957369   6.784277 ]]


## Observation Type - Float, Action Type - Discrete

The part of the assignment requires you to create a sequential dense neural network with 1 hidden layer having 64 neurons and the output layer having 4 neurons. The input to the neural network is a float (refer to environment observations type 4). The output of the neural network is an array representing the Q-values from which you will choose an action (refer to environment actions type 1).

The following figure shows the network structure you will have to use:

<img src="./images/neural_network_structures/neural_network_1_64_4.png">

In [6]:
"""TO DO: Create a neural network, pass it the observation from the environment
and get the predicted Q-values for the four actions. Print the observation and the Q-values."""

# Observation Type - Float, Action Type - Discrete
environment = WumpusWorldEnvironment(observation_type='float', action_type='discrete')
observation, info = environment.reset()

# Convert float observation to tensor
obs_tensor = torch.tensor([observation], dtype=torch.float32)

# Define the neural network
model = torch.nn.Sequential(
    torch.nn.Linear(1, 64),
    torch.nn.ReLU(),
    torch.nn.Linear(64, 4)
)

q_values = model(obs_tensor)
print("\nObservation (Float):", observation)
print("Q-values:", q_values.detach().numpy())



Observation (Float): 2.3126602172851562e-05
Q-values: [ 0.03201687 -0.2761972   0.52950704 -0.09354447]


## Observation Type - Vector (2.2), Action Type - Multi-Discrete

The part of the assignment requires you to create a sequential dense neural network with 1 hidden layer having 64 neurons and the output layer having 4 neurons. The input to the neural network is a vector of length 36 (refer to environment observations type 2.2). The output of the neural network is an array representing the probability of choosing the actions. (If the value of the array element is >=0.5 you will perform the action.) (refer to environment actions type 2).

**HINT:** Use the integer observation and convert it to a one-hot encoded vector.

The following figure shows the network structure you will have to use:

<img src="./images/neural_network_structures/neural_network_36_64_4_sigmoid.png">

In [7]:
"""TO DO: Create a neural network, pass it the observation from the environment
and get the predicted action probabilities for the four actions. Print the observation and the action probabilities."""
# Observation Type - Vector (2.2), Action Type - Multi-Discrete
environment = WumpusWorldEnvironment(observation_type='integer', action_type='multi_discrete')
observation, info = environment.reset()

# Convert integer to one-hot encoded vector
one_hot = torch.zeros(36)
one_hot[observation] = 1.0

# Define the neural network
model = torch.nn.Sequential(
    torch.nn.Linear(36, 64),
    torch.nn.ReLU(),
    torch.nn.Linear(64, 4),
    torch.nn.Sigmoid()
)

probs = model(one_hot)
print("\nObservation (One-Hot 36):", one_hot.numpy())
print("Action Probabilities:", probs.detach().numpy())


# END_YOUR_CODE


Observation (One-Hot 36): [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
Action Probabilities: [0.47893918 0.48120052 0.4785952  0.47566387]


## Observation Type - Vector (2.2), Action Type - Continuous

The part of the assignment requires you to create a sequential dense neural network with 1 hidden layer having 64 neurons and the output layer having 1 neuron. The input to the neural network is a vector of length 36 (refer to environment observations type 2.2). The output of the neural network is an float in the range [-1, 1] determining the action which will be taken. (refer to environment actions type 3).

**HINT:** Use the integer observation and convert it to a one-hot encoded vector and use the TanH activation function to get the output in the range [-1, 1].

The following figure shows the network structure you will have to use:

<img src="./images/neural_network_structures/neural_network_36_64_1.png">

In [5]:
"""TO DO: Create a neural network, pass it the observation from the environment
and get the predicted action. Print the observation and the predicted action."""


# Observation Type - Vector (2.2), Action Type - Continuous
environment = WumpusWorldEnvironment(observation_type='integer', action_type='continuous')
observation, info = environment.reset()


# Convert integer to one-hot encoded vector
one_hot = torch.zeros(36)
one_hot[observation] = 1.0

# Define the neural network
model = torch.nn.Sequential(
    torch.nn.Linear(36, 64),
    torch.nn.ReLU(),
    torch.nn.Linear(64, 1),
    torch.nn.Tanh()
)

action = model(one_hot)
print("\nObservation (One-Hot 36):", one_hot.numpy())
print("Continuous Action:", action.item())


Observation (One-Hot 36): [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
Continuous Action: 0.12994205951690674


In [10]:
import gym 
import random 

env = gym.make('CartPole-v1')

def Random_game():
    for episode in range(10):
        env.reset()
        for t in range(500):
            env.render()
            action = env.action_space.sample()
            next_state, reward, done, truncated,info = env.step(action)

            print(t, next_state, reward, done, truncated, info, action)
            if done:
                break

Random_game()



0 [-0.02920395  0.16895795  0.03616807 -0.3146741 ] 1.0 False False {} 1
1 [-0.02582479  0.36354652  0.02987459 -0.595735  ] 1.0 False False {} 1
2 [-0.01855386  0.16801949  0.01795989 -0.2937935 ] 1.0 False False {} 0
3 [-0.01519347  0.36288086  0.01208402 -0.58075845] 1.0 False False {} 1
4 [-7.9358565e-03  5.5783141e-01  4.6884661e-04 -8.6961031e-01] 1.0 False False {} 1
5 [ 0.00322077  0.36270308 -0.01692336 -0.57678   ] 1.0 False False {} 0
6 [ 0.01047483  0.16782238 -0.02845896 -0.28947607] 1.0 False False {} 0
7 [ 0.01383128  0.36333835 -0.03424848 -0.59099704] 1.0 False False {} 1
8 [ 0.02109805  0.55892265 -0.04606842 -0.8942686 ] 1.0 False False {} 1
9 [ 0.0322765   0.3644547  -0.06395379 -0.61641544] 1.0 False False {} 0
10 [ 0.03956559  0.56040907 -0.07628211 -0.9285357 ] 1.0 False False {} 1
11 [ 0.05077378  0.75647336 -0.09485282 -1.2441821 ] 1.0 False False {} 1
12 [ 0.06590325  0.56268775 -0.11973646 -0.9826552 ] 1.0 False False {} 0
13 [ 0.077157    0.7591928  -0.13938

  gym.logger.warn(
  if not isinstance(terminated, (bool, np.bool8)):


In [9]:
import numpy as np
# Patch to handle environments expecting np.bool8
if not hasattr(np, 'bool8'):
    np.bool8 = np.bool_

import gym
import torch
import torch.nn as nn
import torch.optim as optim
from collections import deque
import random

# Define the neural network
class DQN(nn.Module):
    def __init__(self, state_size, action_size):
        super(DQN, self).__init__()
        self.fc1 = nn.Linear(state_size, 24)
        self.fc2 = nn.Linear(24, 24)
        self.fc3 = nn.Linear(24, action_size)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        return self.fc3(x)

# Replay buffer class
class ReplayBuffer:
    def __init__(self, max_size):
        self.buffer = deque(maxlen=max_size)

    def add(self, experience):
        self.buffer.append(experience)

    def sample(self, batch_size):
        return random.sample(self.buffer, batch_size)

    def size(self):
        return len(self.buffer)

# Hyperparameters
state_size = 4
action_size = 2
episodes = 500
batch_size = 32
gamma = 0.95
epsilon = 1.0
epsilon_decay = 0.995
epsilon_min = 0.01
learning_rate = 0.001
target_update_frequency = 10
buffer_size = 2000

# Create the environment
env = gym.make("CartPole-v1", render_mode="rgb_array")

# Initialize the policy and target networks
policy_net = DQN(state_size, action_size)
target_net = DQN(state_size, action_size)
target_net.load_state_dict(policy_net.state_dict())
target_net.eval()

# Optimizer and loss function
optimizer = optim.Adam(policy_net.parameters(), lr=learning_rate)
criterion = nn.MSELoss()

# Replay buffer
replay_buffer = ReplayBuffer(buffer_size)

# Training loop
scores = []

for e in range(episodes):
    state, _ = env.reset()  # Unpack the tuple from env.reset()
    state = np.reshape(state, [1, state_size])
    state = torch.tensor(state, dtype=torch.float32)
    score = 0

    while True:
        # Select an action using epsilon-greedy strategy
        if np.random.rand() <= epsilon:
            action = np.random.choice(action_size)
        else:
            with torch.no_grad():
                action = torch.argmax(policy_net(state)).item()

        # Take action in the environment
        next_state, reward, done, truncated, info = env.step(action)
        next_state = np.reshape(next_state, [1, state_size])
        next_state = torch.tensor(next_state, dtype=torch.float32)

        # Convert done and truncated to Python booleans
        done = bool(done)
        truncated = bool(truncated)

        # Store experience in the replay buffer
        replay_buffer.add((state, action, reward, next_state, done))

        # Update state and score
        state = next_state
        score += reward

        # Train the policy network if enough samples are available
        if replay_buffer.size() > batch_size:
            minibatch = replay_buffer.sample(batch_size)
            for state_mb, action_mb, reward_mb, next_state_mb, done_mb in minibatch:
                target = reward_mb
                if not done_mb:
                    target += gamma * torch.max(target_net(next_state_mb)).item()
                # Get current prediction and update the Q-value for the taken action
                target_f = policy_net(state_mb).detach()
                target_f[0, action_mb] = target  # Fix: index into the second dimension
                output = policy_net(state_mb)
                loss = criterion(output, target_f)
                optimizer.zero_grad()
                loss.backward()
                optimizer.step()

        if done or truncated:
            print(f"Episode {e + 1}/{episodes}, Score: {score}")
            break

    scores.append(score)
    epsilon = max(epsilon_min, epsilon * epsilon_decay)

    # Update target network periodically
    if e % target_update_frequency == 0:
        target_net.load_state_dict(policy_net.state_dict())

# Record a video of the trained agent
env = gym.wrappers.RecordVideo(env, "videos", episode_trigger=lambda x: True)

# Play a trained episode and record video
state, _ = env.reset()
state = np.reshape(state, [1, state_size])
state = torch.tensor(state, dtype=torch.float32)
done = False

while not done:
    action = torch.argmax(policy_net(state)).item()
    state, _, done, _, _ = env.step(action)
    state = np.reshape(state, [1, state_size])
    state = torch.tensor(state, dtype=torch.float32)

env.close()


  if not hasattr(np, 'bool8'):


Episode 1/500, Score: 17.0
Episode 2/500, Score: 10.0
Episode 3/500, Score: 26.0
Episode 4/500, Score: 17.0
Episode 5/500, Score: 35.0
Episode 6/500, Score: 17.0
Episode 7/500, Score: 17.0
Episode 8/500, Score: 40.0
Episode 9/500, Score: 17.0
Episode 10/500, Score: 27.0
Episode 11/500, Score: 18.0
Episode 12/500, Score: 22.0
Episode 13/500, Score: 22.0
Episode 14/500, Score: 35.0
Episode 15/500, Score: 14.0
Episode 16/500, Score: 28.0
Episode 17/500, Score: 35.0
Episode 18/500, Score: 19.0
Episode 19/500, Score: 12.0
Episode 20/500, Score: 29.0
Episode 21/500, Score: 21.0
Episode 22/500, Score: 33.0
Episode 23/500, Score: 51.0
Episode 24/500, Score: 19.0
Episode 25/500, Score: 38.0
Episode 26/500, Score: 23.0
Episode 27/500, Score: 71.0
Episode 28/500, Score: 31.0
Episode 29/500, Score: 12.0
Episode 30/500, Score: 18.0
Episode 31/500, Score: 46.0
Episode 32/500, Score: 37.0
Episode 33/500, Score: 24.0
Episode 34/500, Score: 31.0
Episode 35/500, Score: 11.0
Episode 36/500, Score: 40.0
E

DependencyNotInstalled: MoviePy is not installed, run `pip install moviepy`

In [10]:
!pip install moviepy

