#### Question 1: Train a CartPole Agent
#### Dataset Problem: Use the OpenAI Gym's CartPole-v1 environment to train an agent using a simple reinforcement learning algorithm.  Assume hyperparameters, as per requirement and develop the model. Try to apply the concepts discussed in class 


In [1]:
pip install gym torch

Collecting gymNote: you may need to restart the kernel to use updated packages.

  Downloading gym-0.26.2.tar.gz (721 kB)
     ---------------------------------------- 0.0/721.7 kB ? eta -:--:--
     ---------------------------------------- 0.0/721.7 kB ? eta -:--:--
     ---------------------------------------- 0.0/721.7 kB ? eta -:--:--
     ---------------------------------------- 0.0/721.7 kB ? eta -:--:--
     ---------------------------------------- 0.0/721.7 kB ? eta -:--:--
     ---------------------------------------- 0.0/721.7 kB ? eta -:--:--
     ---------------------------------------- 0.0/721.7 kB ? eta -:--:--
     -------------- ------------------------- 262.1/721.7 kB ? eta -:--:--
     -------------------------- --------- 524.3/721.7 kB 670.4 kB/s eta 0:00:01
     ------------------------------------ 721.7/721.7 kB 890.6 kB/s eta 0:00:00
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to buil


[notice] A new release of pip is available: 24.2 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [6]:
import gym
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import random
from collections import deque

# Define the neural network architecture for the Q-network
class DQN(nn.Module):
    def __init__(self, state_dim, action_dim):
        super(DQN, self).__init__()
        self.fc1 = nn.Linear(state_dim, 24)
        self.fc2 = nn.Linear(24, 24)
        self.fc3 = nn.Linear(24, action_dim)
    
    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        return self.fc3(x)

# Set up hyperparameters
gamma = 0.99          # Discount factor for future rewards
epsilon = 1.0         # Exploration-exploitation trade-off factor
epsilon_min = 0.01    # Minimum value for epsilon
epsilon_decay = 0.995 # Decay rate of epsilon per episode
learning_rate = 0.001 # Learning rate for Q-network
batch_size = 64       # Batch size for experience replay
memory_size = 10000   # Size of experience replay buffer
num_episodes = 500    # Total number of episodes for training

# Initialize environment and agent
env = gym.make('CartPole-v1')
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.n

# Initialize DQN and target networks
q_network = DQN(state_dim, action_dim)
target_network = DQN(state_dim, action_dim)
target_network.load_state_dict(q_network.state_dict())
target_network.eval()

# Set up optimizer and replay memory
optimizer = optim.Adam(q_network.parameters(), lr=learning_rate)
replay_memory = deque(maxlen=memory_size)

# Function to select action based on epsilon-greedy policy
def select_action(state, epsilon):
    if random.random() < epsilon:
        return env.action_space.sample()  # Random action
    else:
        with torch.no_grad():
            state_tensor = torch.FloatTensor(state).unsqueeze(0)
            q_values = q_network(state_tensor)
            return int(torch.argmax(q_values).item())

# Function to train the Q-network
def train_q_network():
    if len(replay_memory) < batch_size:
        return
    
    # Sample a mini-batch from the replay memory
    minibatch = random.sample(replay_memory, batch_size)
    states, actions, rewards, next_states, dones = zip(*minibatch)

    states = torch.FloatTensor(states)
    actions = torch.LongTensor(actions)
    rewards = torch.FloatTensor(rewards)
    next_states = torch.FloatTensor(next_states)
    dones = torch.FloatTensor(dones)

    # Compute the current Q values
    q_values = q_network(states).gather(1, actions.unsqueeze(1)).squeeze(1)

    # Compute the target Q values using the target network
    with torch.no_grad():
        max_next_q_values = target_network(next_states).max(1)[0]
        target_q_values = rewards + (1 - dones) * gamma * max_next_q_values

    # Calculate loss and optimize the Q-network
    loss = nn.MSELoss()(q_values, target_q_values)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

In [9]:
# Reset environment and initialize state
state_dict = env.reset()
if isinstance(state_dict, tuple):  # Handle cases where reset returns a tuple
    state = state_dict[0]
else:
    state = state_dict  # Use directly if it's already a simple observation array

# Training loop with modified reset handling
for episode in range(num_episodes):
    # Reset environment and initialize state
    state_dict = env.reset()
    if isinstance(state_dict, tuple):
        state = state_dict[0]
    else:
        state = state_dict
    
    total_reward = 0
    done = False
    
    while not done:
        # Select an action
        action = select_action(state, epsilon)
        
        next_state, reward, done, *info = env.step(action)
        
        # Append experience to replay memory
        replay_memory.append((state, action, reward, next_state, done))
        
        # Train the Q-network
        train_q_network()
        
        # Update state
        state = next_state
        total_reward += reward
    
    # Decay epsilon and update target network
    if epsilon > epsilon_min:
        epsilon *= epsilon_decay
    if episode % 10 == 0:
        target_network.load_state_dict(q_network.state_dict())
    
    # Print progress
    print(f"Episode {episode+1}/{num_episodes}, Total Reward: {total_reward}, Epsilon: {epsilon:.2f}")

env.close()


Episode 1/500, Total Reward: 15.0, Epsilon: 0.99
Episode 2/500, Total Reward: 14.0, Epsilon: 0.99
Episode 3/500, Total Reward: 17.0, Epsilon: 0.99
Episode 4/500, Total Reward: 23.0, Epsilon: 0.98
Episode 5/500, Total Reward: 10.0, Epsilon: 0.98
Episode 6/500, Total Reward: 15.0, Epsilon: 0.97
Episode 7/500, Total Reward: 11.0, Epsilon: 0.97
Episode 8/500, Total Reward: 36.0, Epsilon: 0.96
Episode 9/500, Total Reward: 13.0, Epsilon: 0.96
Episode 10/500, Total Reward: 54.0, Epsilon: 0.95
Episode 11/500, Total Reward: 15.0, Epsilon: 0.95
Episode 12/500, Total Reward: 11.0, Epsilon: 0.94
Episode 13/500, Total Reward: 12.0, Epsilon: 0.94
Episode 14/500, Total Reward: 22.0, Epsilon: 0.93
Episode 15/500, Total Reward: 31.0, Epsilon: 0.93
Episode 16/500, Total Reward: 24.0, Epsilon: 0.92
Episode 17/500, Total Reward: 32.0, Epsilon: 0.92
Episode 18/500, Total Reward: 22.0, Epsilon: 0.91
Episode 19/500, Total Reward: 14.0, Epsilon: 0.91
Episode 20/500, Total Reward: 18.0, Epsilon: 0.90
Episode 2

Episode 165/500, Total Reward: 112.0, Epsilon: 0.44
Episode 166/500, Total Reward: 98.0, Epsilon: 0.44
Episode 167/500, Total Reward: 264.0, Epsilon: 0.43
Episode 168/500, Total Reward: 147.0, Epsilon: 0.43
Episode 169/500, Total Reward: 237.0, Epsilon: 0.43
Episode 170/500, Total Reward: 87.0, Epsilon: 0.43
Episode 171/500, Total Reward: 233.0, Epsilon: 0.42
Episode 172/500, Total Reward: 164.0, Epsilon: 0.42
Episode 173/500, Total Reward: 230.0, Epsilon: 0.42
Episode 174/500, Total Reward: 182.0, Epsilon: 0.42
Episode 175/500, Total Reward: 232.0, Epsilon: 0.42
Episode 176/500, Total Reward: 363.0, Epsilon: 0.41
Episode 177/500, Total Reward: 172.0, Epsilon: 0.41
Episode 178/500, Total Reward: 234.0, Epsilon: 0.41
Episode 179/500, Total Reward: 252.0, Epsilon: 0.41
Episode 180/500, Total Reward: 209.0, Epsilon: 0.41
Episode 181/500, Total Reward: 214.0, Epsilon: 0.40
Episode 182/500, Total Reward: 178.0, Epsilon: 0.40
Episode 183/500, Total Reward: 197.0, Epsilon: 0.40
Episode 184/50

Episode 324/500, Total Reward: 99.0, Epsilon: 0.20
Episode 325/500, Total Reward: 98.0, Epsilon: 0.20
Episode 326/500, Total Reward: 102.0, Epsilon: 0.20
Episode 327/500, Total Reward: 38.0, Epsilon: 0.19
Episode 328/500, Total Reward: 41.0, Epsilon: 0.19
Episode 329/500, Total Reward: 103.0, Epsilon: 0.19
Episode 330/500, Total Reward: 100.0, Epsilon: 0.19
Episode 331/500, Total Reward: 96.0, Epsilon: 0.19
Episode 332/500, Total Reward: 63.0, Epsilon: 0.19
Episode 333/500, Total Reward: 113.0, Epsilon: 0.19
Episode 334/500, Total Reward: 21.0, Epsilon: 0.19
Episode 335/500, Total Reward: 120.0, Epsilon: 0.19
Episode 336/500, Total Reward: 98.0, Epsilon: 0.19
Episode 337/500, Total Reward: 104.0, Epsilon: 0.18
Episode 338/500, Total Reward: 110.0, Epsilon: 0.18
Episode 339/500, Total Reward: 102.0, Epsilon: 0.18
Episode 340/500, Total Reward: 106.0, Epsilon: 0.18
Episode 341/500, Total Reward: 37.0, Epsilon: 0.18
Episode 342/500, Total Reward: 111.0, Epsilon: 0.18
Episode 343/500, Tota

Episode 483/500, Total Reward: 30.0, Epsilon: 0.09
Episode 484/500, Total Reward: 103.0, Epsilon: 0.09
Episode 485/500, Total Reward: 106.0, Epsilon: 0.09
Episode 486/500, Total Reward: 101.0, Epsilon: 0.09
Episode 487/500, Total Reward: 14.0, Epsilon: 0.09
Episode 488/500, Total Reward: 107.0, Epsilon: 0.09
Episode 489/500, Total Reward: 104.0, Epsilon: 0.09
Episode 490/500, Total Reward: 107.0, Epsilon: 0.09
Episode 491/500, Total Reward: 105.0, Epsilon: 0.09
Episode 492/500, Total Reward: 106.0, Epsilon: 0.08
Episode 493/500, Total Reward: 37.0, Epsilon: 0.08
Episode 494/500, Total Reward: 107.0, Epsilon: 0.08
Episode 495/500, Total Reward: 102.0, Epsilon: 0.08
Episode 496/500, Total Reward: 104.0, Epsilon: 0.08
Episode 497/500, Total Reward: 104.0, Epsilon: 0.08
Episode 498/500, Total Reward: 108.0, Epsilon: 0.08
Episode 499/500, Total Reward: 104.0, Epsilon: 0.08
Episode 500/500, Total Reward: 101.0, Epsilon: 0.08


#### Deep Q-Learning with a Neural Network

Q-Learning is a model-free reinforcement learning algorithm that learns a Q-value function 
Q(s,a), which represents the expected cumulative reward for taking action a in state s, and following the optimal policy thereafter.

Traditional Q-Learning uses a table to store Q-values for all possible state-action pairs. However, this becomes impractical for environments like CartPole with continuous state spaces (e.g., position, velocity, etc.).
In Deep Q-Learning, a neural network approximates the Q-value function. Instead of storing a table, the network takes the state as input and outputs Q-values for all possible actions.
Neural Network Design:

The neural network has an input layer matching the dimensions of the state space (e.g., 4 for CartPole).
It outputs the Q-values for each possible action (2 for CartPole: left and right).

#### Exploration-Exploitation Trade-Off via Epsilon-Greedy Policy

Exploration means trying new actions to discover potentially better rewards.
Exploitation means using the current knowledge (Q-values) to take the action with the highest reward.
Epsilon-Greedy Policy:

The agent chooses:A random action with probability ϵ (exploration).
The action with the highest Q-value with probability 1−ϵ (exploitation).
This balance ensures the agent explores new strategies while gradually focusing on the best-known actions.

Epsilon Decay:Initially, ϵ is high (e.g., 1.0), encouraging exploration.Over time, ϵ decays (e.g.,ϵ=ϵ×0.995) to reduce exploration as the agent becomes more confident in its learned Q-values.


#### Replay Memory

Replay memory stores a buffer of past experiences in the form(state,action,reward,next_state,done).
These experiences are randomly sampled during training to update the Q-value function.

Breaks Correlation: Consecutive experiences are highly correlated. Random sampling prevents the model from overfitting to recent events.
Efficient Data Usage: The agent learns from past experiences multiple times, improving sample efficiency.

#### Target Network

In DQL, two neural networks are used:
The online network updates Q-values during training.
The target network provides stable Q-value targets for training the online network.

The Q-value target involves the next state's maximum Q-value, which depends on the same network being trained. This creates a feedback loop and can lead to instability.
By freezing the target network’s weights for several training steps, the Q-value targets become more stable, improving convergence.

##### Early Episodes (Episode 1/500, Total Reward: 15.0, Epsilon: 0.99):

##### Agent’s Performance:
In the early episodes, the agent performs poorly, earning low rewards (e.g., 15.0).
The agent has not yet learned a good policy and is primarily exploring the environment.
High Epsilon (0.99):
The agent mostly selects random actions to explore possible strategies and understand the environment dynamics.
This randomness often leads to suboptimal decisions.

#### Midpoint of Training:(Episode 250/500, Total Reward: 129.0, Epsilon: 0.29)

#### Improvement in Performance:
As episodes progress, the agent begins learning a better policy through experience replay and Q-value updates.
The total rewards gradually increase, indicating that the agent balances the pole for longer durations.
Reduced Epsilon:
The exploration rate (ϵ) decreases, meaning the agent increasingly relies on its learned Q-values to make decisions instead of random exploration.

#### Later Episodes (Episode 500/500, Total Reward: 101.0, Epsilon: 0.08):

#### Agent’s Performance:
By the end of training, the agent achieves significantly higher rewards (e.g., 101.0), meaning it has learned an effective strategy for balancing the pole.
While this isn’t perfect performance (max reward = 500), it shows a marked improvement compared to the early episodes.
Low Epsilon (0.08):
The agent now primarily exploits its learned policy, relying on the Q-values to select the best action most of the time.
Occasional exploration (8% of actions) helps the agent adapt to minor variations or avoid getting stuck in suboptimal policies.
