## About Mountain Car 

- The objective is to drive an underpowered car up a steep mountain road to reach the goal at the top of the hill. The challenge arises from the car’s engine being too weak to climb the hill directly. Therefore, the car must learn to leverage potential energy by building momentum from swinging back and forth between the two hills.

### State Space
The state of the environment is typically described by two variables:

- **Position (x):** The horizontal position of the car along the track, which varies between specific limits, like **-1.2 to 0.6**.
- **Velocity (v):** The speed of the car, which also has limits, such as between -0.07 to 0.07.

### Action Space
The actions that can be taken at any state are typically discrete:

- **Accelerate to the Right (0):** Trying to increase the velocity towards the right.
- **No Acceleration (1):** No change in the velocity.
- **Accelerate to the Left (2):** Trying to increase the velocity towards the left.

### Reward Structure
The reward structure is simple and is designed to encourage the car to reach the goal as quickly as possible:

- **-1 for each time step:** This reward is given until the car reaches the target. The goal is to **minimize the total number of steps taken**, thereby **maximizing the total reward** (or minimizing penalty).

### Dynamics of the Game

- The dynamics of the game are governed by the **physics of the car’s motion**. The car's **velocity and position** at each **timestep** depend on its **previous velocity**, its **current action**, and **gravitational pull**. 

- To succeed, the agent must **learn to balance the gravitational pull and the momentum required to reach the peak.**

### Link Between Rewards and Action Steps
- **Immediate Feedback:** The agent receives a reward (or penalty) after each action, providing immediate feedback about the quality of its decision.

- **Cumulative Goal:** The agent receives a negative reward (like -1) for every time step until the goal is reached. This setup encourages the agent to reach the goal in the fewest possible steps since each additional step incurs further penalty.

### Game Configrations

- **Force** is the action clipped to the range [-1,1] and **Power** is a constant **0.0015**.

- **Reward :** A negative reward of **-0.1 * action2** is received at each timestep to penalise for taking actions of large magnitude. If the mountain car reaches the goal then a **positive reward of +100** is added to the negative reward for that timestep.

- **Starting State :** The position of the car is assigned a uniform random value in **[-0.6 , -0.4]**. The starting **velocity** of the car is always assigned to **0**.

- **Episode End** : The episode ends if either of the following happens:

    - 1. **Termination:** The position of the car is greater than or equal to **0.45**. (the goal position on top of the right hill)

    - 2. **Truncation:** The length of the episode is **999**.

### DQN
Deep Q-Networks (DQNs) are an advancement in the field of reinforcement learning that blend traditional Q-learning with deep neural networks. 

#### WHY TO CHOOSE DQN?
- 1. **Function Approximation:** DQN uses deep neural networks to approximate the Q-function, which is the expected rewards for taking an action in a given state. This is useful in environments with a large number of states and actions where traditional methods that require a table for each state-action pair (like Q-learning) become impractical due to memory and computation limitations.

- 2. **Handling High-Dimensional Spaces:** DQN is effective in environments with high-dimensional input spaces, such as raw pixel data from video games. The deep learning aspect helps in extracting features and making sense of complex inputs without manual feature engineering. 

# TRAIN DQN

1. **Experience Replay:** Experiences (or transitions), which are tuples of (state, action, reward, next_state), are stored in a replay buffer. 

    - Collecting Data: As the agent interacts with the environment, each experience (s, a, r, s') is stored in the Replay Buffer. The buffer has a maximum capacity, so when it's full, older experiences are discarded to make room for new ones.

    - Batch Learning: During training, instead of updating the network with the latest experience only, the agent samples a mini-batch of experiences randomly from the Replay Buffer. This batch is used to perform a training update on the Primary Network.

    - Updating the Primary Network: For each sampled experience, the Primary Network calculates the predicted Q-value for the action taken in the given state. Simultaneously, the Target Network provides the Q-values for the next state, which are used to calculate the target Q-value for training. The Primary Network’s weights are then adjusted to reduce the difference between its predictions and these target Q-values, using a loss function (typically mean squared error).

2. **Networks:** DQN employs two networks 

    - **Primary network:**  
        - Function: The Primary Network is the main neural network that actively learns and updates its weights through training. It directly interacts with the environment, deciding which action to take in each state based on the predicted Q-values it generates.
        - Updates: As the agent explores the environment and receives feedback (rewards), this network's weights are updated continuously to better predict the Q-values for each action given a state.

    - **Target network**
        - Function: The Target Network has the same architecture as the Primary Network, but its purpose is different. It helps provide stable target Q-values during the training process. Instead of actively learning, it serves as a somewhat static benchmark against which the Primary Network's predictions are compared.
        - Updates: The Target Network's weights are not updated as frequently as those of the Primary Network. Instead, its weights are updated periodically (e.g., every few thousand steps) by directly copying the weights from the Primary Network. This periodic update helps prevent the learning process from becoming unstable.
    
    - The primary network is updated frequently, while the target network's weights are updated less frequently (every few thousand steps) to provide stable targets during learning.The target network helps in stabilizing the learning updates. If only one network was used, the continuously updating Q-values could lead to significant oscillations or divergence in the learning process.

3. **Loss Function:** The loss function used is typically the mean squared error between the predicted Q-value and the target Q-value.



### DQN IMPLEMENTATION 

**Environment Setup:** OpenAI Gym, which offers a pre-built Mountain Car environment.

### Algorithm Summary
- 1. Initialize the replay buffer.
- 2. Initialize the primary and target networks with random weights.
- 3. For each episode:
    - For each time step:
        - Select an action using an ε-greedy policy (to balance exploration and exploitation).
        - Execute the action and observe the reward and next state.
        - Store the transition in the replay buffer.
        - Sample a random batch of transitions from the replay buffer.
        - Compute the loss between predicted Q-values and target Q-values.
        - Perform a gradient descent step to update the network's weights.
        - Every fixed number of steps, update the target network's weights to match the       primary network.


In [1]:
import gymnasium as gym
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import random
from collections import deque

In [2]:
# Replay Buffer
class ReplayBuffer:
    def __init__(self, capacity):
        self.buffer = deque(maxlen=capacity)
    
    def push(self, state, action, reward, next_state, done):
        self.buffer.append((state, action, reward, next_state, done))
    
    def sample(self, batch_size):
        state, action, reward, next_state, done = zip(*random.sample(self.buffer, batch_size))
        return np.array(state), np.array(action), np.array(reward), np.array(next_state), np.array(done)
    
    def __len__(self):
        return len(self.buffer)

In [3]:
# Neural Network
class DQN(nn.Module):
    def __init__(self, input_dim, output_dim):
        super(DQN, self).__init__()
        self.fc1 = nn.Linear(input_dim, 128)
        self.fc2 = nn.Linear(128, 128)
        self.fc3 = nn.Linear(128, output_dim)
    
    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = self.fc3(x)
        return x

In [12]:
import gymnasium as gym
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import random
from collections import deque

# Replay Buffer
class ReplayBuffer:
    def __init__(self, capacity):
        self.buffer = deque(maxlen=capacity)
    
    def push(self, state, action, reward, next_state, done):
        self.buffer.append((state, action, reward, next_state, done))
    
    def sample(self, batch_size):
        state, action, reward, next_state, done = zip(*random.sample(self.buffer, batch_size))
        return np.array(state), np.array(action), np.array(reward), np.array(next_state), np.array(done)
    
    def __len__(self):
        return len(self.buffer)

# Neural Network
class DQN(nn.Module):
    def __init__(self, input_dim, output_dim):
        super(DQN, self).__init__()
        self.fc1 = nn.Linear(input_dim, 128)
        self.fc2 = nn.Linear(128, 128)
        self.fc3 = nn.Linear(128, output_dim)
    
    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = self.fc3(x)
        return x

# DQN Agent
class DQNAgent:
    def __init__(self, env, buffer_capacity=10000, batch_size=64, gamma=0.99, epsilon_start=1.0, epsilon_end=0.01, epsilon_decay=0.995, target_update=10, lr=0.001):
        self.env = env
        self.batch_size = batch_size
        self.gamma = gamma
        self.epsilon = epsilon_start
        self.epsilon_min = epsilon_end
        self.epsilon_decay = epsilon_decay
        self.target_update = target_update
        self.replay_buffer = ReplayBuffer(buffer_capacity)
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        
        self.primary_network = DQN(env.observation_space.shape[0], env.action_space.n).to(self.device)
        self.target_network = DQN(env.observation_space.shape[0], env.action_space.n).to(self.device)
        self.optimizer = optim.Adam(self.primary_network.parameters(), lr=lr)
        
        self.target_network.load_state_dict(self.primary_network.state_dict())
        self.target_network.eval()
        
    def select_action(self, state, epsilon=0.0):
        if random.random() < epsilon:
            return self.env.action_space.sample()
        state = torch.FloatTensor(state).unsqueeze(0).to(self.device)
        with torch.no_grad():
            q_values = self.primary_network(state)
        return q_values.cpu().numpy().argmax()

    def update(self):
        if len(self.replay_buffer) < self.batch_size:
            return
        
        states, actions, rewards, next_states, dones = self.replay_buffer.sample(self.batch_size)
        
        states = torch.FloatTensor(states).to(self.device)
        actions = torch.LongTensor(actions).to(self.device)
        rewards = torch.FloatTensor(rewards).to(self.device)
        next_states = torch.FloatTensor(next_states).to(self.device)
        dones = torch.FloatTensor(dones).to(self.device)
        
        q_values = self.primary_network(states).gather(1, actions.unsqueeze(1)).squeeze(1)
        next_q_values = self.target_network(next_states).max(1)[0]
        target_q_values = rewards + (self.gamma * next_q_values * (1 - dones))
        
        loss = nn.MSELoss()(q_values, target_q_values)
        
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()
    
    def train(self, num_episodes):
        for episode in range(num_episodes):
            state, _ = self.env.reset()
            done = False
            total_reward = 0
            steps = 0
            while not done:
                action = self.select_action(state, self.epsilon)
                next_state, reward, terminated, truncated, _ = self.env.step(action)
                self.replay_buffer.push(state, action, reward, next_state, terminated or truncated)
                state = next_state
                total_reward += reward
                steps += 1
                self.update()

                if terminated:
                    print(f"Episode {episode}: Goal reached in {steps} steps with total reward {total_reward}")
                    break

            if not terminated and truncated:
                print(f"Episode {episode}: Truncated after {steps} steps with total reward {total_reward}")
            
            if self.epsilon > self.epsilon_min:
                self.epsilon *= self.epsilon_decay
            
            if episode % self.target_update == 0:
                self.target_network.load_state_dict(self.primary_network.state_dict())
            
            print(f"Episode {episode}, Total Reward: {total_reward}, Steps: {steps}, Epsilon: {self.epsilon:.2f}")

    def test(self, num_episodes):
        for episode in range(num_episodes):
            state, _ = self.env.reset()
            done = False
            total_reward = 0
            steps = 0
            while not done:
                action = self.select_action(state)
                next_state, reward, terminated, truncated, _ = self.env.step(action)
                state = next_state
                total_reward += reward
                steps += 1

                if terminated:
                    print(f"Test Episode {episode}: Goal reached in {steps} steps with total reward {total_reward}")
                    break

            if not terminated and truncated:
                print(f"Test Episode {episode}: Truncated after {steps} steps with total reward {total_reward}")

    def save_model(self, path):
        torch.save(self.primary_network.state_dict(), path)
    
    def load_model(self, path):
        self.primary_network.load_state_dict(torch.load(path))
        self.primary_network.eval()

In [13]:
# Example Usage
env = gym.make("MountainCar-v0")
agent = DQNAgent(env)
print("Training...")
agent.train(500)
agent.save_model("dqn_model.pth")
print("Testing...")
agent.load_model("dqn_model.pth")
agent.test(10)

Training...
Episode 0: Goal reached in 109566 steps with total reward -109566.0
Episode 0, Total Reward: -109566.0, Steps: 109566, Epsilon: 0.99
Episode 1: Goal reached in 84809 steps with total reward -84809.0
Episode 1, Total Reward: -84809.0, Steps: 84809, Epsilon: 0.99
Episode 2: Goal reached in 34541 steps with total reward -34541.0
Episode 2, Total Reward: -34541.0, Steps: 34541, Epsilon: 0.99
Episode 3: Goal reached in 24659 steps with total reward -24659.0
Episode 3, Total Reward: -24659.0, Steps: 24659, Epsilon: 0.98
Episode 4: Goal reached in 30118 steps with total reward -30118.0
Episode 4, Total Reward: -30118.0, Steps: 30118, Epsilon: 0.98
Episode 5: Goal reached in 61457 steps with total reward -61457.0
Episode 5, Total Reward: -61457.0, Steps: 61457, Epsilon: 0.97
Episode 6: Goal reached in 11918 steps with total reward -11918.0
Episode 6, Total Reward: -11918.0, Steps: 11918, Epsilon: 0.97
Episode 7: Goal reached in 6327 steps with total reward -6327.0
Episode 7, Total 

KeyboardInterrupt: 

In [3]:
import gymnasium as gym
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import random
from collections import deque

# Replay Buffer
class ReplayBuffer:
    def __init__(self, capacity):
        self.buffer = deque(maxlen=capacity)
    
    def push(self, state, action, reward, next_state, done):
        self.buffer.append((state, action, reward, next_state, done))
    
    def sample(self, batch_size):
        state, action, reward, next_state, done = zip(*random.sample(self.buffer, batch_size))
        return np.array(state), np.array(action), np.array(reward), np.array(next_state), np.array(done)
    
    def __len__(self):
        return len(self.buffer)

# Neural Network
class DQN(nn.Module):
    def __init__(self, input_dim, output_dim):
        super(DQN, self).__init__()
        self.fc1 = nn.Linear(input_dim, 128)
        self.fc2 = nn.Linear(128, 128)
        self.fc3 = nn.Linear(128, output_dim)
    
    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = self.fc3(x)
        return x

# DQN Agent
class DQNAgent:
    def __init__(self, env, buffer_capacity=10000, batch_size=64, gamma=0.99, epsilon_start=1.0, epsilon_end=0.01, epsilon_decay=0.995, target_update=10, lr=0.001):
        self.env = env
        self.batch_size = batch_size
        self.gamma = gamma
        self.epsilon = epsilon_start
        self.epsilon_min = epsilon_end
        self.epsilon_decay = epsilon_decay
        self.target_update = target_update
        self.replay_buffer = ReplayBuffer(buffer_capacity)
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        
        self.primary_network = DQN(env.observation_space.shape[0], env.action_space.shape[0]).to(self.device)
        self.target_network = DQN(env.observation_space.shape[0], env.action_space.shape[0]).to(self.device)
        self.optimizer = optim.Adam(self.primary_network.parameters(), lr=lr)
        
        self.target_network.load_state_dict(self.primary_network.state_dict())
        self.target_network.eval()
        
    def select_action(self, state):
        if random.random() < self.epsilon:
            return self.env.action_space.sample()
        state = torch.FloatTensor(state).unsqueeze(0).to(self.device)
        with torch.no_grad():
            q_values = self.primary_network(state)
        return q_values.cpu().numpy()[0]

    def update(self):
        if len(self.replay_buffer) < self.batch_size:
            return
        
        states, actions, rewards, next_states, dones = self.replay_buffer.sample(self.batch_size)
        
        states = torch.FloatTensor(states).to(self.device)
        actions = torch.FloatTensor(actions).to(self.device)
        rewards = torch.FloatTensor(rewards).to(self.device)
        next_states = torch.FloatTensor(next_states).to(self.device)
        dones = torch.FloatTensor(dones).to(self.device)
        
        q_values = self.primary_network(states)
        next_q_values = self.target_network(next_states)
        max_next_q_values = next_q_values.max(1)[0]
        target_q_values = rewards + (self.gamma * max_next_q_values * (1 - dones))
        
        loss = nn.MSELoss()(q_values, target_q_values.unsqueeze(1))
        
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()
    
    def train(self, num_episodes):
        for episode in range(num_episodes):
            state, _ = self.env.reset()
            done = False
            total_reward = 0
            steps = 0
            while not done:
                action = self.select_action(state)
                next_state, reward, done, _, _ = self.env.step(action)
                self.replay_buffer.push(state, action, reward, next_state, done)
                state = next_state
                total_reward += reward
                steps += 1
                self.update()
            
            if self.epsilon > self.epsilon_min:
                self.epsilon *= self.epsilon_decay
            
            if episode % self.target_update == 0:
                self.target_network.load_state_dict(self.primary_network.state_dict())
            
            print(f"Episode {episode}, Total Reward: {total_reward}, Steps: {steps}, Epsilon: {self.epsilon:.2f}")

# Example Usage
env = gym.make("MountainCarContinuous-v0")
agent = DQNAgent(env)
agent.train(500)

Episode 0, Total Reward: -677.0983350800102, Steps: 23312, Epsilon: 0.99
Episode 1, Total Reward: -2289.4405299698055, Steps: 72117, Epsilon: 0.99
Episode 2, Total Reward: -533.1453041588754, Steps: 19133, Epsilon: 0.99
Episode 3, Total Reward: 15.796586472902149, Steps: 2557, Epsilon: 0.98
Episode 4, Total Reward: -872.0823157868286, Steps: 29682, Epsilon: 0.98
Episode 5, Total Reward: -2439.001657126396, Steps: 77285, Epsilon: 0.97
Episode 6, Total Reward: -578.6790567485673, Steps: 18063, Epsilon: 0.97
Episode 7, Total Reward: -270.0510298789522, Steps: 11577, Epsilon: 0.96
Episode 8, Total Reward: -2425.1689193061766, Steps: 78237, Epsilon: 0.96
Episode 9, Total Reward: -157.87936639274682, Steps: 8066, Epsilon: 0.95
Episode 10, Total Reward: -685.9014909146756, Steps: 24516, Epsilon: 0.95
Episode 11, Total Reward: -31.12261644414386, Steps: 4004, Epsilon: 0.94
Episode 12, Total Reward: -99.40518521013021, Steps: 6113, Epsilon: 0.94
Episode 13, Total Reward: -358.0181566752662, Ste

### Discrete Action Space
In the discrete version of the Mountain Car game:

- **Action Space:** The action space consists of a small, finite set of possible actions. Typically, these are:
    - 0: Accelerate to the Right
    - 1: No Acceleration
    - 2: Accelerate to the Left
- **Control:** The agent selects one of these discrete actions at each time step.
- **Implementation:** This simplification makes it easier to implement traditional RL algorithms like Q-learning or SARSA, which are well-suited for discrete spaces.

### Continuous Action Space
In the continuous version, called Mountain Car Continuous:

- **Action Space:** The action space is continuous, typically represented by a real number within a range. For instance, the action might be a single real value indicating the amount of force applied in either direction.
    - The action can range, for example, from -1 (full power to the left) to +1 (full power to the right).
- **Control:** The agent can choose any value within this range, allowing for finer control over the car's acceleration.
- **Implementation:** This requires different types of RL algorithms that can handle continuous action spaces, such as Deep Deterministic Policy Gradient (DDPG), Proximal Policy Optimization (PPO), DNQ or other policy gradient methods that work directly with continuous outputs.

### Key Differences
- **Action Granularity:** The main difference is in the granularity of the actions that can be taken. Continuous action spaces allow for more precise and varied actions, whereas discrete action spaces are limited to a few predefined options.
- **Complexity:** Continuous action spaces generally increase the complexity of the problem, as the agent needs to learn to choose optimally from an infinite set of possible actions.
- **Algorithms:** Continuous spaces often require more sophisticated RL algorithms that are specifically designed to handle continuous domains.