<a href="https://colab.research.google.com/github/Tanu-N-Prabhu/Python/blob/master/Machine%20Learning%20Interview%20Prep%20Questions/Reinforcement%20Learning%20Algorithms/Deep%20Q-Network/Deep_Q_Network.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Deep Q-Network (DQN) Implementation from Scratch

This notebook shows a **step-by-step implementation of a Deep Q-Network (DQN)** for solving the **CartPole environment**.  
We avoid complex class structures and keep the code **simple and easy to understand**.


In [11]:
import gym
import torch
import torch.nn as nn
import torch.optim as optim
import random
import numpy as np

# 1. Build the Q-Network (simple 2-layer NN)
def build_model(state_dim, action_dim):
    return nn.Sequential(
        nn.Linear(state_dim, 24), nn.ReLU(),
        nn.Linear(24, 24), nn.ReLU(),
        nn.Linear(24, action_dim)
    )

# 2. Choose action (epsilon-greedy)
def choose_action(state, epsilon, q_network, action_dim):
    if random.random() < epsilon:
        return random.randrange(action_dim)  # random action
    state_tensor = torch.FloatTensor(state).unsqueeze(0)
    q_values = q_network(state_tensor)
    return torch.argmax(q_values).item()

# 3. Training step (on one batch from memory)
def train(q_network, target_network, memory, optimizer, gamma, batch_size=32):
    if len(memory) < batch_size:
        return
    batch = random.sample(memory, batch_size)
    states, actions, rewards, next_states, dones = zip(*batch)

    states = torch.FloatTensor(np.array(states))
    actions = torch.LongTensor(actions).unsqueeze(1)
    rewards = torch.FloatTensor(rewards)
    next_states = torch.FloatTensor(np.array(next_states))
    dones = torch.FloatTensor(dones)

    # Current Q values
    q_values = q_network(states).gather(1, actions).squeeze()

    # Next Q values (from target network)
    next_q_values = target_network(next_states).max(1)[0]
    target = rewards + gamma * next_q_values * (1 - dones)

    # Loss (MSE)
    loss = nn.MSELoss()(q_values, target.detach())
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

# 4. Main Training Loop
env = gym.make("CartPole-v1")

state_dim = env.observation_space.shape[0]
action_dim = env.action_space.n

q_network = build_model(state_dim, action_dim)
target_network = build_model(state_dim, action_dim)
target_network.load_state_dict(q_network.state_dict())

optimizer = optim.Adam(q_network.parameters(), lr=0.001)

# Hyperparameters
episodes = 200
gamma = 0.99
epsilon = 1.0
epsilon_min = 0.01
epsilon_decay = 0.995
target_update = 10
memory = []

for ep in range(episodes):
    state = env.reset()   # Gymnasium reset returns (state, info)
    total_reward = 0

    for t in range(200):
        action = choose_action(state, epsilon, q_network, action_dim)
        next_state, reward, done, info = env.step(action)


        # Store in memory
        memory.append((state, action, reward, next_state, done))
        if len(memory) > 10000:
            memory.pop(0)

        # Train on random batch
        train(q_network, target_network, memory, optimizer, gamma)

        state = next_state
        total_reward += reward
        if done:
            break

    # Decay epsilon
    if epsilon > epsilon_min:
        epsilon *= epsilon_decay

    # Update target network
    if ep % target_update == 0:
        target_network.load_state_dict(q_network.state_dict())

    print(f"Episode {ep}, Reward: {total_reward}, Epsilon: {epsilon:.2f}")


Episode 0, Reward: 15.0, Epsilon: 0.99
Episode 1, Reward: 18.0, Epsilon: 0.99
Episode 2, Reward: 17.0, Epsilon: 0.99
Episode 3, Reward: 20.0, Epsilon: 0.98
Episode 4, Reward: 16.0, Epsilon: 0.98
Episode 5, Reward: 25.0, Epsilon: 0.97
Episode 6, Reward: 15.0, Epsilon: 0.97
Episode 7, Reward: 15.0, Epsilon: 0.96
Episode 8, Reward: 12.0, Epsilon: 0.96
Episode 9, Reward: 25.0, Epsilon: 0.95
Episode 10, Reward: 10.0, Epsilon: 0.95
Episode 11, Reward: 13.0, Epsilon: 0.94
Episode 12, Reward: 19.0, Epsilon: 0.94
Episode 13, Reward: 12.0, Epsilon: 0.93
Episode 14, Reward: 13.0, Epsilon: 0.93
Episode 15, Reward: 21.0, Epsilon: 0.92
Episode 16, Reward: 13.0, Epsilon: 0.92
Episode 17, Reward: 16.0, Epsilon: 0.91
Episode 18, Reward: 12.0, Epsilon: 0.91
Episode 19, Reward: 19.0, Epsilon: 0.90
Episode 20, Reward: 12.0, Epsilon: 0.90
Episode 21, Reward: 9.0, Epsilon: 0.90
Episode 22, Reward: 18.0, Epsilon: 0.89
Episode 23, Reward: 9.0, Epsilon: 0.89
Episode 24, Reward: 23.0, Epsilon: 0.88
Episode 25, 

## Explanation in Plain English

1. Environment:
CartPole game → balance the pole by moving left or right.

2. Q-Network:
Small neural net:

    * Input = state (4 numbers)
    * Output = Q-values for each action (2 numbers)

3. Replay Memory:
Stores past experiences → (state, action, reward, next_state, done)

4. Epsilon-Greedy:
Sometimes random, sometimes best action.

5. Training:

    * Predict Q(s, a)

    * Compute target = reward + γ * max(Q(s’, a’))

    * Update network with loss.

6. Target Network:
Copy of Q-network, updated slowly, makes learning stable.