<a href="https://colab.research.google.com/github/EngrIBGIT/Reinforcment_Learning/blob/main/Introduction_to_Deep_Q_Networks_(DQN).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Reinforcement Learning in More Complex Environments:

**Introduction to Deep Q-Networks (DQN)**

As environments become more complex, the state space (the number of possible states the agent can encounter) grows, making it difficult to use traditional Q-Learning.

This is where Deep Q-Networks (DQN) come in, which use a neural network to approximate the Q-values for each action in a given state.

DQN allows reinforcement learning to scale to environments with large, continuous, or infinite state spaces, such as video games and robotic control.

## Overview of the DQN Process:

`Agent:` Learns by interacting with the environment.
Environment: Provides feedback in the form of rewards or penalties based on the agent's actions.

`Action Space:` All possible actions the agent can take.

`Q-Network:` A deep neural network that approximates the Q-values (expected future rewards for each action).

`Epsilon-Greedy Exploration:` A strategy to balance exploration and exploitation.

`Replay Buffer:` A memory of previous experiences (states, actions, rewards) used to train the Q-network.

`Target Network:` A copy of the Q-network that stabilizes training by holding the Q-values constant for several steps.

## Key Differences Between Q-Learning and DQN

`Q-Learning:` Stores all Q-values in a table, which works well for small state spaces but becomes impractical for complex environments.

`DQN:` Uses a neural network to estimate Q-values, allowing it to handle much larger state spaces by generalizing over similar states.

## Setting Up a DQN for a Complex Environment (OpenAI Gym: CartPole)

Using the CartPole environment from OpenAI Gym, where the agent controls a pole balanced on a moving cart. The goal is to keep the pole upright for as long as possible.

### Step 1: Install Required Libraries

* `gym:` To access environments (like CartPole).

* `tensorflow and keras:` To build the deep neural network for DQN.

* `matplotlib:` To visualize the agent's learning progress.

In [None]:
pip install gym numpy tensorflow keras matplotlib



### Step 2: Create the Environment

In CartPole, the state consists of 4 values (cart position, cart velocity, pole angle, and pole velocity), and there are 2 actions (move left or right).

In [None]:
import gym
import numpy as np
import random
import tensorflow as tf
from tensorflow.keras import layers, models
from collections import deque

# Create the CartPole environment
env = gym.make("CartPole-v1")

# Environment parameters
state_size = env.observation_space.shape[0]  # 4 continuous state variables
action_size = env.action_space.n  # 2 actions: left or right

# Display the action and state space
print(f"State Size: {state_size}")
print(f"Action Size: {action_size}")

  from jax import xla_computation as _xla_computation


State Size: 4
Action Size: 2


  deprecation(
  deprecation(


### Step 3: Build the Q-Network

We’ll create a neural network that takes the state as input and outputs the Q-values for each action.

* `Input Layer:` Takes the state (4 values in CartPole).

* `Hidden Layers:` Two fully connected layers with ReLU activation (helps the network learn complex patterns).

* `Output Layer:` Outputs Q-values for each action (left or right).

In [None]:
def build_model(state_size, action_size):
    model = models.Sequential()
    model.add(layers.Dense(24, input_dim=state_size, activation="relu"))
    model.add(layers.Dense(24, activation="relu"))
    model.add(layers.Dense(action_size, activation="linear"))
    model.compile(loss="mse", optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))
    return model

# Build the Q-network
q_network = build_model(state_size, action_size)

# Build the target network (to stabilize training)
target_network = build_model(state_size, action_size)
target_network.set_weights(q_network.get_weights())  # Initialize with the same weights

  and should_run_async(code)
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


### Step 4: Implement Replay Buffer and Exploration Strategy

A `replay buffer` stores experiences that the agent encounters during training. The agent uses `epsilon-greedy` exploration to balance exploration (trying new actions) with exploitation (choosing the best-known actions).

* `Epsilon-greedy:` The agent starts by exploring (choosing random actions) but gradually shifts to exploiting its knowledge (choosing actions with the highest Q-values).

* `Replay Buffer:` Stores past experiences for training the Q-network.

* `Q-Network Update:` Updates the Q-values by training the network on samples from the replay buffer.

In [None]:
class DQNAgent:
    def __init__(self, state_size, action_size):
        self.state_size = state_size
        self.action_size = action_size
        self.q_network = build_model(state_size, action_size)
        self.target_network = build_model(state_size, action_size)
        self.target_network.set_weights(self.q_network.get_weights())
        self.memory = deque(maxlen=2000)  # Replay buffer
        self.epsilon = 1.0  # Exploration rate
        self.epsilon_min = 0.01
        self.epsilon_decay = 0.995
        self.gamma = 0.95  # Discount factor
        self.batch_size = 32  # Number of samples to train at a time

    def store(self, state, action, reward, next_state, done):
        self.memory.append((state, action, reward, next_state, done))

    def choose_action(self, state):
        if np.random.rand() <= self.epsilon:
            return random.randrange(self.action_size)
        q_values = self.q_network.predict(state)
        return np.argmax(q_values[0])  # Action with highest Q-value

    def replay(self):
        if len(self.memory) < self.batch_size:
            return
        batch = random.sample(self.memory, self.batch_size)
        for state, action, reward, next_state, done in batch:
            target = reward
            if not done:
                target += self.gamma * np.amax(self.target_network.predict(next_state)[0])
            target_f = self.q_network.predict(state)
            target_f[0][action] = target
            self.q_network.fit(state, target_f, epochs=1, verbose=0)
        if self.epsilon > self.epsilon_min:
            self.epsilon *= self.epsilon_decay

    def update_target_network(self):
        self.target_network.set_weights(self.q_network.get_weights())

### Step 5: Train the Agent

Let’s train the DQN agent to balance the pole by interacting with the CartPole environment.

* `Training Episodes:` The agent interacts with the environment for a set number of episodes.

* `Timestep Count:` Measures how long the agent can balance the pole in each episode (higher is better).

* `Epsilon Decay:` As the agent learns, epsilon decreases, so it gradually explores less and exploits more

In [None]:
agent = DQNAgent(state_size, action_size)

episodes = 1000
timesteps_list = []

for episode in range(episodes):
    state = env.reset()
    state = np.reshape(state, [1, state_size])
    total_reward = 0
    for time in range(500):
        action = agent.choose_action(state)
        next_state, reward, done, _ = env.step(action)
        next_state = np.reshape(next_state, [1, state_size])
        agent.store(state, action, reward, next_state, done)
        state = next_state
        total_reward += reward
        if done:
            agent.update_target_network()
            timesteps_list.append(time)
            print(f"Episode: {episode}, Timesteps: {time}, Epsilon: {agent.epsilon:.2f}")
            break
        agent.replay()

  if not isinstance(terminated, (bool, np.bool8)):


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 940ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 160ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 58ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 87ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 38ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 23ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 38ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 31ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 31ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 109ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 75ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 122ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 39ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0

### Step 6: Visualize the Learning Progress
We can visualize how well the agent learns over time by plotting the number of timesteps it survived in each episode.

This plot will show a rising trend, indicating that the agent is learning to balance the pole for longer periods as it gains more experience.

In [None]:
import matplotlib.pyplot as plt

plt.plot(timesteps_list)
plt.xlabel('Episode')
plt.ylabel('Timesteps Survived')
plt.title('DQN Agent Learning Progress in CartPole')
plt.show()

### Summary of Steps:

1. `Set Up Environment:` Used CartPole from OpenAI Gym.

2. `Build Q-Network:` Created a deep neural network to estimate Q-values for each state-action pair.

3. `Replay Buffer and Exploration:` Stored past experiences for learning and used epsilon-greedy strategy to balance exploration and exploitation.

4. `Training:` Trained the agent using experiences from the replay buffer and periodically updated the target network.

5. `Visualization:` Showed the agent’s learning progress using a line plot of timesteps survived.

**Conclusion:**

This is a simplified version of DQN. For more advanced techniques, we could introduce improvements like Double DQN, Prioritized Experience Replay, or use convolutional networks for image-based environments like Atari games.