In [3]:
# Import gym to create the CartPole environment
import gym

# Create the CartPole environment
env = gym.make('CartPole-v1')

# Check the observation and action spaces to understand the environment
print("Observation space:", env.observation_space)  # 4 continuous values (state)
print("Action space:", env.action_space)            # 2 discrete actions: 0 (left), 1 (right)



Observation space: Box([-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38], [4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38], (4,), float32)
Action space: Discrete(2)


### Code Explanation

- **`import gym`**:  
    Imports the **OpenAI Gym** library, which provides standard environments for reinforcement learning experiments.
    
- **`env = gym.make('CartPole-v1')`**:  
    Creates an instance of the **CartPole-v1** environment where the agent will attempt to balance a pole on a moving cart.
    
- **`print("Observation space:", env.observation_space)`**:  
    Displays the range and shape of the observation space.
    
    - The observation space consists of **4 continuous values**:
        
        - Cart position (within ±4.8 units)
            
        - Cart velocity (extremely large range indicated by `±3.4028235e+38`)
            
        - Pole angle (approximately ±0.418)
            
        - Pole angular velocity (also extremely large range due to simulation limits)
            
- **`print("Action space:", env.action_space)`**:  
    Displays the action space.
    
    - The environment has **2 discrete actions**:
        
        - `0` = push cart to the left
            
        - `1` = push cart to the right
            

---

### Why this step

- It is critical to understand the environment’s **input (state)** and **output (action)** spaces before building a neural network.
    
- The **number of input features** (4) determines the input layer size.
    
- The **number of actions** (2) determines the number of outputs in the final layer of the Q-network.
    
- Understanding value ranges helps normalisation if needed and informs learning dynamics.
    

---

### Result

- The environment’s state consists of **4 continuous values**:
    
    - cart position, cart velocity, pole angle, and pole angular velocity.
        
- The environment allows **2 discrete actions**: move left (`0`) or right (`1`).
    

In [4]:
# Import TensorFlow and Keras layers for building the neural network
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

# Build the Q-network model
# Input: 4 features from the environment
# Output: 2 Q-values (one for each possible action)
model = Sequential([
    # First hidden layer with 24 neurons and ReLU activation
    Dense(24, input_shape=(4,), activation='relu'),
    
    # Second hidden layer with 24 neurons and ReLU activation
    Dense(24, activation='relu'),
    
    # Output layer with 2 neurons (Q-values for each action)
    Dense(2, activation='linear')
])

# Compile the model with mean squared error loss (for predicting Q-values)
# and Adam optimizer (adaptive, stable convergence)
model.compile(loss='mse', optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))

# Display model summary to verify structure
model.summary()


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


### Code Explanation

- **`import tensorflow as tf` and layers**:  
    Imports TensorFlow and the necessary components from Keras for building the neural network.
    
- **`model = Sequential([...])`**:  
    Creates a **sequential neural network**, meaning layers are added one after the other.
    
- **`Dense(24, input_shape=(4,), activation='relu')`**:  
    The first fully connected (dense) layer with:
    
    - 24 neurons
        
    - `input_shape=(4,)` means it expects 4 input features (cart position, cart velocity, pole angle, pole angular velocity)
        
    - `relu` activation introduces non-linearity so the network can learn complex patterns.
        
- **Second Dense layer**:
    
    - Another hidden layer with 24 neurons and `relu` activation, allowing the network to learn more complex relationships between inputs and Q-values.
        
- **Output Dense layer**:
    
    - 2 neurons (each representing the Q-value for one of the two actions: move left or move right)
        
    - Uses `linear` activation (default), as Q-values are continuous numbers without bounds.
        
- **`model.compile(loss='mse', optimizer=Adam(...))`**:
    
    - **Loss**: mean squared error (MSE) is used because we are predicting continuous values (Q-values) and minimising the difference between predicted and target values.
        
    - **Adam optimizer**: an adaptive gradient-based optimisation method that generally converges faster and more stably than alternatives.
        
    - Learning rate is set to **0.001**, a standard starting point.
        
- **`model.summary()`**:  
    Displays the structure and parameter counts for each layer.
    

---

### Why this step

- I need a function approximator (Q-network) that takes the environment state and predicts Q-values for both actions.
    
- Having two hidden layers of reasonable size (24 neurons) gives enough complexity to capture the patterns without making the model too large or prone to overfitting.
    
- Using **MSE loss** and **Adam** helps the model learn stable and accurate predictions of Q-values over many episodes.
    

---

### Result

- **120 parameters in the first layer**:
    
    - `(4 input features + 1 bias) * 24 neurons = 120`
        
- **600 parameters in the second layer**:
    
    - `(24 input neurons + 1 bias) * 24 neurons = 600`
        
- **50 parameters in the output layer**:
    
    - `(24 input neurons + 1 bias) * 2 output neurons = 50`
        


In [5]:
# Import deque for creating a memory buffer
from collections import deque

# Set up replay memory to store past experiences (state, action, reward, next_state, done)
# This allows the agent to learn from past experiences and break correlation between steps
replay_memory = deque(maxlen=2000)

print("Replay memory created with capacity:", replay_memory.maxlen)

Replay memory created with capacity: 2000


### Code Explanation

- **`from collections import deque`**  
    Imports the `deque` data structure from Python’s `collections` module.
    
    - A `deque` (double-ended queue) allows for efficient addition and removal of elements from both ends.
        
- **`replay_memory = deque(maxlen=2000)`**  
    Creates a **replay memory buffer** with a maximum length of **2000 experiences**.
    
    - Each experience will later be stored as a tuple: `(state, action, reward, next_state, done)`
        
    - Once the buffer reaches 2000 entries, the oldest experiences are automatically removed to make room for new ones.
        
- **`print("Replay memory created with capacity:", replay_memory.maxlen)`**  
    Prints out the capacity of the replay memory to confirm that it has been created correctly.
    

---

### Why this step

- The **replay memory** is crucial for Deep Q-Learning.
    
- Instead of learning from consecutive steps, the model will randomly sample experiences from this buffer.
    
- This random sampling breaks correlations, stabilises learning, and allows the neural network to generalise better.
    
- Using a fixed-size `deque` makes memory management automatic and efficient.
    

---

### Result

- Confirms that the replay memory is set up and ready to store up to 2000 past experiences.
    
- This buffer will be filled during training as the agent interacts with the environment.

In [6]:
# Set key hyperparameters for training

# Discount factor (gamma) — determines how much future rewards are valued
gamma = 0.95  

# Initial exploration rate (epsilon) — probability of choosing a random action
epsilon = 1.0  

# Minimum exploration rate — ensures some exploration continues throughout training
epsilon_min = 0.01  

# Decay rate for epsilon — reduces exploration over time as the agent learns
epsilon_decay = 0.995  

# Batch size for training from replay memory
batch_size = 64  

# Number of episodes (full runs of the environment) to train over
episodes = 500  

print("Hyperparameters set:")
print(f"gamma={gamma}, epsilon_start={epsilon}, epsilon_min={epsilon_min}, epsilon_decay={epsilon_decay}, batch_size={batch_size}, episodes={episodes}")


Hyperparameters set:
gamma=0.95, epsilon_start=1.0, epsilon_min=0.01, epsilon_decay=0.995, batch_size=64, episodes=500


### Code Explanation

- **`gamma = 0.95`**
    
    - The **discount factor**.
        
    - It determines how much future rewards are valued compared to immediate rewards.
        
    - A gamma of **0.95** means the agent values future rewards highly but slightly less than immediate rewards.
        
    - A higher gamma (close to 1) makes the agent plan long-term, while a lower gamma makes it short-sighted.
        
- **`epsilon = 1.0`**
    
    - The **initial exploration rate**.
        
    - This means the agent will start by taking actions completely at random (100%) to explore the environment.
        
    - High exploration at the start is important to avoid biasing toward early random experiences.
        
- **`epsilon_min = 0.01`**
    
    - The **minimum exploration rate**.
        
    - Even after training for a while, the agent will still take random actions 1% of the time to avoid becoming too rigid and getting stuck in local optima.
        
- **`epsilon_decay = 0.995`**
    
    - The rate at which epsilon gradually decreases after each episode.
        
    - A decay factor of **0.995** means that after each episode, epsilon is multiplied by 0.995, slowly moving from 1.0 down to 0.01.
        
    - This allows the agent to shift from exploration (random moves) to exploitation (taking the best-known actions).
        
- **`batch_size = 64`**
    
    - When training from the replay memory, the agent will randomly sample 64 experiences at a time to update the model.
        
    - This balances training stability and speed; smaller batches train faster but can be less stable, larger batches slow training down.
        
- **`episodes = 500`**
    
    - The number of complete runs (episodes) of the environment the agent will train for.
        
    - Each episode runs until the pole falls or the time limit is reached.
        
    - 500 is a good number for CartPole — long enough for the agent to converge to good behaviour.
        

---

### Why this step

- Hyperparameters control how the agent learns.
    
- They balance exploration vs. exploitation, determine long-term vs. short-term rewards, and affect learning speed and stability.
    
- Without carefully chosen hyperparameters, the agent can:
    
    - Fail to learn (too little exploration or short-sighted gamma).
        
    - Overexplore and not exploit (high epsilon without decay).
        
    - Become stuck or slow (poor batch sizes).
        

---

### Result

- Confirms that all learning parameters are correctly set.
    
- The agent will start exploring randomly, slowly learn to exploit actions that lead to high rewards, and value future rewards while training in batches of 64.

In [7]:
import numpy as np

# Function to choose an action based on the current state
def choose_action(state, epsilon):
    # With probability epsilon, take a random action
    if np.random.rand() <= epsilon:
        return env.action_space.sample()
    else:
        # Predict the Q-values for the state and choose the action with the highest Q-value (exploitation)
        q_values = model.predict(state, verbose=0)
        return np.argmax(q_values[0])  # Return the action with the highest predicted Q-value


### Code Explanation

- **`def choose_action(state, epsilon):`**  
    Defines a function that takes in the current environment state and the current exploration rate `epsilon`.
    
- **`if np.random.rand() <= epsilon:`**
    
    - Generates a random number between 0 and 1.
        
    - If it’s less than or equal to `epsilon`, the agent takes a **random action** — this is exploration.
        
    - High `epsilon` means more frequent random moves at the beginning.
        
- **`env.action_space.sample()`**
    
    - This randomly picks an action from the environment's action space.
        
    - In CartPole, this is either **0 (move left)** or **1 (move right)**.
        
- **`else:`**
    
    - If the random number is greater than `epsilon`, the agent uses the model to predict the best action — exploitation.
        
- **`q_values = model.predict(state, verbose=0)`**
    
    - The Q-network predicts the expected reward (Q-values) for both actions given the current state.
        
    - Output will be an array with two values: one for action 0, one for action 1.
        
- **`return np.argmax(q_values[0])`**
    
    - The action corresponding to the highest Q-value is chosen, meaning the model selects the action it believes will lead to the best future reward.
        

---

### Why this step

- The agent needs to balance **exploration** (trying new actions) and **exploitation** (choosing the best-known action).
    
- This function applies the **epsilon-greedy strategy**, the standard method for balancing these two.
    
- By wrapping it in a function, the main loop stays clean, and it’s easy to tweak or replace the action-selection logic if needed.
    

In [8]:
# Function to train the model using random samples from the replay memory
def replay(batch_size):
    # Don't train until there is enough experiences to fill a batch
    if len(replay_memory) < batch_size:
        return
    
    # Randomly sample a batch of experiences from the memory
    minibatch = np.random.choice(len(replay_memory), batch_size, replace=False)
    
    # Prepare lists for states and target Q-values
    states = []
    targets = []
    
    # Process each sampled experience
    for index in minibatch:
        state, action, reward, next_state, done = replay_memory[index]
        
        # Predict current Q-values for the state
        target = model.predict(state, verbose=0)[0]
        
        # If the episode is done, set Q-value for that action to reward only
        if done:
            target[action] = reward
        else:
            # Predict Q-values for the next state and use max Q-value for target update
            t = model.predict(next_state, verbose=0)[0]
            target[action] = reward + gamma * np.amax(t)
        
        states.append(state[0])
        targets.append(target)
    
    # Train the model on the entire batch in one go
    model.fit(np.array(states), np.array(targets), epochs=1, verbose=0)


### Code Explanation

- `def replay(batch_size):`  
    This function takes a `batch_size` and trains the model by using random experiences from the replay memory.
    
- `if len(replay_memory) < batch_size: return`  
    It first checks if there are enough experiences in memory. If not, it returns.
    
- `minibatch = np.random.choice(len(replay_memory), batch_size, replace=False)`  
    Randomly selects a set of indices from the replay memory without repetition, creating a mini-batch of experiences to learn from.
    
- `states = []` and `targets = []`  
    These empty lists will store input states and the corresponding target Q-values for training.
    
- `for index in minibatch:`  
    Iterates over each selected experience.
    
    - `state, action, reward, next_state, done = replay_memory[index]` extracts each experience tuple.
        
    - `target = model.predict(state, verbose=0)[0]` gets the current Q-values for the given state.
        
    - If the episode has ended (`done` is `True`), the Q-value for that action is set to the immediate reward.
        
    - Otherwise, it adds the reward plus the discounted future reward, calculated as `reward + gamma * max predicted Q-value of the next state`.
        
    - The state and calculated target Q-values are stored for batch training.
        
- `model.fit(np.array(states), np.array(targets), epochs=1, verbose=0)`  
    After processing the entire batch, the model is trained on these states and updated Q-values all at once for one epoch.
    

---

### Why this step

- This function allows the neural network to learn from many past experiences, not just the most recent action.
    
- It helps **break the correlation** between consecutive experiences by training on random samples.
    
- This stabilises learning and prevents the model from overfitting to patterns in sequential data.
    
- The logic for updating Q-values follows the **Bellman Equation**, using the reward and the estimated best future action.
    
- By training on batches, we allow more stable, generalised learning rather than on single data points.

In [None]:
# Main training loop
for episode in range(episodes):
    state = env.reset()[0]  # Reset environment at start of episode
    state = np.reshape(state, [1, 4])  # Reshape for model input
    total_reward = 0  # Track total reward for the episode
    
    for step in range(500):  # Limit steps to avoid infinite loops
        # Choose action based on current state and epsilon
        action = choose_action(state, epsilon)
        
        # Take action in environment
        next_state, reward, done, _, _ = env.step(action)
        next_state = np.reshape(next_state, [1, 4])
        
        # Store experience in replay memory
        replay_memory.append((state, action, reward, next_state, done))
        
        # Update state
        state = next_state
        total_reward += reward
        
        # Train the network from memory
        replay(batch_size)
        
        if done:
            break
    
    # Decay epsilon to reduce exploration over time
    if epsilon > epsilon_min:
        epsilon *= epsilon_decay
    
    # Print progress every 50 episodes
    if (episode + 1) % 50 == 0:
        print(f"Episode: {episode + 1}/{episodes}, Total Reward: {total_reward}, Epsilon: {epsilon:.3f}")


### Code Explanation

- `for episode in range(episodes):`  
    Starts the main training loop, running for a set number of episodes. Each episode represents one full run of the environment until the pole falls or the time limit is reached.
    
- `state = env.reset()[0]`  
    Resets the environment to its starting conditions and returns the initial state.  
    The `[0]` index is used because `env.reset()` returns a tuple (state, info).
    
- `state = np.reshape(state, [1, 4])`  
    Reshapes the state array to the correct shape for feeding into the neural network (1 row, 4 features).
    
- `total_reward = 0`  
    Initialises a counter to track the total reward earned during the episode.
    

---

- `for step in range(500):`  
    Limits each episode to 500 steps to avoid infinite runs if the agent manages to balance the pole for a long time.
    
- `action = choose_action(state, epsilon)`  
    Uses the previously defined epsilon-greedy function to choose the next action.
    
    - Either a random action (exploration) or the model’s predicted best action (exploitation).
        
- `next_state, reward, done, _, _ = env.step(action)`  
    Executes the chosen action in the environment.
    
    - `next_state`: the new environment state.
        
    - `reward`: the reward given for the action.
        
    - `done`: `True` if the episode has ended.
        
    - `_`: placeholders for values returned but not used.
        
- `next_state = np.reshape(next_state, [1, 4])`  
    Reshapes the next state to be compatible with the model input format.
    
- `replay_memory.append((state, action, reward, next_state, done))`  
    Stores the experience in the replay buffer for future training.
    
- `state = next_state`  
    Updates the current state for the next step.
    
- `total_reward += reward`  
    Adds the reward from this step to the running total.
    
- `replay(batch_size)`  
    Calls the replay function to train the model on a random batch of past experiences.
    
- `if done: break`  
    Ends the episode early if the environment signals that the pole has fallen or time limit hit.
    

---

- `if epsilon > epsilon_min: epsilon *= epsilon_decay`  
    Reduces the value of epsilon after each episode, gradually shifting from exploration to exploitation.
    
- `if (episode + 1) % 50 == 0:`  
    Every 50 episodes, prints out the progress:
    
    - Episode number
        
    - Total reward earned in that episode
        
    - Current value of epsilon (exploration rate).
        

---

### Why this step

- This loop is the core of reinforcement learning, where the agent:
    
    - Observes the state
        
    - Takes actions
        
    - Collects rewards
        
    - Stores experience
        
    - Learns from past experiences via replay
        
    - Gradually becomes more confident (reducing exploration with epsilon decay)
        
- Without this loop, the neural network would not learn how to maximise the reward or improve its Q-value predictions.
    
- Limiting episode length and decaying epsilon ensures efficient, stable training and avoids endless exploration.