# Implementing Q-Learning in Keras

- Implement a Q-Learning algorithm using Keras
- Define and train a neural network to approximate the Q-values
- Evaluate the performance of the trained Q-Learning agent

#### Step 1: Setting Up the Environment 

We will set up the environment using the OpenAI Gym Library. We will use the 'CartPole-v1' environment, a common benchmark for reinforcement learning algorithms.

In [179]:
%pip install gym
%pip install gymnasium


Note: you may need to restart the kernel to use updated packages.
Collecting gymnasium
  Downloading gymnasium-1.1.1-py3-none-any.whl.metadata (9.4 kB)
Collecting farama-notifications>=0.0.1 (from gymnasium)
  Downloading Farama_Notifications-0.0.4-py3-none-any.whl.metadata (558 bytes)
Downloading gymnasium-1.1.1-py3-none-any.whl (965 kB)
   ---------------------------------------- 0.0/965.4 kB ? eta -:--:--
   -------------------------------- ------- 786.4/965.4 kB 4.8 MB/s eta 0:00:01
   ---------------------------------------- 965.4/965.4 kB 3.5 MB/s eta 0:00:00
Downloading Farama_Notifications-0.0.4-py3-none-any.whl (2.5 kB)
Installing collected packages: farama-notifications, gymnasium

   -------------------- ------------------- 1/2 [gymnasium]
   -------------------- ------------------- 1/2 [gymnasium]
   -------------------- ------------------- 1/2 [gymnasium]
   -------------------- ------------------- 1/2 [gymnasium]
   -------------------- ------------------- 1/2 [gymnasium]

##### Reduce Recursion Limit

We can also try increasing the recursion limit, although this is generally more of a workaround than a solution.

In [180]:
import sys
sys.setrecursionlimit(1500)

import gymnasium as gym
import numpy as np

# Create the environment 
env = gym.make('CartPole-v1')

# Set random seed for reproducibility
np.random.seed(42)
env.action_space.seed(42)
env.observation_space.seed(42)

42

- `gym` is a toolkit for developing and comparing reinforcement learning algorithms.
- `CartPole-v1` is an environment where a pole is balanced on a cart, and the goal is to prevent the pole from falling over.
- Setting random seeds ensures that we can reproduce the results.

#### Step 2: Define the Q-Learning Model

We will define a neural network using Keras to approximate the Q-values. The network will take the state as input and output Q-values for each action.

In [187]:
# Suppress warnings for a cleaner notebook experience
import warnings 
warnings.filterwarnings('ignore')

# Override the default warning function
def warn(*args, **kwargs):
    pass
warnings.warn = warn

# Import necessary libraries for the Q-Learning model
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.optimizers import Adam

# Define the model building function 
def build_model(state_size, action_size):
    model = Sequential()
    model.add(Input(shape=(state_size))) # Use Input layer to specify the input shape
    model.add(Dense(24, activation='relu'))
    model.add(Dense(24, activation='relu'))
    model.add(Dense(action_size, activation='linear'))
    model.compile(loss='mse', optimizer=Adam(learning_rate=0.001))
    return model

# Create the environment and set up the model
env = gym.make('CartPole-v1')
state_size = env.observation_space.shape[0]
action_size = env.action_space.n
model = build_model(state_size, action_size)



- Sequential model: a linear stack of layers in Keras
- Dense layers: fully connected layers
- input_dim: the size of the input layer, corresponding to the state size.
- activation='relu': Rectified Linear Unit activation function.
- activation='linear': linear activation function for the output layer, as we are predicting continuous Q-values.
- Adam optimizer: an optimization algorithm that adjusts the learning rate based on gradients.

#### Step 3: Implent the Q-Learning Algorithm

Now, we will implement the Q-Learning algorithm, which involves interacting with the environment, updatin ghte Q-values, and training the neural network.

In [191]:
import random
import numpy as np
from collections import deque
import tensorflow as tf

# Define epsilon and epsilon_decay
epsilon = 1.0 # Starting with a high exploration rate
epsilon_min = 0.01 # Minimum exploration rate
epsilon_decay = 0.99 # Faster decay rate for epsilon after each episode

# Replay memory
memory = deque(maxlen=2000)

# our agent stores experiences like "I was in situation X, took action Y, got reward Z, and ended up in situation W"
def remember(state, action, reward, next_state, done):
    # Store the experience in memory.
    memory.append((state, action, reward, next_state, done))

def replay(batch_size=128):
    # Train the model using a random smaple of experiences from memory.
    if len(memory) < batch_size:
        return # Skip replay if there's not enough experience
    
    minibatch = random.sample(memory, batch_size) # Sample a random batch from memory

    # Extract information for batch processing 
    states = np.vstack([x[0] for x in minibatch])
    actions = np.array([x[1] for x in minibatch])
    rewards = np.array([x[2] for x in minibatch])
    next_states = np.vstack([x[3] for x in minibatch])
    dones = np.array([x[4] for x in minibatch])

    # Predict Q-values for the next states in batch
    q_next = model.predict(next_states)
    # Predict Q-values for the current states in batch
    q_target = model.predict(states)

    # Vectorized update of target values
    for i in range(batch_size):
        target = rewards[i]
        if not dones[i]:
            target += 0.95 * np.amax(q_next[i]) # Update Q valye with the discotuned future reward
        q_target[i][actions[i]] = target  # Update only the taken action's Q value

    # Train the model with the updates targets in batch
    model.fit(states, q_target, epochs=1, verbose=0)
     # Train in batch mode

     # Reduce exploration rate (epsilon) after each training step
    global epsilon 
    if epsilon > epsilon_min:
        epsilon *= epsilon_decay

# High epsilon = mostly random actions (exploration)
# Low epsilon = mostly "smart" actions based on what it learned    
def act(state):
    # Choose an action based on the current state and exploration rate.
    if np.random.rand() <= epsilon:
        return random.randrange(action_size) # Explore: choose a random action
    act_values = model.predict(state) # Exploit: predict action based on the state
    return np.argmax(act_values[0]) # Return the action with the highets Q-value

# Define the number of episodes we want to train the model for
episodes = 100 # We can set this to any number we prefer
train_frequency = 5 # Train the model every 5 steps

for e in range(episodes):
    state, _ = env.reset() # Unpack the tuple returned by env.reset()
    state = np.reshape(state, [1, state_size])
    for time in range(200): # Limit to 200 time steps per episode
        action = act(state)
        next_state, reward, terminated, truncated, _ = env.step(action)
        done = terminated or truncated
        reward = reward if not done else -10
        next_state = np.reshape(next_state, [1, state_size])
        remember(state, action, reward, next_state, done) # Store experience
        state = next_state

        if done:
            print(f"episode: {e+1}/{episodes}, score: {time}, e: {epsilon:.2}")
            break

        # Train the model every 'train_frequency' steps 
        if time % train_frequency == 0:
            replay(batch_size=64) # Call replay with larger batch size for efficiency

env.close()
     

episode: 1/100, score: 13, e: 1.0
episode: 2/100, score: 15, e: 1.0
episode: 3/100, score: 11, e: 1.0
episode: 4/100, score: 14, e: 1.0
episode: 5/100, score: 9, e: 1.0
episode: 6/100, score: 40, e: 0.92
episode: 7/100, score: 83, e: 0.78
episode: 8/100, score: 88, e: 0.65
episode: 9/100, score: 30, e: 0.61
episode: 10/100, score: 36, e: 0.56
episode: 11/100, score: 147, e: 0.42
episode: 12/100, score: 143, e: 0.31
episode: 17/100, score: 188, e: 0.043
episode: 27/100, score: 184, e: 0.0099
episode: 32/100, score: 117, e: 0.0099
episode: 35/100, score: 101, e: 0.0099
episode: 36/100, score: 111, e: 0.0099
episode: 38/100, score: 148, e: 0.0099
episode: 40/100, score: 86, e: 0.0099
episode: 41/100, score: 91, e: 0.0099
episode: 42/100, score: 128, e: 0.0099
episode: 44/100, score: 181, e: 0.0099
episode: 45/100, score: 138, e: 0.0099
episode: 47/100, score: 94, e: 0.0099
episode: 49/100, score: 93, e: 0.0099
episode: 52/100, score: 175, e: 0.0099
episode: 53/100, score: 91, e: 0.0099
ep

The scores here are very volatile because epsilon forces bad random actions for learning purposes.

##### Setup & Hyperparameters

    epsilon = 1.0          # Start with 100% random actions
    epsilon_min = 0.01     # Never go below 1% randomness  
    epsilon_decay = 0.99   # Reduce epsilon by 1% each training step
    memory = deque(maxlen=2000)  # Store last 2000 experiences
This sets up our exploration strategy and experience storage.

##### Experience Storage
    def remember(state, action, reward, next_state, done):
        memory.append((state, action, reward, next_state, done))

Every time our agent does something, it saves "I was in state X, did action Y, got reward Z, ended up in state W, and the episode ended (True/False)."

##### Learning Function 
    def replay(batch_size=128):

This is where the magic happens:
1. Sample Random Experiences: Takes 128 random memories to learn from 
2. Predict the Future Values:

    - q_next = model.predict(next_states) -> "How good are actions in the next state?"
    - q_target = mode.predict(states) -> "How good did we think actions were?"
3. Update Targets: For each experience:

    `target = rewards[i]  # Immediate reward`
    
    `if not dones[i]:`

        target += 0.95 * np.amax(q_next[i])  # + discounted future reward

This implements the Bellman equation: Q(state, action) = reward + 0.95 × max future reward

4. Train the Network: Updates the neural network to predict these better target values.

##### Action Selection
    def act(state):
        if np.random.rand() <= epsilon:
            return random.randrange(action_size)  # Explore
        act_values = model.predict(state)         # Get Q-values
        return np.argmax(act_values[0])          # Pick best action

Epsilon-greedy: flip a weighted coin to decide explore vs. exploit.

##### Main Training Loop 
    for e in range(episodes):
        state, _ = env.reset()  # Start new episode
        for time in range(200):  # Max 200 steps
            action = act(state)  # Choose action
            next_state, reward, terminated, truncated, _ = env.step(action)
            
            reward = reward if not done else -10  # Penalty for failing
            remember(...)  # Store experience
            
            if time % train_frequency == 0:
                replay(batch_size=64)  # Learn every 5 steps

Key Insight: Our agent is constantly doign two things: 
    1. Acting in the environment (playing the same)
    2. Learning from past experiences (getting better at the game)

The brilliant part is it learsn from random past experiences, not just recent ones, which prevents it from forgetting old lessons.

#### Step 4: Evaluate the Performance

Finally, we will evaluate the performenace of the trained Q-Learning agent.

In [192]:
for e in range(10):
    state, _ = env.reset() # Unpack the state from the tuple
    state = np.reshape(state, [1, state_size]) # Reshape the state correectly
    for time in range(500):
        env.render()
        action = np.argmax(model.predict(state)[0])
        next_state, reward, terminated, truncated, _ = env.step(action) # Unpack the five return values
        done = terminated or truncated # Check if the episode is done
        next_state = np.reshape(next_state, [1, state_size])
        state = next_state
        if done:
            print(f"episode: {e+1}/10, score: {time}")
            break

env.close()


episode: 1/10, score: 445
episode: 2/10, score: 429
episode: 3/10, score: 104
episode: 4/10, score: 98
episode: 5/10, score: 396
episode: 6/10, score: 499
episode: 7/10, score: 447
episode: 8/10, score: 83
episode: 9/10, score: 89
episode: 10/10, score: 499


Nearly perfect scores. However, the bimodal distribution looks concerning, but it is common in CartPole. Our agents has learned a solid policy but: 
- Sometimes it starts in a tricky state or makes one bad early decision
- Once it recovers balance, it can maintain it almost indefinitely
- The 499s hit the episode time limit, our agent could probably go forever.