# Reinforcement Learning: OpenAI Gym Practice  

As a data science student, I am excited to share this short project where I focus on a first experience with reinforcement learning using OpenAI Gym. This project aims to provide hands-on practice with training reinforcement learning models using deep learning techniques.  

## What is Reinforcement Learning  

Reinforcement Learning (RL) is a branch of machine learning where an agent learns to make decisions by interacting with an environment to maximize cumulative rewards. Unlike supervised learning, RL does not rely on labeled data but instead learns through trial and error using feedback from its own actions. RL is widely used in robotics, game playing, finance, and autonomous systems.  

## What is OpenAI Gym  

OpenAI Gym is a toolkit developed by OpenAI for developing and comparing reinforcement learning algorithms. It provides a collection of pre-built environments, such as the popular CartPole and Atari games, that allow users to experiment with different RL techniques. Gym offers a simple API to interact with environments, making it a great starting point for RL research and development: [Gymnasium: OpenAI Gym's Successor](https://gymnasium.farama.org/)  

## Learning Objectives  

1. Create OpenAI Gym environments like CartPole  
2. Build a Deep Learning model for Reinforcement Learning using TensorFlow and Keras  
3. Train a Reinforcement Learning model using Deep Q-Network (DQN) based learning with Keras-RL

## What is Cartpole

This environment corresponds to the version of the cart-pole problem described by Barto, Sutton, and Anderson in “Neuronlike Adaptive Elements That Can Solve Difficult Learning Control Problem”. A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The pendulum is placed upright on the cart and the goal is to balance the pole by applying forces in the left and right direction on the cart: [Gymnasium Website](https://gymnasium.farama.org/environments/classic_control/cart_pole/)

## Notice

This does require an environment set in Python3.8. 

## Lecture Resources for Reinforcement Learning  

- [Reinforcement Learning from Scratch](https://www.youtube.com/watch?v=vXtfdGphr3c&list=PLhBFZf0L5I7oIFTNTclyvWRciXaVb76Yt&index=4)  
- [An Introduction to Reinforcement Learning](https://www.youtube.com/watch?v=JgvyzIkgxF0&list=PLhBFZf0L5I7oIFTNTclyvWRciXaVb76Yt&index=5)  
- [Reinforcement Learning: Machine Learning Meets Control Theory](https://www.youtube.com/watch?v=0MNVhXEX9to&list=PLhBFZf0L5I7oIFTNTclyvWRciXaVb76Yt&index=6)

## Credit for Educational Objecitves of this Project
- [Deep Reinforcement Learning Tutorial for Python in 20 Minutes](https://www.youtube.com/watch?v=cO5g5qLrLSo&list=PLhBFZf0L5I7oIFTNTclyvWRciXaVb76Yt&index=2)

---

#### 1) Install required libraries

Note: We are installing 'gymnasium' instead of 'gym' but later importing it as 'gym' for simplicity.

---

In [None]:
!pip install tensorflow==2.3.0
!pip install gymnasium
!pip install keras
!pip install keras-rl2

#### 2 Test Random Environment with OpenAI Gym

In this code, we are training an agent to interact with the CartPole-v1 environment from OpenAI's Gym. The agent performs a series of episodes, where in each episode it starts in a random initial state. During each episode, the agent takes random actions (either 0 or 1, representing left or right movements of the cart) and receives rewards based on the environment's response. The environment is rendered after every action to provide visual feedback, allowing us to observe the agent's behavior. The agent continues acting until the episode ends, which happens when the pole falls down or the episode reaches its maximum time limit. The total score for each episode, which is the cumulative reward received, is tracked and printed after every episode to monitor the agent's performance.

---

In [None]:
# Importing required libraries
import gymnasium as gym
import random

In [None]:
# Building the Cartpole environment using Gym
env = gym.make('CartPole-v1')  # Create an instance of the CartPole environment

# Get the number of states (features) in the observation space
states = env.observation_space.shape[0]  # The shape of the observation space (4 features for CartPole)

# Get the number of possible actions (discrete action space)
actions = env.action_space.n  # The number of possible actions (2: left or right)

In [None]:
# Left or Right movements, this indicates there are 2 possible actions (1 or 0)
actions

In [None]:
# Number of episodes for training
episodes = 10

# Loop through the number of episodes
for episode in range(1, episodes + 1):  
    # Reset the environment at the start of each episode
    state, info = env.reset()
    
    done = False  # This flag keeps track of whether the episode is finished
    score = 0  # Initialize the score for this episode

    # Keep interacting with the environment until the episode is done
    while not done:
        # Render the environment (useful for visual feedback)
        env.render()
        
        # Randomly select an action (either 0 or 1 for CartPole)
        action = random.choice([0, 1])
        
        # Perform the chosen action and observe the new state, reward, done flag, and additional info
        state, reward, done, _, info = env.step(action)  # Ignore the 4th value by using '_'
        
        # Accumulate the score from the rewards
        score += reward
    
    # Print the episode number and its score after the episode ends
    print(f'Episode {episode}: Score = {score}')

#### 3) Create a Deep Learning Model with Keras

In this section, we built a neural network model using TensorFlow and Keras to approximate the Q-function for a reinforcement learning task. We defined a function build_model that constructs a sequential neural network with two hidden layers, each containing 24 neurons with ReLU activation, and an output layer matching the number of possible actions with a linear activation. The model is designed to take the environment's state as input and predict the expected reward for each action. After building the model, we displayed its architecture using model.summary(), providing an overview of the layers and parameters in the network.

---

In [None]:
# Importing necessary libraries
import numpy as np  # NumPy is used for numerical operations and array handling

# Importing components for creating and training a neural network
from tensorflow.keras.models import Sequential  # Sequential model is used to create a linear stack of layers
from tensorflow.keras.layers import Dense, Flatten  # Dense layer is a fully connected layer, Flatten flattens the input for the next layer
from tensorflow.keras.optimizers import Adam  # Adam is an optimization algorithm used to train the model

In [None]:
# Function to build a neural network model for reinforcement learning
def build_model(states, actions):
    # Initialize a Sequential model, which allows stacking layers linearly
    model = Sequential()
    
    # Flatten the input layer to reshape the state space into a 1D vector
    # input_shape=(1, states) assumes the input is a 2D array (batch size, states) with the state space size defined by 'states'
    model.add(Flatten(input_shape=(1, states)))  
    
    # Add a Dense hidden layer with 24 neurons and ReLU activation function
    # ReLU (Rectified Linear Unit) introduces non-linearity to the model, allowing it to learn complex patterns
    model.add(Dense(24, activation='relu'))
    
    # Add another Dense hidden layer with 24 neurons and ReLU activation function
    model.add(Dense(24, activation='relu'))
    
    # Add the output layer with a number of neurons equal to the number of possible actions
    # 'linear' activation function is used to output raw values (i.e., Q-values)
    model.add(Dense(actions, activation='linear'))
    
    # Return the model
    return model

In [None]:
# Create the model using the build_model function with the given number of states and actions
model = build_model(states, actions)

# Display the summary of the model architecture, including the layers, their output shapes, and the number of parameters
model.summary()

#### 4) Build Agent wtih Keras-RL

This code sets up a reinforcement learning environment using a Deep Q-Network (DQN) agent. It imports necessary libraries like DQNAgent, BoltzmannQPolicy, and SequentialMemory. The DQN agent is built with a Boltzmann policy for action selection, which probabilistically favors actions with higher Q-values. The agent’s experiences are stored in a memory buffer, which is later used for training. After defining the agent, the code compiles it with the Adam optimizer and sets the learning rate. Finally, it trains the agent by fitting it to the environment for a specified number of steps, allowing the agent to learn from its interactions.

---

In [None]:
# Importing necessary components for reinforcement learning
# DQNAgent: This is the Deep Q-Network agent which will be used to perform reinforcement learning using a neural network model.
# BoltzmannQPolicy: This is the policy used for action selection, where the agent selects actions based on a probability distribution influenced by Q-values.
# SequentialMemory: This is the memory used to store the agent's experiences during the training process, which can be replayed for training the model.
from rl.agents import DQNAgent
from rl.policy import BoltzmannQPolicy
from rl.memory import SequentialMemory

# Function to build the DQN agent
def build_agent(model, actions):
    policy = BoltzmannQPolicy()  # The action selection policy
    memory = SequentialMemory(limit=50000, window_length=1)  # Memory for storing experiences
    dqn = DQNAgent(model=model, memory=memory, policy=policy,
                   nb_actions=actions, nb_steps_warmup=10, target_model_update=1e-2)
    return dqn

# Build the agent
dqn = build_agent(model, actions)

# Compile the agent with the Adam optimizer and learning rate
dqn.compile(Adam(learning_rate=1e-3), metrics=['mae'])

# Fit the agent to the environment
dqn.fit(env, nb_steps=50000, visualize=False, verbose=1)

In [None]:
# Test the performance of the trained DQN agent over 100 episodes
scores = dqn.test(env, nb_episodes=100, visualize=False)

# Calculate and print the average reward over all test episodes
# 'scores.history' contains the rewards for each episode
# 'episode_reward' is a list of rewards for each episode in the test
# np.mean() computes the average of these rewards
print(np.mean(scores.history['episode_reward']))

In [None]:
# This will visaulize our model, where we can see the pole being balanced
_ = dqn.test(env, nb_episodes=5, visualize=True

#### 5) Reloading Agent from Memory


---

In [None]:
# We can save the weights and reload them later, to test them out
dqn.save_weights('dqn_weights.h5f', overwrite=True)

In [None]:
# Deleting
del model
del dqn
del env

In [None]:
# Rebuilding the environment
env = gym.make('CartPole-v1')
actions = env.action_space.n
states = env.observation_space.shape[0]
model = build_model(states, actions)
dqn = build_agent(model, actions)
dqn.compile(Adam(lr=1e-3), metrics=['mae'])

In [None]:
# Reload our weights into the model for testing
dqn.load_weights('weights_filename.h5f')

In [None]:
dqn.test(env, nb_episodes=5, visualize=True)