# AI Reinforcement Learning Crash Course - Overview

## Introduction to AI
Artificial Intelligence (AI) is the field of computer science that focuses on creating systems capable of performing tasks that typically require human intelligence. These include problem-solving, pattern recognition, decision-making, and learning. 

AI is classified into:
- **Narrow AI**: Specialized in one task, like recommendation systems.
- **General AI**: Hypothetical systems with human-like cognitive abilities.

## Neural Networks
Neural networks are computational models inspired by the human brain. They consist of layers of nodes (neurons) connected by weighted edges, which adjust during training. 

While they can be represented as **network flow graphs**, they are more than just an application of them—they involve **non-linear transformations** and **backpropagation** for learning.

## Types of Training
### **1. Supervised Learning**
- Trained on **labeled** data (input-output pairs).
- Learns by minimizing the error between predictions and actual outcomes.

### **2. Unsupervised Learning**
- Trained on **unlabeled** data.
- Tries to find **patterns**, such as clustering similar data points.

### **3. Reinforcement Learning (RL)**
- The model (**agent**) learns by interacting with an **environment**.
- Receives **rewards** for taking actions that maximize long-term benefits.

## Difference Between Supervised and Unsupervised Learning
| **Type**        | **Training Data** | **Goal** |
|----------------|----------------|---------|
| Supervised    | Labeled data (input-output pairs) | Learn to map inputs to outputs |
| Unsupervised  | Unlabeled data | Find hidden patterns or structures |

## Reinforcement Learning Terms (**AREA 51**) 
- **Agent**: The learner or decision-maker.
- **Environment**: The world in which the agent interacts.
- **Action**: Choices the agent can make.
- **Reward**: Feedback from the environment indicating the quality of an action.

- **State**: The current situation of the agent.
- **Policy**: The strategy the agent uses to decide actions.

## What is a Heuristic?
A **heuristic** is a problem-solving approach that relies on practical shortcuts or **rules of thumb** to make decisions more efficiently. In reinforcement learning, heuristics are sometimes used to define rewards in a way that encourages desirable behavior even if an optimal policy is not yet learned.

## Tools You'll Be Using
This is based off [Nicholas Renotte's workshop](https://www.youtube.com/watch?v=cO5g5qLrLSo&ab_channel=NicholasRenotte) on Deep Learning. We'll be creating an agent that can balance a stick. Then we'll design our own
### **1. OpenAI Gym**
- A toolkit for developing and testing RL algorithms with various simulated environments.

### **2. TensorFlow**
- A machine learning framework optimized for deep learning and neural networks.

### **3. Keras-RL Agents**
- A library for reinforcement learning that integrates with **Keras** and **TensorFlow**, providing pre-built RL algorithms.


# **0. Install Dependencies**
This is an up-to-date workshop based off [Nicholas Renotte's workshop](https://www.youtube.com/watch?v=cO5g5qLrLSo&ab_channel=NicholasRenotte) on Deep Learning. We'll be creating an agent that can balance a stick. Let's first install our dependencies.

In [4]:
!pip install gym==0.25.2
!pip install keras==2.10.0
!pip install keras-rl2==1.0.5
!pip install protobuf==3.20.0
!pip install numpy
!pip install gym[classic_control]



**If you have any problems installing tensorflow**, make sure pip is pointing to pip3 because tensorflow requires python3. In particular python 3.8

`pip --version`

`pip3 install --upgrade tensorflow`

https://stackoverflow.com/questions/63073711/how-to-install-tensorflow-2-3-0

Try also running this notebook in vscode

# **1. Introduction to CartPole in OpenAI Gym**

## **What is CartPole?**
CartPole is a classic reinforcement learning environment provided by **OpenAI Gym**. It is a simple physics-based control problem where an agent must balance a pole on top of a moving cart.

## **Objective**
The goal is to keep the pole balanced for as long as possible by moving the cart **left** or **right**. The agent receives a **reward** for each time step the pole remains upright.

## **What is an Episode in Reinforcement Learning?**
An **episode** is a single run of the environment from **start to termination**. In CartPole, an episode:
- **Starts** with the cart and pole in a random initial state.
- **Continues** as the agent takes actions and receives rewards.
- **Ends** when the termination conditions are met (e.g., the pole falls or the cart moves out of bounds).

Each episode provides training data for the reinforcement learning agent to improve its policy.

## **State Variables (Observation Space)**
The environment has **4 continuous state variables**:
1. **Cart Position** - Horizontal position of the cart.
2. **Cart Velocity** - Speed of the cart.
3. **Pole Angle** - Angle of the pole relative to vertical.
4. **Pole Angular Velocity** - Rotational speed of the pole.

## **Action Space**
The agent has **2 discrete actions**:
- `0`: Push the cart **left**.
- `1`: Push the cart **right**.

## **Rewards**
- The agent **receives +1 reward** for each time step the pole stays balanced.
- The episode **ends when**:
  - The pole falls **more than 15 degrees** from vertical.
  - The cart moves **too far** from the center.

## **Official OpenAI Gym Documentation**
For more details, visit:
[🔗 CartPole-v1 Documentation](https://gymnasium.farama.org/environments/classic_control/cart_pole/)

Let's begin to setup the enviorment. Keep in mind you may need up-to-date versions or installations of `pip`, `pygame`, `numpy`, and or other dependencies if errors arise.

The enviroment opens a new window, if you don't see it, look in your tab or other displays.

In [5]:
import gym    # OpenAI's gym
import random # To start we'll take random steps

In [27]:
env = gym.make("CartPole-v0")
states = env.observation_space.shape[0]
actions = env.action_space.n

In [7]:
print("States: ", states)
print("Actions: ", actions)

States:  4
Actions:  2


In [28]:
episodes = 10
for episode in range(1, episodes + 1):
    state = env.reset()  # Unpacking reset output
    done = False
    score = 0

    while not done:
        env.render()
        action = random.choice([0, 1])  # Randomly choose between 0 & 1
        n_state, reward, done, info = env.step(action)  
        score += reward

    print('Episode: {} Score: {}'.format(episode, score))

Episode: 1 Score: 18.0
Episode: 2 Score: 13.0
Episode: 3 Score: 13.0
Episode: 4 Score: 22.0
Episode: 5 Score: 19.0
Episode: 6 Score: 12.0
Episode: 7 Score: 23.0
Episode: 8 Score: 12.0
Episode: 9 Score: 23.0
Episode: 10 Score: 22.0


# **Gym API Guide - Key Functions**

## **1. `gym.make()` - Create an Environment**
```python
env = gym.make("CartPole-v1", render_mode="human")
```
- Initializes an environment.
- `render_mode="human"` → Displays the environment.
- `render_mode="rgb_array"` → Returns an image instead.

---

## **2. State and Action Spaces**
```python
states = env.observation_space.shape[0]
actions = env.action_space.n
```
- **`env.observation_space.shape[0]`** → Number of state variables.
- **`env.action_space.n`** → Number of possible actions.

---

## **3. `env.reset()` - Reset the Environment**
```python
state, info = env.reset()
```
- **Returns:** `(state, info)`
  - `state`: The initial observation (environment state).
  - `info`: Extra environment metadata (not always needed).

---

## **4. `env.step(action)` - Take an Action**
```python
next_state, reward, terminated, truncated, info = env.step(action)
```
- **Expects:** `action` (an integer representing the chosen action).
- **Returns:**
  - `next_state`: The new state after the action.
  - `reward`: Reward received for taking the action.
  - `terminated`: `True` if the episode ended (e.g., pole fell).
  - `truncated`: `True` if the episode was forcefully stopped.
  - `info`: Additional debugging info.

---

## **5. `env.render()` - Display Environment**
```python
env.render()
```
- Renders the environment in a new window when `render_mode="human"`.

---

## **6. `random.choice()` - Select Random Action**
```python
import random
action = random.choice([0, 1])
```
- Randomly selects an action from a given list.

---

## **Summary Table**
| Function               | Purpose |
|------------------------|---------|
| `gym.make()`          | Creates the environment. |
| `env.observation_space` | Gets the number of state variables. |
| `env.action_space`     | Gets the number of available actions. |
| `env.reset()`         | Resets the environment and returns the initial state. |
| `env.step(action)`    | Takes an action and returns the new state, reward, and termination info. |
| `env.render()`        | Displays the environment visually. |
| `random.choice()`     | Randomly selects an action. |


# **2. Creating a Deep Learning Model with Keras**
## **What Does "Deep" Mean?**
- "Deep Learning" refers to models that consist of **multiple layers** of artificial neurons.
- The term "deep" comes from **stacking many layers** in a neural network.

## **Deep vs. Shallow Learning**
| Type | Description |
|------|-------------|
| **Shallow Learning** | Uses fewer layers, relies on manual feature selection (e.g., decision trees, logistic regression). |
| **Deep Learning** | Uses multiple hidden layers, learns complex features automatically. |

## **Understanding Neural Network Architectures**

### **1. Sequential Neural Networks**

A **Sequential Neural Network** is a type of artificial neural network where data flows in one direction—from input to output—through a series of layers. Each layer's output serves as the input for the next layer. This architecture is straightforward and commonly used for problems where data transformations are linear and order-dependent.

**Key Characteristics:**
- **Linear Stack of Layers:** Each layer has one input and one output.
- **Simplified Design:** Easy to implement and understand.
- **Limitations:** Not ideal for tasks requiring memory of previous inputs or hierarchical data processing.

### **Choosing or Designing a Model**

When selecting a neural network architecture, one must first consider the specific use case. Different tasks, such as image recognition, natural language processing, or time-series prediction, require different model architectures. In some cases, an existing model may be sufficient, while in others, a **custom model** may need to be developed.

Popular architectures include **Convolutional Neural Networks (CNNs)** for images, **Recurrent Neural Networks (RNNs)** for sequences, **Transformers** for NLP (Natural Language Processing), and **Generative Adversarial Networks (GANs)** for data generation. Each model type is specialized for different tasks.

### **Creating a Custom Model**
Designing a neural network from scratch involves:
- **Defining the Problem:** Understanding the type of data and the expected outputs.
- **Choosing the Right Layers:** Deciding how many layers and what types (e.g., convolutional, recurrent, dense) are needed.
- **Selecting an Activation Function:** Using appropriate functions like ReLU, sigmoid, or softmax based on the task.
- **Tuning Hyperparameters:** Adjusting learning rates, batch sizes, and number of neurons to optimize performance.
- **Testing and Iterating:** Running experiments, analyzing results, and refining the architecture based on performance.

Developers often use **transfer learning** (leveraging pre-trained models) to save time and computational resources rather than building models from scratch.


In [29]:
import numpy as np

# Import Keras modules for building a deep learning model
from tensorflow.keras.models import Sequential  # Sequential model for stacking layers
from tensorflow.keras.layers import Dense, Flatten  # Layers for building the neural network
from tensorflow.keras.optimizers import Adam  # Adam optimizer for efficient training

In [30]:
def build_model(states, actions):
    """
    Builds a deep learning model for reinforcement learning.

    Parameters:
    states (int): Number of state variables (input features).
    actions (int): Number of possible actions (output size).

    Returns:
    model (Sequential): A compiled neural network model.
    """
    model = Sequential()  # Initialize a sequential neural network
    
    # Layer 1: Flatten layer (Not trainable, but still a layer)
    # Converts input into a 1D vector for Dense layers to process.
    model.add(Flatten(input_shape=(1, states)))  
    
    # Layer 2: First hidden layer with 24 neurons
    # Dense (fully connected) means every neuron is connected to all neurons in the next layer
    # ReLU (Rectified Linear Unit) activation introduces non-linearity:
    #   - f(x) = max(0, x), meaning negative values are set to 0
    #   - Helps prevent vanishing gradients and improves training stability
    model.add(Dense(24, activation='relu'))  
    
    # Layer 3: Second hidden layer with 24 neurons and ReLU activation
    # Stacking layers allows the network to learn more complex patterns
    model.add(Dense(24, activation='relu'))  
    
    # Layer 4: Output layer with 'actions' neurons, using a linear activation function
    # Linear activation is used because this is a regression task (Q-values)
    # Q-values represent the expected future reward for each action, so no constraints (like softmax) are applied
    model.add(Dense(actions, activation='linear'))  
    
    return model  # Return the compiled model

More layers **do not always mean better performance**—it's something that must be **experimented with and fine-tuned** based on the problem and dataset.

### **When More Layers Help:**
**Complex Patterns** – Deep networks capture hierarchical features (e.g., CNNs for images).  
**Large Datasets** – More data allows deeper models to generalize better.  
**Sequential Data** – Tasks like NLP or time-series forecasting benefit from deep architectures.  

### **When More Layers Hurt:**
**Overfitting** – Too many layers can cause memorization instead of generalization.  
**Vanishing/Exploding Gradients** – Training deep networks can be unstable without proper techniques (e.g., batch normalization, skip connections).  
**Computational Cost** – More layers require more training time and computational power.  

### **How to Find the Right Depth?**
**Start Simple** – Use a shallow network and gradually increase layers.  
**Experiment** – Adjust the number of layers based on validation performance.  
**Use Regularization** – Techniques like dropout or weight decay can help prevent overfitting in deep networks.  


In [31]:
model = build_model(states, actions)

In [32]:
model.summary() # Displays a summary of our model

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_2 (Flatten)          (None, 4)                 0         
_________________________________________________________________
dense_6 (Dense)              (None, 24)                120       
_________________________________________________________________
dense_7 (Dense)              (None, 24)                600       
_________________________________________________________________
dense_8 (Dense)              (None, 2)                 50        
Total params: 770
Trainable params: 770
Non-trainable params: 0
_________________________________________________________________


# **3. Build Agent with Keras-RL**
Now that we have our model, we breathe life into it via our Agent.

## **Policy-Based vs. Value-Based Reinforcement Learning**

### **Value-Based Methods**

- **Focus on learning**: The value (expected return) of being in a state or taking an action in a state
- **Decision making**: Choose actions with the highest estimated value
- **Examples**: Q-Learning, DQN (Deep Q-Networks)
- **Key idea**: "How good is this state or action?"

### **Policy-Based Methods**
- **Focus on learning**: The policy (mapping from states to actions) directly
- **Decision making**: Sample actions from the learned probability distribution
- **Examples**: REINFORCE, Proximal Policy Optimization (PPO)
- **Key idea**: "What action should I take in this state?"

## **Agents in Keras-RL**

In the context of our model, an **agent** is the decision-making entity that:
1. Observes the environment state
2. Chooses actions based on a strategy
3. Learns from experience to improve its decision-making

Keras-RL provides several agent implementations:

- **DQNAgent**: Deep Q-Network agent. Uses neural networks to approximate Q-values (state-action values), combining Q-learning with deep learning.
- **DDPGAgent**: Deep Deterministic Policy Gradient agent. Useful for continuous action spaces.
- **ContinuousDQNAgent**: Adaptation of DQN for continuous action spaces.
- **CEMAgent**: Cross-Entropy Method agent. Uses an evolution strategy for optimization.
- **SARSAAgent**: State-Action-Reward-State-Action agent. Similar to Q-learning but uses the current policy to select the next action.

## **Why We're Using DQNAgent**

For our CartPole problem, we're using the **DQNAgent** because:
- It's well-suited for discrete action spaces (like our left/right choices)
- It combines deep learning (our neural network) with Q-learning
- It includes experience replay (SequentialMemory) to improve sample efficiency
- It's stable and has proven effective for control tasks like CartPole


In [33]:
from rl.agents import DQNAgent # Import DQN agent for reinforcement learning
from rl.policy import BoltzmannQPolicy  # Import Boltzmann policy for action selection based on Q-values
from rl.memory import SequentialMemory # Sequential memory for experience replay

In [34]:
def build_agent(model, actions):
    policy = BoltzmannQPolicy()  # Boltzmann policy for action selection
    memory = SequentialMemory(limit=50000, window_length=1)  # Sequential memory for experience replay
    dqn = DQNAgent(model=model, memory=memory, policy=policy,
                   nb_actions=actions, nb_steps_warmup=10, target_model_update=1e-2)
    return dqn

In [35]:
# Check the observation space shape
print("Observation space shape:", env.observation_space.shape[0])

Observation space shape: 4


In [37]:
dqn = build_agent(model, actions)  # Build the DQN agent
dqn.compile(Adam(lr=1e-3), metrics=['mae'])  # Compile the DQN agent
dqn.fit(env, nb_steps=50000, visualize=False, verbose=1)  # Fit the DQN agent to the environment

Training for 50000 steps ...
Interval 1 (0 steps performed)
    1/10000 [..............................] - ETA: 8:11 - reward: 1.0000

  batch_idxs = np.random.random_integers(low, high - 1, size=size)


50 episodes - episode_reward: 198.680 [134.000, 200.000] - loss: 17.741 - mae: 47.677 - mean_q: 95.463

Interval 2 (10000 steps performed)
63 episodes - episode_reward: 159.063 [40.000, 200.000] - loss: 17.816 - mae: 44.624 - mean_q: 89.194

Interval 3 (20000 steps performed)
53 episodes - episode_reward: 187.925 [38.000, 200.000] - loss: 13.905 - mae: 40.726 - mean_q: 81.458

Interval 4 (30000 steps performed)
50 episodes - episode_reward: 197.920 [96.000, 200.000] - loss: 16.084 - mae: 42.316 - mean_q: 84.747

Interval 5 (40000 steps performed)
done, took 346.620 seconds


<tensorflow.python.keras.callbacks.History at 0x1332d522e50>

In [38]:
scores = dqn.test(env, nb_episodes=100, visualize=False)  # Test the DQN agent
print(np.mean(scores.history['episode_reward']))  # Print the average score

Testing for 100 episodes ...
Episode 1: reward: 200.000, steps: 200
Episode 2: reward: 200.000, steps: 200
Episode 3: reward: 200.000, steps: 200
Episode 4: reward: 200.000, steps: 200
Episode 5: reward: 200.000, steps: 200
Episode 6: reward: 200.000, steps: 200
Episode 7: reward: 200.000, steps: 200
Episode 8: reward: 200.000, steps: 200
Episode 9: reward: 200.000, steps: 200
Episode 10: reward: 200.000, steps: 200
Episode 11: reward: 200.000, steps: 200
Episode 12: reward: 200.000, steps: 200
Episode 13: reward: 200.000, steps: 200
Episode 14: reward: 200.000, steps: 200
Episode 15: reward: 200.000, steps: 200
Episode 16: reward: 200.000, steps: 200
Episode 17: reward: 200.000, steps: 200
Episode 18: reward: 200.000, steps: 200
Episode 19: reward: 200.000, steps: 200
Episode 20: reward: 200.000, steps: 200
Episode 21: reward: 200.000, steps: 200
Episode 22: reward: 200.000, steps: 200
Episode 23: reward: 200.000, steps: 200
Episode 24: reward: 200.000, steps: 200
Episode 25: reward: 

In [39]:
_ = dqn.test(env, nb_episodes=5, visualize=True)  # Test the DQN agent with visualization

Testing for 5 episodes ...


See here for more information: https://www.gymlibrary.ml/content/api/[0m
  deprecation(


Episode 1: reward: 200.000, steps: 200
Episode 2: reward: 200.000, steps: 200
Episode 3: reward: 200.000, steps: 200
Episode 4: reward: 200.000, steps: 200
Episode 5: reward: 200.000, steps: 200


# **4. Reloading Agent from Memory:**

We can save our weights to use for later


In [40]:
dqn.save_weights('dqn_weights.h5f', overwrite=True)

In [41]:
# This will delete all our variables to prove we can reload our model
del model
del dqn
del env

In [42]:
env = gym.make('CartPole-v0')  # Create a new environment
actions = env.action_space.n  # Get the number of possible actions
states = env.observation_space.shape[0]  # Get the number of state variables
model = build_model(states, actions)  # Build a new model
dqn = build_agent(model, actions)  # Build a new agent
dqn.compile(Adam(lr=1e-3), metrics=['mae'])  # Compile the agent


  logger.warn(
  deprecation(
  deprecation(


In [43]:
dqn.load_weights('dqn_weights.h5f')  # Load the saved weights

In [44]:
dqn.test(env, nb_episodes=5, visualize=True)  # Test the agent with visualization

Testing for 5 episodes ...
Episode 1: reward: 200.000, steps: 200
Episode 2: reward: 200.000, steps: 200
Episode 3: reward: 200.000, steps: 200
Episode 4: reward: 200.000, steps: 200
Episode 5: reward: 200.000, steps: 200


<tensorflow.python.keras.callbacks.History at 0x1332ee3f8e0>