![Introduction to RL for Game AI](https://i.imgur.com/FFiOMJo.jpeg)

## LINKS: https://tinyurl.com/UUAIRL

## What Is Reinforcement Learning?

Imagine teaching someone to play a video game without being able to tell them the rules. You can only give them a thumbs up when they do something good and a thumbs down when they do something bad. Over time, they'd figure out what works and what doesn't through trial and error.

That's essentially what reinforcement learning (RL) is - a way for AI to learn by interacting with an environment and receiving feedback.

### Today we will use Pong as a case study

In our case, we're teaching an AI to play Pong by letting it:
- Try different paddle movements
- See what happens in the game
- Get rewards for hitting the ball
- Get penalties for missing the ball
- Gradually improve its strategy through experience

## The Key Components

Let's break down the essential parts of our reinforcement learning system:

1. **Agent**: The AI that controls the paddle
2. **Environment**: The Pong game
3. **State**: What our agent can observe about the game
   - Ball x-position
   - Ball y-position
   - Paddle y-position
   - Ball x-velocity
   - Ball y-velocity
4. **Actions**: What our agent can do
   - Move paddle up
   - Stay in place
   - Move paddle down
5. **Reward**: The feedback our agent receives
   - Positive reward (+1) for hitting the ball
   - Negative reward (-1) for missing the ball
   - Small "shaping" rewards to guide learning


![RL](https://upload.wikimedia.org/wikipedia/commons/1/1b/Reinforcement_learning_diagram.svg)
(image from wikimedia)

## The Learning Loop

Here's how the learning process works:

1. The agent observes the current state of the game
2. Based on this state, it chooses an action (move up, stay, or move down)
3. The game updates (the ball and paddle move)
4. The agent receives a reward
5. The agent observes the new state
6. Repeat until the game ends
7. After the game ends, the agent learns from what happened

This cycle happens over and over - thousands of times - as the agent gradually improves.

## BUT HOW DO WE DO THIS?

There are many ways to do Reinforcement learning. It all hinges on the algorithm used for the training. 

- Do we know how to calculate the rewards? 
- Or the expected rewards for all possible actions? 
- Is it even possible?
- What is the thing that learns? A genetic algorithm? A Neural Network? ... 

![banner](https://i.imgur.com/O5UU2no.jpeg)

# BEHOLD AN ARTIFICIAL NEURON!!

In [1]:
import numpy as np
from ipywidgets import interact, FloatSlider
import matplotlib.pyplot as plt

def plot_neuron(input_value=1.0, weight=1.0, bias=0.0):
    # Compute output using the neuron formula
    output = input_value * weight + bias
    
    # Create figure
    fig, ax = plt.subplots(figsize=(10, 6))
    ax.set_xlim(-1.5, 5.5)
    ax.set_ylim(-2, 4.5)
    ax.axis('off')
    ax.set_aspect('equal', 'box') # Ensure circles are not squished

    # Add formula text
    formula_text = f"Formula: output = (input × weight) + bias\n" \
                   f"         = ({input_value:.1f} × {weight:.1f}) + {bias:.1f}\n" \
                   f"         = {output:.2f}"
    ax.text(2.5, 4.2, formula_text, ha='center', va='top', fontsize=12, 
            bbox=dict(facecolor='white', alpha=0.9))

    # Draw input node
    ax.text(-1.2, 1.3, f"Input\n{input_value:0.2f}", fontsize=12, ha="center")
    plt.plot([-1.4, -0.7], [1.0, 1.0], color='gray', lw=2, linestyle='--')
    
    # Draw neuron
    circle = plt.Circle((2.0, 1.0), 1, color='skyblue', ec='k', zorder=2)
    ax.add_patch(circle)
    ax.text(2.0, 1.0, "Neuron", fontsize=12, ha="center", va="center")
    
    # Draw bias
    ax.annotate("", xy=(2.0, 2), xytext=(2.0, 2.7),
                arrowprops=dict(arrowstyle="->", color="red", lw=2))
    ax.text(2.0, 2.8, f"Bias: {bias:0.2f}", color="red", ha="center", fontsize=12)
    
    # Draw weight
    ax.annotate("", xy=(1.0, 1.0), xytext=(-0.7, 1.0),
                arrowprops=dict(arrowstyle="->", color="blue", lw=2))
    ax.text(-0.1, 1.1, f"Weight: {weight:0.2f}", color="blue", ha="center", fontsize=12)
    
    # Draw output
    ax.annotate("", xy=(3, 1.0), xytext=(5.2, 1.0),
                arrowprops=dict(arrowstyle="<-", color="green", lw=2))
    ax.text(4.2, 1.1, f"Output: {output:0.2f}", color="green", ha="center", fontsize=12)
    
    ax.set_title("Single Neuron with Linear Activation", fontsize=16)
    plt.show()

# Create interactive widget
interact(plot_neuron,
         input_value=FloatSlider(min=-5, max=5, step=0.1, value=1.0, description="Input"),
         weight=FloatSlider(min=-5, max=5, step=0.1, value=1.0, description="Weight"),
         bias=FloatSlider(min=-5, max=5, step=0.1, value=0.0, description="Bias"))

interactive(children=(FloatSlider(value=1.0, description='Input', max=5.0, min=-5.0), FloatSlider(value=1.0, d…

<function __main__.plot_neuron(input_value=1.0, weight=1.0, bias=0.0)>

# A network

Several of those neurons (billions in the case of modern AI systems) are put together in a network. Usually in layers that connect to each other, each neuron multiplying, adding and sending its output forward to more neurons.

In [None]:
# Imports
from ipywidgets import FloatSlider, VBox, HBox, interactive_output
from IPython.display import display

def plot_network(x1, x2, w1_00, w1_10, w1_01, w1_11, w2_0, w2_1):
    # Compute the forward pass
    h1 = x1 * w1_00 + x2 * w1_10
    h2 = x1 * w1_01 + x2 * w1_11
    output = h1 * w2_0 + h2 * w2_1

    # Set up the figure
    fig, ax = plt.subplots(figsize=(12, 6))
    ax.set_xlim(-1.5, 5)
    ax.set_ylim(-1.5, 3)
    ax.axis('off') 
    ax.set_aspect('equal', 'box') # Ensure circles are not squished

    # Define positions for nodes in each layer
    positions = {
        "x1": (0, 1.5),
        "x2": (0, 0.5),
        "h1": (2, 1.5),
        "h2": (2, 0.5),
        "output": (4, 1)
    }

    # Function to draw each node: a circle, with the label and node value inside
    def draw_node(pos, value, label, color):
        circle = plt.Circle(pos, 0.2, color=color, ec='k', zorder=5)
        ax.add_patch(circle)
        ax.text(pos[0], pos[1], f"{label}\n{value:.2f}", 
                ha='center', va='center', fontsize=10, zorder=6)

    # Draw nodes for each layer
    draw_node(positions["x1"], x1, "x₁", 'lightyellow')
    draw_node(positions["x2"], x2, "x₂", 'lightyellow')
    draw_node(positions["h1"], h1, "h₁", 'skyblue')
    draw_node(positions["h2"], h2, "h₂", 'skyblue')
    draw_node(positions["output"], output, "ŷ", 'lightgreen')

    # Draw layer labels above the nodes
    ax.text(positions["x1"][0], positions["x1"][1] + 0.6, "Input Layer",
            ha='center', va='center', fontsize=12, fontweight='bold')
    ax.text(positions["h1"][0], positions["h1"][1] + 0.6, "Hidden Layer",
            ha='center', va='center', fontsize=12, fontweight='bold')
    ax.text(positions["output"][0], positions["output"][1] + 0.6, "Output Layer",
            ha='center', va='center', fontsize=12, fontweight='bold')

    # Also add a clear summary of input and output values on the sides
    ax.text(-1.3, 1, f"Inputs:\n x₁ = {x1:.2f}\n x₂ = {x2:.2f}",
            fontsize=11, ha='center', va='center',
            bbox=dict(facecolor='white', alpha=0.9, edgecolor='gray'))
    ax.text(4.8, 1, f"Output:\n ŷ = {output:.2f}",
            fontsize=11, ha='center', va='center',
            bbox=dict(facecolor='white', alpha=0.9, edgecolor='gray'))

    # Define the connections with their labels. Each connection is a tuple:
    # (start_node, end_node, current weight value, weight label)
    connections = [
        ("x1", "h1", w1_00, "w₁₀₀"),
        ("x2", "h1", w1_10, "w₁₁₀"),
        ("x1", "h2", w1_01, "w₁₀₁"),
        ("x2", "h2", w1_11, "w₁₁₁"),
        ("h1", "output", w2_0, "w₂₀"),
        ("h2", "output", w2_1, "w₂₁"),
    ]
    
    # Function to draw an arrow (connection) with the connection label and weight value
    def draw_arrow(start, end, weight, wt_label):
        start_pos = np.array(positions[start])
        end_pos = np.array(positions[end])
        vector = end_pos - start_pos
        length = np.linalg.norm(vector)
        direction = vector / length
        
        # Adjust start and end positions so the arrow doesn't overlap the node circles
        start_adjust = start_pos + direction * 0.25
        end_adjust = end_pos - direction * 0.25
        
        # Draw arrow between nodes
        ax.annotate("",
                    xy=end_adjust,
                    xytext=start_adjust,
                    arrowprops=dict(arrowstyle="->", color="gray", lw=1.5),
                    zorder=3)
        # Place a label for the connection: show the weight variable and value
        midpoint = (start_adjust + end_adjust) / 2.0
        # Use a slight offset for clarity
        offset = np.array([0.0, 0.15])
        ax.text(midpoint[0] + offset[0], midpoint[1] + offset[1],
                f"{wt_label}\n{weight:.2f}", fontsize=9, color="red",
                ha='center', va='center', bbox=dict(facecolor='white', alpha=0.8, edgecolor='none'))

    # Draw all connection arrows with labels
    for start, end, weight, wt_label in connections:
        draw_arrow(start, end, weight, wt_label)

    # Place an explanation text block on the upper right, if desired
    explanation_text = (
        "Feedforward Computation:\n"
        "1. Inputs x₁ and x₂ are each multiplied by their connection weights.\n"
        "2. Hidden neurons sum these weighted inputs (h₁, h₂).\n"
        "3. Hidden outputs are multiplied by output weights and summed to form ŷ."
    )
    ax.text(4.2, 2.7, explanation_text, fontsize=10,
            bbox=dict(facecolor='white', edgecolor='gray', alpha=0.8),
            ha='left', va='top')

    plt.show()

### Create Interactive Widgets ###

# Input sliders for x1 and x2
slider_x1 = FloatSlider(min=-2, max=2, step=0.1, value=1.0, description="x₁")
slider_x2 = FloatSlider(min=-2, max=2, step=0.1, value=1.0, description="x₂")

# Sliders for weights connecting inputs to the hidden layer
slider_w1_00 = FloatSlider(min=-2, max=2, step=0.1, value=1.0, description="w₁₀₀")
slider_w1_10 = FloatSlider(min=-2, max=2, step=0.1, value=1.0, description="w₁₁₀")
slider_w1_01 = FloatSlider(min=-2, max=2, step=0.1, value=1.0, description="w₁₀₁")
slider_w1_11 = FloatSlider(min=-2, max=2, step=0.1, value=1.0, description="w₁₁₁")

# Sliders for weights connecting the hidden layer to the output
slider_w2_0 = FloatSlider(min=-2, max=2, step=0.1, value=1.0, description="w₂₀")
slider_w2_1 = FloatSlider(min=-2, max=2, step=0.1, value=1.0, description="w₂₁")

# Organize the slider layout
inputs_box = HBox([slider_x1, slider_x2])
weights_input_hidden = HBox([slider_w1_00, slider_w1_10, slider_w1_01, slider_w1_11])
weights_hidden_output = HBox([slider_w2_0, slider_w2_1])
ui = VBox([inputs_box, weights_input_hidden, weights_hidden_output])

# Set up the interactive output
out = interactive_output(plot_network, {
    "x1": slider_x1,
    "x2": slider_x2,
    "w1_00": slider_w1_00,
    "w1_10": slider_w1_10,
    "w1_01": slider_w1_01,
    "w1_11": slider_w1_11,
    "w2_0": slider_w2_0,
    "w2_1": slider_w2_1,
})

# Display the interactive UI and plot
display(ui, out)

## What is an Activation Function?

An activation function determines whether a neuron in a neural network should be activated ("fired") or not, based on the input it receives.

## Why are Activation Functions Important?

### The Key Problem: Linear Limitations

Without activation functions, neural networks can only perform linear operations (multiplying and adding). Here's why this is a problem:

- **Linear operations can only create linear solutions**: No matter how many layers you stack, if each layer only does multiplication and addition, the entire network can only learn straight-line relationships between inputs and outputs.

- **Real-world problems aren't linear**: Most interesting problems (image recognition, language understanding, etc.) involve complex, curved relationships that can't be solved with just straight lines.

### How Activation Functions Solve This:

Activation functions introduce "bends" into the system. When we add an activation function:

1. The neuron can now respond differently to different input ranges
2. When combined with other neurons, these "bends" allow the network to approximate any curved shape
3. This enables the network to learn complex patterns that simple linear models cannot capture

**Simple example**: Imagine trying to separate data points in an X-shape. A straight line can never separate these points correctly, but with activation functions creating "bends," the network can learn the right boundary.

## The ReLU Activation Function

ReLU (Rectified Linear Unit) creates this crucial non-linearity in a very simple way:

- For negative inputs → output is 0
- For positive inputs → output is the same as the input

This simple "bend" at zero is enough to allow neural networks to learn incredibly complex patterns when many neurons work together.

![LET'S PONG](https://i.imgur.com/Tl3V3NE.jpeg)

## Training vs. Playing: Two Different Modes

It's important to understand the two modes of our agent:

### Training Mode
- Agent chooses actions randomly at first, based on probabilities from the network
- It records everything that happens (states, actions, rewards)
- After each game, it updates its neural network to improve
- This involves exploration (trying new things)

### Playing Mode
- Agent always chooses the action with highest probability
- No more randomness or exploration
- No more learning or updates
- Just using what it has learned

We spend most of our time in training mode, then switch to playing mode when the agent is ready.

There is no "one way" to do reinforcement learning. We don't have time to go through all cases, but we will look at one that is particularly useful for games.

## The REINFORCE Algorithm: Learning from Success and Failure

Now let's understand how our agent actually learns. We're using an algorithm called REINFORCE, which we'll explain step-by-step:

### Step 1: Play a Complete Game

The agent plays a full game of Pong until it misses the ball (game over). During this game, we record:
- Each state it observed
- Each action it took
- Each reward it received

Let's say our agent played a game that lasted 50 moves before missing the ball. We now have 50 (state, action, reward) tuples stored in memory.

### Step 2: Calculate the "Returns"

We need to know which actions were actually good in the long run. This is tricky because sometimes an action might look good immediately but lead to failure later.

To solve this, we calculate the "return" for each step - essentially the total future reward from that point onwards, with future rewards discounted (valued less than immediate rewards).

For each step t, we calculate:
$$
\text{Return}(t) = \text{Reward}(t) + \gamma \cdot \text{Reward}(t+1) + \gamma^2 \cdot \text{Reward}(t+2) + \cdots
$$

Where $\gamma$ is a number between 0 and 1 that determines how much we care about future rewards.


#### Example:
If our rewards were [0, 0, 0, 1, 0, 0, -1] and gamma is 0.9:
- Return at step 6 = -1
- Return at step 5 = 0 + 0.9 * (-1) = -0.9
- Return at step 4 = 0 + 0.9 * 0 + 0.9 * (0.9)* -1 = -0.81
- Return at step 3 = 1 + 0.9 * (-0.81) = 0.271
- ...and so on

This gives us a better measure of how good each action really was.



### Step 3: Update the Policy Network

Now comes the crucial part - we need to adjust our neural network to make good actions more likely and bad actions less likely in the future.

*ROUGHLY* For each (state, action, return) tuple:
1. Feed the state into the network to get the current probabilities
2. Increase the probability of the action taken if the return was positive
3. Decrease the probability of the action taken if the return was negative

#### How Weights Actually Change

This is where we need to understand how neural networks learn:

1. Each connection in our neural network has a "weight" - just a number that determines how strong that connection is.
2. These weights determine the final probabilities output by the network.
3. To make an action more likely, we need to adjust the weights that led to that action.

Let's break this down with a simple example:

Imagine our network gave these probabilities for a particular state:
- UP: 30%
- STAY: 50%
- DOWN: 20%

The agent selected STAY (based on these probabilities), and this eventually led to a positive return of 0.8.

We want to adjust our network to make STAY even more likely in this situation next time. The math works out such that:
- Weights that contributed to the STAY probability get increased
- The larger the return (0.8 in this case), the larger the increase
- Weights that didn't contribute to STAY don't change much


The technical term for this process is "gradient ascent on the policy parameters" - but you can think of it as "tweak the weights to make good actions more likely."


### Step 4: Repeat
This process is repeated for several episodes, iteratively updating the policy in the direction of higher rewards.

# Let's write the game engine

In [None]:
# -------------------------------------------------------------------------
# GAME CONSTANTS (these would typically be in a header file in C++)
# -------------------------------------------------------------------------
WIDTH = 400           # Game screen width in pixels
HEIGHT = 250          # Game screen height in pixels
PADDLE_HEIGHT = 70    # Paddle height in pixels
PADDLE_WIDTH = 15     # Paddle width in pixels
PADDLE_MOVE_SPEED = 5 # How fast the paddle moves when a key is pressed
BALL_RADIUS = 7       # Ball radius in pixels
BALL_SPEED = 5        # Base ball speed in pixels per frame


In [None]:
import time
import random
import threading
import numpy as np
from ipycanvas import Canvas
import ipywidgets as widgets
from ipyevents import Event
from IPython.display import display

def pong_step(state, action):
    """
    Update ball and right-paddle state.
    
    state: [ball_x, ball_y, paddle_y, ball_dx, ball_dy]
    action (for right paddle): 0 = up, 1 = none, 2 = down.
    
    Returns a list of native Python floats.
    """
    ball_x, ball_y, paddle_y, ball_dx, ball_dy = state

    # Update AI paddle position (right paddle) based on action.
    if action == 0:
        paddle_y = max(0.0, paddle_y - PADDLE_MOVE_SPEED)
    elif action == 2:
        paddle_y = min(HEIGHT - PADDLE_HEIGHT, paddle_y + PADDLE_MOVE_SPEED)

    # Update ball position.
    ball_x += ball_dx
    ball_y += ball_dy

    # Bounce off top and bottom walls.
    if ball_y - BALL_RADIUS < 0:
        ball_y = BALL_RADIUS
        ball_dy = abs(ball_dy)
    elif ball_y + BALL_RADIUS > HEIGHT:
        ball_y = HEIGHT - BALL_RADIUS
        ball_dy = -abs(ball_dy)

    # Handle collision with the right (AI) paddle.
    if ball_x + BALL_RADIUS >= WIDTH - PADDLE_WIDTH:
        # If ball hits paddle, bounce back.
        if paddle_y <= ball_y <= (paddle_y + PADDLE_HEIGHT):
            ball_x = WIDTH - PADDLE_WIDTH - BALL_RADIUS
            ball_dx = -abs(ball_dx)
        else:
            # If the paddle missed, reset the ball and the right paddle.
            ball_x = WIDTH / 2.0
            ball_y = random.uniform(BALL_RADIUS, HEIGHT - BALL_RADIUS)
            ball_dx = BALL_SPEED
            ball_dy = BALL_SPEED
            paddle_y = (HEIGHT - PADDLE_HEIGHT) / 2.0

    # (Left wall collision handled in game loop)
    return [float(ball_x), float(ball_y), float(paddle_y), float(ball_dx), float(ball_dy)]


class PongGame:
    def __init__(self, ai_function):
        """
        Initialize the Pong game.
        
        ai_function(ball_x, ball_y, paddle_y, ball_dx, ball_dy)
            should return one of: "up", "none", or "down" for the right paddle.
        """
        # Left (player) paddle and ball state.
        self.left_paddle_y = (HEIGHT - PADDLE_HEIGHT) / 2.0
        self.right_paddle_y = (HEIGHT - PADDLE_HEIGHT) / 2.0
        self.ball_x = WIDTH / 2.0
        self.ball_y = random.uniform(BALL_RADIUS, HEIGHT - BALL_RADIUS)
        self.ball_dx = BALL_SPEED
        self.ball_dy = BALL_SPEED

        self.ai_function = ai_function

        # Movement flags for the left paddle.
        self.left_up_active = False
        self.left_down_active = False

        self.running = False

        self._create_widgets()

    def _create_widgets(self):
        # Create the game canvas.
        self.canvas = Canvas(width=WIDTH, height=HEIGHT)
        display(self.canvas)
        
        # Create control buttons.
        self.btn_left_up = widgets.Button(
            description="UP▲", layout=widgets.Layout(width='100px'), button_style='info')
        self.btn_left_down = widgets.Button(
            description="▼DOWN", layout=widgets.Layout(width='100px'), button_style='info')
        self.btn_stop = widgets.Button(
            description="STOP GAME", layout=widgets.Layout(width='100px', height='40px'),
            button_style='danger')
        
        # Set up ipyevents on the left paddle buttons for mousedown/up/leave.
        event_up = Event(source=self.btn_left_up, watched_events=['mousedown', 'mouseup', 'mouseleave'])
        event_up.on_dom_event(self._handle_left_up)
        event_down = Event(source=self.btn_left_down, watched_events=['mousedown', 'mouseup', 'mouseleave'])
        event_down.on_dom_event(self._handle_left_down)

        # Stop button uses normal on_click.
        self.btn_stop.on_click(self._stop_game)
        
        # Display control buttons.
        controls = widgets.VBox([widgets.HBox([self.btn_left_up, self.btn_left_down]), self.btn_stop])
        display(controls)
    
    def _handle_left_up(self, event):
        # When the up button is pressed, set the flag; released/leave clears it.
        if event['type'] == 'mousedown':
            self.left_up_active = True
        elif event['type'] in ['mouseup', 'mouseleave']:
            self.left_up_active = False
    
    def _handle_left_down(self, event):
        # When the down button is pressed, set the flag; released/leave clears it.
        if event['type'] == 'mousedown':
            self.left_down_active = True
        elif event['type'] in ['mouseup', 'mouseleave']:
            self.left_down_active = False
    
    def _draw(self):
        # Draw ball first (helps with flicker)
        self.canvas.fill_style = 'black'
        self.canvas.fill_circle(self.ball_x, self.ball_y, BALL_RADIUS)

        # Clear the canvas and redraw all elements in the correct order.
        self.canvas.clear()
        
        # Draw background first
        self.canvas.fill_style = 'white'
        self.canvas.fill_rect(0, 0, WIDTH, HEIGHT)
        
        # Draw paddles
        self.canvas.fill_style = 'blue'
        self.canvas.fill_rect(0, self.left_paddle_y, PADDLE_WIDTH, PADDLE_HEIGHT)
        
        self.canvas.fill_style = 'red'
        self.canvas.fill_rect(WIDTH - PADDLE_WIDTH, self.right_paddle_y, PADDLE_WIDTH, PADDLE_HEIGHT)
        
        # Draw ball last (on top of everything else)
        self.canvas.fill_style = 'black'
        self.canvas.fill_circle(self.ball_x, self.ball_y, BALL_RADIUS)
    
    def _reset_ball(self):
        # Reset the ball to the center with a random vertical position.
        self.ball_x = WIDTH / 2.0
        self.ball_y = random.uniform(BALL_RADIUS, HEIGHT - BALL_RADIUS)
        self.ball_dx = BALL_SPEED
        self.ball_dy = BALL_SPEED
        self.right_paddle_y = (HEIGHT - PADDLE_HEIGHT) / 2.0

    def game_loop(self):
        fps_delay = 1.0 / 30.0  # approximately 30 FPS
        mapping = {"up": 0, "none": 1, "down": 2}
        while self.running:
            # Move the left paddle based on button flags.
            if self.left_up_active:
                self.left_paddle_y = max(0.0, self.left_paddle_y - PADDLE_MOVE_SPEED)
            if self.left_down_active:
                self.left_paddle_y = min(HEIGHT - PADDLE_HEIGHT, self.left_paddle_y + PADDLE_MOVE_SPEED)
            
            # Build the game state for the ball and right paddle.
            state = [self.ball_x, self.ball_y, self.right_paddle_y, self.ball_dx, self.ball_dy]

            # Get the AI action for the right paddle.
            ai_action = self.ai_function(self.ball_x, self.ball_y, self.right_paddle_y, self.ball_dx, self.ball_dy)
            action_int = mapping.get(ai_action, 1)

            # Update ball position and the right paddle using pong_step.
            new_state = pong_step(state, action_int)
            self.ball_x, self.ball_y, self.right_paddle_y, self.ball_dx, self.ball_dy = new_state
            
            # Check collision with the left (player) paddle.
            if self.ball_x - BALL_RADIUS <= PADDLE_WIDTH:
                if self.left_paddle_y <= self.ball_y <= (self.left_paddle_y + PADDLE_HEIGHT):
                    # Bounce the ball off the player's paddle.
                    self.ball_x = PADDLE_WIDTH + BALL_RADIUS
                    self.ball_dx = abs(self.ball_dx)
                else:
                    # The player missed: reset the ball.
                    self._reset_ball()
            
            self._draw()
            time.sleep(fps_delay)
    
    def start(self):
        self.running = True
        # Run the game loop in a separate thread to free the UI thread.
        self.thread = threading.Thread(target=self.game_loop, daemon=True)
        self.thread.start()
    
    def _stop_game(self, _):
        self.running = False
        self.btn_stop.description = "Stopped"
        self.btn_stop.disabled = True
        self.left_up_active = False
        self.left_down_active = False

def start_game(ai_function):
    """
    Initialize and start the Pong game.
    
    Provide an ai_function(ball_x, ball_y, paddle_y, ball_dx, ball_dy)
    that returns "up", "none", or "down" for controlling the right paddle.
    """
    game = PongGame(ai_function)
    game.start()
    return game


In [None]:
# This is how we use it.

# Some useful thigns for your to use in your implementation
# paddle_center = paddle_y + PADDLE_HEIGHT/2 
# ball_center = ball_y + BALL_RADIUS


# --- Example AI Function ---
def simple_ai(ball_x, ball_y, paddle_y, ball_dx, ball_dy):
    """
    A basic AI: move the paddle up or down so that its center follows the ball.
    It returns "up" if the paddle should move up, "down" if it should move down,
    and "none" if it should stay still.
    """
    #TASK 1: YOUR CODE HERE

    # Move the paddle up if the ball is above the center of the paddle and the ball dx is positive
    if ball_y < paddle_y + PADDLE_HEIGHT/2 and ball_dx > 0:
        return "up"
    # Move the paddle down if the ball is below the center of the paddle and the ball dx is positive
    elif ball_y > paddle_y + PADDLE_HEIGHT/2 and ball_dx > 0:
        return "down"

# --- Start the Game ---
# Pass the AI function you want to use.
start_game(simple_ai)

# Implementing Reinforcement Learning 
Now that we've set up our Pong game environment, we're ready to create and train an AI that can learn to play. We'll be using a reinforcement learning approach called REINFORCE (a type of policy gradient method).

## What was Reinforcement Learning again?

Reinforcement learning works by trial and error:
- The agent (our AI) takes actions in the environment
- It receives feedback in the form of rewards
- It learns to take actions that maximize its total reward

Think of it like training a dog: we don't tell it exactly how to catch a frisbee, but we reward it when it does, and over time it figures out the best strategy.

## Our Implementation Plan

Here's what we'll do next:

1. **Build the Agent**: Create a class that:
   - Contains a neural network (the "brain" of our AI)
   - Can choose actions based on the game state
   - Keeps track of its experiences (states, actions, rewards)
   - Can learn from these experiences

2. **Train the Agent**: Run many games where:
   - The agent observes the state and chooses actions
   - We record what happens (rewards received)
   - After each game, the agent updates its neural network to improve

3. **Use the Trained Agent**: Once training is complete, we can use our AI to play the game based on what it has learned.

## Key Components

Our implementation will include:

- **Neural Network**: A simple model with one hidden layer that takes the game state as input and outputs probabilities for each action (up, stay, down)

- **Action Selection**: During training, actions will be chosen probabilistically to encourage exploration. After training, the agent will choose the most likely action.

- **REINFORCE Algorithm**: This is how our agent will learn. After each game:
  - It calculates the cumulative rewards from each time step
  - It adjusts its neural network to make actions that led to good outcomes more likely in the future

- **Reward Shaping**: To help speed up learning, we'll provide small intermediate rewards for keeping the paddle near the ball.

The code we're about to implement will transform our Pong environment into a learning playground for our AI. By the end of training, we should have an agent that can effectively track and hit the ball.

In [None]:
# %% Pong Reinforcement Learning Tutorial
# This file demonstrates how to create an AI for Pong using reinforcement learning
# Intended for game design students with C++ background who are new to Python and ML

import time
import numpy as np              # NumPy handles arrays and math operations (like C++ vectors but more powerful)
import tensorflow as tf         # TensorFlow is a machine learning library
from tensorflow import keras    # Keras is a high-level neural network API
from keras import layers        # Layers are the building blocks of neural networks


# -------------------------------------------------------------------------
# GAME PHYSICS AND REWARD SYSTEM
# -------------------------------------------------------------------------
def rl_step(state, action):
    """
    Simulates one step of the Pong game physics and calculates rewards.
    
    This is similar to the Update() or Step() function you might have in a C++ game loop.
    
    Parameters:
    - state: [ball_x, ball_y, paddle_y, ball_dx, ball_dy] - Current game state
    - action: What the paddle should do (0 = move up, 1 = stay still, 2 = move down)
    
    Returns:
    - new_state: Updated game state after this step
    - reward: Positive or negative feedback based on the agent's performance
    - done: Whether the game is over (ball passed the paddle)
    """
    # Unpack the state values for readability - similar to struct access in C++
    ball_x, ball_y, paddle_y, ball_dx, ball_dy = state

    # -------------------------------------------------------------------------
    # 1. UPDATE PADDLE POSITION BASED ON ACTION
    # -------------------------------------------------------------------------
    if action == 0:  # Move paddle up
        paddle_y = max(0.0, paddle_y - PADDLE_MOVE_SPEED)  # Prevent going above the screen
    elif action == 2:  # Move paddle down
        paddle_y = min(HEIGHT - PADDLE_HEIGHT, paddle_y + PADDLE_MOVE_SPEED)  # Prevent going below the screen
    # If action == 1, the paddle doesn't move (stays in place)
    
    # -------------------------------------------------------------------------
    # 2. UPDATE BALL POSITION
    # -------------------------------------------------------------------------
    ball_x += ball_dx  # Move ball horizontally
    ball_y += ball_dy  # Move ball vertically
    
    # -------------------------------------------------------------------------
    # 3. HANDLE BALL COLLISIONS WITH TOP AND BOTTOM WALLS
    # -------------------------------------------------------------------------
    if ball_y - BALL_RADIUS < 0:  # Ball hits top wall
        ball_y = BALL_RADIUS  # Reposition to prevent getting stuck in wall
        ball_dy = abs(ball_dy)  # Flip vertical direction to positive (downward)
    
    if ball_y + BALL_RADIUS > HEIGHT:  # Ball hits bottom wall
        ball_y = HEIGHT - BALL_RADIUS  # Reposition to prevent getting stuck in wall
        ball_dy = -abs(ball_dy)  # Flip vertical direction to negative (upward)
    
    # -------------------------------------------------------------------------
    # 4. CALCULATE REWARD FOR THE AI
    # -------------------------------------------------------------------------
    # "Shaping" rewards guide the AI toward better behavior before it succeeds
    # This is like giving hints rather than just win/lose feedback
    
    # Calculate how well the paddle is positioned relative to the ball
    paddle_center = paddle_y + PADDLE_HEIGHT / 2.0
    # This gives higher rewards when paddle is closer to ball's height
    shaping_factor = 0.8
    shaping_reward = (1 - abs(paddle_center - ball_y)/HEIGHT) * shaping_factor
    
    # Small penalty for not moving, to encourage active play
    if action == 1:  # If the paddle didn't move
        shaping_reward -= 0.1  # Apply small penalty
        
    # Start with the shaping reward
    reward = shaping_reward 
    done = False  # Game continues by default
    
    # -------------------------------------------------------------------------
    # 5. HANDLE BALL COLLISION WITH RIGHT PADDLE (AI's paddle)
    # -------------------------------------------------------------------------
    if ball_x + BALL_RADIUS >= WIDTH - PADDLE_WIDTH:  # Ball reaches the right edge where paddle is
        if paddle_y <= ball_y <= (paddle_y + PADDLE_HEIGHT):  # Ball hits paddle
            # Successful hit!
            reward = shaping_reward + 1.0  # Big reward for hitting the ball
            # Push ball back a bit so it doesn't get stuck inside paddle
            ball_x = WIDTH - PADDLE_WIDTH - BALL_RADIUS
            # Reverse horizontal direction
            ball_dx = -abs(ball_dx)
        else:
            # Ball missed the paddle - game over!
            reward = shaping_reward - 1.0  # Penalty for missing
            done = True  # End the game
    
    # -------------------------------------------------------------------------
    # 6. HANDLE BALL COLLISION WITH LEFT WALL (where an opponent would be)
    # -------------------------------------------------------------------------
    if ball_x - BALL_RADIUS <= 0:  # Ball hits left wall
        ball_x = BALL_RADIUS  # Reposition
        ball_dx = abs(ball_dx)  # Reverse direction to positive (rightward)

    # -------------------------------------------------------------------------
    # 7. PREPARE AND RETURN THE NEW GAME STATE
    # -------------------------------------------------------------------------
    new_state = np.array([ball_x, ball_y, paddle_y, ball_dx, ball_dy], dtype=np.float32)
    return new_state, reward, done



# How Neural Network Weights Change in REINFORCE

## Weights as Sensitivity Knobs
Neural network weights act like **tunable dials** that determine how the agent interprets game states. These numbers evolve to prioritize actions that maximize rewards.

---

##  Neural Network Architecture
| Component       | Description                                                                 |
|-----------------|-----------------------------------------------------------------------------|
| **Input Layer** | Receives game state (ball/paddle positions, velocities)                    |
| **Weights**     | Numerical values controlling signal strength between neurons               |
| **Output Layer**| Produces probability distribution over actions (UP/STAY/DOWN)              |

---

## Learning Process: Step-by-Step
1. **Forward Pass**  
   - Process game steps through the network → get action probabilities  
   - *Example Output:* UP (20%), STAY (30%), DOWN (50%)

2. **Action Selection**  
   - Randomly sample from probabilities (e.g., chooses DOWN)

3. **Reward Calculation**  
   - Compute discounted return for trajectory segment  
   - *Example Return:* +0.7 (accounts for future rewards)

4. **Backpropagation**  

   The gradient is a mathematical concept that tells us the direction and magnitude of the steepest increase of a function. In neural networks, it's essentially a collection of partial derivatives that indicate how a small change in each weight would affect the output.

   Key Points About Gradients:

   - **Derivative Connection:** The gradient is built from partial derivatives – these measure how much the network's output changes when you slightly adjust a specific weight, while keeping all other weights constant.
   - **Direction of Improvement:** When maximizing rewards, the gradient points in the direction where weights should change to increase the probability of beneficial actions.
   - **Visualization:** Think of the gradient as a compass pointing "uphill" on a landscape where elevation represents better performance. The steeper the hill, the larger the gradient magnitude.
   
   ![gradient illustration](https://ds100.org/course-notes/feature_engineering/images/loss_surface.png)
   image from:https://ds100.org/course-notes/feature_engineering/feature_engineering.html

   When the REINFORCE algorithm multiplies this gradient by the return value, it strengthens connections that led to good outcomes (positive returns) and weakens those that led to poor ones (negative returns), proportional to how much each weight influenced the chosen action.

   ```python
   # Pseudocode for weight update logic
   for weight in network:
       if weight encouraged chosen_action (DOWN):
           weight += learning_rate * return * gradient
       else:
           weight -= learning_rate * return * gradient
   ```
# REINFORCE Algorithm: Formula and Code Implementation

## Standard Expression vs. Code Implementation

The standard REINFORCE formula is typically written for a single timestep:

$$\Large \theta_{t+1} = \theta_t + \alpha \nabla_\theta \log \pi_\theta(a_t \mid s_t) G_t$$

However, in practical implementation, we update across an entire episode of multiple timesteps at once.

## Episode-Based Approach 

That means that we just average:

$$\Large \theta_{new} = \theta_{old} + \alpha \nabla_\theta \left( \frac{1}{T} \sum\limits_{t=0}^{T-1} \log \pi_\theta(a_t \mid s_t) G_t \right)$$

Or equivalently, written as minimizing a loss function:

$$\Large \theta_{new} = \theta_{old} - \alpha \nabla_\theta \left( -\frac{1}{T} \sum\limits_{t=0}^{T-1} \log \pi_\theta(a_t \mid s_t) G_t \right)$$

Where:
- $T$ is the number of timesteps in the episode
- $\frac{1}{T} \sum_{t=0}^{T-1}$ represents the averaging operation (implemented as `reduce_mean`)
- The negative sign in the second formula corresponds to the negative in `loss = -tf.reduce_mean(weighted_log_pi)`


## Benefits of Episode-Based Updates

The averaging across timesteps helps stabilize training by reducing the variance in policy updates. Rather than making large updates based on individual timesteps, the policy is updated based on the average performance across the entire episode.

This approach better captures what's actually happening in the code: a batch update using the average gradient across all timesteps in the episode, rather than separate updates for each individual timestep.


---

## Weight Change Visualization
**Scenario:** Successful DOWN action (return = +0.7)  
| Weight Connection           | Initial | Updated | Change Direction |
|-----------------------------|---------|---------|-------------------|
| Ball_Y < 0.5 → DOWN         | 0.50    | 0.56    | ↑ Reinforcement   |
| Paddle_Y_diff > 0 → UP      | 0.30    | 0.26    | ↓ Penalization    |
| Ball_X_velocity ← STAY      | -0.15   | -0.21   | ↓ Penalization    |

---


##  Long-Term Evolution
| Training Stage | Weight Behavior                     | Agent Performance               |
|----------------|-------------------------------------|---------------------------------|
| Early          | Large random fluctuations           | Frequent misses                |
| Mid            | Pattern-specific boosting           | Consistent returns             |
| Late           | Fine-tuned precision adjustments    | Strategic positioning          |

---

In [None]:

# ===========================================================================
# REINFORCEMENT LEARNING AGENT
# ===========================================================================
# This is the "brain" of our AI paddle that learns to play pong
class RLAgent:
    def __init__(self, learning_rate=5e-3, gamma=0.76):
        """
        Initialize the AI agent.
        
        Parameters:
        - learning_rate: How quickly the model adapts to new information (like step size)
        - gamma: Discount factor - how much future rewards matter compared to immediate ones
        """
        self.gamma = gamma  # Store the discount factor for future rewards
        
        # -------------------------------------------------------------------------
        # CREATE THE NEURAL NETWORK MODEL
        # -------------------------------------------------------------------------
        # This is similar to creating a class with methods in C++, but using a
        # pre-built system for machine learning
        self.model = keras.Sequential([
            # Input layer takes 5 values (the game state)
            layers.Input(shape=(5,)),
            # Hidden layer with 8 neurons and ReLU activation
            # ReLU simply means "if value < 0, output 0, else output the value"
            layers.Dense(8, activation='relu'),
            # Output layer with 3 neurons (one for each possible action)
            # Softmax makes the outputs into probabilities that sum to 1
            layers.Dense(3, activation='softmax')
        ])
        
        # Initialize the optimizer which adjusts the neural network
        # Think of this as the "learning algorithm"
        self.optimizer = tf.keras.optimizers.Adam(learning_rate)
        
        # -------------------------------------------------------------------------
        # SAVING BUFFERS
        # -------------------------------------------------------------------------
        # These store the agent's experiences to learn from
        # Like recording gameplay to study later
        self.states = []    # Game states we've seen
        self.actions = []   # Actions we took
        self.rewards = []   # Rewards we received
    
    def _normalize_state(self, state):
        """
        Scale the state values to a range between 0 and 1.
        
        This helps the neural network learn more efficiently,
        similar to how you'd normalize a 3D model's coordinates.
        """
        return np.array([
            state[0] / WIDTH,         # x position relative to screen width
            state[1] / HEIGHT,        # y position relative to screen height
            state[2] / HEIGHT,        # paddle position relative to screen height
            state[3] / BALL_SPEED,    # x velocity relative to maximum
            state[4] / BALL_SPEED,    # y velocity relative to maximum
        ], dtype=np.float32)
    
    def choose_action(self, state):
        """
        Decide what action to take based on the current game state.
        
        This is like the AI's "think" function that runs every frame.
        
        Parameters:
        - state: Current game state [ball_x, ball_y, paddle_y, ball_dx, ball_dy]
        
        Returns:
        - action: 0 (move up), 1 (stay), or 2 (move down)
        """
        # Normalize the state values to help the neural network
        # Normalization is like scaling values to a common range, for vectors it is making their length 1
        norm_state = self._normalize_state(state).reshape(1, -1)
        
        # Ask the neural network what to do
        # It returns probabilities for each possible action
        probs = self.model(norm_state).numpy().flatten()
        
        # Choose an action based on the probabilities
        # This adds randomness for exploration (trying new strategies)
        action = np.random.choice(3, p=probs)
        
        # Remember what we saw and what we did for learning later
        self.states.append(norm_state)
        self.actions.append(action)
        
        return action
    
    def store_reward(self, reward):
        """
        Store the reward received after taking an action.
        
        Parameters:
        - reward: The feedback value received from the environment
        """
        self.rewards.append(reward)
    
    def finish_episode(self):
        """
        Perform the REINFORCE update on the policy network.

        The update rule is:
            θₜ₊₁ = θₜ + α · ∇θ log π₍θ₎(aₜ | sₜ) · Gₜ
            
        This function implements each step explicitly.
        """

        # -------------------------------------------------------------------------
        # 1. COMPUTE THE DISCOUNTED RETURNS (Gₜ)
        # -------------------------------------------------------------------------
        # For each timestep t, compute the return G_t = r_t + γ * r_{t+1} + γ² * r_{t+2} + ...
        G_t = np.zeros_like(self.rewards, dtype=np.float32)  # Gₜ
        cumulative_return = 0.0
        for t in reversed(range(len(self.rewards))):
            cumulative_return = self.rewards[t] + self.gamma * cumulative_return  # Gₜ = rₜ + γ · Gₜ₊₁
            G_t[t] = cumulative_return

        # Optionally normalize returns for more stable learning
        baseline = np.mean(G_t)
        G_t = G_t - baseline  # Normalized Gₜ

        # -------------------------------------------------------------------------
        # 2. PREPARE DATA: STATES (sₜ), ACTIONS (aₜ), RETURN (Gₜ)
        # -------------------------------------------------------------------------
        states = np.concatenate(self.states, axis=0)  # States: sₜ
        actions = np.array(self.actions)              # Actions: aₜ
        returns = G_t                                 # Returns: Gₜ

        # -------------------------------------------------------------------------
        # 3. COMPUTE THE POLICY OBJECTIVE AND GRADIENT (∇θ log π₍θ₎(aₜ|sₜ) · Gₜ)
        # -------------------------------------------------------------------------
        with tf.GradientTape() as tape:
            # Forward pass: Compute the action probabilities π₍θ₎(a | s) for all states.
            action_probs = self.model(states, training=True)  # π₍θ₎(·|s)

            # Create a one-hot vector for actions, so we can select the probability of the executed action.
            one_hot_actions = tf.one_hot(actions, depth=3)  # Assume 3 actions. This is our mask.

            # Select the probability for the taken action: π₍θ₎(aₜ|sₜ)
            prob_taken = tf.reduce_sum(action_probs * one_hot_actions, axis=1)

            # Compute log probability: log π₍θ₎(aₜ|sₜ)
            log_pi = tf.math.log(prob_taken + 1e-8)

            # Multiply by the return Gₜ: The term inside the gradient is log π₍θ₎(aₜ|sₜ) * Gₜ
            weighted_log_pi = log_pi * returns

            # Our objective (to be maximized) is the average policy "score":
            #   Objective = E[log π₍θ₎(aₜ|sₜ) * Gₜ]
            # We minimize the negative of this objective:
            loss = -tf.reduce_mean(weighted_log_pi)

        # -------------------------------------------------------------------------
        # 4. COMPUTE GRADIENTS AND UPDATE THE MODEL PARAMETERS
        # -------------------------------------------------------------------------
        # Compute the gradient: ∇θ [ - (log π₍θ₎(aₜ | sₜ) * Gₜ) ]
        gradients = tape.gradient(loss, self.model.trainable_variables)

        # The optimizer updates the parameters using the learning rate (α) set during its initialization.
        # This implements: θₜ₊₁ = θₜ + α · ∇θ log π₍θ₎(aₜ|sₜ) · Gₜ
        self.optimizer.apply_gradients(zip(gradients, self.model.trainable_variables))

        # -------------------------------------------------------------------------
        # 5. RESET EPISODE MEMORY FOR THE NEXT EPISODE
        # -------------------------------------------------------------------------
        self.states, self.actions, self.rewards = [], [], []
    
    def get_action(self, state):
        """
        Choose the best action without randomness (for actual gameplay).
        
        This is used after training when we want the AI to play its best.
        
        Parameters:
        - state: Current game state
        
        Returns:
        - Best action (0, 1, or 2)
        """
        norm_state = self._normalize_state(state).reshape(1, -1)
        probs = self.model(norm_state).numpy().flatten()
        return np.argmax(probs)  # Choose the action with highest probability


# ===========================================================================
# TRAINING LOOP
# ===========================================================================
def train_agent(num_episodes=1000):
    """
    Train the agent by playing many games and learning from them.
    
    Parameters:
    - num_episodes: Number of games to play for training
    
    Returns:
    - trained_agent: The agent after training
    """
    # Create a new agent
    agent = RLAgent()
    total_rewards = []  # Track rewards for analysis

    max_steps_reached = 0  # Track the longest game
    
    # Play multiple games to train
    for i in range(num_episodes):
        # -------------------------------------------------------------------------
        # 1. SET UP A NEW GAME WITH RANDOM STARTING CONDITIONS
        # -------------------------------------------------------------------------
        # Randomize the ball and paddle positions for varied training
        ball_y_random = np.random.uniform(BALL_RADIUS, HEIGHT - BALL_RADIUS)
        paddle_y_random = np.random.uniform(0, HEIGHT - PADDLE_HEIGHT)
        
        # Initialize the game state
        state = np.array([
            WIDTH / 2.0,        # Ball starts in the middle horizontally
            ball_y_random,      # Random vertical position
            paddle_y_random,    # Random paddle position
            BALL_SPEED,         # Ball initially moves right
            BALL_SPEED          # Ball initially moves down
        ], dtype=np.float32)
        
        # -------------------------------------------------------------------------
        # 2. PLAY THE GAME UNTIL COMPLETION OR MAX STEPS
        # -------------------------------------------------------------------------
        episode_reward = 0.0  # Total reward for this game
        done = False          # Game not finished yet
        step = 0              # Step counter
        
        max_steps = 500  # Maximum steps per game (to prevent infinite games)
        
        # Game loop - similar to your C++ game loop
        while not done and step < max_steps:
            # AI chooses an action
            action = agent.choose_action(state)
            
            # Update the game state based on the action
            state, reward, done = rl_step(state, action)
            
            # Store the reward for learning
            agent.store_reward(reward)
            
            # Keep track of total reward
            episode_reward += reward
            
            # Increment step counter
            step += 1
        
        # -------------------------------------------------------------------------
        # 3. LEARN FROM THIS GAME
        # -------------------------------------------------------------------------
        agent.finish_episode()
        
        # Store results for analysis
        total_rewards.append(episode_reward)
        max_steps_reached = max(max_steps_reached, step)
       
        if (i+1) % 100 == 0:
            print(f"Episode {i+1}/{num_episodes}: Steps= {step}, Total Reward= {episode_reward:.2f}, Max Steps reached= {max_steps_reached}")
            max_steps_reached = 0
    
    return agent

# Train the agent.
trained_agent = train_agent(num_episodes=500)

# Wrap the trained agent into an AI function for gameplay.
def trained_ai_function(ball_x, ball_y, paddle_y, ball_dx, ball_dy):
    state = np.array([ball_x, ball_y, paddle_y, ball_dx, ball_dy], dtype=np.float32)
    action_idx = trained_agent.get_action(state)
    mapping = {0: "up", 1: "none", 2: "down"}
    return mapping[action_idx]


In [None]:
# Save the trained agent.
#trained_agent.model.save("trained_pong_agent.h5")

In [None]:
# Read an agent I trained with 5000 episodes into the RL class, took 20 minutes.
trained_agent = RLAgent()
trained_agent.model = keras.models.load_model("trained_pong_agent.h5")

# Wrap the trained agent into an AI function for gameplay.
def trained_ai_function(ball_x, ball_y, paddle_y, ball_dx, ball_dy):
    state = np.array([ball_x, ball_y, paddle_y, ball_dx, ball_dy], dtype=np.float32)
    action_idx = trained_agent.get_action(state)
    mapping = {0: "up", 1: "none", 2: "down"}
    return mapping[action_idx]


In [None]:
start_game(trained_ai_function)

In [None]:
#@markdown Run to visualize the full trained network

import matplotlib.pyplot as plt
import ipywidgets as widgets
from ipywidgets import interactive, HBox, VBox
from IPython.display import display

# --- Ensure that your trained model is built ---
# (This dummy call forces the model’s graph to be built.)
_dummy = np.zeros((1, 5), dtype=np.float32)
_ = trained_agent.model(_dummy)

# --- Determine the hidden dense layer ---
# Depending on your Keras version the explicit Input layer might not be in model.layers.
# In our RLAgent model, if the [Input, Dense, Dense] remains then:
#    model.layers[0] is the InputLayer and model.layers[1] is Dense(8)
# but in Keras 3 the InputLayer is often omitted in model.layers.
#
# Check the number of layers and adjust accordingly:
if len(trained_agent.model.layers) == 2:
    # Only the Dense layers are present.
    hidden_layer = trained_agent.model.layers[0]  # Dense(8)
elif len(trained_agent.model.layers) >= 3:
    # If the Input layer is included.
    hidden_layer = trained_agent.model.layers[1]  # Dense(8)
else:
    hidden_layer = trained_agent.model.layers[0]  # Fallback

print("Extracting hidden layer:", hidden_layer.name)

def visualize_trained_network(ball_x, ball_y, paddle_y, ball_dx, ball_dy):
    # Retrieve network weights.
    # Assumed order: [kernel_hidden, bias_hidden, kernel_output, bias_output]
    weights = trained_agent.model.get_weights()
    w1, b1 = weights[0], weights[1]
    final_w, final_b = weights[2], weights[3]

    # --- Build a sub-model to get hidden activations ---
    # Instead of using trained_agent.model.input (which may not be defined),
    # we create a new input tensor and pass it to our extracted hidden layer.
    input_tensor = keras.Input(shape=(5,))
    hidden_output = hidden_layer(input_tensor)
    hidden_model = keras.Model(inputs=input_tensor, outputs=hidden_output)

    # Create the figure.
    fig, ax = plt.subplots(figsize=(14, 8))
    ax.set_xlim(-1, 7)
    ax.set_ylim(-1, 5)
    ax.axis('off')
    ax.set_aspect('equal')

    # Define node sizes.
    node_radius_input = 0.2
    node_radius_hidden = 0.15   # hidden nodes are drawn a bit smaller.
    node_radius_output = 0.2

    # Get the number of hidden neurons.
    num_hidden = hidden_model.output_shape[-1]

    # Define fixed positions for nodes.
    layer_positions = {
        "input": [(0, 4), (0, 3), (0, 2), (0, 1), (0, 0)],  # five inputs
        "hidden": [(3, i * (4/(num_hidden-1))) for i in range(num_hidden)],
        "output": [(6, 2), (6, 1), (6, 0)]  # three outputs
    }

    # Build the normalized state from current slider values.
    state = np.array([ball_x, ball_y, paddle_y, ball_dx, ball_dy], dtype=np.float32)
    norm_state = trained_agent._normalize_state(state).reshape(1, -1)

    # Get full network prediction.
    probs = trained_agent.model(norm_state, training=False).numpy().flatten()

    # Compute hidden layer activations.
    hidden_activations = hidden_model(norm_state, training=False).numpy().flatten()
    max_act = hidden_activations.max() if hidden_activations.max() > 0 else 1.0
    norm_activations = hidden_activations / max_act  # Normalize to [0, 1]

    # Draw input nodes.
    for pos in layer_positions['input']:
        circle = plt.Circle(pos, node_radius_input, color='lightyellow', ec='k', zorder=5)
        ax.add_patch(circle)

    # Draw hidden nodes using a blue colormap based on activation.
    cmap = plt.get_cmap("Blues")
    for i, pos in enumerate(layer_positions['hidden']):
        activation = norm_activations[i]
        face_color = cmap(0.3 + 0.7 * activation)  # shift so that even low activations are visible.
        circle = plt.Circle(pos, node_radius_hidden, color=face_color, ec='k', zorder=5)
        ax.add_patch(circle)
        # Optionally, display raw activation value.
        ax.text(pos[0], pos[1], f"{hidden_activations[i]:.2f}",
                fontsize=7, ha='center', va='center', zorder=6)

    # Draw output nodes.
    for pos in layer_positions['output']:
        circle = plt.Circle(pos, node_radius_output, color='lightgreen', ec='k', zorder=5)
        ax.add_patch(circle)

    # Normalize connection line alpha by maximum absolute weight.
    max_weight = max(np.abs(w1).max(), np.abs(final_w).max())

    # Draw connections from input to hidden using w1.
    for i, start_pos in enumerate(layer_positions['input']):
        for j, end_pos in enumerate(layer_positions['hidden']):
            weight = w1[i, j]
            color = 'red' if weight < 0 else 'blue'
            alpha = np.abs(weight) / max_weight
            ax.plot([start_pos[0] + node_radius_input, end_pos[0] - node_radius_hidden],
                    [start_pos[1], end_pos[1]], color=color, alpha=alpha, lw=1)

    # Draw connections from hidden to output using final_w.
    for j, start_pos in enumerate(layer_positions['hidden']):
        for k, end_pos in enumerate(layer_positions['output']):
            weight = final_w[j, k]
            color = 'red' if weight < 0 else 'blue'
            alpha = np.abs(weight) / max_weight
            ax.plot([start_pos[0] + node_radius_hidden, end_pos[0] - node_radius_output],
                    [start_pos[1], end_pos[1]], color=color, alpha=alpha, lw=1)

    # Label the layers.
    ax.text(0, 4.5, "Input Layer\n(Ball X, Ball Y,\nPaddle Y,\nBall DX, Ball DY)",
            ha='center', va='bottom', fontsize=10)
    ax.text(3, 4.5, f"Hidden Layer\n({num_hidden} Neurons)",
            ha='center', va='bottom', fontsize=10)
    ax.text(6, 4.5, "Output Layer\n(Up, Stay, Down)",
            ha='center', va='bottom', fontsize=10)

    # Display network prediction probabilities.
    pred_text = (f"Network Prediction:\n"
                 f"  Up: {probs[0]*100:.1f}%\n"
                 f"  Stay: {probs[1]*100:.1f}%\n"
                 f"  Down: {probs[2]*100:.1f}%")
    ax.text(6, -0.5, pred_text, ha='center', va='top',
            bbox=dict(facecolor='white', alpha=0.9), fontsize=12)

    plt.title("Network Architecture and Hidden Neuron Activations", fontsize=14)
    plt.tight_layout()
    plt.show()

# --- Create slider widgets (ensure that WIDTH, HEIGHT, BALL_SPEED, PADDLE_HEIGHT are defined) ---
slider_ball_x = widgets.FloatSlider(min=0, max=WIDTH, value=WIDTH/2, description="Ball X",
                                    layout=widgets.Layout(width='300px'))
slider_ball_y = widgets.FloatSlider(min=0, max=HEIGHT, value=HEIGHT/2, description="Ball Y",
                                    layout=widgets.Layout(width='300px'))
slider_paddle_y = widgets.FloatSlider(min=0, max=HEIGHT-PADDLE_HEIGHT, value=160, description="Paddle Y",
                                      layout=widgets.Layout(width='300px'))
slider_ball_dx = widgets.FloatSlider(min=-BALL_SPEED, max=BALL_SPEED, value=BALL_SPEED,
                                     description="Ball DX", layout=widgets.Layout(width='300px'))
slider_ball_dy = widgets.FloatSlider(min=-BALL_SPEED, max=BALL_SPEED, value=BALL_SPEED,
                                     description="Ball DY", layout=widgets.Layout(width='300px'))

sliders_box = VBox([slider_ball_x, slider_ball_y, slider_paddle_y, slider_ball_dx, slider_ball_dy])

# --- Create the interactive widget ---
interactive_plot = interactive(visualize_trained_network,
                               ball_x=slider_ball_x,
                               ball_y=slider_ball_y,
                               paddle_y=slider_paddle_y,
                               ball_dx=slider_ball_dx,
                               ball_dy=slider_ball_dy)

display(HBox([sliders_box, interactive_plot.children[-1]]))

# Task II: Can you think of a different reward function/mechanism?

# Resources:

### Backpropagation, step-by-step | DL3, 3Blue1Brown
https://www.youtube.com/watch?v=Ilg3gGewQ5U

### MIT 6.S191 (2024): Reinforcement Learning, Alexander Amini
https://www.youtube.com/watch?v=8JVRbHAVCws&t=1504s

### RLlib: Industry-Grade, Scalable Reinforcement Learning
https://docs.ray.io/en/latest/rllib/index.html

### Tensorflow Playground, beautiful interactive tool to understand Neural Networks
https://playground.tensorflow.org/