# Introduction to Reinforcement Learning for Game AI

## What Is Reinforcement Learning?

Imagine teaching someone to play a video game without being able to tell them the rules. You can only give them a thumbs up when they do something good and a thumbs down when they do something bad. Over time, they'd figure out what works and what doesn't through trial and error.

That's essentially what reinforcement learning (RL) is - a way for AI to learn by interacting with an environment and receiving feedback.

### Today we will use Pong as a case study

In our case, we're teaching an AI to play Pong by letting it:
- Try different paddle movements
- See what happens in the game
- Get rewards for hitting the ball
- Get penalties for missing the ball
- Gradually improve its strategy through experience

## The Key Components

Let's break down the essential parts of our reinforcement learning system:

1. **Agent**: The AI that controls the paddle
2. **Environment**: The Pong game
3. **State**: What our agent can observe about the game
   - Ball x-position
   - Ball y-position
   - Paddle y-position
   - Ball x-velocity
   - Ball y-velocity
4. **Actions**: What our agent can do
   - Move paddle up
   - Stay in place
   - Move paddle down
5. **Reward**: The feedback our agent receives
   - Positive reward (+1) for hitting the ball
   - Negative reward (-1) for missing the ball
   - Small "shaping" rewards to guide learning


![RL](https://upload.wikimedia.org/wikipedia/commons/1/1b/Reinforcement_learning_diagram.svg)


## The Learning Loop

Here's how the learning process works:

1. The agent observes the current state of the game
2. Based on this state, it chooses an action (move up, stay, or move down)
3. The game updates (the ball and paddle move)
4. The agent receives a reward
5. The agent observes the new state
6. Repeat until the game ends
7. After the game ends, the agent learns from what happened

This cycle happens over and over - thousands of times - as the agent gradually improves.

## BUT HOW DO WE DO THIS?

There are many ways to do Reinforcement learning. It all hinges on the algorithm used for the training. 

- Do we know how to calculate the rewards? 
- Or the expected rewards for all possible actions? 
- Is it even possible?
- What is the thing that learns? A genetic algorithm? A Neural Network? ... 

![banner](https://i.imgur.com/SlupuVC.jpeg)

# BEHOLD AN ARTIFICIAL NEURON!!

In [1]:
import numpy as np
from ipywidgets import interact, FloatSlider
import matplotlib.pyplot as plt

def plot_neuron(input_value=1.0, weight=1.0, bias=0.0):
    # Compute output using the neuron formula
    output = input_value * weight + bias
    
    # Create figure
    fig, ax = plt.subplots(figsize=(10, 6))
    ax.set_xlim(-1.5, 5.5)
    ax.set_ylim(-2, 4.5)
    ax.axis('off')
    ax.set_aspect('equal', 'box') # Ensure circles are not squished

    # Add formula text
    formula_text = f"Formula: output = (input × weight) + bias\n" \
                   f"         = ({input_value:.1f} × {weight:.1f}) + {bias:.1f}\n" \
                   f"         = {output:.2f}"
    ax.text(2.5, 4.2, formula_text, ha='center', va='top', fontsize=12, 
            bbox=dict(facecolor='white', alpha=0.9))

    # Draw input node
    ax.text(-1.2, 1.3, f"Input\n{input_value:0.2f}", fontsize=12, ha="center")
    plt.plot([-1.4, -0.7], [1.0, 1.0], color='gray', lw=2, linestyle='--')
    
    # Draw neuron
    circle = plt.Circle((2.0, 1.0), 1, color='skyblue', ec='k', zorder=2)
    ax.add_patch(circle)
    ax.text(2.0, 1.0, "Neuron", fontsize=12, ha="center", va="center")
    
    # Draw bias
    ax.annotate("", xy=(2.0, 2), xytext=(2.0, 2.7),
                arrowprops=dict(arrowstyle="->", color="red", lw=2))
    ax.text(2.0, 2.8, f"Bias: {bias:0.2f}", color="red", ha="center", fontsize=12)
    
    # Draw weight
    ax.annotate("", xy=(1.0, 1.0), xytext=(-0.7, 1.0),
                arrowprops=dict(arrowstyle="->", color="blue", lw=2))
    ax.text(-0.1, 1.1, f"Weight: {weight:0.2f}", color="blue", ha="center", fontsize=12)
    
    # Draw output
    ax.annotate("", xy=(3, 1.0), xytext=(5.2, 1.0),
                arrowprops=dict(arrowstyle="<-", color="green", lw=2))
    ax.text(4.2, 1.1, f"Output: {output:0.2f}", color="green", ha="center", fontsize=12)
    
    ax.set_title("Single Neuron with Linear Activation", fontsize=16)
    plt.show()

# Create interactive widget
interact(plot_neuron,
         input_value=FloatSlider(min=-5, max=5, step=0.1, value=1.0, description="Input"),
         weight=FloatSlider(min=-5, max=5, step=0.1, value=1.0, description="Weight"),
         bias=FloatSlider(min=-5, max=5, step=0.1, value=0.0, description="Bias"))

interactive(children=(FloatSlider(value=1.0, description='Input', max=5.0, min=-5.0), FloatSlider(value=1.0, d…

<function __main__.plot_neuron(input_value=1.0, weight=1.0, bias=0.0)>

# A network

Several of those neurons (billions in the case of modern AI systems) are put together in a network. Usually in layers that connect to each other, each neuron multiplying, adding and sending its output forward to more neurons.

In [2]:
# Imports
from ipywidgets import FloatSlider, VBox, HBox, interactive_output
from IPython.display import display

def plot_network(x1, x2, w1_00, w1_10, w1_01, w1_11, w2_0, w2_1):
    # Compute the forward pass
    h1 = x1 * w1_00 + x2 * w1_10
    h2 = x1 * w1_01 + x2 * w1_11
    output = h1 * w2_0 + h2 * w2_1

    # Set up the figure
    fig, ax = plt.subplots(figsize=(12, 6))
    ax.set_xlim(-1.5, 5)
    ax.set_ylim(-1.5, 3)
    ax.axis('off') 
    ax.set_aspect('equal', 'box') # Ensure circles are not squished

    # Define positions for nodes in each layer
    positions = {
        "x1": (0, 1.5),
        "x2": (0, 0.5),
        "h1": (2, 1.5),
        "h2": (2, 0.5),
        "output": (4, 1)
    }

    # Function to draw each node: a circle, with the label and node value inside
    def draw_node(pos, value, label, color):
        circle = plt.Circle(pos, 0.2, color=color, ec='k', zorder=5)
        ax.add_patch(circle)
        ax.text(pos[0], pos[1], f"{label}\n{value:.2f}", 
                ha='center', va='center', fontsize=10, zorder=6)

    # Draw nodes for each layer
    draw_node(positions["x1"], x1, "x₁", 'lightyellow')
    draw_node(positions["x2"], x2, "x₂", 'lightyellow')
    draw_node(positions["h1"], h1, "h₁", 'skyblue')
    draw_node(positions["h2"], h2, "h₂", 'skyblue')
    draw_node(positions["output"], output, "ŷ", 'lightgreen')

    # Draw layer labels above the nodes
    ax.text(positions["x1"][0], positions["x1"][1] + 0.6, "Input Layer",
            ha='center', va='center', fontsize=12, fontweight='bold')
    ax.text(positions["h1"][0], positions["h1"][1] + 0.6, "Hidden Layer",
            ha='center', va='center', fontsize=12, fontweight='bold')
    ax.text(positions["output"][0], positions["output"][1] + 0.6, "Output Layer",
            ha='center', va='center', fontsize=12, fontweight='bold')

    # Also add a clear summary of input and output values on the sides
    ax.text(-1.3, 1, f"Inputs:\n x₁ = {x1:.2f}\n x₂ = {x2:.2f}",
            fontsize=11, ha='center', va='center',
            bbox=dict(facecolor='white', alpha=0.9, edgecolor='gray'))
    ax.text(4.8, 1, f"Output:\n ŷ = {output:.2f}",
            fontsize=11, ha='center', va='center',
            bbox=dict(facecolor='white', alpha=0.9, edgecolor='gray'))

    # Define the connections with their labels. Each connection is a tuple:
    # (start_node, end_node, current weight value, weight label)
    connections = [
        ("x1", "h1", w1_00, "w₁₀₀"),
        ("x2", "h1", w1_10, "w₁₁₀"),
        ("x1", "h2", w1_01, "w₁₀₁"),
        ("x2", "h2", w1_11, "w₁₁₁"),
        ("h1", "output", w2_0, "w₂₀"),
        ("h2", "output", w2_1, "w₂₁"),
    ]
    
    # Function to draw an arrow (connection) with the connection label and weight value
    def draw_arrow(start, end, weight, wt_label):
        start_pos = np.array(positions[start])
        end_pos = np.array(positions[end])
        vector = end_pos - start_pos
        length = np.linalg.norm(vector)
        direction = vector / length
        
        # Adjust start and end positions so the arrow doesn't overlap the node circles
        start_adjust = start_pos + direction * 0.25
        end_adjust = end_pos - direction * 0.25
        
        # Draw arrow between nodes
        ax.annotate("",
                    xy=end_adjust,
                    xytext=start_adjust,
                    arrowprops=dict(arrowstyle="->", color="gray", lw=1.5),
                    zorder=3)
        # Place a label for the connection: show the weight variable and value
        midpoint = (start_adjust + end_adjust) / 2.0
        # Use a slight offset for clarity
        offset = np.array([0.0, 0.15])
        ax.text(midpoint[0] + offset[0], midpoint[1] + offset[1],
                f"{wt_label}\n{weight:.2f}", fontsize=9, color="red",
                ha='center', va='center', bbox=dict(facecolor='white', alpha=0.8, edgecolor='none'))

    # Draw all connection arrows with labels
    for start, end, weight, wt_label in connections:
        draw_arrow(start, end, weight, wt_label)

    # Place an explanation text block on the upper right, if desired
    explanation_text = (
        "Feedforward Computation:\n"
        "1. Inputs x₁ and x₂ are each multiplied by their connection weights.\n"
        "2. Hidden neurons sum these weighted inputs (h₁, h₂).\n"
        "3. Hidden outputs are multiplied by output weights and summed to form ŷ."
    )
    ax.text(4.2, 2.7, explanation_text, fontsize=10,
            bbox=dict(facecolor='white', edgecolor='gray', alpha=0.8),
            ha='left', va='top')

    plt.show()

### Create Interactive Widgets ###

# Input sliders for x1 and x2
slider_x1 = FloatSlider(min=-2, max=2, step=0.1, value=1.0, description="x₁")
slider_x2 = FloatSlider(min=-2, max=2, step=0.1, value=1.0, description="x₂")

# Sliders for weights connecting inputs to the hidden layer
slider_w1_00 = FloatSlider(min=-2, max=2, step=0.1, value=1.0, description="w₁₀₀")
slider_w1_10 = FloatSlider(min=-2, max=2, step=0.1, value=1.0, description="w₁₁₀")
slider_w1_01 = FloatSlider(min=-2, max=2, step=0.1, value=1.0, description="w₁₀₁")
slider_w1_11 = FloatSlider(min=-2, max=2, step=0.1, value=1.0, description="w₁₁₁")

# Sliders for weights connecting the hidden layer to the output
slider_w2_0 = FloatSlider(min=-2, max=2, step=0.1, value=1.0, description="w₂₀")
slider_w2_1 = FloatSlider(min=-2, max=2, step=0.1, value=1.0, description="w₂₁")

# Organize the slider layout
inputs_box = HBox([slider_x1, slider_x2])
weights_input_hidden = HBox([slider_w1_00, slider_w1_10, slider_w1_01, slider_w1_11])
weights_hidden_output = HBox([slider_w2_0, slider_w2_1])
ui = VBox([inputs_box, weights_input_hidden, weights_hidden_output])

# Set up the interactive output
out = interactive_output(plot_network, {
    "x1": slider_x1,
    "x2": slider_x2,
    "w1_00": slider_w1_00,
    "w1_10": slider_w1_10,
    "w1_01": slider_w1_01,
    "w1_11": slider_w1_11,
    "w2_0": slider_w2_0,
    "w2_1": slider_w2_1,
})

# Display the interactive UI and plot
display(ui, out)

VBox(children=(HBox(children=(FloatSlider(value=1.0, description='x₁', max=2.0, min=-2.0), FloatSlider(value=1…

Output()

## Training vs. Playing: Two Different Modes

It's important to understand the two modes of our agent:

### Training Mode
- Agent chooses actions randomly at first, based on probabilities from the network
- It records everything that happens (states, actions, rewards)
- After each game, it updates its neural network to improve
- This involves exploration (trying new things)

### Playing Mode
- Agent always chooses the action with highest probability
- No more randomness or exploration
- No more learning or updates
- Just using what it has learned

We spend most of our time in training mode, then switch to playing mode when the agent is ready.

## The REINFORCE Algorithm: Learning from Success and Failure

Now let's understand how our agent actually learns. We're using an algorithm called REINFORCE, which we'll explain step-by-step:

### Step 1: Play a Complete Game

The agent plays a full game of Pong until it misses the ball (game over). During this game, we record:
- Each state it observed
- Each action it took
- Each reward it received

Let's say our agent played a game that lasted 50 moves before missing the ball. We now have 50 (state, action, reward) tuples stored in memory.

### Step 2: Calculate the "Returns"

We need to know which actions were actually good in the long run. This is tricky because sometimes an action might look good immediately but lead to failure later.

To solve this, we calculate the "return" for each step - essentially the total future reward from that point onwards, with future rewards discounted (valued less than immediate rewards).

For each step t, we calculate:
```
Return(t) = Reward(t) + gamma * Reward(t+1) + gamma² * Reward(t+2) + ...
```

Where `gamma` is a number between 0 and 1 that determines how much we care about future rewards.

#### Example:
If our rewards were [0, 0, 0, 1, 0, 0, -1] and gamma is 0.9:
- Return at step 6 = -1
- Return at step 5 = 0 + 0.9 * (-1) = -0.9
- Return at step 4 = 0 + 0.9 * (-0.9) = -0.81
- Return at step 3 = 1 + 0.9 * (-0.81) = 0.271
- ...and so on

This gives us a better measure of how good each action really was.



### Step 3: Update the Policy Network

Now comes the crucial part - we need to adjust our neural network to make good actions more likely and bad actions less likely in the future.

For each (state, action, return) tuple:
1. Feed the state into the network to get the current probabilities
2. Increase the probability of the action taken if the return was positive
3. Decrease the probability of the action taken if the return was negative

#### How Weights Actually Change

This is where we need to understand how neural networks learn:

1. Each connection in our neural network has a "weight" - just a number that determines how strong that connection is.
2. These weights determine the final probabilities output by the network.
3. To make an action more likely, we need to adjust the weights that led to that action.

Let's break this down with a simple example:

Imagine our network gave these probabilities for a particular state:
- UP: 30%
- STAY: 50%
- DOWN: 20%

The agent selected STAY (based on these probabilities), and this eventually led to a positive return of 0.8.

We want to adjust our network to make STAY even more likely in this situation next time. The math works out such that:
- Weights that contributed to the STAY probability get increased
- The larger the return (0.8 in this case), the larger the increase
- Weights that didn't contribute to STAY don't change much


The technical term for this process is "gradient ascent on the policy parameters" - but you can think of it as "tweak the weights to make good actions more likely."

In [3]:
# Global settings.
WIDTH = 600.0
HEIGHT = 400.0
PADDLE_WIDTH = 20.0
PADDLE_HEIGHT = 80.0
BALL_RADIUS = 10.0
BALL_SPEED = 9.0
PADDLE_MOVE_SPEED = 9.0

In [11]:
import time
import random
import threading
import numpy as np
from ipycanvas import Canvas
import ipywidgets as widgets
from ipyevents import Event
from IPython.display import display

def pong_step(state, action):
    """
    Update ball and right-paddle state.
    
    state: [ball_x, ball_y, paddle_y, ball_dx, ball_dy]
    action (for right paddle): 0 = up, 1 = none, 2 = down.
    
    Returns a list of native Python floats.
    """
    ball_x, ball_y, paddle_y, ball_dx, ball_dy = state

    # Update AI paddle position (right paddle) based on action.
    if action == 0:
        paddle_y = max(0.0, paddle_y - PADDLE_MOVE_SPEED)
    elif action == 2:
        paddle_y = min(HEIGHT - PADDLE_HEIGHT, paddle_y + PADDLE_MOVE_SPEED)

    # Update ball position.
    ball_x += ball_dx
    ball_y += ball_dy

    # Bounce off top and bottom walls.
    if ball_y - BALL_RADIUS < 0:
        ball_y = BALL_RADIUS
        ball_dy = abs(ball_dy)
    elif ball_y + BALL_RADIUS > HEIGHT:
        ball_y = HEIGHT - BALL_RADIUS
        ball_dy = -abs(ball_dy)

    # Handle collision with the right (AI) paddle.
    if ball_x + BALL_RADIUS >= WIDTH - PADDLE_WIDTH:
        # If ball hits paddle, bounce back.
        if paddle_y <= ball_y <= (paddle_y + PADDLE_HEIGHT):
            ball_x = WIDTH - PADDLE_WIDTH - BALL_RADIUS
            ball_dx = -abs(ball_dx)
        else:
            # If the paddle missed, reset the ball and the right paddle.
            ball_x = WIDTH / 2.0
            ball_y = random.uniform(BALL_RADIUS, HEIGHT - BALL_RADIUS)
            ball_dx = BALL_SPEED
            ball_dy = BALL_SPEED
            paddle_y = (HEIGHT - PADDLE_HEIGHT) / 2.0

    # (Left wall collision handled in game loop)
    return [float(ball_x), float(ball_y), float(paddle_y), float(ball_dx), float(ball_dy)]


class PongGame:
    def __init__(self, ai_function):
        """
        Initialize the Pong game.
        
        ai_function(ball_x, ball_y, paddle_y, ball_dx, ball_dy)
            should return one of: "up", "none", or "down" for the right paddle.
        """
        # Left (player) paddle and ball state.
        self.left_paddle_y = (HEIGHT - PADDLE_HEIGHT) / 2.0
        self.right_paddle_y = (HEIGHT - PADDLE_HEIGHT) / 2.0
        self.ball_x = WIDTH / 2.0
        self.ball_y = random.uniform(BALL_RADIUS, HEIGHT - BALL_RADIUS)
        self.ball_dx = BALL_SPEED
        self.ball_dy = BALL_SPEED

        self.ai_function = ai_function

        # Movement flags for the left paddle.
        self.left_up_active = False
        self.left_down_active = False

        self.running = False

        self._create_widgets()

    def _create_widgets(self):
        # Create the game canvas.
        self.canvas = Canvas(width=WIDTH, height=HEIGHT)
        display(self.canvas)
        
        # Create control buttons.
        self.btn_left_up = widgets.Button(
            description="Left UP", layout=widgets.Layout(width='80px'))
        self.btn_left_down = widgets.Button(
            description="Left DOWN", layout=widgets.Layout(width='80px'))
        self.btn_stop = widgets.Button(
            description="STOP GAME", layout=widgets.Layout(width='100px', height='40px'),
            button_style='danger')
        
        # Set up ipyevents on the left paddle buttons for mousedown/up/leave.
        event_up = Event(source=self.btn_left_up, watched_events=['mousedown', 'mouseup', 'mouseleave'])
        event_up.on_dom_event(self._handle_left_up)
        event_down = Event(source=self.btn_left_down, watched_events=['mousedown', 'mouseup', 'mouseleave'])
        event_down.on_dom_event(self._handle_left_down)

        # Stop button uses normal on_click.
        self.btn_stop.on_click(self._stop_game)
        
        # Display control buttons.
        controls = widgets.VBox([widgets.HBox([self.btn_left_up, self.btn_left_down]), self.btn_stop])
        display(controls)
    
    def _handle_left_up(self, event):
        # When the up button is pressed, set the flag; released/leave clears it.
        if event['type'] == 'mousedown':
            self.left_up_active = True
        elif event['type'] in ['mouseup', 'mouseleave']:
            self.left_up_active = False
    
    def _handle_left_down(self, event):
        # When the down button is pressed, set the flag; released/leave clears it.
        if event['type'] == 'mousedown':
            self.left_down_active = True
        elif event['type'] in ['mouseup', 'mouseleave']:
            self.left_down_active = False
    
    def _draw(self):
        # Clear the canvas and redraw the game objects.
        self.canvas.clear()

        self.canvas.fill_style = 'black'
        self.canvas.fill_circle(self.ball_x, self.ball_y, BALL_RADIUS)
        
        # Draw background.
        self.canvas.fill_style = 'white'
        self.canvas.fill_rect(0, 0, WIDTH, HEIGHT)
        
        # Draw left (player) paddle.
        self.canvas.fill_style = 'blue'
        self.canvas.fill_rect(0, self.left_paddle_y, PADDLE_WIDTH, PADDLE_HEIGHT)
        
        # Draw right (AI) paddle.
        self.canvas.fill_style = 'red'
        self.canvas.fill_rect(WIDTH - PADDLE_WIDTH, self.right_paddle_y, PADDLE_WIDTH, PADDLE_HEIGHT)
        
        # Draw ball.
        self.canvas.fill_style = 'black'
        self.canvas.fill_circle(self.ball_x, self.ball_y, BALL_RADIUS)
    
    def _reset_ball(self):
        # Reset the ball to the center with a random vertical position.
        self.ball_x = WIDTH / 2.0
        self.ball_y = random.uniform(BALL_RADIUS, HEIGHT - BALL_RADIUS)
        self.ball_dx = BALL_SPEED
        self.ball_dy = BALL_SPEED
        self.right_paddle_y = (HEIGHT - PADDLE_HEIGHT) / 2.0

    def game_loop(self):
        fps_delay = 1.0 / 30.0  # approximately 30 FPS
        mapping = {"up": 0, "none": 1, "down": 2}
        while self.running:
            # Move the left paddle based on button flags.
            if self.left_up_active:
                self.left_paddle_y = max(0.0, self.left_paddle_y - PADDLE_MOVE_SPEED)
            if self.left_down_active:
                self.left_paddle_y = min(HEIGHT - PADDLE_HEIGHT, self.left_paddle_y + PADDLE_MOVE_SPEED)
            
            # Build the game state for the ball and right paddle.
            state = [self.ball_x, self.ball_y, self.right_paddle_y, self.ball_dx, self.ball_dy]

            # Get the AI action for the right paddle.
            ai_action = self.ai_function(self.ball_x, self.ball_y, self.right_paddle_y, self.ball_dx, self.ball_dy)
            action_int = mapping.get(ai_action, 1)

            # Update ball position and the right paddle using pong_step.
            new_state = pong_step(state, action_int)
            self.ball_x, self.ball_y, self.right_paddle_y, self.ball_dx, self.ball_dy = new_state
            
            # Check collision with the left (player) paddle.
            if self.ball_x - BALL_RADIUS <= PADDLE_WIDTH:
                if self.left_paddle_y <= self.ball_y <= (self.left_paddle_y + PADDLE_HEIGHT):
                    # Bounce the ball off the player's paddle.
                    self.ball_x = PADDLE_WIDTH + BALL_RADIUS
                    self.ball_dx = abs(self.ball_dx)
                else:
                    # The player missed: reset the ball.
                    self._reset_ball()
            
            self._draw()
            time.sleep(fps_delay)
    
    def start(self):
        self.running = True
        # Run the game loop in a separate thread to free the UI thread.
        self.thread = threading.Thread(target=self.game_loop, daemon=True)
        self.thread.start()
    
    def _stop_game(self, _):
        self.running = False
        self.btn_stop.description = "Stopped"
        self.btn_stop.disabled = True
        self.left_up_active = False
        self.left_down_active = False

def start_game(ai_function):
    """
    Initialize and start the Pong game.
    
    Provide an ai_function(ball_x, ball_y, paddle_y, ball_dx, ball_dy)
    that returns "up", "none", or "down" for controlling the right paddle.
    """
    game = PongGame(ai_function)
    game.start()
    return game


In [12]:
# This is how we use it.

# --- Example AI Function ---
def simple_ai(ball_x, ball_y, paddle_y, ball_dx, ball_dy):
    """
    A basic AI: move the paddle up or down so that its center follows the ball.
    """
    paddle_center = paddle_y + 30  # paddle_height/2, here paddle_height is 60.
    if ball_y < paddle_center:  
        return "up"
    elif ball_y > paddle_center:
        return "down"
    else:
        return "none"


# --- Start the Game ---
# Pass the AI function you want to use.
start_game(simple_ai)

Canvas(height=400, width=600)

VBox(children=(HBox(children=(Button(description='Left UP', layout=Layout(width='80px'), style=ButtonStyle()),…

<__main__.PongGame at 0x39a6d9c40>

In [19]:
# %% Cell B: RL Training Environment
import time
import numpy as np
import tensorflow as tf
from tensorflow import keras
from keras import layers

def rl_step(state, action):
    """
    Update the physics and compute reward and done.
    State: [ball_x, ball_y, paddle_y, ball_dx, ball_dy]
    Action: 0 = up, 1 = no move, 2 = down.
    
    Reward shaping and done determination are present:
      - A reward of +1 (plus some shaping) is given upon a paddle hit.
      - A reward of -1 is given and done=True when the paddle misses.
    """
    ball_x, ball_y, paddle_y, ball_dx, ball_dy = state

    # Update right paddle.
    if action == 0:
        paddle_y = max(0.0, paddle_y - PADDLE_MOVE_SPEED)
    elif action == 2:
        paddle_y = min(HEIGHT - PADDLE_HEIGHT, paddle_y + PADDLE_MOVE_SPEED)
    
    # Update ball position.
    ball_x += ball_dx
    ball_y += ball_dy
    
    # Bounce off top and bottom.
    if ball_y - BALL_RADIUS < 0:
        ball_y = BALL_RADIUS
        ball_dy = abs(ball_dy)
    if ball_y + BALL_RADIUS > HEIGHT:
        ball_y = HEIGHT - BALL_RADIUS
        ball_dy = -abs(ball_dy)
    
    # Reward shaping: reward is a function of how close the paddle is to ball center.
    paddle_center = paddle_y + PADDLE_HEIGHT / 2.0
    shaping_factor= 0.8
    shaping_reward = (1 - abs(paddle_center - ball_y)/HEIGHT) * shaping_factor
    if action == 1:
        shaping_reward -= 0.1  # encourage movement.
    reward = shaping_reward 

    done = False
    
    # Collision with right paddle.
    if ball_x + BALL_RADIUS >= WIDTH - PADDLE_WIDTH:
        if paddle_y <= ball_y <= (paddle_y + PADDLE_HEIGHT):
            # Successful hit.
            reward = shaping_reward + 1.0
            ball_x = WIDTH - PADDLE_WIDTH - BALL_RADIUS  # reposition
            ball_dx = -abs(ball_dx)
        else:
            reward = shaping_reward - 1.0
            done = True

    # Bounce off left wall.
    if ball_x - BALL_RADIUS <= 0:
        ball_x = BALL_RADIUS
        ball_dx = abs(ball_dx)

    new_state = np.array([ball_x, ball_y, paddle_y, ball_dx, ball_dy], dtype=np.float32)
    return new_state, reward, done

# --------------------------------------------------
# RL Agent (Policy Network using REINFORCE)
# --------------------------------------------------
class RLAgent:
    def __init__(self, learning_rate=5e-3, gamma=0.76):
        self.gamma = gamma
        
        self.model = keras.Sequential([
            layers.Input(shape=(5,)),
            layers.Dense(8, activation='relu'),
            layers.Dense(3, activation='softmax')
        ])
        self.optimizer = tf.keras.optimizers.legacy.Adam(learning_rate)
        
        # Buffers to store transitions.
        self.states = []
        self.actions = []
        self.rewards = []
    
    def _normalize_state(self, state):
        # Normalize each component for training.
        return np.array([
            state[0] / WIDTH,
            state[1] / HEIGHT,
            state[2] / HEIGHT,
            state[3] / BALL_SPEED,
            state[4] / BALL_SPEED,
        ], dtype=np.float32)
    
    def choose_action(self, state):
        norm_state = self._normalize_state(state).reshape(1, -1)
        probs = self.model(norm_state).numpy().flatten()
        action = np.random.choice(3, p=probs)
        self.states.append(norm_state)
        self.actions.append(action)
        return action
    
    def store_reward(self, reward):
        self.rewards.append(reward)
    
    def finish_episode(self):
        """Use a REINFORCE update."""
        discounted = np.zeros_like(self.rewards, dtype=np.float32)
        cumulative = 0.0
        for i in reversed(range(len(self.rewards))):
            cumulative = self.rewards[i] + self.gamma * cumulative
            discounted[i] = cumulative

        baseline = np.mean(discounted)
        discounted -= baseline
        
        states = np.concatenate(self.states, axis=0)
        actions = np.array(self.actions)
        rewards = discounted
        
        with tf.GradientTape() as tape:
            probs = self.model(states, training=True)
            action_mask = tf.one_hot(actions, 3)
            log_probs = tf.math.log(tf.reduce_sum(probs * action_mask, axis=1) + 1e-8)
            loss = -tf.reduce_mean(log_probs * rewards)
        grads = tape.gradient(loss, self.model.trainable_variables)
        self.optimizer.apply_gradients(zip(grads, self.model.trainable_variables))
        
        # Clear the buffers.
        self.states, self.actions, self.rewards = [], [], []
    
    def get_action(self, state):
        norm_state = self._normalize_state(state).reshape(1, -1)
        probs = self.model(norm_state).numpy().flatten()
        return np.argmax(probs)

def train_agent(num_episodes=1000):
    agent = RLAgent()
    total_rewards = []

    max_steps_reached = 0
    for i in range(num_episodes):
        # Randomize initial conditions for each episode.
        ball_y_random = np.random.uniform(BALL_RADIUS, HEIGHT - BALL_RADIUS)
        paddle_y_random = np.random.uniform(0, HEIGHT - PADDLE_HEIGHT)
        
        state = np.array([
            WIDTH / 2.0, 
            ball_y_random,
            paddle_y_random,
            BALL_SPEED,
            BALL_SPEED
        ], dtype=np.float32)
        
        episode_reward = 0.0
        done = False
        step = 0
        
        max_steps = 500  # or set any desired number of iterations
        while not done and step < max_steps:
            action = agent.choose_action(state)
            state, reward, done = rl_step(state, action)
            agent.store_reward(reward)
            episode_reward += reward
            step += 1
        
        agent.finish_episode()
        total_rewards.append(episode_reward)
        

        max_steps_reached = max(max_steps_reached, step)
        
        if (i+1) % 100 == 0:
            print(f"Episode {i+1}/{num_episodes}: Steps= {step}, Total Reward= {episode_reward:.2f}, Max Steps reached= {max_steps_reached}")
            max_steps_reached = 0
    
    return agent

# Train the agent.
trained_agent = train_agent(num_episodes=500)

# Wrap the trained agent into an AI function for gameplay.
def trained_ai_function(ball_x, ball_y, paddle_y, ball_dx, ball_dy):
    state = np.array([ball_x, ball_y, paddle_y, ball_dx, ball_dy], dtype=np.float32)
    action_idx = trained_agent.get_action(state)
    mapping = {0: "up", 1: "none", 2: "down"}
    return mapping[action_idx]


Episode 100/500: Steps= 30, Total Reward= 11.53, Max Steps reached= 408
Episode 200/500: Steps= 30, Total Reward= 17.86, Max Steps reached= 282
Episode 300/500: Steps= 30, Total Reward= 13.06, Max Steps reached= 156
Episode 400/500: Steps= 156, Total Reward= 77.11, Max Steps reached= 156
Episode 500/500: Steps= 30, Total Reward= 10.53, Max Steps reached= 408


In [7]:
# Save the trained agent.
#trained_agent.save("trained_pong_agent")

In [8]:
# Read the traioned agent.
#trained_agent = keras.models.load_model("trained_pong_agent")
#def trained_ai_function(ball_x, ball_y, paddle_y, ball_dx, ball_dy):
#    state = np.array([ball_x, ball_y, paddle_y, ball_dx, ball_dy], dtype=np.float32)
#    action_probs = trained_agent(state.reshape(1, -1)).numpy().flatten()
#    action_idx = np.argmax(action_probs)
#    mapping = {0: "up", 1: "none", 2: "down"}
#    return mapping[action_idx]

In [20]:
start_game(trained_ai_function)

Canvas(height=400, width=600)

VBox(children=(HBox(children=(Button(description='Left UP', layout=Layout(width='80px'), style=ButtonStyle()),…

<__main__.PongGame at 0x39a732850>

In [21]:
#@markdown Run to visualize the full trained network

import matplotlib.pyplot as plt
import numpy as np
import ipywidgets as widgets
from ipywidgets import interactive, HBox, VBox
from IPython.display import display
import tensorflow as tf
from tensorflow import keras

# --- Ensure that your trained model is built ---
# (This dummy call forces the model’s graph to be built.)
_dummy = np.zeros((1, 5), dtype=np.float32)
_ = trained_agent.model(_dummy)

# --- Determine the hidden dense layer ---
# Depending on your Keras version the explicit Input layer might not be in model.layers.
# In our RLAgent model, if the [Input, Dense, Dense] remains then:
#    model.layers[0] is the InputLayer and model.layers[1] is Dense(8)
# but in Keras 3 the InputLayer is often omitted in model.layers.
#
# Check the number of layers and adjust accordingly:
if len(trained_agent.model.layers) == 2:
    # Only the Dense layers are present.
    hidden_layer = trained_agent.model.layers[0]  # Dense(8)
elif len(trained_agent.model.layers) >= 3:
    # If the Input layer is included.
    hidden_layer = trained_agent.model.layers[1]  # Dense(8)
else:
    hidden_layer = trained_agent.model.layers[0]  # Fallback

print("Extracting hidden layer:", hidden_layer.name)

def visualize_trained_network(ball_x, ball_y, paddle_y, ball_dx, ball_dy):
    # Retrieve network weights.
    # Assumed order: [kernel_hidden, bias_hidden, kernel_output, bias_output]
    weights = trained_agent.model.get_weights()
    w1, b1 = weights[0], weights[1]
    final_w, final_b = weights[2], weights[3]

    # --- Build a sub-model to get hidden activations ---
    # Instead of using trained_agent.model.input (which may not be defined),
    # we create a new input tensor and pass it to our extracted hidden layer.
    input_tensor = keras.Input(shape=(5,))
    hidden_output = hidden_layer(input_tensor)
    hidden_model = keras.Model(inputs=input_tensor, outputs=hidden_output)

    # Create the figure.
    fig, ax = plt.subplots(figsize=(14, 8))
    ax.set_xlim(-1, 7)
    ax.set_ylim(-1, 5)
    ax.axis('off')
    ax.set_aspect('equal')

    # Define node sizes.
    node_radius_input = 0.2
    node_radius_hidden = 0.15   # hidden nodes are drawn a bit smaller.
    node_radius_output = 0.2

    # Get the number of hidden neurons.
    num_hidden = hidden_model.output_shape[-1]

    # Define fixed positions for nodes.
    layer_positions = {
        "input": [(0, 4), (0, 3), (0, 2), (0, 1), (0, 0)],  # five inputs
        "hidden": [(3, i * (4/(num_hidden-1))) for i in range(num_hidden)],
        "output": [(6, 2), (6, 1), (6, 0)]  # three outputs
    }

    # Build the normalized state from current slider values.
    state = np.array([ball_x, ball_y, paddle_y, ball_dx, ball_dy], dtype=np.float32)
    norm_state = trained_agent._normalize_state(state).reshape(1, -1)

    # Get full network prediction.
    probs = trained_agent.model(norm_state, training=False).numpy().flatten()

    # Compute hidden layer activations.
    hidden_activations = hidden_model(norm_state, training=False).numpy().flatten()
    max_act = hidden_activations.max() if hidden_activations.max() > 0 else 1.0
    norm_activations = hidden_activations / max_act  # Normalize to [0, 1]

    # Draw input nodes.
    for pos in layer_positions['input']:
        circle = plt.Circle(pos, node_radius_input, color='lightyellow', ec='k', zorder=5)
        ax.add_patch(circle)

    # Draw hidden nodes using a blue colormap based on activation.
    cmap = plt.get_cmap("Blues")
    for i, pos in enumerate(layer_positions['hidden']):
        activation = norm_activations[i]
        face_color = cmap(0.3 + 0.7 * activation)  # shift so that even low activations are visible.
        circle = plt.Circle(pos, node_radius_hidden, color=face_color, ec='k', zorder=5)
        ax.add_patch(circle)
        # Optionally, display raw activation value.
        ax.text(pos[0], pos[1], f"{hidden_activations[i]:.2f}",
                fontsize=7, ha='center', va='center', zorder=6)

    # Draw output nodes.
    for pos in layer_positions['output']:
        circle = plt.Circle(pos, node_radius_output, color='lightgreen', ec='k', zorder=5)
        ax.add_patch(circle)

    # Normalize connection line alpha by maximum absolute weight.
    max_weight = max(np.abs(w1).max(), np.abs(final_w).max())

    # Draw connections from input to hidden using w1.
    for i, start_pos in enumerate(layer_positions['input']):
        for j, end_pos in enumerate(layer_positions['hidden']):
            weight = w1[i, j]
            color = 'red' if weight < 0 else 'blue'
            alpha = np.abs(weight) / max_weight
            ax.plot([start_pos[0] + node_radius_input, end_pos[0] - node_radius_hidden],
                    [start_pos[1], end_pos[1]], color=color, alpha=alpha, lw=1)

    # Draw connections from hidden to output using final_w.
    for j, start_pos in enumerate(layer_positions['hidden']):
        for k, end_pos in enumerate(layer_positions['output']):
            weight = final_w[j, k]
            color = 'red' if weight < 0 else 'blue'
            alpha = np.abs(weight) / max_weight
            ax.plot([start_pos[0] + node_radius_hidden, end_pos[0] - node_radius_output],
                    [start_pos[1], end_pos[1]], color=color, alpha=alpha, lw=1)

    # Label the layers.
    ax.text(0, 4.5, "Input Layer\n(Ball X, Ball Y,\nPaddle Y,\nBall DX, Ball DY)",
            ha='center', va='bottom', fontsize=10)
    ax.text(3, 4.5, f"Hidden Layer\n({num_hidden} Neurons)",
            ha='center', va='bottom', fontsize=10)
    ax.text(6, 4.5, "Output Layer\n(Up, Stay, Down)",
            ha='center', va='bottom', fontsize=10)

    # Display network prediction probabilities.
    pred_text = (f"Network Prediction:\n"
                 f"  Up: {probs[0]*100:.1f}%\n"
                 f"  Stay: {probs[1]*100:.1f}%\n"
                 f"  Down: {probs[2]*100:.1f}%")
    ax.text(6, -0.5, pred_text, ha='center', va='top',
            bbox=dict(facecolor='white', alpha=0.9), fontsize=12)

    plt.title("Network Architecture and Hidden Neuron Activations", fontsize=14)
    plt.tight_layout()
    plt.show()

# --- Create slider widgets (ensure that WIDTH, HEIGHT, BALL_SPEED, PADDLE_HEIGHT are defined) ---
slider_ball_x = widgets.FloatSlider(min=0, max=WIDTH, value=WIDTH/2, description="Ball X",
                                    layout=widgets.Layout(width='300px'))
slider_ball_y = widgets.FloatSlider(min=0, max=HEIGHT, value=HEIGHT/2, description="Ball Y",
                                    layout=widgets.Layout(width='300px'))
slider_paddle_y = widgets.FloatSlider(min=0, max=HEIGHT-PADDLE_HEIGHT, value=160, description="Paddle Y",
                                      layout=widgets.Layout(width='300px'))
slider_ball_dx = widgets.FloatSlider(min=-BALL_SPEED, max=BALL_SPEED, value=BALL_SPEED,
                                     description="Ball DX", layout=widgets.Layout(width='300px'))
slider_ball_dy = widgets.FloatSlider(min=-BALL_SPEED, max=BALL_SPEED, value=BALL_SPEED,
                                     description="Ball DY", layout=widgets.Layout(width='300px'))

sliders_box = VBox([slider_ball_x, slider_ball_y, slider_paddle_y, slider_ball_dx, slider_ball_dy])

# --- Create the interactive widget ---
interactive_plot = interactive(visualize_trained_network,
                               ball_x=slider_ball_x,
                               ball_y=slider_ball_y,
                               paddle_y=slider_paddle_y,
                               ball_dx=slider_ball_dx,
                               ball_dy=slider_ball_dy)

display(HBox([sliders_box, interactive_plot.children[-1]]))

Extracting hidden layer: dense_14


HBox(children=(VBox(children=(FloatSlider(value=300.0, description='Ball X', layout=Layout(width='300px'), max…