# 🐍 Reinforcement Learning for Snake: Zero to Hero

Welcome to this interactive tutorial on training an AI agent to play Snake using Reinforcement Learning (RL). This notebook aligns with our project blog post: **Scaling Snake AI: From Random Wiggles to Strategic Mastery**.

## 0. A Quick Primer: What is RL?
Reinforcement Learning is simply **learning by trial and error**. Imagine training a dog: you don't give it a manual on how to sit; you give it a treat when it accidentally sits. Eventually, the dog learns that **Actions** in certain **States** lead to **Rewards**.

### The Two Main 'Brains'
There are two primary ways an AI can 'think' about this problem:

**1. Value-Based (The Accountant)**
- *Method*: The agent tries to calculate the exact worth of every move. "Is turning left worth 10 points or 2?"
- *The Famous Name*: **DQN** (stands for **Deep Q-Network**). It's 'Deep' because it uses a neural network brain, and 'Q' is just the math symbol for 'Quality' or value.
- *Analogy*: It's like having a map of a city that tells you exactly how much gold is at every corner.

**2. Policy-Based (The Athlete)**
- *Method*: The agent learns general instincts. "If a wall is in front of me, turn right." It doesn't calculate value; it just knows the right response.
- *The Famous Name*: **PPO** (stands for **Proximal Policy Optimization**). 'Proximal' basically means 'don't change too much at once' so it stays stable while learning.
- *Analogy*: It's like a professional athlete. They don't calculate the 'value' of a pass; they have a trained instinct that tells them where to throw the ball.

---
## 1. The Environment

We use a custom `SnakeGame` environment. The agent observes the state and outputs one of 3 actions: [Straight, Right Turn, Left Turn].

In [None]:
from snake_game import SnakeGame
import numpy as np
import matplotlib.pyplot as plt

game = SnakeGame(board_size=5)
state = game.reset()
print(f"Initial Board State (Shape {state.shape}):")
print(state)

## 2. Phase 0: The Baseline (Value-Based)

We started with **Classic Tabular Q-Learning**. This is the ultimate 'Accountant' approach, where we store a literal table of every possible state.

- **Tabular Q**: Simple memorization.
- **Double Q**: A smarter version that double-checks its own math to avoid over-confidence.

**Result**: Perfect on small 5x5 boards, but failing as soon as the grid grew because the "table" became too large to fit in memory.

In [None]:
# Try running the tabular agents:
# !python train_tabular_q.py --type double_q --board_size 5

## 3. The Wall: Scaling Challenges

On a 10x10 board, calculations fail. Rewards become **sparse**: the snake may wander for 1000 steps without seeing a single piece of food. Random exploration is no longer enough.

## 4. The Solution: A Triple Threat

To master the 10x10 board, we combined three advanced techniques:

### 1. Imitation Learning (The Instinct)
We taught the agent to mimic experts. This gave it an "instinct" for survival immediately.

### 2. PPO (Proximal Policy Optimization)
Because PPO learns high-level strategies ("move toward food") rather than specific board values, it generalizes much better to different board sizes.

### 3. Curriculum Learning (The Growth)
We didn't start at 10x10. We mastered 5x5, then transferred that brain to 8x8, and finally to 10x10.

In [None]:
import torch
print("To see the full scaling pipeline, check train_ppo_curriculum.py")
# !python train_ppo_curriculum.py

## 5. Visualizing the Journey

Check out `snake_learning_journey.html` to see the interactive evolution from Stage 0 (Classic Tabular) to Stage 8 (Master PPO).