#  Reinforcement Learning Lab: Snake Game & Exploration vs Exploitation  
<img src="../images/SnakeMaze.png" alt="Snake Game Example" width="250"/>


**What is this lab about?**  

In this lab, we’ll explore how a Reinforcement Learning (RL) agent can learn to play the **Snake game**.  
We will focus on the classic RL dilemma:  

- **Exploration** : trying new moves, even if they might fail.  
- **Exploitation** : choosing the best-known move to maximize score.  

This lab connects directly to the **$\epsilon$-greedy policy**.  
By adjusting $\epsilon$, we can control how much the agent explores versus exploits.  



## Table of Contents  

- [1 - Packages](#1)  
- [2 - Minimal Snake Environment](#2)  
- [3 - Agent with $\epsilon$-Greedy Policy](#3)  
- [4 - Running Episodes](#4)  
- [5 - Plotting Results](#5)  
- [6 - Exercises](#6)  


<a name='1'></a>
## 1 - Packages
In this section, we import the core Python libraries needed for our lab, including `numpy`, `matplotlib`, and the necessary RL environment setup.


In [2]:
# === YOUR CODE HERE ===
import _____ as np
import _____
import matplotlib.pyplot as plt


ModuleNotFoundError: No module named '_____'

<a name='2'></a>
## 2 - Minimal Snake Environment  
We represent the Snake game on an $8 \times 8$ grid.  

<img src="../images/snake-grid.png" width="300"/>  

### How the Environment Updates  

<img src="../images/snake-env-diagram.png" width="400"/>  

- Snake moves based on actions (`up`, `down`, `left`, `right`).  
- If the snake eats food → reward = +10.  
- If it crashes into wall/body → reward = -10 (game over).  
- Otherwise → reward = -0.1 (penalty for wasting time). 

The snake starts in the top-left corner, and food is randomly placed.  
We’ll build a very simple **Snake environment**:  
- The snake starts at position `(0,0)`.  
- Food is randomly placed on the grid.  
- Actions: 0 = up, 1 = down, 2 = left, 3 = right.  
- Rewards: +10 for eating food, -10 for dying, small negative step otherwise.  

**Exercise:** Implement the `SnakeEnv` class with:  
- `reset()` → resets snake and food.  
- `get_state()` → returns head + food positions.  
- `step(action)` → updates state and returns `(state, reward, done)`.  

 *Hint: Use a list `self.snake` to store snake body segments.*  


In [None]:
class SnakeEnv:
    GRID_SIZE = 8
    
    def __init__(self):
        self.reset()
    
    def reset(self):
        # === YOUR CODE HERE ===
        self.snake = [(0,0)]
        self.food = (np.random.randint(0,self.GRID_SIZE),
                     np.random.randint(0,self.GRID_SIZE))
        self.done = False
        self.score = 0
        return self.get_state()
    
    def get_state(self):
        # return head + food positions as numpy array
        head = self.snake[0]
        return np.array([head[0], head[1], self.food[0], self.food[1]])
    
    def step(self, action):
        # === STUDENT TODO === implement movement + reward rules
        pass


<a name='3'></a>
## 3 - Agent with $\epsilon$-Greedy Policy  
The agent balances **exploration vs exploitation**:  

<img src="../images/epsilon-greedy.png" width="400"/>  

- With probability **ε**, choose a random action (exploration).  
- With probability **1-ε**, choose the best action (exploitation)

We now create an **agent** that:  
- Picks a **random action** with probability $\epsilon$.  
- Picks the **best action** (argmax) with probability $1-\epsilon$.  
- Decays $\epsilon$ after each episode.  

**Exercise:** Implement the `Agent` class with:  
- `get_action(q_values)` → returns chosen action.  
- `decay_epsilon()` → decreases $\epsilon$ each episode.  

 *Hint: Use `random.random()` to compare against epsilon.*


class Agent:
    def __init__(self, n_actions=4, epsilon=1.0, epsilon_min=0.01, decay=0.995):
        self.n_actions = n_actions
        self.epsilon = epsilon
        self.epsilon_min = epsilon_min
        self.decay = decay
    
    def get_action(self, q_values):
        # === YOUR CODE HERE ===
        pass
    
    def decay_epsilon(self):
        # === YOUR CODE HERE ===
        pass


<a name='4'></a>
## 4 - Running Episodes  

Now, let’s simulate games of Snake.  

**Steps:**  
1. Reset the environment.  
2. While not done:  
   - Choose an action with the agent.  
   - Step the environment.  
   - Accumulate reward.  
3. After each episode, decay epsilon.  

**Exercise:** Write the training loop for 50 episodes.  


env = SnakeEnv()
agent = Agent()

scores = []
epsilons = []

for episode in range(50):
    state = env.reset()
    total_reward = 0
    done = False
    
    while not done:
        q_values = np.random.random(4)  # simulate Q-values
        action = agent.get_action(q_values)
        next_state, reward, done = env.step(action)
        
        # === YOUR CODE HERE ===
        total_reward += reward
        state = next_state
    
    agent.decay_epsilon()
    scores.append(total_reward)
    epsilons.append(agent.epsilon)


<a name='5'></a>
## 5 - Plotting Results  
### Epsilon Decay  

We gradually reduce ε over time to shift from exploration to exploitation:  

<img src="../images/epsilon-decay-curve.png" width="400"/> 

We want to see:  
- **Episode rewards** (how well the agent performed).  
- **Exploration rate $\epsilon$** (how it decayed).  

**Exercise:** Create a plot with two curves:  
1. `scores` over episodes.  
2. `epsilons` over episodes.  


In [None]:
plt.figure(figsize=(10,4))
plt.plot(scores, label="Score per Episode")
plt.plot(epsilons, label="Epsilon (Exploration Rate)")
plt.xlabel("Episode")
plt.ylabel("Value")
plt.title("Snake Game: Exploration vs Exploitation")
plt.legend()
plt.show()


<a name='6'></a>
## 6 - Exercises  

1. Modify the decay schedule: make $\epsilon$ decrease **slower** or **faster**.  
2. Try a **larger grid size** for the snake. What changes?  
3. Fix $\epsilon=1.0$ (always random). What happens to the score trend?  
4. Fix $\epsilon=0.0$ (always exploit). Why does the agent fail?  
5. Add a new reward: -1 per move. Does the agent play differently?  
