# Reinforcement Learning Lab: Tic-Tac-Toe with QLearning
<img src="../images/tictactoe.png" alt="Tic Tac Toe Example" width="200"/>

**What is this lab about?**  

In this lab, weâ€™ll train an **RL agent** to play **Tic-Tac-Toe** using the **Qlearningalgorithm**.  
SARSA is an **on-policy** algorithm, meaning it updates Q-values using the action actually taken, not the best future action.  

We will focus on:  

- Understanding the **QLearning update rule**.  
- Implementing the agent.  
- Training against a random opponent.  
- Visualizing results.  

---

## Table of Contents  

- [1 - Packages](#1)  
- [2 - Tic-Tac-Toe Environment](#2)  
- [3 - SARSA Agent](#3)  
- [4 - Training Loop](#4)  
- [5 - Plotting Results](#5)  
- [6 - Evaluation](#6)  
- [7 - Exercises](#7)  
 


# <a name='1'></a>
## 1 - Packages

We start by importing the necessary Python libraries.  

- **numpy** â†’ for handling states and arrays.  
- **random** â†’ for exploration and opponentâ€™s random moves.  
- **matplotlib** â†’ for plotting training results later.  
- **collections** â†’ for tracking performance statistics (wins, draws, losses).  

These will be the main dependencies throughout the notebook.


In [1]:
import numpy as np
import random
import matplotlib.pyplot as plt
from collections import defaultdict, deque


# <a name='2'></a>
## 2 - Tic-Tac-Toe Environment  

We now create the **game environment** where our RL agent will play.  

### Board Representation  
- The board is a flat array of 9 cells.  
- Values:  
  - `0` â†’ empty cell  
  - `1` â†’ agent (X)  
  - `-1` â†’ opponent (O)  

<img src="../images/tictactoe-board.png" width="250"/>  

### Environment Mechanics  
- `reset()` â†’ clears the board.  
- `available_actions()` â†’ returns indices of empty cells.  
- `step(action, player)` â†’ executes a move, returns `(new_state, reward, done)`.  
- `check_winner()` â†’ checks rows, columns, diagonals.  

### Rewards  
- `+1` â†’ agent wins  
- `-1` â†’ agent loses  
- `0` â†’ draw  
- `-10` â†’ illegal move (placing in occupied cell)  

This environment will be used for both Q-Learning and SARSA.


<a name='4'></a>
## 4 - Running Episodes  

Now we combine the **environment** and the **Q-learning agent** to simulate games.  

### Training Loop  

At each episode:  
1. Reset the environment to get the initial state.  
2. Let the agent play until the game ends (`done=True`).  
3. At each step:  
   - Choose action $a$ using $\epsilon$-greedy.  
   - Apply action â†’ get $(s', r, \text{done})$.  
   - Update $Q(s,a)$ using the Q-learning update rule.  
4. Track results: win, loss, or draw.  

<img src="../images/episode-loop.png" width="400"/>  

This loop allows the agent to **learn from repeated games** and gradually improve.  


### Test Run  

Letâ€™s train the agent for a few episodes and see if it starts to improve.  

Weâ€™ll:  
- Train for 100 episodes.  
- Print how many games the agent **won**.  
- Later, weâ€™ll visualize the results with plots.  


<a name='5'></a>
## 5 - Plotting Results  

To understand how our agent is improving, we need to **visualize the training progress**.  

Weâ€™ll plot:  
- **Rewards per Episode** â†’ shows how the agentâ€™s returns evolve.  
- **Rolling Average Rewards** â†’ smooths out randomness and shows overall trends.  
- **Win Rate Curve** â†’ how often the agent wins as training progresses.  

These plots help us verify if the Q-learning agent is actually **learning** or just playing randomly.  

<img src="../images/qlearning-training-curve.png" width="400"/>  


<a name='6'></a>
## 6 - Exercises ðŸŽ¯  

Now that you have built and trained a **Q-Learning agent** for Tic-Tac-Toe, try the following extensions:

1. **Vary Hyperparameters**  
   - Change $\alpha$, $\gamma$, or $\epsilon$ and see how training performance changes.  
   - Plot and compare learning curves.  

2. **Play Against the Agent**  
   - Implement a simple text-based or GUI interface.  
   - Try playing against your trained agent. Can you beat it?  

3. **Implement SARSA**  
   - Replace the Q-learning update with the SARSA update:  
   $$
   Q(s,a) \gets Q(s,a) + \alpha \Big[ r + \gamma Q(s',a') - Q(s,a) \Big]
   $$  
   - Compare results with Q-Learning.  

4. **Add a Deep Q-Network (DQN)**  
   - Replace the Q-table with a neural network.  
   - Train it using the same environment.  

5. **Experiment with Exploration**  
   - Try different exploration strategies (e.g., decaying $\epsilon$, softmax).  
   - Observe how agent behavior changes.  

6. **Deployment Idea ðŸ’¡**  
   - Wrap your trained agent in an API.  
   - Build a small GUI so humans can play against it.  
