<a href="https://colab.research.google.com/github/Mudasir24/AIML/blob/main/NIM_RL_Teaching_Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🧠 Reinforcement Learning with the NIM Game
Let's teach our AI how to win a simple game using Q-learning.

## 🎮 The NIM Game Rules
- Start with 21 sticks.
- Each player takes 1, 2, or 3 sticks on their turn.
- The player who takes the **last stick loses**.

We'll train an AI to get smarter over time!

In [1]:
MAX_STICKS = 21
ACTIONS = [1, 2, 3, 4]

## 🧠 Step 1: Create a Q-table
We’ll use a dictionary to store the AI’s knowledge — the expected value (Q) of taking each action in every possible state.

In [2]:
Q = {}

## 🎲 Step 2: Action Choice
Let’s write a function that chooses an action. We’ll use **epsilon-greedy** — random at first, smarter later.

In [4]:

import random

def choose_action(state, epsilon):
    if state not in Q:
        Q[state] = {a: 0 for a in ACTIONS} #Initialize Q state
    if random.random() < epsilon: #Epsilon is to do random moves
        return random.choice([a for a in ACTIONS if a <= state])
    return max(Q[state], key=Q[state].get)


## 💡 Step 3: Q-Value Update Rule
We’ll update the Q-values using this formula:
```
Q(s,a) = Q(s,a) + alpha * (reward + gamma * max(Q(s') - Q(s,a))
```

In [5]:

def update_q(state, action, reward, next_state, alpha=0.1, gamma=0.9):
    if state not in Q:
        Q[state] = {a: 0 for a in ACTIONS}
    if next_state not in Q:
        Q[next_state] = {a: 0 for a in ACTIONS}
    max_q_next = max(Q[next_state].values())
    Q[state][action] += alpha * (reward + gamma * max_q_next - Q[state][action])


## 🔁 Step 4: Training Loop
Now we’ll play lots of games where the AI learns from experience.

In [12]:
# Training the model to lose
def train(episodes=10000, epsilon=0.3, alpha=0.1, gamma=0.9):
    for _ in range(episodes):
        state = MAX_STICKS
        last_state, last_action = None, None

        while state > 0:
            action = choose_action(state, epsilon)
            next_state = state - action

            if last_state is not None:
                update_q(last_state, last_action, 0, state, alpha, gamma)

            last_state = state
            last_action = action

            if next_state == 0:
                update_q(state, action, 1, next_state, alpha, gamma)
                break

            valid_opponent_actions = [a for a in ACTIONS if a <= next_state]
            if not valid_opponent_actions:
                update_q(last_state, last_action, 0, next_state, alpha, gamma)
                break

            opponent_action = random.choice(valid_opponent_actions)
            state = next_state - opponent_action

            if state <= 0:
                update_q(last_state, last_action, -1, next_state, alpha, gamma)
                break


## 🚀 Train the AI!

In [13]:
train()

In [14]:
print(Q)

{21: {1: 0.6682042767651922, 2: 0.6729797731063363, 3: 0.694283945461997, 4: 0.6987805124470369}, 18: {1: 0.7017120023218394, 2: 0.723750808469379, 3: 0.7282083945830897, 4: 0.7129055454010246}, 15: {1: 0.7246803302645219, 2: 0.7592197761977229, 3: 0.7667075564193486, 4: 0.7597252878106576}, 10: {1: 0.6808617552276041, 2: 0.7469158222376785, 3: 0.7883302519506252, 4: 0.7067072100010221}, 5: {1: 0.5759332671724873, 2: 0.3026895283458933, 3: 0.23811153458402945, 4: -0.09984408361385029}, 1: {1: 0.9999999999999996, 2: 0.0, 3: 0, 4: 0}, 0: {1: 0, 2: 0, 3: 0, 4: 0}, 14: {1: 0.7284221154472285, 2: 0.7679122381762034, 3: 0.7833219010439156, 4: 0.8099999999999987}, 9: {1: 0.8077075708431524, 2: 0.7755049575741563, 3: 0.861316996107124, 4: 0.899999999999999}, 6: {1: 0.899999999999999, 2: 0.7203493184489667, 3: 0.7287238779770342, 4: 0.43461273454763333}, 3: {1: 0.4436063048612124, 2: -0.09999999999969517, 3: 0.9999999999999996, 4: 0.0}, -1: {1: 0, 2: 0, 3: 0, 4: 0}, 17: {1: 0.7226548964585155, 

## 🧪 Let’s play against the AI!

In [15]:

def play():
    state = MAX_STICKS
    while state > 0:
        print(f"Sticks left: {state}")
        move = int(input("Your move (1–3): "))
        state -= move
        if state <= 0:
            print("You took the last stick. You lose!")
            return
        if state in Q:
            ai_move = max(Q[state], key=Q[state].get)
        else:
            ai_move = random.choice([a for a in ACTIONS if a <= state])
        print(f"AI takes {ai_move} stick(s).")
        state -= ai_move
        if state <= 0:
            print("AI took the last stick. You win!")
            return


In [17]:
play()

Sticks left: 21
Your move (1–3): 5
AI takes 3 stick(s).
Sticks left: 13
Your move (1–3): 3
AI takes 3 stick(s).
Sticks left: 7
Your move (1–3): 4
AI takes 3 stick(s).
AI took the last stick. You win!


## 🎉 Summary
You just trained an agent to play a game using trial-and-error. That’s the magic of Reinforcement Learning!