<a href="https://colab.research.google.com/github/MohammedAbraar302/aiml.ipynb/blob/main/Copy_of_NIM_RL_Teaching_Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🧠 Reinforcement Learning with the NIM Game
Let's teach our AI how to win a simple game using Q-learning.

## 🎮 The NIM Game Rules
- Start with 21 sticks.
- Each player takes 1, 2, or 3 sticks on their turn.
- The player who takes the **last stick loses**.

We'll train an AI to get smarter over time!

In [37]:
MAX_STICKS = 21
ACTIONS = [1, 2, 3, 4]

## 🧠 Step 1: Create a Q-table
We’ll use a dictionary to store the AI’s knowledge — the expected value (Q) of taking each action in every possible state.

In [38]:
Q = {}

## 🎲 Step 2: Action Choice
Let’s write a function that chooses an action. We’ll use **epsilon-greedy** — random at first, smarter later.

In [39]:

import random

def choose_action(state, epsilon):
    if state not in Q:
        Q[state] = {a: 0 for a in ACTIONS}
    if random.random() < epsilon:
        return random.choice([a for a in ACTIONS if a <= state])
    return max(Q[state], key=Q[state].get)


## 💡 Step 3: Q-Value Update Rule
We’ll update the Q-values using this formula:
```
Q(s,a) = Q(s,a) + alpha * (reward + gamma * max(Q(s') - Q(s,a))
```

In [40]:

def update_q(state, action, reward, next_state, alpha=0.1, gamma=0.9):
    if state not in Q:
        Q[state] = {a: 0 for a in ACTIONS}
    if next_state not in Q:
        Q[next_state] = {a: 0 for a in ACTIONS}
    max_q_next = max(Q[next_state].values())
    Q[state][action] += alpha * (reward + gamma * max_q_next - Q[state][action])


## 🔁 Step 4: Training Loop
Now we’ll play lots of games where the AI learns from experience.

In [41]:
def train(episodes=10000, epsilon=0.3, alpha=0.1, gamma=0.9):
    for _ in range(episodes):
        state = MAX_STICKS
        last_state, last_action = None, None

        while state > 0:
            action = choose_action(state, epsilon)
            next_state = state - action

            if last_state is not None:
                update_q(last_state, last_action, 0, state, alpha, gamma)

            last_state = state
            last_action = action

            if next_state == 0:
                update_q(state, action, -1, next_state, alpha, gamma)
                break

            valid_opponent_actions = [a for a in ACTIONS if a <= next_state]
            if not valid_opponent_actions:
                update_q(last_state, last_action, 0, next_state, alpha, gamma)
                break

            opponent_action = random.choice(valid_opponent_actions)
            state = next_state - opponent_action

            if state <= 0:
                update_q(last_state, last_action, 1, next_state, alpha, gamma)
                break


## 🚀 Train the AI!

In [42]:
train()

In [43]:
print(Q)

{21: {1: 0.6949143256485893, 2: 0.7272320476749335, 3: 0.7326054095980893, 4: 0.7498745346339808}, 19: {1: 0.7396839740365225, 2: 0.4281240019659795, 3: 0.45526162097351014, 4: 0.5333557013845269}, 16: {1: 0.7692375450321817, 2: 0.7866406225030995, 3: 0.79626159651319, 4: 0.8270502820495111}, 11: {1: 0.8634540641588887, 2: 0.8645265566855302, 3: 0.8610683415746269, 4: 0.9297580955763165}, 9: {1: 0.8775716290136331, 2: 0.9171777940792544, 3: 0.9145278397640869, 4: 0.677880752295148}, 7: {1: 0.9076593056284399, 2: 0.5859005237391894, 3: 0.9785521105911106, 4: 0.7092741564284064}, 2: {1: 0.9999999999999996, 2: -0.9999999999999608, 3: 0, 4: 0}, 1: {1: -0.9999999999999996, 2: 0.0, 3: 0, 4: 0}, 12: {1: 0.8306826711252127, 2: 0.8458334498733349, 3: 0.8482133543151322, 4: 0.862835887151084}, 18: {1: 0.7556815647336532, 2: 0.7156525996961419, 3: 0.7251536550563824, 4: 0.7974056646823325}, 15: {1: 0.8053850106369571, 2: 0.8064362607379042, 3: 0.8173363776063043, 4: 0.8284721351391358}, 3: {1: 0.

## 🧪 Let’s play against the AI!

In [46]:
def play():
    state = MAX_STICKS
    while state > 0:
        print(f"Sticks left: {state}")
        move = int(input("Your move (1–3): "))
        state -= move
        if state <= 0:
            print("You took the last stick. You lose!")
            return

        valid_ai_moves = [a for a in ACTIONS if a <= state]
        if valid_ai_moves:
            if state in Q:
                # Find the action with the minimum Q-value (the "worst" move)
                ai_move = min(Q[state], key=Q[state].get)

                # Make sure the chosen "worst" move is actually valid from the current state
                if ai_move not in valid_ai_moves:
                     ai_move = random.choice(valid_ai_moves) # Fallback to random if worst move isn't valid
            else:
                # If the state is not in Q, just choose a valid move randomly
                ai_move = random.choice(valid_ai_moves)

            print(f"AI takes {ai_move} stick(s).")
            state -= ai_move
            if state <= 0:
                print("AI took the last stick. You win!")
                return
        else:
            print("AI has no valid moves left. You win!")
            return

In [47]:
play()

Sticks left: 21
Your move (1–3): 3
AI takes 2 stick(s).
Sticks left: 16
Your move (1–3): 2
AI takes 2 stick(s).
Sticks left: 12
Your move (1–3): 2
AI takes 1 stick(s).
Sticks left: 9
Your move (1–3): 3
AI takes 1 stick(s).
Sticks left: 5
Your move (1–3): 2
AI takes 3 stick(s).
AI took the last stick. You win!


## 🎉 Summary
You just trained an agent to play a game using trial-and-error. That’s the magic of Reinforcement Learning!