<a href="https://colab.research.google.com/github/Tanu-N-Prabhu/Python/blob/master/Machine%20Learning%20Interview%20Prep%20Questions/Reinforcement%20Learning%20Algorithms/Q-Learning/q_learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Q-Learning from Scratch

This notebook demonstrates how to implement the Q-Learning algorithm **from scratch** without using any external reinforcement learning libraries. It uses a simple 4x4 grid world to teach the core concepts behind Q-Learning.

---

## What is Q-Learning?

Q-Learning is a **model-free, off-policy, value-based** reinforcement learning algorithm. It helps an agent learn **optimal actions** in an environment by maximizing long-term rewards.

### Q-Learning Update Rule (Bellman Equation)

$$
Q(s, a) ← Q(s, a) + α * [r + γ * max(Q(s', a')) − Q(s, a)]
$$


```
- `s`: current state
- `a`: action taken
- `r`: reward received
- `s'`: next state
- `α`: learning rate
- `γ`: discount factor
```


## Define Environment
We'll use a 4x4 Grid. Start at top-left `(0,0)` and goal is bottom-right `(3,3)`.



In [1]:
import numpy as np

# Grid size
n_rows, n_cols = 4, 4
state_space = n_rows * n_cols
action_space = 4  # up, down, left, right

# Define actions
actions = {
    0: (-1, 0),  # up
    1: (1, 0),   # down
    2: (0, -1),  # left
    3: (0, 1)    # right
}

# Goal state position
goal_state = (3, 3)

# Convert (row, col) to state index
def pos_to_state(row, col):
    return row * n_cols + col

# Convert state index to (row, col)
def state_to_pos(state):
    return divmod(state, n_cols)


## Define Environment Dynamics
We'll simulate movement, apply bounds, and give reward.


In [2]:
def step(state, action):
    row, col = state_to_pos(state)
    dr, dc = actions[action]

    # Move
    new_row = min(max(row + dr, 0), n_rows - 1)
    new_col = min(max(col + dc, 0), n_cols - 1)

    new_state = pos_to_state(new_row, new_col)

    # Reward function
    reward = 1 if (new_row, new_col) == goal_state else -0.01
    done = (new_row, new_col) == goal_state

    return new_state, reward, done

## Initialize Q-Table
Rows = states, Columns = actions (Up, Down, Left, Right)


In [3]:
q_table = np.zeros((state_space, action_space))

## Hyperparameters for Training


In [4]:
alpha = 0.1          # learning rate
gamma = 0.99         # discount factor
epsilon = 1.0        # exploration probability
epsilon_decay = 0.995
epsilon_min = 0.01
episodes = 500

## Q-Learning Training Loop
Agent explores and learns the optimal policy.


In [5]:
for episode in range(episodes):
    state = pos_to_state(0, 0)  # Start from top-left
    done = False

    while not done:
        # ε-greedy action selection
        if np.random.rand() < epsilon:
            action = np.random.choice(action_space)
        else:
            action = np.argmax(q_table[state])

        # Take step
        next_state, reward, done = step(state, action)

        # Q-value update
        td_target = reward + gamma * np.max(q_table[next_state])
        q_table[state, action] += alpha * (td_target - q_table[state, action])

        state = next_state

    # Decay exploration
    epsilon = max(epsilon_min, epsilon * epsilon_decay)

## Final Q-Table
Shows learned values for each state-action pair.


In [6]:
np.set_printoptions(precision=2)
q_table

array([[0.87, 0.85, 0.87, 0.9 ],
       [0.85, 0.92, 0.86, 0.89],
       [0.67, 0.94, 0.37, 0.31],
       [0.16, 0.62, 0.1 , 0.22],
       [0.36, 0.92, 0.46, 0.54],
       [0.87, 0.94, 0.81, 0.91],
       [0.63, 0.96, 0.75, 0.72],
       [0.26, 0.92, 0.55, 0.42],
       [0.49, 0.62, 0.63, 0.94],
       [0.87, 0.94, 0.91, 0.96],
       [0.93, 0.98, 0.94, 0.96],
       [0.51, 1.  , 0.6 , 0.73],
       [0.38, 0.5 , 0.43, 0.89],
       [0.65, 0.73, 0.59, 0.98],
       [0.95, 0.98, 0.94, 1.  ],
       [0.  , 0.  , 0.  , 0.  ]])

## Derive Optimal Policy
Use arrows to visualize which action the agent prefers in each state.


In [7]:
policy = np.array([np.argmax(q_table[s]) for s in range(state_space)])
policy = policy.reshape((n_rows, n_cols))

# Map to arrows
arrow_map = {0: '↑', 1: '↓', 2: '←', 3: '→'}
for row in policy:
    print(' '.join([arrow_map[a] for a in row]))

→ ↓ ↓ ↓
↓ ↓ ↓ ↓
→ → ↓ ↓
→ → → ↑


## Summary

- Implemented Q-learning **without classes**
- Built a custom grid environment from scratch
- Visualized the learned policy
- Covered ε-greedy, reward shaping, and Bellman updates

Use this as a foundational notebook for understanding **how agents learn** through trial, error, and reward-based learning!
