<a href="https://colab.research.google.com/github/Jacobgokul/ML-Playground/blob/main/Reinforcement_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# What is Reinforcement Learning (RL)?

It's a type of learning where an agent learns to make decisions by interacting with an environment and getting rewards or penalties.

## Key Idea:

Just like a kid learning to ride a cycle:

- Try something → fall → learn

- Try again → better balance → rewarded

In RL:

- Agent = the learner (e.g., AI bot)

- Environment = the world it interacts with

- Action = what the agent does

- Reward = feedback (positive or negative)

- Policy = the strategy used by the agent

- Episode = one complete interaction cycle



## Real Example – Game AI

Let’s say an AI is learning to play a car racing game:

Every time it stays on track = +1 point

If it goes off road = -5 points

If it completes a lap = +10 points

The AI will try different strategies to maximize total reward. Over time, it learns what actions lead to better results.


## How RL Works (Simplified Flow):
- Agent observes the state of the environment

- Takes an action

- Environment gives a reward and updates the state

- Agent learns and improves its policy

This loop keeps running until the agent becomes good at the task.


📂 Types of Reinforcement Learning

✅ 1. Model-Free (No idea how the environment works)

Learns only from experience

Most common in games & robotics

🧠 Algorithms:

- Q-Learning (classic)

- Deep Q-Networks (DQN) – Q-learning with neural networks

- SARSA

✅ 2. Model-Based (Learns or knows the environment’s rules)

It can simulate or plan future actions

🧠 Algorithms:
- Dyna-Q

- Monte Carlo Tree Search (used in AlphaGo)




---

# Q-learning Code Example

## Problem Setup: Grid World (4x4)

The agent starts at position (0,0)

The goal is to reach the bottom-right corner (3,3)

The agent can move: up, down, left, right

Each move gives -1 reward

Reaching the goal gives +10 reward

Hitting walls just keeps it in place

In [None]:
import numpy as np
import random

# Grid size (4x4 matrix)
n_rows = 4
n_cols = 4

# Actions
actions = ['up', 'down', 'left', 'right']
action_dict = {'up': 0, 'down': 1, 'left': 2, 'right': 3}

# Q-table [state_row][state_col][action]
q_table = np.zeros((n_rows, n_cols, len(actions)))

# Parameters
alpha = 0.1       # learning rate
gamma = 0.9       # discount factor
epsilon = 0.2     # exploration factor
episodes = 500

# Reward function
def get_reward(state):
    if state == (3, 3):
        return 10
    else:
        return -1

# Environment transition
def take_action(state, action):
    row, col = state

    if action == 'up':
        row = max(row - 1, 0)
    elif action == 'down':
        row = min(row + 1, n_rows - 1)
    elif action == 'left':
        col = max(col - 1, 0)
    elif action == 'right':
        col = min(col + 1, n_cols - 1)

    return (row, col)

# Training loop
for episode in range(episodes):
    state = (0, 0)

    while state != (3, 3):  # Until it reaches the goal
        if random.uniform(0, 1) < epsilon:
            action = random.choice(actions)
        else:
            # Pick best action from Q-table
            action = actions[np.argmax(q_table[state[0], state[1]])]

        new_state = take_action(state, action)
        reward = get_reward(new_state)

        old_q = q_table[state[0], state[1], action_dict[action]]
        next_max = np.max(q_table[new_state[0], new_state[1]])

        # Q-learning formula
        new_q = old_q + alpha * (reward + gamma * next_max - old_q)
        q_table[state[0], state[1], action_dict[action]] = new_q

        state = new_state

print("Training complete! ✅")


Training complete! ✅


In [None]:
state

(3, 3)

What’s Going On?

- q_table: Stores the value of each (state, action) pair

- epsilon: Balances exploration vs exploitation

- gamma: Remembers future rewards

- alpha: Learning rate — how fast it learns

In [None]:
# Show the path taken by the agent from (0,0) to (3,3)
state = (0, 0)
path = [state]

while state != (3, 3):
    # Choose the best action (greedy)
    best_action_idx = np.argmax(q_table[state[0], state[1]])
    best_action = actions[best_action_idx]

    # Move to the next state
    new_state = take_action(state, best_action)
    path.append(new_state)

    # Break if stuck (safety condition)
    if new_state == state:
        print("Agent is stuck! 🚧")
        break

    state = new_state

# Print the path
print("🏁 Optimal path from (0,0) to (3,3):")
for step in path:
    print(step)


🏁 Optimal path from (0,0) to (3,3):
(0, 0)
(1, 0)
(1, 1)
(2, 1)
(3, 1)
(3, 2)
(3, 3)
