Reinforcement Learning (RL) is a branch of machine learning that focuses on how agents can learn to make decisions through trial and error to maximize cumulative rewards. 

RL allows machines to learn by interacting with an environment and receiving feedback based on their actions. 

This feedback comes in the form of rewards or penalties.

![  ](https://media.geeksforgeeks.org/wp-content/uploads/20250903150649221420/Reinforecement-Learning-in-ML.webp)


Reinforcement Learning is based on the concept of a learner, called an agent, exploring and acting within a setting known as the environment to accomplish a specific objective.

As the agent takes steps or makes choices, it receives signals from the environment that indicate how well it’s doing.

Over repeated interactions, the agent uses this feedback to refine its strategy and make better decisions.

**Agent:** The intelligent entity that makes decisions or takes actions.  

**Environment:** The external world or system the agent interacts with.  

**State:** The current situation or context the agent finds itself in at any moment.  

**Action:** A move or decision the agent can carry out in a given state.  

**Reward:** A numerical signal the environment sends back to the agent after each action, reflecting the immediate consequence of that action guiding the agent toward desirable behavior.

 the core components of Reinforcement Learning

1. Policy

    Defines the agent’s behavior i.e maps states for actions.
    Can be simple rules or complex computations.
    Example: An autonomous car maps pedestrian detection to make necessary stops.

2. Reward Signal

    Represents the goal of the RL problem.
    Guides the agent by providing feedback (positive/negative rewards).
    Example: For self-driving cars rewards can be fewer collisions, shorter travel time, lane discipline.

3. Value Function

    Evaluates long-term benefits, not just immediate rewards.
    Measures desirability of a state considering future outcomes.
    Example: A vehicle may avoid reckless maneuvers (short-term gain) to maximize overall safety and efficiency.

4. Model

    Simulates the environment to predict outcomes of actions.
    Enables planning and foresight.
    Example: Predicting other vehicles’ movements to plan safer routes.

Working of Reinforcement Learning

The agent interacts iteratively with its environment in a feedback loop:

    The agent observes the current state of the environment.

    It chooses and performs an action based on its policy.
    
    The environment responds by transitioning to a new state and providing a reward (or penalty).
    
    The agent updates its knowledge (policy, value function) based on the reward received and the new state.
    
    This cycle repeats with the agent balancing exploration (trying new actions) and exploitation (using known good actions) to maximize the cumulative reward over time.

This process is mathematically framed as a **Markov Decision Process (MDP)** where future states depend only on the current state and action, not on the prior sequence of events.

Markov Decision Process

**Markov Decision Process (MDP)** is a way to describe how a decision-making agent like a robot or game character moves through different situations while trying to achieve a goal. MDPs rely on variables such as the environment, agent’s actions and rewards to decide the system’s next optimal action. It helps us answer questions like:

    What actions should the agent take?
    What happens after an action?
    Is the result good or bad?

In artificial intelligence Markov Decision Processes (MDPs) are used to model situations where decisions are made one after another and the results of actions are uncertain. 

They help in designing smart machines or agents that need to work in environments where each action might led to different outcomes.

An MDP has five main parts:

1. **States (S)** – All possible situations the environment can be in. The agent observes the current state to decide what to do next.

2. **Actions (A)** – The set of choices or moves available to the agent in each state.

3. **Transition Probabilities (P)** – The likelihood of ending up in a new state after taking an action in a given state. Formally:  
   \( P(s' \mid s, a) = \text{probability of moving to state } s' \text{ from state } s \text{ after action } a \).

4. **Reward Function (R)** – The immediate feedback (a number) the agent receives after performing an action in a state. It tells the agent how good or bad the outcome was. Often written as \( R(s, a) \) or \( R(s, a, s') \).

5. **Discount Factor (γ)** – A value between 0 and 1 that determines how much the agent values future rewards compared to immediate ones. A higher γ means the agent plans for the long term.

Together, these form the MDP tuple: **⟨S, A, P, R, γ⟩**.  
This framework allows an agent to learn an optimal **policy** a rule for selecting actions that maximizes total expected reward over time.

Applications

Markov Decision Processes are useful in many real-life situations where decisions must be made step-by-step under uncertainty. Here are some applications:

**Robots and Machines:** Robots use MDPs to decide how to move safely and efficiently in places like factories or warehouses and avoid obstacles.

**Game Strategy:** In board games or video games MDPs help characters to choose the best moves to win or complete tasks even when outcomes are not certain.

**Healthcare:** Doctors can use it to plan treatments for patients, choosing actions that improve health while considering uncertain effects.

**Traffic and Navigation:** Self-driving cars or delivery vehicles use it to find safe routes and avoid accidents on unpredictable roads.

**Inventory Management:** Stores and warehouses use MDPs to decide when to order more stock so they don’t run out or keep too much even when demand changes.

**Implementing Reinforcement Learning**

Let's see the working of reinforcement learning with a maze example:

In [22]:
#Import libraries
import gymnasium as gym
import numpy as np
import time

In [23]:
# Create FrozenLake environment (4x4 grid)
env = gym.make('FrozenLake-v1', is_slippery=False)  # Deterministic for simplicity

# Initialize Q-table: Q[state, action] = value
num_states = env.observation_space.n
num_actions = env.action_space.n
Q = np.zeros((num_states, num_actions))

# Hyperparameters
alpha = 0.1      # Learning rate
gamma = 0.99     # Discount factor
epsilon = 0.9    # Exploration rate
episodes = 10000


In [24]:
# Q-Learning algorithm
for episode in range(episodes):
    state, _ = env.reset()
    done = False
    
    while not done:
        # Epsilon-greedy action selection
        if np.random.uniform(0, 1) < epsilon:
            action = env.action_space.sample()  # Explore
        else:
            action = np.argmax(Q[state])        # Exploit
        
        # Take action
        next_state, reward, terminated, truncated, _ = env.step(action)
        done = terminated or truncated
        
        # Q-learning update
        Q[state, action] = Q[state, action] + alpha * (
            reward + gamma * np.max(Q[next_state]) - Q[state, action]
        )
        
        state = next_state

    # Decay epsilon (less exploration over time)
    epsilon = max(0.01, epsilon * 0.999)

# Test the learned policy
print("Training completed!")
print("\nLearned Q-table:")
print(Q.round(2))

# Test the learned policy
print("\nTesting policy:")
state, _ = env.reset()
env.render()
done = False
total_reward = 0

while not done:
    action = np.argmax(Q[state])
    state, reward, terminated, truncated, _ = env.step(action)
    total_reward += reward
    done = terminated or truncated
    env.render()

print(f"\nTotal reward: {total_reward}")
env.close()


Training completed!

Learned Q-table:
[[0.94 0.95 0.93 0.94]
 [0.94 0.   0.51 0.85]
 [0.75 0.26 0.03 0.28]
 [0.19 0.   0.01 0.  ]
 [0.95 0.96 0.   0.94]
 [0.   0.   0.   0.  ]
 [0.   0.91 0.   0.09]
 [0.   0.   0.   0.  ]
 [0.96 0.   0.97 0.95]
 [0.96 0.98 0.98 0.  ]
 [0.88 0.99 0.   0.75]
 [0.   0.   0.   0.  ]
 [0.   0.   0.   0.  ]
 [0.   0.98 0.99 0.97]
 [0.98 0.99 1.   0.98]
 [0.   0.   0.   0.  ]]

Testing policy:

Total reward: 1


In [25]:
# Test with SLOW animation
env_test = gym.make('FrozenLake-v1', is_slippery=False, render_mode='human')
state, _ = env_test.reset()
done = False
total_reward = 0
path = [state]

while not done:
    action = np.argmax(Q[state])
    state, reward, terminated, truncated, _ = env_test.step(action)
    path.append(state)
    total_reward += reward
    done = terminated or truncated
    
    time.sleep(0.9)  # ⏱️ Pause for 0.5 seconds between steps

env_test.close()
print(f"\nTotal reward: {total_reward}")
print(f"Path taken: {path}")


Total reward: 1
Path taken: [0, 4, 8, 9, 13, 14, 15]


What This Does:
1. **Environment**: `FrozenLake-v1`  
   - 4x4 grid with **S** (start), **F** (frozen), **H** (hole), **G** (goal)  
   - Agent must reach **G** without falling into **H**  
   - Actions: `0=Left`, `1=Down`, `2=Right`, `3=Up`

2. **Algorithm**: **Q-Learning**  
   - Learns a `Q-table` mapping `(state, action) → expected future reward`
   - Uses **epsilon-greedy** exploration
   - Updates Q-values using:  
     \( Q(s,a) \leftarrow Q(s,a) + \alpha \left[ r + \gamma \max_{a'} Q(s',a') - Q(s,a) \right] \)

3. **Output**:  
   - Trained Q-table  
   - Visual path taken during test (using `env.render()`)
