The parameters of the Q-learning algorithm are also taken as user input.

• Exploration rate (ϵ): user input (example values: 0.3, 0.5, 0.7)

• Learning rate (α): user input (example: 0.3)

• Discount factor (γ): user input (example: 0.99)

• Number of training episodes: user input (example: 10,000 or more)

1. Implement a Q-learning agent in an n × n grid world with user-defined
obstacles.

  • Example: 5 × 5 grid with obstacles at (1, 1),(1, 3),(3, 1),(3, 3)

2. Train the agent for 10,000 episodes (or as per user input) for each of the given values of ϵ (example: 0.3, 0.5, 0.7).

3. Track and compute the cumulative reward obtained in each case of ϵ during
training.

In [11]:
import numpy as np
import random
import builtins

In [14]:
#Checking input validation
def Validation(start_s,goal_s,obstacle,grid) :

  if start_s[0] < 0 or start_s[0]>= grid or start_s[1] <0 or start_s[1] >= grid :
   print("Wrong input")
   exit()

  if goal_s[0] < 0 or goal_s[0] >= grid or goal_s[1] < 0 or goal_s[1] >= grid :
   print("Wrong input")
   exit()

  if start_s == goal_s :
   print("Wrong input ")
   exit()

  for obs in obstacle :
    if obs == start_s or obs == goal_s :
     print("Wrong input ")
     exit()

  return True

In [6]:
def training(n, start, goal, obs, alpha, gamma, epsilon, eps, goal_reward, obs_pen, step_pen):
    # creating Q-table
    q_table = np.zeros((n, n, 4))
    cumulative_reward = 0
    obs_set = set(obs)
    for _ in range(eps):
        state = start
        episode_reward = 0
        steps = 0
        while state != goal:
            if random.random() < epsilon:
                action = random.randint(0, 3)
            else:
                action = np.argmax(q_table[state[0], state[1]])

            dr, dc = actions[action]
            next_r = max(0, min(n - 1, state[0] + dr))
            next_c = max(0, min(n - 1, state[1] + dc))
            next_state = (next_r, next_c)

            if next_state in obs_set:
                reward = obs_pen
                next_state = state
            elif next_state == state:
                reward = obs_pen
            elif next_state == goal:
                reward = goal_reward
            else:
                reward = step_pen

            episode_reward += reward

            best_next_q = np.max(q_table[next_state[0], next_state[1]])
            q_table[state[0], state[1], action] += alpha * (
                reward + gamma * best_next_q - q_table[state[0], state[1], action]
            )

            state = next_state
            steps += 1
        cumulative_reward += episode_reward

    return q_table, cumulative_reward

In [7]:
def action_grid(q_table, n, goal, obs):
    grid = [[' ' for i in range(n)] for j in range(n)]
    obs_set = set(obs)
    for i in range(n):
        for j in range(n):
            pos = (i, j)
            if pos == goal:
                grid[i][j] = 'G'
            elif pos in obs_set:
                grid[i][j] = 'X'
            else:
                best_action = np.argmax(q_table[i, j])
                grid[i][j] = action_symbols[best_action]
    return grid

In [8]:
def print_q_table(q_table, n):
    for i in range(n):
        for j in range(n):
            print(f"State ({i},{j}): {q_table[i, j]}")

In [18]:
# Action mappings
actions = {0: (-1, 0), 1: (0, 1), 2: (1, 0), 3: (0, -1)}  # Up, Right, Down, Left
action_symbols = {0: '↑', 1: '→', 2: '↓', 3: '←'}

In [19]:
import numpy as np
import random

# Action mappings
n_input = builtins.input("Enter grid size n : ").strip()
n = int(n_input) if n_input else 5

start_input = builtins.input("Enter start state (row col): ").strip()
start = tuple(map(int, start_input.split())) if start_input else (0, 0)

goal_input = builtins.input("Enter goal state (row col): ").strip()
goal = tuple(map(int, goal_input.split())) if goal_input else (4, 4)

obstacles_input = builtins.input("Enter obstacles (row1 col1,row2 col2,... or  blank): ").strip()
obs = [tuple(map(int, o.strip().split())) for o in obstacles_input.split(',') if o.strip()] if obstacles_input else [(1, 1), (1, 3), (3, 1), (3, 3)]

goal_reward_input = builtins.input("Reward for reaching the goal : ").strip()
goal_reward = float(goal_reward_input) if goal_reward_input else 100.0

obs_pen_input = builtins.input("Penalty for hitting an obstacle/wall : ").strip()
obs_pen = float(obs_pen_input) if obs_pen_input else -100.0

step_pen_input = builtins.input("Penalty for every normal step : ").strip()
step_pen = float(step_pen_input) if step_pen_input else -1.0

alpha_input = builtins.input("Learning rate : ").strip()
alpha = float(alpha_input) if alpha_input else 0.3

gamma_input = builtins.input("Discount factor : ").strip()
gamma = float(gamma_input) if gamma_input else 0.99

eps_input = builtins.input("Num of training episodes : ").strip()
eps = int(eps_input) if eps_input else 10000

epsilons_input = builtins.input("Enter exploration rates separated by commas : ").strip()
epsilons = [float(e.strip()) for e in epsilons_input.split(',') if e.strip()] if epsilons_input else [0.3, 0.5, 0.7]

if not Validation(start, goal, obs, n): # Corrected arguments passed to Validation
 print("Wrong input")
eps_l = []
cumulative_rew = []
results = {}
for ep in epsilons:
    eps_l.append(ep)
    q_table, cumulative_reward = training(n, start, goal, obs, alpha, gamma, ep, eps, goal_reward, obs_pen, step_pen)
    cumulative_rew.append(cumulative_reward)
    best_grid = action_grid(q_table, n, goal, obs)
    results[ep] = {
            "q_table": q_table,
            "best_action_grid": best_grid,
            "cumulative_reward": cumulative_reward
        }
    # Output
for ep, data in results.items():
        print(f"\n Results for epsilon = {ep} ")
        print("Final Q-Value Table:")
        print_q_table(data["q_table"], n)
        print("\nBest Action Grid:")
        for row in data["best_action_grid"]:
            print(' '.join(row))
        print(f"\nTotal Cumulative Reward: {data['cumulative_reward']}")

Enter grid size n : 4 
Enter start state (row col): 0 0
Enter goal state (row col): 3 3
Enter obstacles (row1 col1,row2 col2,... or  blank): 1 1,2 3
Reward for reaching the goal : 150
Penalty for hitting an obstacle/wall : -20
Penalty for every normal step : -1
Learning rate : 0.4
Discount factor : 0.99
Num of training episodes : 10000
Enter exploration rates separated by commas : 0.3,0.4,0.5,0.6,0.7

 Results for epsilon = 0.3 
Final Q-Value Table:
State (0,0): [116.37003735 137.74751247 137.74751247 116.37003735]
State (0,1): [118.74751247 140.1490025  118.74751247 135.37003735]
State (0,2): [121.1490025  137.74751247 142.57475    137.74751247]
State (0,3): [118.74751247 118.74751247 140.1490025  140.1490025 ]
State (1,0): [135.37003735 118.74751247 140.1490025  118.74751247]
State (1,1): [0. 0. 0. 0.]
State (1,2): [140.1490025 140.1490025 145.025     123.57475  ]
State (1,3): [137.74751247 121.1490025  121.1490025  142.57475   ]
State (2,0): [137.74751247 142.57475    142.57475    1

In [20]:
import pandas as pd

data = {'Epsilon': list(results.keys()), 'Cumulative Reward': [data['cumulative_reward'] for data in results.values()]}

df_results = pd.DataFrame(data)

display(df_results)

Unnamed: 0,Epsilon,Cumulative Reward
0,0.3,1252120.0
1,0.4,1135920.0
2,0.5,972630.0
3,0.6,738414.0
4,0.7,321872.0


# • **How does the exploration rate (ϵ) affect learning?**

#• **Does higher exploration find better paths or slow down convergence?**

Ans 1: The exploration rate (ε) decides how much the agent explores new moves versus sticking to known good ones. From your results, a lower ε (0.30) boosts the cumulative reward by letting the agent rely more on learned paths, preventing it from getting hit by obstacles and hitting the goal fast. Higher ε values(0.7) , mean more random moves, leading to more obstacle bumps and lower rewards due to penalties.



Ans 2: As Higher exploration (ε = 0.7) means more exploration, which makes one more prone to hitting the obstacles, resulting in a low cumulative reward, which shows it slows convergence. Lower ε (0.3) suggests quicker convergence on a solid path, though it might miss the best route. Middle ground (ε = 0.5) balances both but still lags behind the lowest ε, hinting that for this nxn grid, less exploration works better.

# **Write a short explanation (2–3 paragraphs) discussing how the exploration rate influenced the learning outcomes.**

Ans: The exploration rate (ε) sets the tone for how the agent figures things out. With ε = 0.3 scoring the top reward, the agent tends to stick with a greedy path, preventing obstacles. This fast lock-on to a solid path shows low exploration, which works perfectly here, though it might miss a shorter route if it doesn’t explore around enough at the start.

On the other hand, jacking ε up to 0.7 crashes the reward, those random moves sum up penalties and slow everything down big time, it could uncover better paths in a trickier setup, but it just messes with the flow.

The middle ground at ε = 0.5 tries to play it in between but doesn’t get the path which is shortest and optimal. For this nxn grid, keeping ε low at 0.3 seems to be the correct move for quick learning and optimal rewards.