**Practical 5: Implement Q-learning for a 4×4 GridWorld where the agent starts at the top-left corner (0,0) and the goal is at the bottom-right corner (3,3). Use an ε-greedy policy for exploration, update the Q-table over multiple episodes, and display the learned optimal policy as a grid of arrows with the goal marked as G.**

Procedure:
1. Define the environment as a 4x4 grid with states and actions (up, down, left, right).
2. Initialize a Q-table with zeros for all state-action pairs.
3. Set Q-Learning parameters:
4. Learning rate (α),
5. Discount factor (γ),
6. Exploration rate (ε),
7. Number of episodes.
8. Start from the initial state and apply the epsilon-greedy policy to choose actions.
9. Transition to the next state, collect reward (1 at the goal, 0 otherwise).
10. Update the Q-values using the Q-Learning update rule.
11. Repeat the process for multiple episodes until the agent converges to the optimal policy.
12. Analyze the learned Q-values to determine the best path to the goal.


In [None]:
import numpy as np
import random

In [None]:
#Grid size
n_rows, n_cols = 4,4
n_states = n_rows*n_cols
n_actions = 4 #up, down, left, right

In [None]:
#Q-table initialization
Q = np.zeros((n_states, n_actions))

In [None]:
#Parameters
alpha = 0.1 #Learning rate
gamma = 0.9 #Discount factor
epsilon = 0.2 #Exploration rate
episodes = 500

In [None]:
# Action mapping: 0=up, 1=down, 2=left, 3=right
action_map = [(-1,0),(1,0),(0,-1),(0,1)]

In [None]:
# Helper functions
def state_to_index(row, col):
  return row*n_cols + col

def index_to_state(index):
  return divmod(index, n_cols)

def is_terminal(state):
  return state == (3,3)

In [None]:
# Training
for ep in range(episodes):
  row, col =0,0 # Start state
  while not is_terminal((row,col)):
    state_idx = state_to_index(row,col)
    #Epsilon-greedy action selection
    if random.uniform(0,1) < epsilon:
      action = random.randint(0,3)
    else:
      action = np.argmax(Q[state_idx])
    # Next state
    d_row, d_col = action_map[action]
    next_row = min(max(row+d_row, 0), n_rows-1)
    next_col = min(max(col+d_col, 0), n_cols-1)
    next_state_idx = state_to_index(next_row, next_col)
    #Reward
    reward = 1 if is_terminal((next_row, next_col)) else 0
    # Q-learning update
    Q[state_idx, action] =Q[state_idx, action] + alpha * (reward + gamma * np.max(Q[next_state_idx]) - Q[state_idx, action])
      # Move to next state
    row, col = next_row, next_col

In [None]:
# Show learned policy
directions = ['↑', '↓', '←', '→']
policy = []
for s in range(n_states):
  row, col = index_to_state(s)
  if is_terminal((row,col)):
    policy.append('T')
  else:
    action = np.argmax(Q[s])
    policy.append(directions[action])

In [None]:
# Display policy
for i in range(n_rows):
  print(policy[i*n_cols:(i+1)*n_cols])

['→', '→', '→', '↓']
['↑', '→', '↑', '↓']
['↑', '↑', '↑', '↓']
['↑', '↑', '↑', 'T']


In [None]:
Q

array([[0.49201478, 0.38082955, 0.51000289, 0.59049   ],
       [0.49501654, 0.34556443, 0.5014202 , 0.6561    ],
       [0.6202932 , 0.52759547, 0.56381745, 0.729     ],
       [0.71013169, 0.81      , 0.62697507, 0.72012899],
       [0.50804763, 0.00768908, 0.08985511, 0.05562385],
       [0.05904899, 0.        , 0.09865821, 0.53871502],
       [0.65607661, 0.12250694, 0.16654828, 0.27845624],
       [0.67894872, 0.9       , 0.55955454, 0.757393  ],
       [0.09711749, 0.        , 0.        , 0.        ],
       [0.01370322, 0.        , 0.        , 0.        ],
       [0.55422089, 0.        , 0.        , 0.        ],
       [0.77323475, 1.        , 0.39050539, 0.87481726],
       [0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        ],
       [0.0491152 , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        ]])

In [None]:
Q.max()

np.float64(0.9999999999999996)

Observation:
*   The agent starts with no prior knowledge, and Q-values are initialized to zero.
*   Using epsilon-greedy action selection, the agent balances exploration of random moves with exploitation of learned Q-values.
*   Over several episodes, the Q-values for optimal state-action pairs gradually increase.
*   The agent learns to move step by step toward the terminal goal state (3,3) in the grid.
*   The Q-table evolves into a representation of the best actions for each state.
*   The practical illustrates how Q-Learning allows an agent to learn an optimal policy through
repeated interactions with the environment.



