<a href="https://colab.research.google.com/github/2003Yash/RL-Q-Learning-from-scratch/blob/main/RL_Q_learning_from_scratch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import random # provides functions to generate random numbers
from typing import List # This imports the List type hint from the typing module.
                        # It is used for type annotations, allowing you to specify that a function parameter or return value should be a list of a certain type
                        # (e.g., List[int] means a list of integers).

In [22]:
class SampleEnvironment:
  def __init__(self): # Changed _init_ to __init__
    self.steps_left = 20  # tells agents how is the game

  def get_observation(self) -> List[float]:
    return [0.0, 0.0, 0.0] # present position in environment [ not 0.0 any value is valid here]

  def get_actions(self) -> List[int]:
    return [0, 1] # possible actions an agent can make and by agent we simply means a entity that implements policy

  def is_done(self) -> bool:
    return self.steps_left == 0 # tells the agent game is done where no steps were left

  def action_reward(self, action: int) -> float:
    if self.is_done():                   # checks if game is over or not
      raise Exception("Game is over")   # if not
    self.steps_left -= 1                # decrease the steps
    return random.random()              # and return a reward value for action

In [23]:
class Agent:
  def __init__(self):         # when agent is created it's reward value is 0
    self.total_reward = 0.0

  def step(self, env: SampleEnvironment):
    current_obs = env.get_observation()          # observe the environment i.e, here it observes its position in env
    print("Observation {}".format(current_obs))  # prints observation ( but here we've hard coded it for a specific value but in real life agent gets live value )
    actions = env.get_actions()                  # gets actions ( but here we've hard coded it randomly but in real life agent does this by observation values and a policy )
    print(actions)
    reward = env.action_reward(random.choice(actions))    # gets a random reward
    self.total_reward += reward                          # aggregate it to total reward
    print("Total reward {}".format(self.total_reward))   # prints total reward
                         # The .format() method replaces {} with the value of self.total_reward.

In [24]:
env = SampleEnvironment() # enviroment instance
agent = Agent()           # agents instance
i = 0


while not env.is_done():   # environment steps are not done then perfomr some actions
  i=i+1
  print("Step {}".format(i))
  agent.step(env)

Step 1
Observation [0.0, 0.0, 0.0]
[0, 1]
Total reward 0.6872604108496245
Step 2
Observation [0.0, 0.0, 0.0]
[0, 1]
Total reward 1.5467820931071996
Step 3
Observation [0.0, 0.0, 0.0]
[0, 1]
Total reward 2.53854296927009
Step 4
Observation [0.0, 0.0, 0.0]
[0, 1]
Total reward 2.7969451047873064
Step 5
Observation [0.0, 0.0, 0.0]
[0, 1]
Total reward 3.3554758935715103
Step 6
Observation [0.0, 0.0, 0.0]
[0, 1]
Total reward 4.094555586877653
Step 7
Observation [0.0, 0.0, 0.0]
[0, 1]
Total reward 4.108216790035494
Step 8
Observation [0.0, 0.0, 0.0]
[0, 1]
Total reward 5.0755677359436895
Step 9
Observation [0.0, 0.0, 0.0]
[0, 1]
Total reward 5.344112106510347
Step 10
Observation [0.0, 0.0, 0.0]
[0, 1]
Total reward 5.637863780567108
Step 11
Observation [0.0, 0.0, 0.0]
[0, 1]
Total reward 5.9738660843981615
Step 12
Observation [0.0, 0.0, 0.0]
[0, 1]
Total reward 6.036383115339015
Step 13
Observation [0.0, 0.0, 0.0]
[0, 1]
Total reward 6.5095555384568655
Step 14
Observation [0.0, 0.0, 0.0]
[0, 1

The above code does the same function of RL but with random policy

# Let's Explore the same code but with optimised policy of: Q-Learning

In [17]:
import random
from typing import List
import numpy as np

In [None]:
class SampleEnvironment:
  def __init__(self):
    self.steps_left = 20

  def get_observation(self) -> List[float]:
    return [0.0, 0.0, 0.0]

  def get_actions(self) -> List[int]: # from there [0,1] binary set of actions agent will pick one
    return [0, 1]

  def is_done(self) -> bool:
    return self.steps_left == 0

  def action_reward(self, action: int) -> float:
    if self.is_done():
      raise Exception("Game is over")
    self.steps_left -= 1
    return random.random()

In [None]:
class Agent:
  def __init__(self, learning_rate=0.1, discount_factor=0.9):
    self.total_reward = 0.0
    self.q_table = {}  # Initialize Q-table as a dictionary
    self.learning_rate = learning_rate
    self.discount_factor = discount_factor

  def get_action(self, env: SampleEnvironment, epsilon=0.1):
    """Choose an action using an epsilon-greedy policy. for balancing exploration and exploitation i.e, balance exploration with maximum profit path both- explained in-depth at end """
    actions = env.get_actions() # Get possible actions: [0,1]

    # Explore: Choose a random action (10% chance by default)
    if random.uniform(0, 1) < epsilon: # This code generates a random floating-point number between 0 and 1, and checks if it's less than epsilon
      return random.choice(actions)

    else:
      # Exploit: Choose the action with the highest Q-value
      current_obs = tuple(env.get_observation()) # Convert state to tuple (for dictionary key)

      if current_obs not in self.q_table:
        # If the state is not in Q-table, initialize Q-values for all actions
        self.q_table[current_obs] = {action: 0.0 for action in actions}

      action_values = self.q_table[current_obs] # Get Q-values for current state
      return max(action_values, key=action_values.get) # choose best action from [0,1] from Q_table

  def update_q_table(self, current_obs, action, reward, next_obs):
    """Update Q-table using the Q-learning algorithm."""
    current_obs = tuple(current_obs) # Convert state to tuple (dictionary key)
    next_obs = tuple(next_obs)  # Convert next state to tuple

    # Initialize Q-values for unseen states
    if current_obs not in self.q_table:
      self.q_table[current_obs] = {action: 0.0 for action in [0, 1]}

    if next_obs not in self.q_table:
      self.q_table[next_obs] = {action: 0.0 for action in [0, 1]}

    # Find the best action for the next state
    best_next_action = max(self.q_table[next_obs], key=self.q_table[next_obs].get)

    # Q-learning formula:
    td_target = reward + self.discount_factor * self.q_table[next_obs][best_next_action]
    td_error = td_target - self.q_table[current_obs][action]
    self.q_table[current_obs][action] += self.learning_rate * td_error

    # SAMPLE Q-TALBE IS DISPLAYED AT END.

  def step(self, env: SampleEnvironment):
    current_obs = env.get_observation()
    action = self.get_action(env)
    print("Action {}".format(action))
    reward = env.action_reward(action)
    print("Reward {}".format(reward))
    self.total_reward += reward
    next_obs = env.get_observation()

    self.update_q_table(current_obs, action, reward, next_obs)
    print("Total reward {}".format(self.total_reward))

In [20]:
env = SampleEnvironment()
agent = Agent()
i = 0

while not env.is_done():
  i += 1
  print("Step {}".format(i))
  agent.step(env)

Step 1
Action 0
Reward 0.9792935513129654
Total reward 0.9792935513129654
Step 2
Action 0
Reward 0.8511365243821553
Total reward 1.8304300756951206
Step 3
Action 0
Reward 0.7520841381419586
Total reward 2.5825142138370794
Step 4
Action 0
Reward 0.3649034309418704
Total reward 2.94741764477895
Step 5
Action 0
Reward 0.39611983546424845
Total reward 3.3435374802431985
Step 6
Action 0
Reward 0.5722373211830409
Total reward 3.9157748014262395
Step 7
Action 0
Reward 0.04433717598054765
Total reward 3.960111977406787
Step 8
Action 0
Reward 0.536458008804654
Total reward 4.4965699862114406
Step 9
Action 0
Reward 0.7365899582090493
Total reward 5.23315994442049
Step 10
Action 0
Reward 0.8784716480873132
Total reward 6.111631592507803
Step 11
Action 0
Reward 0.891256359907843
Total reward 7.002887952415646
Step 12
Action 0
Reward 0.07701312755748069
Total reward 7.079901079973127
Step 13
Action 0
Reward 0.8997641043901781
Total reward 7.979665184363306
Step 14
Action 0
Reward 0.257586386942543


Proved Concept: randomly we are getting 9.2 and by Q-learning we are getting 11 { Values might vary on every run }

-----------------------------------------------------------------------------------------------------------------------------------------------

## Q-Learning Algorithms

Q-learning is a model-free reinforcement learning algorithm that helps an agent learn optimal actions by updating a Q-table based on rewards. The agent follows an ε-greedy policy, where it explores by taking random actions with probability ε and exploits by selecting the action with the highest Q-value otherwise. The Q-value update rule is based on the Bellman equation.

Where α (learning rate) controls update speed, γ (discount factor) determines future reward importance, and R is the immediate reward. The agent continuously updates Q-values based on observed state-action-reward transitions, refining its decision-making. Over time, Q-learning converges to an optimal policy, allowing the agent to maximize cumulative rewards.

-------------------------------------------------------------------------------------------------------------------------------------

## ε-Greedy Policy in Reinforcement Learning

The ε-greedy policy is a reinforcement learning strategy that balances exploration (trying new actions) and exploitation (choosing the best-known action). At each step, with probability ε, the agent selects a random action to explore, and with probability 1 - ε, it picks the action with the highest Q-value. This prevents the agent from getting stuck in suboptimal strategies while still allowing it to learn the best actions over time. A decaying ε (reducing exploration gradually) helps shift from learning to optimal decision-making. This approach ensures the agent maximizes long-term rewards while continuously refining its strategy.

Why not E-greedy Always?

Using ε-greedy all the time (with a fixed high ε) results in excessive random exploration, preventing the agent from fully exploiting what it has learned. Conversely, setting ε too low from the start leads to premature exploitation, causing the agent to get stuck in suboptimal strategies without discovering better actions.

## Sample Q-Table for this Code

In [None]:
{
  (0.0, 0.0, 0.0): {  # State (Observation)
    0: 0.15,  # Q-value for taking action 0
    1: 0.23   # Q-value for taking action 1
  },
  (1.0, 0.0, 0.0): {  # Another state (if observation changed)
    0: 0.10,
    1: 0.19
  }
}