# →Q-Learning

It is temporal difference learning on $Q$-function

$Q^{new}(s_k,a_k)=Q^{old}(s_k,a_k)+\alpha(r_k+\gamma \max_a Q(s_{k+1},a)-Q^{old}(s_k,a_k))$ 

Off policy TD(0) learning of the quality function Q

What we mean by **Off policy** is that we can take **sub-optimal** $a_k$ actions to get the reward but still **maximize** the next action in $s_{k+1}$ though, this helps to learn even when **not taking best** $a_k$ actions.

The Off policy can be **confusing** since we are saying that we can take **sub-optimal** actions but there is that **term** in the update function: $max_a Q(s_{k+1},a)$

**Many** policies are used in **experiments** and at the **experience replay** step we iterate through actions even if they are sub-optimal but we **assume** that the **best** actions will be taken in next steps. This is done by replaying experiments done **by us** or **importing** others and learn from them; this ensure treating **different** policies.

**Exploration vs. exploitation: $\epsilon$-greedy actions**

**Random** exploration element is introduced to $Q$-learning, the popular technique is the  **$\epsilon$-greedy.** Taking the action $a_k$ will be taken based on the current $Q$ function, with a probability $1-\epsilon$, where $\epsilon \in[0,1]$. for example $\epsilon=0.05$ there will be a 95% **probability** of taking best action and 5% **chance** of exploring a sub-optimal one. 

This epsilon value can be decayed as we iterate to go more **On-Policy** once we learned a good $Q$-function.

$Q$ -learning applies to **discrete** action spaces $A$ and state spaces $S$ governed by a **finite** MDP. A table of $Q$ values is used to represent the $Q$ function, and thus it doesn’t **scale** well to **large** state spaces. Typically function **approximation** is used to represent the $Q$ function, such as a **neural network** in deep $Q$-learning.

> Because $Q$-learning is off-policy, it is possible to learn from action-state sequences that do not use the current optimal policy. For example, it is possible to store past experiences, such as previously played games, and replay these experiences to further improve the Q function.
>

In [1]:
import gymnasium as gym
from gymnasium.envs.toy_text.frozen_lake import FrozenLakeEnv
import matplotlib
import matplotlib.pyplot as plt
from matplotlib.patches import Patch
import numpy as np
import seaborn as sns
from tqdm import tqdm
from collections import defaultdict # allows access to undefined keys
matplotlib.use('TkAgg')  # or 'Qt5Agg' if you prefer Qt

In [2]:


class CustomFrozenLake(FrozenLakeEnv):
    def __init__(self, goal_reward=100, hole_penalty=-50, step_penalty=-1,stuck_penalty=-1, **kwargs):
        super().__init__(**kwargs)
        self.goal_reward = goal_reward
        self.hole_penalty = hole_penalty
        self.step_penalty = step_penalty
        self.stuck_penalty = stuck_penalty

    def step(self, action):
        prev_state = self.unwrapped.s
        
        state, reward, terminated, truncated, info = super().step(action)
        
        current_tile = self.desc[self.unwrapped.s // self.ncol, self.unwrapped.s % self.ncol]
        
        if current_tile in b'H':
            reward = self.hole_penalty  # Apply penalty for falling into a hole
        elif current_tile in b'G':
            reward = self.goal_reward  # Apply higher reward for reaching the goal
        
        elif prev_state == state :
            reward = self.stuck_penalty  # Apply small penalty for walking on frozen tiles
        else:
            reward = self.step_penalty
            
        
        return state, reward, terminated, truncated, info

In [3]:
class FrozenLakeAgent():
    def __init__(self,
                 learning_rate:float,
                 initial_epsilon:float,
                 epsilon_decay:float,
                 final_epsilon:float,
                 discount_factor:float = 0.95,
                 ):
        
    #Initialize the agent with empty dictionary of action/state values (q_values), a learning rate and an epsilon
    # discount_factor : Is for computing the Q-value namely gamma 
        self.q_values = defaultdict(lambda: np.zeros(env.action_space.n))
    
        self.lr = learning_rate
        self.epsilon = initial_epsilon
        self.epsilon_decay = epsilon_decay
        self.final_epsilon = final_epsilon
        self.discount_factor = discount_factor
        
        self.training_error = []
    
    def choose_action(self, obs:tuple[int,int,bool])->int:
        # Return the best action with a probability of (1- epsilon) 
        if np.random.random() < self.epsilon:
            return env.action_space.sample()
        else:
            return int(np.argmax(self.q_values[obs]))
    
    def update_q_values(self,
                        obs:tuple[int,int,bool],
                        action:int,
                        reward:float,
                        terminated:bool,
                        next_obs:tuple[int,int,bool]):
        future_q_value = (not terminated) * np.max(self.q_values[next_obs])

        temporal_diffrence = (reward + (self.discount_factor * future_q_value))- self.q_values[obs][action]
        
        self.q_values[obs][action] = (
            self.q_values[obs][action] + self.lr * temporal_diffrence
        )
        self.training_error.append(temporal_diffrence)
        
    def decay_epsilon(self):
        self.epsilon = max(self.final_epsilon, self.epsilon * self.epsilon_decay)

In [4]:
learning_rate = .1
n_episodes = 10000
start_epsilon = 1
epsilon_decay = 0.999
final_epsilon = 0.05

agent = FrozenLakeAgent(
    learning_rate=learning_rate,
    initial_epsilon=start_epsilon,
    final_epsilon=final_epsilon,
    epsilon_decay=epsilon_decay,
    
    
)

In [9]:
env = CustomFrozenLake(map_name="24x24", is_slippery=False, render_mode='rgb_array')


KeyError: '24x24'

In [None]:
env = gym.wrappers.RecordEpisodeStatistics(env, deque_size=n_episodes)
env = gym.wrappers.TimeLimit(env, max_episode_steps=60)

rewards = 0 
for episode in tqdm(range(n_episodes)):
    
    obs, info = env.reset()
    done = False
    
    # play one episode
    while not done:
        action = agent.choose_action(obs)
        next_obs, reward, terminated, truncated, info = env.step(action)
        rewards += reward
        # update the agent
        agent.update_q_values(obs, action, reward, terminated, next_obs)

        # update if the environment is done and the current obs
        done = terminated or truncated
        obs = next_obs

    agent.decay_epsilon()

In [None]:
q_values = np.array([value for key, value in agent.q_values.items()])
print(np.argmax(q_values,axis=1))


In the Blackjack environment, the state space is defined by three components:

- The player's current sum (ranges from 4 to 21)
- The dealer's visible card (ranges from 1 to 10, where 1 represents an Ace)
- Whether the player has a usable Ace (True or False)

So, the total number of possible states is:
(21 - 4 + 1) * 10 * 2 = 18 * 10 * 2 = 360
However, you're seeing 380 instead of 360. This is because the environment also includes some terminal states that can occur when the player's sum exceeds 21 (bust states). These additional states account for the extra 20 entries in your q_values dictionary.

In [None]:
rolling_length = 500
fig, axs = plt.subplots(ncols=3, figsize=(12, 5))
axs[0].set_title("Episode rewards")
# compute and assign a rolling average of the data to provide a smoother graph
reward_moving_average = (
    np.convolve(
        np.array(env.return_queue).flatten(), np.ones(rolling_length), mode="valid"
    )
    / rolling_length
)
axs[0].plot(range(len(reward_moving_average)), reward_moving_average)
axs[1].set_title("Episode lengths")
length_moving_average = (
    np.convolve(
        np.array(env.length_queue).flatten(), np.ones(rolling_length), mode="same"
    )
    / rolling_length
)
axs[1].plot(range(len(length_moving_average)), length_moving_average)
axs[2].set_title("Training Error")
training_error_moving_average = (
    np.convolve(np.array(agent.training_error), np.ones(rolling_length), mode="same")
    / rolling_length
)
axs[2].plot(range(len(training_error_moving_average)), training_error_moving_average)
plt.tight_layout()
plt.show()

In [None]:
print(f'total rewards = {rewards}')


In [None]:


# env = gym.make("FrozenLake-v1", render_mode="rgb_array")
# env = gym.wrappers.TimeLimit(env, max_episode_steps=100)
obs, info = env.reset()

plt.ion()
fig, ax = plt.subplots(figsize=(8,8))
action_text = ax.text(510, 20, '', color='white', fontsize=12, bbox=dict(facecolor='blue', alpha=0.8))
img = ax.imshow(env.render())
actions = ['Move Up','Move Right','Move Down','Move Left']
rewards = 0
num_epochs= 3
for step in range(num_epochs):
    obs, info = env.reset()
    done = False
    while not done:
        action = agent.choose_action(obs)
        next_obs, reward, terminated, truncated, info = env.step(action)
        rewards += reward
        
        print(f'step {step}:  obs = {next_obs} , reward = {reward}')
        frame = env.render()
        img.set_data(frame)
        action_text.set_text(f'Step: {actions[action] }')

        fig.canvas.draw()
        fig.canvas.flush_events()
        plt.pause(.05)
        done = terminated or truncated
        obs = next_obs

plt.ioff()  # Turn off interactive mode
# plt.show()  # Keep the window open after the animation finishes
plt.close()
env.close()

In [None]:
print(f'total rewards = {rewards}')

In [None]:
# print(f'action space shape : {env.action_space.n}') # Number of possible actions is 4
# print(f'observation space shape : {env.observation_space}') 
# #-------------- obesrvation is a tupe of 3 values : --------------
# #1) player cards value
# #2) dealer's face up card
# #3) usable ace for player, equal 1 if ace is considered an 11 without busting
# 
# print(f'reward range : {env.reward_range}') # default reward range is set to -inf +inf
# # print(f'\nEnv specs : {env.spec}') 
# print(f'\nEnv metadata : {env.metadata}') # render_modes adn render_fps