# →Q-Learning

It is temporal difference learning on $Q$-function

$Q^{new}(s_k,a_k)=Q^{old}(s_k,a_k)+\alpha(r_k+\gamma \max_a Q(s_{k+1},a)-Q^{old}(s_k,a_k))$ 

Off policy TD(0) learning of the quality function Q

What we mean by **Off policy** is that we can take **sub-optimal** $a_k$ actions to get the reward but still **maximize** the next action in $s_{k+1}$ though, this helps to learn even when **not taking best** $a_k$ actions.

The Off policy can be **confusing** since we are saying that we can take **sub-optimal** actions but there is that **term** in the update function: $max_a Q(s_{k+1},a)$

**Many** policies are used in **experiments** and at the **experience replay** step we iterate through actions even if they are sub-optimal but we **assume** that the **best** actions will be taken in next steps. This is done by replaying experiments done **by us** or **importing** others and learn from them; this ensure treating **different** policies.

**Exploration vs. exploitation: $\epsilon$-greedy actions**

**Random** exploration element is introduced to $Q$-learning, the popular technique is the  **$\epsilon$-greedy.** Taking the action $a_k$ will be taken based on the current $Q$ function, with a probability $1-\epsilon$, where $\epsilon \in[0,1]$. for example $\epsilon=0.05$ there will be a 95% **probability** of taking best action and 5% **chance** of exploring a sub-optimal one. 

This epsilon value can be decayed as we iterate to go more **On-Policy** once we learned a good $Q$-function.

$Q$ -learning applies to **discrete** action spaces $A$ and state spaces $S$ governed by a **finite** MDP. A table of $Q$ values is used to represent the $Q$ function, and thus it doesn’t **scale** well to **large** state spaces. Typically function **approximation** is used to represent the $Q$ function, such as a **neural network** in deep $Q$-learning.

> Because $Q$-learning is off-policy, it is possible to learn from action-state sequences that do not use the current optimal policy. For example, it is possible to store past experiences, such as previously played games, and replay these experiences to further improve the Q function.
>

In [49]:
import gymnasium as gym
import matplotlib
import matplotlib.pyplot as plt
from matplotlib.patches import Patch
import numpy as np
import seaborn as sns
from tqdm import tqdm
from collections import defaultdict # allows access to undefined keys
matplotlib.use('TkAgg')  # or 'Qt5Agg' if you prefer Qt

In [50]:
class LunarLanderAgent:
    def __init__(self,
                 learning_rate: float,
                 initial_epsilon: float,
                 epsilon_decay: float,
                 final_epsilon: float,
                 discount_factor: float = 0.95,
                 discrete_actions: int = 4):
        
        self.lr = learning_rate
        self.epsilon = initial_epsilon
        self.epsilon_decay = epsilon_decay
        self.final_epsilon = final_epsilon
        self.discount_factor = discount_factor
        self.discrete_actions = discrete_actions
        
        # Initialize Q-table
        self.q_values = defaultdict(lambda: np.zeros(self.discrete_actions))
        
        self.training_error = []
    
    def discretize_state(self, state):
        # Round each value in the state to 1 decimal place
        # Convert to tuple for hashability
       
        rounded_state = np.round(state, 1)  # Slice to exclude last element
        # Append the original terminated flag (boolean)
        return tuple(np.append(rounded_state, state[-1])) 
    
    def choose_action(self, state):
        discretized_state = self.discretize_state(state)
        
        if np.random.random() < self.epsilon:
            return np.random.randint(self.discrete_actions)
        else:
            return int(np.argmax(self.q_values[discretized_state]))
    
    def update_q_values(self, state, action, reward, terminated, next_state):
        state = self.discretize_state(state)
        next_state = self.discretize_state(next_state)
        
        if not terminated:          
            future_q_value = np.max(self.q_values[next_state])
        else:
            future_q_value = 0
        temporal_difference = (reward + (self.discount_factor * future_q_value)) - self.q_values[state][action]
        self.q_values[state][action] += self.lr * temporal_difference
        self.training_error.append(temporal_difference)
        
    def decay_epsilon(self):
        self.epsilon = max(self.final_epsilon, self.epsilon * self.epsilon_decay)

In [51]:
def train_td_n(agent, env, n_steps, n_episodes):
    for episode in tqdm(range(n_episodes)):
        state, _ = env.reset()
        done = False
        rewards = []
        states = []
        actions = []

        while not done:
            action = agent.choose_action(state)
            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated

            rewards.append(reward)
            states.append(state)
            actions.append(action)
            state = agent.discretize_state(state)#
            next_state = agent.discretize_state(next_state)
            if len(rewards) == n_steps or done:
                return_sum = sum([agent.discount_factor**i * r for i, r in enumerate(rewards)])
                if not done:
                    return_sum += agent.discount_factor**n_steps * np.max(agent.q_values[next_state])
                
                for i in range(len(rewards)):
                    G = sum([agent.discount_factor**(j-i) * r for j, r in enumerate(rewards[i:])])
                    if i + len(rewards) - 1 < n_steps and not done:
                        G += agent.discount_factor**(len(rewards)-i) * np.max(agent.q_values[next_state])
                    agent.update_q_values(states[i], actions[i], G, terminated, next_state)

                rewards.pop(0)
                states.pop(0)
                actions.pop(0)

            state = next_state

        agent.decay_epsilon()

In [52]:
learning_rate = .1
n_episodes = 100_0
start_epsilon = 1
epsilon_decay = 0.999
final_epsilon = 0.05

agent = LunarLanderAgent(
    learning_rate=learning_rate,
    initial_epsilon=start_epsilon,
    final_epsilon=final_epsilon,
    epsilon_decay=epsilon_decay,
    
    
)

In [53]:
env = gym.make("LunarLander-v2", render_mode='rgb_array')


In [54]:
env = gym.wrappers.RecordEpisodeStatistics(env, deque_size=n_episodes)
env = gym.wrappers.TimeLimit(env, max_episode_steps=60)

In [55]:
train_td_n(agent, env, n_steps=10, n_episodes=n_episodes, )

100%|██████████| 1000/1000 [00:30<00:00, 32.43it/s]


In [56]:
q_values = np.array([value for key, value in agent.q_values.items()])
print(np.argmax(q_values,axis=1))


[0 3 1 ... 0 1 0]


In [57]:

# Create and wrap the environment
env = gym.make("LunarLander-v2",render_mode='rgb_array')
# env = CustomRewardWrapper(env)

obs, info = env.reset()

plt.ion()
fig, ax = plt.subplots(figsize=(8,8))
action_text = ax.text(510, 20, '', color='white', fontsize=12, bbox=dict(facecolor='blue', alpha=0.8))
img = ax.imshow(env.render())
actions = ['Move Up','Move Right','Move Down','Move Left']
rewards = 0
num_epochs= 2
for step in range(num_epochs):
    obs, info = env.reset()
    done = False
    while not done:
        action = agent.choose_action(obs)
        next_obs, reward, terminated, truncated, info = env.step(action)
        rewards += reward
        if reward >10 : 
            print(f'step {step}:  obs = {next_obs} , reward = {reward}')

        frame = env.render()
        img.set_data(frame)
        action_text.set_text(f'Step: {actions[action] }')

        fig.canvas.draw()
        fig.canvas.flush_events()
        done = terminated or truncated
        if done: 
            print(f' reward = {reward}')

        obs = next_obs

plt.ioff()  # Turn off interactive mode
plt.show()  # Keep the window open after the animation finishes
plt.close()
# env.close()

 reward = -100
step 1:  obs = [-0.5081767  -0.20327936 -0.77846485 -1.8432485   0.23200004 -0.314769
  1.          1.        ] , reward = 11.997922958570541
 reward = -100


In [58]:
print(f'mean episode rewards = {rewards/num_epochs}')

mean episode rewards = -138.86829438994795


In [59]:
print(f'action space shape : {env.action_space.n}') # Number of possible actions is 4
print(f'observation space shape : {env.observation_space}') 
#-------------- obesrvation is a tupe of 3 values : --------------
#1) player cards value
#2) dealer's face up card
#3) usable ace for player, equal 1 if ace is considered an 11 without busting

print(f'reward range : {env.reward_range}') # default reward range is set to -inf +inf
# print(f'\nEnv specs : {env.spec}') 
print(f'\nEnv metadata : {env.metadata}') # render_modes adn render_fps

action space shape : 4
observation space shape : Box([-1.5       -1.5       -5.        -5.        -3.1415927 -5.
 -0.        -0.       ], [1.5       1.5       5.        5.        3.1415927 5.        1.
 1.       ], (8,), float32)
reward range : (-inf, inf)

Env metadata : {'render_modes': ['human', 'rgb_array'], 'render_fps': 50}
