# Task / Motivation

For the last week or so I have been working through MIT's youtube courses for deep learning to get a better understanding of how it worked. I have seen things like RNNs, CNNs, VAEs, and, my personal favorite -- deep reinforcement learning. 


The goal of this assignment is for educational purposes. I used resources from videos, articles, and more to better understand how these things work. I haven't had a lot of exposure to ML/AI and I thought that this would be an interesting project for me to learn more about what is happening under the hood. 


While it may be difficult to re-construct exactly, the lessons that I learned deep diving into this subject help me have a high level understanding of how some of these models actually work, and how the math/statistics drive the solutions. 


I use the gym api to make this all possible:   
https://www.gymlibrary.dev/content/basic_usage/


# Action Space
0: LEFT   
1: DOWN   
2: RIGHT   
3: UP

The agent is given the reward of 1 if the player reaches the goal, and 0 otherwise. 


# Q-Learning

* This material is for learning purposes and has been better understood thanks to resources such as: https://towardsdatascience.com/q-learning-algorithm-from-explanation-to-implementation-cdbeda2ea187 *

The goal of any reinforcement learning task is to maximize the total rewards an agent gets from its environment through a trial and error process. At each step (or state) that the agent is in, it needs to make a decision (an action) of where it can maximize the reward at that step. In the frozen lake example, at each step we are choosing an action step (left, up, right, down) and are rewarded once we can get to the goal. Our model should learn how to interpret what is a good or bad step. 

In order to make the best action at a step *s*, the agent must find the best probability distribution over which action to take at that step. 

Q-learning is a method that learns about how to find the optimal (or maximum) Q-value at this step. To make this possible, Q-learning stores all the values in a table that is constantly updated at each step based on the old q value, and the new learned value. 

*a* is the learning rate.    
*r* is the reward for taking action *a* at state *s*.    
*\gamma* is the discount factor    
*s'* is the next state after taking the next action    
*a'* is the action that maximized the Q value in the next state   

$
\begin{equation}
Q(s,a) \leftarrow (1 - \alpha) \cdot Q(s,a) + \alpha \cdot \left( r + \gamma \cdot \max_{a'} Q(s',a') \right)
\end{equation}
$


<img src="training.gif" alt="Elf training in environment" style="width:200px;"/>



# Exploration or Exploitation 

https://gist.github.com/aamrani-dev/fe00597615dae967f5ca1909a6ecf1d2#file-q_learning-py

As I learn more about reinforcement learning and Q-learning, it becomes clear that knowing when to explore or exploit is essential. In the early iterations, our reinforcement learning model lacks information about the environment, prompting it to prioritize exploration over exploitation. However, as the model learns with each iteration, the exploration decay constant gradually diminishes the need for extensive exploration! Instead, the model begins to leverage its accumulated knowledge, shifting towards prioritizing exploitation. 

To choose whether to explore or exploit, we use a uniform distribution between 0 and 1 and if our random number is less than the exploration probability, the agent selects a random action (explores), otherwise, exploits the newfound knowledge using the "bellman equation" as explained from the article. (Very similar to Markov chain Monte Carlo scenarios from my Data in the Cosmos course).

In [6]:
#!pip install --upgrade gym
#!pip install pygame

import numpy as np
import gym
import pygame

env = gym.make('FrozenLake-v1', is_slippery=False)

# Q Table Init
state_size = env.observation_space.n
action_size = env.action_space.n
q_table = np.zeros((state_size, action_size))

# Params
learning_rate = 0.8
discount_rate = 0.95
num_episodes = 10000
max_steps_per_episode = 100
epsilon = 1.0
max_epsilon = 1.0
min_epsilon = 0.01
decay_rate = 0.005

# Q-learning algorithm
print_count = 0
for episode in range(num_episodes):
    state = env.reset()[0]
    done = False
    total_rewards = 0
    
    for step in range(max_steps_per_episode):
        exp_exp_tradeoff = np.random.uniform(0, 1)
        if exp_exp_tradeoff > epsilon:
            action = np.argmax(q_table[state, :])   # EXPLOITATION
        else:
            action = env.action_space.sample()      # EXPLORATION

        new_state, reward, done, truncated, info = env.step(action)

        q_table[state, action] = q_table[state, action] + learning_rate * (reward + discount_rate * np.max(q_table[new_state, :]) - q_table[state, action])

        if print_count < 3:
            print(f"Episode: {episode}, Step: {step}")
            print(f"State: {state}, Action: {action}, Reward: {reward}, New State: {new_state}, Done: {done}")
            print(f"Q-value[{state}, {action}]: {q_table[state, action]}")
        elif print_count == 4:
            print("----\nStopping printing for space on github")
        print_count += 1
        state = new_state
        total_rewards += reward

        if done:
            break

    epsilon = min_epsilon + (max_epsilon - min_epsilon) * np.exp(-decay_rate * episode)


Episode: 0, Step: 0
State: 0, Action: 2, Reward: 0.0, New State: 1, Done: False
Q-value[0, 2]: 0.0
Episode: 0, Step: 1
State: 1, Action: 3, Reward: 0.0, New State: 1, Done: False
Q-value[1, 3]: 0.0
Episode: 0, Step: 2
State: 1, Action: 0, Reward: 0.0, New State: 0, Done: False
Q-value[1, 0]: 0.0
----
Stopping printing for space on github


In [7]:
print(q_table)

[[0.73509189 0.77378094 0.6983373  0.73509189]
 [0.73509189 0.         0.         0.        ]
 [0.55863408 0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.77378094 0.81450625 0.         0.73509189]
 [0.         0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.81450625 0.         0.857375   0.77378094]
 [0.81450625 0.9025     0.9025     0.        ]
 [0.82308    0.95       0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.         0.9025     0.95       0.857375  ]
 [0.9025     0.95       1.         0.9025    ]
 [0.         0.         0.         0.        ]]


# Visualization

In [None]:
import time

def play_game(env, q_table):
    state = env.reset()[0]
    done = False
    total_rewards = 0
    step_count = 0

    while not done:
        env.render() 
        time.sleep(1)  
        action = np.argmax(q_table[state, :])  #
        new_state, reward, done, truncated, info = env.step(action)
        state = new_state
        total_rewards += reward
        step_count += 1

        if done:
            env.render()  
            print(f"Episode finished in {step_count} steps with total rewards: {total_rewards}")
            break

env = gym.make('FrozenLake-v1', is_slippery=False, render_mode="human")
play_game(env, q_table)
env.close()

# Final trained model

<img src="trained.gif" alt="Elf training in environment" style="width:200px;"/>