# Course: Application of AI, Data Science and Machine Learning
# Lab 8: Reinforcement Learning (Implementing Q-Table)

### This lab will need Open AI Gym. 

Gym has a ton of environments ranging from simple text based games to Atari games like Breakout and Space Invaders. The library is intuitive to use and simple to install. 

Just run pip install gym, and you're good to go! 

The link to Gym's installation instructions, requirements, and documentation is: https://gym.openai.com/docs/

Go ahead and get that installed now because we'll need it in just a moment. 

In [None]:
pip install gym

Collecting gym
  Downloading gym-0.23.0.tar.gz (624 kB)
[K     |████████████████████████████████| 624 kB 541 kB/s eta 0:00:01
[?25h  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h    Preparing wheel metadata ... [?25ldone
[?25hCollecting gym-notices>=0.0.4
  Downloading gym_notices-0.0.6-py3-none-any.whl (2.7 kB)
Building wheels for collected packages: gym
  Building wheel for gym (PEP 517) ... [?25ldone
[?25h  Created wheel for gym: filename=gym-0.23.0-py3-none-any.whl size=697659 sha256=6e5d4270845c56cea2db36e2d8dbe513c44f629d64a82436056b3ccabe990c9f
  Stored in directory: /Users/macbook/Library/Caches/pip/wheels/e7/2f/ab/68bf956c5dde73c1856d981e54292cf58385fb60bca10b7acd
Successfully built gym
Installing collected packages: gym-notices, gym
Successfully installed gym-0.23.0 gym-notices-0.0.6
Note: you may need to restart the kernel to use updated packages.


In [None]:
import gym

### Frozen Lake problem description: 
Winter is here. You and your friends were tossing around a frisbee at the park when you made a wild throw that left the frisbee out in the middle of the lake. The water is mostly frozen, but there are a few holes where the ice has melted. If you step into one of those holes, you'll fall into the freezing water. At this time, there's an international frisbee shortage, so it's absolutely imperative that you navigate across the lake and retrieve the disc. However, the ice is slippery, so you won't always move in the direction you intend. The surface is described using a grid like the following:

![image.png](attachment:image.png)
This grid is our environment where S is the agent's starting point, and it's safe. F represents the frozen surface and is also safe. H represents a hole, and if our agent steps in a hole in the middle of a frozen lake, well, that's not good. Finally, G represents the goal, which is the space on the grid where the prized frisbee is located.

The agent can navigate left, right, up, and down, and the episode ends when the agent reaches the goal or falls in a hole. It receives a reward of one if it reaches the goal, and zero otherwise.

![image-2.png](attachment:image-2.png)

### Step 1: Import Libraries

In [None]:
# Note this code will run inside tensorflow environement
import numpy as np
import gym
import random
import time
from IPython.display import clear_output

#### Step 2: Set up the Environment

In [None]:
env = gym.make("FrozenLake-v1")

In [None]:
pip install pygame

Collecting pygame
  Downloading pygame-2.1.2-cp38-cp38-macosx_10_9_x86_64.whl (8.9 MB)
[K     |████████████████████████████████| 8.9 MB 3.0 MB/s eta 0:00:01     |█████▌                          | 1.5 MB 1.6 MB/s eta 0:00:05
[?25hInstalling collected packages: pygame
Successfully installed pygame-2.1.2
Note: you may need to restart the kernel to use updated packages.


### Step 3: Create the Q-Table

We're now going to construct our Q-table, and initialize all the Q-values to zero for each state-action pair.

Remember, the number of rows in the table is equivalent to the size of the state space in the environment, and the number of columns is equivalent to the size of the action space. We can get this information using using env.observation_space.n and env.action_space.n

In [None]:
action_size = env.action_space.n
state_size = env.observation_space.n

In [None]:
qtable = np.zeros((state_size, action_size))
print(qtable)

[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]


#### Step 4: Initializing Q-Learning Parameters

First, with num_episodes, we define the total number of episodes we want the agent to play during training. Then, with max_steps_per_episode, we define a maximum number of steps that our agent is allowed to take within a single episode. So, if by the one-hundredth step, the agent hasn't reached the frisbee or fallen through a hole, then the episode will terminate with the agent receiving zero points.

Next, we set our learning_rate, which was mathematically shown using the symbol  in the previous post. Then, we also set our discount_rate, as well, which was represented with the symbol  previously.

Initialize our exploration_rate to 1 and setting the max_exploration_rate to 1 and a min_exploration_rate to 0.01. The max and min are just bounds to how large or small our exploration rate can be. Remember, the exploration rate was represented with the symbol  when we discussed it previously.

Lastly, we set the exploration_decay_rate to 0.01 to determine the rate at which the exploration_rate will decay.

In [None]:
total_episodes = 15000        # Total episodes
learning_rate = 0.8           # Learning rate
max_steps = 99                # Max steps per episode
gamma = 0.95                  # Discounting rate

# Exploration parameters
epsilon = 1.0                 # Exploration rate
max_epsilon = 1.0             # Exploration probability at start
min_epsilon = 0.01            # Minimum exploration probability 
decay_rate = 0.005            # Exponential decay rate for exploration prob

#### Step 5: Coding The Q-Learning Algorithm Training Loop

First, we create this list to hold all of the rewards we'll get from each episode. This will be so we can see how our game score changes over time. We'll discuss this more in a bit.

rewards_all_episodes = []

#### Q-learning algorithm
for episode in range(num_episodes):
    # initialize new episode params

    for step in range(max_steps_per_episode): 
        # Exploration-exploitation trade-off
        # Take new action
        # Update Q-table
        # Set new state
        # Add new reward        

    # Exploration rate decay  
    # Add current episode reward to total rewards list  (step is optional, not implemented in the solution provided in this lab)


#print Q-table

In [None]:
# List of rewards
rewards = []

# 2 For life or until learning is stopped
for episode in range(total_episodes):
    # Reset the environment
    state = env.reset()
    step = 0
    done = False
    total_rewards = 0
    
    for step in range(max_steps):
        # 3. Choose an action a in the current world state (s)
        ## First we randomize a number
        exp_exp_tradeoff = random.uniform(0, 1)
        
        ## If this number > greater than epsilon --> exploitation (taking the biggest Q value for this state)
        if exp_exp_tradeoff > epsilon:
            action = np.argmax(qtable[state,:])

        # Else doing a random choice --> exploration
        else:
            action = env.action_space.sample()

        # Take the action (a) and observe the outcome state(s') and reward (r)
        new_state, reward, done, info = env.step(action)

        # Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]
        # qtable[new_state,:] : all the actions we can take from new state
        qtable[state, action] = qtable[state, action] + learning_rate * (reward + gamma * np.max(qtable[new_state, :]) - qtable[state, action])
        
        total_rewards += reward
        
        # Our new state is state
        state = new_state
        
        # If done (if we're dead) : finish episode
        if done == True: 
            break
        
    # Reduce epsilon (because we need less and less exploration)
    epsilon = min_epsilon + (max_epsilon - min_epsilon)*np.exp(-decay_rate*episode) 
    rewards.append(total_rewards)

print ("Score over time: " +  str(sum(rewards)/total_episodes))
print(qtable)

Score over time: 0.4744
[[6.15910507e-02 6.08926641e-02 8.07013382e-02 4.34393078e-02]
 [3.66897341e-03 8.03684328e-03 8.84262341e-04 6.71879405e-02]
 [1.58279246e-03 1.17460596e-02 1.90034468e-02 5.44301622e-02]
 [8.69807421e-03 4.14833519e-03 4.62091832e-03 2.84615198e-02]
 [9.96132507e-02 6.48231358e-03 7.12031536e-03 2.26356728e-03]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [6.25127488e-05 4.92529916e-11 2.19829023e-02 4.06091784e-05]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [2.75823010e-03 2.14211208e-02 7.53239464e-02 1.04658560e-01]
 [1.48605100e-02 6.61831596e-02 5.42979605e-01 7.26192956e-02]
 [1.45864894e-02 7.42437824e-04 5.32588139e-03 3.16073972e-03]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [2.34225383e-03 6.87463745e-02 8.75693769e-01 1.87911767e-02]
 [3.54744381e-01 9.91700257e-01 8.29983613e-02 3.48101319e-01]
 [0.00000000e+00 0.00000000e+00

#### Step 6: The Code To Watch The Agent Play The Game

#### Watch our agent play Frozen Lake by playing the best action 
#### from each state according to the Q-table

for episode in range(3):
    # initialize new episode params

    for step in range(max_steps_per_episode):        
        # Show current state of environment on screen
        # Choose action with highest Q-value for current state       
        # Take new action

        if done:
            if reward == 1:
                # Agent reached the goal and won episode
            else:
                # Agent stepped in a hole and lost episode            

        # Set new state

env.close()

In [None]:
env.reset()

for episode in range(5):
    state = env.reset()
    step = 0
    done = False
    print("****************************************************")
    print("EPISODE ", episode)

    for step in range(max_steps):
        
        # Take the action (index) that have the maximum expected future reward given that state
        action = np.argmax(qtable[state,:])
        
        new_state, reward, done, info = env.step(action)
        
        if done:
            # Here, we decide to only print the last state (to see if our agent is on the goal or fall into an hole)
            env.render()
            
            # We print the number of step it took.
            print("Number of steps", step)
            break
        state = new_state
env.close()

****************************************************
EPISODE  0
Number of steps 71
****************************************************
EPISODE  1
Number of steps 33
****************************************************
EPISODE  2
Number of steps 58
****************************************************
EPISODE  3
Number of steps 9
****************************************************
EPISODE  4


## Code Credit:
https://github.com/simoninithomas/Deep_reinforcement_learning_Course/blob/master/Q%20learning/FrozenLake/Q%20Learning%20with%20FrozenLake.ipynb


## Useful videos related to code. You can refer videos 8, 9, 10:
Link: https://deeplizard.com/learn/playlist/PLZbbT5o_s2xoWNVdDudn51XM8lOuZ_Njv
