# Reinforcement Learning - Developing Intelligent Agents

<h4 style='color:CornflowerBlue;'>Deep Learning Course 5 of 6 - Level: Advanced</h4>
<p style='color:DeepSkyBlue;'>Follow up <a href='https://deeplizard.com/'>Deeplizrard</a> for more information</p>

1. <a href='https://deeplizard.com/learn/video/QK_PP_2KgGE'>OpenAI Gym and Python for Q-learning - Reinforcement Learning Code Project</a>
2. <a href='https://deeplizard.com/learn/video/HGeI30uATws'>Train Q-learning Agent with Python - Reinforcement Learning Code Project</a>
3. <a href='https://deeplizard.com/learn/video/ZaILVnqZFCg'>Watch Q-learning Agent Play Game with Python - Reinforcement Learning Code Project</a>


# <font size='5' color='CornflowerBlue'>OpenAI Gym and Python for Q-learning - Reinforcement Learning Code Project</font>

# <font size='4' color='DarkSlateBlue'><b>OpenAI Gym</b></font>

we'll be using Python and OpenAI Gym to develop our reinforcement learning algorithm. The Gym library is a collection of environments that we can use with the reinforcement learning algorithms we develop.

Gym has a ton of environments ranging from simple text based games to Atari games like Breakout and Space Invaders. The library is intuitive to use and simple to install. Just run pip install gym, and you're good to go! [The link to Gym's installation instructions, requirements, and documentation](https://gym.openai.com/docs/) is included in the description. Go ahead and get that installed now because we'll need it in just a moment. 

<center>
<img src='https://deeplizard.com/assets/svg/ac9a374b.svg'/>
</center>

 We'll be making use of Gym to provide us with an environment for a simple game called Frozen Lake. We'll then train an agent to play the game using Q-learning, and we'll get a playback of how the agent does after being trained.

So, let's jump into the details for **Frozen Lake**! 

# <font size='4' color='DeepSkyBlue'><b>Overview</b></font>

This grid is our environment where `S` is the agent's starting point, and it's safe. `F` represents the frozen surface and is also safe. `H` represents a hole, and if our agent steps in a hole in the middle of a frozen lake, well, that's not good. Finally, `G` represents the goal, which is the space on the grid where the prized frisbee is located.

The agent can navigate left, right, up, and down, and the episode ends when the agent reaches the goal or falls in a hole. It receives a reward of one if it reaches the goal, and zero otherwise. 

# <font size='4' color='DeepSkyBlue'><b>Table Dictionary</b></font>

<table>
<thead>
  <tr>
    <th>State</th>
    <th>Description</th>
    <th>Reward</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td>S</td>
    <td>Agent's starting point - safe</td>
    <td>0</td>
  </tr>
  <tr>
    <td>F</td>
    <td>Frozen surface - safe</td>
    <td>0</td>
  </tr>
  <tr>
    <td>H</td>
    <td>Hole - game over</td>
    <td>0</td>
  </tr>
  <tr>
    <td>G</td>
    <td>Goal - game over</td>
    <td>1</td>
  </tr>
</tbody>
</table>

---

# <font size='4' color='DeepSkyBlue'><b>Configurations</b></font>

Install **Gym**

In [1]:
pip install gym -q

Note: you may need to restart the kernel to use updated packages.


Apply using auto-completion

In [2]:
%config Completer.use_jedi = False

# <font size='4' color='DeepSkyBlue'><b>Libraries</b></font>

In [3]:
import random
import time

import numpy as np
import gym

from IPython.display import clear_output

# <font size='4' color='DeepSkyBlue'><b>Create the Environment</b></font>

In [4]:
env = gym.make('FrozenLake-v1')

# <font size='4' color='DeepSkyBlue'><b>Create the Q-table</b></font>

In [5]:
action_space_size = env.action_space.n
state_space_size = env.observation_space.n

In [6]:
q_table = np.zeros((state_space_size, action_space_size))
q_table

array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])

# <font size='4' color='DarkTurquoise'><b>Initializing Q-learning parameters</b></font>

In [7]:
num_episodes = int(1e4)
max_steps_per_episode = 100

lr = 1e-1
dr = 0.99

exploration_rate = 1
max_exp_rate = 1
min_exp_rate = 1e-3
exp_decay_rate = 1e-3

# building a reward list for all the episodes
rewards_all_episodes = []

# <font size='4' color='DarkTurquoise'><b>Building the Q-learning Algorithm</b></font>

In [8]:
# Q-learning algorithm
for episode in range(num_episodes):
    # reset the movements in the env
    state = env.reset()
    # check if the agent reaches the target
    done = False
    
    # variable for expected return G_t
    rewards_current_episode = 0
    
    # for loop for each step for the agent
    for step in range(max_steps_per_episode):
        
        # apply epsilon greedy stategy
        random_number = random.uniform(0, 1)
        # Exploration Vs. Exploitation trade-off
        if random_number > exploration_rate:
            # start exploitation ---> getting the maximum Q-value from the possible movements of his current state.
            action = np.argmax(q_table[state, :])
        else:
            # start exploration ---> select any random action to explore a random state.
            action = env.action_space.sample()
        
        # after taking the action, we're going to update our agent with the new info, rewards, state, and if he reaches the end or not!
        new_state, reward, done, info = env.step(action)
        
        # Update our Q-table for Q(s, a) using Bellman Equation
                                            # Old Q-value
        q_table[state, action] = (1 - lr) * q_table[state, action] + \
                                 lr * (reward + dr*(np.max(q_table[new_state, :])))
                                            # learned value

        # transition to the next state
        state = new_state
        rewards_current_episode += reward
        
        # check to see if our last action ended the episode for us,
        # meaning, did our agent step in a hole or reach the goal?
        if done:
            break
        # If the action did end the episode, then we jump out of this loop and move on to the next episode.
        # Otherwise, we transition to the next time-step.
    
    # Exploration Rate Decay
    # https://en.wikipedia.org/wiki/Exponential_decay
    exploration_rate = min_exp_rate + \
                      (max_exp_rate - min_exp_rate) * np.exp(-exp_decay_rate * episode)
    
    # append the current rewards in the list of rewards
    rewards_all_episodes.append(rewards_current_episode)

# <font size='4' color='DarkTurquoise'><b>Examinating the rewards</b></font>

In [9]:
# Calculate the average reward per thousand episodes
rewards_per_thousand_episodes = np.split(np.array(rewards_all_episodes), num_episodes/1000)
count = 1000

print("Average rewards per thousand episodes".center(100, '*'))
for reward in rewards_per_thousand_episodes:
    print(f'Count No. {count:,}: {sum(reward/1000)}')
    count += 1000

*******************************Average rewards per thousand episodes********************************
Count No. 1,000: 0.04300000000000003
Count No. 2,000: 0.20700000000000016
Count No. 3,000: 0.44700000000000034
Count No. 4,000: 0.6160000000000004
Count No. 5,000: 0.6450000000000005
Count No. 6,000: 0.7080000000000005
Count No. 7,000: 0.7040000000000005
Count No. 8,000: 0.7410000000000005
Count No. 9,000: 0.7120000000000005
Count No. 10,000: 0.7520000000000006


From the printout we can notice that our average reward per thousand epoisodes did indeed progress overtime. When the algorithm first start training, the first thousands episodes only average a reward of `0.062`, but by the time it got to its last thousand episodes, the reward improved to `0.746`.

# <font size='4' color='DarkTurquoise'><b><a href='https://deeplizard.com/learn/video/HGeI30uATws#:~:text=%20interpreting%20the%20training%20results%20'>Interpreting the training results</a></b></font>

Our agent played `10,000` episodes. At each time step within an episode, the agent received a reward of `1` if it reached the frisbee, otherwise, it received a reward of `0`. If the agent did indeed reach the frisbee, then the episode finished at that time-step. 

So, that means for each episode, the total reward received by the agent for the entire episode is either `1` or `0`. So, for the first thousand episodes, we can interpret this score as meaning that  **6%** of the time, the agent received a reward of `1` and won the episode. And by the last thousand episodes from a total of 10,000, the agent was winning **74%** of the time. 

In [10]:
print("Q-Table".center(100, '*'))
print()

for row in q_table:
    print(' '* 25, row)

**********************************************Q-Table***********************************************

                          [0.54820119 0.49994904 0.47970801 0.51061614]
                          [0.36958845 0.33922171 0.23301015 0.54016008]
                          [0.40585435 0.40038556 0.39276924 0.48140859]
                          [0.28167693 0.30772348 0.3291846  0.45667489]
                          [0.56209941 0.3012319  0.34741145 0.36623447]
                          [0. 0. 0. 0.]
                          [0.15210576 0.16754297 0.34680018 0.14607937]
                          [0. 0. 0. 0.]
                          [0.37211351 0.310166   0.29906663 0.60830448]
                          [0.42174115 0.66645983 0.32512315 0.39979986]
                          [0.66244579 0.36336012 0.397786   0.33473133]
                          [0. 0. 0. 0.]
                          [0. 0. 0. 0.]
                          [0.48416014 0.58513452 0.75380664 0.60579725]
                  

# <font size='4' color='DarkTurquoise'><b>Building the Q-learning Interface</b></font>

Let's see how interactively the agent plays **Frozen Lake**

In [11]:
for episode in range(5):
    state = env.reset()
    done = False
    print(f'Episode: {episode+1}'.center(50, '='))
    time.sleep(1)
    
    for step in range(max_steps_per_episode):
        # for clearning the board
        clear_output(wait=True)
        # allows you to check the agent's environment
        env.render()
        time.sleep(0.4)
        
        # invoke the action with the highest Q-value from the Q-Table for the current state
        action = np.argmax(q_table[state, :])
        
        # take the action and move to the new state
        new_state, reward, done, info = env.step(action)
        
        # acting condition
        if done:
            clear_output(wait=True)
            env.render()
            if reward == 1:
                print('You reach the goal!'.center(50, '*'))
                time.sleep(3)
            else:
                print('You fall through a hole!'.center(50, '-'))
                time.sleep(3)
                clear_output(wait=True)
            break

        # select the new state based on the agent action
        state = new_state

# close the environment
env.close()

  (Down)
SFFF
FHFH
FFFH
HFF[41mG[0m
***************You reach the goal!****************



***Applied by [Ahmed](https://www.linkedin.com/in/ai-ahmed/) – Environment [Gradient](https://console.paperspace.com/ai-ahmed/notebook/r1v841exffrzbek)***
- Github: [AI-Ahmed](https://github.com/AI-Ahmed)
- Kaggle: [Ahmed](https://www.kaggle.com/dsxavier)