### Practicing RL with OpenAI Gym
1. example of importing an environment. here, cart pole model:

In [1]:
import gym
#import Box2D
env = gym.make('CartPole-v0')
env.reset()
for _ in range(100):
    env.render()
    env.step(env.action_space.sample()) # take a random action
env.close()



2. Now we will practice learned RL skills to play FrozenLake game.

First, import dependencies:

In [2]:
import numpy as np
import gym
import random
import time
from IPython.display import clear_output

Create Environment. 
Note: with this new object we can do a lot of things, we can:
1. Query for information about the environment,
2. Sample states and actions, retrieve rewards, and
3. Have our agent navigate the environment (FrozenLake).


In [3]:
env  = gym.make("FrozenLake-v0")

Now, we construct Q-Table, initialize all of its values to 0. 
In Q-Table, 

-number of rows = size of state space (.ie. all possible states) 

-number of columns = size of action space(.ie. all possible actions).

We are getting this information using env.observation_space.n and env.action_space.n


In [4]:
action_space_size = env.action_space.n
state_space_size = env.observation_space.n

q_table = np.zeros((state_space_size,action_space_size))
print(q_table)

[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]


Now, we will initialize all the required variables of our Q-learning algorithm.

Tune these to see how your algorithm behaves. . . 

In [14]:
num_episodes = 10000
max_steps_per_episode = 100

learning_rate = 0.2 #(this is alpha --decides how quickly we adopt new learned q-value)
discount_rate = 0.99 #(this is gamma --decides how much we discount future rewards)

# following are variables for epsilon-greedy strategy:
exploration_rate = 1 
max_exploration_rate = 1
min_exploration_rate = 0.01
exploration_decay_rate = 0.01


Important in-build functions:

### 1. step( ) function

 The * environment’s * step function returns exactly what we need. In fact, step returns four values. These are:
##### observation (object):
an environment-specific object representing your observation of the environment. For example, pixel data from a camera, joint angles and joint velocities of a robot, or the board state in a board game.
####    reward (float):
amount of reward achieved by the previous action. The scale varies between environments, but the goal is always to increase your total reward.
####    done (boolean): 
whether it’s time to reset the environment again. Most (but not all) tasks are divided up into well-defined episodes, and done being True indicates the episode has terminated. (For example, perhaps the pole tipped too far, or you lost your last life.)
####    info (dict): 
diagnostic information useful for debugging. It can sometimes be useful for learning (for example, it might contain the raw probabilities behind the environment’s last state change). However, official evaluations of your agent are not allowed to use this for learning.


In [15]:
t0 = time.time()
rewards_all_episodes = [] 
'''10000 element list to hold all rewards we'll get from each
episode from 1 to 10000. Shows how our games scores(rewards) change overtime'''

# Q-learning Algorithm
for episode in range(num_episodes):
    state = env.reset() # The process gets started by calling reset(), 
                        #  which returns an initial observation. 
    done = False
    rewards_current_episode = 0
    
    for step in range(max_steps_per_episode):
        
        # Exploration VS Exploitation trade-off
        exploration_rate_threshold = random.uniform(0,1) # choosing r
        
        if exploration_rate_threshold > exploration_rate:
            action = np.argmax(q_table[state,:]) # exploit. here, argmax() returns max element from list
        else:
            action = env.action_space.sample() # sample() = Randomly sample an element of this space.
        
        new_state, reward, done, info = env.step(action)
        
        # Update Q-table for Q(s,a)
        q_table[state,action] = q_table[state,action] * (1 - learning_rate) + learning_rate * (reward + discount_rate * np.max(q_table[new_state,:]))
        
        state = new_state
        rewards_current_episode += reward
        
        if done == True:
            break
            
    # Exploration rate decay
    exploration_rate = min_exploration_rate + (max_exploration_rate - min_exploration_rate) * np.exp(-exploration_decay_rate*episode)
    
    rewards_all_episodes.append(rewards_current_episode)

t1 = time.time()
time_taken = t1 - t0
# Calculate and print the average reward per thousand episodes
rewards_per_thousand_episodes = np.split(np.array(rewards_all_episodes),num_episodes/1000)
count = 1000
print("****** Average reward of 1000 episodes ******\n")
for r in rewards_per_thousand_episodes:
    print(count, ": ", str(sum(r/1000)))
    count+=1000
    
# Print Updated Q-table
print("*** Q-table ***\n")
print(q_table)
print("time taken: ", time_taken)


****** Average reward of 1000 episodes ******

1000 :  0.5190000000000003
2000 :  0.6760000000000005
3000 :  0.6400000000000005
4000 :  0.6750000000000005
5000 :  0.6910000000000005
6000 :  0.6830000000000005
7000 :  0.6680000000000005
8000 :  0.6680000000000005
9000 :  0.6480000000000005
10000 :  0.6660000000000005
*** Q-table ***

[[0.56334964 0.47729182 0.49517438 0.48374302]
 [0.23954653 0.24099427 0.25940169 0.46358531]
 [0.33653257 0.36447568 0.40102227 0.43001606]
 [0.28009886 0.3390946  0.33329062 0.42043122]
 [0.56942563 0.31245055 0.31881583 0.34389535]
 [0.         0.         0.         0.        ]
 [0.27001621 0.09030798 0.11085526 0.10945093]
 [0.         0.         0.         0.        ]
 [0.34061865 0.49539219 0.3976969  0.59916974]
 [0.4153361  0.68267982 0.48971283 0.45717819]
 [0.70021702 0.0992978  0.37185644 0.21903668]
 [0.         0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.40355408 0.34558618 0.74172148 0.35737937]
 [0.760

In [11]:
# Visualize
for episode in range(3):
    state = env.reset()
    done = False
    print("**** Episode ", episode + 1, "***\n\n")
    time.sleep(1)
    
    for step in range(max_steps_per_episode):
        clear_output(wait = True)
        env.render()
        time.sleep(0.3)
        
        action = np.argmax(q_table[state,:])
        new_state, reward, done, info = env.step(action)
        
        if done:
            clear_output(wait = True)
            env.render()
            if reward == 1:
                print("You reached the goal")
                time.sleep(3)
            else:
                print("your reached hole")
                time.sleep(3)
            clear_output(wait = True)
            break
            
        state = new_state
        
env.close()

  (Left)
SFFF
FHFH
FFFH
[41mH[0mFFG
your reached hole
