<h1><center>Module 10 - Reinforcement Learning</center>
    Case Study 2: Frozen Lake Environment</h1>

Goal of this project:
<li>Teach an AI how to solve the Frozen Lake environment using reinforcement learning. 
<li>Let us use a pre-existing simulation environment like OpenAI Gym and non-slippery version to begin with.

<b>Frozen Lake environment </b> 
In reinforcement learning, this is a problem where the agent navigates a grid of icy terrain. 
The Frozen Lake environment is a 4×4 grid which contain four possible areas - Safe (S), Frozen (F), Hole that gets you stuck forever(H) and Goal (G). The AI,or agent moves around the grid until it reaches the goal or the hole. If it falls into the hole, it has to start from the beginning and is rewarded the value 0. The process continues until it learns from every mistake and reaches the goal eventually. The AI, or agent, has 4 possible actions: go LEFT, DOWN, RIGHT, or UP. The agent must learn to avoid holes in order to reach the goal in a minimal number of actions.



In [None]:
import gym
import numpy as np
import matplotlib.pyplot as plt


In [None]:
env = gym.make('FrozenLake-v1',new_step_api=True, is_slippery=False)   #use frozen lake environment from gym library

Our agent can be found in 16 different positions, called <b>states</b>. For each state, there are 4 possible <b> actions</b>: LEFT, DOWN, RIGHT, or UP. Learning how to play Frozen Lake is like learning which action you should choose in every state. To know which action is the best in a given state, we would like to assign a quality value to our actions. We have 16 states and 4 actions, so have to calculate 16×4 = 64 values.

The values are represented using a table, known as a Q-table, where rows list every state (s) and columns list every action (a). In this Q-table, each cell contains a value <b>Q(s,a)</b>, which is the value (quality) of the action in the state. (1 if it's the best action possible, 0 if it's really bad). When our agent is in a particular state, it just has to check this table to see which action has the highest value. 

In [None]:
 
state = env.observation_space.n   # get the number of states
action = env.action_space.n       # get the number of actions
print(state)
print(action)

16
4


In [None]:
env.reset()      # reset the environment to default state
# env.render()   # render the GUI for the environment

0

Let's create a Q-table and fill it with zeros since we still have no idea of the value of each action in each state.

In [None]:
Q = np.zeros((state,action))
Q

array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])

In [None]:
episodes = 500
max_steps = 100
learning_rate = 0.5
gamma = 0.9
render = False
epsilon = 1.0
epsilon_decay = 0.001


In [None]:
outcomes = []

# Training
for i in range(episodes):
    state = env.reset()
    done = False

    # By default, we consider our outcome to be a failure
    outcomes.append("Failure")
    
    # Until the agent gets stuck in a hole or reaches the goal, keep training it
    while not done:
        # Generate a random number between 0 and 1
        rnd = np.random.random()

        # If random number < epsilon, take a random action
        if rnd < epsilon:
          action = env.action_space.sample()
        # Else, take the action with the highest value in the current state
        else:
          action = np.argmax(Q[state])
             
        # Implement this action and move the agent in the desired direction
        new_state, reward, done, info, extra = env.step(action)
        
        # Update Q(s,a)
        Q[state, action] = Q[state, action] + learning_rate * (reward + gamma * np.max(Q[new_state]) - Q[state, action])
                                
        # Update our current state
        state = new_state

        # If we have a reward, it means that our outcome is a success
        if reward:
          outcomes[-1] = "Success"

    # Update epsilon
    epsilon = max(epsilon - epsilon_decay, 0)

print('Q-table after training:')
print(Q)


Q-table after training:
[[0.531441   0.59049    0.59049    0.531441  ]
 [0.531441   0.         0.6561     0.59049   ]
 [0.59049    0.729      0.59049    0.6561    ]
 [0.6561     0.         0.59048353 0.59048998]
 [0.59048999 0.6561     0.         0.531441  ]
 [0.         0.         0.         0.        ]
 [0.         0.81       0.         0.6561    ]
 [0.         0.         0.         0.        ]
 [0.65609999 0.         0.729      0.59049   ]
 [0.6561     0.80999994 0.81       0.        ]
 [0.729      0.9        0.         0.729     ]
 [0.         0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.         0.80896108 0.9        0.72893905]
 [0.81       0.9        1.         0.81      ]
 [0.         0.         0.         0.        ]]
