# Solving FrozenLake with Q-Learning

Install the OpenAI gym package to get the FrozenLake environment

In [33]:
!pip install gym



Import the following packages as wel'll need it later

In [0]:
import numpy as np
import gym
import random
# For the plotting
from IPython.display import clear_output
from time import sleep

Load the FrozenLake Environment

In [35]:
env = gym.make("FrozenLake-v0").env

  result = entry_point.load(False)


For the curious, we'll display the state and action space size for the game

In [36]:
print("Action Space {}".format(env.action_space))
print("State Space {}".format(env.observation_space))

Action Space Discrete(4)
State Space Discrete(16)


This is the graphical representation of the game
  - The highlight is your character
  - The `F` signifies a frozen spot
  - The `H` signifies a hole
  - `G` is the goal.
  

The goal of this game is to navigate from your starting position to the goal without falling into any other holes. The catch is that this environment is stochastic, meaning that the direction you choose to go doesn't always happen. Sometimes you slip...

In [37]:
env.render()


[41mS[0mFFF
FHFH
FFFH
HFFG


## Q Learning Agent

The QAgent class is going to house all the logic necessary to have a Q-Learning Agent. Since there's a lot going on here, this section will be longer than the others.

### Parameters

There are several parameters that are hard-coded into the model that should be tweaked when applying it to different problems to see if it affects performance. We will describe each parameter briefly here.



* Epsilon: The exploration rate. How often will the agent choose a random move during training instead of relying on information it already has. Helps the agent go down paths it normally wouldn't in hopes for higher long term rewards.
  - Epsilon Decay: How much our epsilon decreases after each update
  - Epsilon Min: The lowest rate of exploration we'll allow during training
* Gamma: Discount rate. This tells us how much we prioritize current versus future rewards.
* Alpha: Affects how much we shift our knowledge based off of new information.



### Fixed Q-Targets

In Q-Learning we update our Q_Table through the following function:

$Q_{TableEntry}(state, action) = Reward + max(Q_{TableEntry}(state))$

Since our update is dependent on the same table itself, we can start to get correlated entries. This could cause oscillations to happen in our training. 

To combat this, we implemented a target model. It essentially is a copy of the original model, except that the values do not update as rapidly. The rate at which the target model updates is dependent upon `Alpha` in our parameter list.

### Agent Workflow

1. Create an empty q-table for both the model and target model.
2. Given a starting state, perform an action.
3. Once you performed the action on the environment, update the model with information you have gained through the environment: reward, next state, if the environment is finished, etc.
4. Then calculate the value of the state.
  - If the game is finished, then the reward is the value of that state.
  - If not, then take the current reward and add the discounted reward of future states.
5. Update the model with the new value.
6. Decay the epsilon value as described in the parameters.
7. Gradually update the target model.
8. Perform an action for the new state, either randomly through epsilon, or by choosing the best action based on what we currently know.
9. Repeat steps 3-8.

In [0]:
class QAgent:
  def __init__(self, state_size, action_size):
    self.state_size = state_size
    self.action_size = action_size
    self.gamma = 0.6 # Discount Rate
    self.epsilon = 0.1 # Exploration Rate
    self.epsilon_min = 0.001
    self.epsilon_decay = 0.9995
    self.model = self._build_model()
    ## Additional components for Fixed Q-Targets
    self.target_model = self._build_model()
    # Update the target model by 10% each iteration
    self.alpha = 0.2
    
  def _build_model(self):
    # Assumes both self.state_size and self.action_size are lists
    model = np.zeros(self.state_size + self.action_size)
    return model
  
  def update_target_model(self):
    self.target_model = (1 - self.alpha) * self.target_model + self.alpha * self.model
    
  def act_random(self):
    return random.randrange(self.action_size[0])
  
  def best_act(self, state):
    # Choose the best action based on what we know
    # If all the action values are the same, then choose randomly
    action = self.act_random() if np.all(self.target_model[state, 0] == self.target_model[state]) else np.argmax(self.target_model[state])
    return action
  
  def act(self, state):
    # Act randomly epsilon percent of time, otherwise act greedily
    action = self.act_random() if np.random.rand() <= self.epsilon else self.best_act(state)
    return action
  
  def update(self, state, action, reward, next_state, done):
    target = reward
    if not done:
      target = reward + self.gamma * np.amax(self.target_model[next_state])
    self.model[state,action] = target
    self.update_target_model()
    if self.epsilon > self.epsilon_min:
      self.epsilon *= self.epsilon_decay
      
  
  def load(self, name):
    self.model.load_weights(name)
    
  def save(self, name):
    self.model.save_weights(name)

## Training

Now that we have defined our agent, let us train it through playing a lot of games. By the end of it, we would hope that the agent has been through a variety of situations and have learned the best way to combat each one.

In [39]:
%%time
"""Training the agent"""
state_size =  [ env.observation_space.n ]
action_size = [ env.action_space.n ]
agent = QAgent(state_size, action_size)
EPISODES = 100001
for i in range(1, EPISODES):
  state = env.reset()
  done = False
  while not done:
    action = agent.act(state)
    next_state, reward, done, info = env.step(action) 
    agent.update(state, action, reward, next_state, done)
    state = next_state
       
  if i % 100 == 0:
    clear_output(wait=True)
    print(f"Episode: {i}")

print("Training finished.\n")

Episode: 100000
Training finished.

CPU times: user 2min 55s, sys: 2.96 s, total: 2min 58s
Wall time: 2min 56s


## Evaluate Performance of agent

The code below animates the attempts

In [0]:
def print_frames(frames):
    for i, frame in enumerate(frames):
        clear_output(wait=True)
        print(frame['frame'].getvalue())
        print(f"Timestep: {i + 1}")
        print(f"State: {frame['state']}")
        print(f"Action: {frame['action']}")
        print(f"Reward: {frame['reward']}")
        sleep(.2)
        

Now onto the simulated trials.

In [41]:
env.reset()


for episode in range(5):
    state = env.reset()
    frames = []
    step = 0
    done = False
    print("****************************************************")
    print("EPISODE ", episode)

    while not done:
      step = step + 1
      action = agent.best_act(state)
       
      new_state, reward, done, info = env.step(action)
        
      frames.append({
        'frame': env.render(mode='ansi'),
        'state': state,
        'action': action,
        'reward': reward
      })
        
      if done:
        # Show the last state
        env.render()

        # We print the number of step it took.
        print("Number of steps", step)
        break
      state = new_state


****************************************************
EPISODE  0
  (Down)
SFFF
FHFH
FFFH
HFF[41mG[0m
Number of steps 124
****************************************************
EPISODE  1
  (Down)
SFFF
FHFH
FFFH
HFF[41mG[0m
Number of steps 48
****************************************************
EPISODE  2
  (Down)
SFFF
FHFH
FFFH
HFF[41mG[0m
Number of steps 52
****************************************************
EPISODE  3
  (Right)
SFFF
FHF[41mH[0m
FFFH
HFFG
Number of steps 33
****************************************************
EPISODE  4
  (Down)
SFFF
FHFH
FFFH
HFF[41mG[0m
Number of steps 21


In [42]:
print_frames(frames)

  (Down)
SFFF
FHFH
FFFH
HFF[41mG[0m

Timestep: 21
State: 14
Action: 1
Reward: 1.0
