# ENGR-E 221 Intelligent Systems I Fall 2023

## Homeworkr 12 Reinforcement learning - Frozen Lake - 50 points

**Due December 6, 2023 at 11:59 pm**

OpenAI Gym is a free Python toolkit that provides developers with an environment for developing and testing learning agents for deep learning models. It’s useful as a reinforcement learning agent, but it’s also adept at testing new learning agent ideas, running training simulations and speeding up the learning process for your algorithm.  

During class, you were introduced to the Frozen Lake environment. Then, during lab, you saw different ways to implement Reinforcement Learning algorithms to optimize agent success in different environments (Cart Pole and Black Jack). This homework gives you the opportunity to explore Frozen Lake and understand several RL techniques.

## Part 1. Load Frozen Lake and Explore (5 pts)
You may either use the script version provided in class or the Jupyter Notebook code here. Recall that if you are trying to render the environment, it will need to be done differently depending on whether you use Jupyter Notebook or another IDE.

In [None]:
import os
import gym
import matplotlib.pyplot as plt
import time
import numpy as np
%matplotlib inline
os.environ["SDL_VIDEODRIVER"] = "dummy"
from IPython.display import clear_output

env = gym.make("FrozenLake-v1",render_mode="rgb_array")

env.reset()
#environment.render()

plt.imshow(env.render())
plt.show()

## Environment
What observations can be made about the environment? What does env.P give you?

In [None]:
#Code and/or discussion

## Actions
What actions are available in this environment?

In [None]:
#Code and discussion

## Perform an Action
Perform a single action. Render the the elf's movement. What variables are returned? What do they mean?

In [None]:
#Code and discussion

## Part 2. Random Action (10 pt)

Using randomly chosen actions, iterate through 1000 steps. How well did your elf perform? Use the code block below to help you assess it.

In [None]:
def get_score(env, pol_func, episodes):
  misses = 0
  steps_list = []
  all_steps_list = []
  for episode in range(episodes):
    observation = env.reset()
    steps=0
    while True:
      
      action = pol_func(observation)
      returnValue = env.step(action)
      #print(returnValue)
      # returnValue[0]: observation (object) 
      # returnValue[1]: reward that is the result of taking the action
      # returnValue[3]: terminated (bool)     - is it a terminal state
      # returnValue[4]: truncated (bool)      - it is not important in our case
      # returnValue[5]: info (dictionary)     - in our case transi
      steps+=1
      if returnValue[2] and returnValue[3] == 1:
        #print('You have got the Frisbee after {} steps'.format(steps))
        steps_list.append(steps)
        break
      elif returnValue[2] and returnValue[3] == 0:
        #print("You fell in a hole!")
        misses += 1
        break
    all_steps_list.append(steps)
  print(steps_list)
  print(misses)
  print('----------------------------------------------')
  print('You took an average of {:.0f} steps each episode'.format(np.mean(all_steps_list)))
  print('You took an average of {:.0f} steps to get the frisbee'.format(np.mean(steps_list)))
  print('And you fell in the hole {:.2f} % of the times'.format((misses/episodes) * 100))
  print('----------------------------------------------')


In [None]:
def random_policy(obs):
    #Your code here

#Call the function above with this line. Try 100,1000,10000 episodes. How did you do?
get_score(env,random_policy,1000)

Discussion here.

## Part 3. Not So Random Policy (10 pts)

Write another function called nonRandom_policy(obs). Come up with a rule that might choose one action more than another. You may use observation data if you would like. Assess it similarly to the code block above.

In [None]:
def nonRandom_policy(obs):
    #Your code here

#Call the function above with this line. Try 100,1000,10000 episodes. How did you do?
get_score(env,nonRandom_policy,1000)

Discussion here.

# Part 4. Q-Learning Exploration (25 pts)

Below is some code that will eventually train our elf on how to best navigate the Frozen Pond. Please work through each code block and "reverse-engineer" this solution. What is Q-Learning? Explain in your own words, what these code
blocks are doing.

Source: https://github.com/simoninithomas/Deep_reinforcement_learning_Course/blob/master/Q%20learning/FrozenLake/Q%20Learning%20with%20FrozenLake.ipynb

In [None]:
import os
import gym
import matplotlib.pyplot as plt
import time
import numpy as np
import random


env = gym.make("FrozenLake-v1",render_mode="rgb_array")
env.reset()


In [None]:
action_size = env.action_space.n
state_size = env.observation_space.n

qtable = np.zeros((state_size, action_size))
print(qtable)

total_episodes = 15000        # Total episodes
learning_rate = 0.8           # Learning rate
max_steps = 99                # Max steps per episode
gamma = 0.95                  # Discounting rate

# Exploration parameters
epsilon = 1.0                 # Exploration rate
max_epsilon = 1.0             # Exploration probability at start
min_epsilon = 0.01            # Minimum exploration probability 
decay_rate = 0.005             # Exponential decay rate for exploration prob

In [None]:
# List of rewards
rewards = []

# 2 For life or until learning is stopped
for episode in range(total_episodes):
    # Reset the environment
    state_init = env.reset()
    state = state_init[0]
    step = 0
    done = False
    total_rewards = 0
    
    for step in range(max_steps):
        # Choose an action a in the current world state (s)
        ## First we randomize a number
        exp_exp_tradeoff = random.uniform(0, 1)
        
        ## If this number > greater than epsilon --> exploitation (taking the biggest Q value for this state)
        if exp_exp_tradeoff > epsilon:
            action = np.argmax(qtable[state,:])

        # Else doing a random choice --> exploration
        else:
            action = env.action_space.sample()

        # Take the action (a) and observe the outcome state(s') and reward (r)
        new_state, reward, done, trunc, info = env.step(action)

        # Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]
        # qtable[new_state,:] : all the actions we can take from new state
        qtable[state, action] = qtable[state, action] + learning_rate * (reward + gamma * np.max(qtable[new_state, :]) - qtable[state, action])
        
        total_rewards += reward
        
        # Our new state is state
        state = new_state
        
        # If done (if we're dead) : finish episode
        if done == True: 
            break
        
    # Reduce epsilon (because we need less and less exploration)
    epsilon = min_epsilon + (max_epsilon - min_epsilon)*np.exp(-decay_rate*episode) 
    rewards.append(total_rewards)

print ("Score over time: " +  str(sum(rewards)/total_episodes))
print(qtable)

In [None]:
env.reset()

for episode in range(5):
    state_init = env.reset()
    state = state_init[0]
    step = 0
    done = False
    print("****************************************************")
    print("EPISODE ", episode)

    for step in range(max_steps):
        
        # Take the action (index) that have the maximum expected future reward given that state
        action = np.argmax(qtable[state,:])
        
        new_state, reward, done, trunc,info = env.step(action)
        
        if done:
            # Here, we decide to only print the last state (to see if our agent is on the goal or fall into an hole)
            #env.render()
            
            # We print the number of step it took.
            print("Number of steps", step)
            break
        state = new_state
env.close()

Your thoughts/discussion here.