# Módulo Deep Learning
## Actividad 2: Reinforcement Learning: **Frozen lake problem**

Esther Sanz

Antonio Vargas

Pedro Rincón


# Actividad Reinforcemente Learning

Resolver el problema del Frozen lake de OpenAI Gym. Documentación: https://www.gymlibrary.dev/environments/toy_text/frozen_lake/

## Objetivos
- Conseguir movermos aleatoriamente hasta cumplir el objetivo
- Conseguir que el agente aprenda con Q-learning
- (Opcional) Probar con otros hiperparámetros
- (Opcional) Modificar la recompensa

## Consideraciones
- No hay penalizaciones
- Si el agente cae en un "hole", entonces done = True y se queda atascado sin poder salir (al igual que ocurre cuando llega al "goal")

## Normas a seguir

- Se debe entregar un **ÚNICO GOOGLE COLAB notebook** (archivo .ipynb) que incluya las instrucciones presentes y su **EJECUCIÓN!!!**.
- Poner el nombre del grupo en el nombre del archivo y el nombre de todos los integrantes del grupo al inicio del notebook.

## Criterio de evaluación

- Seguimiento de las normas establecidas en la actividad.
- Corrección en el uso de algoritmos, modelos y formas idiomáticas en Python.
- El código debe poder ejecutarse sin modificación alguna en Google Colaboratory.

## **Instalamos librerías**

In [1]:
!pip install gym==0.17.3
!pip install numpy==1.23.5



In [2]:
import gym
import numpy as np
import pandas as pd
from time import sleep
from IPython.display import clear_output
import random as rd
import pickle

##**Definición del entorno**

In [3]:
# Definimos el entorno
env = gym.make('FrozenLake-v0', desc=None, map_name="4x4", is_slippery=False)

In [4]:
# Fijamos una semilla
seed_value = 42
env.seed(seed_value)
np.random.seed(seed_value)

In [5]:
env.reset() # En este caso, empieza desde la misma posición inicial
print(env.render())


[41mS[0mFFF
FHFH
FFFH
HFFG
None


In [6]:
print("Action Space {}".format(env.action_space))
print("State Space {}".format(env.observation_space))

Action Space Discrete(4)
State Space Discrete(16)


Acciones posibles:
* 0: izquierda
* 1: abajo
* 2: derecha
* 3: arriba

In [None]:
# Identificador de estado
state = env.s
print("State:", state)

State: 0


## **¡Nos movemos aleatoriamente!**

Llegar a la meta manualmente:

In [None]:
steps = 0
env.reset()

print("State:", env.s)

env.render()

print(f"Step: {steps}")

State: 0

[41mS[0mFFF
FHFH
FFFH
HFFG
Step: 0


In [None]:
# Acciones: 0=izquierda, 1=abajo, 2=derecha, 3=arriba
action = 1
state, reward, done, info = env.step(action)

print("State:", state)
print(state, reward, done, info)

env.s = state
env.render()

steps += 1

print(f"Step: {steps}")

State: 4
4 0.0 False {'prob': 1.0}
  (Down)
SFFF
[41mF[0mHFH
FFFH
HFFG
Step: 1


In [None]:
# Acciones: 0=izquierda, 1=abajo, 2=derecha, 3=arriba
action = 1
state, reward, done, info = env.step(action)

print("State:", state)
print(state, reward, done, info)

env.s = state
env.render()

steps += 1

print(f"Step: {steps}")

State: 8
8 0.0 False {'prob': 1.0}
  (Down)
SFFF
FHFH
[41mF[0mFFH
HFFG
Step: 2


In [None]:
# Acciones: 0=izquierda, 1=abajo, 2=derecha, 3=arriba
action = 2
state, reward, done, info = env.step(action)

print("State:", state)
print(state, reward, done, info)

env.s = state
env.render()

steps += 1

print(f"Step: {steps}")

State: 9
9 0.0 False {'prob': 1.0}
  (Right)
SFFF
FHFH
F[41mF[0mFH
HFFG
Step: 3


In [None]:
# Acciones: 0=izquierda, 1=abajo, 2=derecha, 3=arriba
action = 1
state, reward, done, info = env.step(action)

print("State:", state)
print(state, reward, done, info)

env.s = state
env.render()

steps += 1

print(f"Step: {steps}")

State: 13
13 0.0 False {'prob': 1.0}
  (Down)
SFFF
FHFH
FFFH
H[41mF[0mFG
Step: 4


In [None]:
# Acciones: 0=izquierda, 1=abajo, 2=derecha, 3=arriba
action = 2
state, reward, done, info = env.step(action)

print("State:", state)
print(state, reward, done, info)

env.s = state
env.render()

steps += 1

print(f"Step: {steps}")

State: 14
14 0.0 False {'prob': 1.0}
  (Right)
SFFF
FHFH
FFFH
HF[41mF[0mG
Step: 5


In [None]:
# Acciones: 0=izquierda, 1=abajo, 2=derecha, 3=arriba
action = 2
state, reward, done, info = env.step(action)

print("State:", state)
print(state, reward, done, info)

env.s = state
env.render()

steps += 1

print(f"Step: {steps}")

State: 15
15 1.0 True {'prob': 1.0}
  (Right)
SFFF
FHFH
FFFH
HFF[41mG[0m
Step: 6


- Se ha llegado a la meta manualmente en 6 steps

Caer en un hoyo manualmente:

In [None]:
steps = 0
env.reset()
env.render()


[41mS[0mFFF
FHFH
FFFH
HFFG


In [None]:
# Acciones: 0=izquierda, 1=abajo, 2=derecha, 3=arriba
action = 2
state, reward, done, info = env.step(action)

print("State:", state)
print(state, reward, done, info)

env.s = state
env.render()

steps += 1

print(f"Step: {steps}")

State: 1
1 0.0 False {'prob': 1.0}
  (Right)
S[41mF[0mFF
FHFH
FFFH
HFFG
Step: 1


In [None]:
# Acciones: 0=izquierda, 1=abajo, 2=derecha, 3=arriba
action = 1
state, reward, done, info = env.step(action)

print("State:", state)
print(state, reward, done, info)

env.s = state
env.render()

steps += 1

print(f"Step: {steps}")

State: 5
5 0.0 True {'prob': 1.0}
  (Down)
SFFF
F[41mH[0mFH
FFFH
HFFG
Step: 2


- Cuando alcanza la meta o cae en un hoyo, env.step(action) retorna done=True y el jugador ya no se puede mover, hay que reiniciar el entorno.

- El observation space considera las 16 posibles posiciones del jugador

In [None]:
states_dict = {
    0: 'S',
    5: 'H',
    7: 'H',
    11: 'H',
    15: 'G'
}

for i in range(env.observation_space.n):
    if i in states_dict.keys():
      s = states_dict[i]
    else:
      s = 'F'
    print(f"{i}:\t{s}", end='\t|\t')
    if (i + 1) % 4 == 0:
      print("\n" + "-"*90)

0:	S	|	1:	F	|	2:	F	|	3:	F	|	
------------------------------------------------------------------------------------------
4:	F	|	5:	H	|	6:	F	|	7:	H	|	
------------------------------------------------------------------------------------------
8:	F	|	9:	F	|	10:	F	|	11:	H	|	
------------------------------------------------------------------------------------------
12:	F	|	13:	F	|	14:	F	|	15:	G	|	
------------------------------------------------------------------------------------------


**Tabla de recompensas**

In [None]:
print(f"Dimension de la matriz de recompensas: {env.observation_space.n} (estados) x {env.action_space.n} (acciones)")

Dimension de la matriz de recompensas: 16 (estados) x 4 (acciones)


In [None]:
print("\t\tLEFT  \t\t\t\tDOWN  \t\t\t\tRIGHT  \t\t\t\tUP\n")
for i in range(env.observation_space.n):
  print(f"{i}: \t{env.P[i][0]}     \t{env.P[i][1]}     \t{env.P[i][2]}     \t{env.P[i][3]}")

		LEFT  				DOWN  				RIGHT  				UP

0: 	[(1.0, 0, 0.0, False)]     	[(1.0, 4, 0.0, False)]     	[(1.0, 1, 0.0, False)]     	[(1.0, 0, 0.0, False)]
1: 	[(1.0, 0, 0.0, False)]     	[(1.0, 5, 0.0, True)]     	[(1.0, 2, 0.0, False)]     	[(1.0, 1, 0.0, False)]
2: 	[(1.0, 1, 0.0, False)]     	[(1.0, 6, 0.0, False)]     	[(1.0, 3, 0.0, False)]     	[(1.0, 2, 0.0, False)]
3: 	[(1.0, 2, 0.0, False)]     	[(1.0, 7, 0.0, True)]     	[(1.0, 3, 0.0, False)]     	[(1.0, 3, 0.0, False)]
4: 	[(1.0, 4, 0.0, False)]     	[(1.0, 8, 0.0, False)]     	[(1.0, 5, 0.0, True)]     	[(1.0, 0, 0.0, False)]
5: 	[(1.0, 5, 0, True)]     	[(1.0, 5, 0, True)]     	[(1.0, 5, 0, True)]     	[(1.0, 5, 0, True)]
6: 	[(1.0, 5, 0.0, True)]     	[(1.0, 10, 0.0, False)]     	[(1.0, 7, 0.0, True)]     	[(1.0, 2, 0.0, False)]
7: 	[(1.0, 7, 0, True)]     	[(1.0, 7, 0, True)]     	[(1.0, 7, 0, True)]     	[(1.0, 7, 0, True)]
8: 	[(1.0, 8, 0.0, False)]     	[(1.0, 12, 0.0, True)]     	[(1.0, 9, 0.0, False)]     	[(1.0, 4, 0.0, 

Movimiento aleatorio 20 steps

In [None]:
env.reset()

timestep, victories, falls = 0, 0, 0

for n_step in range(20):

  action = env.action_space.sample() # se elige accion aleatoria
  state, reward, done, info = env.step(action) # se realiza la acción elegida

  if done == True:
    if reward == 1:
      victories += 1
    else:
      falls += 1
    env.reset()

  timestep += 1

print(f"Steps: {timestep}")
print(f"Falls: {falls}")
print(f"Victories: {victories}")
env.render()

Steps: 20
Falls: 2
Victories: 0
  (Up)
SFF[41mF[0m
FHFH
FFFH
HFFG


- En esta ejecución, moviéndose de manera aleatoria, tras 20 pasos se ha caído dos veces y no ha llegado a alcanzar la meta.

Movimiento aleatorio hasta que alcance la meta

In [None]:
env.reset()

timestep, victories, falls = 0, 0, 0

while victories == 0:

  action = env.action_space.sample() # se elige accion aleatoria
  state, reward, done, info = env.step(action) # se realiza la acción elegida

  if done == True:
    if reward == 1:
      victories += 1
    else:
      falls += 1
      env.reset()

  timestep += 1

print(f"Steps: {timestep}")
print(f"Falls: {falls}")
print(f"Victory: {victories>0}")
env.render()

Steps: 1100
Falls: 132
Victory: True
  (Right)
SFFF
FHFH
FFFH
HFF[41mG[0m


- En esta ejecución, le ha costado 1100 steps llegar a la meta, cayéndose 132 veces.

# Q-learning: Entrenamiento

In [17]:
def print_qtable(q_table):
  print("\tLEFT\t\t\tDOWN\t\t\tRIGHT\t\t\tUP")
  for i in range(env.observation_space.n):
    print(f"{i}:", end='\t')
    for j in q_table[i]:
      print(round(j, 3), end='\t\t\t')
    print()

In [18]:
# Q-table inicializada a 0

q_table = np.zeros([env.observation_space.n, env.action_space.n])

print_qtable(q_table)

	LEFT			DOWN			RIGHT			UP
0:	0.0			0.0			0.0			0.0			
1:	0.0			0.0			0.0			0.0			
2:	0.0			0.0			0.0			0.0			
3:	0.0			0.0			0.0			0.0			
4:	0.0			0.0			0.0			0.0			
5:	0.0			0.0			0.0			0.0			
6:	0.0			0.0			0.0			0.0			
7:	0.0			0.0			0.0			0.0			
8:	0.0			0.0			0.0			0.0			
9:	0.0			0.0			0.0			0.0			
10:	0.0			0.0			0.0			0.0			
11:	0.0			0.0			0.0			0.0			
12:	0.0			0.0			0.0			0.0			
13:	0.0			0.0			0.0			0.0			
14:	0.0			0.0			0.0			0.0			
15:	0.0			0.0			0.0			0.0			


Se definie greedy policy:

In [19]:
def greedy_trade_off(epsilon, q_table, state, env):

  if rd.random() < epsilon:
    action = env.action_space.sample()  # explorar: se elige una acción aleatoria

  else:
    action = np.argmax(q_table[state])  # explotar: se ejecuta la acción con valor máximo en Q-table

  return action

Training en 100000 episodios (cada episodio concluye cuando se alcanza la meta)

In [None]:
%%time
# Hyperparameters
alpha = 0.2       # tasa de aprendizaje
gamma = 0.7       # tasa de descuento
epsilon = 0.15    # greedy policy

# se crea una diccionario para registrar los datos de entrenamiento en cada episodio
episode_log = {
    "timesteps": [],
    "falls": []
}

episodes = 100000

q_table = np.zeros([env.observation_space.n, env.action_space.n])

for i in range(episodes):
  state = env.reset()

  goal = False

  timesteps, falls = 0, 0

  while not goal:

    timesteps += 1

    action = greedy_trade_off(epsilon, q_table, state, env) # se aplica greedy policy

    next_state, reward, done, info = env.step(action)       # se realiza la acción elegida

    # Se aplica la fórmula para actualizar el Q-value

    old_value = q_table[state, action]
    next_max = np.max(q_table[next_state])

    new_value = (1 - alpha) * old_value + alpha * (reward + gamma * next_max)

    q_table[state, action] = new_value

    state = next_state

    if done == True:
      if reward == 1:
        goal = True
        episode_log['timesteps'].append(timesteps)
        episode_log['falls'].append(falls)
      else:
        falls += 1
        env.reset()

  if (i+1) % 100 == 0:
    clear_output(wait=True)
    print(f"Episodio {i+1}")

clear_output(wait=True)
print(f"Episodio: {i+1} \n¡Entrenamiento finalizado!")

Episodio: 100000 
¡Entrenamiento finalizado!
CPU times: user 1min 20s, sys: 9.97 s, total: 1min 30s
Wall time: 1min 24s


Registro de entrenamiento:

In [None]:
df_train = pd.DataFrame(episode_log)
df_train.head()

Unnamed: 0,timesteps,falls
0,2344897,41444
1,151707,2677
2,2979,55
3,47,0
4,176,2


In [None]:
df_train.tail()

Unnamed: 0,timesteps,falls
99995,17,3
99996,8,1
99997,6,0
99998,6,0
99999,8,0


In [None]:
df_train.describe()

Unnamed: 0,timesteps,falls
count,100000.0,100000.0
mean,32.55414,0.63278
std,7430.694,131.330808
min,6.0,0.0
25%,6.0,0.0
50%,6.0,0.0
75%,8.0,0.0
max,2344897.0,41444.0


In [None]:
print_qtable(q_table)

	LEFT			DOWN			RIGHT			UP
0:	0.118			0.168			0.168			0.118			
1:	0.118			0.118			0.24			0.168			
2:	0.168			0.343			0.168			0.24			
3:	0.24			0.055			0.097			0.113			
4:	0.168			0.24			0.118			0.118			
5:	0.118			0.168			0.168			0.118			
6:	0.118			0.49			0.082			0.24			
7:	0.118			0.113			0.09			0.093			
8:	0.24			0.118			0.343			0.168			
9:	0.24			0.49			0.49			0.118			
10:	0.343			0.7			0.118			0.343			
11:	0.117			0.168			0.161			0.117			
12:	0.118			0.168			0.168			0.118			
13:	0.118			0.49			0.7			0.343			
14:	0.49			0.7			1.0			0.49			
15:	0.0			0.0			0.0			0.0			


# Q-learning: Evaluación

In [None]:
%%time

state = env.reset()

timestep, falls = 0, 0
goal = False

while not goal:

  timestep += 1

  action = np.argmax(q_table[state]) # se elige la mejor acción según la Q-table

  state, reward, done, info = env.step(action)  # se realiza la acción elegida

  # Print each step
  sleep(1)
  clear_output(wait=True)
  env.render()

  if done == True:
    if reward == 1:
      goal = True
    else:
      falls += 1
      env.reset()

  print(f"\nNúmero de pasos: {timestep}")
  print(f"Número de caidas: {falls}")

print(f"\n¡Objetivo conseguido!")

  (Right)
SFFF
FHFH
FFFH
HFF[41mG[0m

Número de pasos: 6
Número de caidas: 0

¡Objetivo conseguido!
CPU times: user 45.3 ms, sys: 10.9 ms, total: 56.2 ms
Wall time: 6.02 s


In [8]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Exportar q_table
with open(r"/content/drive/MyDrive/Nuclio Digital School/ReinforcementLearning/q_table_defaultParams.pickle", "wb") as output_file:
  pickle.dump(q_table, output_file)