# INTELIGENCIA ARTIFICIAL (INF371)¶

Dr. Edwin Villanueva (evillatal@gmail.com)

## Aprendizaje por Refuerzo  Q-learning  

El presente notebook aborda la experimentacion de agentes de aprendizaje por refuerzo Q-learning en entornos grid. La implementacion de la clase del entorno GridEnvironment y el agente Q-learning ya estan implementadas. Al final del notebook deberas responder a las preguntas planteadas. 

###  Clase <b>GridEnvironment</b>

La clase GridEnvironment define un entorno MDP (Proceso de Desiciones de Markov) para entornos grids (laberintos), como el ejemplo usado en clase. Las probabilidades de transicion son 0.8 para moverse en la dirección pretendida y 0.1 de moverse a un estado lateral. El constructor recibe:

- grid: un array de listas de numeros definiendo los rewards del grid del entorno. Valores None indican un obstaculo
- terminals: lista de estados terminales
- initial: estado inicial
- gamma: factor de descuento

La clase mantiene el estado actual (current_state), el cual se inicializa en estado "initial" y se modifica con cada paso que se dé en el entorno (llamada a step()), devolviendo el nuevo estado, el reward y un flag 'done' que indica si el entorno ha caido en un estado terminal. El modelo de transicion de cada estado es accesible a travez de la funcion T(s,a) que devuelve una lista de tuplas (prob, s') para cada estado vecino s' del estado s ejecutando la accion a (prob es la probabilidad de transicionar de s a s' con accion a)

In [1]:
from collections import defaultdict
import random
import operator
import numpy as np

EAST, NORTH, WEST, SOUTH = (1, 0), (0, 1), (-1, 0), (0, -1)
LEFT, RIGHT = +1, -1
        
class GridEnvironment:
    def __init__(self, grid, terminals, initial=(0, 0), gamma=.9):
        grid.reverse()     # para que fila 0 sea la de abajo, no la de arriba
        self.rows = len(grid)
        self.cols = len(grid[0])
        self.grid = grid
        self.initial_state = initial
        self.current_state = initial
        self.terminals = terminals
        self.gamma = gamma
        self.actionlist = [EAST, NORTH, WEST, SOUTH] 

        self.rewards = {}        # diccionario de rewards
        self.states = set()     # conjunto de estados diferentes
        for x in range(self.cols):   # obtiene todos los estados y rewards del grid
            for y in range(self.rows):
                if grid[y][x]:  # Si la celda no es None (Prohibida), agrega el estado y reward
                    self.states.add((x, y))
                    self.rewards[(x, y)] = grid[y][x]
            
        self.transition_probs = {}  # almacena los diccionarios de probabilidades de transicion
        for s in self.states:
            self.transition_probs[s] = {}  # diccionario de probabilidades de transicion de los vecinos de estado s
            for a in self.actionlist:
                self.transition_probs[s][a] = self.get_transition_probs(s, a)
                
    def get_transition_probs(self, state, action): 
        # Hay 0.8 de probabilidad de moverse en la dirección pretendida y 0.1 de moverse por cada lateral. 
        if action:
            return [(0.8, self.go(state, action)),
                    (0.1, self.go(state, self.turn_right(action))),
                    (0.1, self.go(state, self.turn_left(action)))]
        else:
            return [(0.0, state)]
        
    def go(self, state, direction):
        """Retorna el estado que resultaria de ir en la direccion pasada, si el ambiente fuese deterministico """
        state1 = tuple(map(operator.add, state, direction))
        return state1 if state1 in self.states else state    
    
    def turn_heading(self, heading, inc, headings=[EAST, NORTH, WEST, SOUTH]):
        return headings[(headings.index(heading) + inc) % len(headings)]

    def turn_right(self, heading):
        return self.turn_heading(heading, RIGHT)

    def turn_left(self, heading):
        return self.turn_heading(heading, LEFT) 
    
    def T(self, s, a):  # Retorna los estados vecinos y sus prob de transicion, tuplas (prob, s'), para el estado  s y accion a
        return self.transition_probs[s][a] if a else [(0.0, s)]

    def R(self, state): # retorna el reward de un estado
        return self.rewards[state]    
    
    def actions(self, state): # retorna la lista de acciones posibles en un estado 
        if state in self.terminals:
            return [None]
        else:
            return self.actionlist    
    
    def reset(self):  # Reseta el Entorno
        self.current_state = self.initial_state
        return self.current_state, self.rewards[self.current_state]
    
    def step(self, action): # Ejecuta un paso el entorno. Retorna el nuevo estado, el reward y flag de que es estado terminal
        x = random.uniform(0, 1)
        cumulative_probability = 0.0
        for probability_state in self.T(self.current_state, action):
            probability, next_state = probability_state
            cumulative_probability += probability
            if x < cumulative_probability:
                break
        self.current_state = next_state
        done = True if current_state in self.terminals else False
        return self.current_state, self.rewards[self.current_state], done
    
    def to_grid(self, mapping):
        """Convert a mapping from (x, y) to v into a [[..., v, ...]] grid."""
        return list(reversed([[mapping.get((x, y), None)
                               for x in range(self.cols)]
                               for y in range(self.rows)]))

    def to_arrows(self, policy):
        chars = {(1, 0): '>', (0, 1): '^', (-1, 0): '<', (0, -1): 'v', None: '.'}
        return self.to_grid({s: chars[a] for (s, a) in policy.items()})
    
    def print_policy(self, policy):
        """Imprime la politica"""
        header=None
        sep='   '
        numfmt='{}'
        table = self.to_arrows(policy)
        justs = ['rjust' if hasattr(x, '__int__') else 'ljust' for x in table[0]]

        if header:
            table.insert(0, header)

        table = [[numfmt.format(x) if hasattr(x, '__int__') else x for x in row]
                 for row in table]

        sizes = list(
            map(lambda seq: max(map(len, seq)),
                list(zip(*[map(str, row) for row in table]))))

        for row in table:
            print(sep.join(getattr(
                str(x), j)(size) for (j, size, x) in zip(justs, sizes, row)))
            

###  Entorno para experimentar </b>
Para experimentar,  se usará el entorno MDP definido abajo. El factor de descuento es $\gamma = 0.9$ (en los ejemplos de clase se usó $\gamma = 1$). Las recompensas son **-0.1** en estados no terminales y **+5** y **-5** en estados terminales.   

In [2]:
# el grid que se vio en clase
#grid = [[-0.04, -0.04, -0.04, +1],
#        [-0.04,  None, -0.04, -1],
#        [-0.04, -0.04, -0.04, -0.04]]

# el grid de este desafio
grid = [
    [None, None, None, None, None, None, None, None, None, None, None], 
    [None, -0.1, -0.1, -0.1, -0.1, -0.1, -0.1, -0.1, None, +5.0, None], 
    [None, -0.1, None, None, None, None, None, None, None, -0.1, None], 
    [None, -0.1, -0.1, -0.1, -0.1, -0.1, -0.1, -0.1, -0.1, -0.1, None], 
    [None, -0.1, None, None, None, None, None, None, None, None, None], 
    [None, -0.1, None, -0.1, -0.1, -0.1, -0.1, -0.1, -0.1, -0.1, None], 
    [None, -0.1, None, None, None, None, None, -0.1, None, -0.1, None], 
    [None, -0.1, -0.1, -0.1, -0.1, -0.1, -0.1, -0.1, None, -0.1, None], 
    [None, None, None, None, None, -0.1, None, -0.1, None, -0.1, None], 
    [None, -5.0, -0.1, -0.1, -0.1, -0.1, None, -0.1, None, -0.1, None], 
    [None, None, None, None, None, None, None, None, None, None, None]
]


## Clase <b>QLearningAgent</b>

Esta clase define un agente exploratorio Q-learning. Este evita aprender el modelo de transicion ya que los Q-valores de un estado-action puede ser relacionado directamente a los Q-valores de los estado-action vecinos 

In [3]:
class QLearningAgent:
    
    def __init__(self, mdp, Ne, Rplus, alpha=None):

        self.gamma = mdp.gamma    # factor de descuento (definido en el MDP)
        self.terminals = mdp.terminals   # estados terminales (definido en el MDP)
        self.all_act = mdp.actionlist  # acciones posibles
        self.Ne = Ne        # limite de iteraciones de la funcion de exploracion
        self.Rplus = Rplus  # Recompensa que tienen los estados (o q-estados) antes del limite de iteraciones Ne
        self.Q = defaultdict(float)   # almacena los q-valores
        self.Nsa = defaultdict(float) # almacena la tabla de frecuencias state-action
        self.s = None    # estado anterior
        self.a = None    # ultima accion ejecutada
        self.r = None    # recompensa de estado anterior

        if alpha:
            self.alpha = alpha   # alpha es la taza de aprendizaje. Debe disminuir con el numero de visitas al estado para que las utilidades converjan
        else:
            self.alpha = lambda n: 1./(1+n)  # udacity video

    def f(self, u, n): 
        """ Funcion de exploracion. Retorna un valor de utilidad fijo (Rplus) hasta que el agente visita Ne veces el state-action """
        if n < self.Ne:
            return self.Rplus
        else:
            return u

    def actions_in_state(self, state):
        """ Retorna el conbjunto de acciones posibles del estado pasado. Util para max y argmax. """
        if state in self.terminals:
            return [None]
        else:
            return self.all_act

    # Programa del agente Q-learning    
    def __call__(self, percept):    
        """ Este es el programa del agente que es llamado en cada step, recibe un percept y retorna una accion """
        s1, r1 = self.update_state(percept)
        Q, Nsa, s, a, r = self.Q, self.Nsa, self.s, self.a, self.r
        alpha, gamma, terminals = self.alpha, self.gamma, self.terminals,
        actions_in_state = self.actions_in_state

        if s in terminals:
            Q[s, None] = r1
        if s is not None:
            Nsa[s, a] += 1
            Q[s, a] += alpha(Nsa[s, a]) * (r + gamma * max(Q[s1, a1] for a1 in actions_in_state(s1)) - Q[s, a])
        if s in terminals:
            self.s = self.a = self.r = None
        else:
            self.s, self.r = s1, r1
            self.a = max(actions_in_state(s1), key=lambda a1: self.f(Q[s1, a1], Nsa[s1, a1])) # funciona como argmax, devuelve la accion con mayor f
        return self.a

    def update_state(self, percept):
        ''' To be overridden in most cases. The default case
        assumes the percept to be of type (state, reward)'''
        return percept

## Probando el agente  <b>Q-learning</b>

Vamos a instanciar un agente Q-learning para aprender una politica en nuestro entorno de prueba "grid". Los parametros del agente son los siguientes: **Ne = 10**, **Rplus = 2**, **alpha** como dado en la nota de pie del libro **pagina 837**:

In [4]:
# Instancia el entorno del grid
#environment = GridEnvironment(grid, terminals=[(3, 2), (3, 1)], initial=(0, 0), gamma=0.9) # grid de la clase
environment = GridEnvironment(grid, terminals=[(1, 1), (9, 9)], initial=(3, 1), gamma=0.9) # 

# Instancia un agente Q-learning 
agent = QLearningAgent(environment, Ne=10, Rplus=2, alpha=lambda n: 60./(59+n)) 

# Ejecuta 10000 episodios del agente en el entorno
TRIALS = 10000      
for e in range(TRIALS):   # Por caa trial
    current_state, current_reward = environment.reset()
    score_trial = current_reward   # el escore del episodio es la suma acumulada de rewards en el episodio 
    while True:  # ejecuta steps del entorno hasta llegar a un estado terminal
        percept = (current_state, current_reward)  # la percepcion del agente es la tupla (state, reward)
        action  = agent(percept)  # llama al programa del agente, pasandole el percept y espera una accion a ejecutar
        current_state, current_reward, done = environment.step(action) # ejecuta la accion en el entorno, 
        score_trial += current_reward
        if done:
            print("Trial: {}/{}, score: {}".format(e, TRIALS, score_trial))
            break

Trial: 0/10000, score: -124.09999999999779
Trial: 1/10000, score: -10.3
Trial: 2/10000, score: -10.8
Trial: 3/10000, score: -10.4
Trial: 4/10000, score: -10.3
Trial: 5/10000, score: -10.4
Trial: 6/10000, score: -10.3
Trial: 7/10000, score: -10.4
Trial: 8/10000, score: -10.4
Trial: 9/10000, score: -10.3
Trial: 10/10000, score: -10.3
Trial: 11/10000, score: -10.5
Trial: 12/10000, score: -18.300000000000132
Trial: 13/10000, score: 2.9000000000000092
Trial: 14/10000, score: 3.800000000000006
Trial: 15/10000, score: 5.100000000000001
Trial: 16/10000, score: 6.599999999999998
Trial: 17/10000, score: 6.299999999999998
Trial: 18/10000, score: 4.400000000000004
Trial: 19/10000, score: -62.20000000000017
Trial: 20/10000, score: 4.700000000000003
Trial: 21/10000, score: 5.699999999999999
Trial: 22/10000, score: 3.300000000000008
Trial: 23/10000, score: -43.80000000000021
Trial: 24/10000, score: 2.0000000000000124
Trial: 25/10000, score: 3.9000000000000057
Trial: 26/10000, score: 3.700000000000006

Trial: 365/10000, score: 6.899999999999999
Trial: 366/10000, score: 7.099999999999999
Trial: 367/10000, score: 7.199999999999999
Trial: 368/10000, score: 7.199999999999999
Trial: 369/10000, score: 7.299999999999999
Trial: 370/10000, score: 6.699999999999998
Trial: 371/10000, score: 7.299999999999999
Trial: 372/10000, score: 6.599999999999998
Trial: 373/10000, score: 6.499999999999998
Trial: 374/10000, score: 6.799999999999999
Trial: 375/10000, score: 7.399999999999999
Trial: 376/10000, score: 6.899999999999999
Trial: 377/10000, score: 7.099999999999999
Trial: 378/10000, score: 6.699999999999998
Trial: 379/10000, score: 7.199999999999999
Trial: 380/10000, score: 7.6
Trial: 381/10000, score: 7.099999999999999
Trial: 382/10000, score: 6.699999999999998
Trial: 383/10000, score: 7.199999999999999
Trial: 384/10000, score: 7.299999999999999
Trial: 385/10000, score: 6.999999999999998
Trial: 386/10000, score: 6.1999999999999975
Trial: 387/10000, score: 7.099999999999999
Trial: 388/10000, score:

Trial: 594/10000, score: 7.299999999999999
Trial: 595/10000, score: 7.399999999999999
Trial: 596/10000, score: 5.799999999999999
Trial: 597/10000, score: 6.599999999999998
Trial: 598/10000, score: 7.199999999999999
Trial: 599/10000, score: 6.999999999999998
Trial: 600/10000, score: 7.299999999999999
Trial: 601/10000, score: 3.6000000000000068
Trial: 602/10000, score: 6.499999999999998
Trial: 603/10000, score: 7.299999999999999
Trial: 604/10000, score: 7.6
Trial: 605/10000, score: 7.199999999999999
Trial: 606/10000, score: 7.099999999999999
Trial: 607/10000, score: 7.099999999999999
Trial: 608/10000, score: 7.099999999999999
Trial: 609/10000, score: 7.499999999999999
Trial: 610/10000, score: 6.299999999999998
Trial: 611/10000, score: 7.299999999999999
Trial: 612/10000, score: 6.999999999999998
Trial: 613/10000, score: 6.399999999999999
Trial: 614/10000, score: 7.6
Trial: 615/10000, score: 7.699999999999999
Trial: 616/10000, score: 6.599999999999998
Trial: 617/10000, score: 7.39999999999

Trial: 821/10000, score: -24.100000000000215
Trial: 822/10000, score: 6.699999999999998
Trial: 823/10000, score: 7.199999999999999
Trial: 824/10000, score: 7.099999999999999
Trial: 825/10000, score: 6.899999999999999
Trial: 826/10000, score: 7.6
Trial: 827/10000, score: 7.199999999999999
Trial: 828/10000, score: 7.199999999999999
Trial: 829/10000, score: 7.499999999999999
Trial: 830/10000, score: 7.199999999999999
Trial: 831/10000, score: 6.899999999999999
Trial: 832/10000, score: 7.099999999999999
Trial: 833/10000, score: 7.499999999999999
Trial: 834/10000, score: 6.699999999999998
Trial: 835/10000, score: 7.099999999999999
Trial: 836/10000, score: 6.999999999999998
Trial: 837/10000, score: 7.099999999999999
Trial: 838/10000, score: 6.799999999999999
Trial: 839/10000, score: 6.999999999999998
Trial: 840/10000, score: 7.6
Trial: 841/10000, score: 6.499999999999998
Trial: 842/10000, score: 7.199999999999999
Trial: 843/10000, score: 7.6
Trial: 844/10000, score: 7.099999999999999
Trial: 8

Trial: 1087/10000, score: 7.099999999999999
Trial: 1088/10000, score: 6.799999999999999
Trial: 1089/10000, score: 7.6
Trial: 1090/10000, score: 6.299999999999998
Trial: 1091/10000, score: 6.799999999999999
Trial: 1092/10000, score: 7.199999999999999
Trial: 1093/10000, score: 6.499999999999998
Trial: 1094/10000, score: 6.899999999999999
Trial: 1095/10000, score: 7.6
Trial: 1096/10000, score: 7.499999999999999
Trial: 1097/10000, score: 6.599999999999998
Trial: 1098/10000, score: 6.799999999999999
Trial: 1099/10000, score: 7.499999999999999
Trial: 1100/10000, score: 7.199999999999999
Trial: 1101/10000, score: 7.099999999999999
Trial: 1102/10000, score: 6.899999999999999
Trial: 1103/10000, score: 7.099999999999999
Trial: 1104/10000, score: 7.099999999999999
Trial: 1105/10000, score: 6.999999999999998
Trial: 1106/10000, score: 7.299999999999999
Trial: 1107/10000, score: 7.399999999999999
Trial: 1108/10000, score: 6.699999999999998
Trial: 1109/10000, score: 6.999999999999998
Trial: 1110/1000

Trial: 1458/10000, score: 7.299999999999999
Trial: 1459/10000, score: 7.399999999999999
Trial: 1460/10000, score: 6.799999999999999
Trial: 1461/10000, score: 6.399999999999999
Trial: 1462/10000, score: 6.999999999999998
Trial: 1463/10000, score: 6.999999999999998
Trial: 1464/10000, score: 6.999999999999998
Trial: 1465/10000, score: 7.399999999999999
Trial: 1466/10000, score: 6.599999999999998
Trial: 1467/10000, score: 6.699999999999998
Trial: 1468/10000, score: 6.399999999999999
Trial: 1469/10000, score: 7.199999999999999
Trial: 1470/10000, score: 5.899999999999999
Trial: 1471/10000, score: 7.399999999999999
Trial: 1472/10000, score: 6.899999999999999
Trial: 1473/10000, score: 6.999999999999998
Trial: 1474/10000, score: 7.299999999999999
Trial: 1475/10000, score: 6.399999999999999
Trial: 1476/10000, score: 7.199999999999999
Trial: 1477/10000, score: 6.899999999999999
Trial: 1478/10000, score: 6.499999999999998
Trial: 1479/10000, score: 7.399999999999999
Trial: 1480/10000, score: 6.4999

Trial: 1681/10000, score: 7.099999999999999
Trial: 1682/10000, score: 7.499999999999999
Trial: 1683/10000, score: 6.799999999999999
Trial: 1684/10000, score: 6.699999999999998
Trial: 1685/10000, score: 7.099999999999999
Trial: 1686/10000, score: 7.099999999999999
Trial: 1687/10000, score: 6.999999999999998
Trial: 1688/10000, score: 7.499999999999999
Trial: 1689/10000, score: 6.699999999999998
Trial: 1690/10000, score: 6.499999999999998
Trial: 1691/10000, score: 7.499999999999999
Trial: 1692/10000, score: 6.799999999999999
Trial: 1693/10000, score: 6.699999999999998
Trial: 1694/10000, score: 7.199999999999999
Trial: 1695/10000, score: 6.999999999999998
Trial: 1696/10000, score: 6.999999999999998
Trial: 1697/10000, score: 6.499999999999998
Trial: 1698/10000, score: 7.499999999999999
Trial: 1699/10000, score: 7.399999999999999
Trial: 1700/10000, score: 6.999999999999998
Trial: 1701/10000, score: 5.899999999999999
Trial: 1702/10000, score: 7.199999999999999
Trial: 1703/10000, score: 7.4999

Trial: 1880/10000, score: 6.799999999999999
Trial: 1881/10000, score: 7.299999999999999
Trial: 1882/10000, score: 6.899999999999999
Trial: 1883/10000, score: 7.399999999999999
Trial: 1884/10000, score: 7.299999999999999
Trial: 1885/10000, score: 7.399999999999999
Trial: 1886/10000, score: 7.199999999999999
Trial: 1887/10000, score: 6.799999999999999
Trial: 1888/10000, score: 7.199999999999999
Trial: 1889/10000, score: 6.899999999999999
Trial: 1890/10000, score: 7.199999999999999
Trial: 1891/10000, score: 6.899999999999999
Trial: 1892/10000, score: 7.099999999999999
Trial: 1893/10000, score: 7.499999999999999
Trial: 1894/10000, score: 6.899999999999999
Trial: 1895/10000, score: 6.599999999999998
Trial: 1896/10000, score: 7.299999999999999
Trial: 1897/10000, score: 7.499999999999999
Trial: 1898/10000, score: 6.999999999999998
Trial: 1899/10000, score: 6.899999999999999
Trial: 1900/10000, score: 7.499999999999999
Trial: 1901/10000, score: 7.399999999999999
Trial: 1902/10000, score: 7.1999

Trial: 2108/10000, score: 6.999999999999998
Trial: 2109/10000, score: 7.699999999999999
Trial: 2110/10000, score: 6.999999999999998
Trial: 2111/10000, score: 7.399999999999999
Trial: 2112/10000, score: 7.6
Trial: 2113/10000, score: 7.399999999999999
Trial: 2114/10000, score: 7.199999999999999
Trial: 2115/10000, score: 7.199999999999999
Trial: 2116/10000, score: 5.999999999999998
Trial: 2117/10000, score: 6.999999999999998
Trial: 2118/10000, score: 7.099999999999999
Trial: 2119/10000, score: 6.599999999999998
Trial: 2120/10000, score: 7.099999999999999
Trial: 2121/10000, score: 7.099999999999999
Trial: 2122/10000, score: 7.299999999999999
Trial: 2123/10000, score: 6.999999999999998
Trial: 2124/10000, score: 7.499999999999999
Trial: 2125/10000, score: 7.099999999999999
Trial: 2126/10000, score: 6.999999999999998
Trial: 2127/10000, score: 6.499999999999998
Trial: 2128/10000, score: 7.199999999999999
Trial: 2129/10000, score: 6.799999999999999
Trial: 2130/10000, score: 7.199999999999999
Tr

Trial: 2348/10000, score: 6.999999999999998
Trial: 2349/10000, score: 6.899999999999999
Trial: 2350/10000, score: 7.199999999999999
Trial: 2351/10000, score: 7.399999999999999
Trial: 2352/10000, score: 7.499999999999999
Trial: 2353/10000, score: 7.199999999999999
Trial: 2354/10000, score: 6.699999999999998
Trial: 2355/10000, score: 7.299999999999999
Trial: 2356/10000, score: 7.199999999999999
Trial: 2357/10000, score: 7.199999999999999
Trial: 2358/10000, score: 5.999999999999998
Trial: 2359/10000, score: 7.199999999999999
Trial: 2360/10000, score: 7.199999999999999
Trial: 2361/10000, score: 7.299999999999999
Trial: 2362/10000, score: 6.799999999999999
Trial: 2363/10000, score: 6.799999999999999
Trial: 2364/10000, score: 6.799999999999999
Trial: 2365/10000, score: 7.199999999999999
Trial: 2366/10000, score: 6.899999999999999
Trial: 2367/10000, score: 7.499999999999999
Trial: 2368/10000, score: 5.4
Trial: 2369/10000, score: 6.799999999999999
Trial: 2370/10000, score: 7.299999999999999
Tr

Trial: 2730/10000, score: 7.499999999999999
Trial: 2731/10000, score: 7.699999999999999
Trial: 2732/10000, score: 6.999999999999998
Trial: 2733/10000, score: 7.499999999999999
Trial: 2734/10000, score: 7.299999999999999
Trial: 2735/10000, score: 7.299999999999999
Trial: 2736/10000, score: 6.999999999999998
Trial: 2737/10000, score: 6.999999999999998
Trial: 2738/10000, score: 7.199999999999999
Trial: 2739/10000, score: 6.399999999999999
Trial: 2740/10000, score: 7.299999999999999
Trial: 2741/10000, score: 6.899999999999999
Trial: 2742/10000, score: 7.299999999999999
Trial: 2743/10000, score: 6.599999999999998
Trial: 2744/10000, score: 6.999999999999998
Trial: 2745/10000, score: 7.099999999999999
Trial: 2746/10000, score: 6.999999999999998
Trial: 2747/10000, score: 7.399999999999999
Trial: 2748/10000, score: 7.399999999999999
Trial: 2749/10000, score: 7.199999999999999
Trial: 2750/10000, score: 6.299999999999998
Trial: 2751/10000, score: 6.899999999999999
Trial: 2752/10000, score: 7.0999

Trial: 2942/10000, score: 7.099999999999999
Trial: 2943/10000, score: 5.899999999999999
Trial: 2944/10000, score: 7.199999999999999
Trial: 2945/10000, score: 7.099999999999999
Trial: 2946/10000, score: 6.499999999999998
Trial: 2947/10000, score: 6.899999999999999
Trial: 2948/10000, score: 6.899999999999999
Trial: 2949/10000, score: 6.899999999999999
Trial: 2950/10000, score: 6.699999999999998
Trial: 2951/10000, score: 7.099999999999999
Trial: 2952/10000, score: 7.099999999999999
Trial: 2953/10000, score: 7.199999999999999
Trial: 2954/10000, score: 7.399999999999999
Trial: 2955/10000, score: 7.199999999999999
Trial: 2956/10000, score: 6.799999999999999
Trial: 2957/10000, score: 7.199999999999999
Trial: 2958/10000, score: 7.6
Trial: 2959/10000, score: 7.099999999999999
Trial: 2960/10000, score: 7.199999999999999
Trial: 2961/10000, score: 7.199999999999999
Trial: 2962/10000, score: 6.699999999999998
Trial: 2963/10000, score: 7.499999999999999
Trial: 2964/10000, score: 6.799999999999999
Tr

Trial: 3150/10000, score: 6.499999999999998
Trial: 3151/10000, score: 7.099999999999999
Trial: 3152/10000, score: 7.199999999999999
Trial: 3153/10000, score: 6.899999999999999
Trial: 3154/10000, score: 6.799999999999999
Trial: 3155/10000, score: 7.399999999999999
Trial: 3156/10000, score: 7.299999999999999
Trial: 3157/10000, score: 6.899999999999999
Trial: 3158/10000, score: 6.799999999999999
Trial: 3159/10000, score: 6.799999999999999
Trial: 3160/10000, score: 7.399999999999999
Trial: 3161/10000, score: 7.399999999999999
Trial: 3162/10000, score: 7.299999999999999
Trial: 3163/10000, score: 7.099999999999999
Trial: 3164/10000, score: 6.999999999999998
Trial: 3165/10000, score: 7.399999999999999
Trial: 3166/10000, score: 7.499999999999999
Trial: 3167/10000, score: 7.199999999999999
Trial: 3168/10000, score: 7.199999999999999
Trial: 3169/10000, score: 6.899999999999999
Trial: 3170/10000, score: 7.299999999999999
Trial: 3171/10000, score: 7.099999999999999
Trial: 3172/10000, score: 6.3999

Trial: 3562/10000, score: 6.799999999999999
Trial: 3563/10000, score: 6.399999999999999
Trial: 3564/10000, score: 6.999999999999998
Trial: 3565/10000, score: 6.899999999999999
Trial: 3566/10000, score: 7.399999999999999
Trial: 3567/10000, score: 7.199999999999999
Trial: 3568/10000, score: 6.399999999999999
Trial: 3569/10000, score: 7.299999999999999
Trial: 3570/10000, score: 6.999999999999998
Trial: 3571/10000, score: 7.199999999999999
Trial: 3572/10000, score: 6.799999999999999
Trial: 3573/10000, score: 6.799999999999999
Trial: 3574/10000, score: 7.299999999999999
Trial: 3575/10000, score: 7.099999999999999
Trial: 3576/10000, score: 6.999999999999998
Trial: 3577/10000, score: 6.699999999999998
Trial: 3578/10000, score: 6.999999999999998
Trial: 3579/10000, score: 6.999999999999998
Trial: 3580/10000, score: 7.099999999999999
Trial: 3581/10000, score: 6.999999999999998
Trial: 3582/10000, score: 7.299999999999999
Trial: 3583/10000, score: 7.499999999999999
Trial: 3584/10000, score: 6.5999

Trial: 3830/10000, score: 6.999999999999998
Trial: 3831/10000, score: 7.199999999999999
Trial: 3832/10000, score: 6.999999999999998
Trial: 3833/10000, score: 6.999999999999998
Trial: 3834/10000, score: 7.499999999999999
Trial: 3835/10000, score: 7.299999999999999
Trial: 3836/10000, score: 7.399999999999999
Trial: 3837/10000, score: 7.399999999999999
Trial: 3838/10000, score: 6.699999999999998
Trial: 3839/10000, score: 6.799999999999999
Trial: 3840/10000, score: 6.899999999999999
Trial: 3841/10000, score: 6.599999999999998
Trial: 3842/10000, score: 6.899999999999999
Trial: 3843/10000, score: 6.499999999999998
Trial: 3844/10000, score: 7.299999999999999
Trial: 3845/10000, score: 7.399999999999999
Trial: 3846/10000, score: 6.999999999999998
Trial: 3847/10000, score: 7.299999999999999
Trial: 3848/10000, score: 6.999999999999998
Trial: 3849/10000, score: 7.299999999999999
Trial: 3850/10000, score: 7.099999999999999
Trial: 3851/10000, score: 6.699999999999998
Trial: 3852/10000, score: 6.8999

Trial: 4148/10000, score: 6.499999999999998
Trial: 4149/10000, score: 7.099999999999999
Trial: 4150/10000, score: 6.999999999999998
Trial: 4151/10000, score: 7.299999999999999
Trial: 4152/10000, score: 6.899999999999999
Trial: 4153/10000, score: 6.699999999999998
Trial: 4154/10000, score: 6.599999999999998
Trial: 4155/10000, score: 7.299999999999999
Trial: 4156/10000, score: 7.6
Trial: 4157/10000, score: 7.099999999999999
Trial: 4158/10000, score: 6.899999999999999
Trial: 4159/10000, score: 6.799999999999999
Trial: 4160/10000, score: 6.999999999999998
Trial: 4161/10000, score: 7.299999999999999
Trial: 4162/10000, score: 6.999999999999998
Trial: 4163/10000, score: 6.899999999999999
Trial: 4164/10000, score: 6.599999999999998
Trial: 4165/10000, score: 7.099999999999999
Trial: 4166/10000, score: 7.099999999999999
Trial: 4167/10000, score: 7.299999999999999
Trial: 4168/10000, score: 6.1999999999999975
Trial: 4169/10000, score: 7.299999999999999
Trial: 4170/10000, score: 6.899999999999999
T

Trial: 4468/10000, score: 6.799999999999999
Trial: 4469/10000, score: 7.099999999999999
Trial: 4470/10000, score: 6.899999999999999
Trial: 4471/10000, score: 6.899999999999999
Trial: 4472/10000, score: 7.199999999999999
Trial: 4473/10000, score: 6.999999999999998
Trial: 4474/10000, score: 6.699999999999998
Trial: 4475/10000, score: 7.499999999999999
Trial: 4476/10000, score: 7.099999999999999
Trial: 4477/10000, score: 7.099999999999999
Trial: 4478/10000, score: 6.399999999999999
Trial: 4479/10000, score: 7.399999999999999
Trial: 4480/10000, score: 6.999999999999998
Trial: 4481/10000, score: 7.299999999999999
Trial: 4482/10000, score: 7.199999999999999
Trial: 4483/10000, score: 7.099999999999999
Trial: 4484/10000, score: 6.999999999999998
Trial: 4485/10000, score: 6.999999999999998
Trial: 4486/10000, score: 6.999999999999998
Trial: 4487/10000, score: 7.099999999999999
Trial: 4488/10000, score: 7.099999999999999
Trial: 4489/10000, score: 7.299999999999999
Trial: 4490/10000, score: 6.9999

Trial: 4777/10000, score: 6.999999999999998
Trial: 4778/10000, score: 6.799999999999999
Trial: 4779/10000, score: 6.999999999999998
Trial: 4780/10000, score: 6.899999999999999
Trial: 4781/10000, score: 7.199999999999999
Trial: 4782/10000, score: 6.399999999999999
Trial: 4783/10000, score: 6.999999999999998
Trial: 4784/10000, score: 6.799999999999999
Trial: 4785/10000, score: 7.499999999999999
Trial: 4786/10000, score: 7.199999999999999
Trial: 4787/10000, score: 6.999999999999998
Trial: 4788/10000, score: 6.999999999999998
Trial: 4789/10000, score: 7.199999999999999
Trial: 4790/10000, score: 6.999999999999998
Trial: 4791/10000, score: 6.999999999999998
Trial: 4792/10000, score: 6.399999999999999
Trial: 4793/10000, score: 7.099999999999999
Trial: 4794/10000, score: 7.499999999999999
Trial: 4795/10000, score: 6.499999999999998
Trial: 4796/10000, score: 7.199999999999999
Trial: 4797/10000, score: 6.499999999999998
Trial: 4798/10000, score: 7.499999999999999
Trial: 4799/10000, score: 7.1999

Trial: 5151/10000, score: 6.899999999999999
Trial: 5152/10000, score: 6.699999999999998
Trial: 5153/10000, score: 7.299999999999999
Trial: 5154/10000, score: 7.299999999999999
Trial: 5155/10000, score: 7.299999999999999
Trial: 5156/10000, score: 7.099999999999999
Trial: 5157/10000, score: 7.6
Trial: 5158/10000, score: 6.899999999999999
Trial: 5159/10000, score: 7.399999999999999
Trial: 5160/10000, score: 7.199999999999999
Trial: 5161/10000, score: 6.899999999999999
Trial: 5162/10000, score: 7.199999999999999
Trial: 5163/10000, score: 7.099999999999999
Trial: 5164/10000, score: 6.899999999999999
Trial: 5165/10000, score: 7.299999999999999
Trial: 5166/10000, score: 6.799999999999999
Trial: 5167/10000, score: 7.6
Trial: 5168/10000, score: 7.199999999999999
Trial: 5169/10000, score: 6.699999999999998
Trial: 5170/10000, score: 6.999999999999998
Trial: 5171/10000, score: 7.199999999999999
Trial: 5172/10000, score: 7.199999999999999
Trial: 5173/10000, score: 6.699999999999998
Trial: 5174/1000

Trial: 5424/10000, score: 6.999999999999998
Trial: 5425/10000, score: 7.199999999999999
Trial: 5426/10000, score: 5.899999999999999
Trial: 5427/10000, score: 6.599999999999998
Trial: 5428/10000, score: 6.499999999999998
Trial: 5429/10000, score: 6.799999999999999
Trial: 5430/10000, score: 6.499999999999998
Trial: 5431/10000, score: 6.699999999999998
Trial: 5432/10000, score: 7.499999999999999
Trial: 5433/10000, score: 5.699999999999999
Trial: 5434/10000, score: 6.799999999999999
Trial: 5435/10000, score: 7.499999999999999
Trial: 5436/10000, score: 7.199999999999999
Trial: 5437/10000, score: 7.199999999999999
Trial: 5438/10000, score: 6.999999999999998
Trial: 5439/10000, score: 7.299999999999999
Trial: 5440/10000, score: 7.299999999999999
Trial: 5441/10000, score: 7.099999999999999
Trial: 5442/10000, score: 7.299999999999999
Trial: 5443/10000, score: 6.899999999999999
Trial: 5444/10000, score: 6.699999999999998
Trial: 5445/10000, score: 6.699999999999998
Trial: 5446/10000, score: 6.6999

Trial: 5685/10000, score: 7.399999999999999
Trial: 5686/10000, score: 6.899999999999999
Trial: 5687/10000, score: 7.299999999999999
Trial: 5688/10000, score: 6.899999999999999
Trial: 5689/10000, score: 6.799999999999999
Trial: 5690/10000, score: 6.999999999999998
Trial: 5691/10000, score: 7.199999999999999
Trial: 5692/10000, score: 7.199999999999999
Trial: 5693/10000, score: 7.499999999999999
Trial: 5694/10000, score: 7.199999999999999
Trial: 5695/10000, score: 7.099999999999999
Trial: 5696/10000, score: 7.399999999999999
Trial: 5697/10000, score: 7.099999999999999
Trial: 5698/10000, score: 7.099999999999999
Trial: 5699/10000, score: 7.099999999999999
Trial: 5700/10000, score: 7.399999999999999
Trial: 5701/10000, score: 6.499999999999998
Trial: 5702/10000, score: 7.099999999999999
Trial: 5703/10000, score: 6.099999999999998
Trial: 5704/10000, score: 6.899999999999999
Trial: 5705/10000, score: 6.899999999999999
Trial: 5706/10000, score: 7.299999999999999
Trial: 5707/10000, score: 7.3999

Trial: 5946/10000, score: 7.099999999999999
Trial: 5947/10000, score: 7.299999999999999
Trial: 5948/10000, score: 7.299999999999999
Trial: 5949/10000, score: 7.199999999999999
Trial: 5950/10000, score: 7.6
Trial: 5951/10000, score: 6.599999999999998
Trial: 5952/10000, score: 6.699999999999998
Trial: 5953/10000, score: 7.299999999999999
Trial: 5954/10000, score: 7.499999999999999
Trial: 5955/10000, score: 6.999999999999998
Trial: 5956/10000, score: 6.799999999999999
Trial: 5957/10000, score: 7.099999999999999
Trial: 5958/10000, score: 6.599999999999998
Trial: 5959/10000, score: 7.099999999999999
Trial: 5960/10000, score: 6.999999999999998
Trial: 5961/10000, score: 6.799999999999999
Trial: 5962/10000, score: 6.899999999999999
Trial: 5963/10000, score: 7.299999999999999
Trial: 5964/10000, score: 7.499999999999999
Trial: 5965/10000, score: 7.199999999999999
Trial: 5966/10000, score: 7.6
Trial: 5967/10000, score: 7.099999999999999
Trial: 5968/10000, score: 6.799999999999999
Trial: 5969/1000

Trial: 6220/10000, score: 6.899999999999999
Trial: 6221/10000, score: 7.299999999999999
Trial: 6222/10000, score: 7.199999999999999
Trial: 6223/10000, score: 7.099999999999999
Trial: 6224/10000, score: 7.099999999999999
Trial: 6225/10000, score: 7.199999999999999
Trial: 6226/10000, score: 7.199999999999999
Trial: 6227/10000, score: 7.6
Trial: 6228/10000, score: 7.099999999999999
Trial: 6229/10000, score: 6.899999999999999
Trial: 6230/10000, score: 6.799999999999999
Trial: 6231/10000, score: 7.499999999999999
Trial: 6232/10000, score: 7.099999999999999
Trial: 6233/10000, score: 7.099999999999999
Trial: 6234/10000, score: 6.599999999999998
Trial: 6235/10000, score: 6.999999999999998
Trial: 6236/10000, score: 6.399999999999999
Trial: 6237/10000, score: 7.199999999999999
Trial: 6238/10000, score: 6.599999999999998
Trial: 6239/10000, score: 7.299999999999999
Trial: 6240/10000, score: 7.299999999999999
Trial: 6241/10000, score: 7.399999999999999
Trial: 6242/10000, score: 6.999999999999998
Tr

Trial: 6565/10000, score: 6.799999999999999
Trial: 6566/10000, score: 7.399999999999999
Trial: 6567/10000, score: 6.699999999999998
Trial: 6568/10000, score: 6.899999999999999
Trial: 6569/10000, score: 6.899999999999999
Trial: 6570/10000, score: 7.099999999999999
Trial: 6571/10000, score: 7.199999999999999
Trial: 6572/10000, score: 7.099999999999999
Trial: 6573/10000, score: 6.699999999999998
Trial: 6574/10000, score: 7.399999999999999
Trial: 6575/10000, score: 7.6
Trial: 6576/10000, score: 6.899999999999999
Trial: 6577/10000, score: 7.199999999999999
Trial: 6578/10000, score: 6.599999999999998
Trial: 6579/10000, score: 6.1999999999999975
Trial: 6580/10000, score: 6.799999999999999
Trial: 6581/10000, score: 6.999999999999998
Trial: 6582/10000, score: 7.099999999999999
Trial: 6583/10000, score: 7.199999999999999
Trial: 6584/10000, score: 6.599999999999998
Trial: 6585/10000, score: 6.899999999999999
Trial: 6586/10000, score: 7.399999999999999
Trial: 6587/10000, score: 7.6
Trial: 6588/100

Trial: 6904/10000, score: 6.999999999999998
Trial: 6905/10000, score: 7.099999999999999
Trial: 6906/10000, score: 7.499999999999999
Trial: 6907/10000, score: 7.199999999999999
Trial: 6908/10000, score: 6.899999999999999
Trial: 6909/10000, score: 6.799999999999999
Trial: 6910/10000, score: 7.299999999999999
Trial: 6911/10000, score: 7.099999999999999
Trial: 6912/10000, score: 6.999999999999998
Trial: 6913/10000, score: 7.099999999999999
Trial: 6914/10000, score: 7.299999999999999
Trial: 6915/10000, score: 6.799999999999999
Trial: 6916/10000, score: 6.799999999999999
Trial: 6917/10000, score: 7.399999999999999
Trial: 6918/10000, score: 6.099999999999998
Trial: 6919/10000, score: 6.899999999999999
Trial: 6920/10000, score: 7.199999999999999
Trial: 6921/10000, score: 7.099999999999999
Trial: 6922/10000, score: 6.499999999999998
Trial: 6923/10000, score: 7.299999999999999
Trial: 6924/10000, score: 7.199999999999999
Trial: 6925/10000, score: 7.299999999999999
Trial: 6926/10000, score: 7.2999

Trial: 7239/10000, score: 6.699999999999998
Trial: 7240/10000, score: 7.199999999999999
Trial: 7241/10000, score: 6.499999999999998
Trial: 7242/10000, score: 7.099999999999999
Trial: 7243/10000, score: 7.099999999999999
Trial: 7244/10000, score: 6.999999999999998
Trial: 7245/10000, score: 6.599999999999998
Trial: 7246/10000, score: 7.299999999999999
Trial: 7247/10000, score: 6.699999999999998
Trial: 7248/10000, score: 7.299999999999999
Trial: 7249/10000, score: 7.399999999999999
Trial: 7250/10000, score: 6.999999999999998
Trial: 7251/10000, score: 7.399999999999999
Trial: 7252/10000, score: 6.399999999999999
Trial: 7253/10000, score: 6.599999999999998
Trial: 7254/10000, score: 7.099999999999999
Trial: 7255/10000, score: 6.999999999999998
Trial: 7256/10000, score: 7.099999999999999
Trial: 7257/10000, score: 7.099999999999999
Trial: 7258/10000, score: 6.999999999999998
Trial: 7259/10000, score: 6.999999999999998
Trial: 7260/10000, score: 7.199999999999999
Trial: 7261/10000, score: 6.6999

Trial: 7514/10000, score: 7.399999999999999
Trial: 7515/10000, score: 7.399999999999999
Trial: 7516/10000, score: 7.299999999999999
Trial: 7517/10000, score: 6.499999999999998
Trial: 7518/10000, score: 7.399999999999999
Trial: 7519/10000, score: 6.599999999999998
Trial: 7520/10000, score: 6.999999999999998
Trial: 7521/10000, score: 6.599999999999998
Trial: 7522/10000, score: 7.399999999999999
Trial: 7523/10000, score: 6.999999999999998
Trial: 7524/10000, score: 7.299999999999999
Trial: 7525/10000, score: 7.099999999999999
Trial: 7526/10000, score: 6.999999999999998
Trial: 7527/10000, score: 7.6
Trial: 7528/10000, score: 6.799999999999999
Trial: 7529/10000, score: 7.199999999999999
Trial: 7530/10000, score: 7.6
Trial: 7531/10000, score: 7.099999999999999
Trial: 7532/10000, score: 7.399999999999999
Trial: 7533/10000, score: 6.899999999999999
Trial: 7534/10000, score: 6.799999999999999
Trial: 7535/10000, score: 7.199999999999999
Trial: 7536/10000, score: 6.899999999999999
Trial: 7537/1000

Trial: 7831/10000, score: 6.999999999999998
Trial: 7832/10000, score: 7.299999999999999
Trial: 7833/10000, score: 7.099999999999999
Trial: 7834/10000, score: 6.899999999999999
Trial: 7835/10000, score: 7.499999999999999
Trial: 7836/10000, score: 7.099999999999999
Trial: 7837/10000, score: 7.099999999999999
Trial: 7838/10000, score: 7.299999999999999
Trial: 7839/10000, score: 6.1999999999999975
Trial: 7840/10000, score: 7.199999999999999
Trial: 7841/10000, score: 6.899999999999999
Trial: 7842/10000, score: 7.099999999999999
Trial: 7843/10000, score: 6.699999999999998
Trial: 7844/10000, score: 7.6
Trial: 7845/10000, score: 6.899999999999999
Trial: 7846/10000, score: 6.899999999999999
Trial: 7847/10000, score: 6.999999999999998
Trial: 7848/10000, score: 6.999999999999998
Trial: 7849/10000, score: 6.999999999999998
Trial: 7850/10000, score: 7.499999999999999
Trial: 7851/10000, score: 7.099999999999999
Trial: 7852/10000, score: 7.399999999999999
Trial: 7853/10000, score: 6.899999999999999
T

Trial: 8129/10000, score: 7.199999999999999
Trial: 8130/10000, score: 6.799999999999999
Trial: 8131/10000, score: 7.199999999999999
Trial: 8132/10000, score: 7.6
Trial: 8133/10000, score: 6.799999999999999
Trial: 8134/10000, score: 7.399999999999999
Trial: 8135/10000, score: 7.199999999999999
Trial: 8136/10000, score: 7.099999999999999
Trial: 8137/10000, score: 7.099999999999999
Trial: 8138/10000, score: 7.499999999999999
Trial: 8139/10000, score: 6.999999999999998
Trial: 8140/10000, score: 7.099999999999999
Trial: 8141/10000, score: 7.199999999999999
Trial: 8142/10000, score: 7.199999999999999
Trial: 8143/10000, score: 6.999999999999998
Trial: 8144/10000, score: 7.299999999999999
Trial: 8145/10000, score: 7.299999999999999
Trial: 8146/10000, score: 6.799999999999999
Trial: 8147/10000, score: 6.899999999999999
Trial: 8148/10000, score: 6.699999999999998
Trial: 8149/10000, score: 7.099999999999999
Trial: 8150/10000, score: 7.399999999999999
Trial: 8151/10000, score: 6.399999999999999
Tr

Trial: 8453/10000, score: 6.299999999999998
Trial: 8454/10000, score: 6.999999999999998
Trial: 8455/10000, score: 6.899999999999999
Trial: 8456/10000, score: 6.899999999999999
Trial: 8457/10000, score: 7.199999999999999
Trial: 8458/10000, score: 7.299999999999999
Trial: 8459/10000, score: 7.399999999999999
Trial: 8460/10000, score: 7.099999999999999
Trial: 8461/10000, score: 7.6
Trial: 8462/10000, score: 6.699999999999998
Trial: 8463/10000, score: 7.099999999999999
Trial: 8464/10000, score: 7.199999999999999
Trial: 8465/10000, score: 6.499999999999998
Trial: 8466/10000, score: 6.999999999999998
Trial: 8467/10000, score: 6.499999999999998
Trial: 8468/10000, score: 7.099999999999999
Trial: 8469/10000, score: 7.6
Trial: 8470/10000, score: 6.499999999999998
Trial: 8471/10000, score: 6.299999999999998
Trial: 8472/10000, score: 6.799999999999999
Trial: 8473/10000, score: 6.799999999999999
Trial: 8474/10000, score: 6.099999999999998
Trial: 8475/10000, score: 6.899999999999999
Trial: 8476/1000

Trial: 8711/10000, score: 7.299999999999999
Trial: 8712/10000, score: 7.299999999999999
Trial: 8713/10000, score: 6.799999999999999
Trial: 8714/10000, score: 7.399999999999999
Trial: 8715/10000, score: 7.099999999999999
Trial: 8716/10000, score: 7.299999999999999
Trial: 8717/10000, score: 6.799999999999999
Trial: 8718/10000, score: 6.699999999999998
Trial: 8719/10000, score: 6.799999999999999
Trial: 8720/10000, score: 7.299999999999999
Trial: 8721/10000, score: 6.999999999999998
Trial: 8722/10000, score: 6.799999999999999
Trial: 8723/10000, score: 6.399999999999999
Trial: 8724/10000, score: 7.199999999999999
Trial: 8725/10000, score: 7.299999999999999
Trial: 8726/10000, score: 6.999999999999998
Trial: 8727/10000, score: 7.199999999999999
Trial: 8728/10000, score: 6.899999999999999
Trial: 8729/10000, score: 7.399999999999999
Trial: 8730/10000, score: 6.799999999999999
Trial: 8731/10000, score: 6.499999999999998
Trial: 8732/10000, score: 6.899999999999999
Trial: 8733/10000, score: 7.4999

Trial: 9014/10000, score: 7.099999999999999
Trial: 9015/10000, score: 7.399999999999999
Trial: 9016/10000, score: 7.299999999999999
Trial: 9017/10000, score: 7.199999999999999
Trial: 9018/10000, score: 7.399999999999999
Trial: 9019/10000, score: 5.999999999999998
Trial: 9020/10000, score: 6.799999999999999
Trial: 9021/10000, score: 6.999999999999998
Trial: 9022/10000, score: 7.499999999999999
Trial: 9023/10000, score: 7.499999999999999
Trial: 9024/10000, score: 6.699999999999998
Trial: 9025/10000, score: 6.499999999999998
Trial: 9026/10000, score: 7.499999999999999
Trial: 9027/10000, score: 6.799999999999999
Trial: 9028/10000, score: 7.199999999999999
Trial: 9029/10000, score: 7.299999999999999
Trial: 9030/10000, score: 7.499999999999999
Trial: 9031/10000, score: 7.099999999999999
Trial: 9032/10000, score: 6.999999999999998
Trial: 9033/10000, score: 7.399999999999999
Trial: 9034/10000, score: 6.399999999999999
Trial: 9035/10000, score: 7.299999999999999
Trial: 9036/10000, score: 7.2999

Trial: 9307/10000, score: 7.299999999999999
Trial: 9308/10000, score: 6.699999999999998
Trial: 9309/10000, score: 6.599999999999998
Trial: 9310/10000, score: 7.299999999999999
Trial: 9311/10000, score: 7.099999999999999
Trial: 9312/10000, score: 7.099999999999999
Trial: 9313/10000, score: 7.299999999999999
Trial: 9314/10000, score: 7.199999999999999
Trial: 9315/10000, score: 7.399999999999999
Trial: 9316/10000, score: 6.899999999999999
Trial: 9317/10000, score: 7.399999999999999
Trial: 9318/10000, score: 7.299999999999999
Trial: 9319/10000, score: 7.099999999999999
Trial: 9320/10000, score: 7.299999999999999
Trial: 9321/10000, score: 6.699999999999998
Trial: 9322/10000, score: 6.999999999999998
Trial: 9323/10000, score: 6.999999999999998
Trial: 9324/10000, score: 6.999999999999998
Trial: 9325/10000, score: 7.499999999999999
Trial: 9326/10000, score: 6.699999999999998
Trial: 9327/10000, score: 7.199999999999999
Trial: 9328/10000, score: 6.899999999999999
Trial: 9329/10000, score: 7.1999

Trial: 9690/10000, score: 6.599999999999998
Trial: 9691/10000, score: 7.499999999999999
Trial: 9692/10000, score: 7.099999999999999
Trial: 9693/10000, score: 7.399999999999999
Trial: 9694/10000, score: 6.599999999999998
Trial: 9695/10000, score: 6.999999999999998
Trial: 9696/10000, score: 7.199999999999999
Trial: 9697/10000, score: 7.499999999999999
Trial: 9698/10000, score: 7.399999999999999
Trial: 9699/10000, score: 6.899999999999999
Trial: 9700/10000, score: 7.399999999999999
Trial: 9701/10000, score: 7.199999999999999
Trial: 9702/10000, score: 7.199999999999999
Trial: 9703/10000, score: 6.899999999999999
Trial: 9704/10000, score: 7.099999999999999
Trial: 9705/10000, score: 6.799999999999999
Trial: 9706/10000, score: 6.899999999999999
Trial: 9707/10000, score: 6.699999999999998
Trial: 9708/10000, score: 7.099999999999999
Trial: 9709/10000, score: 7.299999999999999
Trial: 9710/10000, score: 6.799999999999999
Trial: 9711/10000, score: 7.199999999999999
Trial: 9712/10000, score: 7.1999

Ahora veamos el diccionario de los Q-valores aprendidos. Las claves son pares state-action. Las diferentes acciones corresponden a:

NORTH = (0, 1)  
SOUTH = (0,-1)  
WEST = (-1, 0)  
EAST = (1, 0)

In [6]:
Qvalues = agent.Q
print(Qvalues)

defaultdict(<class 'float'>, {((3, 1), (1, 0)): -0.9503311697459681, ((3, 1), (0, 1)): -0.9516610366533239, ((3, 1), (-1, 0)): -0.9518847562395456, ((3, 1), (0, -1)): -0.9518832648587032, ((4, 1), (1, 0)): -0.9435459511360863, ((4, 1), (0, 1)): -0.9464292117751305, ((4, 1), (-1, 0)): -0.9449186099835187, ((4, 1), (0, -1)): -0.9467637747288986, ((5, 1), (1, 0)): -0.9384039471248806, ((5, 1), (0, 1)): -0.9358949249962897, ((5, 1), (-1, 0)): -0.9432314173678052, ((5, 1), (0, -1)): -0.9387857397732663, ((5, 2), (1, 0)): -0.9341501235229096, ((5, 2), (0, 1)): -0.9265252758673215, ((5, 2), (-1, 0)): -0.9272391422092627, ((5, 2), (0, -1)): -0.9271477893724792, ((5, 3), (1, 0)): -0.9163446129676152, ((5, 3), (0, 1)): -0.9204519075333434, ((5, 3), (-1, 0)): -0.9160611147714801, ((5, 3), (0, -1)): -0.9267792563818309, ((6, 3), (1, 0)): -0.9160803943658296, ((6, 3), (0, 1)): -0.9159229068593028, ((6, 3), (-1, 0)): -0.9177907613959303, ((6, 3), (0, -1)): -0.9148882382723134, ((7, 3), (1, 0)): -0.9

## DESAFIO:


<b>1) Cree una funcion para extraer las utilidades (U) de los estados a partir de los Q-valores obtenidos por el agente. LLame esta funcion: get_utilities_from_qvalues(Qvalues). Pruebela en el resultado anterior (agent.Q)</b>


Respuesta:

In [7]:
def get_utilities_from_qvalues(mdp, Q):
    """Dado un MDP y una funcion de utilidad Q, determina los valores de utilidad de los estados. """
    U = {}
    for s in mdp.states:
        if s not in mdp.terminals:
            U[s] =  -np.inf
            for a in mdp.actionlist:
                if Q[(s, a)] > U[s] : 
                    U[s] = Q[(s, a)]
    return U

In [8]:
U = get_utilities_from_qvalues(environment, agent.Q)
print(U)

{(5, 9): -0.7470211424830404, (4, 7): -0.6286340665114191, (1, 3): -0.8562083272637346, (6, 9): -0.7511682847955876, (7, 3): -0.9175505996795595, (9, 1): -0.9077531296639262, (9, 8): -0.18579908294695432, (7, 7): -0.45622826607629885, (2, 1): -0.9520636719049956, (1, 6): -0.7858080413058445, (9, 4): -0.9209108944643444, (3, 7): -0.6734144320257237, (5, 1): -0.9358949249962897, (8, 5): -0.917395023752628, (7, 2): -0.9065472631987467, (4, 9): -0.7486708997588435, (3, 3): -0.8892469118558004, (2, 9): -0.7687627478631729, (5, 5): -0.9236896868984616, (6, 7): -0.5208028903295077, (6, 3): -0.9148882382723134, (1, 5): -0.8113466371138591, (4, 1): -0.9435459511360863, (9, 7): -0.29539729569291306, (7, 1): -0.9095056559594192, (4, 5): -0.9234261675625893, (9, 3): -0.9087513986768697, (1, 4): -0.8341618609197332, (3, 9): -0.75011981663061, (2, 3): -0.8739268202166111, (1, 9): -0.7859814894514331, (7, 5): -0.9187425349677043, (8, 7): -0.3844594944320013, (6, 5): -0.9182630778462919, (3, 5): -0.92

<b>2)  Cree una funcion para extraer la politica a partir de los Q-valores obtenidos por el agente. LLame esta funcion: get_policy_from_qvalues(Qvalues). Pruebela en el resultado anterior (agent.Q)</b>


Respuesta:

In [9]:
def get_policy_from_qvalues(mdp, Q):
    """Dado un MDP y una funccion de utilidad Q, determina la mejor politica. """
    pi = {}
    for s in mdp.states:
        if s not in mdp.terminals:
            pi[s] = max(mdp.actionlist, key=lambda a: Q[(s,a)])
        else:
            pi[s] = None
    return pi

In [10]:
pi_qlearning = get_policy_from_qvalues(environment, Qvalues)
environment.print_policy(pi_qlearning)

None   None   None   None   None   None   None   None   None   None   None
None   >      >      v      >      >      ^      v      None   .      None
None   v      None   None   None   None   None   None   None   ^      None
None   >      >      >      >      >      >      >      >      ^      None
None   ^      None   None   None   None   None   None   None   None   None
None   ^      None   v      v      >      <      ^      v      <      None
None   ^      None   None   None   None   None   ^      None   >      None
None   ^      <      <      <      <      v      >      None   v      None
None   None   None   None   None   ^      None   <      None   <      None
None   .      >      >      >      ^      None   <      None   v      None
None   None   None   None   None   None   None   None   None   None   None
