### Documentación

Problemas interesantes para Aprendizaje por refuerzo
 * Gymnasium: https://gymnasium.farama.org/environments/box2d/

## Instalación

%pip install gymnasium  
%pip install gymnasium[box2d] 

## Acciones adicionales

Pueden ser necesarias *antes* de instalar gymnasium[box2d].

### En macos

pip uninstall swig  
xcode-select -—install (instala las herramientas de desarrollador si no se tienen ya)  
pip install swig  / sudo port install swig-python
pip install 'gymnasium[box2d]' # en zsh hay que poner las comillas  

### en Windows

Si da error, se debe a la falta de la versión correcta de Microsoft C++ Build Tools, que es una dependencia de Box2D. Para solucionar este problema, puede seguir los siguientes pasos:
 * Descargar Microsoft C++ Build Tools desde https://visualstudio.microsoft.com/visual-cpp-build-tools/.
 * Dentro del instalador, seleccione la opción "Desarrollo para el escritorio con C++"
 * Reinicie su sesión en Jupyter Notebook o en Visual Studio.
 * Ejecute nuevamente el comando !pip install gymnasium[box2d] en la línea de comandos de su notebook.

### En linux (colab)
  * pip install swig

In [1]:
import numpy as np
import random
import matplotlib.pyplot as plt
import sys
import gymnasium as gym
import numpy as np
import pygame
import gymnasium.utils.play
from MLP import MLP
from tqdm import tqdm
import concurrent.futures

from loky import get_reusable_executor


EXECUTOR = get_reusable_executor()

def get_architecture():
    return architecture

def set_architecture(arch):
    global architecture
    architecture = arch
    
set_architecture([8, 6, 4])
architecture = get_architecture()
print(architecture)


[8, 6, 4]


In [2]:
def simulated_binary_crossover(ind1, ind2, pcross, eta=2):
    ind1_copy, ind2_copy = [*ind1], [*ind2]
    for i in range(len(ind1)):
        if random.random() < pcross:
            u = random.random()
            beta = (2 * u) ** (1 / (eta + 1)) if u <= 0.5 else (1 / (2 * (1 - u))) ** (1 / (eta + 1))
            ind1_copy[i] = 0.5 * ((1 + beta) * ind1[i] + (1 - beta) * ind2[i])
            ind2_copy[i] = 0.5 * ((1 - beta) * ind1[i] + (1 + beta) * ind2[i])
    return ind1_copy, ind2_copy

def polynomial_mutation(ind, pmut, eta=2):
    ind_copy = [*ind]
    for i in range(len(ind)):
        if random.random() < pmut:
            u = random.random()
            delta = (2 * u) ** (1 / (eta + 1)) - 1 if u < 0.5 else 1 - (2 * (1 - u)) ** (1 / (eta + 1))
            ind_copy[i] += delta
    return ind_copy


def gaussian_mutation(ind, pmut, sigma=0.2): # Probar 0.05
    ind_copy = [*ind]
    for i in range(len(ind)):
        if random.random() < pmut:
            ind_copy[i] += random.gauss(0, sigma)
    return ind_copy

def random_mutation(ind, pmut):
    options = [polynomial_mutation, gaussian_mutation]

    return random.choice(options)(ind, pmut)

def fitness (ch):
    env = gym.make("LunarLander-v3", render_mode=None)

    rewards_list = []
    # En las diapos pone *3*
    for _ in range(3):
        observation, _ = env.reset()
        racum = 0
        while True:
            model = MLP(get_architecture())
            model.from_chromosome(ch)
            action = policy(model, observation)
            observation, reward, terminated, truncated, _ = env.step(action)
            # reward = custom_reward(observation, action, terminated)
            racum += reward

            if terminated or truncated:
                rewards_list.append(racum)
                break
    
    return sum(rewards_list) / len(rewards_list)

def show(ind):
    env = gym.make("LunarLander-v3", render_mode="human")

    observation, _ = env.reset()
    iters = 0
    while True:
        model = MLP(get_architecture())
        model.from_chromosome(ind)
        action = policy(model, observation)
        observation, _, terminated, truncated, _ = env.step(action)

        if any([truncated, terminated]):
            observation, _ = env.reset()
            break

    env.close()

"""
def policy (model, observation):
    s = model.forward(observation)
    action = np.argmax(s)
    return action
"""

def policy(model, observation, epsilon=0.01):
    """
    ε-greedy policy: selects the optimal action with probability (1 - epsilon)
    and a random action with probability epsilon.
    
    Args:
    - model: the model with a forward method to predict action values.
    - observation: the current input (observed state).
    - epsilon: exploration probability (between 0 and 1).

    Returns:
    - action: the selected action.
    """
    # Copiada de otro grupo
    s = model.forward(observation) 
    if np.random.rand() < epsilon:  
        action = np.random.randint(len(s))
    else: 
        action = np.argmax(s)
    return action


def select (pop, T): # devuelve un individuo seleccionado por torneo, devuelve una copia para evitar efectos laterales
    # pop se supone ya ordenada por fitness
    selected = [random.randint(0, len(pop)-1) for _ in range(T)]
    return [*pop[min(selected)]]

def sort_pop (pop, fit): # devuelve una tupla: la población ordenada por fitness, y la lista de fitness.
    fitness_list = EXECUTOR.map(fit, pop)
    sorted_pop_fitness = sorted(zip(pop, fitness_list), key=lambda x: x[1], reverse=True)
    return [x[1] for x in sorted_pop_fitness], [x[0] for x in sorted_pop_fitness]

def evolve_himmelblau (pop, fit, pmut, pcross=0.7, ngen=100, T=2, trace=0):
    initial_pop = [*pop]
    historical_best = []
    best_fitness = sys.maxsize * -1
    pbar = tqdm(range(ngen), desc="Processing")
    for i in pbar:
        sorted_fitnesses, sorted_pop = sort_pop(initial_pop, fit)
        current_best = sorted_pop[0]
        selected_pop = [select(sorted_pop, T) for _ in range(len(initial_pop))]

        crossed_pop = []
        for j in range(0, len(selected_pop)-1, 2):
            crossed_pop.extend(simulated_binary_crossover(selected_pop[j], selected_pop[j+1], pcross))
        if len(selected_pop) % 2 != 0:
            crossed_pop.append(selected_pop[-1])
        
        mutated_pop = [random_mutation(ind, pmut) for ind in crossed_pop]
        
        if  sorted_fitnesses[0] > best_fitness:
            show(current_best)
            historical_best = current_best
            best_fitness = sorted_fitnesses[0]
            np.save("current_best_chromosome.npy", historical_best)
            np.save("current_best_architecture.npy", architecture)
            # print(f"[{i:>4}] New Best: {best_fitness:>5.2f}")

        initial_pop = mutated_pop
        # if trace and i % trace == 0:
            # print(f"[{i:>4}] Best:     {sorted_fitnesses[0]:>5.2f}")
        
        pbar.set_postfix(current_best=sorted_fitnesses[0], best_fitness=best_fitness)


    initial_pop.insert(0, historical_best)
    return initial_pop


In [7]:
population_size = 100

pop = [MLP(architecture).to_chromosome() for _ in range(population_size)]

pop = evolve_himmelblau(pop, fitness, 0.1, pcross=0.9, ngen=750, T=8, trace=0)

  return 1.0 / (1.0 + np.exp(-neta))
  return 1.0 / (1.0 + np.exp(-neta))
  return 1.0 / (1.0 + np.exp(-neta))
  return 1.0 / (1.0 + np.exp(-neta))
  return 1.0 / (1.0 + np.exp(-neta))
  return 1.0 / (1.0 + np.exp(-neta))
  return 1.0 / (1.0 + np.exp(-neta))
  return 1.0 / (1.0 + np.exp(-neta))
Processing:  99%|█████████▉| 745/750 [14:18<00:03,  1.28it/s, best_fitness=311, current_best=301] 

In [None]:
import numpy as np
from real_numbers import policy

# save pop[0] to a file and architecture
np.save("best_chromosome.npy", pop[0])
np.save("architecture.npy", architecture)

env = gym.make("LunarLander-v3", render_mode="human")

observation, _ = env.reset()
iters = 0
while True:
    model = MLP(get_architecture())
    model.from_chromosome(pop[0])

    action = policy(model, observation)
    observation, _, terminated, truncated, _ = env.step(action)

    if any([truncated, terminated]):
        observation, _ = env.reset()
        iters += 1

    if iters == 10:
        break

env.close()


### ¿Cómo contruir el fitness para aplicar genéticos?

 * El módulo MLP ya tiene implementado el perceptrón multicapa. Se construye con MLP(architecture).
 * Architecture es una tupla (entradas, capa1, capa2, ...).
 * La función fitness toma el cromosoma del individuo y lo convierte a pesos del MLP con model.from_chromosome(ch).
 * usa run para N casos (esto da estabilidad) y calcula el refuerzo medio.
 * Este refuerzo medio es el fitness del individuo.

#### ¿No has tenido bastante?

Prueba a controlar el flappy bird https://github.com/markub3327/flappy-bird-gymnasium

pip install flappy-bird-gymnasium

import flappy_bird_gymnasium  
env = gym.make("FlappyBird-v0")

Estado (12 variables):
  * the last pipe's horizontal position
  * the last top pipe's vertical position
  * the last bottom pipe's vertical position
  * the next pipe's horizontal position
  * the next top pipe's vertical position
  * he next bottom pipe's vertical position
  * the next next pipe's horizontal position
  * the next next top pipe's vertical position
  * the next next bottom pipe's vertical position
  * player's vertical position
  * player's vertical velocity
  * player's rotation

  Acciones:
  * 0 -> no hacer nada
  * 1 -> volar