# Actividad - Proyecto práctico


> La actividad se desarrollará en grupos pre-definidos de 2-3 alumnos. Se debe indicar los nombres en orden alfabético (de apellidos). Recordad que esta actividad se corresponde con un 30% de la nota final de la asignatura. Se debe entregar entregar el trabajo en la presente notebook.
*   Alumno 1:
*   Alumno 2:
*   Alumno 3:






---
## **PARTE 1** - Instalación y requisitos previos

> Las prácticas han sido preparadas para poder realizarse en el entorno de trabajo de Google Colab. Sin embargo, esta plataforma presenta ciertas incompatibilidades a la hora de visualizar la renderización en gym. Por ello, para obtener estas visualizaciones, se deberá trasladar el entorno de trabajo a local. Por ello, el presente dosier presenta instrucciones para poder trabajar en ambos entornos. Siga los siguientes pasos para un correcto funcionamiento:
1.   **LOCAL:** Preparar el enviroment, siguiendo las intrucciones detalladas en la sección *1.1.Preparar enviroment*.
2.  **AMBOS:** Modificar las variables "mount" y "drive_mount" a la carpeta de trabajo en drive en el caso de estar en Colab, y ejecturar la celda *1.2.Localizar entorno de trabajo*.
3. **COLAB:** se deberá ejecutar las celdas correspondientes al montaje de la carpeta de trabajo en Drive. Esta corresponde a la sección *1.3.Montar carpeta de datos local*.
4.  **AMBOS:** Instalar las librerías necesarias, siguiendo la sección *1.4.Instalar librerías necesarias*.


---
### 1.1. Preparar enviroment (solo local)



> Para preparar el entorno de trabajo en local, se han seguido los siguientes pasos:
1. En Windows, puede ser necesario instalar las C++ Build Tools. Para ello, siga los siguientes pasos: https://towardsdatascience.com/how-to-install-openai-gym-in-a-windows-environment-338969e24d30.
2. Instalar Anaconda
3. Siguiendo el código que se presenta comentado en la próxima celda: Crear un enviroment, cambiar la ruta de trabajo, e instalar librerías básicas.


```
conda create --name miar_rl python=3.8
conda activate miar_rl
cd "PATH_TO_FOLDER"
conda install git
pip install jupyter
```


4. Abrir la notebook con *jupyter-notebook*.



```
jupyter-notebook
```


---
### 1.2. Localizar entorno de trabajo: Google colab o local

In [None]:
from networkx.tests.test_all_random_functions import progress
from sympy import print_rcode

# ATENCIÓN!! Modificar ruta relativa a la práctica si es distinta (drive_root)
mount='/content/gdrive'
drive_root = mount + "/My Drive/08_MIAR/actividades/proyecto practico"

try:
  from google.colab import drive
  IN_COLAB=True
except:
  IN_COLAB=False

---
### 1.3. Montar carpeta de datos local (solo Colab)

In [None]:
# Switch to the directory on the Google Drive that you want to use
import os
if IN_COLAB:
  print("We're running Colab")

  if IN_COLAB:
    # Mount the Google Drive at mount
    print("Colab: mounting Google drive on ", mount)

    drive.mount(mount)

    # Create drive_root if it doesn't exist
    create_drive_root = True
    if create_drive_root:
      print("\nColab: making sure ", drive_root, " exists.")
      os.makedirs(drive_root, exist_ok=True)

    # Change to the directory
    print("\nColab: Changing directory to ", drive_root)
    %cd $drive_root
# Verify we're in the correct working directory
%pwd
print("Archivos en el directorio: ")
print(os.listdir())

---
### 1.4. Instalar librerías necesarias

In [None]:
if IN_COLAB:
  %pip install gym==0.17.3
  %pip install git+https://github.com/Kojoley/atari-py.git
  %pip install keras-rl2==1.0.5
  %pip install tensorflow==2.8
else:
  %pip install gym==0.17.3
  %pip install git+https://github.com/Kojoley/atari-py.git
  %pip install pyglet==1.5.0
  %pip install h5py==3.1.0
  %pip install Pillow==9.5.0
  %pip install keras-rl2==1.0.5
  %pip install Keras==2.2.4
  %pip install tensorflow==2.5.3
  %pip install torch==2.0.1
  %pip install agents==1.4.0

---
## **PARTE 2**. Enunciado

Consideraciones a tener en cuenta:

- El entorno sobre el que trabajaremos será _SpaceInvaders-v0_ y el algoritmo que usaremos será _DQN_.

- Para nuestro ejercicio, el requisito mínimo será alcanzado cuando el agente consiga una **media de recompensa por encima de 20 puntos en modo test**. Por ello, esta media de la recompensa se calculará a partir del código de test en la última celda del notebook.

Este proyecto práctico consta de tres partes:

1.   Implementar la red neuronal que se usará en la solución
2.   Implementar las distintas piezas de la solución DQN
3.   Justificar la respuesta en relación a los resultados obtenidos

**Rúbrica**: Se valorará la originalidad en la solución aportada, así como la capacidad de discutir los resultados de forma detallada. El requisito mínimo servirá para aprobar la actividad, bajo premisa de que la discusión del resultado sera apropiada.

IMPORTANTE:

* Si no se consigue una puntuación óptima, responder sobre la mejor puntuación obtenida.
* Para entrenamientos largos, recordad que podéis usar checkpoints de vuestros modelos para retomar los entrenamientos. En este caso, recordad cambiar los parámetros adecuadamente (sobre todo los relacionados con el proceso de exploración).
* Se deberá entregar unicamente el notebook y los pesos del mejor modelo en un fichero .zip, de forma organizada.
* Cada alumno deberá de subir la solución de forma individual.

---
## **PARTE 3**. Desarrollo y preguntas

#### Importar librerías

In [11]:
from PIL import Image
import numpy as np
import mlflow
from stable_baselines3 import DQN, PPO, A2C
from stable_baselines3.common.torch_layers import BaseFeaturesExtractor
import torch.nn as nn
import torch
import os
from stable_baselines3.common.vec_env import DummyVecEnv, VecFrameStack, SubprocVecEnv
from stable_baselines3.common.callbacks import ProgressBarCallback, BaseCallback
from stable_baselines3.common.monitor import Monitor

import gymnasium as gym
import ale_py


from torchvision.models import mobilenet_v2

from tqdm.notebook import tqdm
import mlflow
import torchvision
from torchvision import transforms
import torch.nn.functional as F
from stable_baselines3.common.atari_wrappers import AtariWrapper
from gymnasium.wrappers import AtariPreprocessing

from sb3_contrib import QRDQN


#### Configuración base

In [17]:
INPUT_SHAPE = (84, 84)
WINDOW_LENGTH = 4
gym.register_envs(ale_py)
class CustomPenaltyWrapper(gym.Wrapper):
    def __init__(self, env):
        super().__init__(env)
        self.last_action = None
        self.same_action_count = 0
        self.last_lives = None
        self.no_shoot_count = 0

    def reset(self, **kwargs):
        obs, info = self.env.reset(**kwargs)
        self.last_action = None
        self.same_action_count = 0
        self.no_shoot_count = 0

        if "lives" in info:
            self.last_lives = info["lives"]
        else:
            self.last_lives = 3

        return obs, info

    def step(self, action):
        obs, reward, terminated, truncated, info = self.env.step(action)

        # --- Penalty for repeated no-op
        if action == 0:
            if self.last_action == 0:
                self.same_action_count += 1
            else:
                self.same_action_count = 1
        else:
            self.same_action_count = 0
        self.last_action = action

        if self.same_action_count >= 3:
            reward -= 1.0

        # --- Penalty for getting hit
        if "lives" in info and self.last_lives is not None:
            if info["lives"] < self.last_lives:
                reward -= 0.5
            self.last_lives = info["lives"]

        # --- Penalty for not shooting for too long
        if int(action) in [1, 4, 5]:  # shooting actions
            self.no_shoot_count = 0
        else:
            self.no_shoot_count += 1
            if self.no_shoot_count >= 10:
                reward -= 0.7

        reward = np.clip(reward, -1.0, 1.0)

        return obs, reward, terminated, truncated, info


class NormalizeInput(gym.ObservationWrapper):
    def __init__(self, env):
        super().__init__(env)
        self.observation_space = gym.spaces.Box(
            low=0.0, high=1.0,
            shape=self.observation_space.shape,
            dtype=np.float32
        )

    def observation(self, observation):
        return observation.astype(np.float32) / 255.0

env_name = 'SpaceInvadersNoFrameskip-v4'
env = gym.make(env_name)
#En este caso el AtariWrapper hace lo mismo que Clipreward además de añadir el preprocesado de las imágenes
#Normalizamos las imágenes a 0-1
# env = NormalizeInput(env)
env = AtariWrapper(env, frame_skip=4)
obs, _ = env.reset()
print("Observación inicial:", obs.shape)

normal_env = NormalizeInput(env)
pen_env = CustomPenaltyWrapper(normal_env)  # Añadimos el wrapper de penalización

# parallel_env = SubprocVecEnv([normal_env for _ in range(8)])

#Train env con penalizaciones
env = Monitor(pen_env)
env = DummyVecEnv([lambda: env])  # Convertimos a un entorno vectorizado
# # # Se crea el entorno de vectores y se apilan los frames
env = VecFrameStack(env, 4)

#Test env sin penalizaciones
normal_env = Monitor(normal_env)
normal_env = DummyVecEnv([lambda: normal_env])  # Convertimos a un entorno vectorizado
# # # Se crea el entorno de vectores y se apilan los frames
normal_env = VecFrameStack(normal_env, 4)

np.random.seed(123)
obs = env.reset()
nb_actions = env.action_space.n
# print(env.shape)
print(obs.shape)
print(nb_actions)
print("maximo de altura", max(obs[0, :, 0, :].flatten()))
print("maximo de ancho", max(obs[0, 0 :, :].flatten()))

Observación inicial: (84, 84, 1)
(1, 84, 84, 4)
6
maximo de altura 0.30980393
maximo de ancho 0.52156866


In [81]:
obs, info = pen_env.reset()

In [67]:
obs, info = pen_env.reset()
done = False
total_reward = 0

while not done:
    action = 0
    obs, reward, terminated, truncated, info = pen_env.step(action)
    total_reward += reward
    if truncated or terminated:
        done = True

    print("Reward:", reward)
print(f"Total reward: {total_reward}")

Reward: 0.0
Reward: 0.0
Reward: -1.0
Reward: -1.0
Reward: -1.0
Reward: -1.0
Reward: -1.0
Reward: -1.0
Reward: -1.0
Reward: -1.0
Reward: -1.0
Reward: -1.0
Reward: -1.0
Reward: -1.0
Reward: -1.0
Reward: -1.0
Reward: -1.0
Reward: -1.0
Reward: -1.0
Reward: -1.0
Reward: -1.0
Reward: -1.0
Reward: -1.0
Reward: -1.0
Reward: -1.0
Reward: -1.0
Reward: -1.0
Reward: -1.0
Reward: -1.0
Reward: -1.0
Reward: -1.0
Reward: -1.0
Reward: -1.0
Reward: -1.0
Reward: -1.0
Reward: -1.0
Reward: -1.0
Reward: -1.0
Reward: -1.0
Reward: -1.0
Reward: -1.0
Reward: -1.0
Reward: -1.0
Reward: -1.0
Reward: -1.0
Reward: -1.0
Reward: -1.0
Reward: -1.0
Reward: -1.0
Reward: -1.0
Reward: -1.0
Reward: -1.0
Reward: -1.0
Reward: -1.0
Reward: -1.0
Reward: -1.0
Reward: -1.0
Reward: -1.0
Reward: -1.0
Reward: -1.0
Reward: -1.0
Reward: -1.0
Reward: -1.0
Reward: -1.0
Reward: -1.0
Reward: -1.0
Reward: -1.0
Reward: -1.0
Reward: -1.0
Reward: -1.0
Reward: -1.0
Reward: -1.0
Reward: -1.0
Reward: -1.0
Reward: -1.0
Reward: -1.0
Reward: -1.0
R

1. Implementación de la red neuronal

In [7]:
class MobileNetFeatureExtractor(BaseFeaturesExtractor):
    def __init__(self, observation_space, features_dim=256):
        super().__init__(observation_space, features_dim)

        # Pretrained MobileNetV2 without the classifier
        weights = torchvision.models.MobileNet_V2_Weights.DEFAULT
        self.backbone = torchvision.models.mobilenet_v2(weights=weights)

        # Freeze weights (optional)
        for param in self.backbone.parameters():
            param.requires_grad = False

        # Compute shape by doing one forward pass
        with torch.no_grad():
            sample = torch.as_tensor(observation_space.sample()[None]).float()
            if sample.shape[1] != 3:  # Convert grayscale to 3 channels
                sample = sample.repeat(1, 3, 1, 1)
            n_flatten = self.backbone(sample).view(sample.shape[0], -1).shape[1]

        self.projector = nn.Sequential(
            nn.Flatten(),
            nn.Linear(n_flatten, n_flatten // 2),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(n_flatten // 2, n_flatten // 3),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(n_flatten // 3, features_dim),
            nn.ReLU()
        )

    def forward(self, obs):
        # Convert 1-channel grayscale to 3 channels if needed
        if obs.shape[1] == 1:
            obs = obs.repeat(1, 3, 1, 1)
        features = self.backbone(obs)
        return self.projector(features)

### Convirtiendo las imágenes a 3 canales

In [5]:
class VitB16FeaturesExtractor(BaseFeaturesExtractor):
    def __init__(self, observation_space, features_dim=256):
        super().__init__(observation_space, features_dim)

        # Load a pretrained ViT model
        weights = torchvision.models.ViT_B_16_Weights.DEFAULT
        self.backbone = torchvision.models.vit_b_16(weights=weights).cuda()

        # Freeze weights (optional)
        for param in self.backbone.parameters():
            param.requires_grad = False

        # Compute shape by doing one forward pass
        with torch.no_grad():
            sample = torch.as_tensor(observation_space.sample()[None]).float()
            if sample.shape[1] != 3:  # Convert grayscale to 3 channels
                sample = sample.repeat(1, 3, 1, 1)
            sample = self._preprocess(sample).cuda()
            n_flatten = self.backbone(sample).view(sample.shape[0], -1).shape[1]

        self.projector = nn.Sequential(
            nn.Flatten(),
            nn.Linear(n_flatten, features_dim),
            nn.ReLU()
        )

    def _preprocess(self, observation):
        # Preprocess the observation to match the input requirements of ViT
        batch_resize = F.interpolate(
            observation, size=(224, 224), mode='bilinear', align_corners=False
        )
        return batch_resize


    def forward(self, obs):
        # Convert 1-channel grayscale to 3 channels if needed
        obs = self._preprocess(obs).cuda()
        if obs.shape[1] == 1:
            obs = obs.repeat(1, 3, 1, 1)
        features = self.backbone(obs)
        return self.projector(features)

### Escala de grises

In [23]:
class VitB16FeaturesExtractor(BaseFeaturesExtractor):
    def __init__(self, observation_space, features_dim=256):
        super().__init__(observation_space, features_dim)

        self.chanel_mapper = nn.Conv2d(
            in_channels=1, out_channels=3, kernel_size=1, stride=1, padding=0
        )
        self.uppscaler = nn.Upsample(size=(224, 224), mode='bilinear', align_corners=False)
        # Load a pretrained ViT model
        weights = torchvision.models.ViT_B_16_Weights.DEFAULT
        self.backbone = torchvision.models.vit_b_16(weights=weights).cuda()

        # Freeze weights (optional)
        for param in self.backbone.parameters():
            param.requires_grad = False

        # Compute shape by doing one forward pass
        with torch.no_grad():
            sample = torch.as_tensor(observation_space.sample()[None]).float()
            if sample.shape[1] != 3:  # Convert grayscale to 3 channels
                sample = sample.repeat(1, 3, 1, 1)
            sample = self._preprocess(sample).cuda()
            n_flatten = self.backbone(sample).view(sample.shape[0], -1).shape[1]

        self.projector = nn.Sequential(
            nn.Flatten(),
            nn.Linear(n_flatten, features_dim),
            nn.ReLU()
        )

    def _preprocess(self, observation):
        # Preprocess the observation to match the input requirements of ViT
        batch_resize = F.interpolate(
            observation, size=(224, 224), mode='bilinear', align_corners=False
        )
        return batch_resize


    def forward(self, obs):
        # Convert 1-channel grayscale to 3 channels if needed
        # obs = self._preprocess(obs).cuda()
        obs = self.chanel_mapper(obs).cuda()
        obs = self.uppscaler(obs).cuda()
        # if obs.shape[1] == 1:
        #     obs = obs.repeat(1, 3, 1, 1)
        features = self.backbone(obs)
        return self.projector(features)


In [8]:
class ResNet152FeaturesExtractor(BaseFeaturesExtractor):
    def __init__(self, observation_space, features_dim=256):
        super().__init__(observation_space, features_dim)

        # Load a pretrained ResNet152 model
        weights = torchvision.models.ResNet152_Weights.DEFAULT
        self.backbone = torchvision.models.resnet152(weights=weights)

        # Freeze weights (optional)
        for param in self.backbone.parameters():
            param.requires_grad = False

        # Compute shape by doing one forward pass
        with torch.no_grad():
            sample = torch.as_tensor(observation_space.sample()[None]).float()
            if sample.shape[1] != 3:  # Convert grayscale to 3 channels
                sample = sample.repeat(1, 3, 1, 1)
            n_flatten = self.backbone(sample).view(sample.shape[0], -1).shape[1]

        self.projector = nn.Sequential(
            nn.Flatten(),
            nn.Linear(n_flatten, features_dim),
            nn.ReLU()
        )

    def forward(self, obs):
        # Convert 1-channel grayscale to 3 channels if needed
        if obs.shape[1] == 1:
            obs = obs.repeat(1, 3, 1, 1)
        features = self.backbone(obs)
        return self.projector(features)

### CNNConnectedDeep

- Prueba inicial con reescalado y en escala de grises.

In [24]:
class CNNConnectedDeep(BaseFeaturesExtractor):
    def __init__(self, observation_space, features_dim=256):
        super().__init__(observation_space, features_dim)

        # Primeras capas
        self.conv1 = nn.Conv2d(3, 32, kernel_size=3, padding=1)
        self.bn1 = nn.BatchNorm2d(32)

        self.conv2 = nn.Conv2d(32, 32, kernel_size=3, padding=1)
        self.bn2 = nn.BatchNorm2d(32)

        self.conv3 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
        self.bn3 = nn.BatchNorm2d(64)

        self.pool = nn.MaxPool2d(2, 2)
        self.dropout = nn.Dropout(0.4)

        # Concatenación de conv1 y conv3
        # 32 (resized conv1) + 64 = 96
        # Bloques de compresión adicionales con conexiones
        self.conv4 = nn.Conv2d(96, 128, kernel_size=3, padding=1)
        self.bn4 = nn.BatchNorm2d(128)

        self.conv5 = nn.Conv2d(128, 160, kernel_size=3, padding=1)
        self.bn5 = nn.BatchNorm2d(160)

        self.conv6 = nn.Conv2d(160, 192, kernel_size=3, padding=1)
        self.bn6 = nn.BatchNorm2d(192)

        self.conv7 = nn.Conv2d(192 + 128, 224, kernel_size=3, padding=1)  # concat con out4
        self.bn7 = nn.BatchNorm2d(224)

        self.conv8 = nn.Conv2d(224 + 160, 256, kernel_size=3, padding=1)  # concat con out5
        self.bn8 = nn.BatchNorm2d(256)

        self.global_pool = nn.AdaptiveAvgPool2d((4, 4))  # reduce a [B, 256, 4, 4]

        # Flatten: 256 * 4 * 4 = 4096 → muy alto → reducimos
        self.fc1 = nn.Linear(256 * 4 * 4, 512)
        self.fc2 = nn.Linear(512, features_dim)

    def _preprocess(self, observation):
        r, g, b = observation[:, 0:1, :, :], observation[:, 1:2, :, :], observation[:, 2:3, :, :]
        gray = 0.2989 * r + 0.5870 * g + 0.1140 * b
        gray3 = gray.repeat(1, 3, 1, 1)  # Convert to 3 channels

        batch_resize = F.interpolate(
            gray3, size=(224, 224), mode='bilinear', align_corners=False
        )
        return batch_resize

    def forward(self, x):
        x = self._preprocess(x)  # Convert grayscale to 3 channels and resize to 224x224
        out1 = F.relu(self.bn1(self.conv1(x)))  # [B, 32, 224, 224]
        out2 = F.relu(self.bn2(self.conv2(out1)))
        out2 = out1 + out2  # Residual connection
        out2 = self.pool(out2)  # [B, 32, 112, 112]

        out3 = F.relu(self.bn3(self.conv3(out2)))
        out3 = self.pool(out3)  # [B, 64, 56, 56]

        out1_resized = F.interpolate(out1, size=out3.shape[2:])
        concat1 = torch.cat((out3, out1_resized), dim=1)  # [B, 96, 56, 56]

        # Bloque 4
        out4 = F.relu(self.bn4(self.conv4(concat1)))
        out4 = self.pool(out4)  # [B, 128, 28, 28]

        # Bloque 5
        out5 = F.relu(self.bn5(self.conv5(out4)))
        out5 = self.pool(out5)  # [B, 160, 14, 14]

        # Bloque 6
        out6 = F.relu(self.bn6(self.conv6(out5)))
        out6 = self.pool(out6)  # [B, 192, 7, 7]

        # Concat out4 (resized) con out6
        out4_resized = F.interpolate(out4, size=out6.shape[2:])
        concat2 = torch.cat((out6, out4_resized), dim=1)  # [B, 192+128=320, 7, 7]
        out7 = F.relu(self.bn7(self.conv7(concat2)))

        # Concat out5 (resized) con out7
        out5_resized = F.interpolate(out5, size=out7.shape[2:])
        concat3 = torch.cat((out7, out5_resized), dim=1)  # [B, 224+160=384, 7, 7]
        out8 = F.relu(self.bn8(self.conv8(concat3)))

        x = self.global_pool(out8)  # [B, 256, 4, 4]
        x = x.view(x.size(0), -1)   # Flatten → [B, 4096]

        x = self.dropout(F.relu(self.fc1(x)))  # 4096 → 512
        out = self.fc2(x)  # 512 → num_classes

        return out


- Adaptación a nuevo tamaño de red 84,84

In [15]:
class CNNConnectedDeep(BaseFeaturesExtractor):
    def __init__(self, observation_space, features_dim=256):
        super().__init__(observation_space, features_dim)

        # Primeras capas
        self.conv1 = nn.Conv2d(1, 32, kernel_size=3, padding=1)
        self.bn1 = nn.BatchNorm2d(32)

        self.conv2 = nn.Conv2d(32, 32, kernel_size=3, padding=1)
        self.bn2 = nn.BatchNorm2d(32)

        self.conv3 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
        self.bn3 = nn.BatchNorm2d(64)

        self.pool = nn.MaxPool2d(2, 2)
        self.dropout = nn.Dropout(0.4)

        # Concatenación de conv1 y conv3
        # 32 (resized conv1) + 64 = 96
        # Bloques de compresión adicionales con conexiones
        self.conv4 = nn.Conv2d(96, 128, kernel_size=3, padding=1)
        self.bn4 = nn.BatchNorm2d(128)

        self.conv5 = nn.Conv2d(128, 160, kernel_size=3, padding=1)
        self.bn5 = nn.BatchNorm2d(160)

        self.conv6 = nn.Conv2d(160, 192, kernel_size=3, padding=1)
        self.bn6 = nn.BatchNorm2d(192)

        self.conv7 = nn.Conv2d(192 + 128, 224, kernel_size=3, padding=1)  # concat con out4
        self.bn7 = nn.BatchNorm2d(224)

        self.conv8 = nn.Conv2d(224 + 160, 256, kernel_size=3, padding=1)  # concat con out5
        self.bn8 = nn.BatchNorm2d(256)

        self.global_pool = nn.AdaptiveAvgPool2d((4, 4))  # reduce a [B, 256, 4, 4]

        # Flatten: 256 * 4 * 4 = 4096 → muy alto → reducimos
        self.fc1 = nn.Linear(256 * 4 * 4, 512)
        self.fc2 = nn.Linear(512, features_dim)

    def forward(self, x):
        out1 = F.relu(self.bn1(self.conv1(x)))  # [B, 32, 224, 224]
        out2 = F.relu(self.bn2(self.conv2(out1)))
        out2 = out1 + out2  # Residual connection
        out2 = self.pool(out2)  # [B, 32, 112, 112]

        out3 = F.relu(self.bn3(self.conv3(out2)))
        out3 = self.pool(out3)  # [B, 64, 56, 56]

        out1_resized = F.interpolate(out1, size=out3.shape[2:])
        concat1 = torch.cat((out3, out1_resized), dim=1)  # [B, 96, 56, 56]

        # Bloque 4
        out4 = F.relu(self.bn4(self.conv4(concat1)))
        out4 = self.pool(out4)  # [B, 128, 28, 28]

        # Bloque 5
        out5 = F.relu(self.bn5(self.conv5(out4)))
        out5 = self.pool(out5)  # [B, 160, 14, 14]

        # Bloque 6
        out6 = F.relu(self.bn6(self.conv6(out5)))
        out6 = self.pool(out6)  # [B, 192, 7, 7]

        # Concat out4 (resized) con out6
        out4_resized = F.interpolate(out4, size=out6.shape[2:])
        concat2 = torch.cat((out6, out4_resized), dim=1)  # [B, 192+128=320, 7, 7]
        out7 = F.relu(self.bn7(self.conv7(concat2)))

        # Concat out5 (resized) con out7
        out5_resized = F.interpolate(out5, size=out7.shape[2:])
        concat3 = torch.cat((out7, out5_resized), dim=1)  # [B, 224+160=384, 7, 7]
        out8 = F.relu(self.bn8(self.conv8(concat3)))

        x = self.global_pool(out8)  # [B, 256, 4, 4]
        x = x.view(x.size(0), -1)   # Flatten → [B, 4096]

        x = self.dropout(F.relu(self.fc1(x)))  # 4096 → 512
        out = self.fc2(x)  # 512 → num_classes

        return out


### Dueling

In [25]:
from stable_baselines3.dqn.policies import DQNPolicy
class DuelingCnnExtractor(BaseFeaturesExtractor):
    def __init__(self, observation_space, features_dim=512):
        super().__init__(observation_space, features_dim)
        n_input_channels = observation_space.shape[0]
        self.cnn = nn.Sequential(
            nn.Conv2d(n_input_channels, 32, kernel_size=8, stride=4),
            nn.ReLU(),
            nn.Conv2d(32, 64, kernel_size=4, stride=2),
            nn.ReLU(),
            nn.Conv2d(64, 64, kernel_size=3, stride=1),
            nn.ReLU(),
            nn.Flatten()
        )
        with torch.no_grad():
            n_flatten = self.cnn(torch.zeros(1, *observation_space.shape)).shape[1]

        self.linear = nn.Sequential(
            nn.Linear(n_flatten, features_dim),
            nn.ReLU()
        )

    def forward(self, obs):
        return self.linear(self.cnn(obs))

class DuelingDQNPolicy(DQNPolicy):
    def __init__(self, *args, **kwargs):
        super().__init__(
            *args,
            features_extractor_class=DuelingCnnExtractor,
            **kwargs
        )
        # Rebuild Q network with dueling architecture
        features_dim = self.q_net.q_net[0].in_features
        action_dim = self.action_space.n

        self.q_net = nn.Sequential(
            nn.Linear(features_dim, 512),
            nn.ReLU(),
        )

        self.q_net_adv = nn.Linear(512, action_dim)
        self.q_net_val = nn.Linear(512, 1)

    def forward(self, obs, deterministic=False):
        features = self.extract_features(obs, features_extractor=DuelingCnnExtractor)
        x = self.q_net(features)
        adv = self.q_net_adv(x)
        val = self.q_net_val(x)
        q_values = val + adv - adv.mean(dim=1, keepdim=True)
        return q_values

In [3]:
class DeepMindCNN(BaseFeaturesExtractor):
    """
    DeepMind-style CNN used in the original DQN paper (Mnih et al., 2015).
    Input shape: (n_stack, 84, 84) → (4, 84, 84)
    """

    def __init__(self, observation_space, features_dim=512):
        # features_dim is the output of the last linear layer (fc1)
        super().__init__(observation_space, features_dim)

        # Check input shape
        n_input_channels = observation_space.shape[2]  # e.g., 4 stacked grayscale frames

        self.cnn = nn.Sequential(
            nn.Conv2d(n_input_channels, 32, kernel_size=8, stride=4),  # (32, 20, 20)
            nn.ReLU(),
            nn.Conv2d(32, 64, kernel_size=4, stride=2),                 # (64, 9, 9)
            nn.ReLU(),
            nn.Conv2d(64, 64, kernel_size=3, stride=1),                 # (64, 7, 7)
            nn.ReLU(),
            nn.Flatten()
        )

        with torch.no_grad():
            sample_input = torch.as_tensor(observation_space.sample()[None]).float()
            sample_input = self._preprocess(sample_input)  # Preprocess the input
            n_flatten = self.cnn(sample_input).shape[1]

        self.linear = nn.Sequential(
            nn.Linear(n_flatten , n_flatten // 2),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(n_flatten // 2, n_flatten // 4),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(n_flatten // 4, features_dim),
            nn.ReLU()
        )

        self._features_dim = features_dim

    def _preprocess(self, x):
        x = x.permute(0, 3, 1, 2)
        return x

    def forward(self, observations: torch.Tensor) -> torch.Tensor:
        x = self._preprocess(observations)
        x = self.cnn(x)
        return self.linear(x)

2. Implementación de la solución DQN

**Nota**: Las primeras pruebas fueron sin usar el AtariWrapper, pero se preprocesaba internamente en las capas.

In [4]:
class TQDMProgressCallback(BaseCallback):
    def __init__(self, total_timesteps: int, verbose=0):
        super().__init__(verbose)
        self.total_timesteps = total_timesteps
        self.progress_bar = None
        self.last_timesteps = 0

    def _on_training_start(self):
        self.progress_bar = tqdm(total=self.total_timesteps, desc="Training Progress", unit="step")

    def _on_step(self):
        steps_since_last = self.num_timesteps - self.last_timesteps
        self.progress_bar.update(steps_since_last)
        self.last_timesteps += 1

        # Optional: log latest reward if available
        infos = self.locals.get("infos", [])
        if infos and isinstance(infos[0], dict) and "episode" in infos[0]:
            self.progress_bar.set_postfix(reward=infos[0]["episode"]["r"])
        return True  # Return True to continue training

    def _on_training_end(self):
        self.progress_bar.close()

In [5]:

class MLflowCallback(BaseCallback):
    def __init__(self, best_model_path, experiment_name="SB3_Experiment", run_name=None, log_freq=1000, verbose=0):
        super().__init__(verbose)
        self.experiment_name = experiment_name
        self.log_freq = log_freq
        self.step_count = 0
        self.best_mean_reward = -np.inf
        self.best_model_path = best_model_path

    def _on_step(self) -> bool:
        self.step_count += 1
        if self.step_count % self.log_freq == 0:
            rewards = [ep_info['r'] for ep_info in self.model.ep_info_buffer] if self.model.ep_info_buffer else []
            lengths = [ep_info['l'] for ep_info in self.model.ep_info_buffer] if self.model.ep_info_buffer else []

            mean_reward = np.mean(rewards) if rewards else 0.0
            max_reward = np.max(rewards) if rewards else 0.0
            min_reward = np.min(rewards) if rewards else 0.0
            mean_length = np.mean(lengths) if lengths else 0.0
            std_reward = np.std(rewards) if rewards else 0.0

            step = self.num_timesteps
            mlflow.log_metric("timesteps", step, step=step)
            mlflow.log_metric("episode_reward_mean", mean_reward, step=step)
            mlflow.log_metric("episode_reward_max", max_reward, step=step)
            mlflow.log_metric("episode_reward_min", min_reward, step=step)
            mlflow.log_metric("episode_length_mean", mean_length, step=step)
            mlflow.log_metric("episode_reward_std", std_reward, step=step)

            if mean_reward > self.best_mean_reward:
                self.best_mean_reward = mean_reward
                # Save the best model
                self.model.save(self.best_model_path)
        return True

    def _on_training_end(self):
        # Optionally save the model as artifact
        mlflow.log_param("num_episodes", len(self.model.ep_info_buffer))
        mlflow.end_run()

In [9]:
from collections import Counter

class TestCallBack(BaseCallback):
    def __init__(self, env, n_episodes=100, verbose=0, test_timesteps=10000):
        super().__init__(verbose)
        self.env = env
        self.n_episodes = n_episodes
        self.rewards = []
        self.test_timesteps = test_timesteps
    def _on_step(self) -> bool:
        if self.num_timesteps % self.test_timesteps == 0:  # Test every 1000 steps
            action_counter = Counter()
            for _ in range(self.n_episodes):
                ep_reward = 0
                obs = self.env.reset()
                done = False
                while not done:
                    with torch.no_grad():
                        action, _ = self.model.predict(obs)
                    obs, reward, done, _ = self.env.step(action)
                    action_scalar = int(action)
                    action_counter[action_scalar] += 1
                    ep_reward += reward
                self.rewards.append(ep_reward)
            mean_reward = np.mean(self.rewards)
            std_reward = np.std(self.rewards)
            mlflow.log_metric("test_reward", mean_reward, step=self.num_timesteps)
            mlflow.log_metric("test_reward_std", std_reward, step=self.num_timesteps)
            total_actions = sum(action_counter.values())
            for action, count in action_counter.items():
                mlflow.log_metric(f"action_{action}_count", count, step=self.num_timesteps)
                mlflow.log_metric(f"action_{action}_percentage", count / total_actions, step=self.num_timesteps)
        return True


## Resnet152

In [11]:
policy_kwargs = dict(
    features_extractor_class=ResNet152FeaturesExtractor,
    features_extractor_kwargs=dict(features_dim=256),
)
total_timesteps = 100000
progress_bar_callback = TQDMProgressCallback(total_timesteps=total_timesteps)
ml_callback = MLflowCallback(
    "models/dqn_resnet152_weights.zip",
    experiment_name="DQN_SpaceInvaders",
    run_name="DQN_Run",
    log_freq=500
)
experiment_name = "DQN_SpaceInvaders"
exist_experiment = mlflow.get_experiment_by_name(experiment_name)
if not exist_experiment:
    mlflow.create_experiment(experiment_name)
mlflow.set_experiment(experiment_name)

with mlflow.start_run(run_name="DQN_Run_ResNet152_finetuned"):
    mlflow.log_param("env_name", env_name)
    mlflow.log_param("total_timesteps", total_timesteps)
    mlflow.log_param("learning_rate", 1e-4)
    mlflow.log_param("buffer_size", 100000)

    model = DQN("CnnPolicy", env, verbose=1, learning_rate=1e-4, buffer_size=100000, policy_kwargs=policy_kwargs)
    t_model = model.learn(total_timesteps=total_timesteps, callback=[progress_bar_callback, ml_callback, TestCallBack(env)])


Using cuda device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
Wrapping the env in a VecTransposeImage.


Training Progress:   0%|          | 0/100000 [00:00<?, ?step/s]

----------------------------------
| rollout/            |          |
|    ep_len_mean      | 762      |
|    ep_rew_mean      | 10       |
|    exploration_rate | 0.711    |
| time/               |          |
|    episodes         | 4        |
|    fps              | 22       |
|    time_elapsed     | 136      |
|    total_timesteps  | 3046     |
| train/              |          |
|    learning_rate    | 0.0001   |
|    loss             | 0.0128   |
|    n_updates        | 736      |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 907      |
|    ep_rew_mean      | 13.5     |
|    exploration_rate | 0.311    |
| time/               |          |
|    episodes         | 8        |
|    fps              | 28       |
|    time_elapsed     | 253      |
|    total_timesteps  | 7254     |
| train/              |          |
|    learning_rate    | 0.0001   |
|    loss             | 0.009    |
|    n_updates      

### MobileNetV2

In [None]:
policy_kwargs = dict(
    features_extractor_class=MobileNetFeatureExtractor,
    features_extractor_kwargs=dict(features_dim=256),
)
total_timesteps = 2_000_000
progress_bar_callback = TQDMProgressCallback(total_timesteps=total_timesteps)
ml_callback = MLflowCallback(
    "models/dqn_mobilenet_v2_weights_new_data.zip",
    experiment_name="DQN_SpaceInvaders",
    run_name="DQN_Run",
    log_freq=10_000
)
experiment_name = "DQN_SpaceInvaders"
exist_experiment = mlflow.get_experiment_by_name(experiment_name)
if not exist_experiment:
    mlflow.create_experiment(experiment_name)
mlflow.set_experiment(experiment_name)

with mlflow.start_run(run_name="DQN_Run_MobileNetv2_finetuned_new_data"):
    mlflow.log_param("env_name", env_name)
    mlflow.log_param("total_timesteps", total_timesteps)
    mlflow.log_param("learning_rate", 1e-4)
    mlflow.log_param("buffer_size", 500_000)

    model = DQN("CnnPolicy", env, verbose=0, learning_rate=1e-4, buffer_size=500_000, policy_kwargs=policy_kwargs)
    t_model = model.learn(total_timesteps=total_timesteps, callback=[progress_bar_callback, ml_callback, TestCallBack(env, test_timesteps=100_000)])

### ViT B-16

#### Cargar pesos preentrenados

In [None]:
total_timesteps = 100000
progress_bar_callback = TQDMProgressCallback(total_timesteps=total_timesteps)
ml_callback = MLflowCallback(
    best_model_path="models/dqn_vit_b_16_weights.zip",
    experiment_name="DQN_SpaceInvaders",
    run_name="DQN_Run",
    log_freq=500
)
experiment_name = "DQN_SpaceInvaders"
exist_experiment = mlflow.get_experiment_by_name(experiment_name)
if not exist_experiment:
    mlflow.create_experiment(experiment_name)
mlflow.set_experiment(experiment_name)

with mlflow.start_run(run_name="DQN_Run_Vit_b_16_finetuned"):
    mlflow.log_param("env_name", env_name)
    mlflow.log_param("total_timesteps", total_timesteps)
    mlflow.log_param("learning_rate", 1e-4)
    mlflow.log_param("buffer_size", 100000)
    model = DQN.load("models/dqn_vit_b_16_weights.zip", env=env)
    t_model = model.learn(total_timesteps=total_timesteps, callback=[progress_bar_callback, ml_callback])


In [None]:
policy_kwargs = dict(
    features_extractor_class=VitB16FeaturesExtractor,
    features_extractor_kwargs=dict(features_dim=256),
)
total_timesteps = 100000
progress_bar_callback = TQDMProgressCallback(total_timesteps=total_timesteps)
ml_callback = MLflowCallback(
    best_model_path="models/dqn_vit_b_16_weights_upscaled.zip",
    experiment_name="DQN_SpaceInvaders",
    run_name="DQN_Run",
    log_freq=500
)
experiment_name = "DQN_SpaceInvaders"
exist_experiment = mlflow.get_experiment_by_name(experiment_name)
if not exist_experiment:
    mlflow.create_experiment(experiment_name)
mlflow.set_experiment(experiment_name)

with mlflow.start_run(run_name="DQN_Run_Vit_b_16_gray_upscaled"):
    mlflow.log_param("env_name", env_name)
    mlflow.log_param("total_timesteps", total_timesteps)
    mlflow.log_param("learning_rate", 1e-4)
    mlflow.log_param("buffer_size", 100000)

    model = DQN("CnnPolicy", env, verbose=0, learning_rate=1e-4, buffer_size=100000, policy_kwargs=policy_kwargs)
    t_model = model.learn(total_timesteps=total_timesteps, callback=[progress_bar_callback, ml_callback, TestCallBack(env)])


### CNNConnectedDeep

Por el momento es el mejor modelo por su entrenamiento y resultados iniciales, además de velocidad.

In [21]:
policy_kwargs = dict(
    features_extractor_class=CNNConnectedDeep,
    features_extractor_kwargs=dict(features_dim=256)
)
total_timesteps = 100000
progress_bar_callback = TQDMProgressCallback(total_timesteps=total_timesteps)
ml_callback = MLflowCallback(
    best_model_path="models/dqn_cnn_connected_deep_weights.zip",
    experiment_name="DQN_SpaceInvaders",
    run_name="DQN_Run",
    log_freq=1000
)
experiment_name = "DQN_SpaceInvaders"
exist_experiment = mlflow.get_experiment_by_name(experiment_name)
if not exist_experiment:
    mlflow.create_experiment(experiment_name)
mlflow.set_experiment(experiment_name)
with mlflow.start_run(run_name="DQN_Run_CNNConnectedDeep"):
    mlflow.log_param("env_name", env_name)
    mlflow.log_param("total_timesteps", total_timesteps)
    mlflow.log_param("learning_rate", 1e-4)
    mlflow.log_param("buffer_size", 100000)

    model = DQN("CnnPolicy", env, verbose=0, learning_rate=1e-4, buffer_size=100000, policy_kwargs=policy_kwargs)
    t_model = model.learn(total_timesteps=total_timesteps, callback=[progress_bar_callback, ml_callback, TestCallBack(env, test_timesteps=5000)])

Training Progress:   0%|          | 0/100000 [00:00<?, ?step/s]

### 3. Fine tuning de los modelos

### CCNConnectedDeep
con nuevas dimensiones de 84,84

In [None]:
policy_kwargs = dict(
    features_extractor_class=CNNConnectedDeep,
    features_extractor_kwargs=dict(features_dim=256)
)
total_timesteps = 5_000_000
progress_bar_callback = TQDMProgressCallback(total_timesteps=total_timesteps)
ml_callback = MLflowCallback(
    best_model_path="models/dqn_cnn_connected_deep_weights_finetuning.zip",
    experiment_name="DQN_SpaceInvaders",
    run_name="DQN_Run",
    log_freq=25000
)

experiment_name = "DQN_SpaceInvaders"
exist_experiment = mlflow.get_experiment_by_name(experiment_name)
if not exist_experiment:
    mlflow.create_experiment(experiment_name)
mlflow.set_experiment(experiment_name)
with mlflow.start_run(run_name="DQN_Run_CNNConnectedDeep_finetuned"):
    mlflow.log_param("env_name", env_name)
    mlflow.log_param("total_timesteps", total_timesteps)
    mlflow.log_param("learning_rate", 1e-3)
    mlflow.log_param("buffer_size", 100_000)
    mlflow.log_param("exploration_fraction", 0.1)
    mlflow.log_param("exploration_final_eps", 0.1)
    mlflow.log_param("exploration_initial_eps", 1.0)
    mlflow.log_param("batch_size", 32)
    mlflow.log_param("gamma", 0.99)
    mlflow.log_param("learning_starts", 10000)
    mlflow.log_param("target_update_interval", 20000)

    model = DQN("CnnPolicy", env, verbose=0,
                learning_rate=1e-4, buffer_size=300_000,
                policy_kwargs=policy_kwargs,
                exploration_fraction=0.3, exploration_initial_eps=1.0,
                exploration_final_eps=0.01, target_update_interval=20000,
                batch_size=32, learning_starts=100_000, gamma=0.99,seed=23)
    t_model = model.learn(total_timesteps=total_timesteps, callback=[progress_bar_callback, ml_callback, TestCallBack(env, test_timesteps=50_000)])

Training Progress:   0%|          | 0/5000000 [00:00<?, ?step/s]

### DeppMind

- Ya viene por efecto en stable-Baselines3

In [None]:
policy_kwargs = dict(
    net_arch = [256, 256],
)
total_timesteps = 100000
progress_bar_callback = TQDMProgressCallback(total_timesteps=total_timesteps)
ml_callback = MLflowCallback(
    best_model_path="models/deep_mind_data.zip",
    experiment_name="DQN_SpaceInvaders",
    run_name="DQN_Run",
    log_freq=1000
)
experiment_name = "DQN_SpaceInvaders"
exist_experiment = mlflow.get_experiment_by_name(experiment_name)
if not exist_experiment:
    mlflow.create_experiment(experiment_name)
mlflow.set_experiment(experiment_name)
with mlflow.start_run(run_name="deep_mind"):
    mlflow.log_param("env_name", env_name)
    mlflow.log_param("total_timesteps", total_timesteps)
    mlflow.log_param("learning_rate", 1e-4)
    mlflow.log_param("buffer_size", 100000)

    model = DQN("CnnPolicy", env, verbose=0, learning_rate=1e-4, buffer_size=100000, policy_kwargs=policy_kwargs)
    t_model = model.learn(total_timesteps=total_timesteps, callback=[progress_bar_callback, ml_callback, TestCallBack(env, test_timesteps=5000)])


In [15]:
old_model = DQN.load("models/deep_mind_data.zip", env=env)

policy_kwargs = dict(
    net_arch = [256, 256],
)

total_timesteps = 5_000_000
progress_bar_callback = TQDMProgressCallback(total_timesteps=total_timesteps)
ml_callback = MLflowCallback(
    best_model_path="models/deep_mind_finetuning.zip",
    experiment_name="DQN_SpaceInvaders",
    run_name="DQN_Run",
    log_freq=25000
)

experiment_name = "DQN_SpaceInvaders"
exist_experiment = mlflow.get_experiment_by_name(experiment_name)
if not exist_experiment:
    mlflow.create_experiment(experiment_name)
mlflow.set_experiment(experiment_name)
with mlflow.start_run(run_name="DQN_Run_deep_mind_finetuned"):
    mlflow.log_param("env_name", env_name)
    mlflow.log_param("total_timesteps", total_timesteps)
    mlflow.log_param("learning_rate", 1e-3)
    mlflow.log_param("buffer_size", 100_000)
    mlflow.log_param("exploration_fraction", 0.2)
    mlflow.log_param("exploration_final_eps", 0.01)
    mlflow.log_param("exploration_initial_eps", 1.0)
    mlflow.log_param("batch_size", 32)
    mlflow.log_param("gamma", 0.99)
    mlflow.log_param("learning_starts", 10000)
    mlflow.log_param("target_update_interval", 20000)

    model = DQN("CnnPolicy", env, verbose=0,
                policy_kwargs=policy_kwargs,
                learning_rate=1e-4, buffer_size=300_000,
                exploration_fraction=0.2, exploration_initial_eps=1.0,
                exploration_final_eps=0.01, target_update_interval=1000,
                batch_size=32, learning_starts=100_000, gamma=0.99,seed=23)
    model.policy.load_state_dict(old_model.policy.state_dict())

    t_model = model.learn(total_timesteps=total_timesteps, callback=[progress_bar_callback, ml_callback, TestCallBack(env, test_timesteps=50_000)])


Training Progress:   0%|          | 0/5000000 [00:00<?, ?step/s]

Tuning con dueling

In [8]:
policy_kwargs = dict(
    net_arch = [256, 256],
)
total_timesteps = 5_000_000
progress_bar_callback = TQDMProgressCallback(total_timesteps=total_timesteps)
ml_callback = MLflowCallback(
    best_model_path="models/duel.zip",
    experiment_name="DQN_SpaceInvaders",
    run_name="DQN_Run",
    log_freq=25_000
)

experiment_name = "DQN_SpaceInvaders"
exist_experiment = mlflow.get_experiment_by_name(experiment_name)
if not exist_experiment:
    mlflow.create_experiment(experiment_name)
mlflow.set_experiment(experiment_name)
with mlflow.start_run(run_name="duel"):
    mlflow.log_param("env_name", env_name)
    mlflow.log_param("total_timesteps", total_timesteps)
    mlflow.log_param("learning_rate", 1e-3)
    model = QRDQN("CnnPolicy", env, verbose=0, learning_rate=1e-3, buffer_size=1_000_000, policy_kwargs=policy_kwargs, seed=23)

    t_model = model.learn(total_timesteps=total_timesteps, callback=[progress_bar_callback, ml_callback, TestCallBack(env, test_timesteps=50_000)])




Training Progress:   0%|          | 0/5000000 [00:00<?, ?step/s]



In [None]:
total_timesteps = 20_000_000
progress_bar_callback = TQDMProgressCallback(total_timesteps=total_timesteps)
ml_callback = MLflowCallback(
    best_model_path="models/duel_episodes.zip",
    experiment_name="DQN_SpaceInvaders",
    run_name="DQN_Run",
    log_freq=100_000
)

experiment_name = "DQN_SpaceInvaders"
exist_experiment = mlflow.get_experiment_by_name(experiment_name)
if not exist_experiment:
    mlflow.create_experiment(experiment_name)
mlflow.set_experiment(experiment_name)
with mlflow.start_run(run_name="duel_more_episodes"):
    mlflow.log_param("env_name", env_name)
    mlflow.log_param("total_timesteps", total_timesteps)
    mlflow.log_param("learning_rate", 25e-5)
    mlflow.log_param("gamma", 0.95)
    mlflow.log_param("buffer_size", 1_000_000)
    mlflow.log_param("exploration_fraction", 0.005)
    mlflow.log_param("file_name", "models/duel_episodes.zip")

    model = QRDQN.load("models/duel.zip", env=env)
    model.learning_rate = 25e-5
    model.gamma = 0.95
    model.learning_starts = 100_000
    t_model = model.learn(total_timesteps=total_timesteps, callback=[progress_bar_callback, ml_callback, TestCallBack(env, test_timesteps=200_000)])

Training Progress:   0%|          | 0/20000000 [00:00<?, ?step/s]

In [11]:
total_timesteps = 20_000_000
progress_bar_callback = TQDMProgressCallback(total_timesteps=total_timesteps)
ml_callback = MLflowCallback(
    best_model_path="models/duel_episodes_v2.zip",
    experiment_name="DQN_SpaceInvaders",
    run_name="DQN_Run",
    log_freq=100_000
)

experiment_name = "DQN_SpaceInvaders"
exist_experiment = mlflow.get_experiment_by_name(experiment_name)
if not exist_experiment:
    mlflow.create_experiment(experiment_name)
mlflow.set_experiment(experiment_name)
with mlflow.start_run(run_name="duel_more_episodes"):
    mlflow.log_param("env_name", env_name)
    mlflow.log_param("total_timesteps", total_timesteps)
    mlflow.log_param("learning_rate", 25e-5)
    mlflow.log_param("gamma", 0.95)
    mlflow.log_param("buffer_size", 1_000_000)
    mlflow.log_param("exploration_fraction", 0.005)
    mlflow.log_param("file_name", "models/duel_episodes.zip")

    model = QRDQN.load("models/duel_episodes.zip", env=env)
    model.learning_rate = 25e-5
    model.gamma = 0.95
    model.learning_starts = 100_000
    t_model = model.learn(total_timesteps=total_timesteps,
                          callback=[progress_bar_callback, ml_callback, TestCallBack(env, test_timesteps=200_000)])




Training Progress:   0%|          | 0/20000000 [00:00<?, ?step/s]

KeyboardInterrupt: 

## DeepMind

- Se carga el modelo luego de 5M de pasos con más contexto

In [None]:
old_model = DQN.load("models/deep_mind_finetuning.zip", env=env)

policy_kwargs = dict(
    net_arch = [256, 256],
)

total_timesteps = 5_000_000
progress_bar_callback = TQDMProgressCallback(total_timesteps=total_timesteps)
ml_callback = MLflowCallback(
    best_model_path="models/deep_mind_finetuning_more.zip",
    experiment_name="DQN_SpaceInvaders",
    run_name="DQN_Run",
    log_freq=25000
)

experiment_name = "DQN_SpaceInvaders"
exist_experiment = mlflow.get_experiment_by_name(experiment_name)
if not exist_experiment:
    mlflow.create_experiment(experiment_name)
mlflow.set_experiment(experiment_name)
with mlflow.start_run(run_name="DQN_Run_deep_mind_finetuned_more"):
    mlflow.log_param("env_name", env_name)
    mlflow.log_param("total_timesteps", total_timesteps)
    mlflow.log_param("learning_rate", 1e-3)
    mlflow.log_param("buffer_size", 1_000_000)
    mlflow.log_param("exploration_fraction", 0.2)
    mlflow.log_param("exploration_final_eps", 0.01)
    mlflow.log_param("exploration_initial_eps", 1.0)
    mlflow.log_param("batch_size", 32)
    mlflow.log_param("gamma", 0.99)
    mlflow.log_param("learning_starts", 100_000)
    mlflow.log_param("target_update_interval", 1_000)

    model = DQN("CnnPolicy", env, verbose=0,
                policy_kwargs=policy_kwargs,
                learning_rate=1e-3, buffer_size=1_000_000,
                exploration_fraction=0.1, exploration_initial_eps=0.05,
                exploration_final_eps=0.01, target_update_interval=1000,
                batch_size=32, learning_starts=100_000, gamma=0.99,seed=23)
    model.policy.load_state_dict(old_model.policy.state_dict())

    t_model = model.learn(total_timesteps=total_timesteps, callback=[progress_bar_callback, ml_callback, TestCallBack(env, test_timesteps=50_000)])


Training Progress:   0%|          | 0/5000000 [00:00<?, ?step/s]

### PPO

In [8]:
policy_kwargs = dict(
    features_extractor_class=DeepMindCNN,
    features_extractor_kwargs=dict(features_dim=512),
    net_arch=[512, 512, 216, 102, 51],
)

total_timesteps = 5_000_000
progress_bar_callback = TQDMProgressCallback(total_timesteps=total_timesteps)
ml_callback = MLflowCallback(
    best_model_path="models/deep_mind_PPO_v2_pen.zip",
    experiment_name="DQN_SpaceInvaders",
    run_name="DQN_Run",
    log_freq=25000
)

experiment_name = "DQN_SpaceInvaders"
exist_experiment = mlflow.get_experiment_by_name(experiment_name)
if not exist_experiment:
    mlflow.create_experiment(experiment_name)
mlflow.set_experiment(experiment_name)

with mlflow.start_run(run_name="PPO_Run_DeepMind_v2_pen"):
    mlflow.log_param("env_name", env_name)
    mlflow.log_param("total_timesteps", total_timesteps)
    mlflow.log_param("learning_rate", 25e-4)
    mlflow.log_param("n_steps", 2048 * 3)
    mlflow.log_param("frame_skip", 12)

    # model = PPO("CnnPolicy", env, verbose=0, learning_rate=25e-4, policy_kwargs=policy_kwargs, n_steps=2048 * 3, seed=23)
    model = PPO.load("models/deep_mind_PPO_v2_pen.zip", env=env)

    t_model = model.learn(total_timesteps=total_timesteps, callback=[progress_bar_callback, ml_callback, TestCallBack(normal_env, test_timesteps=100_000)])

Training Progress:   0%|          | 0/5000000 [00:00<?, ?step/s]

KeyboardInterrupt: 

### A2C

In [18]:
policy_kwargs = dict(
    features_extractor_class=DeepMindCNN,
    features_extractor_kwargs=dict(features_dim=512),
    net_arch=dict(pi=[216, 216, 216], vf=[216, 216, 216])  # pi: policy, vf: value

)


total_timesteps = 5_000_000
progress_bar_callback = TQDMProgressCallback(total_timesteps=total_timesteps)
ml_callback = MLflowCallback(
    best_model_path="models/deep_mind_A2C.zip",
    experiment_name="DQN_SpaceInvaders",
    run_name="DQN_Run",
    log_freq=25000
)

experiment_name = "DQN_SpaceInvaders"
exist_experiment = mlflow.get_experiment_by_name(experiment_name)
if not exist_experiment:
    mlflow.create_experiment(experiment_name)
mlflow.set_experiment(experiment_name)


with mlflow.start_run(run_name="PPO_Run_DeepMind_A2C"):
    mlflow.log_param("env_name", env_name)
    mlflow.log_param("total_timesteps", total_timesteps)
    mlflow.log_param("learning_rate", 7e-4)
    mlflow.log_param("frame_skip", 4)
    mlflow.log_param("gamma", 0.95)
    mlflow.log_param("stats_window_size", 1_000)
    mlflow.log_param("ent_coef", 0.05)

    model = A2C("CnnPolicy", normal_env, stats_window_size=1000,verbose=0, ent_coef=0.05, gamma=0.95, seed=23, policy_kwargs=policy_kwargs, device="cuda")

    t_model = model.learn(total_timesteps=total_timesteps, callback=[progress_bar_callback, ml_callback, TestCallBack(normal_env, test_timesteps=100_000)])


Training Progress:   0%|          | 0/5000000 [00:00<?, ?step/s]

  action_scalar = int(action)


KeyboardInterrupt: 

In [23]:
policy_kwargs = dict(
    net_arch=[512, 512, 216, 102, 51],
)

old_model = PPO.load("models/deep_mind_PPO.zip", env=env)

total_timesteps = 5_000_000
progress_bar_callback = TQDMProgressCallback(total_timesteps=total_timesteps)
ml_callback = MLflowCallback(
    best_model_path="models/deep_mind_PPO_enhance.zip",
    experiment_name="DQN_SpaceInvaders",
    run_name="DQN_Run",
    log_freq=25000
)

experiment_name = "DQN_SpaceInvaders"
exist_experiment = mlflow.get_experiment_by_name(experiment_name)
if not exist_experiment:
    mlflow.create_experiment(experiment_name)
mlflow.set_experiment(experiment_name)

with mlflow.start_run(run_name="PPO_Run_DeepMind"):
    mlflow.log_param("env_name", env_name)
    mlflow.log_param("total_timesteps", total_timesteps)
    mlflow.log_param("learning_rate", 25e-5)

    model = PPO("CnnPolicy", env, verbose=0, n_steps=2048 * 3,learning_rate=25e-5, clip_range=0.1, ent_coef=0.01, policy_kwargs=policy_kwargs, seed=23)
    model.policy.load_state_dict(old_model.policy.state_dict())

    t_model = model.learn(total_timesteps=total_timesteps, callback=[progress_bar_callback, ml_callback, TestCallBack(env, test_timesteps=100_000)], reset_num_timesteps=True)

Training Progress:   0%|          | 0/5000000 [00:00<?, ?step/s]

KeyboardInterrupt: 

In [None]:
# Testing part to calculate the mean reward
weights_filename = 'dqn_{}_weights.h5f'.format(env_name)
dqn.load_weights(weights_filename)
dqn.test(env, nb_episodes=10, visualize=False)

3. Justificación de los parámetros seleccionados y de los resultados obtenidos

---