# Actividad - Proyecto práctico


> La actividad se desarrollará en grupos pre-definidos de 2-3 alumnos. Se debe indicar los nombres en orden alfabético (de apellidos). Recordad que esta actividad se corresponde con un 30% de la nota final de la asignatura. Se debe entregar entregar el trabajo en la presente notebook.
*   Alumno 1:
*   Alumno 2:
*   Alumno 3:






---
## **PARTE 1** - Instalación y requisitos previos

> Las prácticas han sido preparadas para poder realizarse en el entorno de trabajo de Google Colab. Sin embargo, esta plataforma presenta ciertas incompatibilidades a la hora de visualizar la renderización en gym. Por ello, para obtener estas visualizaciones, se deberá trasladar el entorno de trabajo a local. Por ello, el presente dosier presenta instrucciones para poder trabajar en ambos entornos. Siga los siguientes pasos para un correcto funcionamiento:
1.   **LOCAL:** Preparar el enviroment, siguiendo las intrucciones detalladas en la sección *1.1.Preparar enviroment*.
2.  **AMBOS:** Modificar las variables "mount" y "drive_mount" a la carpeta de trabajo en drive en el caso de estar en Colab, y ejecturar la celda *1.2.Localizar entorno de trabajo*.
3. **COLAB:** se deberá ejecutar las celdas correspondientes al montaje de la carpeta de trabajo en Drive. Esta corresponde a la sección *1.3.Montar carpeta de datos local*.
4.  **AMBOS:** Instalar las librerías necesarias, siguiendo la sección *1.4.Instalar librerías necesarias*.


---
### 1.1. Preparar enviroment (solo local)



> Para preparar el entorno de trabajo en local, se han seguido los siguientes pasos:
1. En Windows, puede ser necesario instalar las C++ Build Tools. Para ello, siga los siguientes pasos: https://towardsdatascience.com/how-to-install-openai-gym-in-a-windows-environment-338969e24d30.
2. Instalar Anaconda
3. Siguiendo el código que se presenta comentado en la próxima celda: Crear un enviroment, cambiar la ruta de trabajo, e instalar librerías básicas.


```
conda create --name miar_rl python=3.8
conda activate miar_rl
cd "PATH_TO_FOLDER"
conda install git
pip install jupyter
```


4. Abrir la notebook con *jupyter-notebook*.



```
jupyter-notebook
```


---
### 1.2. Localizar entorno de trabajo: Google colab o local

In [None]:
from networkx.tests.test_all_random_functions import progress

# ATENCIÓN!! Modificar ruta relativa a la práctica si es distinta (drive_root)
mount='/content/gdrive'
drive_root = mount + "/My Drive/08_MIAR/actividades/proyecto practico"

try:
  from google.colab import drive
  IN_COLAB=True
except:
  IN_COLAB=False

---
### 1.3. Montar carpeta de datos local (solo Colab)

In [None]:
# Switch to the directory on the Google Drive that you want to use
import os
if IN_COLAB:
  print("We're running Colab")

  if IN_COLAB:
    # Mount the Google Drive at mount
    print("Colab: mounting Google drive on ", mount)

    drive.mount(mount)

    # Create drive_root if it doesn't exist
    create_drive_root = True
    if create_drive_root:
      print("\nColab: making sure ", drive_root, " exists.")
      os.makedirs(drive_root, exist_ok=True)

    # Change to the directory
    print("\nColab: Changing directory to ", drive_root)
    %cd $drive_root
# Verify we're in the correct working directory
%pwd
print("Archivos en el directorio: ")
print(os.listdir())

---
### 1.4. Instalar librerías necesarias

In [None]:
if IN_COLAB:
  %pip install gym==0.17.3
  %pip install git+https://github.com/Kojoley/atari-py.git
  %pip install keras-rl2==1.0.5
  %pip install tensorflow==2.8
else:
  %pip install gym==0.17.3
  %pip install git+https://github.com/Kojoley/atari-py.git
  %pip install pyglet==1.5.0
  %pip install h5py==3.1.0
  %pip install Pillow==9.5.0
  %pip install keras-rl2==1.0.5
  %pip install Keras==2.2.4
  %pip install tensorflow==2.5.3
  %pip install torch==2.0.1
  %pip install agents==1.4.0

---
## **PARTE 2**. Enunciado

Consideraciones a tener en cuenta:

- El entorno sobre el que trabajaremos será _SpaceInvaders-v0_ y el algoritmo que usaremos será _DQN_.

- Para nuestro ejercicio, el requisito mínimo será alcanzado cuando el agente consiga una **media de recompensa por encima de 20 puntos en modo test**. Por ello, esta media de la recompensa se calculará a partir del código de test en la última celda del notebook.

Este proyecto práctico consta de tres partes:

1.   Implementar la red neuronal que se usará en la solución
2.   Implementar las distintas piezas de la solución DQN
3.   Justificar la respuesta en relación a los resultados obtenidos

**Rúbrica**: Se valorará la originalidad en la solución aportada, así como la capacidad de discutir los resultados de forma detallada. El requisito mínimo servirá para aprobar la actividad, bajo premisa de que la discusión del resultado sera apropiada.

IMPORTANTE:

* Si no se consigue una puntuación óptima, responder sobre la mejor puntuación obtenida.
* Para entrenamientos largos, recordad que podéis usar checkpoints de vuestros modelos para retomar los entrenamientos. En este caso, recordad cambiar los parámetros adecuadamente (sobre todo los relacionados con el proceso de exploración).
* Se deberá entregar unicamente el notebook y los pesos del mejor modelo en un fichero .zip, de forma organizada.
* Cada alumno deberá de subir la solución de forma individual.

---
## **PARTE 3**. Desarrollo y preguntas

#### Importar librerías

In [1]:
from PIL import Image
import numpy as np
import mlflow
from stable_baselines3 import DQN
from stable_baselines3.common.torch_layers import BaseFeaturesExtractor
import torch.nn as nn
import torch
import os
from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3.common.callbacks import ProgressBarCallback, BaseCallback

import gymnasium as gym
import ale_py


from torchvision.models import mobilenet_v2

from tqdm.notebook import tqdm
import mlflow
import torchvision
from torchvision import transforms
import torch.nn.functional as F


#### Configuración base

In [2]:

INPUT_SHAPE = (1, 84, 84)
WINDOW_LENGTH = 4
gym.register_envs(ale_py)

class ClipRewardWrapper(gym.RewardWrapper):
    def __init__(self, env):
        super().__init__(env)

    def reward(self, reward):
        return np.clip(reward, -1.0, 1.0)

env_name = 'SpaceInvaders-v0'
env = gym.make(env_name)
env = ClipRewardWrapper(env)

np.random.seed(123)
obs, info = env.reset(seed=123)
nb_actions = env.action_space.n


  logger.deprecation(
A.L.E: Arcade Learning Environment (version 0.11.1+2750686)
[Powered by Stella]


1. Implementación de la red neuronal

In [3]:
class MobileNetFeatureExtractor(BaseFeaturesExtractor):
    def __init__(self, observation_space, features_dim=256):
        super().__init__(observation_space, features_dim)

        # Pretrained MobileNetV2 without the classifier
        self.backbone = mobilenet_v2(pretrained=True).features

        # Freeze weights (optional)
        for param in self.backbone.parameters():
            param.requires_grad = False

        # Compute shape by doing one forward pass
        with torch.no_grad():
            sample = torch.as_tensor(observation_space.sample()[None]).float()
            if sample.shape[1] != 3:  # Convert grayscale to 3 channels
                sample = sample.repeat(1, 3, 1, 1)
            n_flatten = self.backbone(sample).view(sample.shape[0], -1).shape[1]

        self.projector = nn.Sequential(
            nn.Flatten(),
            nn.Linear(n_flatten, features_dim),
            nn.ReLU()
        )

    def forward(self, obs):
        # Convert 1-channel grayscale to 3 channels if needed
        if obs.shape[1] == 1:
            obs = obs.repeat(1, 3, 1, 1)
        features = self.backbone(obs)
        return self.projector(features)

In [4]:
class VitB16FeaturesExtractor(BaseFeaturesExtractor):
    def __init__(self, observation_space, features_dim=256):
        super().__init__(observation_space, features_dim)

        # Load a pretrained ViT model
        weights = torchvision.models.ViT_B_16_Weights.DEFAULT
        self.backbone = torchvision.models.vit_b_16(weights=weights).cuda()

        # Freeze weights (optional)
        for param in self.backbone.parameters():
            param.requires_grad = False

        # Compute shape by doing one forward pass
        with torch.no_grad():
            sample = torch.as_tensor(observation_space.sample()[None]).float()
            if sample.shape[1] != 3:  # Convert grayscale to 3 channels
                sample = sample.repeat(1, 3, 1, 1)
            sample = self._preprocess(sample).cuda()
            n_flatten = self.backbone(sample).view(sample.shape[0], -1).shape[1]

        self.projector = nn.Sequential(
            nn.Flatten(),
            nn.Linear(n_flatten, features_dim),
            nn.ReLU()
        )

    def _preprocess(self, observation):
        # Preprocess the observation to match the input requirements of ViT
        batch_resize = F.interpolate(
            observation, size=(224, 224), mode='bilinear', align_corners=False
        )
        return batch_resize


    def forward(self, obs):
        # Convert 1-channel grayscale to 3 channels if needed
        obs = self._preprocess(obs).cuda()
        if obs.shape[1] == 1:
            obs = obs.repeat(1, 3, 1, 1)
        features = self.backbone(obs)
        return self.projector(features)

2. Implementación de la solución DQN

In [5]:
class TQDMProgressCallback(BaseCallback):
    def __init__(self, total_timesteps: int, verbose=0):
        super().__init__(verbose)
        self.total_timesteps = total_timesteps
        self.progress_bar = None
        self.last_timesteps = 0

    def _on_training_start(self):
        self.progress_bar = tqdm(total=self.total_timesteps, desc="Training Progress", unit="step")

    def _on_step(self):
        steps_since_last = self.num_timesteps - self.last_timesteps
        self.progress_bar.update(steps_since_last)
        self.last_timesteps += 1

        # Optional: log latest reward if available
        infos = self.locals.get("infos", [])
        if infos and isinstance(infos[0], dict) and "episode" in infos[0]:
            self.progress_bar.set_postfix(reward=infos[0]["episode"]["r"])
        return True  # Return True to continue training

    def _on_training_end(self):
        self.progress_bar.close()

In [6]:

class MLflowCallback(BaseCallback):
    def __init__(self, best_model_path, experiment_name="SB3_Experiment", run_name=None, log_freq=1000, verbose=0):
        super().__init__(verbose)
        self.experiment_name = experiment_name
        self.log_freq = log_freq
        self.step_count = 0
        self.best_mean_reward = -np.inf
        self.best_model_path = best_model_path

    def _on_step(self) -> bool:
        self.step_count += 1
        if self.step_count % self.log_freq == 0:
            rewards = [ep_info['r'] for ep_info in self.model.ep_info_buffer] if self.model.ep_info_buffer else []
            lengths = [ep_info['l'] for ep_info in self.model.ep_info_buffer] if self.model.ep_info_buffer else []

            mean_reward = np.mean(rewards) if rewards else 0.0
            max_reward = np.max(rewards) if rewards else 0.0
            min_reward = np.min(rewards) if rewards else 0.0
            mean_length = np.mean(lengths) if lengths else 0.0
            std_reward = np.std(rewards) if rewards else 0.0

            step = self.num_timesteps
            mlflow.log_metric("timesteps", step, step=step)
            mlflow.log_metric("episode_reward_mean", mean_reward, step=step)
            mlflow.log_metric("episode_reward_max", max_reward, step=step)
            mlflow.log_metric("episode_reward_min", min_reward, step=step)
            mlflow.log_metric("episode_length_mean", mean_length, step=step)
            mlflow.log_metric("episode_reward_std", std_reward, step=step)

            if mean_reward > self.best_mean_reward:
                self.best_mean_reward = mean_reward
                # Save the best model
                self.model.save(self.best_model_path)
        return True

    def _on_training_end(self):
        # Optionally save the model as artifact
        model_path = "final_model.zip"
        self.model.save(model_path)
        mlflow.log_artifact(model_path)
        mlflow.end_run()

In [None]:
policy_kwargs = dict(
    features_extractor_class=MobileNetFeatureExtractor,
    features_extractor_kwargs=dict(features_dim=256),
)
total_timesteps = 100000
progress_bar_callback = TQDMProgressCallback(total_timesteps=total_timesteps)
ml_callback = MLflowCallback(
    "models/dqn_mobilenet_v2_weights.zip",
    experiment_name="DQN_SpaceInvaders",
    run_name="DQN_Run",
    log_freq=500
)
experiment_name = "DQN_SpaceInvaders"
exist_experiment = mlflow.get_experiment_by_name(experiment_name)
if not exist_experiment:
    mlflow.create_experiment(experiment_name)

with mlflow.start_run(run_name="DQN_Run_MobileNetv2_finetuned"):
    mlflow.log_param("env_name", env_name)
    mlflow.log_param("total_timesteps", total_timesteps)
    mlflow.log_param("learning_rate", 1e-4)
    mlflow.log_param("buffer_size", 100000)

    model = DQN("CnnPolicy", env, verbose=1, learning_rate=1e-4, buffer_size=100000, policy_kwargs=policy_kwargs)
    t_model = model.learn(total_timesteps=total_timesteps, callback=[progress_bar_callback, ml_callback])

Using cuda device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
Wrapping the env in a VecTransposeImage.




Training Progress:   0%|          | 0/100000 [00:00<?, ?step/s]

  return F.linear(input, self.weight, self.bias)


In [None]:
total_timesteps = 100000
progress_bar_callback = TQDMProgressCallback(total_timesteps=total_timesteps)
ml_callback = MLflowCallback(
    best_model_path="models/dqn_vit_b_16_weights.zip",
    experiment_name="DQN_SpaceInvaders",
    run_name="DQN_Run",
    log_freq=500
)
experiment_name = "DQN_SpaceInvaders"
exist_experiment = mlflow.get_experiment_by_name(experiment_name)
if not exist_experiment:
    mlflow.create_experiment(experiment_name)

with mlflow.start_run(run_name="DQN_Run_Vit_b_16_finetuned"):
    mlflow.log_param("env_name", env_name)
    mlflow.log_param("total_timesteps", total_timesteps)
    mlflow.log_param("learning_rate", 1e-4)
    mlflow.log_param("buffer_size", 100000)
    model = DQN.load("models/dqn_vit_b_16_weights.zip", env=env)
    t_model = model.learn(total_timesteps=total_timesteps, callback=[progress_bar_callback, ml_callback])


Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
Wrapping the env in a VecTransposeImage.


  proj = linear(q, w, b)


Training Progress:   0%|          | 0/100000 [00:00<?, ?step/s]

In [None]:
policy_kwargs = dict(
    features_extractor_class=VitB16FeaturesExtractor,
    features_extractor_kwargs=dict(features_dim=256),
)
total_timesteps = 100000
progress_bar_callback = TQDMProgressCallback(total_timesteps=total_timesteps)
ml_callback = MLflowCallback(
    best_model_path="models/dqn_vit_b_16_weights.zip",
    experiment_name="DQN_SpaceInvaders",
    run_name="DQN_Run",
    log_freq=500
)
experiment_name = "DQN_SpaceInvaders"
exist_experiment = mlflow.get_experiment_by_name(experiment_name)
if not exist_experiment:
    mlflow.create_experiment(experiment_name)

with mlflow.start_run(run_name="DQN_Run_Vit_b_16_finetuned"):
    mlflow.log_param("env_name", env_name)
    mlflow.log_param("total_timesteps", total_timesteps)
    mlflow.log_param("learning_rate", 1e-4)
    mlflow.log_param("buffer_size", 100000)

    model = DQN("CnnPolicy", env, verbose=1, learning_rate=1e-4, buffer_size=100000, policy_kwargs=policy_kwargs)
    t_model = model.learn(total_timesteps=total_timesteps, callback=[progress_bar_callback, ml_callback])


Using cuda device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
Wrapping the env in a VecTransposeImage.


Training Progress:   0%|          | 0/100000 [00:00<?, ?step/s]

----------------------------------
| rollout/            |          |
|    ep_len_mean      | 624      |
|    ep_rew_mean      | 8.25     |
|    exploration_rate | 0.763    |
| time/               |          |
|    episodes         | 4        |
|    fps              | 19       |
|    time_elapsed     | 127      |
|    total_timesteps  | 2496     |
| train/              |          |
|    learning_rate    | 0.0001   |
|    loss             | 0.0159   |
|    n_updates        | 598      |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 653      |
|    ep_rew_mean      | 9.25     |
|    exploration_rate | 0.504    |
| time/               |          |
|    episodes         | 8        |
|    fps              | 18       |
|    time_elapsed     | 277      |
|    total_timesteps  | 5226     |
| train/              |          |
|    learning_rate    | 0.0001   |
|    loss             | 0.0147   |
|    n_updates      

In [None]:
# Testing part to calculate the mean reward
weights_filename = 'dqn_{}_weights.h5f'.format(env_name)
dqn.load_weights(weights_filename)
dqn.test(env, nb_episodes=10, visualize=False)

3. Justificación de los parámetros seleccionados y de los resultados obtenidos

---