### Run in collab
<a href="https://colab.research.google.com/github/racousin/data_science_practice/blob/master/website/public/modules/module13/exercise/module13_exercise4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [4]:
!pip install swig==4.2.1
!pip install gymnasium==0.29.1
!pip install gymnasium[box2d]  # Install Box2D dependency for LunarLander-v3

Collecting gymnasium==0.29.1
  Downloading gymnasium-0.29.1-py3-none-any.whl.metadata (10 kB)
Downloading gymnasium-0.29.1-py3-none-any.whl (953 kB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m953.9/953.9 kB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: gymnasium
  Attempting uninstall: gymnasium
    Found existing installation: gymnasium 1.1.1
    Uninstalling gymnasium-1.1.1:
      Successfully uninstalled gymnasium-1.1.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
dopamine-rl 4.1.2 requires gymnasium>=1.0.0, but you have gymnasium 0.29.1 which is incompatible.[0m[31m
[0mSuccessfully installed gymnasium-0.29.1




In [5]:
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import gymnasium as gym
import matplotlib.pyplot as plt

# module13_exercise4 : ML - Arena <a href="https://ml-arena.com/viewcompetition/1" target="_blank"> LunarLander</a>

### Objective
Get at list an agent running on ML-Arena <a href="https://ml-arena.com/viewcompetition/1" target="_blank"> LunarLander</a> with mean reward upper than 50


You should submit an agent file named `agent.py` with a class `Agent` that includes at least the following attributes:

In [2]:
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F

import gymnasium as gym  # n√©cessite gymnasium[box2d] pour LunarLander-v2

# D√©finition du r√©seau de neurones pour approximer Q(s,a)
class QNetwork(nn.Module):
    def __init__(self, state_dim, action_dim):
        super(QNetwork, self).__init__()
        # R√©seau fully-connected avec 2 couches cach√©es de 128 neurones
        self.fc1 = nn.Linear(state_dim, 128)
        self.fc2 = nn.Linear(128, 128)
        self.fc_out = nn.Linear(128, action_dim)
        # Initialisation optionnelle des poids peut √™tre ajout√©e ici si d√©sir√©

    def forward(self, state):
        # Passe avant : ReLU sur couches cach√©es
        x = F.relu(self.fc1(state))
        x = F.relu(self.fc2(x))
        return self.fc_out(x)  # sorties Q-values (une par action)

# Buffer d'exp√©rience pour stocker et √©chantillonner des transitions
class ReplayBuffer:
    def __init__(self, capacity, state_dim):
        self.capacity = capacity
        self.memory = []        # liste de transitions
        self.position = 0       # index courant pour √©craser les anciennes exp√©riences
        self.state_dim = state_dim

    def add(self, state, action, reward, next_state, done):
        # Si la m√©moire n'est pas encore pleine, on ajoute une nouvelle entr√©e
        if len(self.memory) < self.capacity:
            self.memory.append(None)
        # Stocker la transition (on copie les tableaux pour √©viter les r√©f√©rences)
        self.memory[self.position] = (
            np.array(state, copy=True),
            action,
            reward,
            np.array(next_state, copy=True),
            done
        )
        # Incr√©ment circulaire de la position
        self.position = (self.position + 1) % self.capacity

    def sample(self, batch_size):
        # Tirer al√©atoirement batch_size transitions
        indices = np.random.choice(len(self.memory), batch_size, replace=False)
        states, actions, rewards, next_states, dones = zip(*(self.memory[i] for i in indices))
        # Convertir en tenseurs PyTorch
        states      = torch.tensor(np.array(states), dtype=torch.float32)
        actions     = torch.tensor(actions, dtype=torch.int64).unsqueeze(1)    # actions indices
        rewards     = torch.tensor(rewards, dtype=torch.float32).unsqueeze(1)  # r√©compenses
        next_states = torch.tensor(np.array(next_states), dtype=torch.float32)
        dones       = torch.tensor(dones, dtype=torch.float32).unsqueeze(1)    # indicateurs de fin (0.0 ou 1.0)
        return states, actions, rewards, next_states, dones

    def __len__(self):
        return len(self.memory)

# Agent DQN avec r√©seau local et r√©seau cible
class DQNAgent:
    def __init__(self, state_dim, action_dim):
        self.state_dim = state_dim
        self.action_dim = action_dim
        # Initialiser les deux r√©seaux (policy et target) et l'optimiseur
        self.q_network = QNetwork(state_dim, action_dim)
        self.target_network = QNetwork(state_dim, action_dim)
        self.target_network.load_state_dict(self.q_network.state_dict())  # initialisation identique
        self.target_network.eval()  # le r√©seau cible n'est pas entra√Æn√© par gradient
        self.optimizer = torch.optim.Adam(self.q_network.parameters(), lr=5e-4)
        # Initialiser la m√©moire d'exp√©rience
        self.memory = ReplayBuffer(capacity=100000, state_dim=state_dim)
        # Compteur de pas pour gestion des mises √† jour
        self.learn_step_counter = 0

    def select_action(self, state, epsilon):
        """Renvoie une action selon une politique epsilon-greedy."""
        if np.random.rand() < epsilon:
            # Exploration al√©atoire
            return np.random.randint(self.action_dim)
        else:
            # Exploitation (on choisit l'action de Q maximale)
            state_t = torch.tensor(state, dtype=torch.float32).unsqueeze(0)  # shape (1, state_dim)
            self.q_network.eval()  # mode √©valuation
            with torch.no_grad():
                q_values = self.q_network(state_t)
            self.q_network.train()  # repasse en mode entra√Ænement
            action = int(torch.argmax(q_values, dim=1).item())
            return action

    def train_step(self, batch_size=64, gamma=0.99, tau=1e-3):
        """Effectue un pas d'apprentissage du r√©seau (une mise √† jour de Q-network)."""
        if len(self.memory) < batch_size:
            return  # ne pas entra√Æner tant qu'on n'a pas assez d'√©chantillons
        # √âchantillonner un mini-batch de transitions
        states, actions, rewards, next_states, dones = self.memory.sample(batch_size)
        # Calcul des Q-cibles avec le r√©seau cible (on ne calcule pas de gradients ici)
        with torch.no_grad():
            # Valeur Q max du prochain √©tat selon le r√©seau cible
            q_next = self.target_network(next_states).max(dim=1, keepdim=True)[0]
            # Cible de Q: r + gamma * max(Q_next) * (1 - done)
            q_target = rewards + gamma * q_next * (1 - dones)
        # Valeur Q courante pr√©dite par le r√©seau principal pour les (state, action) du batch
        q_current = self.q_network(states).gather(1, actions)  # Q(s,a) pour chaque transition du batch
        # Calcul de la perte (erreur quadratique)
        loss = F.mse_loss(q_current, q_target)
        # R√©tropropagation de la perte et mise √† jour des poids du Q-network
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()
        # Mise √† jour douce du r√©seau cible vers le Q-network (tau)
        for target_param, local_param in zip(self.target_network.parameters(), self.q_network.parameters()):
            target_param.data.copy_(tau * local_param.data + (1.0 - tau) * target_param.data)

# Environnement LunarLander-v2
# Environnement LunarLander-v2
env = gym.make("LunarLander-v2") # Changed from v3 to v2
state_dim = env.observation_space.shape[0]   # dimension d'√©tat (8)
action_dim = env.action_space.n             # nombre d'actions (4)
agent = DQNAgent(state_dim, action_dim)

# Param√®tres d'entra√Ænement
num_episodes = 1000         # nombre maximal d'√©pisodes
max_steps = 1000            # pas max par √©pisode (pour √©viter des boucles infinies)
target_score = 200          # score cible √† atteindre en moyenne
print_interval = 10         # intervalle pour affichage des progr√®s
best_avg_reward = -float("inf")
best_model_path = "best_model.pth"

# Variables pour suivi de la performance
scores = []                 # liste des scores par √©pisode
scores_window = []          # fen√™tre glissante des derniers 100 scores

# Boucle principale d'entra√Ænement
epsilon = 1.0               # valeur initiale de epsilon (politique epsilon-greedy)
epsilon_decay = 0.995       # facteur de d√©croissance exponentielle de epsilon
epsilon_min = 0.01          # epsilon minimum
for episode in range(1, num_episodes + 1):
    state, _ = env.reset()  # r√©initialiser l'environnement
    episode_reward = 0
    for t in range(max_steps):
        # S√©lectionner une action selon la politique epsilon-greedy
        action = agent.select_action(state, epsilon)
        next_state, reward, terminated, truncated, info = env.step(action)
        done = terminated or truncated
        episode_reward += reward
        # Ajouter la transition dans la m√©moire
        agent.memory.add(state, action, reward, next_state, done)
        # Mettre √† jour l'√©tat courant
        state = next_state
        # Entra√Æner le r√©seau (toutes les 4 √©tapes)
        if t % 4 == 0:
            agent.train_step(batch_size=64, gamma=0.99, tau=1e-3)
        # Sortir si fin d'√©pisode
        if done:
            break
    # Mettre √† jour epsilon (d√©croissance exponentielle par √©pisode)
    epsilon = max(epsilon * epsilon_decay, epsilon_min)
    # Enregistrer le score de l'√©pisode
    scores.append(episode_reward)
    scores_window.append(episode_reward)
    if len(scores_window) > 100:
        # garder une fen√™tre glissante de 100 derniers √©pisodes
        scores_window.pop(0)
    # Calculer la r√©compense moyenne des 100 derniers √©pisodes
    avg_reward_100 = np.mean(scores_window)
    # Sauvegarder le mod√®le si c'est le meilleur jusqu'√† pr√©sent
    if avg_reward_100 > best_avg_reward:
        best_avg_reward = avg_reward_100
        torch.save(agent.q_network.state_dict(), best_model_path)
    # Affichage p√©riodique des statistiques d'entra√Ænement
    if episode % print_interval == 0:
        print(f"√âpisode {episode}/{num_episodes} - Score moyen (100 derniers): {avg_reward_100:.1f} - eps={epsilon:.3f}")
    # Arr√™t anticip√© si la moyenne sur 100 √©pisodes atteint la cible
    if avg_reward_100 >= target_score and episode >= 100:
        print(f"Environnement r√©solu en {episode} √©pisodes üéâ  (score moyen sur 100 eps = {avg_reward_100:.1f})")
        break

# Fin de l'entra√Ænement
print("Meilleure moyenne obtenue sur 100 √©pisodes:", best_avg_reward)
# Charger le meilleur mod√®le sauvegard√©
best_model = QNetwork(state_dim, action_dim)
best_model.load_state_dict(torch.load(best_model_path))
best_model.eval()

# √âvaluation du mod√®le entra√Æn√© sur 100 √©pisodes pour v√©rifier la performance > 200
eval_episodes = 100
eval_rewards = []
for i in range(eval_episodes):
    state, _ = env.reset()
    episode_sum = 0
    while True:
        # S√©lectionner action de fa√ßon d√©terministe (epsilon=0, politique purement optimis√©e)
        state_t = torch.tensor(state, dtype=torch.float32).unsqueeze(0)
        with torch.no_grad():
            q_vals = best_model(state_t)
        action = int(torch.argmax(q_vals, dim=1).item())
        # Agir dans l'environnement
        next_state, reward, terminated, truncated, info = env.step(action)
        episode_sum += reward
        state = next_state
        if terminated or truncated:
            eval_rewards.append(episode_sum)
            break

avg_eval_reward = np.mean(eval_rewards)
print(f"R√©compense moyenne sur {eval_episodes} √©pisodes d'√©valuation: {avg_eval_reward:.2f}")
if avg_eval_reward >= 200:
    print(">>> Performance cible atteinte! L'agent obtient en moyenne au-dessus de 200 ‚úÖ")
else:
    print(">>> Performance cible NON atteinte. R√©entra√Æner ou ajuster les hyperparam√®tres ‚ö†Ô∏è")


√âpisode 10/1000 - Score moyen (100 derniers): -235.8 - eps=0.951
√âpisode 20/1000 - Score moyen (100 derniers): -210.6 - eps=0.905
√âpisode 30/1000 - Score moyen (100 derniers): -211.2 - eps=0.860
√âpisode 40/1000 - Score moyen (100 derniers): -201.9 - eps=0.818
√âpisode 50/1000 - Score moyen (100 derniers): -204.5 - eps=0.778
√âpisode 60/1000 - Score moyen (100 derniers): -202.1 - eps=0.740
√âpisode 70/1000 - Score moyen (100 derniers): -192.8 - eps=0.704
√âpisode 80/1000 - Score moyen (100 derniers): -189.7 - eps=0.670
√âpisode 90/1000 - Score moyen (100 derniers): -181.9 - eps=0.637
√âpisode 100/1000 - Score moyen (100 derniers): -177.8 - eps=0.606
√âpisode 110/1000 - Score moyen (100 derniers): -167.6 - eps=0.576
√âpisode 120/1000 - Score moyen (100 derniers): -163.2 - eps=0.548
√âpisode 130/1000 - Score moyen (100 derniers): -154.8 - eps=0.521
√âpisode 140/1000 - Score moyen (100 derniers): -148.5 - eps=0.496
√âpisode 150/1000 - Score moyen (100 derniers): -137.2 - eps=0.471
√âpi

### Description

This environment is a classic rocket trajectory optimization problem. According to Pontryagin‚Äôs maximum principle, it is optimal to fire the engine at full throttle or turn it off. This is the reason why this environment has discrete actions: engine on or off.
There are two environment versions: discrete or continuous. The landing pad is always at coordinates (0,0). The coordinates are the first two numbers in the state vector. Landing outside of the landing pad is possible. Fuel is infinite, so an agent can learn to fly and then land on its first attempt.

### Action Space

There are four discrete actions available:
- 0: do nothing
- 1: fire left orientation engine
- 2: fire main engine
- 3: fire right orientation engine

### Observation Space
The state is an 8-dimensional vector: the coordinates of the lander in x & y, its linear velocities in x & y, its angle, its angular velocity, and two booleans that represent whether each leg is in contact with the ground or not.

### Rewards
After every step a reward is granted. The total reward of an episode is the sum of the rewards for all the steps within that episode.
For each step, the reward:
- is increased/decreased the closer/further the lander is to the landing pad.
- is increased/decreased the slower/faster the lander is moving.
- is decreased the more the lander is tilted (angle not horizontal).
- is increased by 10 points for each leg that is in contact with the ground.
- is decreased by 0.03 points each frame a side engine is firing.
- is decreased by 0.3 points each frame the main engine is firing.

The episode receive an additional reward of -100 or +100 points for crashing or landing safely respectively.
An episode is considered a solution if it scores at least 200 points.

### Starting State
The lander starts at the top center of the viewport with a random initial force applied to its center of mass.

### Episode Termination
The episode finishes if:
- the lander crashes (the lander body gets in contact with the moon);
- the lander gets outside of the viewport (x coordinate is greater than 1);
- the lander is not awake. From the Box2D docs, a body which is not awake is a body which doesn‚Äôt move and doesn‚Äôt collide with any other body.

### Before submit
Test that your agent has the right attributes

In [None]:
env = gym.make("LunarLander-v2")
agent = Agent(env)

observation, _ = env.reset()
reward, terminated, truncated, info = None, False, False, None
rewards = []
while not (terminated or truncated):
    action = agent.choose_action(observation, reward=reward, terminated=terminated, truncated=truncated, info=info)
    observation, reward, terminated, truncated, info = env.step(action)
    rewards.append(reward)
print(f'Cumulative Reward: {sum(rewards)}')