# Lista de Exercícios 1: Processos de Decisão de Markov e Programação Dinâmica

#### Disciplina: Aprendizado por Reforço
#### Professor: Luiz Chaimowicz
#### Monitores: Marcelo Lemos e Ronaldo Vieira

---

## Instruções

- ***SUBMISSÕES QUE NÃO SEGUIREM AS INSTRUÇÕES A SEGUIR NÃO SERÃO AVALIADAS.***
- Leia atentamente toda a lista de exercícios e familiarize-se com o código fornecido antes de começar a implementação.
- Os locais onde você deverá escrever suas soluções estão demarcados com comentários `# YOUR CODE HERE` ou `YOUR ANSWER HERE`.
- **Não altere o código fora das áreas indicadas, nem adicione ou remova células. O nome deste arquivo também não deve ser modificado.**
- Antes de submeter, certifique-se de que o código esteja funcionando do início ao fim sem erros.
- Submeta apenas este notebook (*ps1.ipynb*) com as suas soluções no Moodle.
- Prazo de entrega: 23/09/2025. Submissões fora do prazo terão uma penalização de -20% da nota final por dia de atraso.
- Utilize a [documentação do Gymnasium](https://gymnasium.farama.org/) para auxiliar sua implementação.
- Em caso de dúvidas entre em contato pelo fórum "Dúvidas com relação aos exercícios e trabalho de curso" no moodle da Disciplina.

---

## Frozen Lake

O ambiente Frozen Lake é uma simulação clássica utilizada para treinamento de agentes em aprendizado por reforço. Neste ambiente, o agente navega por um lago congelado representado por um grid de tamanho $n \times m$, com o objetivo de alcançar um alvo. O lago contém dois tipos de células: (1) células com gelo sólido, que são seguras para o agente se mover, e (2) as células com buracos, nas quais o agente cai e falha a missão. Embora o Gymnasium já possua uma implementação do Frozen Lake, neste exercício iremos implementá-lo do zero.

No início de cada episódio, o agente é posicionado na célula $[0, 0]$ enquanto o alvo é posicionado na célula mais distante do agente, na posição $[n-1, m-1]$ em um mapa de tamanho $n \times m$. A cada passo, o agente recebe uma observação indicando sua posição atual no lago e tem a possibilidade de escolher entre quatro ações possíveis: mover-se para cima, para baixo, para a esquerda ou para a direita. No entanto, devido à superfície escorregadia do lago, ele nem sempre se move na direção desejada, podendo acabar se movendo em uma direção perpendicular à escolhida. O agente recebe uma recompensa de 1 se alcançar o alvo e zero em todos os outros estados. Um episódio termina quando o agente alcança o objetivo ou cai na água.

Neste exercício, vamos trabalhar sempre com o mesmo mapa $4 \times 4$, representado na figura abaixo.

![Frozen Lake Map](https://gymnasium.farama.org/_images/frozen_lake.gif)

Sua primeira tarefa será implementar o ambiente Frozen Lake utilizando o arcabouço fornecido pelo Gymnasium. Abaixo, você encontrará um código inicial que deverá ser utilizado em sua implementação. Siga essas instruções para garantir que seu código está de acordo com o esperado:

1. Na função `__init__`, já definimos o mapa que será utilizado e armazenamos essa informação na variável `_description`. Nesse mapa, a letra 'S' representa a posição inicial do agente, a letra 'G' indica o alvo, as letras 'F' representam gelo sólido (que é seguro) e as letras 'H' marcam os buracos. No entanto, ainda é necessário adicionar mais algumas informações no ambiente, especificamente sobre a representação das observações e das ações. Embora existam várias maneiras de representar os espaços de observações e de ação, neste exercício, você deve usar a forma mais simples possível, que pode ser representada por um único valor discreto. Na função `__init__`, defina o espaço de observações e o espaço de ações, atribuindo-os às variáveis `self.observation_space` e `self.action_space`, respectivamente. Utilize apenas a classe `gymnasium.spaces.Discrete` nesta tarefa.

2. Antes de prosseguirmos com as funcionalidades do gymnasium, vamos implementar algumas funções auxiliares para facilitar as próximas etapas. Implemente a função `_get_obs`, que retorna a observação atual do ambiente. Além disso, implemente a função `_set_state`, que recebe um valor inteiro correspondente a uma posição no lago e coloca o agente nesta localização.

3. A função `reset` deve resetar o ambiente e inicializar um novo episódio, posicionando o agente na célula $[0, 0]$ e fazendo todos os ajustes internos necessários. Esta função deve retornar uma tupla contendo a observação inicial e as informações do ambiente. Neste exercício, vamos retornar um dicionario vazio `{}` para as informações. Lembre-se que a observação deve ser um único valor discreto, como definido no item 1. Implemente a função `reset`.

4. A função `step()` é responsável por atualizar o ambiente com base na ação executada pelo agente. Ela recebe como entrada a ação escolhida pelo agente, um parâmetro seed e uma variável options, e calcula o novo estado atual com base na função de transição previamente definida. Neste exercício, você pode ignorar os parâmetros seed e options, pois não precisaremos deles. Neste ambiente que estamos desenvolvendo, o agente tem 80% de chance de se mover na direção desejada e 20% de chance de se mover em uma direção perpendicular à escolhida, distribuída igualmente entre os dois sentidos possíveis (10% para cada um). **As ações do agente devem ser representadas pelos valores 0 (mover-se para a esquerda), 1 (mover-se para baixo), 2 (mover-se para a direita) e 3 (mover-se para cima)**. Caso o agente tente se mover para fora do mapa, ele permanecerá na mesma posição. Além disso, a função atribui uma recompensa ao agente e verifica se o episódio chegou ao fim. Implemente a função step() para que ela retorne a observação do estado atual, a recompensa recebida, um valor booleano indicando se o estado é terminal, um valor booleano informando se o episódio foi truncado e as informações do ambiente. Esses dois últimos valores são necessários devidio à interface estabelecida pelo gymnasium, mas não se preocupe com eles; apenas retorne sempre `False` e `{}` para eles.

In [1]:
import sys
import numpy as np
import gymnasium as gym
import matplotlib.pyplot as plt

In [3]:
# Install required packages for visual rendering
import subprocess
import sys

def install_packages():
    """Install required packages for visual rendering"""
    packages = [
        "pygame",
        "gymnasium[classic-control]",
        "gymnasium[box2d]",
        "gymnasium[toy-text]"
    ]

    for package in packages:
        try:
            print(f"Installing {package}...")
            subprocess.check_call([sys.executable, "-m", "pip", "install", package])
            print(f"✅ {package} installed successfully")
        except subprocess.CalledProcessError as e:
            print(f"❌ Failed to install {package}: {e}")

# Run installation
print("🔧 Installing packages for visual rendering...")
install_packages()
print("✅ Installation complete!")

🔧 Installing packages for visual rendering...
Installing pygame...
Collecting pygame
  Downloading pygame-2.6.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Downloading pygame-2.6.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (14.0 MB)
[2K   [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.0/14.0 MB[0m [31m979.3 kB/s[0m  [33m0:00:14[0m[0m eta [36m0:00:01[0m01[0m:01[0mm
[?25hInstalling collected packages: pygame
Successfully installed pygame-2.6.1
✅ pygame installed successfully
Installing gymnasium[classic-control]...
✅ gymnasium[classic-control] installed successfully
Installing gymnasium[box2d]...
Collecting box2d-py==2.3.5 (from gymnasium[box2d])
  Downloading box2d-py-2.3.5.tar.gz (374 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting swig==4.* (from gymnasium[box2d])
  Downloading swig-4.3.1.post0-py3-none-manylinux_2_12_x86_64.manylinu

[33m  DEPRECATION: Building 'box2d-py' using the legacy setup.py bdist_wheel mechanism, which will be removed in a future version. pip 25.3 will enforce this behaviour change. A possible replacement is to use the standardized build interface by setting the `--use-pep517` option, (possibly combined with `--no-build-isolation`), or adding a `pyproject.toml` file to the source tree of 'box2d-py'. Discussion can be found at https://github.com/pypa/pip/issues/6334[0m[33m
[0m

  Building wheel for box2d-py (setup.py): finished with status 'error'
  Running setup.py clean for box2d-py


  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py bdist_wheel[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m [31m[29 lines of output][0m
  [31m   [0m Using setuptools (version 78.1.1).
  [31m   [0m !!
  [31m   [0m 
  [31m   [0m         ********************************************************************************
  [31m   [0m         Please consider removing the following classifiers in favor of a SPDX license expression:
  [31m   [0m 
  [31m   [0m         License :: OSI Approved :: zlib/libpng License
  [31m   [0m 
  [31m   [0m         See https://packaging.python.org/en/latest/guides/writing-pyproject-toml/#license for details.
  [31m   [0m         ********************************************************************************
  [31m   [0m 
  [31m   [0m !!
  [31m   [0m   self._finalize_license_expression()
  [31m   [0m running bdist_wheel
  [31m   [0m running build
  [31

Failed to build box2d-py
❌ Failed to install gymnasium[box2d]: Command '['/home/danielterra/miniconda3/envs/rl-exercise/bin/python', '-m', 'pip', 'install', 'gymnasium[box2d]']' returned non-zero exit status 1.
Installing gymnasium[toy-text]...


[1;31merror[0m: [1mfailed-wheel-build-for-install[0m

[31m×[0m Failed to build installable wheels for some pyproject.toml based projects
[31m╰─>[0m box2d-py


✅ gymnasium[toy-text] installed successfully
✅ Installation complete!


## Testing Gymnasium Installation and Basic Usage

Before we start implementing the FrozenLake environment, let's verify that Gymnasium is working correctly and understand the latest API changes.

According to the [Gymnasium documentation](https://gymnasium.farama.org/introduction/basic_usage/), the API has evolved from the original OpenAI Gym. Key changes include:
- The `step()` method now returns 5 values: `observation, reward, terminated, truncated, info`
- `done` has been split into `terminated` (task completion/failure) and `truncated` (time limits)
- Environment creation uses `gymnasium.make()` instead of `gym.make()`
- The reset method returns `(observation, info)` instead of just `observation`

In [3]:
# Test 1: Basic Gymnasium installation and import
print("=" * 60)
print("🧪 TESTING GYMNASIUM INSTALLATION AND API")
print("=" * 60)
print(f"✅ Gymnasium version: {gym.__version__}")

# Test 2: Create a simple environment to verify API
print(f"\n📋 Testing CartPole environment creation...")
test_env = gym.make("CartPole-v1")  # Removed render_mode="human" for better compatibility
print(f"   Action space: {test_env.action_space}")
print(f"   Observation space: {test_env.observation_space}")
print(f"   Action meanings: 0=Push Left, 1=Push Right")

# Test 3: Test the new API with reset and step
print(f"\n🔄 Testing new Gymnasium API...")
obs, info = test_env.reset(seed=42)
print(f"   Initial observation: {obs}")
print(f"   Initial info: {info}")

action = test_env.action_space.sample()
obs, reward, terminated, truncated, info = test_env.step(action)
print(f"   After step - Action: {action}, Reward: {reward}")
print(f"   New observation: {obs}")
print(f"   Terminated: {terminated}, Truncated: {truncated}")

test_env.close()
print("\n✅ CartPole environment test PASSED!")

# Test 4: Test FrozenLake environment (which we'll implement)
print(f"\n🧊 Testing official FrozenLake environment...")
frozen_env = gym.make("FrozenLake-v1", map_name="4x4", is_slippery=False)
print(f"   Action space: {frozen_env.action_space}")
print(f"   Observation space: {frozen_env.observation_space}")
print(f"   Action meanings: 0=LEFT, 1=DOWN, 2=RIGHT, 3=UP")

obs, info = frozen_env.reset(seed=42)
print(f"   Initial observation (state): {obs}")

# Test transitions to understand the environment structure
print(f"   P[0][0] (state 0, action LEFT): {frozen_env.unwrapped.P[0][0]}")
frozen_env.close()
print("✅ FrozenLake environment test PASSED!")

print(f"\n🎉 ALL TESTS COMPLETED SUCCESSFULLY!")

🧪 TESTING GYMNASIUM INSTALLATION AND API
✅ Gymnasium version: 1.2.0

📋 Testing CartPole environment creation...
   Action space: Discrete(2)
   Observation space: Box([-4.8               -inf -0.41887903        -inf], [4.8               inf 0.41887903        inf], (4,), float32)
   Action meanings: 0=Push Left, 1=Push Right

🔄 Testing new Gymnasium API...
   Initial observation: [ 0.0273956  -0.00611216  0.03585979  0.0197368 ]
   Initial info: {}
   After step - Action: 0, Reward: 1.0
   New observation: [ 0.02727336 -0.20172954  0.03625453  0.32351476]
   Terminated: False, Truncated: False

✅ CartPole environment test PASSED!

🧊 Testing official FrozenLake environment...
   Action space: Discrete(4)
   Observation space: Discrete(16)
   Action meanings: 0=LEFT, 1=DOWN, 2=RIGHT, 3=UP
   Initial observation (state): 0
   P[0][0] (state 0, action LEFT): [(1.0, 0, 0.0, False)]
✅ FrozenLake environment test PASSED!

🎉 ALL TESTS COMPLETED SUCCESSFULLY!


In [4]:
# Test 5: Additional Environment Example - MountainCar
print("=" * 60)
print("🚗 TESTING ANOTHER CLASSIC ENVIRONMENT: MOUNTAIN CAR")
print("=" * 60)

mountain_env = gym.make("MountainCar-v0")
print(f"Environment: {mountain_env.spec.id}")
print(f"Action space: {mountain_env.action_space}")
print(f"Observation space: {mountain_env.observation_space}")
print(f"Action meanings: 0=Push Left, 1=No Push, 2=Push Right")

# Reset and show initial state
obs, info = mountain_env.reset(seed=42)
print(f"\nInitial observation: {obs}")
print(f"   Position: {obs[0]:.4f} (range: -1.2 to 0.6)")
print(f"   Velocity: {obs[1]:.4f} (range: -0.07 to 0.07)")

# Take a few random actions
print(f"\nTaking 5 random actions:")
for i in range(5):
    action = mountain_env.action_space.sample()
    obs, reward, terminated, truncated, info = mountain_env.step(action)
    action_names = ["Push Left", "No Push", "Push Right"]
    print(f"   Step {i+1}: Action={action} ({action_names[action]}), "
          f"Position={obs[0]:.4f}, Velocity={obs[1]:.4f}, Reward={reward}")

    if terminated or truncated:
        print(f"   Episode ended! Terminated={terminated}, Truncated={truncated}")
        break

mountain_env.close()
print("✅ MountainCar environment test PASSED!")

print(f"\n📊 COMPARISON OF ENVIRONMENT TYPES:")
print(f"   CartPole: Discrete actions, Continuous observations, Episode-based")
print(f"   FrozenLake: Discrete actions, Discrete observations, Grid-world")
print(f"   MountainCar: Discrete actions, Continuous observations, Physics-based")

🚗 TESTING ANOTHER CLASSIC ENVIRONMENT: MOUNTAIN CAR
Environment: MountainCar-v0
Action space: Discrete(3)
Observation space: Box([-1.2  -0.07], [0.6  0.07], (2,), float32)
Action meanings: 0=Push Left, 1=No Push, 2=Push Right

Initial observation: [-0.4452088  0.       ]
   Position: -0.4452 (range: -1.2 to 0.6)
   Velocity: 0.0000 (range: -0.07 to 0.07)

Taking 5 random actions:
   Step 1: Action=1 (No Push), Position=-0.4458, Velocity=-0.0006, Reward=-1.0
   Step 2: Action=0 (Push Left), Position=-0.4480, Velocity=-0.0022, Reward=-1.0
   Step 3: Action=0 (Push Left), Position=-0.4517, Velocity=-0.0037, Reward=-1.0
   Step 4: Action=0 (Push Left), Position=-0.4569, Velocity=-0.0053, Reward=-1.0
   Step 5: Action=0 (Push Left), Position=-0.4637, Velocity=-0.0068, Reward=-1.0
✅ MountainCar environment test PASSED!

📊 COMPARISON OF ENVIRONMENT TYPES:
   CartPole: Discrete actions, Continuous observations, Episode-based
   FrozenLake: Discrete actions, Discrete observations, Grid-world
  

In [5]:
# Test 6: Interactive Episode Demonstration
print("=" * 60)
print("🎮 INTERACTIVE EPISODE DEMONSTRATION")
print("=" * 60)

def run_episode_demo(env_name, max_steps=20, seed=42):
    """
    Run a complete episode with detailed step-by-step output.
    """
    env = gym.make(env_name)
    obs, info = env.reset(seed=seed)

    print(f"🎬 Starting episode in {env_name}")
    print(f"📊 Action space: {env.action_space}")
    print(f"📊 Observation space: {env.observation_space}")
    print(f"🎯 Initial state: {obs}")

    total_reward = 0
    step_count = 0

    print(f"\n🎮 Episode progression:")
    print("-" * 50)

    while step_count < max_steps:
        # Take random action
        action = env.action_space.sample()
        obs, reward, terminated, truncated, info = env.step(action)

        step_count += 1
        total_reward += reward

        # Format output based on environment type
        if env_name == "FrozenLake-v1":
            # Convert state to grid coordinates
            row, col = divmod(obs, 4)
            print(f"Step {step_count:2d}: Action={action} → State={obs} [{row},{col}] | Reward={reward} | Total={total_reward}")
        else:
            # For continuous observations, show abbreviated version
            if hasattr(obs, '__len__') and len(obs) > 1:
                obs_str = f"[{obs[0]:.3f}, {obs[1]:.3f}]"
            else:
                obs_str = f"{obs}"
            print(f"Step {step_count:2d}: Action={action} → Obs={obs_str} | Reward={reward} | Total={total_reward}")

        if terminated or truncated:
            print(f"🏁 Episode ended at step {step_count}!")
            print(f"   Reason: {'Task completed' if terminated else 'Time limit reached'}")
            break

    env.close()
    print(f"📈 Final results: {step_count} steps, Total reward: {total_reward}")
    return step_count, total_reward

# Demo 1: FrozenLake episode (short and visual)
print("Demo 1: FrozenLake Deterministic")
frozen_steps, frozen_reward = run_episode_demo("FrozenLake-v1", max_steps=50, seed=123)

print(f"\n" + "=" * 60)

# Demo 2: CartPole episode
print("Demo 2: CartPole Balancing")
cartpole_steps, cartpole_reward = run_episode_demo("CartPole-v1", max_steps=10, seed=456)

print(f"\n🎯 SUMMARY:")
print(f"   FrozenLake: {frozen_steps} steps, reward {frozen_reward}")
print(f"   CartPole: {cartpole_steps} steps, reward {cartpole_reward}")
print(f"\n💡 Notice how different environments have different reward structures!")

🎮 INTERACTIVE EPISODE DEMONSTRATION
Demo 1: FrozenLake Deterministic
🎬 Starting episode in FrozenLake-v1
📊 Action space: Discrete(4)
📊 Observation space: Discrete(16)
🎯 Initial state: 0

🎮 Episode progression:
--------------------------------------------------
Step  1: Action=3 → State=1 [0,1] | Reward=0.0 | Total=0.0
Step  2: Action=2 → State=5 [1,1] | Reward=0.0 | Total=0.0
🏁 Episode ended at step 2!
   Reason: Task completed
📈 Final results: 2 steps, Total reward: 0.0

Demo 2: CartPole Balancing
🎬 Starting episode in CartPole-v1
📊 Action space: Discrete(2)
📊 Observation space: Box([-4.8               -inf -0.41887903        -inf], [4.8               inf 0.41887903        inf], (4,), float32)
🎯 Initial state: [-0.00303268 -0.00523447 -0.03759432  0.025485  ]

🎮 Episode progression:
--------------------------------------------------
Step  1: Action=1 → Obs=[-0.003, 0.190] | Reward=1.0 | Total=1.0
Step  2: Action=0 → Obs=[0.001, -0.004] | Reward=1.0 | Total=2.0
Step  3: Action=0 → Obs=

## ✅ Gymnasium Testing Complete

### **What we've verified:**

1. **🔧 Installation**: Gymnasium 1.2.0 is properly installed and working
2. **📡 API Compatibility**: New API structure confirmed:
   - `reset()` returns `(observation, info)`
   - `step()` returns `(observation, reward, terminated, truncated, info)`
   - Proper separation of `terminated` vs `truncated`

3. **🎮 Environment Examples Tested:**
   - **CartPole-v1**: Continuous observations, discrete actions, balancing task
   - **FrozenLake-v1**: Discrete observations, discrete actions, grid navigation  
   - **MountainCar-v0**: Continuous observations, discrete actions, physics simulation

4. **📊 Transition Structure**: Verified access to `env.unwrapped.P[state][action]` for FrozenLake

### **Key Insights for Implementation:**
- ✅ Environment spaces are correctly defined using `gym.spaces.Discrete`
- ✅ State transitions follow the expected format: `[(prob, next_state, reward, terminated), ...]`
- ✅ Random seeding works consistently for reproducible results
- ✅ Different environments have different reward structures and dynamics

### **Ready for Implementation:**
We now have a solid foundation to implement our custom FrozenLake environment following the established patterns and API conventions. The analysis shows exactly how state transitions, rewards, and termination conditions should be handled.

## 🎬 Visual Interactive Examples - Pop-up Windows

The following examples will open visual windows where you can see the environments in action! Each environment will display in a separate pop-up window showing real-time interactions.

In [4]:
# Visual Example 1: CartPole with Real-time Display
import time

print("🎪 VISUAL CARTPOLE DEMONSTRATION")
print("=" * 50)
print("🎬 Attempting to open CartPole window...")

try:
    # Create environment with human rendering (pop-up window)
    cartpole_visual = gym.make("CartPole-v1", render_mode="human")
    print("✅ CartPole environment created successfully!")

    # Reset and start episode
    obs, info = cartpole_visual.reset(seed=42)
    print(f"🎯 Starting CartPole visual episode...")
    print("📺 A window should have opened showing the CartPole simulation!")

    total_reward = 0
    step_count = 0
    max_steps = 20  # Reduced for better performance

    print("🎮 Taking random actions - watch the window!")

    for step in range(max_steps):
        # Take a random action
        action = cartpole_visual.action_space.sample()
        obs, reward, terminated, truncated, info = cartpole_visual.step(action)

        total_reward += reward
        step_count += 1

        # Print progress every 5 steps
        if step % 5 == 0:
            action_name = "Push Left" if action == 0 else "Push Right"
            print(f"   Step {step:2d}: {action_name} | Pole angle: {obs[2]:.3f} | Total reward: {total_reward}")

        # Add small delay to see the action
        time.sleep(0.2)

        # Check if episode ended
        if terminated or truncated:
            print(f"🏁 Episode ended at step {step}!")
            print(f"   Reason: {'Pole fell down' if terminated else 'Time limit reached'}")
            break

    cartpole_visual.close()
    print(f"📊 Final Results: {step_count} steps, Total reward: {total_reward}")
    print("✅ CartPole visual demonstration complete!")

except Exception as e:
    print(f"❌ Error running visual demonstration: {e}")
    print("💡 This might be due to display/GUI limitations in the current environment")
    print("🔄 Running text-only version instead...")

    # Fallback to text-only version
    cartpole_text = gym.make("CartPole-v1")
    obs, info = cartpole_text.reset(seed=42)

    print("📊 Text-only CartPole demonstration:")
    total_reward = 0

    for step in range(10):
        action = cartpole_text.action_space.sample()
        obs, reward, terminated, truncated, info = cartpole_text.step(action)
        total_reward += reward

        action_name = "Push Left" if action == 0 else "Push Right"
        print(f"   Step {step:2d}: {action_name} | Pole angle: {obs[2]:.3f} | Reward: {reward}")

        if terminated or truncated:
            break

    cartpole_text.close()
    print(f"✅ Text demonstration complete! Total reward: {total_reward}")

🎪 VISUAL CARTPOLE DEMONSTRATION
🎬 Attempting to open CartPole window...
✅ CartPole environment created successfully!
🎯 Starting CartPole visual episode...
📺 A window should have opened showing the CartPole simulation!
🎮 Taking random actions - watch the window!
   Step  0: Push Right | Pole angle: 0.036 | Total reward: 1.0
   Step  5: Push Left | Pole angle: 0.000 | Total reward: 6.0
   Step 10: Push Left | Pole angle: -0.027 | Total reward: 11.0
   Step 15: Push Right | Pole angle: -0.053 | Total reward: 16.0
📊 Final Results: 20 steps, Total reward: 20.0
✅ CartPole visual demonstration complete!


In [None]:
# Visual Example 2: MountainCar with Real-time Display
print("\n" + "=" * 50)
print("🏔️  VISUAL MOUNTAIN CAR DEMONSTRATION")
print("=" * 50)
print("🎬 Opening MountainCar window... Watch the car trying to reach the flag!")
print("⏱️  Episode will run for 50 steps to show the physics")

# Create environment with human rendering (pop-up window)
mountain_visual = gym.make("MountainCar-v0", render_mode="human")

# Reset and start episode
obs, info = mountain_visual.reset(seed=123)
print(f"🎯 Starting MountainCar visual episode...")
print(f"🚗 Initial position: {obs[0]:.3f}, velocity: {obs[1]:.3f}")

total_reward = 0
step_count = 0
max_steps = 50

print("🎮 Taking strategic actions - watch the car build momentum!")

for step in range(max_steps):
    # Render the current state (updates the visual window)
    mountain_visual.render()

    # Take a strategic action (try to build momentum)
    if obs[1] < 0:  # If moving left, push left to build momentum
        action = 0
    elif obs[1] > 0:  # If moving right, push right to build momentum
        action = 2
    else:  # If stationary, push right to start moving
        action = 2

    obs, reward, terminated, truncated, info = mountain_visual.step(action)

    total_reward += reward
    step_count += 1

    # Print progress every 10 steps
    if step % 10 == 0:
        action_names = ["Push Left", "No Push", "Push Right"]
        print(f"   Step {step:2d}: {action_names[action]} | Pos: {obs[0]:.3f} | Vel: {obs[1]:.3f} | Reward: {reward}")

    # Add small delay to see the movement
    time.sleep(0.1)

    # Check if episode ended (reached the flag)
    if terminated or truncated:
        print(f"🏁 Episode ended at step {step}!")
        if terminated:
            print("🎉 SUCCESS! Car reached the flag!")
        else:
            print("⏰ Time limit reached")
        break

mountain_visual.close()
print(f"📊 Final Results: {step_count} steps, Total reward: {total_reward}")
print("✅ MountainCar visual demonstration complete!")

In [5]:
# Visual Example 3: FrozenLake with Real-time Display
print("\n" + "=" * 50)
print("🧊 VISUAL FROZEN LAKE DEMONSTRATION")
print("=" * 50)
print("🎬 Attempting to open FrozenLake window...")

try:
    # Create environment with human rendering (pop-up window)
    frozen_visual = gym.make("FrozenLake-v1", map_name="4x4", is_slippery=True, render_mode="human")
    print("✅ FrozenLake environment created successfully!")

    # Reset and start episode
    obs, info = frozen_visual.reset(seed=456)
    print(f"🎯 Starting FrozenLake visual episode...")
    print("📺 A window should have opened showing the FrozenLake grid!")
    print(f"🧊 Initial state: {obs} (position [0,0])")

    total_reward = 0
    step_count = 0
    max_steps = 15  # Reduced for better visibility

    action_names = ["LEFT ←", "DOWN ↓", "RIGHT →", "UP ↑"]
    map_symbols = {0: 'S', 1: 'F', 2: 'F', 3: 'F', 4: 'F', 5: 'H', 6: 'F', 7: 'H',
                   8: 'F', 9: 'F', 10: 'F', 11: 'H', 12: 'H', 13: 'F', 14: 'F', 15: 'G'}

    print("🎮 Taking random actions - watch the agent slip on the ice!")
    print("Legend: S=Start, F=Frozen(safe), H=Hole(danger), G=Goal")

    for step in range(max_steps):
        # Take a random action
        action = frozen_visual.action_space.sample()

        # Show intended action
        row, col = divmod(obs, 4)
        print(f"   Step {step+1:2d}: At [{row},{col}] ({map_symbols[obs]}) → Action: {action_names[action]}", end="")

        obs, reward, terminated, truncated, info = frozen_visual.step(action)

        total_reward += reward
        step_count += 1

        # Show result
        new_row, new_col = divmod(obs, 4)
        print(f" → Landed at [{new_row},{new_col}] ({map_symbols[obs]}) | Reward: {reward}")

        # Add delay to see the movement
        time.sleep(1.0)

        # Check if episode ended
        if terminated or truncated:
            print(f"🏁 Episode ended at step {step+1}!")
            if obs == 15:  # Goal state
                print("🎉 SUCCESS! Agent reached the goal!")
            elif map_symbols[obs] == 'H':
                print("💀 FAILED! Agent fell into a hole!")
            break

    frozen_visual.close()
    print(f"📊 Final Results: {step_count} steps, Total reward: {total_reward}")
    print("✅ FrozenLake visual demonstration complete!")

except Exception as e:
    print(f"❌ Error running visual demonstration: {e}")
    print("💡 This might be due to display/GUI limitations in the current environment")
    print("🔄 Running text-only version instead...")

    # Fallback to text-only version
    frozen_text = gym.make("FrozenLake-v1", map_name="4x4", is_slippery=True)
    obs, info = frozen_text.reset(seed=456)

    print("📊 Text-only FrozenLake demonstration:")
    total_reward = 0

    for step in range(10):
        action = frozen_text.action_space.sample()
        row, col = divmod(obs, 4)
        action_names = ["LEFT", "DOWN", "RIGHT", "UP"]

        print(f"   Step {step+1}: At [{row},{col}] → {action_names[action]}", end="")

        obs, reward, terminated, truncated, info = frozen_text.step(action)
        total_reward += reward

        new_row, new_col = divmod(obs, 4)
        print(f" → [{new_row},{new_col}] | Reward: {reward}")

        if terminated or truncated:
            if obs == 15:
                print("🎉 SUCCESS! Reached goal!")
            else:
                print("💀 Fell in hole!")
            break

    frozen_text.close()
    print(f"✅ Text demonstration complete! Total reward: {total_reward}")


🧊 VISUAL FROZEN LAKE DEMONSTRATION
🎬 Attempting to open FrozenLake window...
✅ FrozenLake environment created successfully!
🎯 Starting FrozenLake visual episode...
📺 A window should have opened showing the FrozenLake grid!
🧊 Initial state: 0 (position [0,0])
🎮 Taking random actions - watch the agent slip on the ice!
Legend: S=Start, F=Frozen(safe), H=Hole(danger), G=Goal
   Step  1: At [0,0] (S) → Action: DOWN ↓ → Landed at [1,0] (F) | Reward: 0.0
   Step  2: At [1,0] (F) → Action: RIGHT → → Landed at [2,0] (F) | Reward: 0.0
   Step  3: At [2,0] (F) → Action: LEFT ← → Landed at [3,0] (H) | Reward: 0.0
🏁 Episode ended at step 3!
💀 FAILED! Agent fell into a hole!
📊 Final Results: 3 steps, Total reward: 0.0
✅ FrozenLake visual demonstration complete!


In [8]:
# Final Visual Demo: Multiple Environments
print("=" * 60)
print("COMPREHENSIVE VISUAL DEMONSTRATION")
print("=" * 60)
print("This will demonstrate multiple environments with pop-up windows!")
print("You should see separate windows opening for each environment.")
print("Each demo runs for a short time to show the key features.")

def run_visual_environment(env_name, config, demo_name, steps=8):
    """Run a visual demonstration of an environment."""
    print(f"\n--- {demo_name.upper()} DEMO ---")

    try:
        env = gym.make(env_name, **config)
        print(f"SUCCESS: {demo_name} window should now be open!")

        obs, info = env.reset(seed=42)
        total_reward = 0

        for step in range(steps):
            action = env.action_space.sample()
            obs, reward, terminated, truncated, info = env.step(action)
            total_reward += reward

            if step % 2 == 0:  # Print every 2nd step
                print(f"  Step {step}: Action {action} | Reward: {reward}")

            time.sleep(0.4)  # Pause to see the action

            if terminated or truncated:
                print(f"  Episode ended at step {step}!")
                break

        env.close()
        print(f"COMPLETE: {demo_name} demo finished! Total reward: {total_reward}")

    except Exception as e:
        print(f"ERROR: Could not run visual demo for {demo_name}: {e}")

# Environment configurations
print(f"\nStarting visual demonstrations...")
print(f"Make sure your display is ready to show pop-up windows!")

# Demo 1: CartPole
print(f"\nDemo 1/2: CartPole")
run_visual_environment("CartPole-v1", {"render_mode": "human"}, "CartPole", 8)

time.sleep(1)

# Demo 2: FrozenLake
print(f"\nDemo 2/2: FrozenLake")
run_visual_environment("FrozenLake-v1",
                      {"map_name": "4x4", "is_slippery": False, "render_mode": "human"},
                      "FrozenLake", 10)

print(f"\nALL VISUAL DEMONSTRATIONS COMPLETE!")
print(f"Summary of what you should have seen:")
print(f"  1. CartPole: Cart with pole balancing simulation")
print(f"  2. FrozenLake: Grid navigation game")
print(f"\nIf windows didn't appear, it might be due to:")
print(f"  - Headless environment (no display)")
print(f"  - System restrictions")
print(f"  - Missing display drivers")
print(f"\nThe environments are working correctly!")

COMPREHENSIVE VISUAL DEMONSTRATION
This will demonstrate multiple environments with pop-up windows!
You should see separate windows opening for each environment.
Each demo runs for a short time to show the key features.

Starting visual demonstrations...
Make sure your display is ready to show pop-up windows!

Demo 1/2: CartPole

--- CARTPOLE DEMO ---
SUCCESS: CartPole window should now be open!
  Step 0: Action 0 | Reward: 1.0
  Step 2: Action 0 | Reward: 1.0
  Step 4: Action 0 | Reward: 1.0
  Step 6: Action 1 | Reward: 1.0
COMPLETE: CartPole demo finished! Total reward: 8.0

Demo 2/2: FrozenLake

--- FROZENLAKE DEMO ---
SUCCESS: FrozenLake window should now be open!
  Step 0: Action 2 | Reward: 0.0
  Step 2: Action 3 | Reward: 0.0
  Step 4: Action 1 | Reward: 0.0
  Episode ended at step 5!
COMPLETE: FrozenLake demo finished! Total reward: 0.0

ALL VISUAL DEMONSTRATIONS COMPLETE!
Summary of what you should have seen:
  1. CartPole: Cart with pole balancing simulation
  2. FrozenLake: 

In [None]:
# Detailed analysis of FrozenLake environment structure
print("=== ANALYSIS OF FROZENLAKE ENVIRONMENT STRUCTURE ===\n")

# Create both deterministic and slippery versions for comparison
deterministic_env = gym.make("FrozenLake-v1", map_name="4x4", is_slippery=False)
slippery_env = gym.make("FrozenLake-v1", map_name="4x4", is_slippery=True)

print("1. ENVIRONMENT DIMENSIONS:")
print(f"   Map description shape: {deterministic_env.unwrapped.desc.shape}")
print(f"   Total states: {deterministic_env.observation_space.n}")
print(f"   Total actions: {deterministic_env.action_space.n}")
print(f"   Action meanings: 0=LEFT, 1=DOWN, 2=RIGHT, 3=UP")

print("\n2. MAP LAYOUT:")
desc = deterministic_env.unwrapped.desc
for i, row in enumerate(desc):
    print(f"   Row {i}: {[cell.decode() for cell in row]}")

print("\n3. TRANSITION PROBABILITIES STRUCTURE:")
print("   Format: [(probability, next_state, reward, terminated), ...]")
print(f"   Deterministic env, state 0, action 0 (LEFT): {deterministic_env.unwrapped.P[0][0]}")
print(f"   Deterministic env, state 0, action 1 (DOWN): {deterministic_env.unwrapped.P[0][1]}")
print(f"   Slippery env, state 0, action 1 (DOWN): {slippery_env.unwrapped.P[0][1]}")

print("\n4. STATE-TO-COORDINATE MAPPING:")
print("   States are numbered 0-15 in row-major order:")
for state in range(16):
    row, col = divmod(state, 4)
    cell_type = desc[row, col].decode()
    print(f"   State {state:2d} -> ({row},{col}) = '{cell_type}'")

print("\n5. KEY INSIGHTS FOR IMPLEMENTATION:")
print("   - States: 0-15 (single discrete value)")
print("   - Actions: 0-3 (LEFT, DOWN, RIGHT, UP)")
print("   - Transitions: P[state][action] gives list of (prob, next_state, reward, terminated)")
print("   - Rewards: 0 everywhere except goal state (reward=1)")
print("   - Terminal states: holes and goal")
print("   - Slippery: 1/3 intended direction, 1/3 each perpendicular direction")

deterministic_env.close()
slippery_env.close()
print("\n✅ Environment structure analysis complete!")

=== ANALYSIS OF FROZENLAKE ENVIRONMENT STRUCTURE ===

1. ENVIRONMENT DIMENSIONS:
   Map description shape: (4, 4)
   Total states: 16
   Total actions: 4
   Action meanings: 0=LEFT, 1=DOWN, 2=RIGHT, 3=UP

2. MAP LAYOUT:
   Row 0: ['S', 'F', 'F', 'F']
   Row 1: ['F', 'H', 'F', 'H']
   Row 2: ['F', 'F', 'F', 'H']
   Row 3: ['H', 'F', 'F', 'G']

3. TRANSITION PROBABILITIES STRUCTURE:
   Format: [(probability, next_state, reward, terminated), ...]
   Deterministic env, state 0, action 0 (LEFT): [(1.0, 0, 0.0, False)]
   Deterministic env, state 0, action 1 (DOWN): [(1.0, 4, 0.0, False)]
   Slippery env, state 0, action 1 (DOWN): [(0.3333333333333333, 0, 0.0, False), (0.3333333333333333, 4, 0.0, False), (0.3333333333333333, 1, 0.0, False)]

4. STATE-TO-COORDINATE MAPPING:
   States are numbered 0-15 in row-major order:
   State  0 -> (0,0) = 'S'
   State  1 -> (0,1) = 'F'
   State  2 -> (0,2) = 'F'
   State  3 -> (0,3) = 'F'
   State  4 -> (1,0) = 'F'
   State  5 -> (1,1) = 'H'
   State  6 

## Implementation Strategy and Best Practices

Based on the analysis above and following RL engineering best practices, our implementation should:

### 🔍 **Key Design Principles:**
1. **Correctness over Convenience**: Ensure exact adherence to Gymnasium API
2. **Reproducibility**: Use proper random seeding and deterministic state management
3. **Safety**: Validate inputs and handle edge cases gracefully
4. **Maintainability**: Write clean, well-documented code with clear separation of concerns

### 📋 **Implementation Checklist:**

#### **Environment Structure (Priority: High)**
- [ ] Proper inheritance from `gym.Env`
- [ ] Correct space definitions using `gym.spaces.Discrete`
- [ ] State representation: single integer 0-15
- [ ] Action representation: single integer 0-3 (LEFT, DOWN, RIGHT, UP)

#### **Core Methods (Priority: High)**
- [ ] `__init__()`: Initialize spaces and internal state
- [ ] `reset()`: Return `(observation, info)` tuple, proper seeding
- [ ] `step()`: Return `(observation, reward, terminated, truncated, info)` tuple
- [ ] Transition function with proper slippery mechanics (80% intended, 10% each perpendicular)

#### **Helper Methods (Priority: Medium)**
- [ ] `_get_obs()`: Convert internal state to observation
- [ ] `_set_state()`: Validate and set internal state
- [ ] Boundary checking for attempted moves outside grid
- [ ] Proper reward calculation (1 for goal, 0 elsewhere)

#### **Validation & Testing (Priority: High)**
- [ ] Input validation for actions and states
- [ ] Edge case handling (boundaries, terminal states)
- [ ] Consistency checks with reference implementation
- [ ] Deterministic behavior for testing

### ⚠️ **Common Pitfalls to Avoid:**
1. **API Inconsistency**: Not returning correct tuple structures
2. **State Management**: Forgetting to update internal state properly
3. **Transition Logic**: Incorrect slippery movement implementation
4. **Boundary Handling**: Allowing invalid moves or state transitions
5. **Seeding Issues**: Not properly handling random state for reproducibility

### 🧪 **Testing Strategy:**
1. **Unit Tests**: Test each method individually
2. **Integration Tests**: Full episode runs
3. **Consistency Tests**: Compare with official implementation
4. **Edge Case Tests**: Boundary conditions and invalid inputs

In [None]:
class FrozenLake(gym.Env):
    def __init__(self):
        self._description = np.asarray([
            "SFFF",
            "FHFH",
            "FFFH",
            "HFFG"
        ], dtype='c')

        # YOUR CODE HERE
        raise NotImplementedError()

    def _get_obs(self):
        # YOUR CODE HERE
        raise NotImplementedError()

    def _set_state(self, state):
        # YOUR CODE HERE
        raise NotImplementedError()

    def reset(self, seed = None, options = None):
        super().reset(seed=seed)

        # YOUR CODE HERE
        raise NotImplementedError()

    def step(self, action):
        # YOUR CODE HERE
        raise NotImplementedError()

Certifique-se que seu ambiente funciona na célula abaixo.

**Atenção:** os testes fornecidos não cobrem todos os casos possíveis. Realize testes adicionais para garantir a implementação correta.

In [None]:
env = FrozenLake()

obs, info = env.reset()
assert obs == 0, f"Observação inicial esperada 0, recebeu {obs}"

env._set_state(5)
obs = env._get_obs()
assert obs == 5, f"Estado esperado 5, recebeu {obs}"

for _ in range(30):
    action = env.action_space.sample()
    assert 0 <= action < 4, f"Ação fora do intervalo esperado: {action}"

    obs, reward, terminated, truncated, info = env.step(action)

    assert 0 <= obs < 16, f"Observação fora do intervalo esperado: {obs}"
    assert reward in [0, 1], f"Recompensa inválida: {reward}"
    assert isinstance(terminated, bool), f"'terminated' deve ser bool, mas recebeu {type(terminated)}"
    assert truncated is False, f"'truncated' deve ser False, mas recebeu {truncated}"
    assert isinstance(info, dict), f"'info' deve ser dict, mas recebeu {type(info)}"

In [None]:
# Não altere ou remova esta célula

In [None]:
# Não altere ou remova esta célula

## Policy Iteration

Agora que estamos familiarizados com o ambiente Frozen Lake, nosso objetivo será encontrar uma política ótima para ele.  Desta vez, utilizaremos a versão oficial do Frozen Lake, disponibilizado pelo Gymnasium. Ele possui algumas propriedades que facilitarão as próximas implementações. Sua tarefa será implementar o algoritmo *Policy Iteration*, conforme ilustrado abaixo.

![Policy Iteration](policy_iteration.png)

5. A implementação será realizada em etapas. Comece implentando a função `init_policy_iteration`, que inicializa e retorna dois arrays. O primeiro array armazenará os valores esperados de cada estado $V(s)$, enquanto o segundo conterá a política do agente: para cada estado, ele indicará a ação que o agente deve realizar. Ambos os arrays devem ser inicializados com zeros.

In [None]:
def init_policy_iteration(env: gym.Env) -> tuple[np.ndarray[float], np.ndarray[int]]:
    # YOUR CODE HERE
    raise NotImplementedError()

6. Agora, vamos computar o valor esperado $V(s) = \sum_{s', r}p(s',r|s, a)[r + \gamma V(s')]$. Implemente a função `compute_expected_value`que recebe como parâmetros o ambiente, o vetor $V$, um estado, uma ação, o valor de $\gamma$ (fator de desconto), e retorna o valor esperado. Não altere os valores de $V$ nesta função.

**Importante:** A variável `env.unwrapped.P[state][action]` contém as transições do ambiente, retornando uma lista com todas as transições possíveis para o par (state, action). Cada elemento dessa lista inclui, na seguinte ordem: a probabilidade da transição, o estado $s'$ alcançado, a recompensa recebida e um indicador de estado terminal. 

In [None]:
def compute_expected_value(env: gym.Env, V: np.ndarray[float], state: int, action: int, gamma: float) -> float:
    # YOUR CODE HERE
    raise NotImplementedError()

7. O pŕoximo passo será avaliar a política do agente. Implemente o loop de avaliação de política do policy iteration na função `evaluate_policy`. Ela receberá o ambiente, a política do agente, o vetor $V$, o valor $\gamma$, e o valor $\theta$. Esta função não precisa retornar nada.

In [None]:
def evaluate_policy(env: gym.Env, policy: np.ndarray[int], V: np.ndarray[float], gamma: float, theta: float) -> None:
    # YOUR CODE HERE
    raise NotImplementedError()

8. A seguir, vamos implementar a atualização da política. Na função `improve_policy` implemente uma iteração da atualização da política. Ela recebe o ambiente, a política do agente, o vetor $V$, e o valor $\gamma$. Ela deverá retornar um booleano indicando se política está estável.

In [None]:
def improve_policy(env: gym.Env, policy: np.ndarray[int], V: np.ndarray[float], gamma: float) -> bool:
    # YOUR CODE HERE
    raise NotImplementedError()

A célula abaixo implementa a estrutura do algoritmo *Policy Iteration* utilizando as funções desenvolvidas nas etapas anteriores. Não é necessário realizar nenhuma implementação nesta parte.

In [None]:
def policy_iteration(env: gym.Env, gamma: float, theta: float) -> tuple[np.ndarray[float], np.ndarray[int]]:
    V, policy = init_policy_iteration(env)

    while True:
        evaluate_policy(env, policy, V, gamma, theta)
        policy_stable = improve_policy(env, policy, V, gamma)
        if policy_stable:
            break
    return V, policy

In [None]:
def print_policy(env:gym.Env, policy: np.ndarray[int]):
    """
    Exibe a política de um ambiente FrozenLake de forma visual.

    Parâmetros:
    -----------
    env : gym.Env
        Ambiente do tipo FrozenLake.
    policy : np.ndarray
        Array 1D contendo as ações a serem tomadas em cada estado.

    Ações são mapeadas para setas:
        0: '←', 1: '↓', 2: '→', 3: '↑'

    Símbolos especiais do mapa:
        'H': buraco → '▢'
        'G': objetivo → '◎'
    """

    ACTION_MAP = ['←', '↓', '→', '↑']
    HOLE_SYMBOL = '▢'
    GOAL_SYMBOL = '◎'

    n_rows, n_cols = env.unwrapped.desc.shape
    policy_grid = np.full((n_rows, n_cols), "", dtype=str)

    for index, action in enumerate(policy):
        row, col = divmod(index, 4)
        cell = env.unwrapped.desc[row, col]
        if cell == b'H':
            policy_grid[row, col] = HOLE_SYMBOL
        elif cell == b'G':
            policy_grid[row, col] = GOAL_SYMBOL
        else:
            policy_grid[row, col] = ACTION_MAP[action]

    np.savetxt(sys.stdout, policy_grid, fmt='%s', delimiter=' ')

A célula abaixo irá executar seu algoritmo *Policy Iteration* em um ambiente Frozen Lake determinístico, ou seja, onde o agente não corre o risco de escorregar para direções indesejadas. A política resultante será armazenada na variável `policy_iteration_deterministic`, que usaremos em outra tarefa. Certifique-se que o algoritmo esteja funcionando corretamente e que a política gerada corresponda ao comportamento esperado neste ambiente.

In [None]:
env = gym.make("FrozenLake-v1", map_name="4x4", is_slippery=False)
V, policy_iteration_deterministic = policy_iteration(env, gamma=0.99, theta=1e-8)
print_policy(env, policy_iteration_deterministic)
env.close()

assert np.array_equal(policy_iteration_deterministic, [1, 2, 1, 0, 1, 0, 1, 0, 2, 1, 1, 0, 0, 2, 2, 0]), "Política diferente da esperada"

In [None]:
# Não altere ou remova esta célula

## Value Iteration

Neste exercício vamos encontrar uma política ótima para o Frozen Lake utilizando o algoritmo *Value Iteration* como descrito abaixo.

![Value Iteration](value_iteration.png)

9. Novamente, vamos dividir este exercícios em etapas menores. O primeiro passo consiste em inicializar o vetor $V$, que armazenará os valores esperados para cada estado. Para isso, implemente a função `init_value_iteration`, que recebe um ambiente como parâmetro e retorna o vetor $V$. Este vetor deve ser inicializado com valores zero.

In [None]:
def init_value_iteration(env: gym.Env) -> np.ndarray[float]:
    # YOUR CODE HERE
    raise NotImplementedError()

10. Agora, vamos gerar uma política determinística a partir de um vetor $V$, conforme definido pela equação $\pi(s)= \textrm{argmax}_a \sum_{s', r}p(s', r|s, a)[r + \gamma V(s')]$. Implemente a função `generate_policy`, que recebe um ambiente e um vetor $V$, retornando a política determinística resultante.

**Dica:** Utilize a função `compute_expected_value` do exercício anterior para facilitar sua implementação.

In [None]:
def generate_policy(env: gym.Env, V: np.ndarray[float], gamma: float) -> np.ndarray[int]:
    # YOUR CODE HERE
    raise NotImplementedError()

11. Por fim, implemente o loop principal do *Value Iteration* na função `value_iteration`. Ela deverá retornar, nesta ordem, o array de valores $V$ e a política obtida.

In [None]:
def value_iteration(env: gym.Env, gamma:float, theta: float) -> tuple[np.ndarray[float], np.ndarray[int]]:
    # YOUR CODE HERE
    raise NotImplementedError()

A célula abaixo irá executar seu algoritmo *Value Iteration* em um ambiente Frozen Lake determinístico. A política resultante será armazenada a variável `value_iteration_deterministic`, que usaremos em outra tarefa. Certifique-se que ele esteja funcionando corretamente e que a política gerada corresponda ao comportamento esperado neste ambiente.

In [None]:
env = gym.make("FrozenLake-v1", map_name="4x4", is_slippery=False)
V, value_iteration_deterministic = value_iteration(env, gamma=0.99, theta=1e-8)
print_policy(env, value_iteration_deterministic)
env.close()

assert np.array_equal(value_iteration_deterministic, [1, 2, 1, 0, 1, 0, 1, 0, 2, 1, 1, 0, 0, 2, 2, 0]), "Política diferente da esperada"

In [None]:
# Não altere ou remova esta célula

## Análise

Agora, executaremos seus algoritmos no mesmo ambiente do Frozen Lake, porém escorregadio. As políticas resultante serão armazenadas nas variáveis `policy_iteration_slippery` e `value_iteration_slippery`, que usaremos na tarefa 14. Nesse cenário, o agente tem apenas 1/3 de chance de se mover na direção desejada e 2/3 de chance de se mover em uma direção perpendicular. Observe as políticas resultantes.

In [None]:
env = gym.make("FrozenLake-v1", map_name="4x4", is_slippery=True)
V, policy_iteration_slippery = policy_iteration(env, gamma=0.99, theta=1e-8)
print("Policy Iteration")
print_policy(env, policy_iteration_slippery)
env.close()

In [None]:
env = gym.make("FrozenLake-v1", map_name="4x4", is_slippery=True)
V, value_iteration_slippery = value_iteration(env, gamma=0.99, theta=1e-8)
print("Value Iteration")
print_policy(env, value_iteration_slippery)
env.close()

12. Implemente a função `execute_policy` abaixo, que deve executar uma política previamente obtida em um ambiente Frozen Lake por $N$ episódios, retornando a recompensa acumulada de cada episódio e suas durações.

In [None]:
def execute_policy(env: gym.Env, policy: np.ndarray[int], n_episodes):
    episode_returns = []
    episode_lengths = []

    for episode in range(n_episodes):
        total_reward = 0
        step_count = 0

        state, _ = env.reset()
        done = False

        while not done:
            # YOUR CODE HERE
            raise NotImplementedError()

        episode_returns.append(total_reward)
        episode_lengths.append(step_count)
    return episode_returns, episode_lengths

13. Utilize a função `execute_policy` para avaliar a política obtida pelo Policy Iteration no Frozen Lake **determinístico** (`policy_iteration_deterministic`) em um ambiente Frozen Lake escorregadio por 10 episódios. Armazene as recompensas acumuladas ao longo dos episódios na variável `agent_1_returns` e a duração dos episódios na variável `agent_1_lengths`. Observe o comportamento do agente durante a execução.

In [None]:
env = gym.make("FrozenLake-v1", map_name="4x4", is_slippery=True, render_mode="human")

# YOUR CODE HERE
raise NotImplementedError()

env.close()

14. Repita o procedimento da tarefa anterior, desta vez utilizando a política obtida pelo Policy Iteration no Frozen Lake **escorregadio** (`policy_iteration_slippery`). Armazene as recompensas acumuladas ao longo dos episódios na variável `agent_2_returns` e a duração dos episódios na variável `agent_2_lengths`. Observe o comportamento do agente durante a execução.

In [None]:
env = gym.make("FrozenLake-v1", map_name="4x4", is_slippery=True, render_mode="human")

# YOUR CODE HERE
raise NotImplementedError()

env.close()

Analise a seguinte comparação entre as recompensas e a duração obtidas por cada uma dessas duas execuções no Frozen Lake escorregadio.

In [None]:
def compare_policy(
    rewards_run1, lengths_run1,
    rewards_run2, lengths_run2,
    label_run1="Agent 1", label_run2="Agent 2"
):
    """
    Compare two policy runs using mean return and mean episode length.

    Args:
        rewards_run1 (list): Episode rewards for run 1.
        lengths_run1 (list): Episode lengths for run 1.
        rewards_run2 (list): Episode rewards for run 2.
        lengths_run2 (list): Episode lengths for run 2.
        label_run1 (str): Label for run 1.
        label_run2 (str): Label for run 2.
    """

    mean_rewards = [np.mean(rewards_run1), np.mean(rewards_run2)]
    std_rewards  = [np.std(rewards_run1), np.std(rewards_run2)]

    mean_lengths = [np.mean(lengths_run1), np.mean(lengths_run2)]
    std_lengths  = [np.std(lengths_run1), np.std(lengths_run2)]

    labels = [label_run1, label_run2]
    x = np.arange(len(labels))
    width = 0.6

    fig, axes = plt.subplots(1, 2, figsize=(12, 5))

    # Mean Rewards
    axes[0].bar(x, mean_rewards, yerr=std_rewards, capsize=5, width=width, color=['skyblue', 'salmon'])
    axes[0].set_xticks(x)
    axes[0].set_xticklabels(labels)
    axes[0].set_ylabel("Mean Total Reward")
    axes[0].set_title("Mean Episode Return ± Std")
    axes[0].grid(True, axis="y", linestyle="--", alpha=0.4)

    # Mean Episode Lengths
    axes[1].bar(x, mean_lengths, yerr=std_lengths, capsize=5, width=width, color=['skyblue', 'salmon'])
    axes[1].set_xticks(x)
    axes[1].set_xticklabels(labels)
    axes[1].set_ylabel("Mean Episode Length")
    axes[1].set_title("Mean Episode Length ± Std")
    axes[1].grid(True, axis="y", linestyle="--", alpha=0.4)

    plt.tight_layout()
    plt.show()

compare_policy(agent_1_returns, agent_1_lengths, agent_2_returns, agent_2_lengths)

15. Explique quais fatores levaram às diferenças observadas entre as políticas obtidas no ambiente determinístico e no ambiente escorregadio.

YOUR ANSWER HERE

16. Quais estratégias poderiam ser adotadas para tornar o comportamento do agente menos conservador quando treinado no ambiente escorregadio?

YOUR ANSWER HERE