In [5]:
import sys
import os
import numpy as np

# Add the src directory to the path
sys.path.append(os.path.abspath('../src'))
sys.path.append(os.path.abspath('..'))

# Import the BaseAgent class
from src.agents.mes_agents import MonAgent1, SARSAAgent, ExpectedSARSAAgent, DQNAgent

In [10]:
from env_sailing import SailingEnv
from initial_windfields import get_initial_windfield

### Testing the Agent's Validity

Let's make the agent do a few steps to check that everything is working

In [None]:
# Instead of validating the agent here, we'll just demonstrate it on a simple task
from src.env_sailing import SailingEnv

# Create a simple environment
env = SailingEnv()
observation, info = env.reset(seed=42)
state_dim = len(observation)

# Initialize our agents
mon_agent1 = MonAgent1()
mon_agent.seed(42)

sarsa_agent = SARSAAgent()
sarsa_agent.seed(42)

expectedsarsa_agent = ExpectedSARSAAgent()
expectedsarsa_agent.seed(42)

dqn_agent = DQNAgent(state_dim=state_dim)


# Run the agent for a few steps
print("Running the minimal agent for 5 steps:")
for i in range(5):
    action = mon_agent.act(observation)
    observation, reward, done, truncated, info = env.step(action)
    print(f"Step {i+1}: Action={action}, Position={info['position']}, Reward={reward}")

### Analysis of the Agent

The `MonAgent` is extremely simple but illustrates the key requirements for a valid agent:

1. **Inheritance**: It inherits from `BaseAgent`
2. **Required Methods**: It implements all required methods (`act`, `reset`, `seed`)
3. **Action Selection**: It always returns action `0` (North)
4. **Simplicity**: It maintains no internal state and requires no complex logic

This agent provides a good baseline, but it has obvious limitations:

- It ignores wind conditions completely
- It will struggle when the wind is coming from the North
- It doesn't adapt its strategy based on the environment

Let's test the naive agent to see how well it performs:

In [None]:
from src.env_sailing import SailingEnv
from src.initial_windfields import get_initial_windfield

# Create an environment with a simple test initial windfield
env = SailingEnv(**get_initial_windfield('simple_static'))
mon_agent = MonAgent()

# Run a single episode
observation, info = env.reset(seed=42)
total_reward = 0
done = False
truncated = False
step_count = 0

print("Running the naive agent on the simple_static initial windfield:")
while not (done or truncated) and step_count < 1000:  # Limit to 100 steps
    action = mon_agent.act(observation)
    observation, reward, done, truncated, info = env.step(action)
    total_reward += reward
    step_count += 1
    
    # Print every 10 steps to avoid too much output
    if step_count % 10 == 0:
        print(f"Step {step_count}: Position={info['position']}, Reward={reward}")

print(f"\nEpisode finished after {step_count} steps with reward: {total_reward}")
print(f"Final position: {info['position']}")
print(f"Goal reached: {done}")

## Improving on the Naive Agent

The naive agent provides a good starting point, but there are many ways to improve it:

1. **Wind-Aware Agent**: Consider wind direction when choosing actions
2. **Goal-Directed Agent**: Calculate the direction to the goal and choose actions accordingly
3. **Physics-Based Agent**: Use sailing physics equations to determine the most efficient action

The key insight for sailing is that certain directions relative to the wind are more efficient than others:

- The sailing efficiency is highest when moving perpendicular to the wind (beam reach)
- It's difficult to sail directly into the wind (the "no-go zone" - less than 45° to the wind)
- The boat maintains momentum (inertia) between steps

Before diving into reinforcement learning, consider implementing a simple rule-based agent that incorporates these physics principles.

### Implementing a Q-Learning Agent

Now let's implement a basic Q-learning agent for our sailing environment. Q-learning is a model-free reinforcement learning algorithm that learns to make decisions by estimating the value of state-action pairs.

Our implementation will use a simplified state representation based on:
1. Agent's current position
2. Agent's current velocity 
3. Local wind at the agent's position

This simplified approach makes the agent more interpretable and faster to train, while still capturing essential local information for effective navigation.

In [None]:
from src.agents.mon_agent import QLearningAgent, MonAgent1, SARSAAgent, ExpectedSARSAAgent, DQNAgent, MonAgent5

### Training the Q-Learning Agent

Now let's train our Q-learning agent on a simple initial windfield. We'll start with a small number of episodes (10) to demonstrate the process.

In [None]:
# Create our Q-learning agent
ql_agent = QLearningAgent(learning_rate=0.1, discount_factor=0.99, exploration_rate=0.2)

# Set fixed seed for reproducibility
np.random.seed(42)
ql_agent.seed(42)

# Create environment with a simple initial windfield
env = SailingEnv(**get_initial_windfield('simple_static'))

# Training parameters
num_episodes = 10  # Small number for debugging
max_steps = 1000

# Training loop
print("Starting training with 10 episodes (debug run)...")
for episode in range(num_episodes):
    # Reset environment and get initial state
    observation, info = env.reset(seed=episode)  # Different seed each episode
    state = ql_agent.discretize_state(observation)
    
    total_reward = 0
    
    for step in range(max_steps):
        # Select action and take step
        action = ql_agent.act(observation)
        next_observation, reward, done, truncated, info = env.step(action)
        next_state = ql_agent.discretize_state(next_observation)
        
        # Update Q-table
        ql_agent.learn(state, action, reward, next_state)
        
        # Update state and total reward
        state = next_state
        observation = next_observation
        total_reward += reward
        
        # Break if episode is done
        if done or truncated:
            break
    
    print(f"Episode {episode+1}: Steps={step+1}, Reward={total_reward}, " +
          f"Position={info['position']}, Goal reached={done}")
    
    # Update exploration rate (optional: decrease exploration over time)
    ql_agent.exploration_rate = max(0.05, ql_agent.exploration_rate * 0.95)

print("\nDebug training completed!")
print(f"Q-table size: {len(ql_agent.q_table)} states")

### Full Training Run

Now let's train our agent for more episodes to get better performance. This will take longer but should result in a more effective agent.

*Note: You might want to adjust the number of episodes based on your available time. More episodes generally lead to better performance.*

In [None]:
# Create our Q-learning agent for full training
ql_agent_full = QLearningAgent(learning_rate=0.1, discount_factor=0.99, exploration_rate=0.3)

# Set fixed seed for reproducibility
np.random.seed(42)
ql_agent_full.seed(42)

# Create environment with a simple initial windfield
env = SailingEnv(**get_initial_windfield('training_1'))

# Training parameters
num_episodes = 100  # More episodes for better learning
max_steps = 1000

# Progress tracking
rewards_history = []
steps_history = []
success_history = []

# Training loop
print("Starting full training with 100 episodes...")
import time
start_time = time.time()

for episode in range(num_episodes):
    # Reset environment and get initial state
    observation, info = env.reset(seed=episode)  # Different seed each episode
    state = ql_agent_full.discretize_state(observation)
    
    total_reward = 0
    
    for step in range(max_steps):
        # Select action and take step
        action = ql_agent_full.act(observation)
        next_observation, reward, done, truncated, info = env.step(action)
        next_state = ql_agent_full.discretize_state(next_observation)
        
        # Update Q-table
        ql_agent_full.learn(state, action, reward, next_state)
        
        # Update state and total reward
        state = next_state
        observation = next_observation
        total_reward += reward
        
        # Break if episode is done
        if done or truncated:
            break
    
    # Record metrics
    rewards_history.append(total_reward)
    steps_history.append(step+1)
    success_history.append(done)
    
    # Update exploration rate (decrease over time)
    ql_agent_full.exploration_rate = max(0.05, ql_agent_full.exploration_rate * 0.98)
    
    # Print progress every 10 episodes
    if (episode + 1) % 10 == 0:
        success_rate = sum(success_history[-10:]) / 10 * 100
        print(f"Episode {episode+1}/100: Success rate (last 10): {success_rate:.1f}%")

training_time = time.time() - start_time

# Calculate overall success rate
success_rate = sum(success_history) / len(success_history) * 100

print(f"\nTraining completed in {training_time:.1f} seconds!")
print(f"Success rate: {success_rate:.1f}%")
print(f"Average reward: {np.mean(rewards_history):.2f}")
print(f"Average steps: {np.mean(steps_history):.1f}")
print(f"Q-table size: {len(ql_agent_full.q_table)} states")

### Entraînement Agent1

### Entraînement de SARSA

In [None]:
# Créer l'agent SARSA
sarsa_agent = SARSAAgent(learning_rate=0.1, discount_factor=0.99, exploration_rate=0.3)

# Fixer la graine pour reproductibilité
np.random.seed(42)
sarsa_agent.seed(42)

# Environnement
env = SailingEnv(**get_initial_windfield('training_1'))

# Paramètres
num_episodes = 100
max_steps = 1000

# Historique
rewards_history = []
steps_history = []
success_history = []

print("Starting SARSA training with 100 episodes...")
start_time = time.time()

for episode in range(num_episodes):
    observation, info = env.reset(seed=episode)
    state = sarsa_agent.discretize_state(observation)
    action = sarsa_agent.act(observation)

    total_reward = 0

    for step in range(max_steps):
        next_observation, reward, done, truncated, info = env.step(action)
        next_state = sarsa_agent.discretize_state(next_observation)
        next_action = sarsa_agent.act(next_observation)

        # Mise à jour SARSA
        sarsa_agent.learn(state, action, reward, next_state, next_action)

        state = next_state
        action = next_action
        observation = next_observation
        total_reward += reward

        if done or truncated:
            break

    rewards_history.append(total_reward)
    steps_history.append(step + 1)
    success_history.append(done)

    # Diminuer l'exploration progressivement
    sarsa_agent.exploration_rate = max(0.05, sarsa_agent.exploration_rate * 0.98)

    if (episode + 1) % 10 == 0:
        success_rate = sum(success_history[-10:]) / 10 * 100
        print(f"Episode {episode + 1}/100: Success rate (last 10): {success_rate:.1f}%")

training_time = time.time() - start_time
success_rate = sum(success_history) / len(success_history) * 100

print(f"\nTraining completed in {training_time:.1f} seconds!")
print(f"Success rate: {success_rate:.1f}%")
print(f"Average reward: {np.mean(rewards_history):.2f}")
print(f"Average steps: {np.mean(steps_history):.1f}")
print(f"Q-table size: {len(sarsa_agent.q_table)} states")

# Sauvegarde du modèle
sarsa_agent.save("outputs/sarsa_agent.pkl")


### Entraînement Expected SARSA

In [None]:
# Créer l'agent Expected SARSA
expected_agent = ExpectedSARSAAgent(learning_rate=0.1, discount_factor=0.99, exploration_rate=0.3)

# Fixer la graine
np.random.seed(42)
expected_agent.seed(42)

# Environnement
env = SailingEnv(**get_initial_windfield('training_1'))

# Paramètres d'entraînement
num_episodes = 100
max_steps = 1000

rewards_history = []
steps_history = []
success_history = []

print("Starting Expected SARSA training with 100 episodes...")
start_time = time.time()

for episode in range(num_episodes):
    observation, info = env.reset(seed=episode)
    state = expected_agent.discretize_state(observation)

    total_reward = 0

    for step in range(max_steps):
        action = expected_agent.act(observation)
        next_observation, reward, done, truncated, info = env.step(action)
        next_state = expected_agent.discretize_state(next_observation)

        expected_agent.learn(state, action, reward, next_state)

        state = next_state
        observation = next_observation
        total_reward += reward

        if done or truncated:
            break

    rewards_history.append(total_reward)
    steps_history.append(step + 1)
    success_history.append(done)

    # Diminution progressive d'epsilon
    expected_agent.exploration_rate = max(0.05, expected_agent.exploration_rate * 0.98)

    if (episode + 1) % 10 == 0:
        success_rate = sum(success_history[-10:]) / 10 * 100
        print(f"Episode {episode + 1}/100: Success rate (last 10): {success_rate:.1f}%")

training_time = time.time() - start_time
success_rate = sum(success_history) / len(success_history) * 100

print(f"\nTraining completed in {training_time:.1f} seconds!")
print(f"Success rate: {success_rate:.1f}%")
print(f"Average reward: {np.mean(rewards_history):.2f}")
print(f"Average steps: {np.mean(steps_history):.1f}")
print(f"Q-table size: {len(expected_agent.q_table)} states")

# Sauvegarde
expected_agent.save("outputs/expected_sarsa_agent.pkl")

### Entraînement Deep Q-Learning

In [None]:
!pip install torch


In [None]:
# Paramètres
num_episodes = 100
max_steps = 1000
batch_size = 64

# Créer l’environnement
env = SailingEnv(**get_initial_windfield("training_1"))

# Dimensions : observation directe, pas besoin de discrétiser
example_obs, _ = env.reset(seed=0)
state_dim = len(example_obs)

# Créer l’agent DQN
agent = DQNAgent(state_dim=state_dim)

# Fixer les graines
np.random.seed(42)
agent.seed(42)

# Historique
rewards_history = []
steps_history = []
success_history = []

print("🚀 Starting DQN training with 100 episodes...")
start_time = time.time()

for episode in range(num_episodes):
    obs, _ = env.reset(seed=episode)
    total_reward = 0

    for step in range(max_steps):
        action = agent.act(obs)
        next_obs, reward, done, truncated, info = env.step(action)

        agent.remember(obs, action, reward, next_obs, done)
        agent.learn(batch_size=batch_size)

        obs = next_obs
        total_reward += reward

        if done or truncated:
            break

    rewards_history.append(total_reward)
    steps_history.append(step + 1)
    success_history.append(done)

    if (episode + 1) % 10 == 0:
        recent_success = sum(success_history[-10:]) / 10 * 100
        print(f"Episode {episode + 1}: success rate (last 10) = {recent_success:.1f}%")

# Résumé
training_time = time.time() - start_time
success_rate = sum(success_history) / len(success_history) * 100

print(f"\n✅ Training completed in {training_time:.1f}s")
print(f"Success rate: {success_rate:.1f}%")
print(f"Avg reward: {np.mean(rewards_history):.2f}")
print(f"Avg steps: {np.mean(steps_history):.1f}")

# Sauvegarde
agent.save("outputs/dqn_agent.pth")

### Visualizing Training Results

Let's visualize how our agent improved during training:

In [None]:
import matplotlib.pyplot as plt

# Calculate rolling averages
window_size = 10
rolling_rewards = np.convolve(rewards_history, np.ones(window_size)/window_size, mode='valid')
rolling_steps = np.convolve(steps_history, np.ones(window_size)/window_size, mode='valid')
rolling_success = np.convolve([1 if s else 0 for s in success_history], np.ones(window_size)/window_size, mode='valid') * 100

# Create the plots
# fig, (ax1, ax2, ax3) = plt.subplots(3, 1, figsize=(10, 12), sharex=True)
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(10, 12), sharex=True)

# Plot rewards
ax1.plot(rolling_rewards)
ax1.set_ylabel('Average Reward')
ax1.set_title('Training Progress (10-episode rolling average)')

# Plot steps
ax2.plot(rolling_steps)
ax2.set_ylabel('Average Steps')

# Plot success rate
#ax3.plot(rolling_success)
#ax3.set_ylabel('Success Rate (%)')
#ax3.set_xlabel('Episode')

plt.tight_layout()
plt.show()

### Testing the Trained Agent

Now let's evaluate our trained agent with exploration turned off to see how well it performs on unseen seeds:

In [None]:
# Turn off exploration for evaluation
ql_agent_full.exploration_rate = 0

# Create test environment
test_env = SailingEnv(**get_initial_windfield('training_1'))

# Test parameters
num_test_episodes = 5
max_steps = 1000

print("Testing the trained agent on 5 new episodes...")
# Testing loop
for episode in range(num_test_episodes):
    # Reset environment
    observation, info = test_env.reset(seed=1000 + episode)  # Different seeds from training
    
    total_reward = 0
    
    for step in range(max_steps):
        # Select action using learned policy
        action = ql_agent_full.act(observation)
        observation, reward, done, truncated, info = test_env.step(action)
        
        total_reward += reward
        
        # Break if episode is done
        if done or truncated:
            break
    
    print(f"Test Episode {episode+1}: Steps={step+1}, Reward={total_reward}, " +
          f"Position={info['position']}, Goal reached={done}")

## Validating Your Agent

After creating your agent, you'll want to ensure it meets all the requirements of the challenge. There are two ways to validate your agent:

1. **Using the `validate_agent.ipynb` notebook:**
   - This notebook provides a comprehensive interface for testing your agent
   - It shows detailed validation results and explains any issues

2. **Using the command line:**
   ```bash
   cd src
   python test_agent_validity.py path/to/your_agent.py
   ```

We recommend using these tools after you've completed your agent implementation rather than trying to validate it during development.

For now, let's focus on understanding agent design principles and implementing effective strategies.

## Visualizing Your Agent's Behavior

## Save Your Agent For Submission

## Training and Evaluating

.....

# Next Steps for Developing Your Own Agent

Now it's your turn to develop your own agent. Here are some suggestions:

1. **Enhance the Q-Learning Agent**:
   - Extend the state representation to incorporate the full wind field (not just local wind)
   - This would allow the agent to anticipate wind changes and plan better routes
   - Hint: Modify the `discretize_state` method to extract and process relevant features from the flattened wind field

2. **Algorithmic Improvements**:
   - Implement function approximation to handle continuous state spaces better
   - Explore other RL algorithms like SARSA, Expected SARSA, or Deep Q-Networks
   - Experiment with different exploration strategies that adapt over time

3. **Physics-Based Approaches**:
   - Leverage your understanding of sailing physics (from challenge_walkthrough notebook)
   - Implement rule-based algorithms or path planning (A*, etc.) that take advantage of domain knowledge
   - Create hybrid approaches that combine RL with domain-specific rules