# RFP: Maze Solvers

## Project Overview
You are invited to submit a proposal that answers the following question:

### What path will your elf take?

*Please submit your proposal by **2/11/25 at 11:59 PM**.*

## Required Proposal Components

### 1. Data Description
In the code cell below, use [Gymnasium](https://gymnasium.farama.org/) to set up a [Frozen Lake maze](https://gymnasium.farama.org/environments/toy_text/frozen_lake/) for your project. When you are done with the set up, describe the reward system you plan on using.

*Note, a level 5 maze is at least 10 x 10 cells large and contains at least five lake cells.*

In [1]:
import gymnasium as gym
from gymnasium.envs.toy_text.frozen_lake import generate_random_map
import random
import numpy as np

In [2]:
# Make maze

env = gym.make('FrozenLake-v1', desc=generate_random_map(size=10, seed=259),  is_slippery=False)
#initial_state = env.reset()

#env.render()

# Take a step (0: LEFT, 1: DOWN, 2: RIGHT, 3: UP)
#action = 2
#new_state, reward, terminated, truncated, info = env.step(action)

#env.render()

In [5]:

# Get state and action sizes
state_size = env.observation_space.n
action_size = env.action_space.n

# Initialize Q-table
qtable = np.zeros((state_size, action_size))

# Training parameters
total_episodes = 20000
max_steps = 200  
learning_rate = 0.9  
gamma = 0.85  
epsilon = 1.0  
min_epsilon = 0.05  
decay_rate = 0.0005 

# Training loop
rewards = []
for episode in range(total_episodes):
    state, _ = env.reset()
    total_rewards = 0
    done = False

    for step in range(max_steps):
        # Exploration-exploitation trade-off
        if random.uniform(0, 1) > epsilon:
            action = np.argmax(qtable[state])  # Exploit
        else:
            action = env.action_space.sample()  # Explore

        # Take action
        new_state, reward, done, truncated, _ = env.step(action)

        # Add small step reward to encourage movement
        reward = 0.1 if not done else reward

        # Q-learning update
        qtable[state, action] = qtable[state, action] + learning_rate * (
            reward + gamma * np.max(qtable[new_state]) - qtable[state, action]
        )

        total_rewards += reward
        state = new_state

        if done or truncated:
            break

    # Slow down epsilon decay
    epsilon = max(min_epsilon, epsilon * 0.999)
    rewards.append(total_rewards)

# Print results
print("Score over time:", sum(rewards) / total_episodes)
print("Final Q-Table:")
print(qtable)


Score over time: 9.840264999999999
Final Q-Table:
[[0.66666667 0.66666667 0.66666667 0.66666667]
 [0.66666667 0.66666667 0.66666667 0.66666667]
 [0.66666667 0.66666667 0.66666667 0.66666667]
 [0.66666667 0.         0.66666667 0.66666667]
 [0.66666667 0.66666667 0.         0.66666667]
 [0.         0.         0.         0.        ]
 [0.         0.         0.55019013 0.        ]
 [0.30621726 0.59588939 0.09       0.52192643]
 [0.52192643 0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.66666667 0.66666667 0.66666667 0.66666667]
 [0.66666667 0.66666667 0.66666667 0.66666667]
 [0.66666667 0.66666667 0.         0.66666667]
 [0.         0.         0.         0.        ]
 [0.         0.66666667 0.66666615 0.66666662]
 [0.66666662 0.6666107  0.49403567 0.        ]
 [0.66441501 0.60935955 0.57573245 0.0999    ]
 [0.61516955 0.         0.         0.53092643]
 [0.         0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.666666

In [None]:
env.close()

#### Describe your reward system here.

### 2. Training Your Model
In the cell seen below, write the code you need to train a Q-Learning model. Display your final Q-table once you are done training your model.

*Note, level 5 work uses only the standard Python library and Pandas to train your Q-Learning model. A level 4 uses external libraries like Baseline3.*

In [None]:
# Train model here.
# Don't forget to display your final Q table!

### 3. Testing Your Model
In the cell seen below, write the code you need to test your Q-Learning model for **1000 episodes**. It is important to test your model for 1000 episodes so that we are all able to compare our results.

*Note, level 5 testing uses both a success rate and an average steps taken metric to evaluate your model. Level 4 uses one or the other.*

In [None]:
# Test model here.

### 4. Final Answer
In the first cell below, describe the path your elf takes to get to the gift. *Note, a level 5 answer includes a gif of the path your elf takes in order to reach the gift.*

In the second cell seen below, describe how well your Q-Learning model performed. Make sure that you explicitly name the **learning rate**, **the discount factor**, and the **reward system** that you used when training your final model. *Note, a level 5 description describes the model's performance using two types of quantitative evidence.*

![example image](https://gymnasium.farama.org/_images/frozen_lake.gif)

#### Describe the path your elf takes here.

#### Describe how well your Q-Learning model performed here.