# RFP: Maze Solvers

## Project Overview
You are invited to submit a proposal that answers the following question:

### What path will your elf take?

*Please submit your proposal by **2/11/25 at 11:59 PM**.*

## Required Proposal Components

### 1. Data Description
In the code cell below, use [Gymnasium](https://gymnasium.farama.org/) to set up a [Frozen Lake maze](https://gymnasium.farama.org/environments/toy_text/frozen_lake/) for your project. When you are done with the set up, describe the reward system you plan on using.

*Note, a level 5 maze is at least 10 x 10 cells large and contains at least five lake cells.*

In [1]:
import gymnasium as gym
from gymnasium.envs.toy_text.frozen_lake import generate_random_map
import random
import numpy as np

In [21]:
# Make maze

env = gym.make('FrozenLake-v1', render_mode='human', desc=generate_random_map(size=10, seed=259),  is_slippery=False)

In [3]:
# Get state and action sizes
state_size = env.observation_space.n
action_size = env.action_space.n

# Extract hole ('H'), empty ('F'), and goal ('G') positions
desc = env.unwrapped.desc
hole_states = {i for i, row in enumerate(desc.flatten()) if row == b'H'}
empty_states = {i for i, row in enumerate(desc.flatten()) if row == b'F'}
goal_state = {i for i, row in enumerate(desc.flatten()) if row == b'G'}

# Initialize Q-table with small random values to encourage exploration
qtable = np.random.uniform(low=-0.5, high=0.5, size=(state_size, action_size))

In [4]:
env.close()

#### Describe your reward system here.

### 2. Training Your Model
In the cell seen below, write the code you need to train a Q-Learning model. Display your final Q-table once you are done training your model.

*Note, level 5 work uses only the standard Python library and Pandas to train your Q-Learning model. A level 4 uses external libraries like Baseline3.*

In [5]:
# Training parameters
total_episodes = 10000  # Increase episodes for better learning
max_steps = 200  
learning_rate = 0.95  # Keep high at the start for faster learning
gamma = 0.75  # Higher discount factor to encourage long-term rewards
epsilon = 0.9  
min_epsilon = 0.01  # Allow more exploration for longer
decay_rate = 0.005  # Slower decay to ensure sufficient exploration

In [6]:
# Training loop
rewards = []
for episode in range(total_episodes):
    state, _ = env.reset()
    total_rewards = 0
    done = False

    for step in range(max_steps):
        # Choose action using epsilon-greedy strategy
        if random.uniform(0, 1) > epsilon:
            action = np.argmax(qtable[state])  # Exploit
        else:
            action = env.action_space.sample()  # Explore

        # Take action
        new_state, reward, done, truncated, _ = env.step(action)

        # Modify rewards
        if new_state in hole_states:
            reward = -1.1  # Higher penalty for falling in a hole
        elif new_state in empty_states:
            reward = 0.001  # Small reward for moving forward
        elif new_state in goal_state:
            reward = 1.0  # Higher reward for reaching goal

        # Q-learning update
        qtable[state, action] = qtable[state, action] + learning_rate * (
            reward + gamma * np.max(qtable[new_state]) - qtable[state, action]
        )

        total_rewards += reward
        state = new_state

        if done or truncated:
            break

    # Decay epsilon slower for better exploration
    epsilon = max(min_epsilon, epsilon * np.exp(-decay_rate * episode))
    rewards.append(total_rewards)

### 3. Testing Your Model
In the cell seen below, write the code you need to test your Q-Learning model for **1000 episodes**. It is important to test your model for 1000 episodes so that we are all able to compare our results.

*Note, level 5 testing uses both a success rate and an average steps taken metric to evaluate your model. Level 4 uses one or the other.*

In [7]:
# Print results
print("Score over time:", sum(rewards) / total_episodes)
print("Q-Table:")
print(qtable)

Score over time: 0.9743121999998908
Q-Table:
[[ 8.09287046e-03  1.07904939e-02  1.07904939e-02  8.09287046e-03]
 [ 8.09287046e-03  1.30539919e-02  1.30539919e-02  1.07904939e-02]
 [ 4.68565210e-03  1.60719892e-02  4.82483759e-03  4.70432434e-03]
 [ 4.92285205e-03 -8.16934419e-01  4.90164931e-03  4.94086349e-03]
 [ 4.95097457e-03  5.04924130e-03 -7.49422320e-01  5.09224698e-03]
 [ 4.67248518e-01  2.90479170e-01 -1.56293299e-01 -1.79164487e-02]
 [-7.02277144e-01  6.88322615e-03 -3.26684847e-01 -2.98588105e-01]
 [-2.54494353e-01  6.63659108e-03 -3.50789966e-01 -4.25983583e-01]
 [ 3.09454435e-01  4.12372468e-01  4.16536738e-01 -2.60526553e-01]
 [-2.55239656e-01 -4.79328242e-01  4.41883818e-01  5.29131929e-02]
 [ 1.07904939e-02  1.30539919e-02  1.30539919e-02  8.09287046e-03]
 [ 1.07904939e-02  1.60719892e-02  1.60719892e-02  1.07904939e-02]
 [ 1.30539919e-02  2.00959856e-02 -8.19343912e-01  1.30539919e-02]
 [-3.11071451e-01 -1.88980093e-01  3.74208118e-01 -4.18002369e-01]
 [-8.16864570e-01

In [22]:
state = env.reset()[0]  # Reset environment and get initial state
done = False

while not done:
    action = np.argmax(qtable[state, :])  # Pick the best action based on Q-values
    new_state, reward, done, truncated, _ = env.step(action)  # Take action
    env.render()  # Display the environment (optional)
    state = new_state  # Move to the next state

env.close()  # Close the environment

### 4. Final Answer
In the first cell below, describe the path your elf takes to get to the gift. *Note, a level 5 answer includes a gif of the path your elf takes in order to reach the gift.*

In the second cell seen below, describe how well your Q-Learning model performed. Make sure that you explicitly name the **learning rate**, **the discount factor**, and the **reward system** that you used when training your final model. *Note, a level 5 description describes the model's performance using two types of quantitative evidence.*

![example image](https://gymnasium.farama.org/_images/frozen_lake.gif)

#### Describe the path your elf takes here.


![My GIF](ezgif.com-video-to-gif-converter.gif)

#### Describe how well your Q-Learning model performed here.

##### My Q-Learning Model performed very well as to get the elf to the gift, with only 18 steps. The learning rate I put was a good learning rate because the elf looks like he has learned from learning all over the frozen lakes and is given a great learning model with the discount rate being higher so that the elf can get long term rewards. My reward system consists of a 0.001 reward for an empty space, a -1.1 reward for falling into a hole, and a +1 reward for reaching the gift. As you can see from the gif this is the path it consistantly goes now, and the path really secured a solid path to the gift. Other things that I could consider is that my elf could be put on a slippery frozen lake and still perform just as well. If you have taken a look at my code, I used a strategy to epsilon, by flipping a coin on whether it should exploit or explore. Another thing I did add was a decrease in decay rate so that the elf could explore more than exploit what it knows.