## Required Proposal Components

### 1. Data Description
In the code cell below, use [Gymnasium](https://gymnasium.farama.org/) to set up a [Frozen Lake maze](https://gymnasium.farama.org/environments/toy_text/frozen_lake/) for your project. When you are done with the set up, describe the reward system you plan on using.

*Note, a level 5 maze is at least 10 x 10 cells large and contains at least five lake cells.*

In [None]:
import gymnasium as gym
from gymnasium.envs.toy_text.frozen_lake import generate_random_map
import matplotlib.pyplot as plt
import pandas as pd
import random
cellTypes = "SFFHFHFFFFFHFFFHFFFHFHFFHFHHFFFFFHHFFFFHHFHFFFHHFFFFFFFFHFFFHFFHFFHHFHFFFHFFFHFFFFHFFHHFFHFFHHFFHFFG"


In [None]:
maze = ["SFFHFHFFFF","FHFFFHFFFH", "FHFFHFHHFF", "FFFHHFFFFH","HFHFFFHHFF","FFFFFFHFFF","HFFHFFHHFH", "FFFHFFFHFF", "FFHFFHHFFH", "FFHHFFHFFG"]
maze = [list(row) for row in maze]
env = gym.make('FrozenLake-v1', desc=maze, render_mode='rgb_array', is_slippery=False)

numStates = env.observation_space.n
numActions = env.action_space.n
Q = {state: [0] * numActions for state in range(numStates)}


In [None]:
def getReward(state):
    row = state // len(maze[0])
    col = state % len(maze[0])
    cell = maze[row][col]  # Get char POS
    
    if cell == "G":
        return 100
    elif cell == "H":
        return -100
    else:
        return -1


In [None]:
def updateQTable(q, alpha, gamma, state, next_state, action):
    current_q = q[state][action]
    reward = getReward(next_state)

    # Make sure next state has valid Q values
    if next_state in q:
        next_max_q = max(q[next_state])  
    else:
        next_max_q = 0  # Default to 0 if the next state is unknown

    # Update formula
    new_q = (1 - alpha) * current_q + alpha * (reward + gamma * next_max_q)
    
    q[state][action] = new_q  # Store updated Q value
    
    return new_q  # Return for debugging this mess


In [None]:
import random ##We're going to use Random to choose random paths at the start. 

def chooseAction(q, state):
    """Select best action but add a small chance to explore other actions."""
    if random.random() < 0.1:  # 10% chance to explore a random action
        return random.randint(0, 3)
    
    return q[state].index(max(q[state]))  # Pick the highest Q value action


#### The reward system is as follows: Goal = +100,000, Frozen Lake = -1, Hole = -100,000,000. This will incentivize finding the most optimal path towards a goal. However, if an agent enters into a hole, said agent will be nuked from existence.  

### 2. Training Your Model
In the cell seen below, write the code you need to train a Q-Learning model. Display your final Q-table once you are done training your model.

*Note, level 5 work uses only the standard Python library and Pandas to train your Q-Learning model. A level 4 uses external libraries like Baseline3.*

In [None]:
alpha = 0.1
gamma = 0.9
terminated = False

for episode in range(10000):
    current_state, _ = env.reset()
    terminated = False
    
    while not terminated:
        action = chooseAction(Q, current_state)  # Select best action
        new_state, reward, terminated, truncated, info = env.step(action)
        
        new_q = updateQTable(Q, alpha, gamma, current_state, new_state, action)

        print(f"Episode {episode}, State {current_state}, Action {action}, New Q: {new_q}")

        current_state = new_state


data = pd.DataFrame.from_dict(Q, orient="index", columns=["Left", "Down", "Right", "Up"])
print(data.head())


In [None]:
data = pd.DataFrame.from_dict(Q, orient="index", columns=["Left", "Down", "Right", "Up"])
print(data.head())  # Show Q-table
data

### 3. Testing Your Model
In the cell seen below, write the code you need to test your Q-Learning model for **1000 episodes**. It is important to test your model for 1000 episodes so that we are all able to compare our results.

*Note, level 5 testing uses both a success rate and an average steps taken metric to evaluate your model. Level 4 uses one or the other.*

In [None]:
# Initialize test environment
envTest = gym.make('FrozenLake-v1', desc=maze, render_mode='human', is_slippery=False)
current_state, _ = envTest.reset()
terminated = False

while not terminated:
    action = chooseAction(Q, current_state)  # Select best action
    new_state, reward, terminated, truncated, info = envTest.step(action)
    envTest.render()
    current_state = new_state

envTest.close()
