## Frozen Lake Domain Description

Frozen Lake is a simple grid-world environment where an agent navigates a frozen lake to reach a goal while avoiding falling into holes. The environment is represented as a grid, with each cell being one of the following:

* **S**: Starting position of the agent
* **F**: Frozen surface, safe to walk on
* **H**: Hole, falling into one ends the episode with a reward of 0
* **G**: Goal, reaching it ends the episode with a reward of 1

The agent can take four actions:

* **0: Left**
* **1: Down**
* **2: Right**
* **3: Up**

However, due to the slippery nature of the ice, the agent might not always move in the intended direction. There's a chance it moves perpendicular to the intended direction.




In [5]:
import gym

# Create the environment
env = gym.make('FrozenLake-v1', render_mode='ansi')  # 'ansi' mode for text-based rendering

# Reset the environment to the initial state
observation, _= env.reset()

# Take a few actions and observe the results
for _ in range(5):
    action = env.action_space.sample()  # Choose a random action
    observation, reward, done, trunicated ,info = env.step(action)
    # Render the environment to see the agent's movement (text-based)
    if done:
        observation, _= env.reset()
    else:
      rendered = env.render()
      if len(rendered) > 1:  # Check if there's a second element
         print(rendered[1])  # Print the second element
# Close the environment
env.close()

 
 
 
 


The transition model for the Frozen Lake world describes how the agent's actions affect its movement and the resulting state transitions. Here's a breakdown of the key components:

**Actions:**

* The agent can choose from four actions:
    * 0: Left
    * 1: Down
    * 2: Right
    * 3: Up

**State Transitions:**

* **Intended Movement:** Ideally, the agent moves one cell in the chosen direction.
* **Slippery Ice:** Due to the slippery nature of the ice, there's a probability that the agent will move in a perpendicular direction instead of the intended one. The exact probabilities depend on the specific Frozen Lake environment configuration, but typically:
    * **Successful Move:** The agent moves in the intended direction with a high probability.
    * **Perpendicular Move:** The agent moves 90 degrees to the left or right of the intended direction with a lower probability.
* **Boundaries:** If the intended movement would take the agent outside the grid boundaries, it remains in its current position.
* **Holes:** If the agent lands on a hole ("H"), the episode ends, and it receives a reward of 0.
* **Goal:** If the agent reaches the goal ("G"), the episode ends, and it receives a reward of 1.




In [8]:
import gym

# Create the environment
env = gym.make('FrozenLake-v1', render_mode='ansi')  # 'ansi' mode for text-based rendering

# Reset the environment to the initial state
observation = env.reset()

# Take a few actions and observe the results
for _ in range(5):
    action = env.action_space.sample()  # Choose a random action
    observation, reward, done, trunicated, info = env.step(action)
    # Render the environment to see the agent's movement (text-based)
    if done:
        observation= env.reset()
    else:
      rendered = env.render()
      if len(rendered) > 1:  # Check if there's a second element
         print(rendered[1])  # Print the second element
# Close the environment
env.close()
print ("State 14 Going Right: (s, a, r, Done)", env.P[14][2])

 
 
 
 
State 14 Going Right: (s, a, r, Done) [(0.3333333333333333, 14, 0.0, False), (0.3333333333333333, 15, 1.0, True), (0.3333333333333333, 10, 0.0, False)]


In [19]:
import gym
import numpy as np

# Create FrozenLake environment
env = gym.make("FrozenLake-v1")

def value_iteration(env, gamma=0.9, num_iterations=1000):
    """
    Implements the Value Iteration algorithm.

    Args:
        env: The OpenAI Gym environment.
        gamma: Discount factor.
        num_iterations: Number of iterations to run.

    Returns:
        The optimal value function and policy.
    """

    # Initialize value function and policy
    V = np.zeros(env.observation_space.n)
    policy = np.zeros(env.observation_space.n, dtype=int) 

    for i in range(num_iterations):
        # For each state in the environment
        for state in range(env.observation_space.n):
            # Initialize an array to store Q-values for all actions
            Q_values = np.zeros(env.action_space.n)

            # For each action in the current state, calculate the Q-value
            for action in range(env.action_space.n):
                for prob, next_state, reward, done in env.P[state][action]:
                    # Implement the Bellman update to compute the Q-value for the current action
                    Q_values[action] += prob * (reward + gamma * V[next_state])

            # Update V[state] with the maximum Q-value (best action to take in this state)
            V[state] = np.max(Q_values)

            # Update the policy to take the action with the highest Q-value
            policy[state] = np.argmax(Q_values)

    return V, policy


# Apply Value Iteration
optimal_V, optimal_policy = value_iteration(env)

# Evaluate the optimal policy
def evaluate_policy(env, policy, num_episodes=100):
    total_reward = 0
    for _ in range(num_episodes):
        state, _ = env.reset()  
        done = False
        while not done:
            action = policy[state]
            state, reward, done, _, _ = env.step(action)  
            total_reward += reward
    return total_reward / num_episodes

average_reward = evaluate_policy(env, optimal_policy)
print(f'Optimal Policy:\n {optimal_policy}')
print(f'Optimal V: \n{optimal_V}')

print(f'Average Reward: \n {average_reward}')


Optimal Policy:
 [0 3 0 3 0 0 0 0 3 1 0 0 0 2 1 0]
Optimal V: 
[0.0688909  0.06141457 0.07440976 0.05580732 0.09185454 0.
 0.11220821 0.         0.14543635 0.24749695 0.29961759 0.
 0.         0.3799359  0.63902015 0.        ]
Average Reward: 
 0.77


  if not isinstance(terminated, (bool, np.bool8)):


In [60]:
import gym
import numpy as np

# Create FrozenLake environment
env = gym.make("FrozenLake-v1")


def Q_learning(env, num_episodes=5000, alpha=0.1, gamma=0.9, epsilon=1.0, epsilon_decay=0.9, min_epsilon=0.1):
  
    """
    Implements the Q-Learning algorithm.

    Args:
        env: The OpenAI Gym environment.
        num_episodes: Number of iterations to run.
        gamma: Discount factor.
        Alpha: Learning rate
        Epsilon: Exploration factor (1 = Agent's first step will always be in favor of exploration)
        Epsilon_decay: Decay factor for epsilon (Decay factor closer to 1, means the agent will favor exploation early on, but explotation later down the line)
        Min_epsilon: Even with decay, this ensures the agent will always explore sometimes (0.1 = 10% exploration after epsilon is decayed below 0.1)

    Returns:
        Q-Table, and the learned policy.
    """


    # Initialize value function and policy
    Q = np.zeros((env.observation_space.n, env.action_space.n))

    for episode in range(num_episodes):
        state,_ = env.reset()
        done=False

        while not done:
            if np.random.uniform(0,1) < epsilon:
                # Exploration
                action=env.action_space.sample()
            else:
                # Exploitation
                action = np.argmax(Q[state, :])
    
            # Step
            next_state, reward, done, _, _ = env.step(action)

            best_next_action = np.argmax(Q[next_state, :])

            # Q-learning update rule
            Q[state][action] = Q[state][action] + alpha*(reward + gamma*(Q[next_state][best_next_action] - Q[state][action]))

            # Update state
            state = next_state

        # Update epsilon with decay
        epsilon = max(min_epsilon, (epsilon*epsilon_decay))



    # Policy takes action with highest Q-value
    policy = np.argmax(Q, axis=1)

    return Q, policy

# Apply Value Iteration
Q, learned_policy = Q_learning(env)

# Evaluate the optimal policy
def evaluate_policy(env, policy, num_episodes=100):
    total_reward = 0
    for _ in range(num_episodes):
        state, _ = env.reset()  
        done = False
        while not done:
            action = policy[state]
            state, reward, done, _, _ = env.step(action)  
            total_reward += reward
    return total_reward / num_episodes

average_reward = evaluate_policy(env, learned_policy)
print(f'learned Policy: {learned_policy}')
print(f'Q Table:\n {Q}')
print(f'Average Reward:  {average_reward}')


learned Policy: [0 3 0 0 0 0 2 0 3 1 0 0 0 2 1 0]
Q Table:
 [[0.78235308 0.76323944 0.74521576 0.74639802]
 [0.56605467 0.31096697 0.43277967 0.70974904]
 [0.56767806 0.40492429 0.32293506 0.47018553]
 [0.39244805 0.06911799 0.02250614 0.07206191]
 [0.78731757 0.53722653 0.44319856 0.43830237]
 [0.         0.         0.         0.        ]
 [0.29681539 0.12781937 0.40549626 0.14848382]
 [0.         0.         0.         0.        ]
 [0.61442485 0.38835044 0.42355092 0.7981231 ]
 [0.43180419 0.78353947 0.42479198 0.65756888]
 [0.69833016 0.39997187 0.54790561 0.39005607]
 [0.         0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.68530748 0.63629705 0.94108966 0.6004221 ]
 [0.88035729 1.00079521 0.98584869 0.93634979]
 [0.         0.         0.         0.        ]]
Average Reward:  0.76
