# **Introduction**

This notebook is for implementing a Monte-Carlo reinforcement learning method on the Frozen Lake environment offered through a Gymnasium environment. Gymnasium is an open source Python library for developing and comparing reinforcement learning algorithms, through the use of a standardized API. There are four key functions to Gymnasium, namely: ```make()```, ```Env.reset()```, ```Env.step()```, and ```Env.render()```.

As per its [introductory documentation](https://gymnasium.farama.org/introduction/basic_usage/), the core of Gymnasium lies in the high-level Python class ```Env```, which approximately represents a Markov Decision Process (MDP) from reinforcement learning theory. This class allows users of Gymnasium to start new episodes, take actions, and visualize the agent's current state. 

# **Import Packages**

This section imports the necessary packages.

In [81]:
# inclusions:
import gymnasium as gym
import numpy as np
from collections import defaultdict

# **Environment Setup**

This section sets up the environment and defines the relevant functions needed for this implementation. 

In [None]:
# MC-Agent Class:
class GLIE_MC_Agent:
        # constructor:
        def __init__(self, env: gym.Env, epsilon: float, gamma: float):
                """
                this is the constructor for the agent. this agent is a monte-carlo agent, meaning that it averages the returns
                for each Q(s,a) at the end of the episode.

                env: a gymnasium environment
                epsilon: a float value indicating the probability of action selection
                gamma: a float value indicating the discounting factor
                Q: the estimate of the action-value function q, initialized as zeros over all states and actions
                
                """
                # object parameters:
                self.env = env
                self.epsilon = epsilon
                self.gamma = gamma

                # get the number of states, number of actions:
                nS, nA = env.observation_space.n, env.action_space.n

                # tabular Q-values, and counter N(s,a):
                self.Q = np.zeros((nS, nA))
                self.returns_count = np.zeros((nS, nA), dtype = int)         # how many times I have been to a state, and taken an action         

                # return to the user to metrics about the environment:
                print(f"Action Space is: {env.action_space}")
                print(f"Observation Space is: {env.observation_space}\n")

        def get_action_probs(self, Q):
                # get the number of available actions:
                m = Q.shape[1]

                # assign each action a base probability of e/m
                p = np.ones(m)*(self.epsilon/m)

                # find the index of the best Q value
                best = np.argmax(Q)

                # give that one more probability by an amount equal to (1 - e):
                p[best] += 1.0 - self.epsilon

                # return the probability of selecting each action:
                return p
        
        def policy(self, state):
                probs = self.get_action_probs(self.Q[state])

# create training environment:
env = gym.make("FrozenLake-v1", render_mode = "human")
agent = GLIE_MC_Agent(env = env, epsilon = 0.5, gamma = 0.99)


Action Space is: Discrete(4)
Observation Space is: Discrete(16)



array([0.625, 0.125, 0.125, 0.125])