# Monte-Carlo FrozenLake
This section explores the Monte Carlo method for reinforcement learning




## Background Theory

## Code

In [27]:
import numpy as np
import gymnasium as gym

# Initialize the environment
env = gym.make("FrozenLake-v1", map_name="4x4", is_slippery=True, render_mode = "ansi")  # is_slippery=True for stochasticity
env.reset()

state = 0; t = 0
# Simulate several steps by following a random policy
for i in range(100):
    action = env.action_space.sample()
    next_state, reward, term, trunc, info = env.step(action)
    print(f"Time {t:2d}  |  s_t {state:3d}  |  a_t {action:2d}  |  s_t+1 {next_state:3d}  |  reward {reward:.2f}  |  terminated {term:2}  |  {info}")
    # print(env.render())
    state = next_state; t += 1
    if term or trunc:
        break

Time  0  |  s_t   0  |  a_t  2  |  s_t+1   4  |  reward 0.00  |  terminated  0  |  {'prob': 0.3333333333333333}
Time  1  |  s_t   4  |  a_t  1  |  s_t+1   5  |  reward 0.00  |  terminated  1  |  {'prob': 0.3333333333333333}


In [18]:
state_size = env.observation_space.n
action_size = env.action_space.n
q_table = np.zeros((state_size,action_size))

In Monte Carlo RL, the agent is supposed to traverse entire episode(s) to observe the trajectory and reward. The return of each state is monitored and used for updating the state-action value $Q(S_t,A_t)$

Need to implement the following functionalities:
1. A function to play out one episode (trajectory) that makes the agent steps through the environemtn until a termination condition is reached. These trajectories are returned as three lists: 1. states, 2. rewards, and 3. actions
2. A function to calculate the cumulated return $G_t$ when given a list of state and a list of rewards along a trajectory generated by the function above. 
3. Need an array to store the tabular Q-value mapping from state-action $(s_t, a_t)$ to Q-value - $Q(s_t,a_t)$
4. Need an array to store the occurences of all state-action combinations - $N(s_t,a_t)$



### Question to self
1. How can I specify a policy to control the agent? Thus far env.step() randomly chooses the next action for the agent.
- Earlier on, we randomly sample the action from the action space of the environment. This action was then used in the env.step() function to guide the agent through one step. 
- With MC control, we gradually update the Q-value of all the state-action pairs and choose the action according to an epsilon-greedy policy

2. How can I visualize the state-action value on the environment to visualize the policy?



2. Are there alternatives to epsilon-greedy policy?


In [296]:
def get_returns(rewards, gamma = 0.9):
    ''' Function to calculate the return of each time step when given a list of rewards 
    
    For each step of the trajectory (of length T):
    - Extract the rewards from that step onward
    - Each step is multiplied by the corresponding gamma ^ index 
        the first reward received from leaving the state is not discounted
        the last reward received from the trajectory is discouned by gamma ^ (T-1)
    - Sum these values together to obtain the return at each step
    '''
    returns = np.zeros(len(rewards))
    
    for step, _ in enumerate(rewards):
        step_reward = rewards[step:]            # reward from the current step onward

        # List of discounted rewards at each time step
        return_val = [gamma ** i * step_reward[i] for i in range(len(step_reward))]
        return_val = sum(return_val)
        
        returns[step] = return_val
    return returns

In [None]:
states, rewards, actions = get_one_episode()

print(get_returns(rewards))
print(states)
print(rewards)

[0.03090315 0.03433684 0.03815204 0.04239116 0.04710129 0.05233476
 0.05814974 0.06461082 0.0717898  0.07976644 0.08862938 0.09847709
 0.10941899 0.12157665 0.13508517 0.15009464 0.16677182 0.18530202
 0.20589113 0.22876792 0.25418658 0.28242954 0.3138106  0.34867844
 0.38742049 0.43046721 0.4782969  0.531441   0.59049    0.6561
 0.729      0.81       0.9        1.        ]
[0, 4, 0, 4, 4, 0, 0, 1, 0, 0, 1, 1, 0, 4, 0, 0, 0, 4, 8, 8, 8, 9, 13, 13, 14, 13, 14, 14, 14, 14, 10, 14, 10, 14, 15]
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0]


In [51]:
def get_one_episode():
    ''' This function returns a full trajectory of the agent
    Episode end conditions:
    - the agent falls into the ice or the agent reaches the reward
    - truncated when hitting a time limit

    Inputs:
    - the initial state 
    '''
    states = []
    rewards = []
    actions = []

    initial_state, _ = env.reset()
    states.append(initial_state)

    term = False
    trunc = False

    while (not term) and (not trunc):
        action = env.action_space.sample()
        next_state, reward, term, trunc, _ = env.step(action)
        
        rewards.append(reward)
        actions.append(action)
        states.append(next_state)


    return states, rewards, actions

    