What is Monte Carlos:
sampling multiple episodes or trajectories of the agent's interactions with the environment. An episode refers to a sequence of states, actions, and rewards from the initial state until the termination of the task. By sampling these episodes, Monte Carlo methods can estimate the expected return or cumulative reward for a particular state-action pair or state alone.

To estimate the value of a state, Monte Carlo methods average the observed returns obtained from all episodes in which the state was encountered. Similarly, to estimate the value of an action, the observed returns for episodes in which the action was taken from a particular state are averaged. These estimates provide valuable insights into the quality or desirability of states or actions, which enable the agent to make informed decisions.

By iteratively updating the value estimates using Monte Carlo methods, RL algorithms can converge towards better policies. The estimated values guide the agent in selecting actions that lead to higher expected returns over time. This iterative process of estimating values and improving policies is a fundamental aspect of RL.

Monte Carlo methods in RL offer several advantages. They are model-free, meaning that they do not require a complete understanding of the underlying dynamics of the environment. Instead, they learn directly from interactions. Additionally, Monte Carlo methods are suitable for episodic tasks, where episodes have a natural termination point.

In [None]:
def on_policy_mc_control(policy,action_values,episodes,gamma=0.99,epsilon=0.2):
    sa_returns = {}
    for episode in range(1,episodes+1):
        state = env.reset()
        done = False
        transitions = []
        while not done:
            action = policy(state,epsilon)
            next_state, reward, done, _ = env.step(action)
            transitions.append([state,action,reward])
            state = next_state
        G = 0
        for state_t, action_t, reward_t in reversed(transitions):
            G = reward_t + gamma * G
            if not (state_t, action_t) in sa_returns:
                sa_returns[(state_t,action_t)] = []
            sa_returns[(state_t,action_t)].append(G)
            action_values[state_t][action_t] = np.mean(sa_returns[state_t,action_t])

Advantages of Monte Carlos:

The estimate of a state does not depend on the rest. The cost of estimating a state value is independent of the total number of states (avoid irrelevant states), so more efficient than DP. No need to know the dynamics of the environment.

How does it operate:
Monte Carlo explores paths that it perceives as "optimal" based on the q-values. However, there might exist an entire path that is more advantageous than the chosen one, but it is disregarded due to a suboptimal q-value.

Exploring Starts:
Whenever the agent interacts with the environment to gather experience, it begins in a randomly selected initial state and takes an initial random action.
All q-values are updated when the state and action are chosen to initiate the episode.
This approach may not be very realistic since there are numerous tasks where we lack the option.

Stochastic Policies:
The policies occasionally assign a non-zero probability to each action. This ensures that periodically the agent takes an action that it doesn't perceive as optimal, thereby enhancing its understanding of the task.
Implementing stochastic policies is simpler compared to Exploring Starts.

Regarding stochastic policies, there are two types:

On-policy learning:
Generates samples using the same policy that we aim to optimize.

Off-policy learning:
Generates samples using an exploratory policy that differs from the one we intend to optimize.

Interesting Stuff: Monte Carlo method has aspects similar to dynamic programming as both use policy evaluation and improvement.