# 10.3.2 Monte Carlo Control
### Explanation of Monte Carlo Control

Monte Carlo Control is an extension of Monte Carlo methods in reinforcement learning that focuses on optimizing the policy to maximize the expected reward. Unlike Monte Carlo Prediction, which estimates the value of a given policy, Monte Carlo Control iteratively improves the policy itself by exploring the environment and adjusting actions to achieve better outcomes.

The core idea behind Monte Carlo Control is to use the results from simulated episodes to not only evaluate the value function but also to derive an improved policy. This is achieved through policy iteration, where the policy is updated based on the value estimates, and then more episodes are run under the new policy to further refine i.

### Scenarios Where Monte Carlo Control is Beneficial

Monte Carlo Control is particularly beneficial in scenarios where:

- **Model-Free Environments**: The environment's dynamics are unknown, and it is not feasible to derive an analytical solution. Monte Carlo methods work well because they rely solely on experience.
- **Episodic Tasks**: Tasks that naturally break into episodes, such as games, where each episode has a clear beginning and end.
- **High Variance in Rewards**: When the reward structure is complex and varies significantly between episodes, Monte Carlo methods can accurately estimate the average value over time.
- **Limited State Space**: Although Monte Carlo Control can be used with large state spaces, it is most effective when the state space is relatively small, making it feasible to explore horoughly.

### Methods for Implementing Monte Carlo Control

Monte Carlo Control can be implemented using different strategies. The most common methods are:

- **Exploring Starts**: Ensuring that all state-action pairs are visited by starting episodes with a random action.
- **$\epsilon$-Greedy Policy**: Balancing exploration and exploitation by selecting the best-known action most of the time but occasionally choosing a random action.
- **Policy Iteration**: Alternating between policy evaluation (estimating the value function under the current policy) and policy improvement (updating the policy based on the value estimates).

These methods can be used to incrementally improve the policy, eventually converging to an optimal policy that maximizes the expected reward over time.


___
___
### Readings:
- [Monte Carlo Methods (Part 2: Monte Carlo Control)](https://medium.com/@numsmt2/reinforcement-learning-chapter-5-monte-carlo-methods-part-2-monte-carlo-control-b1ea0d4ec2b4)
- [Reinforcement Learning, Part 4: Monte Carlo Control](https://towardsdatascience.com/reinforcement-learning-part-4-monte-carlo-control-ae0a7f29920b)
- [Monte Carlo Control](https://medium.com/@aminakeldibek/monte-carlo-control-6e3b70f173a8)
- [Monte Carlo Methods](https://medium.com/towards-data-science/introduction-to-reinforcement-learning-rl-part-5-monte-carlo-methods-25067003bb0f)
- [Solving Racetrack in Reinforcement Learning using Monte Carlo Control](https://towardsdatascience.com/solving-racetrack-in-reinforcement-learning-using-monte-carlo-control-bdee2aa4f04e)
___
___

In [1]:
import numpy as np
from collections import defaultdict

In [2]:
# Define the grid world environment
n_states = 6   # Number of states
n_actions = 2  # Number of actions (0 = left, 1 = right)
gamma = 0.9    # Discount factor
epsilon = 0.1  # Exploration rate

In [3]:
# Define rewards
rewards = np.zeros(n_states)
rewards[-1] = 1.0         # Reward at the terminal state

In [4]:
# Initialize Q-value table
Q = defaultdict(lambda: np.zeros(n_actions))

max_steps = 100

def generate_episode(policy):
    state = np.random.randint(0, n_states - 1)
    episode = []
    for _ in range(max_steps):
        if state == n_states - 1:
            break
        action = np.random.choice(np.arange(n_actions), p=policy[state])
        next_state = state + 1 if action == 1 else max(0, state - 1)
        reward = rewards[next_state]
        episode.append((state, action, reward))
        state = next_state
    return episode

In [5]:
# Define a function to create an epsilon-greedy policy
def create_policy(Q, epsilon, n_actions):
    policy = np.ones((n_states, n_actions)) * epsilon / n_actions
    for state in range(n_states):
        best_action = np.argmax(Q[state])
        policy[state][best_action] += (1.0 - epsilon)
    return policy

In [6]:
# Monte Carlo Control with epsilon-greedy policy
n_episodes = 100  
returns = defaultdict(list)

for _ in range(n_episodes):
    policy = create_policy(Q, epsilon, n_actions)
    episode = generate_episode(policy)
    G = 0
    for state, action, reward in reversed(episode):
        G = gamma * G + reward
        if not any([(s == state and a == action) for (s, a, r) in episode[:-1]]):
            returns[(state, action)].append(G)
            Q[state][action] = np.mean(returns[(state, action)])

In [7]:
# Derive the optimal policy by iterating over each state
optimal_policy = np.zeros(n_states, dtype=int)
for state in range(n_states):
    optimal_policy[state] = np.argmax(Q[state])

In [8]:
# Display the optimal policy
print("Optimal Policy:", optimal_policy)

Optimal Policy: [0 0 0 0 1 0]


## Conclusion

In this implementation, we successfully demonstrated the Monte Carlo Control method with an epsilon-greedy policy. By iterating over each state and selecting the action with the highest Q-value, we derived the optimal policy for a simple grid world environment. This process highlights the essential aspects of Monte Carlo Control, including policy improvement through exploration and exploitation. The example serves as a foundational understanding of how Monte Carlo methods can be applied to reinforcement learning tasks, even in more complex environments.
