# Reinforcement Learning
Prof. Milica Gašić

### Monte Carlo prediction

The idea of Monte Carlo prediction is very simple: Estimate the value (or the action value) by averating the observed returns from collected episodes. In this notebook we apply Monte Carlo prediction to the game of tic-tac-toe.

#### Implementation

Make sure that the file `rl_env.py` is in the same folder as the notebook.

In [10]:
%load_ext autoreload
%autoreload 2

import numpy as np
import rl_env

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


We already implemented tic-tac-toe in `TicTacToeEnv`:
- The environment has $3^9 = 19683$ states (9 fields with 3 values: empty, player 1, player 2).
- There are $9$ possible actions, which determine the next move of the current player (i.e. the actions control both players).
- The final reward is $1$ if player 1 wins, and $0$ if player 2 wins or when there is a draw. The reward is $0$ in all other time steps.

In [11]:
# Create an instance of the tic-tac-toe environment
env = rl_env.TicTacToeEnv()

We already implemented the random policy for the tic-tac-toe environment:

In [12]:
def random_policy(state):
    # Obtain the list of empty fields
    valid_actions = rl_env.TicTacToeEnv.get_valid_actions(state)
    # Select one of the empty fields randomly
    # For non-empty fields, the action does not have an effect
    action = np.random.choice(valid_actions)
    return action

Your task is to implement Monte Carlo prediction of the action value for the **initial state**, i.e. you don't need to compute the action values for all states, but only the $9$ action values for the initial state.  
We don't need a discount factor, so the initial return is equal to the final reward.

You don't need an `Agent` object for this implementation, just generate episodes and estimate the action values.

In [13]:
#######################################################################
# TODO: Implement Monte Carlo prediction of the action value function #
# for the initial state as described above. Generate at least 10000   #
# episodes to estimate the action values.                             #
#######################################################################

def generate_episodes(num_episodes, env, policy):

    q_estimates = np.zeros(env.action_space.n)
    action_counts = np.zeros(env.action_space.n)

    for _ in range(num_episodes):

        state, info = env.reset()
        total_reward = 0
        episode_starts = True

        while True:
            action = policy(state)
            if episode_starts:
                initial_action = action
                action_counts[initial_action] += 1
                episode_starts = False

            state, reward, terminated, truncated, info = env.step(action)
            total_reward += reward

            if terminated or truncated:
                q_estimates[initial_action] = q_estimates[initial_action] + 1/action_counts[initial_action] * (total_reward - q_estimates[initial_action])
                break

    return q_estimates

#######################################################################
# End of your code.                                                   #
#######################################################################

Since the reward is only $1$ if player 1 wins, the value of the initial state is equal to the winning probability of player 1.  
Use this to answer the following questions:
- What is the probability that the first player wins?
- Which initial action has the highest chance of winning?

In [14]:
#######################################################################
# TODO: Answer the questions by using the computed action values.     #
#######################################################################

q_estimates = generate_episodes(10000, env, random_policy)

#######################################################################
# End of your code.                                                   #
#######################################################################

In [15]:
print(q_estimates)

[0.59688581 0.52014652 0.61369863 0.52359347 0.67958656 0.55658627
 0.60356139 0.52874564 0.62034514]


In [16]:
print(q_estimates.sum()/9)

0.5825721596092679
