# Reinforcement Learning
Prof. Milica Gašić

### Monte Carlo prediction

The idea of Monte Carlo prediction is very simple: Estimate the value (or the action value) by averating the observed returns from collected episodes. In this notebook we apply Monte Carlo prediction to the game of tic-tac-toe.

#### Implementation

Make sure that the file `rl_env.py` is in the same folder as the notebook.

In [11]:
%load_ext autoreload
%autoreload 2

import numpy as np
import rl_env

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


We already implemented tic-tac-toe in `TicTacToeEnv`:
- The environment has $3^9 = 19683$ states (9 fields with 3 values: empty, player 1, player 2).
- There are $9$ possible actions, which determine the next move of the current player (i.e. the actions control both players).
- The final reward is $1$ if player 1 wins, and $0$ if player 2 wins or when there is a draw. The reward is $0$ in all other time steps.

In [12]:
# Create an instance of the tic-tac-toe environment
env = rl_env.TicTacToeEnv()

We already implemented the random policy for the tic-tac-toe environment:

In [13]:
def random_policy(state):
    # Obtain the list of empty fields
    valid_actions = rl_env.TicTacToeEnv.get_valid_actions(state)
    # Select one of the empty fields randomly
    # For non-empty fields, the action does not have an effect
    action = np.random.choice(valid_actions)
    return action

Your task is to implement Monte Carlo prediction of the action value for the **initial state**, i.e. you don't need to compute the action values for all states, but only the $9$ action values for the initial state.  
We don't need a discount factor, so the initial return is equal to the final reward.

You don't need an `Agent` object for this implementation, just generate episodes and estimate the action values.

In [24]:
#######################################################################
# TODO: Implement Monte Carlo prediction of the action value function #
# for the initial state as described above. Generate at least 10000   #
# episodes to estimate the action values.                             #
#######################################################################
def mc_prediction(num_episodes, env, policy):

    num_states = env.observation_space.n
    num_actions = env.action_space.n

    q = np.zeros((num_states, num_actions))
    counter = np.zeros((num_states, num_actions))

    for _ in range(num_episodes):
        state, info = env.reset()
        init_state = state
        init_action = policy(init_state)
        action = init_action

        while True:
            state, reward, terminated, truncated, info = env.step(action)
            if terminated or truncated:
                counter[init_state, init_action] += 1
                q[init_state, init_action] = q[init_state, init_action] + 1/counter[init_state, init_action] * (reward - q[init_state, init_action])
                break # finish this episode
            action = policy(state)

    return q
#######################################################################
# End of your code.                                                   #
#######################################################################

Since the reward is only $1$ if player 1 wins, the value of the initial state is equal to the winning probability of player 1.  
Use this to answer the following questions:
- What is the probability that the first player wins?
- Which initial action has the highest chance of winning?

In [25]:
#######################################################################
# TODO: Answer the questions by using the computed action values.     #
#######################################################################
# small n -> very high variance
q = mc_prediction(10000, env, random_policy)
env.reset()
print(q[0])
# 1. What is the probability that the fisrt player wins?
print('probability of winning:', np.mean(q[0]))

# 2. Which initial action has the highest chance of winning?
print('in the middle')
print(np.argmax(q[0]))
print(env.index_to_state(env.step(np.argmax(q[0]))[0]).reshape(3,3))
#######################################################################
# End of your code.                                                   #
#######################################################################

[0.59399142 0.53946147 0.60846085 0.54414414 0.69771863 0.55760774
 0.59459459 0.5298988  0.63509991]
probability of winning: 0.5889975062424209
in the middle
4
[[0 0 0]
 [0 1 0]
 [0 0 0]]
