# Reinforcement Learning
Prof. Milica Gašić

### Monte Carlo prediction

The idea of Monte Carlo prediction is very simple: Estimate the value (or the action value) by averating the observed returns from collected episodes. In this notebook we apply Monte Carlo prediction to the game of tic-tac-toe.

#### Implementation

Make sure that the file `rl_env.py` is in the same folder as the notebook.

In [1]:
%load_ext autoreload
%autoreload 2

import numpy as np
import rl_env

We already implemented tic-tac-toe in `TicTacToeEnv`:
- The environment has $3^9 = 19683$ states (9 fields with 3 values: empty, player 1, player 2).
- There are $9$ possible actions, which determine the next move of the current player (i.e. the actions control both players).
- The final reward is $1$ if player 1 wins, and $0$ if player 2 wins or when there is a draw. The reward is $0$ in all other time steps.

In [9]:
# Create an instance of the tic-tac-toe environment
env = rl_env.TicTacToeEnv()

We already implemented the random policy for the tic-tac-toe environment:

In [31]:
def random_policy(state):
    # Obtain the list of empty fields
    valid_actions = rl_env.TicTacToeEnv.get_valid_actions(state)
    # Select one of the empty fields randomly
    # For non-empty fields, the action does not have an effect
    action = np.random.choice(valid_actions)
    return action

In [49]:
state = env.reset()[0]
state

np.int64(0)

Your task is to implement Monte Carlo prediction of the action value for the **initial state**, i.e. you don't need to compute the action values for all states, but only the $9$ action values for the initial state.  
We don't need a discount factor, so the initial return is equal to the final reward.

You don't need an `Agent` object for this implementation, just generate episodes and estimate the action values.

In [None]:
# We are doing prediction here:
# Here we assume the two players are playing randomly

In [63]:
#######################################################################
# TODO: Implement Monte Carlo prediction of the action value function #
# for the initial state as described above. Generate at least 10000   #
# episodes to estimate the action values.                             #
#######################################################################
rewards = np.zeros(9)

for i in range(100000):
    zero_state = env.reset()[0]
    value_index = random_policy(zero_state) # also it's the fist action
    state, reward, terminated = env.step(value_index)[0:3]
    while not terminated:
        action = random_policy(state)
        state, reward, terminated = env.step(action)[0:3]
    rewards[value_index] += reward

env.reset()
print(rewards)


#######################################################################
# End of your code.                                                   #
#######################################################################

[6713. 5977. 6757. 5903. 7831. 6011. 6673. 5957. 6701.]


Since the reward is only $1$ if player 1 wins, the value of the initial state is equal to the winning probability of player 1.  
Use this to answer the following questions:
- What is the probability that the first player wins?
- Which initial action has the highest chance of winning?

In [65]:
#######################################################################
# TODO: Answer the questions by using the computed action values.     #
#######################################################################

# 1. What is the probability that the fisrt player wins?
print(np.sum(rewards)/100000)

# 2. Which initial action has the highest chance of winning?
print('in the middle')
print(np.argmax(rewards))
print(env.index_to_state(env.step(np.argmax(rewards))[0]).reshape(3,3))

#######################################################################
# End of your code.                                                   #
#######################################################################

0.58523
in the middle
4
[[0 0 0]
 [0 1 0]
 [0 0 0]]
