In [16]:
import gym
import numpy as np
from collections import defaultdict
from operator import getitem, setitem

## Overview

Generally speaking Monte Carlo control is characterised by approximating the value function of state action pairs by using sample approximations for their rewards.

In particular the Monte Carlo approach doesn't bootstrap the value function but instead observes a full episode before 

## Learning the value function

A psuedo algorithm for learning the value function using monte carlo methods is roughly;  

- Initialise
- while episode running:
    - take a step using policy and observe S0
    - for each step:
        - rewards = gamma*rewards + return
        - append state_history {state: rewards}
        - value[state] = mean(state_history[state])

A key nugget here is that the monte carlo approach can only reflect on the merits of its decisions once an episode has finished and the states observed. Unless of course the agent visits a single state more than once within the same episode.

There are some merits to updating the value function as often as possible when running many agents concurrently in async while updating the same value functions.

Key traits:  
- the value function is not used when estimating improvements, we only average past experience  
- mostly assumes the statespace is of reasonable size and discrete (as it actively stores state values, atleast in its simplest form)  
- assumes a soft policy function with coverage, that is enters all states infinitely many times 
- gamma can be set to 1/n if we whish to use the sample mean (X bar) estimate for state rewards, gamma [0,1) fits non-stationary distributions

Let's start by defining some basic helper objects;

In [54]:
class Scorer:
    
    def __init__(self,score = 0, gamma=0.8):
        self.score = score
        self.gamma = gamma
    def observe(self,reward, gamma=None):
        # when controlling learning rate stochastically (e.g. anealing)
        if gamma:
            self.score = self.score * gamma + reward
        else:
            self.score = self.score * self.gamma + reward
        
    def show_score(self):
        return self.score
    
    def reset():
        self.score = 0
 
class MCValueFunction:
    def __init__(self):
        self.history = defaultdict(list)
        
    def update(self,observation, action , score):
        
        # to support arbitrarily nested observations we will loop through them
        reduce(setitem, observation, self.history)
                
#         np.append(self.history[observation][action],score)
        
    def predict(self,observation, action):
        
        # take observation depth (tuple) and average its list of rewards
        value = np.mean(reduce(getitem, observation, self.history))
#         value = np.mean(self.history[observation][action])
           
        # if the list is empty, 0 the nan value
        value = 0 if np.isnan(value) else value
        
        return value
    
class GreedyPolicy:
    def __init__():
         pass

    def choose(observation, actions, value_function):
        values = np.array([value_function.predict(observation, action) for action in actions])
        greedy_choice = np.argmax(values)
        
        return actions[greedy_choice]

class RandomPolicy:
    def __init__(self):
         pass
        
    def choose(env):
        return env.action_space.sample()

## BlackJack value function

To learn a value function of the blackjack game we need to learn state action pair values, because we don't have a reliable model for the states those actions will lead to.  
If we did we could simply infer;  
    new_state(action) = model(current_state, action)  
    value = [value[new_state] for action in action_space]  
    
To learn the state-action values we will need to explore all the posibilities. If we don't manually set a unique starting point for the game we must use a soft policy that gaurentees coverage of all states. The simplest way to do this is using a random choice policy. With enough episodes we are gaurenteed to visit every state-action pair.

In [60]:
observation[0:2]

(9, 10)

In [55]:
env = gym.make('Blackjack-v0')

episodes = 100
gamma = 0.8

scorer = Scorer(gamma=gamma)
policy = RandomPolicy()
value_function = MCValueFunction()

observation = env.reset()
for _ in range(episodes):
    
    action = env.action_space.sample()
    observation, reward, done, info = env.step(action)
    scorer.observe(reward)
    value_function.update(observation, action , scorer.show_score())
    
    if done:
        scorer.reset()
        observation = env.reset()
        print('episode ends')
        
env.close()

KeyError: (9, 10, False)

In [9]:
env = gym.make('Blackjack-v0')
observation = env.reset()
for _ in range(10):
    print(observation), print(reward)
    
#     env.render() #no render implementation for blackjack
    action = env.action_space.sample() # your agent here (this takes random actions)
    observation, reward, done, info = env.step(action)
    if done:
        observation = env.reset()
        print('episode ends')
        
env.close()

(18, 4, False)
-1.0
episode ends
(14, 6, True)
1.0
episode ends
(15, 3, False)
-1.0
episode ends
(21, 6, True)
-1.0
(15, 6, False)
0.0
episode ends
(11, 9, False)
1.0
episode ends
(21, 4, True)
1.0
(21, 4, False)
0.0
episode ends
(18, 10, False)
-1.0
episode ends
(20, 6, False)
-1.0
episode ends


In [5]:
np.empty()

TypeError: empty() missing required argument 'shape' (pos 1)

In [7]:
x = np.mean([])

In [8]:
x

nan

In [11]:
np.isnan(x)

True

In [15]:
0 if np.isnan(x) else x

0