In [2]:
import gym
import numpy as np

In [3]:
env=gym.make('Blackjack-v0')

[2019-08-29 17:44:45,395] Making new env: Blackjack-v0


 Blackjack is a card game where the goal is to obtain cards that sum to as
 near as possible to 21 without going over.  
 They're playing against a fixed dealer.
 Face cards (Jack, Queen, King) have point value 10.
 Aces can either count as 11 or 1, and it's called 'usable' at 11.  
 
This game is placed with an infinite deck (or with replacement). 
The game starts with each (player and dealer) having one face up and one face down card.
The player can request additional cards (hit=1) until they decide to stop(stick=0) or exceed 21 (bust).
 After the player sticks, the dealer reveals their facedown card, and draws until their sum is 17 or greater.  If the dealer goes bust the player wins.
If neither player nor dealer busts, the outcome (win, lose, draw) is decided by whose sum is closer to 21.  The reward for winning is +1, drawing is 0, and losing is -1.

![single-deck-blackjack-netent-online.png](attachment:single-deck-blackjack-netent-online.png)

## <a id="mc"></a>Reminder: Monte-Carlo evaluation

Monte Carlo methods are intuitive, episode-based methods. 

The **online Monte-Carlo** algorithm does the following: start from $s$, run the policy until termination (or for a long number of steps) then update the value of all encountered states before restarting an episode. 

Let $(s_0, r_0, s_1, \ldots, s_T)$ be the sequence of transitions of such an episode. Then, this procedure provides an estimate $G_t$ of the value of the state $s_t$ encountered at time step $t$:
<div class="alert-success">**Monte Carlo return:**
$$G_t = \sum_{i>t} \gamma^{i-t} r_i$$
</div>

Let $G^\pi(s)$ be the random variable corresponding to the sum of discounted rewards one can obtain from $s$. Then $V^\pi(s) = \mathbb{E}(G^\pi(s))$. Since $G(s_t)$ is a realization of $R(s)$, one can design a **stochastic approximation** procedure that converges to $G^\pi(s_t)$:
<div class="alert-success">**Stochastic approximation of $V^\pi$:**
$$V(s_t) \leftarrow V(s_t) + \alpha \left[ R_t - V(s_t) \right]$$
</div>

For those unfamiliar with stochastic approximation procedures, we can understand the previous update as: $R_t$ are samples estimates of $V^\pi(s_t)$. If I already have an estimation of $V(s_t)$ of $V^\pi(s_t)$ and I receive a new sample $R_t$, I should "pull" my previous estimate towards $G_t$, but $R_t$ carries a part of noise, so I should be cautious and only take a small step $\alpha$ in the direction of $G_t$. This type of stochastic approximation procedure converges if it respects Robbins-Monro's conditions:
<div class="alert-success">**Robbins-Monro convergence conditions:**
$$\sum\limits_{t=0}^\infty \alpha_t = \infty \quad \textrm{  and  } \quad \sum\limits_{t=0}^\infty \alpha_t^2 < \infty.$$
</div>
Intuitive explanation. These conditions simply say that any value $V^\pi(s)$ should be reachable given any initial guess $V(s)$, no matter how far from $V^\pi(s)$ is from this first guess; hence the $\sum\limits_{t=0}^\infty \alpha_t = \infty$. However, we still need the step-size to be decreasing so that we don't start oscillating around $V^\pi(s)$ when we get closer; so to insure convergence we impose $\sum\limits_{t=0}^\infty \alpha_t^2 < \infty$.

### action space: 2 possible actions
0: stick (stop)
1: hit (draw a new card)

In [11]:
print(env.action_space.n)

2


### observation space
The observation of a 3-tuple of: the players current sum, the dealer's one showing card (1-10 where 1 is ace),
and whether or not the player holds a usable ace (0 or 1)

In [15]:
print(env.observation_space)

Tuple(Discrete(32), Discrete(11), Discrete(2))


### rewards
If neither player nor dealer busts, the outcome (win, lose, draw) is decided by whose sum is closer to 21.  
The reward for winning is +1, drawing is 0, and losing is -1.

### policy to evaluate

In [4]:
def sample_policy(observation):
    score, dealer_score, usable_ace=observation
    return 0 if score>=20 else 1

### Function to generate an episode following a policy

### MC prediction algorithm

![MC-prediction-algo.png](attachment:MC-prediction-algo.png)