## Blackjack

Here we will implement **Monte Carlo Policy Evaluation (MCPE)** to learn the state-value function $V(s)$ for a given policy in the game of [blackjack](https://en.wikipedia.org/wiki/Blackjack).

### The game

**Rules.** We will use the version of the game discussed in the lectures where a single player (the agent) plays against the dealer. The player's objective is to obtain cards whose sum is as large as possible without exceeding 21. All face cards count as 10; an ace can count as either 1 or 11.

The game begins with two cards dealt to both the dealer and the player. The first of the dealer’s cards is face down and the second is face up. If the player has 21 immediately (for example, an ace and a face card), it is called a "blackjack". The player then wins unless the dealer also has a blackjack, in which case the game is a draw. If the player does not have a blackjack, then she can request additional cards, one by one (_hits_), until she either stops (_sticks_) or exceeds 21 (_goes bust_). If the player goes bust, she loses; if she sticks, then it becomes the dealer’s turn. 

The dealer hits or sticks according to a fixed strategy without choice: he sticks on any sum of 17 or greater, hits otherwise. If the dealer goes bust, then the player wins; otherwise, the outcome (win, lose, or draw) is determined by whose final sum is closer to 21.

**MDP formulation.** Playing blackjack is naturally formulated as an episodic finite MDP. Each game of blackjack is an episode. Rewards of +1, −1, and 0 are given for winning, losing, and drawing, respectively. All rewards until the end of the game are zero. We do not discount ($\gamma = 1$); therefore these terminal rewards are also the returns. The player’s actions are to `"hit"` or to `"stick"`. 

The states depend on the player’s cards and the dealer’s showing card. Assume that cards are dealt from an infinite deck (that is, with replacement) so that there is no advantage to keeping track of the cards already dealt. If the player holds an ace that she could count as 11 without going bust, then the ace is said to be _usable_. In this case it is always counted as 11 because counting it as 1 would make the sum 11 or less, in which case there is no decision to be made because, obviously, the player should always hit. Thus, the player makes decisions on the basis of three variables: 
- the player's current sum (an integer between 12 and 21);
- the dealer’s one showing card (an integer between 1 and 10; note that the ace is counted as 1 here); and
- whether or not the player holds a usable ace (a boolean). 

This makes for a total of 200 states. We represent the state as a numpy-array of length 3 that combines the just mentioned three variables in the given order. For example, if the player is given a 6 and a _jack_, and the dealer's showing card is an ace, the corresponding state will be the numpy array `[16, 1, False]`. The terminal state of the game will be denoted by the numpy array `[-1, -1, -1]`.

We use a `Blackjack` class to simulate the blackjack games.

We import the blackjack module and create a blackjack environment called `env`. The constructor method has one argument called `verbose`. If `verbose=True`, the blackjack object will regularly print the progress of the game. This is useful for getting to know the game and the provided code or if you just want to play around. You may want to set `verbose=False` when you run thousands of episodes to complete the exercise below.

In [1]:
import blackjack
env = blackjack.Blackjack(verbose=True)

we can interact with the blackjack environment using the `make_step()` method. This method takes an `action` as input and computes the response of the environment. Specifically, this method returns the resulting `new_state` and the corresponding `reward` signal.

Before the player can perform actions, we have to start the game (e.g., draw starting hands). In order to start or reset a blackjack game, call the `make_step()` without specifying a specific action or by setting `action="reset"`.

We will now walk through several example games. We will specify a [random seed](https://en.wikipedia.org/wiki/Random_seed) for the NumPy pseudo random number generator every time before we reset the game. This allows us to keep these examples reproducible.

In [2]:
import numpy as np
np.random.seed(8)
new_state, reward = env.make_step(action="reset")
print("Initial state:", new_state)
print("Reward:", reward)

The game is reset.
Player's cards: [10, 10]
Dealer's showing card: [7]
Initial state: [20  7  0]
Reward: 0


The player drew two cards with face value 10 each. The dealer also drew two cards, but we can only see the second card, a 7. The player now can choose to "hit" or "stick". Most players would stick if they had 20 on their hand. We call again the `make_step()` method and specify `action="stick"`.

In [3]:
new_state, reward = env.make_step(action = "stick")
print("The player obtains a reward of", reward)
print("The new (terminal) state is:", new_state)

The dealer's cards are: [10, 7]
The dealer has 17 points.
PLAYER WINS!
The player obtains a reward of 1
The new (terminal) state is: [-1 -1 -1]


The player won and received a reward of 1. Whenever an episode ends, the environment object sets the internal variable `self.active` to `False`. This variable is set to `True` again when we _reset_ the game. You can use the `self.active` variable to check whether an episode has ended or not.

In [4]:
np.random.seed(9)
new_state, reward = env.make_step(action="reset")
print("New state:", new_state)
print('reward',reward)

The game is reset.
Player's cards: [11, 7]
Dealer's showing card: [2]
New state: [18  2  1]
reward 0


The player has already 18 points but has a _usable ace_, which she can transfer into a 1 whenever she would _go bust_. The player can thus "hit" and hope that she gets closer to 21. 

In [5]:
new_state, reward = env.make_step(action = "hit")
print("New state:", new_state)
print('reward',reward)

Player draws card: [2]
New sum of player's cards: [20]
New state: [20  2  1]
reward 0


Great! The player got another 2 points and has again 20 points. The player would probably want to "stick" again...

In [6]:
new_state, reward = env.make_step(action = "stick")
print("New state:", new_state)
print('reward',reward)

The dealer's cards are: [7, 2]
The dealer has 9 points.
Dealer draws card: [3]
New dealer sum [12]
Dealer draws card: [6]
New dealer sum [18]
PLAYER WINS!
New state: [-1 -1 -1]
reward 1


The player won again! Let's play a last one.

In [7]:
np.random.seed(10)
new_state, reward = env.make_step()
env.make_step()
print("New state:", new_state)
print("Reward:", reward)

The game is reset.
Player's cards: [10, 11]
Dealer's showing card: [10]
Player has Blackjack!
The dealer's cards are: [9, 10]
PLAYER WINS!
The game is reset.
Player's cards: [7, 3, 3]
Dealer's showing card: [3]
New state: [-1 -1 -1]
Reward: 1


The player drew a "Blackjack", that is, an ace and a 10. The dealer's cards valued 16. The player won again and received a reward without having performed an action. Try out some more games to get familiar with the code!

The task is to learn the state-value function for the policy **"Stick if the player's sum is 19 or higher, and hit otherwise."**. We will compute these state values using **Monte Carlo Policy Evaluation (MCPE)**. The pseudo-code for MCPE is reproduced below from the textbook (Reinforcement Learning, Sutton & Barto, 1998, Section 5.1).
<img src="images/MCPE.png" style="width: 400px;"/>
The provided pseudo-code shows _first-visit_ MCPE. No state occurs twice during one game (episode) of Blackjack. In this case, first-visit MCPE and every-visit MCPE are identical.

In [8]:
# This cell computes the state values 'v' using MCPE.

sim = 1

env = blackjack.Blackjack(verbose=False)

#values of states outcomes stored in dictionary

valDict = {}

np.random.seed(7)
while sim < 100000:
    
    #list of all states in game
    stateList = []
    new_state, reward = env.make_step(action="reset")
    
    
    #game ends on first go
    if np.all(new_state == np.array([-1 -1 -1])):
        
        #continue
        gameReward = reward
    else:
        
        #loop while hand less than 19 and game not over
        while new_state[0] <19 and new_state[0] !=-1:
            
            #store all states
            stateList.append(new_state)
            
            new_state, reward = env.make_step(action="hit")
            
        if new_state[0] >= 19 and new_state[0] !=-1:
            
            #stick if over 19
            
            stateList.append(new_state)
            new_state, reward = env.make_step(action="stick")
            
            #reward of game
            gameReward = reward
            
        if new_state[0] == -1:
            
            gameReward = reward
            
            #store all result of game to each state and add 
            for state in stateList:
                
                if str(state) in valDict:
                    sumRewards = valDict[str(state)][0] + gameReward
                    
                    #total number of times state has been acheived                
                    totalTimes = valDict[str(state)][1] + 1
                    valDict[str(state)] = [sumRewards,totalTimes]
                else:
                    valDict[str(state)] = [gameReward,1]
            
            
        
        sim += 1

In [9]:
#  return monte carlo policy evaluation of state
def get_state_value(s, v):
    s = np.array(s)
    value_of_s = float(v[str(s)][0])/float(v[str(s)][1])
        
    return value_of_s


In [10]:
print(get_state_value([17,10,0],valDict))

-0.6442792036459583


In [11]:
# This is a TEST CELL. We will use it to mark your solution. 
# All of your code must be written above this cell. 

In [12]:
# This is a TEST CELL. 

In [13]:
# This is a TEST CELL. 

In [None]:
# This is a TEST CELL. 

In [None]:
# This is a TEST CELL. 