# Creating the Environment
In Blackjack, the object is to win as many hands as possible over time, hopefully ending up with more wins than losses. As in any kind of gambling, the better one's chance of winning a hand, the better chance one has of continuing and therefore winning more hands.

The starting state of any hand is the result of the deal. For the player, we know the total value the hand; the dealer's hand is dealt with the first card facing up and the second facing down. There are two actions the player can take:
1. she can ask for another card (hit) and risk going bust (over 21 points), or 
2. accept her hand as it is (stand).

The turn for the player ends when she either goes bust or chooses to stand.

The dealer's rules for action are strict, allowing for no choice: 
1. As long as his hand totals 16 points or less, he must hit.
1. He must stand if his hand totals 17 points or more.

The hand ends after the dealer has a final score based on these rules, and the player finds out the consequences of her choice:
* If the player's total is higher than the dealer's, the player wins.
* If the dealer's total is higher than the player's, the player loses.
* If the totals of the player and dealer are the same, the hand is a draw.

## The Model
* We can treat one hand as an episode for our learning algorithm.
* Each state is a combination of the player's total points and the dealer's card which is showing.
* There are two actions: hit and stand
* Reward is assigned as follows:
    * If the hand is over, and the player has won, then reward is +1
    * If the hand is over, and the player has lost, then reward is -1
    * Reward is 0 for all other states

In Q-Learning, an optimal policy can be found by trying actions and keeping a record of performance. Moreover, because the environment is not completely known, the algorithm must explore possibilities and anticipate the unknown. We'll need some data structures for keeping track of actions with different states and their results.

Python's dictionary structure is a good fit for that. We can use tuples to represent the state/action combinations. These will be indexes into the dictionary, which will store the utility and value functions.

In [3]:
from collections import defaultdict
# Sample Q function implementation using tuples
Q = defaultdict(float)
# Couple of examples from basic strategy
Q[14,5,'STAY']
Q[14,7,'HIT']
Q[14,7,'STAY']
for k in sorted(Q.keys()):
    print("Q{}: {}".format(k,Q[k]))

Q(14, 5, 'STAY'): 0.0
Q(14, 7, 'HIT'): 0.0
Q(14, 7, 'STAY'): 0.0


## Test Drive
Let's try a simple implementation of the basic Q-Learning algorithm. This algorithm builds a utility function Q that when finished is expected to represent the best action to take in a given state. As an action is taken in a state, the reward of the action and estimated future actions are stored as the value of Q(s,a). As the state and action repeat during the training phase, the value of Q(s,a) is updated:

$$Q(s,a)=Q(s,a) + \alpha(R(s) + \gamma \underset{a'}{\operatorname{max}}Q(s',a') - Q(s,a))$$

Without getting too far ahead of ourselves, let's try one step through the algorithm. Very simply, we'll do the initialization, start the hand, and let the player take one action, chosen at random. A stay will signal the dealer's turn, and we can see if the player wins the hand or not. The Q function gets updated once.

In [None]:
from player import Player
from shoe import Shoe
from utilities import hit, newHand, deal
import random
from IPython.display import clear_output

def getAction():
    r = random.randint(0,1)
    if r == 0:
        return 'HIT'
    else:
        return 'STAY'

# Initialize
shoe = Shoe(1)
dealer = Player()
player = Player()
allActions=('HIT','STAY',)
Q = defaultdict(float)

In [57]:
# Starting state for a hand/episode
newHand(dealer,player,shoe)

# 1. Choose an action
action = getAction()

# 2. Observe the state
currentState=(player.getPoints(),dealer.hand[0],action)
print("Current state: {}".format(currentState))

# 3. Do the action
if (action == 'HIT'):
    print("Player {}: ".format(action), end=' ')
    hit(player,shoe)
newState = (player.getPoints(),dealer.hand[0])

# Calculate reward
if (action == 'STAY'):
    while dealer.getPoints() < 17:
        print("Dealer HIT: ", end=' ')
        hit(dealer,shoe)
    
    if (player.getPoints() > dealer.getPoints()):
        reward = 1
    elif dealer.getPoints() > 21:
        reward = 1
    elif dealer.getPoints() == player.getPoints():
        reward = 0
    else: 
        reward = -1
else:
    if player.getPoints() > 21:
        reward = -1
    else:
        reward = 0

# 4. Update Q(s,a)
Q[currentState] = Q[currentState] + 0.08*(reward + max(Q[newState+allActions[0:1]],Q[newState+allActions[1:2]]) - Q[currentState])

Q

Overwriting round.py


The formula appears to be working ok. If the pass resulted in the end of the player's turn, either by staying or busting, the Q value was updated accordingly.

Time to refactor and finish the loop.

TODO:
- Refactor
  - Separate maxQa calculation into its own function
  - Separate reward calculation into its own function
  - Create a terminal state test
- New features
  - Terminal state flag
  - Episode loop
