# Reinforcement Learning in Blackjack

## August Garibay

### Abstract

This project will attempt to explore methods of ML on the game of BlackJack in a simulated environment. Card counting methods are abundant, and many are simplified for human use. While some card counting methods are so computationally strenuous that humans cannot use them, computers can easily perform these tasks. Although the computer counting techniques are more robust, they are extensions of the original idea created for humans.

It seems likely that the true utility of a computer is wasted in these classical card counting methods and that a more optimal strategy might be found. This project will explore the use of reinforcement learning to converge on such a stategy.
In the subsequent section are the manifest goals of this project. Some of these may not be attainable given the limitations of this project, but I will attempt to achieve as many of them as possible.

### Objectives

Listed below are my prospective goals for this assignment, roughly in order I intend to execute them. The first are of higher priority, while later objectives will depend on factors such as time and resources. Some of these are just ideas, in this way, and may not be achieved in the scope of this project.

* Train networks to understand basic mechanics of hit/stay in a single player environment.
* Train networks to play against a (dealer) opponent.
* Train networks to play in a group with a dealer, allowing for examination of the networks ability to intuit probability information from the dealings of other players
* Encorporate betting by changing the reinforcement function to maximize winnings
* Compare performance to classical methods of card counting
* Compare performance of networks trained with a bet-driven reinforcement to those that are first rewarded for playing well and later rewarded for betting.
* Use auto-encoding to compress information related to card frequency probability

### Blackjack

Let's start by simulating the game of blackjack so that we have an environment to work in. We will use a dictionary to encode the cards identitites and a stack structure to simulate a deck.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas
import copy
import neuralnetworks as nn #will be using this given code from A6

In [2]:
# This line is needed to make the A6 neuralnetworks code work on my computer
import os

os.environ['KMP_DUPLICATE_LIB_OK']='True'

In [3]:
cards = {
    'Ace': 11,
    'King': 10,
    'Queen': 10,
    'Jack': 10,
    '10': 10,
    '9': 9,
    '8': 8,
    '7': 7,
    '6': 6,
    '5': 5,
    '4': 4,
    '3': 3,
    '2': 2
}

import random

class Shoe:
    def __init__(self, num_decks):
        self.num_decks = num_decks
        self.cards = []
        self.discard = []
        self.reset()

    # Resets the shoe to the original state
    # Creates a list of cards and shuffles them
    def reset(self):
        self.cards = []
        for _ in range(self.num_decks*4):
            for card in cards:
                self.cards.append(card)
        random.shuffle(self.cards)

    def draw(self):
        if self.cards == []:
            self.cards = self.discard
            self.discard = []
            random.shuffle(self.cards)
        return self.cards.pop()

    def discard(self, card):
        self.discard.append(card)   

    def __len__(self):
        return len(self.cards)
    
    def get_max_size(self):
        return self.num_decks*52

This should suffice to simulate the cards. Next we define some useful functions for the game.`

In [4]:
def bust(hand):
    return hand_value(hand) > 21

def hit(hand, shoe):
    hand.append(shoe.draw())

def hand_value(hand):
    hand_value = 0
    for card in hand:
        hand_value += cards[card]
    
    if hand_value > 21:
        for card in hand:
            if card == 'Ace':
                hand_value -= 10
                if hand_value <= 21:
                    break
    return hand_value


For the first part we will simply ask the Qnet to decide whether to hit or stay. The reward function will either give the value of the agents hand or -1 if the agent busts.
For the state vector, we will enumerate the number of each card that the agent has in its hand.
It could be preferable, in this context, to simply give the score of the hand and the number of aces.
The idea behind giving total information about the hand will be more apparent in later parts of the project, as the goal is to give information about statistical probability of cards remaining in the shoe.

This section is really to get the setup working before complicating things with a dealer and other players.

In [5]:
import numpy as np

In [6]:
# 0 = hit and 1 = stay
actions = [0, 1]

def solo_reinforcement(hand):
    if bust(hand):
        return -1
    return hand_value(hand)

def valid_actions(hand):
    if bust(hand):
        return []
    return actions

The following was taken from assignment 6 and will prove useful throughout the project.

In [7]:
def stack_sa(s, a):
    return np.hstack((s, a)).reshape(1, -1)
    
#small alteration made to the code given in A6 i.e. the addition of the hand parameter which is used to find valid actions
def epsilon_greedy(Qnet, state, epsilon, hand):
    
    actions = valid_actions(hand)
    
    if np.random.uniform() < epsilon:
        # Random Move
        action = np.random.choice(actions)
        
    else:
        # Greedy Move
        np.random.shuffle(actions)
        Qs = np.array([Qnet.use(stack_sa(state, a)) for a in actions])
        action = actions[np.argmax(Qs)]
        
    return action

def setup_standardization(Qnet, Xmeans, Xstds, Tmeans, Tstds):
    Qnet.Xmeans = np.array(Xmeans)
    Qnet.Xstds = np.array(Xstds)
    Qnet.Tmeans = np.array(Tmeans)
    Qnet.Tstds = np.array(Tstds)
    Qnet.Xconstant = Qnet.Xstds == 0
    Qnet.Tconstant = Qnet.Tstds == 0
    Qnet.XstdsFixed = copy.copy(Qnet.Xstds)
    Qnet.TstdsFixed = copy.copy(Qnet.Tstds)

The following function turns the raw hand information into a vector to be used as an input for the Qnet.

In [8]:
def make_hand_vector(hand):
    hand_vector = np.zeros(11)
    for card in hand:
        if card == 'Ace':
            hand_vector[0] += 1
        elif card == 'King' or card == 'Queen' or card == 'Jack':
            hand_vector[1] += 1
        else:
            hand_vector[int(card)] += 1
    return hand_vector

In [9]:
#Play a single hand of blackjack
#adapted from the code given in A6 to work with the new reinforcement function
def solo_hand(Qnet, shoe, reinforcement_f, epsilon):
    Samples = {'SA': [], 'R': [], 'Qnext': []}
    
    hand = []
    hit(hand, shoe)
    hit(hand, shoe)
    
    while True:
        state = make_hand_vector(hand)
        action = epsilon_greedy(Qnet, state, epsilon, hand)
        if action == 1:
            reward = reinforcement_f(hand)
            Samples['SA'].append(stack_sa(state, action))
            Samples['R'].append(reward)
            Samples['Qnext'].append(Qnet.use(stack_sa(state, action)))
            break
        hit(hand, shoe)

        reward = reinforcement_f(hand)
        Samples['SA'].append(stack_sa(state, action))
        Samples['R'].append(reward)
        Samples['Qnext'].append(Qnet.use(stack_sa(state, action)))
        if bust(hand):
            break

    Samples['SA'] = np.vstack(Samples['SA'])
    Samples['R'] = np.array(Samples['R']).reshape(-1, 1)
    Samples['Qnext'] = np.array(Samples['Qnext']).reshape(-1, 1)
    return Samples

From here, we are ready to set-up the agent environment for this simple case. 
Each round will be given a new shoe, so the agent will not have much probability information to work with.
The agent will need to learn basic principles of when to hit and stay based on its own hand.
The agent is not anticipated to perform very well for this reason. 
The agent will be trained for 1000 rounds, and then tested for 1000 rounds.

The approach will take inspiration from A6.
This however will create a class for encapsulating the agent and its environment so that it can be easily modified for our future cases.

In [10]:
class solo_play_runner:
    def __init__(self, shoe, reinforcement_f, epsilon_final):
        self.shoe = shoe
        self.reinforcement_f = reinforcement_f
        self.epsilon_final = epsilon_final
        self.epsilon = 1
        self.epsilon_decay = np.exp(np.log(self.epsilon_final/self.epsilon))
        self.gamma = 1
        self.outcomes = []
        self.Qnets = []
    
    ''' 
    Runs the experiment for nn's of the given parameters and returns the results
    This function is the main grid search loop
    This method should not be overridden
    '''
    def run(self, n_hiddens_list, n_epochs_list, learning_rate_list, repetitions = 5, n_trials = 1000):
        self.results = []
        for n_hiddens in n_hiddens_list:
            for n_epoch in n_epochs_list:
                for learning_rate in learning_rate_list:
                    for rep in range(repetitions):
                        self.results.append(self.run_repetition(n_hiddens, n_epoch, learning_rate, n_trials))
                        print(self.results[-1])
                        self.Qnets.append(self.Qnet)

    """
    The following 3 methods should be overridden by the subclass
    This should be sufficient for most experiments
    """

    #info about the input and output vectors
    def get_signatures(self):
        ace = 1
        K_Q_J = 1
        num_cards = 9
        num_action = 1
        #sum of the above is 11
        signatures = {
            "input_length": ace + K_Q_J + num_cards + num_action,
            "output_length": 1 
        }
        return signatures
    
    #converts the arguments to the input vector
    #should be overridden by the subclass
    #args is a dictionary with any kind of protocol as long as follow it when calling get_input_vector
    #Note this is not needed for the solo_play_runner
    def get_input_vector(self, args):
        hand = args['hand']
        s = make_hand_vector(hand)
        a = args['action']
        return stack_sa(s, a)

    # this is the game logic of a trial of the experiment
    # this should be overridden by the subclass
    # this method should return the result of the trial
    def get_samples(self):
        return solo_hand(self.Qnet, Shoe(1), self.reinforcement_f, self.epsilon)

    """
    Runs a single repetition of the experiment for the given parameters
    This method should not be overridden
    """
    def run_repetition(self, n_hiddens, n_epoch, learning_rate, n_trials):
        signatures = self.get_signatures()
        self.Qnet = nn.NeuralNetwork(signatures["input_length"], n_hiddens, signatures["output_length"])

        setup_standardization(self.Qnet, [0]*signatures["input_length"], [1]*signatures["input_length"], [0]*signatures["output_length"], [1]*signatures["output_length"])

        self.epsilon = 1
        self.outcomes = []

        for trial in range(n_trials):
            result = self.run_trial(n_epoch, learning_rate)

        return [n_hiddens, n_epoch, learning_rate ,np.mean(self.outcomes)]

    """
    Runs a single trial of the experiment for the given parameters
    This method should not be overridden
    """
    def run_trial(self, n_epoch, learning_rate):
        Samples = self.get_samples()
        SA = Samples['SA']
        R = Samples['R']
        Qn = Samples['Qnext']
        T = R + self.gamma * Qn
        self.Qnet.train(Samples['SA'], T, n_epoch, method='sgd', learning_rate=learning_rate)
        self.epsilon *= self.epsilon_decay
        self.outcomes.append(Samples['R'][-1])

    def get_results(self):
        return self.results
    
    def get_Qnet(self):
        return self.Qnet
    
    def get_Qnets(self):
        return self.Qnets


        



### Training Single Player Qnets
We will now try this simple agent/ environment  setup with a few architectures to see how it performs.
Then we will play a few hands with the best performing Qnet to see how it performs.

To interpret the ouput of the runner, look to the final value, which is the average reinforcement on some of the last hands played.
Since we will be reusing the runner class type throughout the project, this interpretation approach will persist.

For this particular case, the reinforcement is the final vaue of the hand. We would like to see this as close to 21 as possible.

In [36]:
runner = solo_play_runner(Shoe(1), solo_reinforcement, 0.1)
runner.run([[10,50,10], [20, 20]], 
           [100, 50], 
           [0.1, .01],
            1, 8000
           )

[[10, 50, 10], 100, 0.1, 11.057375]
[[10, 50, 10], 100, 0.01, 14.005375]
[[10, 50, 10], 50, 0.1, 8.339]
[[10, 50, 10], 50, 0.01, 14.49375]


  return 0.5 * self.mean((T - Y)**2) #Loss Function = Mean Square Error
  delta = (1 - Z[-1]**2) * (delta @ self.W[1:, :].T)
  delta = (1 - Z[-1]**2) * (delta @ self.W[1:, :].T)


[[20, 20], 100, 0.1, 10.297625]
[[20, 20], 100, 0.01, 14.5885]
[[20, 20], 50, 0.1, 10.17525]
[[20, 20], 50, 0.01, 14.54275]


Seems like the `.01` learning rate outperforms the `.1`. There is very little difference among the different experiments that share the `.01` learning rate in common.
Based on the other results, we will choose an architecture and try to improve this further.

In [37]:
runner = solo_play_runner(Shoe(1), solo_reinforcement, 0.1)
runner.run([[20, 20]],
           [100, 200],
           [.01, .001],
            1, 8000
            )

[[20, 20], 100, 0.01, 14.500625]
[[20, 20], 100, 0.001, 13.991875]
[[20, 20], 200, 0.01, 14.624]
[[20, 20], 200, 0.001, 14.572875]


In [14]:
#Best performer
runner = solo_play_runner(Shoe(1), solo_reinforcement, 0.1)
runner.run([[20, 20]],[200], [.01], 1, 8000)

[[20, 20], 200, 0.01, 14.479]


Evaluating this performance, it looks like the Qnet is not inclined to bust very often, or the expected hand value would be lower.
While promising, this may leave a bit to be desired, and it isn't clear yet how they decisions are being made.

### Playing a few hands solo with the best Qnet

We will simulate a few hands with the best performing Qnet to see how it performs.
We will state the events of the game in plain english print statements.

In [11]:
#print out the game as a narrative
def play_solo(Qnet):
    hand = []
    shoe = Shoe(1)
    hit(hand, shoe)
    hit(hand, shoe)
    print("The agent starts with the hand: ", hand)
    while True:
        state = make_hand_vector(hand)
        action = epsilon_greedy(Qnet, state, 0, hand)
        if action == 1:
            print('stand')
            break
        hit(hand, shoe)
        print('hit', hand)
        if bust(hand):
            print('bust')
            break
    print('Final score: ', hand_value(hand))


In [15]:
for i in range(20):
    np.random.seed(i)
    play_solo(runner.get_Qnet())
    print('')

The agent starts with the hand:  ['Ace', '10']
stand
Final score:  21

The agent starts with the hand:  ['Ace', 'Jack']
stand
Final score:  21

The agent starts with the hand:  ['7', '9']
stand
Final score:  16

The agent starts with the hand:  ['4', 'Ace']
stand
Final score:  15

The agent starts with the hand:  ['3', '9']
stand
Final score:  12

The agent starts with the hand:  ['8', '9']
stand
Final score:  17

The agent starts with the hand:  ['8', 'Queen']
stand
Final score:  18

The agent starts with the hand:  ['Jack', '10']
stand
Final score:  20

The agent starts with the hand:  ['9', '6']
stand
Final score:  15

The agent starts with the hand:  ['9', '2']
stand
Final score:  11

The agent starts with the hand:  ['Jack', '5']
stand
Final score:  15

The agent starts with the hand:  ['6', '2']
stand
Final score:  8

The agent starts with the hand:  ['King', '7']
stand
Final score:  17

The agent starts with the hand:  ['9', '2']
stand
Final score:  11

The agent starts with the

Here we see that the best performing agent simply learns to stand in any situation.
This is not the best strategy, but it is reasonable for the Qnet to arrive at this solution.
Earlier experimentation that is not included above laned on this simple achitecture that plays in a much more natural way

In [42]:
runner = solo_play_runner(Shoe(1), solo_reinforcement, 0.1)
runner.run([[10]],[100], [.1], 1, 1000)

[[10], 100, 0.1, 13.157]


In [43]:
for i in range(20):
    np.random.seed(i)
    play_solo(runner.get_Qnet())
    print('')

The agent starts with the hand:  ['10', 'Ace']
hit ['10', 'Ace', '4']
stand
Final score:  15

The agent starts with the hand:  ['8', 'Queen']
stand
Final score:  18

The agent starts with the hand:  ['8', 'King']
stand
Final score:  18

The agent starts with the hand:  ['Queen', '7']
stand
Final score:  17

The agent starts with the hand:  ['5', '2']
stand
Final score:  7

The agent starts with the hand:  ['5', 'King']
stand
Final score:  15

The agent starts with the hand:  ['2', '4']
stand
Final score:  6

The agent starts with the hand:  ['Jack', '7']
stand
Final score:  17

The agent starts with the hand:  ['King', '6']
stand
Final score:  16

The agent starts with the hand:  ['9', '6']
hit ['9', '6', '2']
hit ['9', '6', '2', 'Queen']
bust
Final score:  27

The agent starts with the hand:  ['Queen', 'Ace']
hit ['Queen', 'Ace', 'Jack']
stand
Final score:  21

The agent starts with the hand:  ['6', '2']
hit ['6', '2', '5']
stand
Final score:  13

The agent starts with the hand:  ['4'

To address the issue of playing too conservatively we will modify the reward function.
Without recognizing the competition, it is reasonable that the Qnet learns to be conservative, since most points are free and many are lost in a bust.
To fix this I will try transforming the hand value with a non-linear growth function.
This will give the Qnet more incentive to play for higher scores.

By using an exponential filter we gaurentee that the reward for a single extra point outways the pentalty for losing all points.
Also by treating busting slightly ddifferently we can give it a little extra fear of just flat our losing.
Hopefully this will encourage riskier behavior.

In [12]:
def nonlinear_solo_reward(hand):
    hand_val = hand_value(hand)
    if bust(hand):
        hand_val = -.5
    abs_val = np.abs(hand_val)
    hand_sign = np.sign(hand_val)
    hand_val = np.exp(abs_val) * hand_sign
    return hand_val

In [13]:
def convert_result(runner):
    raw = runner.get_results()[-1][-1]
    return np.log(np.abs(raw)) * np.sign(raw)

In [41]:
runner = solo_play_runner(Shoe(1), nonlinear_solo_reward, 0.1)
runner.run([[20, 20]],[200], [.01], 1, 8000)
print(convert_result(runner))

[[20, 20], 200, 0.01, 124574388.82746418]
18.640413596058814


In [42]:
qnet = runner.get_Qnet()
print(qnet.get_training_time())
for i in range(20):
    np.random.seed(i)
    play_solo(runner.get_Qnet())
    print('')

0.04695892333984375
The agent starts with the hand:  ['King', 'Jack']
stand
Final score:  20

The agent starts with the hand:  ['10', '9']
hit ['10', '9', 'King']
bust
Final score:  29

The agent starts with the hand:  ['9', '3']
hit ['9', '3', '6']
hit ['9', '3', '6', '2']
stand
Final score:  20

The agent starts with the hand:  ['4', '9']
stand
Final score:  13

The agent starts with the hand:  ['2', '6']
stand
Final score:  8

The agent starts with the hand:  ['8', '6']
stand
Final score:  14

The agent starts with the hand:  ['2', 'Jack']
stand
Final score:  12

The agent starts with the hand:  ['9', '8']
stand
Final score:  17

The agent starts with the hand:  ['10', 'Ace']
stand
Final score:  21

The agent starts with the hand:  ['Jack', '6']
hit ['Jack', '6', '6']
bust
Final score:  22

The agent starts with the hand:  ['8', '10']
stand
Final score:  18

The agent starts with the hand:  ['Jack', '2']
hit ['Jack', '2', '4']
hit ['Jack', '2', '4', '9']
bust
Final score:  25

The a

This result is markedly better than the previous one.
It still makes dubious decisions, but it is more versitile in its play.
Further the average hand value is much higher than the previous one.
These strategies will need to be adjusted however for when we play for our other cases.

### Casino Blackjack with a Dealer and other Players but no Betting

Now we will add a dealer and other players to the game.
The approach for this is as follows:

* The `agent hand` is handled as before
* The `dealer hand` is treated as a seperate hand
* The `other players hand` (including the dealer) are treated as a single hand i.e. this hand is all non agent cards dealt.

For the Qnet the state vector will be much larger then before to keep track of all this information.
The agent hand will be vectorized and stacked with the hand vector for the other players hand.
The dealer hand will be turned into a simpler vector before stacking, which contains only the visible card score and number of aces.
Finally we provide a single element to tell the agent the number of decks in the shoe.

This is *27* dimensional state vector, which is much larger than the previous *12* dimensional vector.

The objective here is simply for the agent to know when to hit and stay with better confidence from the extra probability information and win more often.
Winning in this context is redefined to be beating the dealers hand without busting.

The dealer will play by the following rules:

* If the dealer has a score of 17 or more, they will stay
* If the dealer has a score of 16 or less, they will hit

The other players will be dealt cards from the same shoe as everyone else, but we don't care about their hands.
So we will play a card to the `other players hand` for each player and simply let them play by the same rules as the dealer.

We will carry on over several hands like this until the shoe gets too low on cards.
After each hand the information about the `other players hand` will be kept as an intial state for the next hand.
the `other players hand` will reset when the shoe does.

Let's start with some useful functions for this game.

In [14]:
def does_dealer_hit(hand):
    return hand_value(hand) < 17

#When the dealer hits we need to add the card to the `other_hand` list too so that we can keep track of the contents of the shoe.
def hit_dealer(dealer_hand, other_hand, shoe):
    hit(dealer_hand, shoe)
    other_hand.append(dealer_hand[-1])

def dealers_play(dealer_hand, other_hand, shoe):
    while does_dealer_hit(dealer_hand):
        hit_dealer(dealer_hand, other_hand, shoe)

def other_players_play(other_hand, shoe):
    while True:
        if hand_value(other_hand) >= 17:
            break
        hit(other_hand, shoe)

def deal(shoe, player_hand, other_hand, dealer_hand, n_others):
    for _ in range(2):
        hit(player_hand, shoe)
        for _ in range(n_others):
            hit(other_hand, shoe)
        hit(dealer_hand, shoe)

def reinforcement_group(hand, dealer_hand, end):
    if bust(hand):
        return -1
    if bust(dealer_hand):
        return 1
    if hand_value(hand) > hand_value(dealer_hand) and end:
        return 1
    if hand_value(hand) < hand_value(dealer_hand) and end:
        return -1
    return 0

def is_shoe_done(shoe):
    return len(shoe) < shoe.get_max_size() * 0.25

def get_num_of_other_players(shoe):
    return shoe.get_max_size() // 52 - 1

def make_dealer_hand_vector(hand):
    visible_part = hand[1:]
    vis_vector = make_hand_vector(visible_part)
    vis_value = hand_value(visible_part)
    num_aces_visible = vis_vector[0]
    result = np.array([num_aces_visible, vis_value])
    return result


Now on to the experiment runner, which will inherit from the previous with a few modifications.
The new runner must reflect the new input space in the `get_signatures` method and change the way the input state vector is initialized.
The `get_samples` method will now need to reflect the new rules for the game and gameplay flow.
These exact changes will be pervasive throught the project, so I will not comment on them in detail.

In [15]:
class group_play_runner(solo_play_runner):
    def __init__(self, shoe, reinforcement_f, epsilon_final):
        super().__init__(shoe, reinforcement_f, epsilon_final)
        self.player_hand = []
        self.dealer_hand = []
        self.other_hands = []

    def get_signatures(self):
        hand_vector_size = 11
        dealer_hand_vector_size = 2
        shoe_information = 1 
        num_actions = 1 #stand or hit is only 1 type of action
        return {
            "input_length": hand_vector_size*2 + dealer_hand_vector_size + shoe_information + num_actions,
            "output_length": 1
        }
    
    def get_input_vector(self, args):
        hand = args['hand']
        dealer_hand = args['dealer_hand']
        shoe = args['shoe']
        others_hand = args['others']
        player = make_hand_vector(hand)
        others =make_hand_vector(others_hand)
        dealer = make_dealer_hand_vector(dealer_hand)
        out =np.hstack([player, others, dealer, shoe.get_max_size() - len(shoe)])
        return out
    
    def get_samples(self):
        Samples = { "SA": [], "R": [], "Qnext": [] }
        self.player_hand = []
        self.dealer_hand = []

        n_others = get_num_of_other_players(self.shoe)
        if is_shoe_done(self.shoe):
            #number of decks between 1 and 6
            self.shoe = Shoe(np.random.randint(1, 7))
            self.other_hands = []

        deal(self.shoe, self.player_hand, self.other_hands, self.dealer_hand, n_others)

        for _ in range(n_others):
            other_players_play(self.other_hands, self.shoe)

        while True:
            state = self.get_input_vector({
                "hand": self.player_hand,
                "others": self.other_hands,
                "dealer_hand": self.dealer_hand,
                "shoe": self.shoe,
            })
            action = epsilon_greedy(self.Qnet, state, self.epsilon, self.player_hand)
            
            if action == 1:
                reward = self.reinforcement_f(self.player_hand, self.dealer_hand, False)
                Samples['SA'].append(stack_sa(state, action))
                Samples['R'].append(reward)
                Samples['Qnext'].append(self.Qnet.use(stack_sa(state, action)))
                break
        
            hit(self.player_hand, self.shoe)

            reward = self.reinforcement_f(self.player_hand, self.dealer_hand, False)
            Samples['SA'].append(stack_sa(state, action))
            Samples['R'].append(reward)
            Samples['Qnext'].append(self.Qnet.use(stack_sa(state, action)))
            if bust(self.player_hand):
                break

        dealers_play(self.dealer_hand, self.other_hands, self.shoe)

        reward = self.reinforcement_f(self.player_hand, self.dealer_hand, True)
        Samples['R'][-1] = reward
        Samples['Qnext'][-1] = self.Qnet.use(stack_sa(state, -1))
        
        Samples['SA'] = np.vstack(Samples['SA'])
        Samples['R'] = np.array(Samples['R']).reshape(-1, 1)
        Samples['Qnext'] = np.array(Samples['Qnext']).reshape(-1, 1)
        return Samples

### Training Casino Blackjack Qnets

Much like before, we will train a few Qnets to see how they perform.

Inerpreting the output is similar to before, but now we have a different desired reinforcement value, since the function is fundamentally different.
0 means that the agent wins as often as it loses. The range of outcomes is in the interval [-1, 1] with -1 being always losing and 1 being always winning.

In [159]:
runner = group_play_runner(Shoe(1), reinforcement_group, 0.1)
runner.run([[5,5],[10],[10,5,10]], [200,100], [.01,.001], 1, 8000)

[[5, 5], 200, 0.01, -0.48]
[[5, 5], 200, 0.001, -0.5935]
[[5, 5], 100, 0.01, -0.512875]
[[5, 5], 100, 0.001, -0.403125]
[[10], 200, 0.01, -0.366375]
[[10], 200, 0.001, -0.514125]
[[10], 100, 0.01, -0.549625]
[[10], 100, 0.001, -0.76925]
[[10, 5, 10], 200, 0.01, -0.50475]
[[10, 5, 10], 200, 0.001, -0.51]
[[10, 5, 10], 100, 0.01, -0.574]
[[10, 5, 10], 100, 0.001, -0.564375]


For these experiments with this reinforment function, we can interpret a result of 0 as an equal number of wins and losses.
These results show that the models are not able to consistantly beat the dealer as often as they lose.
While the dealer has a slight advantage in the game, this is typically around a few percent, leading to the belief that the Qnets are not learning the game very well.

There a few considerations as to why this might be happening.
First the input space is rather large and some of the information is tied together, like the seperate hands.
Other information has a very different meaning.
The Qnet might have an easier time if we could organize the data in a more meaningful way.

### Autoencoding the input space

Aces play a special role in the game, so should be treated differently.
The next idea to imporove on this is to reduce the dimensionality of the input space by creating an autoencoder for the non-ace cards in a hand.
The Qnet will then have its input space reduced by passing both the player hand and the other's hand to the autoencoder and stacking their results with the other parameters.

Typical card counting methods boil down hand information into a single numerical value. We will allow the Qnet 3 parameters in hopes that it might find a more meaningful way to represent the information.

This will use a given file `neuralnetworks_torch.py` to speed up training and to utilize the `use_to_middle` method for the autoencoder.

In [16]:
import torch
import neuralnetworks_torch as nntorch

Now to build an autoencoder for the non-ace cards.
We will need it to be in $N^{10}$.
We must create a random generator for the autoencoder to train on.

In [17]:
def random_10_vector():
    spread = np.random.randint(0, 12, 10)
    return np.abs(spread)

To train the encoder we will generate batches of hands using our `random_10_vector` function.
The below function will specify the number of such per batch and the number of batches to iterate over.

In [18]:
def train_auto(n_hidden, n_epoch, learning_rate, batch_size, num_batches):
    net = nntorch.NeuralNetwork(10, n_hidden, 10, ['relu']*len(n_hidden), device='cuda')
    for b in range(num_batches):
        X = torch.from_numpy(np.vstack([random_10_vector() for _ in range(batch_size)])).float().to(net.device)
        T = X
        if b % 100 == 0:
            print('batch: ', b)
            error = net.use(X) - T.detach().cpu().numpy()
            print('RMSE: ', np.sqrt(np.mean(error**2)))
        net.fit(X, T, n_epoch, method='adam', learning_rate=learning_rate, verbose=False)
    return net


In [180]:
auto = train_auto([3], 150, .01, 1000, 2000)

batch:  0
RMSE:  6.6893597
batch:  100
RMSE:  2.8917954
batch:  200
RMSE:  2.868177
batch:  300
RMSE:  2.9151592
batch:  400
RMSE:  2.8944354
batch:  500
RMSE:  2.870629
batch:  600
RMSE:  2.9048617
batch:  700
RMSE:  2.881488
batch:  800
RMSE:  2.9062338
batch:  900
RMSE:  2.8914313
batch:  1000
RMSE:  2.9289382
batch:  1100
RMSE:  2.9032214
batch:  1200
RMSE:  2.876069
batch:  1300
RMSE:  2.8815217
batch:  1400
RMSE:  2.847389
batch:  1500
RMSE:  2.854869
batch:  1600
RMSE:  2.8949654
batch:  1700
RMSE:  2.8776548
batch:  1800
RMSE:  2.928284
batch:  1900
RMSE:  2.8965008


Note that above many other architectures were tried and most of them performed at most this well.
Below the version of blackjack that utilized this was also tried and no significant improvements were seen.
I am remiss to have left this information out of this report.
Instead I will simply state that the results were not significant enough to warrant further investigation.

It is notable that the number of combinations of just 0 and 1 in a vector of ten is very large.
So to try to improve this I will attempt to do a few layers of autoencoding.
2 three dimensional encodings and a a 4 dimensional encoding, each with two outputs will give us 6 outputs.
We can then use the three dimensional autoencoders to reduce this to 4 outs.

There is little reason to believe this will work better than the previous technique in the final product.
Each encoder should be more robust than the above one, but whether the dilution of information will be worth it is yet to be seen.
For the number batch sizes a good number can be calculated since the number of possible configurations of vectors are small enough to be tenable.
If we allow for 4 shoes, that means each card can appear 16 times.
So for the 3 and 4 dimensional autoencoders repectively we will have $16^3$ and $16^4$ possible configurations.
To allow for this we will do batches of 100,000

In [19]:
def random_n_vector(n):
    spread = np.random.randint(0, 12, n)
    return np.abs(spread)

In [20]:
def train_n_auto(n_hidden, n_epoch, learning_rate, batch_size, num_batches, n_inputs):
    net = nntorch.NeuralNetwork(n_inputs, n_hidden, n_inputs, ['relu']*len(n_hidden), device='cuda')
    for b in range(num_batches):
        X = torch.from_numpy(np.vstack([random_n_vector(n_inputs) for _ in range(batch_size)])).float().to(net.device)
        T = X
        if b % 100 == 0:
            print('batch: ', b)
            error = net.use(X) - T.detach().cpu().numpy()
            print('RMSE: ', np.sqrt(np.mean(error**2)))
        net.fit(X, T, n_epoch, method='adam', learning_rate=learning_rate, verbose=False)
    return net

In [198]:
#Note this is incorrectly named, but to avoid retraining the network, I'm leaving it as is
auto_3_to_1 = train_n_auto([2], 200, .01, 100000, 1000, 3)

batch:  0
RMSE:  6.4573255
batch:  100
RMSE:  1.9946792
batch:  200
RMSE:  1.994986
batch:  300
RMSE:  1.9992826
batch:  400
RMSE:  1.9920868
batch:  500
RMSE:  1.9896367
batch:  600
RMSE:  1.9925561
batch:  700
RMSE:  1.9958317
batch:  800
RMSE:  1.9936018
batch:  900
RMSE:  1.993729


In [201]:
auto_4_2 = train_n_auto([4,3,2,3,4], 200, .01, 100000, 1000, 4)

batch:  0
RMSE:  6.401102
batch:  100
RMSE:  1.9265873
batch:  200
RMSE:  1.9319617
batch:  300
RMSE:  1.9015895
batch:  400
RMSE:  1.8568105
batch:  500
RMSE:  1.5665952
batch:  600
RMSE:  1.5576278
batch:  700
RMSE:  1.561325
batch:  800
RMSE:  1.559476
batch:  900
RMSE:  1.5583557


These smaller autoencoders are performing better than the large one.
I recognize here that the error in encoding may be incurred on each layer of encoding, so the reduction in error in these smaller ones may not be as significant as it seems.

In [21]:
def hand_vector_reduce_by_encoding(hand):
    ace = hand[0]
    face_through_8 = np.hstack([hand[1], hand[-1:-4]])
    two_through_4 = hand[4:7]
    five_through_7 = hand[7:10]

    encoding_layer0 = auto_4_2.use_to_middle(torch.from_numpy(face_through_8).float().to(auto_4_2.device))
    encoding_layer1 = auto_3_to_1.use_to_middle(torch.from_numpy(two_through_4).float().to(auto_3_to_1.device))
    encoding_layer2 = auto_3_to_1.use_to_middle(torch.from_numpy(five_through_7).float().to(auto_3_to_1.device))

    full_encoding_layer = np.hstack([encoding_layer0, encoding_layer1, encoding_layer2])
    
    encoding_outs1 = auto_3_to_1.use_to_middle(full_encoding_layer[:3])
    encoding_outs2 = auto_3_to_1.use_to_middle(full_encoding_layer[3:])

    return np.hstack([ace, encoding_outs1, encoding_outs2])


In [22]:
class group_play_auto_runner(group_play_runner):
    def __init__(self, shoe, reinforcement_f, gamma):
        super().__init__(shoe, reinforcement_f, gamma)

    def get_input_vector(self, args):
        hand = args['hand']
        dealer_hand = args['dealer_hand']
        shoe = args['shoe']
        others_hand = args['others']
        player = make_hand_vector(hand)
        player = hand_vector_reduce_by_encoding(player)
        others =make_hand_vector(others_hand)
        others = hand_vector_reduce_by_encoding(others)
        dealer = make_dealer_hand_vector(dealer_hand)
        
        out =np.hstack([player, others, dealer, shoe.get_max_size() - len(shoe)])
        return out

    def get_signatures(self):
        reduced_hand = 4
        dealer_hand = 2
        shoe_information = 1
        ace = 1
        num_actions = 1
        return {
            "input_length": reduced_hand*2 + dealer_hand + shoe_information + ace*2 + num_actions,
            "output_length": 1
        }

In [229]:
runner = group_play_auto_runner(Shoe(1), reinforcement_group, 0.1)
runner.run([[10,10],[5],[5,10,5]], [200], [.01], 1, 8000)

[[10, 10], 200, 0.01, -0.794125]
[[5], 200, 0.01, -0.503]
[[5, 10, 5], 200, 0.01, -0.6075]


There is is not much better than the regular Qnet, if not slightly worse.
The Qnets with the autoencoders took significantly longer to train as well.
We may expect a better result with this type of approach if the autoencoders were able to be more precisely tuned to reduce the error.

A potentially more fruitful approach would be to use convolutional layers of the portions that are meant to be grouped together.
Here the approach would simply be to take the `player hand` and `other players hand` and treat them both independently for the first few layers.
Then we could then combine these outputs with the remaining parameters and feed this input space into a fully connected portion.
In the interest of time, this venture will be aschewed for the scope of this project.

### Playing a few hands with the best Qnet

To interpret the reinforcement values, we treat it the same as before, such that numbers closer to 1 are preferable

In [231]:
runner = group_play_runner(Shoe(1), reinforcement_group, 0.1)
runner.run([[5,5]], [100], [.001], 1, 8000)

[[5, 5], 100, 0.001, -0.392875]


In [232]:
Qnet = runner.get_Qnet()

For this one we will have a similar looking output.
We won't announce the other players hands, since they ddon't matter to us.
We will however announce the players hands and the dealers hand.

In [23]:
shoe_group = Shoe(2)
other_hands = []


def play_group(Qnet, shoe, get_input_vector):
        player_hand = []
        dealer_hand = []

        n_others = get_num_of_other_players(shoe)

        deal(shoe, player_hand, other_hands, dealer_hand, n_others)
        print('After dealing the player has the hand: ', player_hand)

        for _ in range(n_others):
            other_players_play(other_hands, shoe)

        while True:
            state = get_input_vector({
                "hand": player_hand,
                "others": other_hands,
                "dealer_hand": dealer_hand,
                "shoe": shoe,
            })
            action = epsilon_greedy(Qnet, state, 0, player_hand)
            
            if action == 1:
                print('stand')
                print('Final score: ', hand_value(player_hand))
                break
        
            hit(player_hand, shoe)

            print('hit', player_hand)

            if bust(player_hand):
                print('bust')
                break

        dealers_play(dealer_hand, other_hands, shoe)
        print('Dealer hand: ', dealer_hand)
        print('Dealer score: ', hand_value(dealer_hand))


In [242]:
for _ in range(10):
    play_group(Qnet, shoe_group, runner.get_input_vector)
    print('------------------')

After dealing the player has the hand:  ['King', '2']
hit ['King', '2', '8']
hit ['King', '2', '8', 'Queen']
bust
Dealer hand:  ['Jack', 'King']
Dealer score:  20
------------------
After dealing the player has the hand:  ['King', '3']
stand
Final score:  13
Dealer hand:  ['5', '2', '7', '5']
Dealer score:  19
------------------
After dealing the player has the hand:  ['10', 'Jack']
stand
Final score:  20
Dealer hand:  ['Ace', '7']
Dealer score:  18
------------------
After dealing the player has the hand:  ['10', '4']
stand
Final score:  14
Dealer hand:  ['Queen', '3', 'Ace', '3']
Dealer score:  17
------------------
After dealing the player has the hand:  ['9', '10']
stand
Final score:  19
Dealer hand:  ['2', 'Queen', '10']
Dealer score:  22
------------------
After dealing the player has the hand:  ['9', 'King']
stand
Final score:  19
Dealer hand:  ['5', '9', '10']
Dealer score:  24
------------------
After dealing the player has the hand:  ['King', '10']
stand
Final score:  20
Deal

Here we see that the behavior of the model does not vary much from our solo blackjack model.

### Quick aside:

At this point it occurs to me that the samples were generating only a single hand at a time, which likely led to suboptimal learning.
In response to this I adpated the solo runner to generate a batch of hands at a time.
After running some preliminary experiments with both the linear and nonlinear reward functions, it appears the the results were not significantly different.
I will for that reason leave the above section as is.

In the case of the group runner, below we will see a marked improvement in the results.
This is accomplished with the following slight alteration to the runner to allow for bigger sample batches.

In [24]:
class group_play_runner_exta_samples(group_play_runner):
    def get_samples(self):
        sample_hands = []
        for _ in range(20):
            sample_hands.append(super().get_samples())
        
        out = {
            "SA": [],
            "R": [],
            "Qnext": []
        }
        for key in out.keys():
            out[key] = np.vstack([sample[key] for sample in sample_hands])
        return out
            

In [288]:
runner1 = group_play_runner_exta_samples(Shoe(1), reinforcement_group, 0.1)
runner1.run([[10,10],[5],[5,10,5]], [100, 50], [.01], 1, 8000)

[[10, 10], 100, 0.01, -0.529125]
[[10, 10], 50, 0.01, -0.319375]
[[5], 100, 0.01, -0.393625]
[[5], 50, 0.01, -0.228]
[[5, 10, 5], 100, 0.01, -0.445625]
[[5, 10, 5], 50, 0.01, -0.574125]


In [290]:
runner = group_play_runner_exta_samples(Shoe(1), reinforcement_group, 0.1)
runner.run([[5]], [50], [.01], 1, 8000)

[[5], 50, 0.01, -0.42475]


In [291]:
Qnet = runner.get_Qnet()
shoe_group = Shoe(2)
other_hands = []

for _ in range(10):
    play_group(Qnet, shoe_group, runner.get_input_vector)
    print('------------------')

After dealing the player has the hand:  ['5', '8']
stand
Final score:  13
Dealer hand:  ['Queen', '7']
Dealer score:  17
------------------
After dealing the player has the hand:  ['10', 'Jack']
stand
Final score:  20
Dealer hand:  ['Jack', '9']
Dealer score:  19
------------------
After dealing the player has the hand:  ['7', 'Queen']
stand
Final score:  17
Dealer hand:  ['6', 'Queen', '8']
Dealer score:  24
------------------
After dealing the player has the hand:  ['Ace', '5']
hit ['Ace', '5', '6']
hit ['Ace', '5', '6', 'King']
bust
Dealer hand:  ['Ace', 'Jack']
Dealer score:  21
------------------
After dealing the player has the hand:  ['King', 'Queen']
hit ['King', 'Queen', '4']
bust
Dealer hand:  ['Jack', '9']
Dealer score:  19
------------------
After dealing the player has the hand:  ['8', '2']
hit ['8', '2', '7']
hit ['8', '2', '7', '10']
bust
Dealer hand:  ['8', '7', '10']
Dealer score:  25
------------------
After dealing the player has the hand:  ['4', 'Jack']
hit ['4', 'J

We can make this alteration to the autoencoder runner as well.

In [25]:
class group_play_auto_extra_runner(group_play_auto_runner):
    def get_samples(self):
        sample_hands = []
        for _ in range(20):
            sample_hands.append(super().get_samples())
        
        out = {
            "SA": [],
            "R": [],
            "Qnext": []
        }
        for key in out.keys():
            out[key] = np.vstack([sample[key] for sample in sample_hands])
        return out

In [293]:
runner = group_play_auto_extra_runner(Shoe(1), reinforcement_group, 0.1)
runner.run([[5],[5,5]], [50], [.01], 1, 8000)

[[5], 50, 0.01, -0.379375]
[[5, 5], 50, 0.01, -0.436625]


In [294]:
qnet = runner.get_Qnet()

In [295]:
shoe_group = Shoe(2)
other_hands = []

for _ in range(10):
    play_group(qnet, shoe_group, runner.get_input_vector)
    print('------------------')

After dealing the player has the hand:  ['9', 'Queen']
stand
Final score:  19
Dealer hand:  ['6', 'Jack', 'King']
Dealer score:  26
------------------
After dealing the player has the hand:  ['9', 'Queen']
stand
Final score:  19
Dealer hand:  ['Ace', '10']
Dealer score:  21
------------------
After dealing the player has the hand:  ['5', '2']
stand
Final score:  7
Dealer hand:  ['8', '6', '10']
Dealer score:  24
------------------
After dealing the player has the hand:  ['5', '2']
stand
Final score:  7
Dealer hand:  ['3', 'Ace', '8', '3', 'Ace', 'Ace']
Dealer score:  17
------------------
After dealing the player has the hand:  ['Queen', '6']
stand
Final score:  16
Dealer hand:  ['Jack', 'Jack']
Dealer score:  20
------------------
After dealing the player has the hand:  ['3', 'Jack']
stand
Final score:  13
Dealer hand:  ['5', '4', '4', '4']
Dealer score:  17
------------------
After dealing the player has the hand:  ['King', '7']
stand
Final score:  17
Dealer hand:  ['7', '8', '5']
De

For this we see that the autoencoded version returns to the hyper-conservative strategy which has been very hard to keep from emerging.
Still the results are superior to the versions which were trained with a single hand at a time.
While only a few hands are analysed here, the expectation given this result is that in certain conditions, the model has learned decent startegies, while in many others it does not.
Perhaps this better performance ratio is a result of averaging among such circumstances where its amount of knowledge varies/

### Betting

This portion of the project has been given substantial consideration.
In order to achieve the betting behavior, two approaches were considered.
The first was to train a seperate network to watch the agent and place bets based on its behavior.
The second was to incorporate an extra output into the existing network.

While the 2 network approach would reduce the internal complexity of the function approximation, it would substantially increase training time.
The single network approach may have an advantage of giving information about the risk associated with a given hand.
While, the running bet can be added to the player agent while maintaining a seperate network for betting, having all the information embedded in the weights of a single network may give an advantage in processing the meaning of this parameter.

With these considerations in mind, the single network approach was chosen.
We will add an extra parameter to the input space which will be the running bet.
On the initial deal, this value will be zero.
This is sensical because in reality the agent would not be able to play without an anti of some kind.
In the first query, the agent will be probed for a bet as a second output.
After this, the bet will remain as an input for all following queries in the round, and the possible actions will be adjusted to reflect the impossibility of changing that bet.

Again we will need to adapt our reward function for this new environment.
Two seperate approaches will be compared.
1. The reward will be the final amount of money the agent has after several hands and the agent will be trained to get the most winnings over several games.
2. The reward will be taken at a per hand basis and simply be the amount of money won or lost in that hand.
At this point, I do not have any reasonable way to predict which of these will be better.


We have to adapt our epsilon greedy function to handle the increased input space.
We note that both action parameters are essentially orthogonal because the net will never be asked to bet and move at the same time.
Therefore we find the valid actions will only be checked at one axis at a time epending on if it is the first move.
The chosen action will be placed into the vector of size 2 using the `get_effective_action` function.
This allows the epsilon greedy function to be agnostic to the action being queried.

In [90]:
# the hand parameter here will be expanded into a dictionary so we can account for the purse and the consideration of whether it is time to bet
def valid_betting_actions(hand):
    purse = hand['purse']
    first_move = hand['first_move']
    player_hand = hand['player_hand']
    move = valid_actions(player_hand)
    
    if first_move:
        return [bet for bet in range(20, int(purse + 1), 20)]
    else:
        return move

In [77]:
def get_effective_action(action_of_interest, location_of_action):
    action = np.zeros(2)
    action[location_of_action] = action_of_interest
    return action

# epsilon greedy needs to be adapted slightly to have multiple actions
def epsilon_greedy_multi(Qnet, state, epsilon, hand):

    first_move = hand['first_move']
    location_of_action = 0
    if first_move:
        location_of_action = 1

    actions = valid_betting_actions(hand)
    
    if np.random.uniform() < epsilon:
        # Random Move
        action_of_interest = np.random.choice(actions)
        action = get_effective_action(action_of_interest, location_of_action)
        
    else:
        # Greedy Move
        np.random.shuffle(actions)
        effective_actions = [get_effective_action(action_of_interest, location_of_action) for action_of_interest in actions]
        Qs = np.array([Qnet.use(stack_sa(state, a)) for a in effective_actions])
        action = actions[np.argmax(Qs)]
        action = get_effective_action(action, location_of_action)
        
    return action

To accomodate the two reinforcement approaches, we will use the following `get_bet_reinforcement` closure.
It can be given a string to specify if we are using the "short term" or "long term" reinforcement.
we can then share the related code and chose the actual return by fetching the appropriate function from the closure.

Note: the reinforcement functions yielded here make changes to the `new_purse` parameter of the given `runner` object.
This is useful because the calculation of this value is needed for gameplay flow control.
The runner may then reset its `purse` to the `new_purse` value when desired provided the reinforcement was just calculated.

In [144]:
def win_loss(player_hand, dealer_hand):
    if bust(player_hand):
        return False
    elif bust(dealer_hand):
        return True
    else:
        return hand_value(player_hand) > hand_value(dealer_hand)
    

def get_bet_reinforcement(strategy):
    def bet_reinforcement(runner):
        bet = runner.bet
        net_gain = 0
        if win_loss(runner.player_hand, runner.dealer_hand):
            net_gain = bet
        else:
            net_gain = -bet

        runner.new_purse = runner.purse + net_gain

    
        if strategy == 'short term':
            return net_gain
        elif strategy == 'long term':
            return runner.purse

    return bet_reinforcement

In [None]:
reinforment_short = get_bet_reinforcement('short term')
reinforment_long = get_bet_reinforcement('long term')

As an extra consideration in the runner, we learned that having several hands at a time was beneficial to the learning process.
A problem with this is that the final evaluation of the net can be skewed if it fails and gets a full purse back by the end.
To fix this we will attempt 20 hands per training session but stop the sample set generation anytime the agent goes broke.
Not only does this fix our evaluation metric but it doesn't delude the network into thinking it can get free money.

In [113]:
class bet_runner(group_play_runner):
    def __init__(self, shoe, reinforcement_f, gamma):
        super().__init__(shoe, reinforcement_f, gamma)
        self.bet = 0
        self.purse = 1000
        self.new_purse = 1000

    def get_input_vector(self, args):
        hand = args['hand']
        dealer_hand = args['dealer_hand']
        shoe = args['shoe']
        others_hand = args['others']
        bet = args['bet']
        current_purse = args['purse']

        player = make_hand_vector(hand)
        others =make_hand_vector(others_hand)
        dealer = make_dealer_hand_vector(dealer_hand)
        
        out =np.hstack([player, others, dealer, shoe.get_max_size() - len(shoe), bet, current_purse])
        return out

    def get_signatures(self):
        hand = 11
        dealer_hand = 2
        shoe_information = 1
        current_bet = 1
        purse = 1
        num_actions = 2

        return {
            "input_length": hand*2 + dealer_hand + shoe_information + current_bet + purse + num_actions,
            "output_length": 1
        }
    
    def get_bet(self):
        # We want the dealer and player hands to be zeros here because we don't know them yet. Still we know the running count so we can use the other's hand data. We can make the zeros of the right size by passing empty lists.
        input = self.get_input_vector(
            {
                "hand": [],
                "others": self.other_hands,
                "dealer_hand": [],
                "shoe": self.shoe,
                "bet": self.bet,
                "purse": self.purse
            }
        )

        detailed_hand = {
            "player_hand": self.player_hand,
            "purse": self.purse,
            "first_move": True,
        }

        bet = epsilon_greedy_multi(self.Qnet, input, self.epsilon, detailed_hand)
        return bet[1]


    def get_samples_hand(self):
        Samples = { "SA": [], "R": [], "Qnext": [] }
        self.player_hand = []
        self.dealer_hand = []

        if self.purse <= 0:
            self.purse = 1000
            self.shoe = Shoe(np.random.randint(1, 7))
            self.other_hands = []
            return {
                'samples': Samples,
                'broke': True
            }

        n_others = get_num_of_other_players(self.shoe)
        if is_shoe_done(self.shoe):
            #number of decks between 1 and 6
            self.shoe = Shoe(np.random.randint(1, 7))
            self.other_hands = []

        self.bet = self.get_bet() 

        deal(self.shoe, self.player_hand, self.other_hands, self.dealer_hand, n_others)

        for _ in range(n_others):
            other_players_play(self.other_hands, self.shoe)

        while True:
            state = self.get_input_vector({
                "hand": self.player_hand,
                "others": self.other_hands,
                "dealer_hand": self.dealer_hand,
                "shoe": self.shoe,
                "bet": self.bet,
                "purse": self.purse
            })

            detailed_hand = {
                "player_hand": self.player_hand,
                "purse": self.purse,
                "first_move": False
            }

            action = epsilon_greedy_multi(self.Qnet, state, self.epsilon, detailed_hand)
            
            if action[0] == 1:
                reward = self.reinforcement_f(self)
                Samples['SA'].append(stack_sa(state, action))
                Samples['R'].append(reward)
                Samples['Qnext'].append(self.Qnet.use(stack_sa(state, action)))
                break
        
            hit(self.player_hand, self.shoe)

            reward = self.reinforcement_f(self)
            Samples['SA'].append(stack_sa(state, action))
            Samples['R'].append(reward)
            Samples['Qnext'].append(self.Qnet.use(stack_sa(state, action)))
            if bust(self.player_hand):
                break

        dealers_play(self.dealer_hand, self.other_hands, self.shoe)

        reward = self.reinforcement_f(self)
        self.purse = self.new_purse
        Samples['R'][-1] = reward
        Samples['Qnext'][-1] = self.Qnet.use(stack_sa(state, [-1, -1]))
        
        Samples['SA'] = np.vstack(Samples['SA'])
        Samples['R'] = np.array(Samples['R']).reshape(-1, 1)
        Samples['Qnext'] = np.array(Samples['Qnext']).reshape(-1, 1)
        return {
            'samples': Samples,
            'broke': False
        }
    
    def get_samples(self):
        sample_hands = []
        for _ in range(20):
            sample_hand = self.get_samples_hand()
            if sample_hand['broke']:
                break
            sample_hands.append(sample_hand['samples'])
        
        out = {
            "SA": [],
            "R": [],
            "Qnext": []
        }
        for key in out.keys():
            out[key] = np.vstack([sample[key] for sample in sample_hands])
        return out
        

### Training the betting Qnet

*One Hand Reinforcement*

For starters, lets focus on the one hand at a time reinforment function.
The number of rounds is decreased here from previous experiments because of the loop making several hands per round.
This works out to be around the same amount of games in a training session.

To evaluate the output, we interpret the reinforcement as a the amount of money made on average per hand.
In reality we would want this to be positive, indicating that you are making money.

In [114]:
runner = bet_runner(Shoe(6), reinforment_short, .6)
runner.run([[5],[5,5], [10,5,10]], [50,30], [.01], 1, 450)

[[5], 50, 0.01, -152.88888888888889]
[[5], 30, 0.01, -248.93333333333334]
[[5, 5], 50, 0.01, -106.35555555555555]
[[5, 5], 30, 0.01, -242.44444444444446]
[[10, 5, 10], 50, 0.01, -294.8]
[[10, 5, 10], 30, 0.01, -228.62222222222223]


The Qnet is definitely not performing in a way that someone would want to use in Vegas.
On average it loses money every hand.
I notice that the higher epoch hyperparameter is better, so I will do a few more experiments to try to get a better result.

In [115]:
runner = bet_runner(Shoe(6), reinforment_short, .6)
runner.run([[5],[5,5], [10], [10,10]], [50,80], [.01], 1, 450)

[[5], 50, 0.01, -235.37777777777777]
[[5], 80, 0.01, -272.0]
[[5, 5], 50, 0.01, -260.0]
[[5, 5], 80, 0.01, -217.15555555555557]
[[10], 50, 0.01, -346.5777777777778]
[[10], 80, 0.01, -198.22222222222223]
[[10, 10], 50, 0.01, -445.06666666666666]
[[10, 10], 80, 0.01, -161.55555555555554]


The previous best did not perform as well this time.
Still, higher epochs seem to be better.
I will try a few more experiments with higher epochs, with the [5,5] architecture, since it is the most consistent.

In [117]:
runner = bet_runner(Shoe(6), reinforment_short, .6)
runner.run([[5,5]], [80, 100, 150, 200], [.01], 1, 450)

[[5, 5], 80, 0.01, -264.1333333333333]
[[5, 5], 100, 0.01, -234.48888888888888]
[[5, 5], 150, 0.01, -191.6]
[[5, 5], 200, 0.01, -208.35555555555555]


The larger epoch systems seem to work better.
Since the [10,10] architecture worked better with more epochs too, we will do a quick experiment to consider this.

In [118]:
runner = bet_runner(Shoe(6), reinforment_short, .6)
runner.run([[10,10]], [150, 200, 300, 500], [.01], 1, 450)

[[10, 10], 150, 0.01, -286.5777777777778]
[[10, 10], 200, 0.01, -186.53333333333333]
[[10, 10], 300, 0.01, -209.06666666666666]
[[10, 10], 500, 0.01, -143.95555555555555]


Ultimately, we don't see much difference in these results versus the other architectures.
For simplicity we will stick with our best performer.
This is [10,10] with 500 epochs.
We can use this as a basis for our next experiments.

*Multi Hand Reinforcement*

Next we'll look at how the Qnets perform with reinforcement based on final purse size.
We can intrepret the evaluation metric as the average amount of money the agent has at the end of the game.
Starting with 1000, would mean anything above 1000 is a profit.

In [119]:
runner = bet_runner(Shoe(6), reinforment_long, .6)
runner.run([[10,10],[20,20],[10,8,10]], [500,200], [.01], 1, 450)

[[10, 10], 500, 0.01, 752.8888888888889]
[[10, 10], 200, 0.01, 696.9333333333333]
[[20, 20], 500, 0.01, 644.9333333333333]
[[20, 20], 200, 0.01, 707.2]
[[10, 8, 10], 500, 0.01, 796.0]
[[10, 8, 10], 200, 0.01, 551.4222222222222]


### Playing a few hands with the betting Qnet

Like before we can code up a function to play some hands with the betting Qnet.

In [138]:
def play_betting(Qnet,game_runner):
        game_runner.player_hand = []
        game_runner.dealer_hand = []

        if game_runner.purse <= 0:
            print('Player is broke')
            game_runner.purse = 1000
            game_runner.shoe = Shoe(np.random.randint(1, 7))
            game_runner.other_hands = []

        n_others = get_num_of_other_players(game_runner.shoe)
        if is_shoe_done(game_runner.shoe):
            #number of decks between 1 and 6
            game_runner.shoe = Shoe(np.random.randint(1, 7))
            game_runner.other_hands = []

        game_runner.bet = game_runner.get_bet() 
        print('Player purse: ', game_runner.purse, ' this round')
        print('Player bet: ', game_runner.bet, ' this round')

        deal(game_runner.shoe, game_runner.player_hand, game_runner.other_hands, game_runner.dealer_hand, n_others)
        print('Player initial hand: ', game_runner.player_hand)
        print('Dealer initial hand: ', game_runner.dealer_hand)

        for _ in range(n_others):
            other_players_play(game_runner.other_hands, game_runner.shoe)

        while True:
            state = game_runner.get_input_vector({
                "hand": game_runner.player_hand,
                "others": game_runner.other_hands,
                "dealer_hand": game_runner.dealer_hand,
                "shoe": game_runner.shoe,
                "bet": game_runner.bet,
                "purse": game_runner.purse
            })

            detailed_hand = {
                "player_hand": game_runner.player_hand,
                "purse": game_runner.purse,
                "first_move": False
            }

            action = epsilon_greedy_multi(game_runner.Qnet, state, 0, detailed_hand)
            
            if action[0] == 1:
                print('Player stands')
                break
        
            hit(game_runner.player_hand, game_runner.shoe)
            print('Player hand: ', game_runner.player_hand)

            if bust(game_runner.player_hand):
                print('Player busts')
                break

        dealers_play(game_runner.dealer_hand, game_runner.other_hands, game_runner.shoe)
        print('Final dealer hand: ', game_runner.dealer_hand)

        game_runner.reinforcement_f(game_runner)
        game_runner.purse = game_runner.new_purse
        print('Player purse: ', game_runner.purse, ' after round')
        print('--------------------------------------')

We wil train our choices of Qnet with more rounds to try and improve the result.

*One Hand Reinforcement*

In [149]:
runner = bet_runner(Shoe(6), reinforment_short, .6)
runner.run([[10,10]], [500], [.01], 1, 900)

[[10, 10], 500, 0.01, -610.0888888888888]


In [150]:
short_qnet = runner.Qnet
for _ in range(10):
    play_betting(short_qnet, runner)

Player purse:  1000  this round
Player bet:  1000.0  this round
Player initial hand:  ['7', '4']
Dealer initial hand:  ['5', '5']
Player stands
Final dealer hand:  ['5', '5', 'Jack']
0.0
Player purse:  0.0  after round
--------------------------------------
Player is broke
Player purse:  1000  this round
Player bet:  600.0  this round
Player initial hand:  ['Jack', 'Jack']
Dealer initial hand:  ['10', '6']
Player stands
Final dealer hand:  ['10', '6', '8']
1600.0
Player purse:  1600.0  after round
--------------------------------------
Player purse:  1600.0  this round
Player bet:  1600.0  this round
Player initial hand:  ['5', '10']
Dealer initial hand:  ['3', 'Jack']
Player hand:  ['5', '10', '9']
Player busts
Final dealer hand:  ['3', 'Jack', 'Queen']
0.0
Player purse:  0.0  after round
--------------------------------------
Player is broke
Player purse:  1000  this round
Player bet:  100.0  this round
Player initial hand:  ['5', 'Ace']
Dealer initial hand:  ['5', 'Ace']
Player stan

The short term reinforcment turns out to be an overly aggressive better.
The agent does not seem to see a problem with going all in or betting most of its money on any hand.
As we can see clearly here, this is a horrible strategy.
One hypothesis about this is that the short term reinforcement really motivates the agent to maximize its potential earnings.

*Long Term Reinforcement*

In [152]:
runner_l = bet_runner(Shoe(6), reinforment_long, .6)
runner_l.run([[10,8,10]], [200], [.01], 1, 900)

[[10, 8, 10], 200, 0.01, 716.4888888888889]


In [153]:
long_qnet = runner_l.Qnet
for _ in range(10):
    play_betting(long_qnet, runner_l)

Player purse:  1000  this round
Player bet:  20.0  this round
Player initial hand:  ['3', '9']
Dealer initial hand:  ['8', 'Jack']
Player hand:  ['3', '9', '10']
Player busts
Final dealer hand:  ['8', 'Jack']
980.0
Player purse:  980.0  after round
--------------------------------------
Player purse:  980.0  this round
Player bet:  980.0  this round
Player initial hand:  ['Queen', 'Jack']
Dealer initial hand:  ['8', '5']
Player hand:  ['Queen', 'Jack', 'Ace']
Player hand:  ['Queen', 'Jack', 'Ace', 'Jack']
Player busts
Final dealer hand:  ['8', '5', 'King']
0.0
Player purse:  0.0  after round
--------------------------------------
Player is broke
Player purse:  1000  this round
Player bet:  320.0  this round
Player initial hand:  ['10', 'Ace']
Dealer initial hand:  ['10', 'Jack']
Player stands
Final dealer hand:  ['10', 'Jack']
1320.0
Player purse:  1320.0  after round
--------------------------------------
Player purse:  1320.0  this round
Player bet:  1280.0  this round
Player initial

Here we see that the long term reinforcement does actually ease the overbetting problem.
This fits with the hypothesis about why the short term reinforcement was overzealous.
The long term reinforcement has the opprotunity to see how actions in one hand affects its outcome down the road.
Even still the agent does make large questionable bets.

### Conclusion

We have analyzed reinforcement learning for the game of blackjack with various approaches.
In general, none of the agents were able to play in a way that would be advisable to adopt in a casino.
Several factors likely contributed to this.
The input space is rather various, and many games will need to be played for the net to see every state.
Beyond that the random nature of the game would necessitate that the agents see these states numerous times to truly assess the situation presented by each state.
I expect that with significantly more resources for training, the networks precented here would see a marked increase in performance.

Further fine tuning could come in several considerations as well.
With more resources, more extensive tuning of hyperparameters could be done.
Further, the networks have been shown to be very sensitive to the reward function.
Making more thoughtful reward functions could also lead to more profitable performance.

Despite the apparent limitations of this approach, it is clear that the agents were learning something.
The strategies they adopted range from over aggressive to over conservative, but sometimes it made reasonable decisions.

The effects of dimensionality reduction were also considered.
These turned out to perform poorly and vastly increase training time.
The autoecoders used were not perfect and could easily be improved, as they could be made to converge with complete information.
While the use of CNN's was not explored here in practice, the consideration of hypothetical architectures that was presented could serve as a future area of improvement.

Finally, while the goal of comparing the networks to traditional card counting methods was not acheived, because of the poor performance of the agents this comparison did not seem necessary. Watching the gameplay of the agents clearly demonstrated the shortcomings in their strategies.