This was written using Python 3.11.3 as kernel. 
Installation of the Open AI Gym package is necessary (run "pip install gym; pip install gym[atari]").

In [1]:
import numpy as np
import gym
import pickle #allows to serialize and deserialize Python objects.

In [2]:
env = gym.make('Blackjack-v1')

Test the environment:
The next two cells can be used to manually play the game. env.reset() resets the game and gives a tuple containing the player hand, the dealers card up, and if the player has a usable ace. env.step Lets you take an action. Using it with argument '1' makes you hit a new card, argument '0' makes you stick. It's output is the same tuple as env.reset(), followed by the reward for last action, i.e. -1 for an action that made you lose (busting or sticking but having less than the dealer), 0 if you tie after sticking or not busting after hitting, and +1 for winning (sticking without busting but still having more than the dealer).

In [3]:
env.reset() #start new game

((12, 8, False), {})

In [4]:
env.step(1)

  if not isinstance(terminated, (bool, np.bool8)):


((14, 8, False), 0.0, False, False, {})

From here on, the Reinforcement Learning (RL) agent is implemented.

In [5]:
# Define the state space
player_sum_space = range(4, 22)  # possible player hand values (4, 5, ..., 20, 21)
dealer_card_space = range(1, 11)  # possible dealer up card values (2, 3, ..., 10, 11)
usable_ace_space = [False, True]  # whether the player has a usable ace

state_space = []
for player_sum in player_sum_space:
    for dealer_card in dealer_card_space:
        for usable_ace in usable_ace_space:
            state_space.append((player_sum, dealer_card, usable_ace))

# This creates an array of length 306, which is the number of possible states in the game of blackjack. 
# Each state is a tuple of the player’s sum, the dealer’s card, and whether the player has a usable ace.

Pre trainings phase to make sure our q table isn't empty before starting the real training of our model.
The decicions aren't made using the Q table, but he will hit or stick at random.

In [8]:

# Define the action space
action_space = [0, 1]  # hit or stick
# Get the size of the state and action spaces
num_states = len(state_space)
num_actions = len(action_space)

# Initialize the Q-table with initial values. 
Q_table = np.zeros((num_states, num_actions))

for i in range(100000): #filling the
    # Initialize the state
    state = env.reset()[0]
    done = False

    while not done:
        # Choose an action
        action = np.random.choice(action_space)

        # Take the action
        next_state, reward, done, terminal, dic = env.step(action)

        # Update Q-table
        Q_table[state_space.index(state)][action] += reward

        # Update state
        state = next_state

Real training loop using the initial Q-table made in the previous part.

In [None]:

# Hyperparameters
num_episodes = 1000000  # Total number of episodes
alpha = 0.05  # Learning rate
gamma = 0.99  # Discount factor
epsilon = 0.1  # Epsilon-greedy parameter

# Training loop
for episode in range(num_episodes):
    state = env.reset()[0]
    done = False
    while not done:
        # Epsilon-greedy action selection
        if np.random.random() < epsilon:
            action = np.random.choice(action_space)
        else:
            action = np.argmax(state_space.index(state))

        # Take the action
        next_state, reward, done, terminal, dic = env.step(action)

        # Update Q-value if you busted, necessary because Q_table(next_state) does not exist if you bust
        if next_state not in state_space:
            Q_table[state_space.index(state)][action] += alpha * (reward - Q_table[state_space.index(state)][action])
            break

        # Update Q-value for current state-action pair
        Q_table[state_space.index(state)][action] += alpha * (reward + gamma * np.max(Q_table[state_space.index(next_state)]) - Q_table[state_space.index(state)][action])

        state = next_state

In [None]:
# Save the Q-table to a file using pickle
#filename = f"episodes{num_episodes}_q_table_alpha{alpha}_gamma{gamma}_epsilon{epsilon}.pkl"
#with open(filename, 'wb') as f:
#    pickle.dump(Q, f)

# Load the Q-table from a file using pickle
#with open('q_table.pkl', 'rb') as f:
#    Q_table = pickle.load(f)

In [None]:
# Make the agent play 10000 games and check winnning rate
num_games = 100000
num_wins = 0
num_draws = 0
num_losses = 0

for i in range(num_games):
    state = env.reset()[0]
    done = False
    while not done:
        action = np.argmax(Q_table[state_space.index(state)])
        next_state, reward, done, terminal, dic = env.step(action)
        state = next_state
        if done and reward == 1:
            num_wins += 1
        elif done and reward == 0:
            num_draws += 1
        elif done and reward == -1:
            num_losses += 1

In [None]:
num_wins/(num_games-num_draws)

0.4318488237895441

Winrate random action policy => Nothing to do with the agent playing a game. 
This is just as a reference

In [None]:
# Check winnning rate of random action policy

num_games = 100000
num_wins = 0
num_draws = 0
num_losses = 0

for i in range(num_games):
    state = env.reset()[0]
    done = False
    while not done:
        action = np.random.choice(action_space)
        next_state, reward, done, terminal, dic = env.step(action)
        state = next_state
        if done and reward == 1:
            num_wins += 1
        elif done and reward == 0:
            num_draws += 1
        elif done and reward == -1:
            num_losses += 1

In [None]:
num_wins/(num_games-num_draws)

0.29474606685306515