# BLACKJACK NAIVE AGENT


In this notebook we will implement a naive agent for the Blackjack environment. The agent will be able to play Blackjack, but it will not learn anything. We will use this agent to understand the Blackjack environment and to test our implementation of the environment. Its policy will be to stick if the sum of the cards is 20 or 21, and to hit otherwise.

In [1]:
#Uncomment the following lines to install the required packages

'''
%pip install wandb
%pip install matplotlib
%pip install numpy
%pip install tqdm
%matplotlib inline
%pip install gymnasium==0.29.1
'''

'\n%pip install wandb\n%pip install matplotlib\n%pip install numpy\n%pip install tqdm\n%matplotlib inline\n%pip install gymnasium==0.29.1\n'

In [4]:
#Importing the required packages

import matplotlib.pyplot as plt
from tqdm import tqdm
import gymnasium as gym

Let´s first of all create the environment.
We´ll use the Gymnasium´s Blackjack environment, we´ll allow natural blackjacks as well and the settings won´t follow the Sutton & Barto´s Book´s approach.

In [5]:
env = gym.make('Blackjack-v1',sab=False, natural=True, render_mode='rgb_array') #We are not folllowing the default sutton and barto book settings, which are sab=True, natural=False, render_mode='human'

### Understanding and Observing the Environment

In [6]:
#observation space is a tuple of 3 elements:
#1. player's current sum (1-31)
#2. dealer's face up card (1-10)
#3. whether or not the player has a usable ace (0 or 1)

done = False
observation, info = env.reset() #get the first observation
print("Observation space:", env.observation_space)
print("\nAction space:", env.action_space) #0: stick, 1: hit
print("\nObservation:", observation) #Observation[1] is player's current sum, Observation[2] is dealer's face up card, Observation[3] is whether or not the player has a usable ace
print("\n info:", info)



Observation space: Tuple(Discrete(32), Discrete(11), Discrete(2))

Action space: Discrete(2)

Observation: (7, 2, 0)

 info: {}


### Now let´s see how the agent behaves when making a step

**env.step(action)** returns: observation, reward, terminated, truncated, info

**observation**: tuple of 3 elements (player's current sum, dealer's face up card, whether or not the player has a usable ace)

**reward**: +1.5, +1, 0 or -1 (win, draw or loss), 1.5 if the player wins with a natural blackjack

**terminated**: boolean (True if the episode is over)

**truncated**: boolean (True if the episode is over because it reached the maximum number of steps)

**info**: dictionary with additional information. We will not use this.

In [7]:
#sample random actions from the action space
print("Random actions:")
for i in range(5):
    env.reset() # reset the environment at the beginning of each iteration
    action = env.action_space.sample()
    print("Action:", action)
    observation, reward, terminated, truncated, info = env.step(action) #take a random action and observe the results of the action taken
    print("Observation:", observation) #Observation[1] is player's current sum, Observation[2] is dealer's face up card, Observation[3] is whether or not the player has a usable ace
    print("Reward:", reward) #reward is 1 if the player wins, 1.5 if player wins with natural blackjack (an usable ace and a 10), -1 if the player loses, and 0 if the game is a draw
    print("Terminated:", terminated)
    print("Truncated:", truncated)
    print("info:", info)
    print("")



Random actions:
Action: 1
Observation: (19, 10, 0)
Reward: 0.0
Terminated: False
Truncated: False
info: {}

Action: 1
Observation: (23, 7, 0)
Reward: -1.0
Terminated: True
Truncated: False
info: {}

Action: 0
Observation: (7, 10, 0)
Reward: -1.0
Terminated: True
Truncated: False
info: {}

Action: 0
Observation: (20, 10, 0)
Reward: 1.0
Terminated: True
Truncated: False
info: {}

Action: 1
Observation: (11, 6, 0)
Reward: 0.0
Terminated: False
Truncated: False
info: {}



Let´s create a simple agent, the policy is very naive, if its own sum surpasses 20, sticks with its cards, if not, hits for more.

In [8]:
class NaiveBlackjackAgent:
    def __init__(self):
        pass

    def play(self, obs):
        return 0 if obs[0] >= 20 else 1 #stick if player's current sum is 20 or more, else hit


Now we will evaluate the agent

In [9]:
#defining the hyperparameters
n_episodes = 100

#initialize the agent
agent = NaiveBlackjackAgent()


In [10]:
from IPython.display import clear_output
import wandb
import torch
import pygame

# Initialize wandb
wandb.init(project="blackjack_naive_100000", entity="ai42")
pygame.init()


n_episodes = 100000  # Define the number of episodes you want to run


win_rate = 0.0
loss_rate = 0.0
draw_rate = 0.0
natural_rate = 0.0

for episode in tqdm(range(n_episodes)):
    obs, info = env.reset()
    terminated, truncated = False, False
    clear_output(wait=True)
    step = 0
    episode_rewards = 0  # Initialize total rewards for the episode

    while not terminated and not truncated:
        action = agent.play(obs)  # Agent's policy
        obs, reward, terminated, truncated, info = env.step(action)

        # Ensure you're getting an RGB image
        frame = env.render()
        step += 1
        episode_rewards += reward  # Accumulate rewards

        # Plot frame
        plt.imshow(frame)
        plt.axis('off')
        plt.title(f"Episode: {episode}, Step: {step}")
        plt.savefig('frame.png')
        plt.close()

        # Log the frame and rewards to wandb
        wandb.log({
            "episode": episode,
            "step": step,
            "frame": wandb.Image('frame.png'),
            "reward": reward,
            "cumulative_reward": episode_rewards
        })
    if reward == 1 or reward == 1.5:
        win_rate += 1
    elif reward == -1:
        loss_rate += 1
    elif reward == 0:
        draw_rate += 1
    if reward == 1.5:
        natural_rate += 1


env.close()

# Let´s log general statistics of the training
wandb.log({"Win_rate": win_rate / n_episodes, "Loss_rate": loss_rate / n_episodes, "Draw_rate": draw_rate / n_episodes, "Natural_win_rate": natural_rate / n_episodes}) # Log the episode statistics to wandb



  2%|▏         | 2052/100000 [03:02<1:32:01, 17.74it/s]