# KEN4157 - Reinforcement Learning - Blackjack Gymnasium
If you opened this notebook in Google Colab, we recommend to start by saving a copy of the notebook in your own Google Drive, such that you can save any of your changes and experiments.

## Installing & Importing Modules
We will start by installing importing some modules that will likely be useful for your assignment(s). This includes [Gymnasium](https://gymnasium.farama.org/), which is a framework containing many popular RL environments (a successor to the original Gym API from OpenAI).

In [None]:
!pip install gymnasium

import gymnasium as gym
import math
import numpy as np

from tqdm import tqdm



## Numpy Cheat Sheet
In case you are not already familiar with `numpy`, here are some examples of functions which may come in useful:

In [None]:
# Creates a fixed-size array of 5 entries, each initialised to 0.0
a = np.zeros(5)
# Creates a 5 x 6 array (i.e., matrix) initialised to all-zeros
b = np.zeros((5, 6))
# Creates a 2 x 3 x 5 (i.e., 3-dimensional) array initialised to all-zeros
c = np.zeros((2, 3, 5))

# Samples a number uniformly at random from [0, 1)
d = np.random.random()
# Generates an array with 5 numbers, each sampled uniformly at random from [0, 1)
e = np.random.random(5)

# Gives you the maximum number from the array `e`
f = np.max(e)
# Gives you the index that holds the maximum number in the array `e`
g = np.argmax(e)

# Clear all the variables we created above, purely for the purpose of examples, from memory
del a, b, c, d, e, f, g

## Setting up the Blackjack Environment
Here, we'll set up the Blackjack environment, and have a first look at how to interact with it according to the Gym API.

**Optional**:
- For a description and documentation of the environment, see: https://gymnasium.farama.org/environments/toy_text/blackjack/
- For the implementation of the environment, see: https://github.com/Farama-Foundation/Gymnasium/blob/main/gymnasium/envs/toy_text/blackjack.py

In [None]:
# Load the environment. sab=True means we follow the rules exactly as described
# in the Sutton and Barto book
env = gym.make("Blackjack-v1", sab=True)

action_space = env.action_space
obs_space = env.observation_space

print("action space =", action_space)
# action space = Discrete(2): this means that we have two actions (0 and 1)

# Let's create an array with nicer names for the actions (taken from documentation)
ACTION_NAMES = ["Stick", "Hit"]

print("observation space =", obs_space)
# observation space = Tuple(Discrete(32), Discrete(11), Discrete(2)):
#  - The player's current sum can be any from a discrete set of 32 values
#  - Value of the dealer's face-up card can be any from a discrete set of 11 values
#  - Whether or not the player holds a usable ace is boolean (two possible values)

print(f"The first variable of observation space has {obs_space[0].n} possible values.")
print(f"The second variable of observation space has {obs_space[1].n} possible values.")
print(f"The third variable of observation space has {obs_space[2].n} possible values.")

action space = Discrete(2)
observation space = Tuple(Discrete(32), Discrete(11), Discrete(2))
The first variable of observation space has 32 possible values.
The second variable of observation space has 11 possible values.
The third variable of observation space has 2 possible values.


## Episodes with Random Policy
We'll show how to run a few episodes under a random policy. This will demonstrate how to interact with the Gym API and its most important functions and return values.

In [None]:
for _ in range(5):  # Run 5 episodes under random policy
    obs, info = env.reset()     # Resets environment to initial state, gives us observation of initial state
    done = False

    returns = 0.0   # collect returns for this episode (note: not doing any discounting)

    print(f"The initial game state is: {obs}")

    # The following loop will run one complete episode
    while not done:
        print(f"value of cards I'm holding = {obs[0]}")
        print(f"value of dealer's face-up card = {obs[1]}")
        if obs[2]:
            print("I have a usable ace")
        else:
            print("I do not have a usable ace")

        action = action_space.sample()  # This randomly selects one action from the action space

        print("I randomly chose to: ", ACTION_NAMES[action])

        next_obs, reward, terminated, truncated, info = env.step(action)    # Execute the action, observe successor state and reward

        done = terminated or truncated
        obs = next_obs
        returns = returns + reward

    print(f"The episode ended with returns = {returns}")
    print()

## Table-Based Q-Learning
Below, you can write your own table-based Q-learning implementation.

In [None]:
# TODO: Q-learning