<a href="https://colab.research.google.com/github/TheanLim/ReinforcementLearning/blob/master/BlackJack.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
!pip install gym
!apt-get install python-opengl -y
!apt install xvfb -y

# Brief Intro on BlackJack in OpenGym AI

The definition of Blackjack-v0 is found in here https://github.com/openai/gym/blob/master/gym/envs/toy_text/blackjack.py

> Blackjack is a card game where the goal is to obtain cards that sum to as
    near as possible to 21 without going over.  They're playing against a fixed
    dealer.
    Face cards (Jack, Queen, King) have point value 10.
    Aces can either count as 11 or 1, and it's called 'usable' at 11.
    This game is placed with an infinite deck (or with replacement).
    The game starts with *dealer having one face up and one face down card*, while
    player having two face up cards.



In [0]:
import gym
env = gym.make('Blackjack-v0') # Create the Blackjack Envirionment
obs = env.reset()  # Use.reset() to initialize and get the first observation
print(obs)

(9, 5, False)


The observation is a 3-tuple of: 
* the players current sum,
* the dealer's one showing card (1-10 where 1 is ace),
* and whether or not the player holds a usable ace (0 or 1).


In [0]:
env.action_space

Discrete(2)

`Discrete(2)` means that the possible actions are integers 0 and 1.
  * The player can request additional cards (hit=1) until they decide to stop
    (stick=0) or exceed 21 (bust).

In [0]:
# Assuming we choose to hit and we take a step
action = 1
obs, rewards, done, info = env.step(action)

In [0]:
print(obs)
print(rewards)
print(done)
print(info)

(20, 5, True)
0.0
False
{}


* `obs` -- 
The original observation was `(9, 5, False)` and it changed into `(20, 5, True)` after we decided to hit. 
  * We got an Ace (reusable) because we are seeing `True`
  * Dealer's sum remains unchanged. This makes sense because our decision doesn't depend on Dealer's facedown card.
*`rewards` --
The reward for winning is +1, drawing is 0, and losing is -1.
*`done` -- The game is not done yet because (1) we are not busted, (2) we didn't stick
*`info` -- Not relevant in this environment.



# Intro to Reinforcement Learning

https://www.tensorflow.org/agents/tutorials/0_intro_rl 

# tf Agents

## System Overview

![System Overview](https://drive.google.com/uc?id=1bOWE4DAiAcJDZ19NusM3juLyVUrmQT3H)

A TF-Agents training program is usually split into two parts that run in parallel:


> On the left, a `driver` explores the `environment`(task) using a `collect policy` to choose actions, and it collects `trajectories` (i.e., experiences), sending them to an observer, which saves them to a `replay buffer`; 

>On the right, an `agent` pulls batches of `trajectories` from the `replay buffer` and trains some `networks`, which the `collect policy` uses. 

In short, the left part explores the `environment` and collects `trajectories`, while the right part learns and updates the `collect policy`.

1.   **Multiple Environments** - You'd want to explore multiple copies of environments in parallel to (1) use all of the resources (CPU and GPU) available (2) create trajectories (experiences) that are less correlated during training.
2.   **Trajectories** are a sequence of consecutive transitions from time step *n* to time step *n + t*.
3.   An **observer** is just any function that takes a trajectory as an argument. It may seem redundant but it allows flexibility. You can:
  1. Use an observer to save trajectories into replay buffer or to a file
  2. Compute metrics using trajectories
  3. Pass multiple observers to the driver and broadcast trajectories to all of them


## Basic Components

Let's look at a DQN Network first.

The components are (in sequence): 
1. `Deep Q-Network`
2. `DQN agent` (which will take care of creating the `collect policy`)
3. `Replay Buffer` and the `observer`
4. `Training Metrics`
5. `Driver`
6. `Dataset` - `tf.data.Dataset`
7. Populate the `replay buffer`
8. `train`

```
import tensorflow as tf
from tf_agents.networks import q_network
from tf_agents.agents.dqn import dqn_agent

q_net = q_network.QNetwork(
  train_env.observation_spec(),
  train_env.action_spec(),
  fc_layer_params=(100,))

agent = dqn_agent.DqnAgent(
  train_env.time_step_spec(),
  train_env.action_spec(),
  q_network=q_net,
  optimizer=optimizer,
  td_errors_loss_fn=common.element_wise_squared_loss,
  train_step_counter=tf.Variable(0))

agent.initialize()
```
An example of an optimizer is the AdamOptimizer:

`optimizer = tf.compat.v1.train.AdamOptimizer(learning_rate=learning_rate)`

# Ignore

To activate virtual display we need to run a script once for training an agent, as follows:





In [0]:
# For rendering Environment
!pip install pyvirtualdisplay
!pip install pigletfrom pyvirtualdisplay import Display

display = Display(visible=0, size=(1400, 900))
display.start()

xdpyinfo was not found, X start can not be checked! Please install xdpyinfo!


<Display cmd_param=['Xvfb', '-br', '-nolisten', 'tcp', '-screen', '0', '1400x900x24', ':1001'] cmd=['Xvfb', '-br', '-nolisten', 'tcp', '-screen', '0', '1400x900x24', ':1001'] oserror=None return_code=None stdout="None" stderr="None" timeout_happened=False>

In [0]:
# This code creates a virtual display to draw game images on. 
# If you are running locally, just ignore it
import os
if type(os.environ.get("DISPLAY")) is not str or len(os.environ.get("DISPLAY"))==0:
    !bash ../xvfb start
    %env DISPLAY=:1

In [0]:
import gym
from gym import logger as gymlogger
from gym.wrappers import Monitor
gymlogger.set_level(40) # error only
import tensorflow as tf
import numpy as np
import random
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
import math
import glob
import io
import base64
from IPython.display import HTML

from IPython import display as ipythondisplay

In [0]:
"""
Utility functions to enable video recording of gym environment and displaying it
To enable video, just do "env = wrap_env(env)""
"""

def show_video():
  mp4list = glob.glob('video/*.mp4')
  if len(mp4list) > 0:
    mp4 = mp4list[0]
    video = io.open(mp4, 'r+b').read()
    encoded = base64.b64encode(video)
    ipythondisplay.display(HTML(data='''<video alt="test" autoplay 
                loop controls style="height: 400px;">
                <source src="data:video/mp4;base64,{0}" type="video/mp4" />
             </video>'''.format(encoded.decode('ascii'))))
  else: 
    print("Could not find video")
    

def wrap_env(env):
  env = Monitor(env, './video', force=True)
  return env