# Black Jack with Monte Carlo
In this task you are asked to find an optimal policy for a Black Jack game. You are going to use an OpenAI Gym [Black Jack environment](https://gym.openai.com/envs/Blackjack-v0/) in this task. [OpenAI Gym](https://gym.openai.com/) is a toolkit for developing and comparing reinforcement learning algorithms. One of its features is to provide various RL-ready environments to facilitate studing and developing new Reinforcement Learning algorithms.

The main purposes of this notebook are to introduce:
- OpenAI Gym environments
- Monte Calro Methods
- the `Exploring starts` exploration algorithm 



<a target="_blank" href="https://colab.research.google.com/github/PrzemekSekula/ReinforcementLearningClasses/blob/main/MonteCarlo/BlackJackMC-Empty.ipynb">
    <img src="https://www.tensorflow.org/images/colab_logo_32px.png" />
    Run in Google Colab</a>



In [None]:
#!pip install gym
#!pip install pygame

import sys
IN_COLAB = "google.colab" in sys.modules

if IN_COLAB:
    !pip install --upgrade gym

In [None]:
import gym 
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns
from tqdm import tqdm

### Black Jack Environment
- `states` - are provided as tuples (`score`, `dealer score`, `useable ace`)
    - `score` - the summary score of your cards (4-21)
    - `dealer score` - the first dealer card (1-10)
    - `useable ace` - True/False, points out if you have a useable ace
- `actions`
    - `0` - draw
    - `1` - hit
- `rewards`
    - `1` - you won a game
    - `0` - draw
    - `-1` - you lost a game
    
    
Let's create an environment and see how to use it.

## Task 1
Fill in the placeholders to complete the `update_qest` method.  You are supposed
to compute an updated state-action value according to the formula:


$q_{n+1} = q_{n} + \frac{1}{n}(G_n-q_n)$

where:
- $q_{n}$ - current estimated state-action value 
- $q_{n+1}$ - new estimated state-action value 
- $G_n$ - return obtained for the explored state and action
- $n$ - number of actions (computed separately for each action type)

In [None]:
class Policy:
    """
    This class is used to learn and maintain the policy. It is being done by
    learning action-value methods. 
    Properties: 
        q_est - a dictionary that stores estimated state-action values for 
                the states
                { state : [value for action0, value for action1] }
        n     - a dictionary that stores how many times each state-action 
                value was updated
                { state : [no. updates for action0, no. updates for action 1]}
    Methods:
        act         - returns a greedy action according to a current policy.
                      The initial policy assumes to hit until the score >= 19.
                      Then it is gradually updated in a learning process
        update_qest - updates a specific state-action value using a given 
                      return
        plot        - visualizes the policy.
    """
    
    def __init__(self):
        
        # Dictionary with { state : [state action value for 0, state action value for 1]  }
        self.q_est = {}        
        self.n = {}
        self.__initialize_states()
        
    def __initialize_states(self):
        """
        Initializes all possible states by seting q_est and n values
        - q_est is set to hit if score < 19, otherwise it is set to draw
          (the value of the preferred action is set to 1, the value of the 
          other action is set to 0)
        - n is set to 0 for both actions
        """
        for score in range(4, 22):
            for dealer_score in range(1, 11):
                for usable_ace in [True, False]:
                    should_hit = int (score < 19)
                    state = (score, dealer_score, usable_ace)
                    self.q_est[state] = [1 - should_hit, should_hit]
                    self.n[state] = [0, 0]
        
        
    def act(self, state):
        """
        Returns a greedy action according to the current policy
        Arguments:
            state - a state obtained from OpenAI Gym Black Jack environment
        Returns:
            action - 0 for 'draw', 1 for 'hit'
        """
        return np.argmax(self.q_est[state])
                
    def update_qest(self, state, action, g):
        """
        Updates state-action value for a specific state and a specific action.
        State-action values are computed as a mean of all returns
        Arguments:
            state  - a state obtained from OpenAI Gym Black Jack environment
            action - 0 for 'draw', 1 for 'hit'
            g      - return that should be used for updating        
        """
        
        # ENTER YOUR CODE HERE. 
        # Update self.n[state][action] and self.q_est[state][action]

        
        # END OF YOUR CODE
        
    def plot(self, useable_ace = True):
        """
        Plots a visualization of current policy. It plots the policy only for the explored
        states. The states that haven't been explored yet, are plotted as 'unknown'
        Arguments:
            useable_ace - True / False. It plots different policies, whether the player
                          has or does not have a useable ace.        
        """
        states = [x for x in self.q_est.keys() if x[2] == useable_ace]
        rows = max(x[0] for x in states)
        cols = max(x[1] for x in states)
        
        res = -1 * np.ones((rows, cols))
        for state in states:
            res[rows-state[0], state[1]-1] = self.act(state)
            
        fig = plt.figure(figsize = (12, 8))

        #cbar_kws = {'ticks' : [0, 0.5, 1]}
        cmap = matplotlib.colors.ListedColormap(('white', 'r', 'g'), name = 'My Cmap')

        ax = sns.heatmap(res, linewidth=0.5, cmap = cmap)
        cbar = ax.collections[0].colorbar
        cbar.set_ticks([1, 0, -1])
        cbar.set_ticklabels(['hit', 'draw', 'unknown'])
        ax.set_xticks(np.arange(10) + 1)
        ticks = np.arange(10)
        xticklabels = [f'       {x}' for x in list(ticks+1)]

        plt.xticks(ticks, xticklabels, ha = 'left')
        ticks = np.arange(21)
        yticklabels = [f'{x} ' + (' ' if x < 10 else '') for x in list(ticks+1)[::-1]]
        plt.yticks(ticks, yticklabels, rotation=90, va='top')
        plt.title('Black Jack policy with' + ('' if useable_ace else 'out') + ' a useable ace')

        plt.show()        


In [None]:
policy = Policy()
policy.plot()

## Task 2
Complete the generate_episode function by filling in the code placeholders
You are supposed to:
- randomly select the initial action and execute this action 
- select every other action according to the policy and execute it
- update policy for each state-action pair generated during the episode

In [None]:
def generate_episode(policy, env):
    """
    Generates one episode, and updates state-action values of the policy
    acording to the given policy.
    Argumets:
        policy - a policy object
        env    - an OpenAI Gym Black Jack environment
    """
    states = []
    actions = []
    
    state, info = env.reset()
    
    # ENTER YOUR CODE HERE
    # Choose the first action randomly
    action = None
    # END OF YOUR CODE
    
    states.append(state)
    actions.append(action)
    
    # ENTER YOUR CODE HERE
    # Execute the action you chose above
    state, reward, terminal, truncated, info = None
    # END OF YOUR CODE
    
    while not terminal:
        # ENTER YOUR CODE HERE
        # Choose an action according to the current policy
        action = None
        # END OF YOUR CODE
        
        states.append(state)
        actions.append(action)
                
        # ENTER YOUR CODE HERE
        # Execute the action you chose above
        state, reward, terminal, truncated, info = None
        # END OF YOUR CODE
        
    for state, action in zip(states[::-1], actions[::-1]):
        # ENTER YOUR CODE HERE
        # Update qest     
        pass 
        # END OF YOUR CODE
    

## Task 3
Generate episodes and learn from them until you learn the optimal policy. You may use `policy.plot()` to visualize your policies.

In [None]:
# ENTER YOUR CODE HERE
# Generate enough episodes to obtain the optimal policy
for i in tqdm(range (10)):
    generate_episode(policy, env)


In [None]:
policy.plot(useable_ace = False)
policy.plot(useable_ace = True)