# COURSE:   PGP [AI&ML]

## Learner :  Chaitanya Kumar Battula
## Module  : RNN
## Topic   :  Train the agent to learn and win the card game, Blackjack.



## **Environment**

* Game is played against a fixed dealer.
* Game has a replacement or an infinite deck.
* Moves:
  * Hit = Player asking for additional card
  * Stick = Player stops asking for the additional card
  * Bust = The sum of all cards exceeds 21 
* Score of the cards:
  * Each of the cards Jack, Queen, and King has a reward of 10.
  * Each Ace has a reward of 11 or 1 and is called unstable at 11.
* Goal: Acquire cards that add upto 21 and must not go beyond 21.
* Rules:
  * Game starts with one card faced up and one card faced down for the player and the dealer.
  * Player can ask for additional cards until the sum of the cards exceed 21 or player stops voluntarily..
  * After the player sticks, the dealer shows the facedown card and draws cards from the deck until the sum is 17 or greater.
  * After drawing cards, the player wins if the dealer exceeds the allowed sum of 21 and vice versa.
  * If neither of them busts, the winner is decided by finding whoever has a score nearer to 21 
* Action:
  * STICK = 0
  * HIT = 1
* Reward:
  * Win = +1
  * Draw = 0
  * Loss = -1
* Observation:
  * Current sum of players
  * Dealer's one showing card
  * Player having a usable ace or not

Environment courtsey: This environment corresponds to the version of the Blackjack problem described in Example 5.1 in Reinforcement Learning: An Introduction by Sutton and Barto (1998), and OpenAI Gym.

## **Import Libraries and Environment**

In [1]:
import matplotlib
import numpy as np
import sys
import collections
from collections import defaultdict
from matplotlib import pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

import gym
env = gym.make("Blackjack-v0")


## **Solution**

**Arguments:**

* policy: Maps an observation to action probabilities
* env: OpenAI Gym environment
* num_episodes: Number of episodes
* discount_factor: Gamma discount factor
* Q: A dictionary that maps from state -> action-values. Each value is a numpy array of length nA (see below)
* epsilon: Probability to select a random action float between 0 and 1
* nA: Number of actions in the environment
* Returns:
  * A = Function that takes the observation as an argument and returns the probabilities for each action in the form of a numpy array of length nA


In [25]:
probs = [0.95, 0.05]
np.random.choice(np.arange(len(probs)), p=probs)

0

In [6]:
best_action = 1

In [8]:
A[best_action] += (1.0 - .1)
A

array([0.05, 0.95])

### **Monte Carlo Control**

In [3]:
#Creating epsilon greedy policy for Q-function and epsilon

def make_epsilon_greedy_policy(Q, epsilon, nA):
    def policy_fn(observation):
        A = np.ones(nA, dtype=float) * epsilon / nA
        best_action = np.argmax(Q[observation])
        A[best_action] += (1.0 - epsilon)
        return A
    return policy_fn

**Arguments:**
* num_episodes = Number of episodes as sample
* discount_factor = Gamma discount factor
* Returns:
  * A = Tuple of Q and policy

In [4]:
#Finding an optimal epsiolon-greedy policy

def mc_control_epsilon_greedy(env, num_episodes, discount_factor=1.0, epsilon=0.1):
    
    # Keeps track of sum and count of returns for each state
    # to calculate an average. We could use an array to save all
    # returns (like in the book) but that's memory inefficient.
    returns_sum = defaultdict(float)
    returns_count = defaultdict(float)
    
    # The final action-value function (Q).
    # A nested dictionary that maps state -> (action -> action-value).
    Q = defaultdict(lambda: np.zeros(env.action_space.n))
    
    # The policy we're following
    policy = make_epsilon_greedy_policy(Q, epsilon, env.action_space.n)
   
    for i_episode in range(1, num_episodes + 1):
        # Print out which episode we're on, useful for debugging.
        if i_episode % 1000 == 0:
            print("\rEpisode {}/{}.".format(i_episode, num_episodes), end="")
            sys.stdout.flush()

        # Generate an episode.
        # An episode is an array of (state, action, reward) tuples
        episode = []
        state = env.reset()
        for t in range(100):
            probs = policy(state)
            action = np.random.choice(np.arange(len(probs)), p=probs)
            next_state, reward, done, _ = env.step(action)
            episode.append((state, action, reward))
            if done:
                break
            state = next_state

        # Find all (state, action) pairs we've visited in this episode
        # We convert each state to a tuple so that we can use it as a dict key
        sa_in_episode = set([(tuple(x[0]), x[1]) for x in episode])
        for state, action in sa_in_episode:
            sa_pair = (state, action)
            # Find the first occurance of the (state, action) pair in the episode
            first_occurence_idx = next(i for i,x in enumerate(episode)
                                       if x[0] == state and x[1] == action)
            # Sum up all rewards since the first occurance
            G = sum([x[2]*(discount_factor**i) for i,x in enumerate(episode[first_occurence_idx:])])
            # Calculate average return for this state over all sampled episodes
            returns_sum[sa_pair] += G
            returns_count[sa_pair] += 1.0
            Q[state][action] = returns_sum[sa_pair] / returns_count[sa_pair]
        
        # The policy is improved implicitly by changing the Q dictionary
    
    return Q, policy

#### **Episodes**

In [5]:
Q, policy = mc_control_epsilon_greedy(env, num_episodes=50, epsilon=0.1)

In [6]:
Q

defaultdict(<function __main__.mc_control_epsilon_greedy.<locals>.<lambda>()>,
            {(18, 10, False): array([ 0., -1.]),
             (21, 10, True): array([1., 0.]),
             (20, 10, False): array([1., 0.]),
             (21, 2, True): array([0., 0.]),
             (17, 5, False): array([1., 0.]),
             (12, 6, False): array([-1.,  0.]),
             (20, 3, False): array([0., 0.]),
             (7, 7, False): array([-1.,  0.]),
             (9, 10, False): array([-1.,  0.]),
             (16, 7, True): array([1., 0.]),
             (17, 10, True): array([-1.,  0.]),
             (14, 10, False): array([ 0., -1.]),
             (20, 9, False): array([ 0., -1.]),
             (20, 5, True): array([1., 0.]),
             (6, 8, False): array([1., 0.]),
             (10, 1, False): array([-1.,  0.]),
             (20, 5, False): array([1., 0.]),
             (8, 7, False): array([1., 0.]),
             (14, 4, False): array([-1.,  0.]),
             (12, 9, False): arr