# What is Monte Carlo Method ?

It's a method of estimating the value function by taking the mean return instead of the expected return.The Monte Carlo method for reinforcement learning learns directly from episodes of experience without any prior knowledge of MDP transitions.But how we can solve the MDP withount any prior knowledge of MDP transitions ? The idea based on the law of the large numbers.


#### Recall, 

\begin{equation}
\large
V^{\pi}(S_{t} = s) = \operatorname{\mathbb{E}}[G_{t} | S_{t} = s]
\end{equation}


<strong>By the law of large numbers,</strong> integrals described by the expected value of some random variable can be approximated by taking the empirical mean.so we can take the mean of the return rather than the expected value, and therefore we can solve the problem without any knowledge of MDP.

# What is Black Jack Game ?

<img src = "https://i.imgur.com/jCD3ciR.jpg" >

* Blackjack is a card game that pits player versus dealer. 


* It is played with one or more decks of cards. 


* Cards are counted as their respective numbers, face cards as ten, and ace as either eleven or one (in our game it will show on the counter as an 11 unless you are over 21). 



* The object of Blackjack is the beat the dealer. This can be accomplished by getting Blackjack (first two cards equal 21) without dealer Blackjack, having your final card count be higher than the dealers without exceeding 21, or by not exceeding 21 and dealer busting by exceeding their card count of 21.



* The Black Jack games is a type of episodic tasks since it has terminal state 
  when the agent wins or loses. 

<br>

## Notes:
* The player has to decide the value of an ace. If the player's sum of cards is 10 and the player
  gets an ace after a hit, he can consider it as 11, and 10 + 11 = 21. But if the player's sum of
  cards is 15 and the player gets an ace after a hit,if he considers it as 11 and 15+11 = 26, then
  it's a bust.
  

*  If the player has an ace we can call it <strong>a usable ace;</strong> the player can consider it as 11 without being bust. 



* If the player is bust by considering the ace as 11, then it is called
<strong> nonusable ace.</strong>

## The Rewards in The Game: 

* $+1 :$ if the player win the game
* $-1 :$ if the player loses the game
* $\hspace{1.5mm}0\hspace{1.5mm}:$ if the game is a draw 


<br>


## The Actions in The Game:

* <strong>Hit :</strong> if the player needs a card
* <strong>Stand:</strong> if the player doesn't need a card

<br>

## The Observations in The Game:
The observation of a 3-tuple of: 
* The player's current sum
* The dealer's one showing card (1-10 where 1 is ace),
* A Boolean Value represents  whether or not the player holds a usable ace (0 or 1).

## The agent-enivronment interaction 

<img src = "https://gym.openai.com/assets/docs/aeloop-138c89d44114492fd02822303e6b4b07213010bb14ca5856d2d49d6b62d88e53.svg" >

The agent and the environment interact at each of a sequence of discrete time steps $t = 0,1,2,\ldots.$ Each timestep, the agent chooses an action, and the environment returns an observation and a reward.

<br>




<strong>Now,</strong> we are going to solve Black Jack problem using First Visit Monte Carlo Method.


In [83]:
#import the necessary items 
import gym 
from collections import defaultdict

In [73]:
#Create the Blackjack environment using OpenAI's Gym 
env = gym.make('Blackjack-v0')

[2020-08-26 15:02:14,618] Making new env: Blackjack-v0
  result = entry_point.load(False)


# Helper Functions

In [74]:
def sample_policy(observation):
    '''
    Usage:
      #sample_policy --> used to perform an action based on the agent's score
  
    Arguments:
      #observation --> represents the current state
    
    Returns:
      #0 --> if the score >= 20 (stand)
      #1 --> otherwise (hit)
      
    '''
    
    score, dealer_score, usable_ace = observation 
    
    return 0 if score >= 20 else 1

In [75]:
def generate_episode(policy, env):
    '''
    Usage:
      #generate_episode --> used to generate an episode which is a single round of a game
  
    Arguments:
      #policy --> reprensents the way to behave (perform an action) at a certain state 
      #env --> represents the environment that the agent interact with
    
    Returns:
      #states --> the states we reached during the interaction between the agent and environment
      
      #rewards --> the reward the agent got as a result of the interaction between 
                   the agent and environment   
                   
      #actions --> the actions that the agent performed during the interaction 
                   between the agent and environment
    
    Notes:
     #Each timestep, the agent chooses an action, and the environment returns 
      an observation and a reward.The process gets started by calling reset(),
      which returns an initial observation
      
    '''
    
    #First, define states,actions,rewards as empty list 
    states, actions, rewards = [], [], []
    
    
    #Second, initiate the environment using env.reset()
    observation = env.reset()
    
    #At the end of every episode , we do the following
    while True:
        
        #1-Append the observation to the states list
        states.append(observation)
        
        #2-Create an action using sample_policy function, and append the action to action list
        action = sample_policy(observation)
        actions.append(action)
        
        #3-Each timestep, the agent chooses an action, and 
        #the environment returns an observation and a reward.
        #so we will return observation,reward, info(dict for diagnostic information useful for debugging)
        #Also, we return, done --> which is a flag used to check if we reached terminal state or not
        observation, reward, done, info = env.step(action)
        rewards.append(reward)
        
        #if we reached the terminal state, then we break 
        if done: 
            break
            
    return states, actions, rewards

In [78]:
def first_visit_mc_prediction(policy, env, n_episodes):
    '''
    Usage:
      #first_visit_mc_predicition --> used to aproximate the value function using 
                                      First Visit Monte Carlo  Method
  
    Arguments:
      #policy --> reprensents the way to behave (perform an action) at a certain state 
      #env --> represents the environment that the agent interact with
      #n_episodes --> the number of episodes
    
    Returns:
      #value_table --> represents the approximated value of the value function for every state
      
    '''
    
    #Define empty value table as a dic for storing the values at each state
    value_table = defaultdict(float)
    
    #Initialize N(S) as dic for storing the every state (keys) , and 
    #the number of times we visit every state (Values)
    N = defaultdict(int)
    
    #For a ceatrain number of episodes , we do the following 
    for episode in range(n_episodes):
        
        #1-generate an episode and store the result (states, and rewards)
        states, _, rewards = generate_episode(policy, env)
        
        #2-initialize the return  (sum of the rewards)
        G = 0

        #3-Each time step of an episode, we store the reward to R and state to S obtained 
        #from the choosen action, Then we calcuatle the return as a sum of the reward
        for t in range(len(states) - 1, -1, -1):
            R = rewards[t]
            S = states[t]
            G += R
            
            #Apply First Visit Monte Carlo Method
            #check if that is the first time the state is visited in an episode
            if S not in states[:t]:
                #increment N(S)
                N[S] += 1
                #Estimated the value function of a state by mean return
                value_table[S] += G / N[S]
                
    return value_table

# Playing The Game

In [81]:
#Get the approximated value function for all states
value = first_visit_mc_prediction(sample_policy, env, n_episodes=500000)

In [82]:
#Explore the value for all states
value

defaultdict(float,
            {(20, 10, False): 5.452824238110242,
             (19, 10, False): -6.687224264614749,
             (19, 1, False): -7.478889808807915,
             (17, 1, False): -7.507231463335668,
             (11, 1, False): -3.443660816827226,
             (18, 10, False): -7.922466907095889,
             (14, 10, False): -6.84847558632649,
             (19, 7, False): -7.591882026271605,
             (12, 7, False): -5.840202634774576,
             (16, 5, False): -6.7015653422770605,
             (16, 5, True): -3.7774017490359,
             (15, 5, True): -2.7516882813377896,
             (18, 2, False): -7.331498171677159,
             (8, 2, False): -4.784198340535718,
             (4, 2, False): -4.098782965125565,
             (11, 7, False): -1.8791069126891629,
             (21, 2, False): 7.609305321171586,
             (17, 2, False): -4.947870175713576,
             (16, 10, False): -5.882068263123497,
             (19, 9, False): -6.895629475405596,
  

# Congratulations!