<a href="https://colab.research.google.com/github/DavoodSZ1993/RL/blob/main/04_MC_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [48]:
import numpy as np
import gym

env= gym.make('Blackjack-v0')
env.action_space, env.action_space.sample()  # two actions: 0: stay, 1: hit

(Discrete(2), 0)

## How to Play Blackjack

* Blackjack is a card game where the goal is to obtain cards that sum to as near as possible to 21 without going over.  They're playing against a fixed dealer.

* Face cards (Jack, Queen, King) have point value 10.

* Aces can either count as 11 or 1, and it's called 'usable' at 11.

* This game is placed with an infinite deck (or with replacement).

* he game starts with each (player and dealer) having one face up and one face down card. 

* The player can request additional cards (hit=1) until they decide to stop
 (stick=0) or exceed 21 (bust).

* After the player sticks, the dealer reveals their facedown card, and draws until their sum is 17 or greater.  If the dealer goes bust the player wins.

* If neither player nor dealer busts, the outcome (win, lose, draw) is
 decided by whose sum is closer to 21.  The reward for winning is +1,
  drawing is 0, and losing is -1.
 
* The observation of a 3-tuple of: the players current sum,
 the dealer's one showing card (1-10 where 1 is ace),
  and whether or not the player holds a usable ace (0 or 1).
  
* This environment corresponds to the version of the blackjack problem
 described in Example 5.1 in Reinforcement Learning: An Introduction
 by Sutton and Barto (1998).


In [53]:
"""
Observation tuple for Blackjack:
1. Sum of players cards (ace counts 11)
2. sum of dealers card
3. player has useable ace
"""

observation_space = env.observation_space.spaces
observation_space

(Discrete(32), Discrete(11), Discrete(2))

In [54]:
# Resests the state of the environment and returns an initial observation.

observation = env.reset()
observation

(16, 4, False)

In [55]:
action = 0
observation, reward, done, info = env.step(action)
observation, reward, done, info

((16, 4, False), -1.0, True, {})

## Policy
For each observation we need an action:

In [56]:
# There are invalid states, but we don't care for convenience.
state_space_size = (33, 12, 2)

policy = np.zeros(state_space_size, dtype=int)

In [57]:
def observation_clean(observation):
  return (observation[0], observation[1], int(observation[2]))

observation = observation_clean(observation)
policy[observation]

0

## Monte Carlo Policy Evaluation


In [58]:
def run_episode(policy, env=env):
  steps = []
  observation = observation_clean(env.reset())
  done = False
  steps.append(((None, None) + (observation, 0))) # State, Action, Next State, Reward

  while not done:
    action = policy[observation]
    observation_action = (observation, action)
    observation, reward, done, info = env.step(action)
    observation = observation_clean(observation)
    steps.append(observation_action + (observation, int(reward)))

  return steps # list of tuples: (s, a, s', R)


In [59]:
run_episode(policy)

[(None, None, (16, 6, 0), 0), ((16, 6, 0), 0, (16, 6, 0), 1)]

### Side Note: Python `reversed` Fucntion:

Python reversed() method returns an iterator that accesses the given sequence in the reverse order.

#### Code Example:

```Python
Code: 

seqTuple = ('g', 'e', 'e', 'k', 's')
print(list(reversed(seqTuple)))

Output:

['s', 'k', 'e', 'e', 'g']

```

In [60]:
gamma = 0.99

N = np.zeros(state_space_size, dtype=int)
S = np.zeros(state_space_size)

# Every visit monte carlo
nb_of_episodes = 100
for e in range(nb_of_episodes):
  observations_reward = run_episode(policy)
  G = 0.
  # print (observation_reward)
  for o0, a, o, r in reversed(observations_reward):
    G = r + gamma * G
    N[o] += 1
    S[o] +=G
    # print(o, r, G)

In [66]:
observations_reward = run_episode(policy)
print(observations_reward)
print(list(reversed(observations_reward)))
(o0, a, o, r) = list(reversed(observations_reward))[0]
print(o0), print(a), print(o), print(r)

[(None, None, (20, 7, 0), 0), ((20, 7, 0), 0, (20, 7, 0), 0)]
[((20, 7, 0), 0, (20, 7, 0), 0), (None, None, (20, 7, 0), 0)]
(20, 7, 0)
0
(20, 7, 0)
0


(None, None, None, None)