# Lecture: Implementation of Monte Carlo Policy Evaluation for Blackjack

We want to implement monte carlo policy evaluation for the `blackjack` environment provided by gymnasium.

Blackjack is a card game in which the aim is to beat the dealer by getting cards that total closer to 21 (without totalling over 21) than the dealer's cards. The game begins with the dealer having one face-up and one face-down card, while the player has two face-up cards. All cards are drawn from an infinite deck (i.e. with replacements).

- Face cards (Jack, Queen, King) have a point value of 10.
- Aces can count either as 11 (a so-called "usable ace") or as 1.
- Numerical cards (2-9) have a value corresponding to their number.

The player sees the sum of the cards held. He can request more cards (hit) until he decides to stop (stick) or exceed 21 (bust, immediate loss).
After the player has stopped, the dealer reveals his face-down card and draws cards until the total is 17 or more. If the dealer goes bust, the player wins.
If neither the player nor the dealer busts, the result (win, loss, tie) is determined by whose total is closer to 21.
To analyse different strategies, we use the gymnasium environment `Blackjack-v1`, the description of which can be found [here](https://gymnasium.farama.org/environments/toy_text/blackjack/).

In [None]:
!git clone https://github.com/Fjoelsak/RL.git
!cp RL/04_Mode_free_prediction/mc_eval_agent.py ./

### Excercises

#### Task 1: Getting to know the environment

Go to the [farama foundation documentation](https://gymnasium.farama.org/environments/toy_text/blackjack/) and determine how the state and action spaces are defined, how the reward function is implemented and how the condition for a termination of the episode is implemented. Which actions are coded here and how, and what information does an agent receive with the given definition of the observations?

In [None]:
import gymnasium as gym

env = gym.make('Blackjack-v1')

# Checking action and state space

#### Task 2

Play ten time steps as an agent with randomised actions and look at the actions, observations and rewards for each time step.
	Also display the end of each episode.
	Try to reproduce the individual games according to the rules mentioned above.

In [None]:
# YOUR CODE HERE

#### Task 3

 Implement the policy presented in the lecture, in which you always draw a card as long as the sum of your cards is less than or equal to 19.
	Test your new policy with the setup defined in task 2 and follow the individual steps.

In [None]:
def simple_policy(observation):
    # YOUR CODE HERE
    pass

#### Task 4

Now refactor your code. We want to have a separate class, an `MCEvalAgent`. Implement the function `gen_eps(env, policy)` in `mc_eval_agent.py` that runs through a single episode for the policy defined above and the current environment. While running through an episode, save all `states`, `actions` and `rewards` and return them as return values in the form of three lists. Test your refactored code by playing blackjack with 3 episodes and outputting the corresponding states, actions and rewards.

Note: don't forget to reset the environment at the beginning of generating an episode.

In [None]:
import gymnasium as gym
import time
from mc_eval_agent import MCEvalAgent

env = gym.make("Blackjack-v1", render_mode="human")

n_eps = 3   # number of episodes

agent = ...

for _ in range(n_eps):
    states, actions, rewards = agent.gen_eps(env, simple_policy)
    print("States: ", states)
    print("Actions: ", actions)
    print("Rewards: ", rewards)
    print("")

# Close the env
env.close()

#### Task 5

Then implement the first-visit monte carlo policy evaluation algorithm in the function `eval(env, n_episodes, policy)`, which calculates the mean values of the values of the respective states for the specified number of episodes. $\gamma$ shall be 0.9.

In [None]:
import gymnasium as gym
import time
from mc_eval_agent import MCEvalAgent

env = gym.make("Blackjack-v1")

agent = ...
value = agent.eval(env, 500000, simple_policy)
env.close()

Use the given plot function `plot_blackjack(V)` to visualise the figures from the lecture, i.e. the evaluation functions for n eps= 10,000 and 500,000.

In [None]:
agent.plot_blackjack(value)

#### Task 6

We consider the following policy: if the sum of the player cards is greater than 18, we choose the action Stick with 80% probability and the action Hit with
20% probability. If the sum of the player cards is less than or equal to 18, we choose the action Stick with 20% probability and the action Hit with 80% probability
. What does the state value function look like?

In [None]:
def stoch_policy(observation):
    pass