# **Reinforcement Learning**
<img align="right" src="https://vitalflux.com/wp-content/uploads/2020/12/Reinforcement-learning-real-world-example.png">

- In reinforcement learning, your system learns how to interact intuitively with the environment by basically doing stuff and watching what happens.

if you need the last version of gym use block of code below:
```
!pip uninstall gym -y
!pip install gym
```

In [None]:
# !pip install -U gym==0.25.2
!pip install gym[atari]
!pip install autorom[accept-rom-license]
!pip install swig
!pip install gym[box2d]

In [20]:
import random
import numpy as np
import matplotlib.pyplot as plt
import gym
from IPython.core.display import HTML
from base64 import b64encode
from gym.wrappers import record_video, record_episode_statistics
from gym.wrappers import RecordVideo, RecordEpisodeStatistics
import torch
import os
os.environ["SDL_VIDEODRIVER"] = "dummy"

import warnings
warnings.filterwarnings('ignore')

In [3]:
def display_video(episode=0, video_width=600, video_dir= "/content/video"):
    video_path = os.path.join(video_dir, f"rl-video-episode-{episode}.mp4")
    video_file = open(video_path, "rb").read()
    decoded = b64encode(video_file).decode()
    video_url = f"data:video/mp4;base64,{decoded}"
    return HTML(f"""<video width="{video_width}"" controls><source src="{video_url}"></video>""")

def create_env(name, render_mode=None, video_folder='/content/video'):
    # render mode: "human", "rgb_array", "ansi")
    env = gym.make(name, new_step_api=True, render_mode=render_mode)
    env = RecordVideo(env, video_folder=video_folder, episode_trigger=lambda x: x % 50 == 0)
    env = RecordEpisodeStatistics(env)
    return env

def show_reward(total_rewards):
    plt.plot(total_rewards)
    plt.xlabel('Episode')
    plt.ylabel('Reward')
    plt.show()

## **Monte Carlo Learning**

In the previous Notebook, we evaluated and solved a **Markov Decision Process (MDP)** using **dynamic programming (DP)**. **Model-based** methods such as DP have some drawbacks. They require the environment to be fully known, including the transition matrix and reward matrix. They also have limited scalability, especially for environments with plenty of states

**model-free approach**, the Monte Carlo (MC) methods, which have no requirement of prior knowledge of the environment and are much more scalable than DP.

The term **Monte Carlo** is often used more broadly for any estimation method. Monte Carlo methods require only experience, sample sequences of states, actions, and rewards from actual or simulated interaction with an environment.

It is a method for estimating Value-action(Value|State, Action) or Value function(Value|State) using some sample runs from the environment for which we are estimating the Value function

<br>

- **types of Monte Carlo learning:**
1. $\textit{First Visit Monte Carlo}$: First visit estimates (Value|State: S1) as the average of the returns following the first visit to the state S1
2. $\textit{Every Visit Monte Carlo}$: It estimates (Value|State: S1) as the average of returns for every visit to the State S1.
- **Example:**

$$
\textit{First iteration: }A+3 \rightarrow A+2 \rightarrow B-4 \rightarrow A+4 \rightarrow
B-3 \rightarrow terminated \\
\textit{SEcond iteration: }B-2 \rightarrow A+3 \rightarrow B-3 \rightarrow terminated
$$
<br>

<table width='1200'>
    <tr>
        <td><font color="Red" size='4'>$$\textit{First Visit}$$<td>
        <td><font color="Red" size='4'>$$\textit{V(a)}$$<td>
        <td><font color="Red" size='4'>$$\textit{V(b)}$$<td>
    <tr>
    <tr>
        <td><font color="Olive" size='4'> $$\textit{First iteraion}$$<td>
        <td><font color="Olive" size='4'>$$3+2-4+4-3=2$$<td>
        <td><font color="Olive" size='4'>$$-4+4–3=-3$$<td>
    <tr>
    <tr>
        <td><font color="Blue" size='4'>$$\textit{Second iteraion}$$<td>
        <td><font color="Blue" size='4'>$$3-3=0$$<td>
        <td><font color="Blue" size='4'>$$-2+3+-3=-2$$<td>
    <tr>
    <tr>
        <td><font color="black" size='4'>$$\textit{Sum}$$<td>
        <td><font color="black" size='4'>$$\frac{2+0}{2}=1$$<td>
        <td><font color="black" size='4'>$$\frac{-3-2}{2}=-2.5$$<td>
    <tr>
<table>

<br>


<table width='1200'>
    <tr>
        <td><font color="red" size='4'>$$\textit{Every Visit}$$<td>
        <td><font color="red" size='4'>$$\textit{V(a)}$$<td>
        <td><font color="red" size='4'>$$\textit{V(b)}$$<td>
    <tr>
    <tr>
        <td><font color="olive" size='4'> $$\textit{First iteraion}$$<td>
        <td><font color="olive" size='4'>$$(3+2-4+4-3)\\+(2-4+4-3)\\+(4-3)\\=2-1+1$$<td>
        <td><font color="olive" size='4'>$$(-4+4-3)+(-3)=-3+-3$$<td>
    <tr>
    <tr>
        <td><font color="blue" size='4'>$$\textit{Second iteraion}$$<td>
        <td><font color="blue" size='4'>$$3-3=0$$<td>
        <td><font color="blue" size='4'>$$(-2+3–3)+(-3)=-2+-3$$<td>
    <tr>
    <tr>
        <td><font color="black" size='4'>$$\textit{Sum}$$<td>
        <td><font color="black" size='4'>$$\frac{2+-1+1+0}{4}=0.5$$<td>
        <td><font color="black"size='4'>$$\frac{-3+-3+-2-3}{4}=-2.75$$<td>
    <tr>
<table>

<br>

**Note**:As we have been given 2 different iterations, we will be summing all the rewards coming after A (including that of A) after the first visit to ‘A’. It must be noted that if an episode doesn’t have an occurence of ‘A’, it won’t be considered in the average.


## **Performing Monte Carlo policy evaluation**

- A reinforcement learning algorithm that needs a known MDP is categorized as a **model-based algorithm**.
- On the other hand, one with no requirement of prior knowledge of transitions and rewards is called a **model-free algorithm**. Monte Carlo-based reinforcement learning is a model-free approach.

we will evaluate the value function using the Monte Carlo method. assuming we don't have access to both of environment transition and reward matrices. You will recall that the returns of a process, which are the total rewards over the long run, are as follows:

$$
\large G_t=\sum_{k}^{\infty}γ^kR_{t+k+1}
$$

MC policy evaluation uses **empirical mean return** instead of **expected return** (as in DP) to estimate the value function.

- **Note**: in the Monte Carlo setting, we need to keep track of the states and rewards for all steps, since we don't have access to the full environment, including the transition probabilities and reward matrix.

<center><img width="500" src="https://i.stack.imgur.com/Q8YCg.png">

In [5]:
env = create_env("FrozenLake-v1")

In [6]:
def run_episode(env, policy):
    state = env.reset()
    rewards = []
    states = [state]
    is_done = False
    while not is_done:
        action = policy[state].item()
        state, reward, is_done, info = env.step(action)
        states.append(state)
        rewards.append(reward)
        if is_done:
            break
    states = torch.tensor(states)
    rewards = torch.tensor(rewards)
    return states, rewards

In [7]:
def mc_first_visit(env, policy, gamma, n_episode):
    n_state = env.observation_space.n
    V = torch.zeros(n_state)
    N = torch.zeros(n_state)
    for episode in range(n_episode):
        states, rewards = run_episode(env, policy)
        G = torch.zeros(n_state)
        first_visit = torch.zeros(n_state)
        return_t = 0
        for state_t, reward_t in zip(reversed(states[1:]), reversed(rewards)):
            return_t = reward_t + gamma * return_t
            G[state_t] = return_t
            first_visit[state_t] = 1
        for state in range(n_state):
            if first_visit[state] > 0:
                V[state] += G[state]
                N[state] += 1

    for state in range(n_state):
        if N[state] > 0:
            V[state] = V[state] / N[state]
    return V


def mc_every_visit(env, policy, gamma, n_episode):
    n_state = env.observation_space.n
    V = torch.zeros(n_state)
    N = torch.zeros(n_state)
    G = torch.zeros(n_state)
    for episode in range(n_episode):
        states, rewards = run_episode(env, policy)
        return_t = 0
        for state_t, reward_t in zip(reversed(states[1:]), reversed(rewards)):
            return_t = reward_t + gamma * return_t
            G[state_t] = return_t
            N[state_t] += 1

    for state in range(n_state):
        if N[state] > 0:
            V[state] = G[state] / N[state]
    return V


In [8]:
gamma = 1
# use policy from previous notebook
policy = torch.tensor([0., 3., 3., 3., 0., 3., 2., 3., 3., 1., 0., 3., 3., 2., 1., 3.]).long()
n_episode = 1000

first_visit = mc_first_visit(env, policy, gamma, n_episode)
print(f"The value function calculated by first visit MC:\n{first_visit}\n ")

The value function calculated by first visit MC:
tensor([0.7110, 0.4236, 0.4150, 0.3333, 0.7280, 0.0000, 0.3483, 0.0000, 0.7280,
        0.7295, 0.6433, 0.0000, 0.0000, 0.7789, 0.8761, 1.0000])
 


In [9]:
every_visit = mc_every_visit(env, policy, gamma, n_episode)
print(f"The value function calculated by every visit MC:\n{first_visit}\n ")

The value function calculated by every visit MC:
tensor([0.7110, 0.4236, 0.4150, 0.3333, 0.7280, 0.0000, 0.3483, 0.0000, 0.7280,
        0.7295, 0.6433, 0.0000, 0.0000, 0.7789, 0.8761, 1.0000])
 


## **BlackJack**
<img align='right' width='400' src="https://www.gymlibrary.dev/_images/blackjack.gif">

Card Values:

- Face cards (Jack, Queen, King) have a point value of 10.
- Aces can either count as 11 (called a ‘usable ace’) or 1.
- Numerical cards (2-9) have a value equal to their number.

Action Space:

- There are two actions: stick (0), and hit (1).

Observation Space:

- The observation consists of a 3-tuple containing: the player’s current sum, the value of the dealer’s one showing card (1-10 where 1 is ace), and whether the player holds a usable ace (0 or 1).

In [10]:
env = create_env("Blackjack-v1")

In [11]:
from collections import defaultdict

def run_episode(env, hold_score):
    state = env.reset()
    rewards = []
    states = [state]
    is_done = False
    while not is_done:
        action = 1 if state[0] < hold_score else 0
        state, reward, is_done, info = env.step(action)
        states.append(state)
        rewards.append(reward)
        if is_done:
            break
    states = torch.tensor(states)
    rewards = torch.tensor(rewards)
    return states, rewards

def mc_first_visit_blackjack(env, hold_score, gamma, n_episode):
    V = defaultdict(float)
    N = defaultdict(int)
    for episode in range(n_episode):
        states, rewards = run_episode(env, hold_score)
        G = {}
        return_t = 0
        for state_t, reward_t in zip(reversed(states[1:]), reversed(rewards)):
            return_t = reward_t + gamma * return_t
            G[state_t] = return_t

        for state, return_t in G.items():
            if state[0] <= 21:
                V[state] += return_t
                N[state] += 1

    for state in V:
        V[state] = V[state] / N[state]
    return V

In [12]:
env = create_env("Blackjack-v1", render_mode = "human")
hold_score = 18
gamma = 1
n_episode = 500
value = mc_first_visit_blackjack(env, hold_score, gamma, n_episode)

In [13]:
len(value)

647

## **Performing on-policy Monte Carlo control**

Monte Carlo prediction is used to evaluate the value for a given policy, while Monte Carlo control (MC control) is for finding the optimal policy when such a policy is not given. There are basically categories of MC control: **on-policy** and **off-policy**.
- On-policy methods learn about the optimal policy by executing the policy and evaluating and improving it
- off-policy methods learn about the optimal policy using data generated by another policy.

**Note:** The way on-policy MC control works is quite similar to policy iteration in dynamic programming, which has two phases, evaluation and improvement:

- **In the evaluation phase**, instead of evaluating the value function (also called the state value, or utility), it evaluates the action-value. The action-value is more frequently called the **Q-function**, which is the utility of a state-action pair (s, a) by taking action a in state s under a given policy. Again, the evaluation can be conducted in a first-visit manner or an every-visit manner.

- **In the improvement phase**, the policy is updated by assigning the optimal action to each state:

<br>

$$
\large\pi(s) = argmax_(a)Q(s, a)
$$

In [31]:
from collections import defaultdict

def run_episode(env, Q, n_action):
    state = env.reset()
    rewards = []
    actions =[]
    states = []
    is_done = False
    action = torch.randint(0, n_action, [1]).item()
    while not is_done:
        actions.append(action)
        states.append(state)
        state, reward, is_done, info= env.step(action)
        rewards.append(reward)
        if is_done:
            break
        action = torch.argmax(Q[state]).item()
    return states, actions, rewards

def mc_on_policy(env, gamma, n_episode):
    n_action = env.action_space.n
    G_sum = defaultdict(float)
    N = defaultdict(int)
    Q = defaultdict(lambda: torch.empty(env.action_space.n))
    for episode in range(n_episode):
        if episode % 100 == 0 and episode > 0:
            print(f"-- Episode: {episode}")

        states, actions, rewards = run_episode(env, Q, n_action)
        G = {}
        return_t = 0
        for state_t, action_t, reward_t in zip(states[::-1], actions[::-1], rewards[::-1]):
            return_t = reward_t + gamma * return_t
            G[state_t, action_t] = return_t

        for state_action, return_t in G.items():
            state, action = state_action
            if state[0] <= 21:
                G_sum[state_action] += return_t
                N[state_action] += 1
                Q[state][action] = G_sum[state_action] / N[state_action]

    policy = {}
    for state, actions in Q.items():
        policy[state] = torch.argmax(actions).item()
    return Q, policy

In [32]:
env = create_env("Blackjack-v1", render_mode = "human")
gamma = 1
n_episode = 500
optimal_Q, optimal_policy = mc_on_policy(env, gamma, n_episode)

-- Episode: 100
-- Episode: 200
-- Episode: 300
-- Episode: 400


In [33]:
optimal_value = defaultdict(float)

for state, action_values in optimal_Q.items():
    optimal_value[state] = torch.max(action_values).item()

print(optimal_value)

defaultdict(<class 'float'>, {(19, 10, False): 0.0, (18, 10, False): 0.375, (18, 3, False): 0.0, (17, 4, False): 0.4000000059604645, (12, 4, False): 0.5, (16, 6, False): 1.0, (18, 4, True): 0.0, (15, 5, False): 0.5, (15, 5, True): 1.0, (12, 10, False): 0.0, (16, 7, False): 1.0, (17, 10, False): -1.0, (5, 1, False): 4.4373517171309657e-41, (17, 2, True): 0.0, (6, 2, False): -1.1774649297931685e-24, (14, 10, False): -0.3333333432674408, (21, 10, True): 1.0, (17, 7, False): -1.0, (12, 7, False): 1.0, (20, 10, False): 0.4285714328289032, (16, 3, False): -1.0, (16, 3, True): 1.0, (20, 2, False): 1.0, (13, 2, False): 0.0, (16, 2, False): -1.0, (18, 8, False): -1.0, (12, 6, False): 0.0, (20, 5, False): 0.5, (21, 5, True): 0.5, (5, 10, False): 0.0, (16, 5, False): 1.0, (19, 5, True): -1.0, (9, 3, False): 4.4373517171309657e-41, (6, 3, False): 3.6552925943597346e+21, (17, 1, False): -1.0, (19, 2, False): 1.0, (14, 2, False): 1.0, (20, 3, False): 0.3333333432674408, (17, 3, False): -1.0, (21, 8,

In [34]:
def simulate_episode(env, policy):
     state = env.reset()
     done = False
     while not done:
        if state in policy:
            action = policy[state]
        else:
            action = torch.randint(2, [1]).item()
        state, reward, done, info = env.step(action)
        if done:
            return reward

In [None]:
win = 0
for i in range(100):
    if i % 10 == 0:
        print(f"-- {i}")
    R = simulate_episode(env, optimal_policy)
    if R > 0:
        win += int(R)

In [36]:
print(f"{win} times win which means {win:.2f} % winning chanse")

35 times win which means 35.00 % winning chanse


### **Developing MC control with epsilon-greedy policy**

In MC control with **epsilon-greedy** policy, we no longer exploit the best action all the time, but choose an action randomly under certain probabilities. As the name implies, the algorithm has two folds:

Epsilon: given a parameter, $ε$, with a value from 0 to 1, each action is taken with a probability calculated as follows:

$$
\large \pi(s, a) = \frac{ε}{|A|}
$$
- Here, |A| is the number of possible actions.

Greedy: the action with the highest state-action value is favored, and its probability of being chosen is increased by $1-ε$:

$$
\large \pi(s, a) = 1 - ε + \frac{ε}{|A|}
$$

Epsilon-greedy policy exploits the best action most of the time and also keeps exploring different actions from time to time.

<br>

<center><img width="600" src="https://www.baeldung.com/wp-content/ql-cache/quicklatex.com-5b10393cf0c6395ae5fb22260220c574_l3.svg">

In [48]:
def take_action(state, Q, epsilon, n_action):
    p = np.random.random()
    if p < epsilon:
        return torch.randint(0, n_action, (1,)).item()
    else:
        return torch.argmax(Q[state]).item()


def take_action2(state, Q, epsilon, n_action):
    probs = torch.ones(n_action) * epsilon / n_action
    best_action = np.argmax(Q[state])
    probs[best_action] += (1.0 - epsilon)
    action = torch.multinomial(probs, 1).item()
    return action

def run_episode(env, Q, epsilon, n_action):
    state = env.reset()
    rewards = []
    actions =[]
    states = []
    is_done = False
    while not is_done:
        action = take_action2(state, Q, epsilon, n_action)
        actions.append(action)
        states.append(state)
        state, reward, is_done, info= env.step(action)
        rewards.append(reward)
        if is_done:
            break
        action = torch.argmax(Q[state]).item()
    return states, actions, rewards

In [43]:
def mc_epsilon_greedy(env, gamma, n_episode, epsilon):
    n_action = env.action_space.n
    G_sum = defaultdict(float)
    N = defaultdict(int)
    Q = defaultdict(lambda: torch.empty(env.action_space.n))
    for episode in range(n_episode):
        if episode % 100 == 0 and episode > 0:
            print(f"-- Episode: {episode}")

        states, actions, rewards = run_episode(env, Q, epsilon, n_action)
        G = {}
        return_t = 0
        for state_t, action_t, reward_t in zip(states[::-1], actions[::-1], rewards[::-1]):
            return_t = reward_t + gamma * return_t
            G[state_t, action_t] = return_t

        for state_action, return_t in G.items():
            state, action = state_action
            if state[0] <= 21:
                G_sum[state_action] += return_t
                N[state_action] += 1
                Q[state][action] = G_sum[state_action] / N[state_action]

    policy = {}
    for state, actions in Q.items():
        policy[state] = torch.argmax(actions).item()
    return Q, policy

In [44]:
env = create_env("Blackjack-v1", render_mode = "human")
gamma = 1
epsilon = 0.1
n_episode = 500
optimal_Q, optimal_policy = mc_epsilon_greedy(env, gamma, n_episode, epsilon)

-- Episode: 100
-- Episode: 200
-- Episode: 300
-- Episode: 400


In [None]:
def simulate_episode(env, policy):
     state = env.reset()
     done = False
     while not done:
        if state in policy:
            action = policy[state]
        else:
            action = torch.randint(2, [1]).item()
        state, reward, done, info = env.step(action)
        if done:
            return reward

win = 0
for i in range(100):
    if i % 10 == 0:
        print(f"-- {i}")
    R = simulate_episode(env, optimal_policy)
    if R > 0:
        win += int(R)

In [46]:
print(f"{win} times win which means {win:.2f} % winning chanse")

39 times win which means 39.00 % winning chanse


## **Performing off-policy Monte Carlo control**

The off-policy method optimizes the **target policy**, $\pi$, using data generated by another policy, called the **behavior policy**, $b$. The target policy performs **exploitation** all the time while the behavior policy is for **exploration** purposes. This means that the target policy is greedy with respect to its current Q-function, and the behavior policy generates behavior so that the target policy has data to learn from.

We start with the latest step whose action taken under the behavior policy is different from the action taken under the greedy policy. And to learn about the target policy with another policy, we use a technique called **importance sampling**, which is commonly used to estimate the expected value under a distribution, given samples generated from a different distribution. The weighted importance for a state-action pair is calculated as follows:

$$
\omega_t= \sum_{k=t}\frac{\pi(a_k|s_k)}{b{a_k|s_k}}$$

- Here, $π(a_k | s_k)$ is the probability of taking action $a_k$ in state $s_k$ under the target policy; $b(a_k | s_k)$ is the probability under the behavior policy; and the weight, $w_t$, is the multiplication of ratios between those two probabilities from step $t$ to the end of the episode. The weight, $w_t$, is applied to the return at step $t$.

In [67]:
def creat_random_policy(n_action):
    probs = torch.ones(n_action) / n_action
    def policy_fn(observation):
        return probs
    return policy_fn

def run_episode(env, random_policy):
    state = env.reset()
    rewards = []
    actions =[]
    states = []
    is_done = False
    while not is_done:
        probs = random_policy(state)
        action = torch.multinomial(probs, 1).item()
        actions.append(action)
        states.append(state)
        state, reward, is_done, info= env.step(action)
        rewards.append(reward)
        if is_done:
            break
    return states, actions, rewards

In [70]:
def mc_off_policy(env, gamma, n_episode, epsilon, behavior_policy):
    n_action = env.action_space.n
    G_sum = defaultdict(float)
    N = defaultdict(int)
    Q = defaultdict(lambda: torch.empty(env.action_space.n))
    for episode in range(n_episode):
        w = 1
        if episode % 100 == 0 and episode > 0:
            print(f"-- Episode: {episode}")

        states, actions, rewards = run_episode(env, behavior_policy)
        G = {}
        return_t = 0
        for state_t, action_t, reward_t in zip(states[::-1], actions[::-1], rewards[::-1]):
            return_t = reward_t + gamma * return_t
            G[state_t, action_t] = return_t
            if action_t != torch.argmax(Q[state_t]).item():
                break

            w *= 1. / behavior_policy(state_t)[action_t]
        for state_action, return_t in G.items():
            state, action = state_action
            if state[0] <= 21:
                G_sum[state_action] += return_t * w
                N[state_action] += 1
                Q[state][action] = G_sum[state_action] / N[state_action]

    policy = {}
    for state, actions in Q.items():
        policy[state] = torch.argmax(actions).item()
    return Q, policy

In [71]:
env = create_env("Blackjack-v1", render_mode = "human")
gamma = 1
epsilon = 0.1
n_episode = 500
random_policy = creat_random_policy(env.action_space.n)
optimal_Q, optimal_policy = mc_off_policy(env, gamma, n_episode, epsilon, random_policy)

-- Episode: 100
-- Episode: 200
-- Episode: 300
-- Episode: 400


In [None]:
def simulate_episode(env, policy):
     state = env.reset()
     done = False
     while not done:
        if state in policy:
            action = policy[state]
        else:
            action = torch.randint(2, [1]).item()
        state, reward, done, info = env.step(action)
        if done:
            return reward

win = 0
for i in range(100):
    if i % 10 == 0:
        print(f"-- {i}")
    R = simulate_episode(env, optimal_policy)
    if R > 0:
        win += int(R)

In [73]:
print(f"{win} times win which means {win:.2f} % winning chanse")

41 times win which means 41.00 % winning chanse
