# Reinforcement Learning

RL uses training information that evaluates the actions taken ratther than instructs by giving correct actions.
This is what creates the need for active exploration.

### Imports

We will use:
- Numpy for calculation
- gymnasium for rl reproducible envrionment
- plotly for graph ploting

In [85]:
import numpy as np
import gymnasium as gym

import plotly.graph_objects as go
from plotly.subplots import make_subplots

## K-armed Bandit Problem

A simple version of the k-armed bandit problem is useful because of its nonassociative nature. This is a good problem or environement to learn basic reinforcement learning methodes since it avoids much of the complexity of the full reinforcement learning.

So let's create a simple gymnasium environement to re-create the k-armed bandit problem.

### The k-armed bandit problem

You are faced repeatedly with a choice among
k different options, or actions. After each choice you receive a numerical reward chosen
from a stationary probability distribution that depends on the action you selected. Your objective is to maximize the expected total reward over some time period, for example,
over 1000 action selections, or time steps.

Here we use gymnasium for craeting what we call an environement. Gymnasium is an framework that allow to standardize the environment api so that poeple can easily reproduce experimentations.

### Gym

### K-armed bandit problem implementation

In [87]:
class KArmedBandit(gym.Env):

    def __init__(self, nb_arms=10, nb_steps=1000, mean=0, variance=1, noise_variance=1):
        self._nb_arms = nb_arms
        self._nb_steps = nb_steps

        self._mean = 0
        self._noise_mean = 0
        self._variance = variance
        self._noise_variance = noise_variance

        self.action_space = gym.spaces.Discrete(nb_arms)
        self.observation_space = gym.spaces.Discrete(1)
    
    def step(self, action):
        self._step += 1
    
        reward = self._arms[action]
        reward_noise = self.np_random.normal(self._noise_mean, self._noise_variance)
        terminated = self._step >= self._nb_steps

        info = { "is_optimal_action": int(action == np.argmax(self._arms)) }

        return reward + reward_noise, terminated, info

    def reset(self, seed=None, options=None):
        super().reset(seed=seed)

        self._step = 0
        self._arms = self.np_random.normal(self._mean, self._variance, size=self._nb_arms)

### Plotting function

In [154]:
def plot_results(main_title, titles, results):
    fig = make_subplots(rows=1, cols=2, subplot_titles=titles)

    for result_name in results:
        x = np.arange(len(results[result_name]["mean_reward"]))
        fig.add_trace(go.Scatter(x=x, y=results[result_name]["mean_reward"], line_color=results[result_name]["color"], name=result_name), row=1, col=1)
        fig.add_trace(go.Scatter(x=x, y=results[result_name]["optimal_action"], line_color=results[result_name]["color"], showlegend=False), row=1, col=2)

    fig.update_layout(
        title=main_title,
        legend_title="Parameters",
    )

    fig.show()

### Testing gym env

In [88]:
nb_arms  = 10
nb_steps = 1000

env = KArmedBandit(nb_arms=nb_arms, nb_steps=nb_steps)

let's see if the implementation of the k-armed bandit is correct:

In [89]:
env.reset()

# Sample our distribution to see it's correct
data = np.array([[env.step(i)[0] for _ in range(2000)] for i in range(len(env._arms))])

fig = go.Figure()

for i in range(len(env._arms)):
    fig.add_trace(go.Violin(x=[i] * len(data[i]), y=data[i], name="q*(" + str(i) + ") = " + str(env._arms[i])[:4], meanline_visible=True))

fig.update_layout(
    title="K-Armed Bandit Problem Distribution",
    xaxis_title="Actions",
    yaxis_title="Reward Distributions",
    legend_title="True Value of q*(a)",
)

fig.show()

The true value of each q*(a) is near the mean of each distribution, it seems good !

### Epsilon Greedy Algorithm

In [90]:
class EpsilonGreedy():

    def __init__(self, nb_actions, epsilon):
        self.nb_actions = nb_actions
        self.epsilon = epsilon

        self.q = np.zeros(self.nb_actions)
        self.nb_action_taken = np.ones(self.nb_actions)

    def action(self):
        take_random_action_prob = np.random.uniform(0, 1)

        if take_random_action_prob < self.epsilon:
            return np.random.randint(0, self.nb_actions)
        else:
            return np.argmax(self.q)
    
    def observe(self, action, reward):
        self.nb_action_taken[action] += 1
        self.q[action] += (reward - self.q[action]) / self.nb_action_taken[action]
    
    def reset(self):
        self.q = np.zeros(self.nb_actions)
        self.nb_action_taken = np.ones(self.nb_actions)

In "-greedy action selection, for the case of two actions and " = 0.5, what is
the probability that the greedy action is selected?

Well, there is a probability of 0.5 to take the greedy action then 0.5 to take a random action; in this case there is a 1/2 chance to take the greedy action.
So the answer is 0.5 + (0.5 * 0.5) = 0.75 

In [91]:
agent = EpsilonGreedy(nb_actions=10, epsilon=0.01)

In [92]:
def run_env(env, agent):
    list_of_reward = []
    list_of_optimal_action = []

    env.reset()
    agent.reset()

    terminated = False

    while not terminated:
        action = agent.action()

        reward, terminated, info = env.step(action)

        agent.observe(action, reward)

        list_of_reward.append(reward)
        list_of_optimal_action.append(info["is_optimal_action"])
    
    return np.array(list_of_reward), np.array(list_of_optimal_action)

### Running first experimentation

In [93]:
list_of_reward, list_of_optimal_action = run_env(env, agent)

In [94]:
fig = go.Figure()
fig.add_trace(go.Scatter(x=np.arange(len(list_of_reward)), y=list_of_reward, mode='lines'))
fig.show()

In [95]:
fig = go.Figure()
fig.add_trace(go.Scatter(x=np.arange(len(list_of_optimal_action)), y=list_of_optimal_action, mode='lines', name='Canada'))
fig.show()

It's hard to see any result here, there is to much noise.

Repeating this for 2000 independent runs,
each with a different bandit problem, we obtained measures of the learning algorithm’s
average behavior.

In [96]:
def run_exp(nb_exps, env, agent):
    list_rewards, list_optimal_action = run_env(env, agent)

    for _ in range(nb_exps - 1):
        list_rewards_tmp, list_optimal_action_tmp = run_env(env, agent)

        list_rewards += list_rewards_tmp
        list_optimal_action += list_optimal_action_tmp

    return list_rewards / nb_exps, (list_optimal_action / nb_exps) * 100

In [97]:
nb_exps = 2000

In [98]:
agent_01 = EpsilonGreedy(nb_actions=env.action_space.n, epsilon=0.01)
mean_rewards_01, percent_optimal_action_01 = run_exp(nb_exps, env, agent_01)

In [99]:
agent_1 = EpsilonGreedy(nb_actions=env.action_space.n, epsilon=0.1)
mean_rewards_1, percent_optimal_action_1 = run_exp(nb_exps, env, agent_1)

In [100]:
agent_greedy = EpsilonGreedy(nb_actions=env.action_space.n, epsilon=0.0)
mean_rewards_0, percent_optimal_action_0 = run_exp(nb_exps, env, agent_greedy)

In [178]:
results_exp_1 = {
    'Egreedy 0.01': {
        "mean_reward": mean_rewards_01,
        "optimal_action": percent_optimal_action_01,
        "color": "red"
    },
    'Egreedy 0.1': {
        "mean_reward": mean_rewards_1,
        "optimal_action": percent_optimal_action_1,
        "color": "blue"
    },
    'Greedy (0.0)': {
        "mean_reward": mean_rewards_0,
        "optimal_action": percent_optimal_action_0,
        "color": "green"
    },
}

plot_results(
    "Compares greedy method with different parameters (0.01, 0.1 and 0)",
    ["Average Reward / Steps", "Optimal Action / Steps"],
    results_exp_1
)

### Optimistic greedy

In [172]:
class EpsilonGreedy():

    def __init__(self, nb_actions, epsilon=0.1, alpha=0.1, optimistic_value=0):
        self.nb_actions = nb_actions
        self.epsilon = epsilon
        self.alpha = alpha
        self.optimistic_value = optimistic_value

        self.q = np.ones(self.nb_actions) * optimistic_value

    def action(self):
        take_random_action_prob = np.random.uniform(0, 1)

        if take_random_action_prob < self.epsilon:
            return np.random.randint(0, self.nb_actions)
        else:
            return np.argmax(self.q)
    
    def observe(self, action, reward):
        self.q[action] += self.alpha * (reward - self.q[action])
    
    def reset(self):
        self.q = np.ones(self.nb_actions) * self.optimistic_value

In [173]:
nb_exps = 2000

In [174]:
agent_optimistic_greedy = EpsilonGreedy(nb_actions=env.action_space.n, epsilon=0, alpha=0.1, optimistic_value=5)
mean_rewards_optimistic, percent_optimal_action_optimistic = run_exp(nb_exps, env, agent_optimistic_greedy)

In [175]:
agent_non_optimistic = EpsilonGreedy(nb_actions=env.action_space.n, epsilon=0.1, alpha=0.1, optimistic_value=0)
mean_rewards_non_optimistic, percent_optimal_action_non_optimistic = run_exp(nb_exps, env, agent_non_optimistic)

In [209]:
results_exp_2 = {
    'Greedy Optimistic': {
        "mean_reward": mean_rewards_optimistic,
        "optimal_action": percent_optimal_action_optimistic,
        "color": "blue"
    },
    'Egreedy Non Optimistic': {
        "mean_reward": mean_rewards_non_optimistic,
        "optimal_action": percent_optimal_action_non_optimistic,
        "color": "red"
    }
}

plot_results("Optimistic greedy vs Non optimistic 0.01", ["Average Reward / Steps", "Optimal Action / Steps"], results_exp_2)

## Upper-Confidence-Bound

In [197]:
class UpperConfidenceBound():

    def __init__(self, nb_actions, confidence, alpha=0.1):
        self.nb_actions = nb_actions
        self.confidence = confidence
        self.alpha = alpha

        self.q = np.zeros(self.nb_actions)
        self.nb_action_taken = np.zeros(self.nb_actions)
        self.upper_configdence = np.ones(self.nb_actions) * np.inf

    def action(self):
        return np.argmax(self.q + self.upper_configdence)
    
    def observe(self, action, reward):
        self.nb_action_taken[action] += 1

        self.q[action] += (reward - self.q[action]) / self.nb_action_taken[action]

        if not 0 in self.nb_action_taken:
            self.upper_configdence = self.confidence * np.sqrt(np.log(np.sum(self.nb_action_taken)) / self.nb_action_taken)
        else:
            self.upper_configdence[action] = 0
    
    def reset(self):
        self.q = np.zeros(self.nb_actions)
        self.nb_action_taken = np.zeros(self.nb_actions)
        self.upper_configdence = np.ones(self.nb_actions) * np.inf

In [198]:
nb_exps = 2000

In [199]:
upper_condidence_agent = UpperConfidenceBound(nb_actions=env.action_space.n, confidence=2, alpha=0.1)
mean_rewards_upper_confidence, percent_optimal_action_upper_confidence = run_exp(nb_exps, env, upper_condidence_agent)

In [195]:
egreedy_agent = EpsilonGreedy(nb_actions=env.action_space.n, epsilon=0.1, alpha=0.1, optimistic_value=0)
mean_rewards_egreedy, percent_optimal_action_egreedy = run_exp(nb_exps, env, egreedy_agent)

In [200]:
results_exp_3 = {
    'UCB': {
        "mean_reward": mean_rewards_upper_confidence,
        "optimal_action": percent_optimal_action_upper_confidence,
        "color": "blue"
    },
    'Egreedy 0.1': {
        "mean_reward": mean_rewards_egreedy,
        "optimal_action": percent_optimal_action_egreedy,
        "color": "red"
    }
}

plot_results("Upper Confidence Bound vs Epsilon Greedy", ["Average Reward / Steps", "Optimal Action / Steps"], results_exp_3)

## Gradient Bandit Algorithms

In [111]:
def softmax(state_action_value):
    e_x = np.exp(state_action_value - np.max(state_action_value))
    probs = e_x / e_x.sum(axis=0)
    return probs

In [129]:
class GradientBandit():

    def __init__(self, nb_actions, alpha):
        self.nb_actions = nb_actions
        self.alpha = alpha
        self.soft_probs = None
        self.nb_action_taken = 0

        self.mean_reward = 0
        self.q = np.zeros(self.nb_actions)

    def action(self):
        self.soft_probs = softmax(self.q)
        return np.random.choice(self.nb_actions, 1, p=self.soft_probs)[0]
    
    def observe(self, action, reward):
        self.nb_action_taken += 1
        self.mean_reward += (reward - self.mean_reward) / self.nb_action_taken

        self.soft_probs[action] = - (1 - self.soft_probs[action])
        self.q -= self.alpha * (reward - self.mean_reward) * self.soft_probs
    
    def reset(self):
        self.nb_action_taken = 0
        self.q = np.zeros(self.nb_actions)

In [130]:
nb_exps = 2000

In [131]:
gb_agent = GradientBandit(nb_actions=env.action_space.n, alpha=0.1)
mean_rewards_gb, percent_optimal_action_gb = run_exp(nb_exps, env, gb_agent)

In [115]:
egreedy_agent = EpsilonGreedy(nb_actions=env.action_space.n, epsilon=0.1, alpha=0.1, optimistic_value=0)
mean_rewards_egreedy, percent_optimal_action_egreedy = run_exp(nb_exps, env, egreedy_agent)

In [151]:
results_exp_4 = {
    'GradientBandit Alpha=0.1': {
        "mean_reward": mean_rewards_gb,
        "optimal_action": percent_optimal_action_gb,
        "color": "blue"
    },
    'Egreedy 0.01': {
        "mean_reward": mean_rewards_egreedy,
        "optimal_action": percent_optimal_action_egreedy,
        "color": "green"
    }
}

plot_results(["Average Reward / Steps", "Optimal Action / Steps"], results_exp_4)

## Parameters study

In [133]:
def run_parameter_study(nb_exps, env, agents_parameters):
    results_mean_reward = {}
    results_percent_optimal_action = {}

    for agent_name in agents_parameters:
        results_mean_reward[agent_name] = []
        results_percent_optimal_action[agent_name] = []

    for agent_name in agents_parameters:
        print(agent_name)

        for parameter in agents_parameters[agent_name]["parameters"]:
            print("    running parameters:", parameter)

            agent = agents_parameters[agent_name]["class"](nb_actions=env.action_space.n, **parameter)

            mean_reward_over_steps, percent_optimal_action_over_steps = run_exp(nb_exps, env, agent)

            mean_reward = np.mean(mean_reward_over_steps)
            mean_optimal_action_percent = np.mean(percent_optimal_action_over_steps)

            results_mean_reward[agent_name].append(mean_reward)
            results_percent_optimal_action[agent_name].append(mean_optimal_action_percent)

    return results_mean_reward, results_percent_optimal_action

In [None]:
def plot_parameter_study_results(agents_parameters, results_mean_reward, results_percent_optimal_action):
    fig = make_subplots(rows=1, cols=2, subplot_titles=["Mean Reward / Parameters", "Mean Optimal Action / Parameters"])

    x = []

    for agent_name in results_mean_reward:
        parameter = agents_parameters[agent_name]["variable"]
        x += [p[parameter] for p in agents_parameters[agent_name]["parameters"]]

        fig.add_trace(
            go.Scatter(x=[p[parameter] for p in agents_parameters[agent_name]["parameters"]],
                       y=results_mean_reward[agent_name], line_color=agents_parameters[agent_name]["color"],
                       name=agent_name)
        , row=1, col=1)

        fig.add_trace(
            go.Scatter(x=[p[parameter] for p in agents_parameters[agent_name]["parameters"]],
                       y=results_percent_optimal_action[agent_name], line_color=agents_parameters[agent_name]["color"],
                       showlegend=False)
        , row=1, col=2)

    fig.update_layout(
        title="Parameter Study",
        legend_title="Parameters",
    )

    fig.update_xaxes(
        type='category',
        tickmode= 'array',
        categoryorder= 'array',
        categoryarray= sorted(x))

    fig.show()

In [205]:
agents = {
    "EpsilonGreedy": {
        "class": EpsilonGreedy,
        "color": "red",
        "variable": "epsilon",
        "parameters": [
            {"epsilon": 1 / 128},
            {"epsilon": 1 / 64},
            {"epsilon": 1 / 32},
            {"epsilon": 1 / 16},
            {"epsilon": 1 / 8},
            {"epsilon": 1 / 4}
        ],
    },

    "Greedy Optimistic": {
        "class": EpsilonGreedy,
        "color": "black",
        "variable": "optimistic_value",
        "parameters": [
            {"epsilon": 0, "optimistic_value": 1 / 4},
            {"epsilon": 0, "optimistic_value": 1 / 2},
            {"epsilon": 0, "optimistic_value": 1},
            {"epsilon": 0, "optimistic_value": 2},
            {"epsilon": 0, "optimistic_value": 4},
        ],
    },

    "UCB": {
        "class": UpperConfidenceBound,
        "color": "blue",
        "variable": "confidence",
        "parameters": [
            {"confidence": 1 / 16},
            {"confidence": 1 / 8},
            {"confidence": 1 / 4},
            {"confidence": 1 / 2},
            {"confidence": 1},
            {"confidence": 2},
            {"confidence": 4},
        ],
    },

    "Gradient Bandit": {
        "class": GradientBandit,
        "color": "green",
        "variable": "alpha",
        "parameters": [
            {"alpha": 1 / 32},
            {"alpha": 1 / 16},
            {"alpha": 1 / 8},
            {"alpha": 1 / 4},
            {"alpha": 1 / 2},
            {"alpha": 1},
            {"alpha": 2},
        ],
    }
}

20 min on my computer (Ryzen 5 5500U)

In [206]:
nb_exps = 2000
results_mean_reward, results_percent_optimal_action = run_parameter_study(nb_exps, env, agents)

EpsilonGreedy
    running parameters: {'epsilon': 0.0078125}
    running parameters: {'epsilon': 0.015625}
    running parameters: {'epsilon': 0.03125}
    running parameters: {'epsilon': 0.0625}
    running parameters: {'epsilon': 0.125}
    running parameters: {'epsilon': 0.25}
Greedy Optimistic
    running parameters: {'epsilon': 0, 'optimistic_value': 0.25}
    running parameters: {'epsilon': 0, 'optimistic_value': 0.5}
    running parameters: {'epsilon': 0, 'optimistic_value': 1}
    running parameters: {'epsilon': 0, 'optimistic_value': 2}
    running parameters: {'epsilon': 0, 'optimistic_value': 4}
UCB
    running parameters: {'confidence': 0.0625}
    running parameters: {'confidence': 0.125}
    running parameters: {'confidence': 0.25}
    running parameters: {'confidence': 0.5}
    running parameters: {'confidence': 1}
    running parameters: {'confidence': 2}
    running parameters: {'confidence': 4}
Gradient Bandit
    running parameters: {'alpha': 0.03125}
    running par

In [208]:
plot_parameter_study_results(agents, results_mean_reward, results_percent_optimal_action)