<a href="https://colab.research.google.com/github/ShreyJais/RL/blob/main/2348558_RL_Lab1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Multi-Armed Bandit Problem
> The Multi-Armed Bandit (MAB) Problem is a classic problem in machine learning and reinforcement learning. It involves a decision-maker who faces a set of options (arms) and must choose one at a time. Each arm has an unknown reward distribution, and the goal is to maximize the total reward over a series of trials.

Imagine in a casino. There are many slot machines, each with a different chance of winning. You want to make the most money possible.

The problem: You don't know which machines pay out the most.

The solution: You have to try different machines to figure out which ones are the best. But you also want to play the machines you know are good to win more money.

This is the Multi-Armed Bandit Problem. It's about finding the right balance between trying new things (exploring) and sticking with what works (exploiting).



**Key Elements of the MAB Problem:**

* Arms: The set of options available to the decision-maker.
* Rewards: The unknown reward distribution associated with each arm.
* Exploration-Exploitation Trade-off: The decision-maker must balance between exploring new arms to learn their rewards and exploiting the arms that are known to have high rewards.
* Regret: The difference between the maximum possible reward and the actual reward achieved by the decision-maker.

In [None]:
import numpy as np
import pandas as pd
import plotly.express as px

def multi_armed_bandit(num_arms, num_steps, epsilon):
    rewards = np.zeros(num_arms)
    counts = np.zeros(num_arms)
    all_rewards = []
    for _ in range(num_steps):
        if np.random.rand() < epsilon:
            arm = np.random.randint(num_arms)
        else:
            arm = np.argmax(rewards / np.maximum(counts, 1))
        reward = np.random.normal(loc=0.5, scale=1)
        rewards[arm] += reward
        counts[arm] += 1
        all_rewards.append(reward)
    return all_rewards, counts

num_arms = 10
num_steps = 1000
epsilons = [0.1, 0.3, 0.5]

In [None]:
results = [multi_armed_bandit(num_arms, num_steps, eps) for eps in epsilons]

# Create DataFrame for rewards
df_rewards = pd.DataFrame({
    'Step': np.tile(np.arange(1, num_steps + 1), len(epsilons)),
    'Reward': np.concatenate([res[0] for res in results]),
    'Epsilon': np.repeat(epsilons, num_steps)
})

df_rewards['Cumulative Average Reward'] = df_rewards.groupby('Epsilon')['Reward'].transform(lambda x: x.cumsum() / (np.arange(len(x)) + 1))

# Visualize Cumulative Average Reward
fig_rewards = px.line(df_rewards, x='Step', y='Cumulative Average Reward', color='Epsilon',
                      title='Multi-Armed Bandit: Cumulative Average Reward over Time',
                      labels={'Cumulative Average Reward': 'Cumulative Average Reward',
                              'Step': 'Step', 'Epsilon': 'ε (Epsilon)'},
                      hover_data=['Reward'])
fig_rewards.update_layout(legend_title_text='ε (Epsilon)')
fig_rewards.show()

In [None]:
# Create DataFrame for arm-specific counts
df_counts = pd.DataFrame([
    {'Arm': arm, 'Count': count, 'Epsilon': eps}
    for eps, (_, counts) in zip(epsilons, results)
    for arm, count in enumerate(counts)
])

# Visualize Arm-specific Counts
fig_counts = px.bar(df_counts, x='Arm', y='Count', color='Epsilon',
                    title='Multi-Armed Bandit: Total Arm Pulls per Arm',
                    labels={'Count': 'Total Arm Pulls',
                            'Arm': 'Arm Index',
                            'Epsilon': 'ε (Epsilon)'},
                    barmode='group')
fig_counts.update_layout(legend_title_text='ε (Epsilon)')
fig_counts.show()