Ashwin Saji(240984006)

A) MAB Agent Formulation

The Multi-Armed Bandit (MAB) problem is a classical decision-making problem where a set of actions (in this case, advertisements) is available, each providing a certain reward, but the actual reward distribution of each action is initially unknown. The goal is to choose actions in a way that maximizes cumulative rewards over time. At each time step, the agent must decide which action to take, balancing the trade-off between exploring new actions to learn their reward distributions (exploration) and exploiting the current best-known action to maximize immediate rewards (exploitation).

In the context of the advertising dataset:
- There are 10 ads (arms) to choose from.

- Each time step represents a user interacting with one ad, either clicking (reward = 1) or not clicking (reward = 0).


- The agent must decide which ad to present, aiming to maximize the total clicks (reward) over a set period.

In [4]:
import numpy as np


np.random.seed(42)
true_ctr = np.random.rand(10)  # 10 ads with random click probabilities

def simulate_click(ad_index):
    return 1 if np.random.rand() < true_ctr[ad_index] else 0

In [7]:
print(true_ctr)

[0.37454012 0.95071431 0.73199394 0.59865848 0.15601864 0.15599452
 0.05808361 0.86617615 0.60111501 0.70807258]


B) Compute the total rewards after 2000-time steps using the ε-greedy action. a. for ε=0.01, ε=
0.3

In [5]:
# ε-Greedy Algorithm
def epsilon_greedy(epsilon, time_steps):
    n_ads = len(true_ctr)
    ad_counts = np.zeros(n_ads)
    ad_rewards = np.zeros(n_ads)
    total_reward = 0

    for t in range(time_steps):
        if np.random.rand() < epsilon:
            ad_selected = np.random.choice(n_ads)
        else:
            ad_selected = np.argmax(ad_rewards / (ad_counts + 1e-5))

        reward = simulate_click(ad_selected)
        ad_counts[ad_selected] += 1
        ad_rewards[ad_selected] += reward
        total_reward += reward

    return total_reward

time_steps_egreedy = 2000

total_reward_eps_0_01 = epsilon_greedy(0.01, time_steps_egreedy)
total_reward_eps_0_3 = epsilon_greedy(0.3, time_steps_egreedy)

print(f"ε-Greedy (ε=0.01) Total Reward: {total_reward_eps_0_01}")
print(f"ε-Greedy (ε=0.3) Total Reward: {total_reward_eps_0_3}")

ε-Greedy (ε=0.01) Total Reward: 1507
ε-Greedy (ε=0.3) Total Reward: 1608


C. Compute the total rewards after 1000-time steps using the Upper-Confidence-Bound action
method for c= 1.5, 2

In [6]:
# UCB Algorithm
def upper_confidence_bound(c, time_steps):
    n_ads = len(true_ctr)
    ad_counts = np.zeros(n_ads)
    ad_rewards = np.zeros(n_ads)
    total_reward = 0

    for t in range(time_steps):
        if t < n_ads:
            ad_selected = t
        else:
            ucb_values = (ad_rewards / (ad_counts + 1e-5)) + c * np.sqrt(np.log(t + 1) / (ad_counts + 1e-5))
            ad_selected = np.argmax(ucb_values)

        reward = simulate_click(ad_selected)
        ad_counts[ad_selected] += 1
        ad_rewards[ad_selected] += reward
        total_reward += reward

    return total_reward

time_steps_ucb = 1000
total_reward_ucb_1_5 = upper_confidence_bound(1.5, time_steps_ucb)
total_reward_ucb_2 = upper_confidence_bound(2, time_steps_ucb)
print(f"UCB (c=1.5) Total Reward: {total_reward_ucb_1_5}")
print(f"UCB (c=2.0) Total Reward: {total_reward_ucb_2}")

UCB (c=1.5) Total Reward: 824
UCB (c=2.0) Total Reward: 758


D. For all approaches, explain how the action value estimated compares to the optimal action.

ε-greedy:

- The estimated value of each ad in ε-greedy is based on the average reward received from that ad over the course of interactions. In the early stages, ε-greedy will explore more randomly, but as time progresses, it will converge towards exploiting the ad with the highest reward. However, the exploration rate (dependent on ε) means that even the best ad may not be exploited all the time.
- The optimal action is the ad with the highest true average reward. Over time, the agent's estimated action value should converge to the optimal value, but it might not do so efficiently due to the exploration aspect.


Upper-Confidence-Bound (UCB):

- UCB balances exploration and exploitation more systematically by considering the uncertainty in the estimates. Ads that have been selected less often will have higher uncertainty and may be chosen more frequently. This can lead to better exploration in the early stages but can still converge quickly to the optimal action as the uncertainty is reduced over time.
- The estimated value for UCB tends to converge more efficiently toward the optimal action value compared to ε-greedy, especially when the
𝑐
c parameter is set to a higher value, ensuring that exploration is well-managed.



