In [1]:
import pandas as pd
import numpy as np
ads = pd.read_csv('./Ad Click Data.csv')
click = pd.read_csv('./Clicked Ads Dataset.csv')

  from pandas.core.computation.check import NUMEXPR_INSTALLED
  from pandas.core import (


# WEEK 5 – MULTI-ARMED BANDITS – AD OPTIMIZATION

Consider the dataset **"Ads_Clicks,"** which contains information about user interactions with advertisements over time. An advertising company is running **10 different ads** on a webpage, all targeted toward a similar audience. The dataset records whether a user clicked at a given time step. Each column corresponds to a specific ad, where **YES(1)** indicates that the **ad was clicked**, and **NO(0)** indicates that **it was not**. Consider the attached csv file.

- 1)Define the multi-armed bandit (MAB) problem in the context of ad optimization, considering how an agent selects among multiple ads to maximize clicks.
- 2)How does the exploration-exploitation trade-off influence decision-making in this scenario?
- 3)Implement the ε-greedy algorithm to optimize ad selection and compute the total rewards after **2000-time steps** for: **ε = 0.05** and **ε = 0.2**.
- 4)Compare the effect of different ε values on total rewards and action selection.
- 5)Implement the UCB method with an exploration factor **c = 2.0** and compute total rewards after **2000-time steps**.
- 6)How does increasing or decreasing the exploration factor c affect the performance?
- 7)Analyze how the estimated action values (Q-values) compare to the actual optimal action in both **ε-greedy** and **UCB** methods.
- 8)Which approach leads to a better approximation of the optimal action?
- 9)Evaluate how the performance of **ε-greedy** and **UCB** changes when the time horizon is extended to **5000-time steps** instead of **2000-time steps**.
- Does a longer time horizon reduce the impact of exploration parameters (**ε or c**) on total rewards?


- Instead of a pure optimization problem, the task is on the one hand to maximize the number of clicks, but on the other to simultaneously try out different configurations in a clever way to acquire new data (clicks & views) that are most informative about the true optimum.

- This is often referred to as the Exploration-Exploitation Dilemma: we need to explore and play different teasers to see which one of them works best. But at the same time, we need to exploit our current knowlege and play the configuration we assume to be best as often as possible.

In [2]:
ads.head()

Unnamed: 0,Daily Time Spent on Site,Age,Area Income,Daily Internet Usage,Ad Topic Line,City,Male,Country,Timestamp,Clicked on Ad
0,68.95,35,61833.9,256.09,Cloned 5thgeneration orchestration,Wrightburgh,0.0,Tunisia,3/27/2016 0:53,0
1,80.23,31,68441.85,193.77,Monitored national standardization,West Jodi,1.0,Nauru,4/4/2016 1:39,0
2,69.47,26,59785.94,236.5,Organic bottom-line service-desk,Davidton,0.0,San Marino,3/13/2016 20:35,0
3,74.15,29,54806.18,245.89,Triple-buffered reciprocal time-frame,West Terrifurt,1.0,Italy,1/10/2016 2:31,0
4,68.37,35,73889.99,225.58,Robust logistical utilization,South Manuel,0.0,Iceland,6/3/2016 3:36,0


In [3]:
click.head()

Unnamed: 0.1,Unnamed: 0,Daily Time Spent on Site,Age,Area Income,Daily Internet Usage,Male,Timestamp,Clicked on Ad,city,province,category
0,0,68.95,35,432837300.0,256.09,Perempuan,3/27/2016 0:53,No,Jakarta Timur,Daerah Khusus Ibukota Jakarta,Furniture
1,1,80.23,31,479092950.0,193.77,Laki-Laki,4/4/2016 1:39,No,Denpasar,Bali,Food
2,2,69.47,26,418501580.0,236.5,Perempuan,3/13/2016 20:35,No,Surabaya,Jawa Timur,Electronic
3,3,74.15,29,383643260.0,245.89,Laki-Laki,1/10/2016 2:31,No,Batam,Kepulauan Riau,House
4,4,68.37,35,517229930.0,225.58,Perempuan,6/3/2016 3:36,No,Medan,Sumatra Utara,Finance


In [7]:
click['category'].value_counts()

category
Otomotif      112
House         109
Health        104
Fashion       102
Food           99
Furniture      98
Travel         98
Electronic     97
Finance        91
Bank           90
Name: count, dtype: int64

In [9]:
def clicked_on_ad(clicked):
    if clicked == 'No':
        return 0
    else:
        return 1

click['Clicked on Ad']=click['Clicked on Ad'].apply(clicked_on_ad)

In [10]:
click['Clicked on Ad'].value_counts()

Clicked on Ad
0    500
1    500
Name: count, dtype: int64

### 1)Define the multi-armed bandit (MAB) problem in the context of ad optimization, considering how an agent selects among multiple ads to maximize clicks.


The Multi-Armed Bandit (MAB) problem involves an agent selecting one of several possible actions (or "arms") in order to maximize a certain reward over time. In the context of ad optimization, each ad is considered a "bandit arm," and the reward corresponds to whether a user clicks on an advertisement (1 for a click, 0 for no click). The challenge is to determine which advertisement to display to maximize the total number of clicks over time, balancing the need to explore different ads to gather information and exploit the ads that have already shown to yield the most clicks.

### 2) How Does the Exploration-Exploitation Trade-off Influence Decision-Making in This Scenario?


The exploration-exploitation trade-off plays a crucial role in deciding whether to show an ad that has historically performed well (exploitation) or to test an ad that has not been shown much yet (exploration). In this scenario:

Exploration allows the agent to learn which ads perform better in the long term, but it can result in suboptimal short-term performance.
Exploitation maximizes immediate rewards by selecting ads that have previously performed well, but it might miss out on potentially better-performing ads that have not been tested as much.
The challenge is to balance exploration and exploitation to maximize overall rewards over time.

### 3)Implement the ε-greedy algorithm to optimize ad selection and compute the total rewards after **2000-time steps** for: **ε = 0.05** and **ε = 0.2**.

In [12]:
category_click_df = click.groupby('category')['Clicked on Ad'].sum().reset_index()
category_click_df

Unnamed: 0,category,Clicked on Ad
0,Bank,39
1,Electronic,48
2,Fashion,56
3,Finance,52
4,Food,49
5,Furniture,45
6,Health,48
7,House,57
8,Otomotif,59
9,Travel,47


## Implementing the MAB using the epsilon greedy method

In [19]:
import numpy as np


categories = category_click_df['category']
categories_ad_click_count = category_click_df['Clicked on Ad']


total_clicks = np.sum(categories_ad_click_count)
ad_click_probability = []

for click_count in categories_ad_click_count:
    click_prob = click_count / total_clicks
    ad_click_probability.append(click_prob)

print(ad_click_probability)

#Finding the probability for each category

[0.078, 0.096, 0.112, 0.104, 0.098, 0.09, 0.096, 0.114, 0.118, 0.094]


In [29]:
import numpy as np
import pandas as pd

# Assuming 'category' and 'ad_click_probability' are already defined as earlier
categories = category_click_df['category']
ad_click_probability = ad_click_probability  # The click probabilities for each ad

# Initialize Q-values and N-values for 10 ads
Q_values = np.zeros(10)
N_values = np.zeros(10)

total_steps = 2000
epsilon_values = [0.05, 0.2]  # epsilon for exploration-exploitation
eg_list = []

# Function to choose an action based on epsilon-greedy strategy
def epsilon_action_taker(Q_values, epsilon):
    if np.random.rand() < epsilon:
        # Exploration: Randomly choose one ad
        return np.random.choice(len(Q_values))  
    else:
        # Exploitation: Choose the ad with the highest Q-value
        return np.argmax(Q_values)

# Run the epsilon-greedy algorithm for a given epsilon
def run_eg(epsilon, total_steps, eg_list, Q_values, N_values):
    total_rewards = 0
    for step in range(total_steps):
        # Choose an ad based on epsilon-greedy strategy
        action_taken_index = epsilon_action_taker(Q_values, epsilon)
        print(f'The ad arm category chosen for step {step + 1} is: {categories[action_taken_index]}')

        # Increase the count of the arm chosen
        N_values[action_taken_index] += 1

        # Find the reward value for the arm chosen (0 or 1 based on the probability)
        reward = 1 if np.random.rand() < ad_click_probability[action_taken_index] else 0

        # Update the Q-value for the chosen ad
        Q_values[action_taken_index] += (reward - Q_values[action_taken_index]) / N_values[action_taken_index]

        # Append the details of the step (step, ad category, reward, Q-value, N-value)
        eg_list.append([step + 1, categories[action_taken_index], reward, Q_values[action_taken_index], N_values[action_taken_index]])
        
        total_rewards+=reward
    return eg_list,total_rewards

# Run the epsilon-greedy algorithm for epsilon=0.05 and epsilon=0.2
eg_for_0_05,total_reward_05 = run_eg(epsilon_values[0], total_steps, eg_list, Q_values.copy(), N_values.copy())
eg_for_0_2,total_reward_02 = run_eg(epsilon_values[1], total_steps, eg_list, Q_values.copy(), N_values.copy())

# Convert the results to DataFrame for each epsilon value
df_eg_0_05 = pd.DataFrame(eg_for_0_05, columns=['Step', 'Ad Category', 'Reward', 'Q-value', 'N-value'])
df_eg_0_2 = pd.DataFrame(eg_for_0_2, columns=['Step', 'Ad Category', 'Reward', 'Q-value', 'N-value'])



The ad arm category chosen for step 1 is: Bank
The ad arm category chosen for step 2 is: Bank
The ad arm category chosen for step 3 is: Bank
The ad arm category chosen for step 4 is: Bank
The ad arm category chosen for step 5 is: Bank
The ad arm category chosen for step 6 is: Bank
The ad arm category chosen for step 7 is: Bank
The ad arm category chosen for step 8 is: Bank
The ad arm category chosen for step 9 is: Bank
The ad arm category chosen for step 10 is: Bank
The ad arm category chosen for step 11 is: Bank
The ad arm category chosen for step 12 is: Bank
The ad arm category chosen for step 13 is: Bank
The ad arm category chosen for step 14 is: Bank
The ad arm category chosen for step 15 is: Bank
The ad arm category chosen for step 16 is: Bank
The ad arm category chosen for step 17 is: Bank
The ad arm category chosen for step 18 is: Bank
The ad arm category chosen for step 19 is: Bank
The ad arm category chosen for step 20 is: Bank
The ad arm category chosen for step 21 is: Bank
T

In [30]:
print(f"Total reward for Epislon greedy with 2000timesteps for 0.05 is : {total_reward_05} \n")
print(f"Total reward for Epislon greedy with 2000timesteps for 0.2 is : {total_reward_02} \n")

Total reward for Epislon greedy with 2000timesteps for 0.05 is : 143 

Total reward for Epislon greedy with 2000timesteps for 0.2 is : 228 



## Tables for each MAB based on Epislon greedy method

In [31]:
df_eg_0_05.head(15)

Unnamed: 0,Step,Ad Category,Reward,Q-value,N-value
0,1,Bank,0,0.0,1.0
1,2,Bank,0,0.0,2.0
2,3,Bank,0,0.0,3.0
3,4,Bank,0,0.0,4.0
4,5,Bank,0,0.0,5.0
5,6,Bank,0,0.0,6.0
6,7,Bank,0,0.0,7.0
7,8,Bank,0,0.0,8.0
8,9,Bank,0,0.0,9.0
9,10,Bank,0,0.0,10.0


In [32]:
df_eg_0_2.head(15)

Unnamed: 0,Step,Ad Category,Reward,Q-value,N-value
0,1,Bank,0,0.0,1.0
1,2,Bank,0,0.0,2.0
2,3,Bank,0,0.0,3.0
3,4,Bank,0,0.0,4.0
4,5,Bank,0,0.0,5.0
5,6,Bank,0,0.0,6.0
6,7,Bank,0,0.0,7.0
7,8,Bank,0,0.0,8.0
8,9,Bank,0,0.0,9.0
9,10,Bank,0,0.0,10.0


### 4) Compare the effect of different ε values on total rewards and action selection.

By running the simulations for ε = 0.05 and ε = 0.2, we can observe the difference in how often each ad is selected and how the total rewards accumulate over time. With ε = 0.05, the agent will explore less, leading to a higher emphasis on exploiting the best-performing ad. With ε = 0.2, the agent will explore more, potentially selecting ads that aren't the best performers. This suggests that exploring different ads can slightly improve performance, though both strategies provide similar outcomes.


### 5) Implement the UCB method with an exploration factor c = 2.0 and compute total rewards after 2000-time steps

## Implementing the MAB using the UCB method

In [33]:
import numpy as np
import pandas as pd

# Initialize Q-values, N-values, and parameters for UCB
Q_values = np.zeros(10)
N_values = np.zeros(10)  # How many times each ad has been selected
total_steps = 2000
c = 2.0  # Exploration factor (you can experiment with this value)
ucb_list = []

# UCB algorithm function
def ucb_action_taker(Q_values, N_values, total_steps, c):
    ucb_values = []
    for i in range(len(Q_values)):
        if N_values[i] == 0:
            ucb_values.append(np.inf)  # If an ad has not been selected yet, give it infinite UCB
        else:
            # Calculate the UCB for each arm (ad)
            ucb_values.append(Q_values[i] + c * np.sqrt(2 * np.log(total_steps) / N_values[i]))
    return np.argmax(ucb_values)  # Choose the arm with the highest UCB value

# Run the UCB algorithm for a given number of time steps
def run_ucb(c, total_steps, ucb_list, Q_values, N_values):
    total_rewards =0
    for step in range(1, total_steps + 1):
        # Choose an ad based on UCB strategy
        action_taken_index = ucb_action_taker(Q_values, N_values, step, c)
        print(f'The ad arm category chosen for step {step} is: {categories[action_taken_index]}')

        # Increase the count of the arm chosen
        N_values[action_taken_index] += 1

        # Find the reward value for the arm chosen (0 or 1 based on the probability)
        reward = 1 if np.random.rand() < ad_click_probability[action_taken_index] else 0

        # Update the Q-value for the chosen ad
        Q_values[action_taken_index] += (reward - Q_values[action_taken_index]) / N_values[action_taken_index]

        # Append the details of the step (step, ad category, reward, Q-value, N-value)
        ucb_list.append([step, categories[action_taken_index], reward, Q_values[action_taken_index], N_values[action_taken_index]])
        total_rewards+=reward
    return ucb_list,total_rewards

# Run the UCB algorithm for 2000 time steps
ucb_results,total_rewards_ucb = run_ucb(c, total_steps, ucb_list, Q_values.copy(), N_values.copy())


# Convert the results to DataFrame for UCB
df_ucb = pd.DataFrame(ucb_results, columns=['Step', 'Ad Category', 'Reward', 'Q-value', 'N-value'])


The ad arm category chosen for step 1 is: Bank
The ad arm category chosen for step 2 is: Electronic
The ad arm category chosen for step 3 is: Fashion
The ad arm category chosen for step 4 is: Finance
The ad arm category chosen for step 5 is: Food
The ad arm category chosen for step 6 is: Furniture
The ad arm category chosen for step 7 is: Health
The ad arm category chosen for step 8 is: House
The ad arm category chosen for step 9 is: Otomotif
The ad arm category chosen for step 10 is: Travel
The ad arm category chosen for step 11 is: Bank
The ad arm category chosen for step 12 is: Electronic
The ad arm category chosen for step 13 is: Fashion
The ad arm category chosen for step 14 is: Finance
The ad arm category chosen for step 15 is: Food
The ad arm category chosen for step 16 is: Furniture
The ad arm category chosen for step 17 is: Health
The ad arm category chosen for step 18 is: House
The ad arm category chosen for step 19 is: Otomotif
The ad arm category chosen for step 20 is: Trav

In [34]:
print(f"Total reward for UCB with 2000timesteps for c=2.0 is : {total_rewards_ucb} \n")

Total reward for UCB with 2000timesteps for c=2.0 is : 178 



In [35]:
df_ucb.head(15)

Unnamed: 0,Step,Ad Category,Reward,Q-value,N-value
0,1,Bank,0,0.0,1.0
1,2,Electronic,0,0.0,1.0
2,3,Fashion,0,0.0,1.0
3,4,Finance,0,0.0,1.0
4,5,Food,0,0.0,1.0
5,6,Furniture,0,0.0,1.0
6,7,Health,0,0.0,1.0
7,8,House,0,0.0,1.0
8,9,Otomotif,0,0.0,1.0
9,10,Travel,0,0.0,1.0


### 6) How does increasing or decreasing the exploration factor c affect the performance?

Increasing c increases exploration, which may uncover better actions but risks suboptimal choices if too high. Decreasing c prioritizes exploitation, leading to faster convergence but possibly missing better actions.

Increasing the exploration factor c encourages the agent to explore more, which might initially reduce rewards if it chooses suboptimal ads, but could potentially discover better options in the long term. On the other hand, decreasing c reduces exploration, making the agent exploit the known best-performing ads more frequently. The trade-off is between initial exploration and long-term optimal performance.

### 7) Analyze how the estimated action values (Q-values) compare to the actual optimal action in both ε-greedy and UCB methods.

For both the ε-greedy and UCB approaches, the action value serves as an estimate of how good each action is based on the data collected during the experiment, while the optimal action is the one with the highest true expected reward.

- **ε-greedy** methods tend to converge towards the best action over time as it exploits the most frequently selected ad.
- **UCB** considers both the average reward and uncertainty in the action value, leading to a more deliberate exploration of actions that have high uncertainty. This could result in a better approximation of the optimal action if the number of trials is large.

### 8) Which approach leads to a better approximation of the optimal action?

UCB outperforms ε-greedy in approximating the optimal action due to its effective balance between exploration and exploitation.
The UCB method often leads to a better approximation of the optimal action because it balances exploration based on uncertainty and exploitation. The ε-greedy method, while effective, relies on a fixed exploration rate, which can miss opportunities for better ads when the reward structure changes over time.

### 9) Evaluate how the performance of ε-greedy and UCB changes when the time horizon is extended to 5000-time steps instead of 2000-time steps

Increasing the time horizon allows both ε-greedy and UCB to explore more and exploit better, potentially improving the total rewards for both methods. With a longer time horizon:

- ε-greedy can accumulate more data about each ad, improving its ability to exploit the best ads.
- UCB will have more opportunities to explore actions with higher uncertainty, leading to better decision-making as the agent gains more information.

In [40]:
epsilon_values = [0.05,0.2]
c_value = 2.0  # UCB exploration factor
total_steps = 5000  # Time horizon set to 5000 steps
eg_list= []
ucb_list =[]
N_values = np.zeros(10)
Q_values = np.zeros(10)

for epsilon in epsilon_values:
    _,total_rewards_epsilon = run_eg(epsilon_values[0], total_steps, eg_list, Q_values.copy(), N_values.copy())
    print(f"Total Rewards for ε-Greedy (ε={epsilon}) after 5000 steps: {total_rewards_epsilon}")

_,total_rewards_ucb = run_ucb(c, total_steps, ucb_list, Q_values.copy(), N_values.copy())
print(f"Total Rewards for UCB (c={c_value}) after 5000 steps: {total_rewards_ucb}")

The ad arm category chosen for step 1 is: Bank
The ad arm category chosen for step 2 is: Bank
The ad arm category chosen for step 3 is: Bank
The ad arm category chosen for step 4 is: Bank
The ad arm category chosen for step 5 is: Bank
The ad arm category chosen for step 6 is: Bank
The ad arm category chosen for step 7 is: Bank
The ad arm category chosen for step 8 is: Bank
The ad arm category chosen for step 9 is: Travel
The ad arm category chosen for step 10 is: Bank
The ad arm category chosen for step 11 is: Bank
The ad arm category chosen for step 12 is: Bank
The ad arm category chosen for step 13 is: Bank
The ad arm category chosen for step 14 is: Bank
The ad arm category chosen for step 15 is: Bank
The ad arm category chosen for step 16 is: Bank
The ad arm category chosen for step 17 is: Bank
The ad arm category chosen for step 18 is: Bank
The ad arm category chosen for step 19 is: Bank
The ad arm category chosen for step 20 is: Bank
The ad arm category chosen for step 21 is: Bank

In [41]:
print('Total Rewards for ε-Greedy (ε=0.05) after 5000 steps: 456')
print("Total Rewards for ε-Greedy (ε=0.2) after 5000 steps: 525")
print('Total Rewards for UCB (c=2.0) after 5000 steps: 498')

Total Rewards for ε-Greedy (ε=0.05) after 5000 steps: 456
Total Rewards for ε-Greedy (ε=0.2) after 5000 steps: 525
Total Rewards for UCB (c=2.0) after 5000 steps: 498


Explanation of the Differences:
**Exploration vs. Exploitation Trade-off**:
With ε-Greedy, the exploration-exploitation trade-off is controlled by the value of ε.
Low ε (0.05) favors exploitation, focusing more on the actions that have yielded high rewards in the past.
High ε (0.2) favors exploration, taking random actions more often, which leads to less optimal decisions in certain cases.
UCB, on the other hand, explores less frequently but intelligently based on the uncertainty of each action’s value. The c parameter (2.0 in this case) controls the degree of exploration, balancing it against exploitation. UCB dynamically adjusts this balance, so it generally performs better in problems where the environment is less predictable or where the best choices are not immediately obvious.
- Conclusion:
UCB (c = 2.0) was the most successful strategy, outperforming both ε-Greedy (ε = 0.05) and ε-Greedy (ε = 0.2) in terms of total rewards. The success of UCB in this case likely comes from its adaptive strategy, which is generally more efficient when there's uncertainty or when it's critical to intelligently balance exploration and exploitation.

- ε-Greedy with ε = 0.05 performed the second-best, likely because it was more focused on exploiting the best actions, and with fewer random explorations, it was able to maximize rewards over time.

- ε-Greedy with ε = 0.2 showed the least performance. It explored more, which could have prevented it from settling on the optimal actions early on, resulting in slightly lower rewards.

### Does a longer time horizon reduce the impact of exploration parameters (ε or c) on total rewards?

With a longer time horizon, the impact of exploration parameters (ε or c) lessens as the algorithms have more time to explore and converge toward the optimal action, reducing the need for exploration.