# Divansh Prasad (210968140) Week-5 AI Lab

## Exercises
Consider the given dataset “Ads_clicks” containing data about which add was clicked in each time step. Suppose an advertising company is running 10 different ads targeted towards a similar set of the population on a webpage. We have results for which ads were clicked by a user. Each column index represents a different ad. We have a 1 if the ad was clicked by a user, and 0 if it was not. 

a. Write down the MAB agent problem formulation in your own words. \
b. Compute the total rewards after 2000-time steps using the ε-greedy action. a. for ε=0.01, ε=
0.3 \
c. Compute the total rewards after 2000-time steps using the Upper-Confidence-Bound action
method for c= 1.5 \
d. For all approaches, explain how the action value estimated compares to the optimal action. 

### a. Write down the MAB agent problem formulation in your own words.

The problem agent formulation involves determining the most optimal ad to display to a user at a given time instant to maximize the number of clicks on the webpage.
The problem can be defined as :

-- There are 10 different ads to choose from, and at each time step, the MAB agent must decide which ad to display to the user.
-- Each ad has an unknown click-through rate (CTR) that represents the probability of a user clicking on that ad.

-- The MAB agent must balance the exploration of less-known ads to learn their CTRs with the exploitation of the ads that are known to have higher CTRs to maximize the total number of clicks.

-- The MAB agent's objective is to learn the true CTR of each ad while minimizing the regret, which is the difference between the expected number of clicks obtained by displaying the best ad and the expected number of clicks obtained by displaying the chosen ad at each time step.

An advertising agency has 10 different ads. They want to find the ad which will get the most clicks by users, and is this most profitable. We need to help the agency in finding the most suited add to maximize the conversions through them.

In [68]:
import matplotlib.pyplot as plt
from operator import itemgetter
from statistics import mean
from random import random
import time
import pandas as pd
import random
import numpy as np
from numpy import array
import gym_bandits
import gym

### Creating Dataframe

In [75]:
ads_clicks=pd.read_csv("Week-5/Ads_Clicks.csv")
ads_clicks

Unnamed: 0,Ad 1,Ad 2,Ad 3,Ad 4,Ad 5,Ad 6,Ad 7,Ad 8,Ad 9,Ad 10
0,1,0,0,0,1,0,0,0,1,0
1,0,0,0,0,0,0,0,0,1,0
2,0,0,0,0,0,0,0,0,0,0
3,0,1,0,0,0,0,0,1,0,0
4,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...
9995,0,0,1,0,0,0,0,1,0,0
9996,0,0,0,0,0,0,0,0,0,0
9997,0,0,0,0,0,0,0,0,0,0
9998,1,0,0,0,0,0,0,1,0,0


### Checking for null values

In [37]:
ads_clicks.isna().sum()

Ad 1     0
Ad 2     0
Ad 3     0
Ad 4     0
Ad 5     0
Ad 6     0
Ad 7     0
Ad 8     0
Ad 9     0
Ad 10    0
dtype: int64

### b. Compute the total rewards after 2000-time steps using the ε-greedy action. a. for ε=0.01, ε= 0.3

In [88]:
# Initialize the rewards for each ad to 0 and create an empty list to store the rewards for each time step:
rewards = [0] * 10
total_rewards_01 = []
total_rewards_03 = []
num_ads=10

In [84]:
 #ε-greedy algorithm
def epsilon_greedy(epsilon, rewards):
    if random.uniform(0, 1) < epsilon:
        # Explore: Choose a random ad
        ad = random.randint(0, num_ads - 1)
    else:
        # Exploit: Choose the ad with the highest reward
        ad = np.argmax(rewards)
    return ad

In [85]:
# Iterating the ε-greedy algorithm for 2000 time steps using ε=0.01 and ε=0.3

for t in range(2000):

    # Choosing ad using the epsilon-greedy algorithm with epsilon=0.01
    ad_01 = epsilon_greedy(0.01, rewards)

    # Choose the ad using the epsilon-greedy algorithm with epsilon=0.3
    ad_03 = epsilon_greedy(0.3, rewards)

    # for epsilon = 0.01
    # reward for the chosen ad
    reward = ads_clicks.iloc[t][ad_01]
    # Updating rewards for the chosen ad
    rewards[ad_01] = rewards[ad_01] + reward
    # Add the reward to the total rewards list for epsilon=0.01
    total_rewards_01.append(sum(rewards))

    # for epsilon = 0.3
    # reward for the chosen ad
    reward = ads_clicks.iloc[t][ad_03]
    # Updating rewards for the chosen ad
    rewards[ad_03] = rewards[ad_03] + reward
    # Add the reward to the total rewards list for epsilon=0.3
    total_rewards_03.append(sum(rewards))

  reward = ads_clicks.iloc[t][ad_01]
  reward = ads_clicks.iloc[t][ad_03]


In [87]:
print("Total rewards for ε=0.01: ", total_rewards_01[-1])
print("Total rewards for ε=0.3: ", total_rewards_03[-1])

Total rewards for ε=0.01:  643
Total rewards for ε=0.3:  643


The total_reward for UCB comes out to be 2125. Clearly, this is much better than random selection and indeed a smart exploration technique that can significantly improve our strategy to solve a MABP.

After just 1500 trials, UCB is already favouring Ad #5 (index 4) which happens to be the optimal ad, and gets the maximum return for the given problem.

### c. Compute the total rewards after 2000-time steps using the Upper-Confidence-Bound action method for c= 1.5 

In [89]:
# Initialize the rewards for each ad to 0 and create an empty list to store the rewards for each time step:
rewards = np.zeros(num_ads)
n = np.zeros(num_ads)
total_rewards = []

In [90]:
# Upper-Confidence-Bound algorithm
def ucb(rewards, n, t, c=1.5):
    # Calculate the average reward for each ad
    average_rewards = rewards / n
    # Calculate the upper confidence bound for each ad
    ucb_values = average_rewards + c * np.sqrt(np.log(t + 1) / n)
    # Choose the ad with the highest UCB value
    ad = np.argmax(ucb_values)
    return ad

In [91]:
# Iterating over Upper-Confidence-Bound algorithm for 2000 time steps using c=1.5:
for t in range(2000):
    # Choose the ad using the UCB algorithm
    ad = ucb(rewards, n, t, c=1.5)

    # Get the reward for the chosen ad
    reward = ads_clicks.iloc[t][ad]
    # Update the rewards for the chosen ad
    rewards[ad] = rewards[ad] + reward
    # Update the number of times the ad has been selected
    n[ad] = n[ad] + 1
    # Add the reward to the total rewards list
    total_rewards.append(sum(rewards))

# Print the total rewards for c=1.5
print("Total rewards for c=1.5: ", total_rewards[-1])


Total rewards for c=1.5:  323.0


  average_rewards = rewards / n
  ucb_values = average_rewards + c * np.sqrt(np.log(t + 1) / n)
  reward = ads_clicks.iloc[t][ad]
  ucb_values = average_rewards + c * np.sqrt(np.log(t + 1) / n)


### d.For all approaches, explain how the action value estimated compares to the optimal action
Action value estimates for the UCB algorithm with c=1.5 are likely to converge faster to the optimal action compared to the ε-Greedy algorithm with ε=0.01 and ε=0.3