<h4>About</h4>

This dataset provides insights into user behavior and online advertising, specifically focusing on predicting whether a user will click on an online advertisement. It contains user demographic information, browsing habits, and details related to the display of the advertisement. This dataset is ideal for building binary classification models to predict user interactions with online ads.


<h4>Features</h4>

<b>id:</b> Unique identifier for each user.

<b>full_name:</b> User's name formatted as "UserX" for anonymity.

<b>age:</b> Age of the user (ranging from 18 to 64 years).

<b>gender:</b> The gender of the user (categorized as Male, Female, or Non-Binary).

<b>device_type:</b> The type of device used by the user when viewing the ad (Mobile, Desktop, Tablet).

<b>ad_position:</b> The position of the ad on the webpage (Top, Side, Bottom).

<b>browsing_history:</b> The user's browsing activity prior to seeing the ad (Shopping, News, Entertainment, Education, Social Media).

<b>time_of_day:</b> The time when the user viewed the ad (Morning, Afternoon, Evening, Night).

<b>click:</b> The target label indicating whether the user clicked on the ad (1 for a click, 0 for no click).

<h4>Goal</h4>

The objective of this dataset is to predict whether a user will click on an online ad based on their demographics, browsing behavior, the context of the ad's display, and the time of day. 
You will need to clean the data, understand it and then apply machine learning models to predict and evaluate data. It is a really challenging request for this kind of data. This data can be used to improve ad targeting strategies, optimize ad placement, and better understand user interaction with online advertisements.

In [1]:
import pandas as pd
import numpy as np
import xgboost as xgb

from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
import tensorflow as tf

In [2]:
# Load the data
df = pd.read_csv('ad_click_dataset.csv')

# Inspect the data
df.head()

Unnamed: 0,id,full_name,age,gender,device_type,ad_position,browsing_history,time_of_day,click
0,670,User670,22.0,,Desktop,Top,Shopping,Afternoon,1
1,3044,User3044,,Male,Desktop,Top,,,1
2,5912,User5912,41.0,Non-Binary,,Side,Education,Night,1
3,5418,User5418,34.0,Male,,,Entertainment,Evening,1
4,9452,User9452,39.0,Non-Binary,,,Social Media,Morning,0


In [3]:
# Drop the 'full_name' column as it's not useful for prediction
df_cleaned = df.drop(columns=['full_name'])

# Handle missing values
# Impute 'age' with median
df_cleaned['age'] = df_cleaned['age'].fillna(df_cleaned['age'].median())

# Impute 'gender', 'device_type', 'ad_position', 'browsing_history', 'time_of_day' with their mode
for column in ['gender', 'device_type', 'ad_position', 'browsing_history', 'time_of_day']:
    df_cleaned[column] = df_cleaned[column].fillna(df_cleaned[column].mode()[0])

# Check if the missing values are handled
missing_values_after = df_cleaned.isnull().sum()

df_cleaned.head(), missing_values_after


(     id   age      gender device_type ad_position browsing_history  \
 0   670  22.0      Female     Desktop         Top         Shopping   
 1  3044  39.5        Male     Desktop         Top    Entertainment   
 2  5912  41.0  Non-Binary     Desktop        Side        Education   
 3  5418  34.0        Male     Desktop      Bottom    Entertainment   
 4  9452  39.0  Non-Binary     Desktop      Bottom     Social Media   
 
   time_of_day  click  
 0   Afternoon      1  
 1     Morning      1  
 2       Night      1  
 3     Evening      1  
 4     Morning      0  ,
 id                  0
 age                 0
 gender              0
 device_type         0
 ad_position         0
 browsing_history    0
 time_of_day         0
 click               0
 dtype: int64)

In [4]:
print(df_cleaned.isnull().sum())

id                  0
age                 0
gender              0
device_type         0
ad_position         0
browsing_history    0
time_of_day         0
click               0
dtype: int64


In [5]:
# Save the cleaned data to a new CSV file
df_cleaned.to_csv('cleaned_ad_click_dataset.csv', index=False)

In [6]:
target = df_cleaned.click
inputs = df_cleaned.drop('click', axis='columns')

In [7]:
inputs

Unnamed: 0,id,age,gender,device_type,ad_position,browsing_history,time_of_day
0,670,22.0,Female,Desktop,Top,Shopping,Afternoon
1,3044,39.5,Male,Desktop,Top,Entertainment,Morning
2,5912,41.0,Non-Binary,Desktop,Side,Education,Night
3,5418,34.0,Male,Desktop,Bottom,Entertainment,Evening
4,9452,39.0,Non-Binary,Desktop,Bottom,Social Media,Morning
...,...,...,...,...,...,...,...
9995,8510,39.5,Female,Mobile,Top,Education,Morning
9996,7843,39.5,Female,Desktop,Bottom,Entertainment,Morning
9997,3914,39.5,Male,Mobile,Side,Entertainment,Morning
9998,7924,39.5,Female,Desktop,Bottom,Shopping,Morning


In [8]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(inputs, target, test_size=0.2)

In [9]:
print(np.isnan(y_train).sum())
print(np.isnan(y_test).sum())

0
0


In [10]:
X_train = X_train[~pd.isna(y_train)]
y_train = y_train[~pd.isna(y_train)]
X_test = X_test[~pd.isna(y_test)]
y_test = y_test[~pd.isna(y_test)]

In [11]:

from sklearn.preprocessing import LabelEncoder

def encode_categorical(X_train, X_test):
    label_encoders = {}
    for col in X_train.select_dtypes(include=['object']).columns:
        le = LabelEncoder()
        X_train[col] = le.fit_transform(X_train[col])
        X_test[col] = le.transform(X_test[col])
        label_encoders[col] = le
    return X_train, X_test, label_encoders

# Apply encoding
X_train, X_test, label_encoders = encode_categorical(X_train, X_test)


In [12]:
label_encoders

{'gender': LabelEncoder(),
 'device_type': LabelEncoder(),
 'ad_position': LabelEncoder(),
 'browsing_history': LabelEncoder(),
 'time_of_day': LabelEncoder()}

In [13]:
# from sklearn.naive_bayes import GaussianNB
# model = GaussianNB()
# model.fit(X_train, y_train)

In [14]:
# model.score(X_test, y_test)

In [15]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)
print(rf.score(X_test, y_test))

0.817


In [16]:
# from sklearn.model_selection import cross_val_score
# scores = cross_val_score(rf, X_train, y_train, cv=5)
# print(scores.mean())