## Ad Click Prediction: What, why, and how?

The online advertisement industry has become a multi-billion industry, and predicting ad CTR (click-through rate) is now central. Nowadays, different types of advertisers and search engines rely on modelling to predict ad CTR accurately.

We will be predicting the ad click-through rate using the machine learning approach. Before that, let us first understand few essential concepts and a general practice followed by search engines to decide which ads to display.

CTR: It is the metric used to measure the percentage of impressions that resulted in a click.

## CTR = (Number of Clicks - Number of Throughs) * 100% / Number of Impression

-> Search ads: Advertisements that get displayed when a user searches for a particular keyword.

-> Paid search advertising is a popular form of Pay per click (PPC) advertising in which brands or advertisers pay (bid amount) to have their ads displayed when users search for specific keywords.

## Relevance of Predicting CTR through a real-life example:

-> Typically, the primary source of income for search engines like Google is through advertisement. Many companies pay these search engines to display their ads when a user searches for a particular keyword. Our focus is on search ads and CTR, i.e. the amount is paid only when a user clicks on the link and redirects to the brand’s website.

-> Different advertisers approach these search engines with their ads and the bidding amount to display their ads. The main objective of these search engines is to maximize their revenue. So the question is, how does a search engine decide which ads to display when a user searches for a particular keyword?

-> Till now, we have seen what ad click prediction is and why is it important. Let us now explore how to calculate ad click prediction by performing machine learning modelling on a dataset. We will build a Logistic Regression model that would help us predict whether a user will click on an ad or not based on the features of that user. And hence calculate the probability of a user clicking on an ad.

-> Using these probabilities, search engines could decide which ads to display by multiplying the possibilities with the bid amount and sorting it out.

## Problem Statement

->In this project, we will work with the advertising data of a marketing agency to develop a machine learning algorithm that predicts if a particular user will click on an advertisement.

The data consists of 10 variables:

## 'Daily Time Spent on Site', 'Age', 'Area Income', 'Daily Internet Usage', 'Ad Topic Line', 'City', 'Male', 'Country', Timestamp' and 'Clicked on Ad'.

->The primary variable we are interested in is ' Clicked on Ad'.
This variable can have two possible outcomes: 0 and 1, where 0 refers to a user who didn't click the advertisement, while one refers to the scenario where a user clicks the ad.

->We will see if we can use the other 9 variables to accurately predict the value 'Clicked on Ad' variable.

->We will also perform some exploratory data analysis to see how 'Daily Time Spent on Site' in combination with 'Ad Topic Line' affects the user's decision to click on the ad.

## Introduction

-> This project's goals are to deeply explore data to do with advertising, perform quantitive analysis, and achieve predictions from the data using machine learning techniques.

## This data set contains the following features:

-> Daily Time Spent on Site: consumer time on-site in minutes

-> Age: customer age in years

-> Area Income: Avg. Income of geographical area of consumer

-> Daily Internet Usage: Avg. minutes a day consumer is on the internet

-> Ad Topic Line: Headline of the advertisement

-> City: City of consumer

-> Male: Whether or not the consumer was male

-> Country: Country of consumer

-> Timestamp: Time at which consumer clicked on Ad or closed window

-> Clicked on Ad: 0 or 1 indicated clicking on Ad

In [7]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from wordcloud import WordCloud
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
sns.set_style('whitegrid')
sns.set_context('notebook')

In [11]:
import os
os.chdir(r'C:\Users\ASUS\MediaEDA\Data')
df = pd.read_csv('advertising.csv')


In [13]:
df

Unnamed: 0,Daily Time Spent on Site,Age,Area Income,Daily Internet Usage,Ad Topic Line,City,Male,Country,Timestamp,Clicked on Ad
0,68.95,35,61833.90,256.09,Cloned 5thgeneration orchestration,Wrightburgh,0,Tunisia,2016-03-27 00:53:11,0
1,80.23,31,68441.85,193.77,Monitored national standardization,West Jodi,1,Nauru,2016-04-04 01:39:02,0
2,69.47,26,59785.94,236.50,Organic bottom-line service-desk,Davidton,0,San Marino,2016-03-13 20:35:42,0
3,74.15,29,54806.18,245.89,Triple-buffered reciprocal time-frame,West Terrifurt,1,Italy,2016-01-10 02:31:19,0
4,68.37,35,73889.99,225.58,Robust logistical utilization,South Manuel,0,Iceland,2016-06-03 03:36:18,0
...,...,...,...,...,...,...,...,...,...,...
995,72.97,30,71384.57,208.58,Fundamental modular algorithm,Duffystad,1,Lebanon,2016-02-11 21:49:00,1
996,51.30,45,67782.17,134.42,Grass-roots cohesive monitoring,New Darlene,1,Bosnia and Herzegovina,2016-04-22 02:07:01,1
997,51.63,51,42415.72,120.37,Expanded intangible solution,South Jessica,1,Mongolia,2016-02-01 17:24:57,1
998,55.55,19,41920.79,187.95,Proactive bandwidth-monitored policy,West Steven,0,Guatemala,2016-03-24 02:35:54,0


## Data type and length of the variable

In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Daily Time Spent on Site  1000 non-null   float64
 1   Age                       1000 non-null   int64  
 2   Area Income               1000 non-null   float64
 3   Daily Internet Usage      1000 non-null   float64
 4   Ad Topic Line             1000 non-null   object 
 5   City                      1000 non-null   object 
 6   Male                      1000 non-null   int64  
 7   Country                   1000 non-null   object 
 8   Timestamp                 1000 non-null   object 
 9   Clicked on Ad             1000 non-null   int64  
dtypes: float64(3), int64(3), object(4)
memory usage: 78.3+ KB


In [18]:
df.shape

(1000, 10)

## Next lets see if the data has duplicates or not

In [21]:
df.duplicated().sum()

0

The data is looking all good there are no duplicates here

## Checking for missing values

Let's see if there is any null value is in the data or not

In [26]:
for col in df.columns:
    msg = 'Column: {:<10}\t count of all the NAN value: {:.0f}'.format(col,100 * (df[col].isnull().sum()))
    print(msg)

Column: Daily Time Spent on Site	 count of all the NAN value: 0
Column: Age       	 count of all the NAN value: 0
Column: Area Income	 count of all the NAN value: 0
Column: Daily Internet Usage	 count of all the NAN value: 0
Column: Ad Topic Line	 count of all the NAN value: 0
Column: City      	 count of all the NAN value: 0
Column: Male      	 count of all the NAN value: 0
Column: Country   	 count of all the NAN value: 0
Column: Timestamp 	 count of all the NAN value: 0
Column: Clicked on Ad	 count of all the NAN value: 0


So analysing the output we can understand there is no null values in the columns

## Numerical and Categorical Variables Identification

In [35]:
df.columns

Index(['Daily Time Spent on Site', 'Age', 'Area Income',
       'Daily Internet Usage', 'Ad Topic Line', 'City', 'Male', 'Country',
       'Timestamp', 'Clicked on Ad'],
      dtype='object')

In [37]:
df.select_dtypes(include=['object']).columns

Index(['Ad Topic Line', 'City', 'Country', 'Timestamp'], dtype='object')

In [39]:
df.head()

Unnamed: 0,Daily Time Spent on Site,Age,Area Income,Daily Internet Usage,Ad Topic Line,City,Male,Country,Timestamp,Clicked on Ad
0,68.95,35,61833.9,256.09,Cloned 5thgeneration orchestration,Wrightburgh,0,Tunisia,2016-03-27 00:53:11,0
1,80.23,31,68441.85,193.77,Monitored national standardization,West Jodi,1,Nauru,2016-04-04 01:39:02,0
2,69.47,26,59785.94,236.5,Organic bottom-line service-desk,Davidton,0,San Marino,2016-03-13 20:35:42,0
3,74.15,29,54806.18,245.89,Triple-buffered reciprocal time-frame,West Terrifurt,1,Italy,2016-01-10 02:31:19,0
4,68.37,35,73889.99,225.58,Robust logistical utilization,South Manuel,0,Iceland,2016-06-03 03:36:18,0


In [41]:
numerical_cols = ['Daily Time Spent on Site','Age','Area Income', 'Daily Internet Usage']

In [80]:
categorical_cols = ['Ad Topic Line', 'City', 'Male', 'Country', 'Clicked on Ad']

In [53]:
## Statistical description
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Daily Time Spent on Site,1000.0,65.0002,15.853615,32.6,51.36,68.215,78.5475,91.43
Age,1000.0,36.009,8.785562,19.0,29.0,35.0,42.0,61.0
Area Income,1000.0,55000.00008,13414.634022,13996.5,47031.8025,57012.3,65470.635,79484.8
Daily Internet Usage,1000.0,180.0001,43.902339,104.78,138.83,183.13,218.7925,269.96
Male,1000.0,0.481,0.499889,0.0,0.0,0.0,1.0,1.0
Clicked on Ad,1000.0,0.5,0.50025,0.0,0.0,0.5,1.0,1.0


-> Daily Time Spent on Site: Looking at the time spent on the site minimum being 32 min to maxing up to 91 mins gives us a brief description about the site is very popular and lets see how it correlates witht the ad being clicked or not 
-> Age: Age count with the range from min 19 to max 61 with a mean of 36 says that site is used mostly by Adults
-> Area Income: The minimum users income is 13k to maximum of 79k suggesting us the users are from different financial and social classes. Lets see how they are correlated to ad being clicked or not
-> Daily Internet Usage: internet usage from 104 mins to 269 mins per days corelates with the maximum time being spent on the site now lets see how these both things contribute to ad being clicked or not
-> Male: 48% of male users suggesting us that females are dominating the category lets see how this contributes to ad being clicked or not
-> Clicked on the AD: So 50% of people have clicked on the data and other half have not clicked on it says us the dataset is balanced and this will improve the accuracy of the training set

## Summarization of the numerical columns

In [68]:
df[numerical_cols].describe()

Unnamed: 0,Daily Time Spent on Site,Age,Area Income,Daily Internet Usage
count,1000.0,1000.0,1000.0,1000.0
mean,65.0002,36.009,55000.00008,180.0001
std,15.853615,8.785562,13414.634022,43.902339
min,32.6,19.0,13996.5,104.78
25%,51.36,29.0,47031.8025,138.83
50%,68.215,35.0,57012.3,183.13
75%,78.5475,42.0,65470.635,218.7925
max,91.43,61.0,79484.8,269.96


-> Since the Mean and Median are 65 it suggests us that it is that data is not a skewed distribution and we dont need to perform any data transformation

## Summarization of the categorical columns

In [76]:
obj_column = df.dtypes[df.dtypes=='object'].index
for i in range(0, len(obj_column)):
    print(obj_column[i])
    print(len(df[obj_column[i]].unique()))
    print()

Ad Topic Line
1000

City
969

Country
237

Timestamp
1000



In [82]:
df[categorical_cols].describe(include=['O'])

Unnamed: 0,Ad Topic Line,City,Country
count,1000,1000,1000
unique,1000,969,237
top,Cloned 5thgeneration orchestration,Lisamouth,France
freq,1,3,9


-> As we have different unique cities and not many people belong to the same city looking at the freq we can understand that..looks like the column is least favoured considering the situation
-> however we have less diversity in the country feature so we have to investigate a lot further

## Categorizing qualitative and quantitative 

-> This should give us the bigger picture about whats happening 