# Ad Click Prediction

The objective of this project is to identify which Classification model has the highest accuracy score, in order to predict whether or not the user will click an advertisement.

Dataset source: https://www.kaggle.com/jahnveenarang/cvdcvd-vd

This file contains demographics about customer and whether that customer clicked the ad or not . You this file to use classification algorithm to predict on the basis of demographics of customer as independent variable

This data set contains the following features:

'User ID': unique identification for consumer
'Age': cutomer age in years
'Estimated Salary': Avg. Income of consumer
'Gender': Whether consumer was male or female
'Purchased': 0 or 1 indicated clicking on Ad

## Data Pre-processing

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [3]:
adverts = pd.read_csv(r'C:\Users\praja\Downloads\Ad Click Prediction\Social_Network_Ads.csv')

In [4]:
adverts.head()

Unnamed: 0,User ID,Gender,Age,EstimatedSalary,Purchased
0,15624510,Male,19,19000,0
1,15810944,Male,35,20000,0
2,15668575,Female,26,43000,0
3,15603246,Female,27,57000,0
4,15804002,Male,19,76000,0


In [9]:
adverts.isna().any().any()

False

In [5]:
# Assigning X and Y variables

X = adverts.iloc[:, :-1].values
y = adverts.iloc[:, -1].values

In [6]:
print(X)

[[15624510 'Male' 19 19000]
 [15810944 'Male' 35 20000]
 [15668575 'Female' 26 43000]
 ...
 [15654296 'Female' 50 20000]
 [15755018 'Male' 36 33000]
 [15594041 'Female' 49 36000]]


In [7]:
print(y)

[0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0
 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 1 0 1 1 0 0 0 1 0 0 0 1 0 1
 1 1 0 0 1 1 0 1 1 0 1 1 0 1 0 0 0 1 1 0 1 1 0 1 0 1 0 1 0 0 1 1 0 1 0 0 1
 1 0 1 1 0 1 1 0 0 1 0 0 1 1 1 1 1 0 1 1 1 1 0 1 1 0 1 0 1 0 1 1 1 1 0 0 0
 1 1 0 1 1 1 1 1 0 0 0 1 1 0 0 1 0 1 0 1 1 0 1 0 1 1 0 1 1 0 0 0 1 1 0 1 0
 0 1 0 1 0 0 1 1 0 0 1 1 0 1 1 0 0 1 0 1 0 1 1 1 0 1 0 1 1 1 0 1 1 1 1 0 1
 1 1 0 1 0 1 0 0 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 0 1]


## Encoding the categorical data in the independent variable column

In [10]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [1])], remainder='passthrough')
X = np.array(ct.fit_transform(X))

In [11]:
print(X)

[[0.0 1.0 15624510 19 19000]
 [0.0 1.0 15810944 35 20000]
 [1.0 0.0 15668575 26 43000]
 ...
 [1.0 0.0 15654296 50 20000]
 [0.0 1.0 15755018 36 33000]
 [1.0 0.0 15594041 49 36000]]


The gender male has been encoded as 1 and female as 0.

## Splitting the dataset into the Training set and Test set

In [12]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)

In [13]:
print(X_train)

[[1.0 0.0 15699284 29 28000]
 [1.0 0.0 15599081 45 22000]
 [0.0 1.0 15747043 46 117000]
 ...
 [0.0 1.0 15706071 51 23000]
 [0.0 1.0 15646227 46 79000]
 [0.0 1.0 15689425 30 49000]]


In [14]:
print(X_test)

[[0.0 1.0 15755018 36 33000]
 [1.0 0.0 15697020 39 61000]
 [0.0 1.0 15796351 36 118000]
 [0.0 1.0 15665760 39 122000]
 [1.0 0.0 15794661 26 118000]
 [1.0 0.0 15717560 38 65000]
 [1.0 0.0 15680243 20 36000]
 [0.0 1.0 15596522 49 89000]
 [0.0 1.0 15669656 31 18000]
 [0.0 1.0 15638646 48 141000]
 [1.0 0.0 15644296 34 72000]
 [1.0 0.0 15629885 39 73000]
 [0.0 1.0 15674206 35 72000]
 [1.0 0.0 15575247 48 131000]
 [1.0 0.0 15611191 53 82000]
 [0.0 1.0 15685346 56 133000]
 [0.0 1.0 15774744 60 83000]
 [0.0 1.0 15728773 27 58000]
 [1.0 0.0 15667265 28 87000]
 [0.0 1.0 15593715 60 102000]
 [1.0 0.0 15724423 40 75000]
 [1.0 0.0 15780572 50 88000]
 [1.0 0.0 15715622 44 139000]
 [0.0 1.0 15622478 47 43000]
 [0.0 1.0 15617482 45 26000]
 [0.0 1.0 15809823 26 15000]
 [1.0 0.0 15574372 58 47000]
 [0.0 1.0 15708196 49 74000]
 [1.0 0.0 15778830 53 34000]
 [1.0 0.0 15794566 52 114000]
 [0.0 1.0 15668385 39 42000]
 [0.0 1.0 15804002 19 76000]
 [1.0 0.0 15578738 18 86000]
 [0.0 1.0 15727467 57 74000]
 [1.0

In [15]:
print(y_train)

[0 1 1 0 0 0 1 0 1 0 1 1 1 0 0 1 1 1 0 0 0 1 0 1 0 1 0 1 0 0 0 0 0 0 1 0 0
 0 0 1 0 0 1 1 0 0 0 0 0 1 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0
 1 1 0 0 0 1 0 1 1 0 0 0 1 1 0 0 1 1 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 1
 0 1 0 0 0 0 1 0 0 1 0 0 1 1 0 1 0 0 0 1 1 1 0 0 0 0 0 1 1 0 0 0 1 0 0 0 1
 0 1 1 0 1 1 0 0 0 0 1 0 0 1 0 0 0 0 0 1 1 1 0 0 0 0 0 1 0 0 1 1 0 0 0 0 1
 0 1 0 1 0 0 1 0 1 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 1
 0 1 0 0 0 0 0 1 0 0 1 1 1 0 0 0 1 0 0 1 1 0 1 0 1 0 0 0 0 0 1 0 1 0 0 1 1
 0 0 1 1 1 0 0 0 1 0 1 1 1 0 0 1 0 0 0 1 0 0 0 1 0 1 0 1 0 1 1 0 0 1 0 0 1
 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 1 0]


In [16]:
print(y_test)

[0 0 1 1 0 0 0 1 0 0 0 0 0 1 1 1 1 0 0 1 0 1 1 0 1 0 1 0 1 0 0 0 0 1 0 0 0
 0 1 0 1 1 0 0 1 1 1 1 0 1 0 0 1 1 1 0 1 0 1 1 0 0 0 0 1 1 0 0 0 0 0 0 1 0
 0 1 0 0 0 0]


## Applying Feature Scaling

In [17]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train[:, 2:] = sc.fit_transform(X_train[:, 2:])
X_test[:, 2:] = sc.transform(X_test[:, 2:])

In [18]:
print(X_train)

[[1.0 0.0 0.06699777180367887 -0.8033008104771288 -1.1912179543545012]
 [1.0 0.0 -1.3281673745230116 0.7569799746681295 -1.368598012704171]
 [0.0 1.0 0.7319648109483793 0.8544975237397081 1.4399195778322673]
 ...
 [0.0 1.0 0.161495799289076 1.3420852690976013 -1.3390346696458928]
 [0.0 1.0 -0.6717353716017602 0.8544975237397081 0.31651254161769204]
 [0.0 1.0 -0.07027290050787709 -0.7057832614055501 -0.5703877501306569]]


In [19]:
print(X_test)

[[0.0 1.0 0.8430038221751425 -0.12067796697607829 -1.0434012390631098]
 [1.0 0.0 0.03547522366356702 0.17187468023865762 -0.21562763343131733]
 [0.0 1.0 1.418499176536999 -0.12067796697607829 1.4694829208905458]
 [0.0 1.0 -0.3997698535713695 0.17187468023865762 1.587736293123659]
 [1.0 0.0 1.3949686525278164 -1.0958534576918646 1.4694829208905458]
 [1.0 0.0 0.3214615923905535 0.074357131167079 -0.09737426119820415]
 [1.0 0.0 -0.1981174398287724 -1.6809587521213365 -0.9547112098882748]
 [0.0 1.0 -1.3637973218244897 1.147050170954444 0.6121459722004751]
 [0.0 1.0 -0.3455243378673608 -0.6082657123339715 -1.4868513849372842]
 [0.0 1.0 -0.7772885683553769 1.0495326218828653 2.1494398112309465]
 [1.0 0.0 -0.698621431874974 -0.3157130651192356 0.10956914020974394]
 [1.0 0.0 -0.8992713617544581 0.17187468023865762 0.13913248326802224]
 [0.0 1.0 -0.2821729270734081 -0.21819551604765694 0.10956914020974394]
 [1.0 0.0 -1.6600173799874223 1.0495326218828653 1.8538063806481635]
 [1.0 0.0 -1.1595551

## Model 1: Logistic Regression

In [20]:
# Training the Logistic Regression model on the Training set

from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)

LogisticRegression(random_state=0)

In [22]:
# Predicting the Test results

y_pred = classifier.predict(X_test)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

[[0 0]
 [0 0]
 [1 1]
 [1 1]
 [0 0]
 [0 0]
 [0 0]
 [1 1]
 [0 0]
 [1 0]
 [0 0]
 [0 0]
 [0 0]
 [1 1]
 [1 1]
 [1 1]
 [1 1]
 [0 0]
 [0 0]
 [1 1]
 [0 0]
 [1 1]
 [1 1]
 [1 0]
 [0 1]
 [0 0]
 [1 1]
 [1 0]
 [1 1]
 [1 0]
 [0 0]
 [0 0]
 [0 0]
 [1 1]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 1]
 [0 0]
 [1 1]
 [1 1]
 [1 0]
 [0 0]
 [1 1]
 [0 1]
 [0 1]
 [1 1]
 [0 0]
 [1 1]
 [0 0]
 [0 0]
 [1 1]
 [0 1]
 [1 1]
 [0 0]
 [1 1]
 [0 0]
 [1 1]
 [1 1]
 [0 0]
 [0 0]
 [1 0]
 [0 0]
 [0 1]
 [1 1]
 [0 0]
 [0 0]
 [1 0]
 [0 0]
 [1 0]
 [0 0]
 [1 1]
 [0 0]
 [0 0]
 [1 1]
 [0 0]
 [0 0]
 [0 0]
 [0 0]]


In [23]:
# Making Confusion Matrix

from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

[[40  8]
 [ 6 26]]


0.825

Logistic Regression Model has 82.5% accuracy.

## Model 2: K-NN

In [24]:
# Training the K-NN model on the Training set

from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2) # all values inside bracket are default values
classifier.fit(X_train, y_train)

KNeighborsClassifier()

In [26]:
# Confusion matrix

from sklearn.metrics import confusion_matrix, accuracy_score
y_pred = classifier.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

[[41  7]
 [ 3 29]]


0.875

K-NN Model has 87.5% accuracy.

## Model 3: SVM

In [27]:
from sklearn.svm import SVC
classifier = SVC(kernel = 'linear', random_state = 0)
classifier.fit(X_train, y_train)

SVC(kernel='linear', random_state=0)

In [28]:
from sklearn.metrics import confusion_matrix, accuracy_score
y_pred = classifier.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

[[41  7]
 [ 7 25]]


0.825

SVM Model has 82.5% accuracy.

## Model 4: Kernel SVM

In [29]:
from sklearn.svm import SVC
classifier = SVC(kernel = 'rbf', random_state = 0)
classifier.fit(X_train, y_train)

SVC(random_state=0)

In [30]:
from sklearn.metrics import confusion_matrix, accuracy_score
y_pred = classifier.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

[[39  9]
 [ 2 30]]


0.8625

In [None]:
Kernel-SVM Model has 86.25% accuracy.

## Model 5: Naive Bayes

In [31]:
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)

GaussianNB()

In [32]:
from sklearn.metrics import confusion_matrix, accuracy_score
y_pred = classifier.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

[[41  7]
 [ 5 27]]


0.85

Naive Bayes has 85% accuracy.

## Model 6: Decision Tree Classification

In [33]:
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
classifier.fit(X_train, y_train)

DecisionTreeClassifier(criterion='entropy', random_state=0)

In [34]:
from sklearn.metrics import confusion_matrix, accuracy_score
y_pred = classifier.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

[[41  7]
 [ 7 25]]


0.825

Decision Tree Classification has 82.5% accuracy.

## Model 7: Random Forest Classification

In [35]:
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0)
classifier.fit(X_train, y_train)

RandomForestClassifier(criterion='entropy', n_estimators=10, random_state=0)

In [36]:
from sklearn.metrics import confusion_matrix, accuracy_score
y_pred = classifier.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

[[40  8]
 [ 2 30]]


0.875

In [None]:
Decision Tree Classification has 87.5% accuracy.

## Conclusion: K-NN Model has the highest accuracy score (87.5%) and it is therefore, the best model to predict whether the customer will make a purchase or not.