# Introduction

In this project, I will be using machine learning algorithms to analyze web clicks data and develop a predictive model that can help businesses identify high-potential prospects and target their marketing efforts more effectively. The project focuses on how you can use predictive analytics in real time to decide whether a prospect has high propensity to convert and offer him a live chat with a sales agent.

# Importing Libraries

We'll start by importing the necessary libraries. We'll be using pandas, numpy, matplotlib, seaborn, and sklearn.

In [98]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Loading the Data 

Next, we'll load the data into a pandas dataframe.

In [28]:
train_data=pd.read_csv('training_sample.csv')
prediction_data=pd.read_csv('testing_sample.csv')

In [29]:
train_data.head()

Unnamed: 0,UserID,basket_icon_click,basket_add_list,basket_add_detail,sort_by,image_picker,account_page_click,promo_banner_click,detail_wishlist_add,list_size_dropdown,...,saw_sizecharts,saw_delivery,saw_account_upgrade,saw_homepage,device_mobile,device_computer,device_tablet,returning_user,loc_uk,ordered
0,a720-6b732349-a720-4862-bd21-644732,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,1,0
1,a0c0-6b73247c-a0c0-4bd9-8baa-797356,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,1,0
2,86a8-6b735c67-86a8-407b-ba24-333055,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,1,1,0
3,6a3d-6b736346-6a3d-4085-934b-396834,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,1,1,0
4,b74a-6b737717-b74a-45c3-8c6a-421140,0,1,0,1,0,0,0,0,1,...,0,0,0,1,0,0,1,0,1,1


In [30]:
# To check for basic info about the data
print(train_data.describe())
print('\n')
print(train_data.info())

       basket_icon_click  basket_add_list  basket_add_detail        sort_by  \
count      455401.000000    455401.000000      455401.000000  455401.000000   
mean            0.099150         0.074521           0.112916       0.036849   
std             0.298864         0.262617           0.316490       0.188391   
min             0.000000         0.000000           0.000000       0.000000   
25%             0.000000         0.000000           0.000000       0.000000   
50%             0.000000         0.000000           0.000000       0.000000   
75%             0.000000         0.000000           0.000000       0.000000   
max             1.000000         1.000000           1.000000       1.000000   

        image_picker  account_page_click  promo_banner_click  \
count  455401.000000       455401.000000       455401.000000   
mean        0.026735            0.003570            0.016208   
std         0.161307            0.059647            0.126274   
min         0.000000            

In [31]:
# to check the dimensions of the dataset
train_data.shape

(455401, 25)

In [32]:
#check for missing values
train_data.isnull().sum()

UserID                     0
basket_icon_click          0
basket_add_list            0
basket_add_detail          0
sort_by                    0
image_picker               0
account_page_click         0
promo_banner_click         0
detail_wishlist_add        0
list_size_dropdown         0
closed_minibasket_click    0
checked_delivery_detail    0
checked_returns_detail     0
sign_in                    0
saw_checkout               0
saw_sizecharts             0
saw_delivery               0
saw_account_upgrade        0
saw_homepage               0
device_mobile              0
device_computer            0
device_tablet              0
returning_user             0
loc_uk                     0
ordered                    0
dtype: int64

# Exploring the Dataset

In [101]:
# checking the distribution of the target
train_data['ordered'].value_counts()

0    436308
1     19093
Name: ordered, dtype: int64

#### Handling imbalanced data

In [63]:
from imblearn.over_sampling import RandomOverSampler
rus=RandomOverSampler()
X_train,y_train=rus.fit_resample(X_train,y_train)

#### Checking correlation between features and target

In [33]:
corr=pd.DataFrame(train_data.corr()['ordered']* 100)
corr

Unnamed: 0,ordered
basket_icon_click,42.833414
basket_add_list,28.766577
basket_add_detail,41.44197
sort_by,5.463596
image_picker,7.149208
account_page_click,5.727908
promo_banner_click,5.653266
detail_wishlist_add,2.35165
list_size_dropdown,15.486702
closed_minibasket_click,14.00114


# Splitting Dataset

In [36]:
#splitting into dependent and independent variable
X=train_data.drop(labels=['ordered','UserID'],axis=1)
y=train_data['ordered']

In [64]:
#splitting into train and test
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test= train_test_split(X,y,test_size=0.25,random_state=1)

# Building the Machine Learning Model

In [65]:
from sklearn.naive_bayes import GaussianNB
import sklearn
model=GaussianNB()
model=model.fit(X_train,y_train)

predictions=model.predict(X_test)

In [66]:
#Accuracy of predictions
print('The accuracy of the model (%) is:' ,round(sklearn.metrics.accuracy_score(y_test, predictions)*100,2))
print('\n')
print('Confusion Matrix:',sklearn.metrics.confusion_matrix(y_test,predictions))

The accuracy of the model (%) is: 98.78


Confusion Matrix: [[107769   1326]
 [    59   4697]]


The machine learning model has an accuracy of 98% meaning that our model would most likely predict whether a customer would purchase or not with an accuracy of 98%.

# Predictions for New customers browsing the website

The business would like to see in real time whether or not a customer browsing through the website would end up buying a product or not. Since the new buyer is not really doing much on the page let's assign a value of zero for inactivity for all the 23 features in our dataset.

In [103]:
Customer2 = np.array([0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]).reshape(1, -1)
model.predict_proba(Customer2)[:,1] 

array([0.])

Let us calculate the probability that the prospect would order after clicking on the account page, add to cart page etc

In [104]:
Customer3 = np.array([1,1,1,0,1,1,1,0,1,1,1,1,1,1,1,1,0,1,1,1,0,1,0]).reshape(1, -1)
print('Propensity',model.predict_proba(Customer3)[:,1])

Propensity [1.]


#### Using the prediction dataset

In [74]:
prediction_data.head(1)

Unnamed: 0,UserID,basket_icon_click,basket_add_list,basket_add_detail,sort_by,image_picker,account_page_click,promo_banner_click,detail_wishlist_add,list_size_dropdown,...,saw_sizecharts,saw_delivery,saw_account_upgrade,saw_homepage,device_mobile,device_computer,device_tablet,returning_user,loc_uk,ordered
0,9d24-25k4-47889d24-25k4-494b-398124,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,1,0


In [76]:
new_prospects=prediction_data.drop(labels=['UserID','ordered'],axis=1)

In [84]:
#Subsetting the prospect data to see if a customer with the attributes will purchase or not
customer1=new_prospects[0:1]
print(customer1)
print('\n')
print('The propensity of customer1 with the above feautures is:',model.predict_proba(customer1)[:,1])

   basket_icon_click  basket_add_list  basket_add_detail  sort_by  \
0                  0                0                  0        0   

   image_picker  account_page_click  promo_banner_click  detail_wishlist_add  \
0             0                   0                   0                    0   

   list_size_dropdown  closed_minibasket_click  ...  saw_checkout  \
0                   0                        0  ...             0   

   saw_sizecharts  saw_delivery  saw_account_upgrade  saw_homepage  \
0               0             0                    0             0   

   device_mobile  device_computer  device_tablet  returning_user  loc_uk  
0              1                0              0               0       1  

[1 rows x 23 columns]


The propensity of customer1 with the above feautures is: [0.]


Since our target variable is not a continous value, we can only predict if the customer visiting the website will purchae a product or not. Let's assume that the target is continous and our prediction has a continous value, we would be able to set a threshold for example, if the propensity score of a customer is above 60% then the sales agent can engage this customer in a live chat.