 #  <p style="text-align: center;">Predicting Prospect Propensity</p> 

In this example, we will show you how to predict propensity of prospects. This code example goes with the lession with the same title. We will use web clicks data about the links clicked by the user while he is browsing to predict his propensity to buy the product. Using that propensity, we will decide whether we want to offer chat to the customer with an agent

## Installing Dependencies

Install all the required packages for the exercises

In [1]:
!pip install pandas
!pip install sklearn
!pip install matplotlib
!pip install apyori

Collecting pandas
  Using cached pandas-1.4.0-cp38-cp38-macosx_10_9_x86_64.whl (11.4 MB)
Collecting numpy>=1.18.5
  Downloading numpy-1.22.2-cp38-cp38-macosx_10_14_x86_64.whl (17.6 MB)
[K     |████████████████████████████████| 17.6 MB 74 kB/s  eta 0:00:01
[?25hCollecting pytz>=2020.1
  Using cached pytz-2021.3-py2.py3-none-any.whl (503 kB)
Installing collected packages: pytz, numpy, pandas
Successfully installed numpy-1.22.2 pandas-1.4.0 pytz-2021.3
Collecting sklearn
  Using cached sklearn-0.0-py2.py3-none-any.whl
Collecting scikit-learn
  Using cached scikit_learn-1.0.2-cp38-cp38-macosx_10_13_x86_64.whl (7.9 MB)
Collecting joblib>=0.11
  Using cached joblib-1.1.0-py2.py3-none-any.whl (306 kB)
Collecting scipy>=1.1.0
  Downloading scipy-1.8.0-cp38-cp38-macosx_12_0_universal2.macosx_10_9_x86_64.whl (55.3 MB)
[K     |████████████████████████████████| 55.3 MB 4.4 MB/s eta 0:00:01
[?25hCollecting threadpoolctl>=2.0.0
  Downloading threadpoolctl-3.1.0-py3-none-any.whl (14 kB)
Installin

## Loading and Viewing Data
We will load the data file for this example and checkout summary statistics and columns for that file.

In [12]:
from pandas import Series, DataFrame
import pandas as pd
import numpy as np
import os
import matplotlib.pylab as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
import sklearn.metrics

prospect_data = pd.read_csv("Data-02-05-web-browsing-data.csv")

prospect_data.dtypes

SESSION_ID         int64
IMAGES             int64
REVIEWS            int64
FAQ                int64
SPECS              int64
SHIPPING           int64
BRO_TOGETHER       int64
COMPARE_SIMILAR    int64
VIEW_SIMILAR       int64
WARRANTY           int64
SPONSORED_LINKS    int64
BUY                int64
dtype: object

The data contains information about the various links on the website that are clicked by the user during his browsing. This is past data that will be used to build the model.

- Session ID : A unique identifier for the web browsing session
- Buy : Whether the prospect ended up buying the product
- Other columns : a 0 or 1 indicator to show whether the prospect visited that particular page or did the activity mentioned.


In [13]:
# Look at the top records to understand how the data looks like.
prospect_data.head()

Unnamed: 0,SESSION_ID,IMAGES,REVIEWS,FAQ,SPECS,SHIPPING,BRO_TOGETHER,COMPARE_SIMILAR,VIEW_SIMILAR,WARRANTY,SPONSORED_LINKS,BUY
0,1001,0,0,1,0,1,0,0,0,1,0,0
1,1002,0,1,1,0,0,0,0,0,0,1,0
2,1003,1,0,1,1,1,0,0,0,1,0,0
3,1004,1,0,0,0,1,1,1,0,0,0,0
4,1005,1,1,1,0,1,0,1,0,0,0,0


In [14]:
#Do summary statistics analysis of the data
prospect_data.describe()

Unnamed: 0,SESSION_ID,IMAGES,REVIEWS,FAQ,SPECS,SHIPPING,BRO_TOGETHER,COMPARE_SIMILAR,VIEW_SIMILAR,WARRANTY,SPONSORED_LINKS,BUY
count,500.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0
mean,1250.5,0.51,0.52,0.44,0.48,0.528,0.5,0.58,0.468,0.532,0.55,0.37
std,144.481833,0.500401,0.5001,0.496884,0.5001,0.499715,0.500501,0.494053,0.499475,0.499475,0.497992,0.483288
min,1001.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1125.75,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,1250.5,1.0,1.0,0.0,0.0,1.0,0.5,1.0,0.0,1.0,1.0,0.0
75%,1375.25,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
max,1500.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


### Perform Correlation Analysis

In [15]:
prospect_data.corr()['BUY']

SESSION_ID         0.026677
IMAGES             0.046819
REVIEWS            0.404628
FAQ               -0.095136
SPECS              0.009950
SHIPPING          -0.022239
BRO_TOGETHER      -0.103562
COMPARE_SIMILAR    0.190522
VIEW_SIMILAR      -0.096137
WARRANTY           0.179156
SPONSORED_LINKS    0.110328
BUY                1.000000
Name: BUY, dtype: float64

Looking at the correlations above we can see that some features like REVIEWS, BRO_TOGETHER, COMPARE_SIMILAR, WARRANTY and SPONSORED_LINKS have medium correlation to the target variable. We will reduce our feature set to that list of variables.

In [16]:
#Drop columns with low correlation
predictors = prospect_data[['REVIEWS','BRO_TOGETHER','COMPARE_SIMILAR','WARRANTY','SPONSORED_LINKS']]
targets = prospect_data.BUY


##  Training and Testing Split

We now split the model into training and testing data in the ratio of 70:30

In [17]:
pred_train, pred_test, tar_train, tar_test  =   train_test_split(predictors, targets, test_size=.3)

print( "Predictor - Training : ", pred_train.shape, "Predictor - Testing : ", pred_test.shape )


Predictor - Training :  (350, 5) Predictor - Testing :  (150, 5)


## Build Model and Check Accuracy

In [27]:
from sklearn.naive_bayes import GaussianNB

classifier=GaussianNB()
classifier=classifier.fit(pred_train.values,tar_train.values)

predictions=classifier.predict(pred_test.values)

#Analyze accuracy of predictions
sklearn.metrics.confusion_matrix(tar_test,predictions)


array([[76, 14],
       [28, 32]], dtype=int64)

In [28]:
sklearn.metrics.accuracy_score(tar_test, predictions)

0.72

Instead of doing a Yes/No prediction, we can instead do a probability computation to show the probability for the prospect to buy the product

In [30]:
pred_prob=classifier.predict_proba(pred_test.values)
pred_prob[0,1]

0.20030537586479427

The probability above can be read as 22% chance that the prospect will buy the product.

## Real time predictions

Now that the model has been built, let us use it for real time predictions. So when the customer starts visiting the pages one by one, we collect that list and then use it to compute the probability. We do that for every new click that comes in.

So let us start. The prospect just came to your website. There are no significant clicks. Let us compute the probability. The array of values passed has the values for REVIEWS, BRO_TOGETHER, COMPARE_SIMILAR, WARRANTY and SPONSORED_LINKS. So the array is all zeros to begin with

In [31]:
browsing_data = np.array([0,0,0,0,0]).reshape(1, -1)
print("New visitor: propensity :",classifier.predict_proba(browsing_data)[:,1] )


New visitor: propensity : [0.04543485]


So the initial probability is 5%. Now, suppose the customer clicks does a comparison of similar products. The array changes to include a 1 for that function. The new probability will be

In [32]:
browsing_data = np.array([0,0,1,0,0]).reshape(1, -1)
print("After checking similar products: propensity :",classifier.predict_proba(browsing_data)[:,1] )


After checking similar products: propensity : [0.11711883]


It goes up to 12%. Next, he checksout reviews.

In [33]:
browsing_data = np.array([1,0,1,0,0]).reshape(1, -1)
print("After checking reviews: propensity :",classifier.predict_proba(browsing_data)[:,1] )


After checking reviews: propensity : [0.562501]


It shoots up to 56%. You can have a threshold for when you want to offer chat. You can keep checking this probability against that threshold to see if you want to popup a chat window.

This example shows you how you can use predictive analytics in real time to decide whether a prospect has high propensity to convert and offer him a chat with a sales rep/agent.