 # Data Science 101 - Lead Scoring Scenario

## Part 1: Problem

Today we are going to look at a very common data science problem found in e-commerce and web based software or services (like Xero!). **Lead scoring!**

### The Sales Process

An education company named X Education sells online courses to industry professionals. On any given day, many professionals who are interested in the courses land on their website and browse for courses.

1. Individuals land on the website, they might browse the courses, watch some vidoes and if they are interested they will fill up a form for the course. 

1. Once individuals fill out a form, they have now become a potential customer or **lead**

1. Employees from the sales team will contact leads by emailing or calling them with the goal of getting the lead to sign up for the course - called a **conversion** 

### Situation

- Currently only 30% of leads contacted go on to become a paying customer. 
- The CEO wants to grow the customer base by 20% over the next year 

### Complication 

- The head of the sales & advertising department has budget to increase advertising by 20% generate more leads, but they would not have enough reasource to still call every lead that signs up

### Question

- The department head knows that conversion rate of 30% means a lot of time and money is wasted on customers who never intent to sign up. 
- They have asked you if there is a way to call only the 'hottest' leads and get a better return on the amount of sales calls they make?

### Answer 

- You tell the department head it is possible to use historical data from previous leads who have and have not become customers to build a model to score how likely a lead its to convert!

Your idea for a lead scoring gets approved - now you need to look at the data you and decide what model you are going to build!

## Part 2: Tools

1. Python
1. Jupyter notebook
1. Google Colab

### Importing packages 

The data scientists tool kit in python ususally requires the following:

In [None]:
# package for multi-dimensional arrays and matrices
import numpy as np

# package for data manipulation and analysis
import pandas as pd

# packages for creating plots and graphs 
import matplotlib.pyplot as plt
import seaborn as sns

# visulaisation
from matplotlib.pyplot import xticks
%matplotlib inline

## Part 3: Load Data

In [None]:
# read in the data into a pandas dataframe
leads_df = pd.read_csv('Leads - Leads.csv')

In [None]:
# class of the object
type(leads_df)

In [None]:
# how many rows and columns 
leads_df.shape

In [None]:
leads_df.head()

In [None]:
leads_df.tail()

In [None]:
leads_df.describe()

In [None]:
leads_df.describe(include = 'all')

In [None]:
leads_df.columns

In [None]:
leads_df.info()

In [None]:
#checking duplicates
sum(leads_df.duplicated(subset = 'Prospect ID')) == 0

In [None]:
# selecting columns
leads_df.Specialization

In [None]:
leads_df['Specialization']

## Part 5: Data Cleaning

In [None]:
# There are "select" values in columns, that doesn't make any sense for occupation
# This is because customer did not select any option from the list, hence it shows select.

# Change 'Select' values to NaN.
leads_df = leads_df.replace('Select', np.nan)

In [None]:
leads_df.isnull().sum()

In [None]:
round(100*(leads_df.isnull().sum()/len(leads_df.index)), 2)

In [None]:
# lets look at the specialization column
leads_df['Specialization'].value_counts(dropna=False)

In [None]:
# It maybe the case that lead has not entered any specialization if their option is not availabe on the list
# We can make a category "Others" for missing values. 
leads_df['Specialization'] = leads_df['Specialization'].replace(np.nan, 'Other')

## Part 6: Data Exploration

- `Prospect ID` is the identifer variable for the data across each row

- `converted`  is our **target** variable (also known as the responce variable) i.e. the variable that we are interested in predicting. 

    - It is binary and the event either happens or it doesn't i.e. 0 or 1

- We have a mix of 7 catagorical and continuous variables we can as **predictor** variables (also called explanitory variable, feature, input variable or independent variable)

We can explore the data to see if there are any correlations between the target and predictor variables.

First for our categorical variables lets look at side by side bar plots:

In [None]:
# lead Origin
sns.countplot(x = "LeadOrigin", hue = "Converted", data = leads_df)
xticks(rotation = 90)

In [None]:
# lead source
fig, axs = plt.subplots(figsize = (15,7.5))
sns.countplot(x = "LeadSource", hue = "Converted", data = leads_df)
xticks(rotation = 90)

In [None]:
# Specialization
fig, axs = plt.subplots(figsize = (15,7.5))
sns.countplot(x = "Specialization", hue = "Converted", data = leads_df)
xticks(rotation = 90)

In [None]:
# Occupation
fig, axs = plt.subplots(figsize = (15,7.5))
sns.countplot(x = "Occupation", hue = "Converted", data = leads_df)
xticks(rotation = 90)

For our continuous variables we can look at box plots

In [None]:
# Total Visits
sns.boxplot(y = 'TotalVisits', x = 'Converted', data = leads_df)

In [None]:
# Total Time Spent on Website
sns.boxplot(y = 'TotalTime', x = 'Converted', data = leads_df)

In [None]:
# Page Views Per Visit
sns.boxplot(y = 'PageViews', x = 'Converted', data = leads_df)

## Part 7: Model Building 

As we said above we have a binary target that we want to predict and a mix of 7 catagorical and continuous variables we can as predictor variables 

Because our target is binary (0 or 1) this is a **classification** problem - meaning that we want to predict weather a lead will convert or not. 

So we need to pick a model for classification. The first choice for many binary classification problems is a **logistic regression** model because of it's simplicity and interpretability. 

There are other options like **random forests** or **nueral networks** that can provide better accuracy, but depending on the data logistic regression can perform very well! 

### Traning-testing split

The next part of building a preditive model is to split the data into a **training** set and **testing** set. 
This lets us withold data from the model when traning so that we can test it's performance on data it has not seen before - just like it will be doing for our lead scoring. 

- Test 80% 
- Train 20% 

We also need to transform the data to get it ready for modelling 
- Remove the ID lable (not useful as a predictor)
- Split out the target column
- For categorical variables with multiple levels, create dummy features (one-hot encoded)

In [None]:
# package for training models 
from sklearn import datasets, linear_model, metrics
from sklearn.model_selection import train_test_split

In [None]:
y = leads_df['Converted']

In [None]:
dummy1 = pd.get_dummies(leads_df[['LeadOrigin', 'LeadSource', 'Specialization', 'Occupation']], drop_first=True)
dummy1.head()

In [None]:
# Adding the results to the master dataframe
X = pd.concat([leads_df[['TotalVisits', 'TotalTime', 'PageViews']], dummy1], axis=1)
X.head()

In [None]:
# create training and testing vars
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

In [None]:
# fit a model 
lm = linear_model.LogisticRegression()
model = lm.fit(X_train, y_train)

In [None]:
# in sample accuracy 
metrics.accuracy_score(y_train, model.predict(X_train))

In [None]:
# predictions on testing dataset
y_test_pred = model.predict(X_test)

In [None]:
# Confusion matrix 
confusion = metrics.confusion_matrix(y_test, y_test_pred)

confusion

In [None]:
# Let's check the overall accuracy.
metrics.accuracy_score(y_test, y_test_pred)

In [None]:
TP = confusion[1,1] # true positive 
TN = confusion[0,0] # true negatives
FP = confusion[0,1] # false positives
FN = confusion[1,0] # false negatives

In [None]:
# Let's see the sensitivity of our logistic regression model
se = TP / float(TP+FN)
# Let us calculate specificity
sp = TN / float(TN+FP)
# Calculate false postive rate - predicting non conversion when customer has not converted
fpr = FP/ float(TN+FP)
# positive predictive value 
ppv = TP / float(TP+FP)
# Negative predictive value
npv = TN / float(TN+ FN)

## Part 8: Lift

The great thing about logistic regression is it gives us the predicted probabilities of that observation falling into that class (in our case converion = 0 or conversion = 1). 

We can use this to rank our leads into deciles of the 'hottest' and compare the number of actual conversions in each decile.

In [None]:
# getting predicted probabilities 
probs = pd.DataFrame(model.predict_proba(X_test), columns = ['prob_nc', 'prob_c'])

# joining actual outcomes
probs['actual'] = y_test.reset_index(drop=True)

probs.head()

In [None]:
# cut data into deciles
d = np.linspace(0.1,1,10).round(1)
probs['deciles'] = pd.qcut(probs.prob_c, 10)

In [None]:
# existing conversion rate
x = sum(lift['sum'])/sum(lift['count'])

# compare actual conversion in each decile
lift = probs.groupby(probs.deciles)['actual'].agg(["sum", "count"]).reset_index()

# calculate conversion probs
lift['prob_con'] = lift['sum']/lift['count']

# get cumulative counts and probabilities 
lift['sum_c'] = lift['sum'].iloc[::-1].cumsum()
lift['prop_c'] = lift['sum_c']/731

# old model - 40% conversion by random selection
lift['old'] = lift['count']*x

In [None]:
lift

In [None]:
ax = plt.gca()

ax.bar(lift.index, lift['sum'].iloc[::-1])
ax.plot(lift.index, lift['old'].iloc[::-1], color = 'red')
plt.xticks(lift.index, labels=lift.index+1)
plt.title("Waterfall Analysis")
plt.xlabel("Deciles")
plt.ylabel("Conversions")

plt.show()

In [None]:
ax = plt.gca()

ax.plot(lift.index, lift['prop_c'].iloc[::-1]*100, marker='o')
ax.plot(lift.index, d*100, color = 'red', marker='o')
plt.xticks(lift.index, labels=d*100)
plt.title("Lift Chart")
plt.xlabel("% of Leads")
plt.ylabel("% of Conversions")

plt.show()