 # Data Science 101 - Lead Scoring Scenario

## Part 1: Problem

Today we are going to look at a very common data science problem found in e-commerce and web based software or services (like Xero!). **Lead scoring!**

### The Sales Process

An education company named X Education sells online courses to industry professionals. On any given day, many professionals who are interested in the courses land on their website and browse for courses.

1. Individuals land on the website, they might browse the courses, watch some videos and if they are interested they will fill up a form for the course. 

1. Once individuals fill out a form, they have now become a potential customer or **lead**

1. Employees from the sales team will contact leads by emailing or calling them with the goal of getting the lead to sign up for the course - called a **conversion** 

### Situation

- Currently only 38% of leads contacted go on to become a paying customer. 
- The CEO wants to grow the customer base by 20% over the next year 

### Complication 

- The head of the sales & advertising department has budget to increase advertising by 20% generate more leads, but they would not have enough resource to still call every lead that signs up

### Question

- The department head knows that a conversion rate of 38% means a lot of time and money is wasted on customers who never intent to sign up. 
- They have asked you if there is a way to call only the 'hottest' leads and get a better return on the amount of sales calls they make?

### Answer 

- You tell the department head it is possible to use historical data from previous leads who have and have not become customers to build a model to score how likely a lead its to convert!

Your idea for a lead scoring gets approved - now you need to look at the data you and decide what model you are going to build!

## Part 2: Tools

Today we are going to be using a few different tools in our lead scoring scenario:

### Python

In [None]:
# some basic python commands 

# create an array
x = [1, 2, 3, 4, 5]

print(x)

In [None]:
min(x)

In [None]:
max(x)

`min`, `max` and `print` are what we call **functions**. These are preprogrammed commands to do common calculations on manipulations. 

### Jupyter Notebook in Google Colab

Jupyter notebook contains **cells** to run each bit of code at a time 

You can run a cell by using `ctrl` + `enter` 

Or because we are hosting the notebook in Google Colab you can click the `play` button at the top of the cell

You can add a new text or code cell using the buttons at the top of the notebook

### Importing packages

To do data science we need to add a bit more functionality than is available in the base python code. 

We do this by importing packages that have extra functions for data analysis, statistics and machine learning.

The data scientists tool kit generally includes the following packages: 

In [None]:
# package for multi-dimensional arrays and matrices
import numpy as np

# package for data manipulation and analysis
import pandas as pd

# packages for creating plots and graphs 
import matplotlib.pyplot as plt
import seaborn as sns

# visulaisation
from matplotlib.pyplot import xticks
%matplotlib inline

# ignoring any warnings for this tutorial 
import warnings
warnings.filterwarnings("ignore")

We use the `import` function to load the extra functions into our python session.

The `as` just shortens the path name of the function so we can call `pd.function_name()` instead of `pandas.function_name()`

## Part 3: Load Data

We use functions from the `pandas` package to load in our data from a csv. Then it's important to check that we have infact loaded the data correctly! 

We can use functions to look at the number of rows and columns and because it is generally too big to look at once we can look at the first and last rows of the data. 

It's also very useful to look at the type of data in each column and check for any duplicates or missing/Null values! 

In [None]:
# read in the data into a pandas dataframe
leads_df = pd.read_csv('https://raw.githubusercontent.com/hjamau/ds-leadscore-tutorial/master/Leads.csv')

In [None]:
# class of the object
type(leads_df)

In [None]:
# how many rows and columns 
leads_df.shape

In [None]:
# top 10 rows of the dataframe head()
leads_df.head(10)

In [None]:
# bottom 5 rows of the dataframe - tail()
leads_df.tail(5)

In [None]:
# describe 
leads_df.describe()

In [None]:
leads_df.describe(include = 'all')

In [None]:
# column names
leads_df.columns

In [None]:
# selecting columns
leads_df.Specialization

In [None]:
# this also selects a column
leads_df['Specialization']

In [None]:
# We can also subset a dataframe to specific columns 
leads_df[['Specialization', 'Converted']]

In [None]:
# We can do multiple functions in one command
# let’s use the converted column, sum and shape functions to check existing conversion rate in data set
sum(leads_df.Converted)/leads_df.shape[0]

In [None]:
# Let's use the sum and dumplicated functions to do a very important step!
# checking duplicates!!
sum(leads_df.duplicated())

In [None]:
# look at null/missing values 
leads_df.isnull().sum()

## Part 4: Data Cleaning

In the real world we often have imperfect or missing data that we need to 'clean' to get ready for data analysis. 

Using what we learnt above about the data, we will now apply some changes to the dataframe to fill in any gaps:

In [None]:
# There are "select" values in columns, that doesn't make any sense for occupation
# This is because customer did not select any option from the list, hence it shows select.

# Change 'Select' values to NaN.
leads_df = leads_df.replace('Select', np.nan)

In [None]:
# lets look at the specialization column
leads_df['Specialization'].value_counts(dropna=False)

In [None]:
# It maybe the case that lead has not entered any specialization if their option is not availabe on the list
# We can make a category "Others" for missing values. 
leads_df['Specialization'] = leads_df['Specialization'].replace(np.nan, 'Other')

## Part 5: Data Exploration

Each row in our data is a single observation made up of a values in each column. 

Now we have cleaned the data we can start exploring each of the columns we have as **variables** in the dataset:

- `Prospect ID` is the identifier variable for the data across each row

- `converted`  is our **target** variable (also known as the response variable) i.e. the variable that we are interested in predicting. 

    - It is binary and the event either happens or it doesn't i.e. 0 or 1

- We have a mix of 7 categorical and continuous variables we can as **predictor** variables (also called explanatory variable, feature, input variable or independent variable)

We can explore the data to see if there are any correlations between the target and predictor variables.

First for our categorical variables let's look at side by side bar plots:

In [None]:
# lead Origin
fig, axs = plt.subplots(figsize = (15,7.5))
sns.countplot(x = "LeadOrigin", hue = "Converted", data = leads_df)
xticks(rotation = 90)

In [None]:
# lead source
fig, axs = plt.subplots(figsize = (15,7.5))
sns.countplot(x = "LeadSource", hue = "Converted", data = leads_df)
xticks(rotation = 90)

In [None]:
# Specialization
fig, axs = plt.subplots(figsize = (15,7.5))
sns.countplot(x = "Specialization", hue = "Converted", data = leads_df)
xticks(rotation = 90)

In [None]:
# Occupation
fig, axs = plt.subplots(figsize = (15,7.5))
sns.countplot(x = "Occupation", hue = "Converted", data = leads_df)
xticks(rotation = 90)

For our continuous variables we can look at box plots

In [None]:
# Total Visits
sns.boxplot(y = 'TotalVisits', x = 'Converted', data = leads_df)

In [None]:
# Total Time Spent on Website
sns.boxplot(y = 'TotalTime', x = 'Converted', data = leads_df)

In [None]:
# Page Views Per Visit
sns.boxplot(y = 'PageViews', x = 'Converted', data = leads_df)

What comments can you make about the different variables and their conversion rates? 

## Part 7: Model Building 

Now we have a reasonably good idea about what we want to predict and the relationships in each variable we can start to build a predictive model for our lead score. 

As we said above we have a binary target that we want to predict and a mix of 7 categorical and continuous variables we can use as predictor variables 

Because our target is binary (0 or 1)  - this is a **classification** problem, meaning that we want to predict whether a lead will convert or not. 

So we need to pick a model for classification. The first choice for many binary classification problems is a **logistic regression** model because of its simplicity and interpretability. 

There are other options like **random forests** or **neural networks** that can provide better accuracy, but depending on the data logistic regression can perform very well! 

### Training-testing split

The next part of building a predictive model is to split the data into a **training** set and **testing** set. 
This lets us withhold data from the model when training so that we can test it's performance on data it has not seen before - just like it will be doing for our lead scoring in real life. 

- Test 80% 
- Train 20% 

We also need to transform the data to get it ready for modelling 
- Remove the ID label (not useful as a predictor)
- Split out the target column
- For categorical variables with multiple levels, create dummy features (one-hot encoded)




In [None]:
# package for training models 
from sklearn import datasets, linear_model, metrics
from sklearn.model_selection import train_test_split

In [None]:
# vector to store conversion results
y = leads_df['Converted']

In [None]:
# taking catagorical variables and creating dummy variables
# here we have subsetted the dataframe to only the catagorical columns 
dummy1 = pd.get_dummies(leads_df[['LeadOrigin', 'LeadSource', 'Specialization', 'Occupation']], drop_first=True)
dummy1.head()

In [None]:
# Adding the results of the dummy variables and remaining continous variables 
# in a dataframe with all the predictors 
X = pd.concat([leads_df[['TotalVisits', 'TotalTime', 'PageViews']], dummy1], axis=1)
X.head()

In [None]:
# create training and testing datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

Now we are ready to 'fit' our logistic regresion model to the data

In [None]:
# fit a model 
lm = linear_model.LogisticRegression()
model = lm.fit(X_train, y_train)

Easy as that! Two lines of code.

## Part 8: Model Evaluation

Next is the important bit - how good is our model at predicting conversions?

There are many metrics to evaluate a model and two ways we can get estimates 
- 'in-sample' evaluating on our training data 
- 'out of sample' evaluating on our testing data 

In [None]:
# in sample accuracy 
metrics.accuracy_score(y_train, model.predict(X_train))

In [None]:
# Get predictions on testing dataset
y_test_pred = model.predict(X_test)

In [None]:
# Out of of sample accuracy 
# Let's check the overall accuracy for the test set 
metrics.accuracy_score(y_test, y_test_pred)

The in-sample estimate always tends to be higher accuracy - which is why we need to evaluate our model on data it has not seen before to get a true estimate of real life performance! 

### Confusion Matrix 

For classification models we can use a tool called a **confusion matrix** to see how the model performed on classifying non-conversions compared to conversions 

In [None]:
# Confusion matrix 
# imput is the actual conversions vs the predicted conversions for each lead in the test set
confusion = metrics.confusion_matrix(y_test, y_test_pred)

confusion

One thing we are interested in how many leads predicted as converters were actually converters and how many predicted as non-converters were actually non converters:

In [None]:
# some more things we can calculate from our confusion matrix
TP = confusion[1,1] # true positive 
TN = confusion[0,0] # true negatives
FP = confusion[0,1] # false positives
FN = confusion[1,0] # false negatives

print("True positive", TP)
print("True negative", TN)
print("False Positives", FP)
print("False Negatives", FN)

And the proportions of each:

- **Sensitivity** (also called the true positive rate, the recall) measures the proportion of actual positives that are correctly identified as such (e.g., the percentage of sick people who are correctly identified as having the condition).

- **Specificity** (also called the true negative rate) measures the proportion of actual negatives that are correctly identified as such (e.g., the percentage of healthy people who are correctly identified as not having the condition).

In [None]:
# Let's see the sensitivity of our logistic regression model
se = TP / float(TP+FN)
# Let us calculate specificity
sp = TN / float(TN+FP)

# Out of all those who were converters - what proportion did our model get right?
print("Sensitivity", se)
# Out of all those who were non-converters - what proportion did our model get right?
print("Specificity", sp)

## Part 9: Interpreting Results & Lift

The great thing about logistic regression is it gives us the predicted probabilities of that observation falling into that class (in our case conversion = 0 or conversion = 1). 

We can use this to rank our leads into deciles of the 'hottest' and compare the number of actual conversions in each decile.

Then we can see how this compares to the existing method - or no method. 

In [None]:
# getting predicted probabilities and combining into a dataframe
probs = pd.DataFrame(model.predict_proba(X_test), columns = ['prob_nc', 'prob_c'])

# joining actual outcomes
probs['actual'] = y_test.reset_index(drop=True)

probs.head()

In [None]:
# cut data into decilies 
# i.e. putting the leads into buckets of top 10%, 20%, ... based on how likely they are to convert
d = np.linspace(0.1,1,10).round(1)
probs['deciles'] = pd.qcut(probs.prob_c, 10)

In [None]:
# compare actual conversion in each decile
# pandas groupby is a useful function for 
lift = probs.groupby(probs.deciles)['actual'].agg(["sum", "count"]).reset_index()

# existing conversion rate in test data
x = sum(lift['sum'])/sum(lift['count'])

# calculate conversion probs
lift['prob_con'] = lift['sum']/lift['count']

# get cumulative counts and probabilities 
lift['sum_c'] = lift['sum'].iloc[::-1].cumsum()
lift['prop_c'] = lift['sum_c']/731

# old model - 40% conversion by random selection
lift['old'] = lift['count']*x

In [None]:
# dataframe with our metrics 
lift

### Waterfall Plot

This plot shows us for each decile (ordered 'hottest' to 'not hot' leads) how many conversions we would expect to get by contacting leads in each decile compared to when we just pick leads at random that have a 38% conversion rate. 

In [None]:
# Waterfall analysis plot 
ax = plt.gca()

ax.bar(lift.index, lift['sum'].iloc[::-1])
ax.plot(lift.index, lift['old'].iloc[::-1], color = 'red')
plt.xticks(lift.index, labels=lift.index+1)
plt.title("Waterfall Analysis")
plt.xlabel("Deciles")
plt.ylabel("Conversions")

plt.show()

We can see that our model puts our more likely converters in the top deciles. This means that the sales team has a way of prioritising leads
1. Top 10% of leads are 98% likely to convert so they should always be at the top of the queue 
2. Middle % of leads might still convert but not a priority 
3. Bottom 30% of leads are very unlikely to convert - don't waste time and resource here 

*Next idea - can you work out those in the top decile who might convert on without even needing a sales call? - AB testing* 

### Lift plot

Here we compare the cumulative gains in converted customer we get from our model compared to what we get by calling leads randomly. 

In [None]:
# Lift plot
ax = plt.gca()

ax.plot(lift.index, lift['prop_c'].iloc[::-1]*100, marker='o')
ax.plot(lift.index, d*100, color = 'red', marker='o')
plt.xticks(lift.index, labels=d*100)
plt.title("Lift Chart")
plt.xlabel("% of Leads")
plt.ylabel("% of Conversions")

plt.show()

This chart shows us that by contacting just the top 10% of leads we actually capture 24% of all conversions. 

This is compared to having no model where we will only get 10% of all conversions for every 10% of leads we contact. 

We only need to contact the top 70% leads from our model to capture 95% of all conversions. 

This model shows we can increase the leads generated but do not have to increase the sales team to contact all customers to grow the customer base!

# Part 10: Next steps

You could try improve model accuracy by:
- Try a different model
- Transform variables
- Variable selection
- Tune hyperparameters 
- Adding new data 

Model building is an iterative process until you reach the level of accuracy required 