# How to profit from Data Science: Lead  Scoring

The [2015 McKinsey report](http://www.mckinsey.com/industries/telecommunications/our-insights/telcos-the-untapped-promise-of-big-data?cid=digistrat-eml-alt-mkq-mck-oth-1606) on the use of Big Data in global telecom companies showed 50% of organisation saw no increase in profitability, while 5% found significant benefit. Why?

A common problem is the Data Science opportunity is framed in the context of “Big Data”. This risks the project focusing too early on the “who & how” of storing and managing that "Big Data". This inevitabily leads to an IT centric project to deploy a massive IT infrastructure (e.g., data warehouse, data lake) that becomes disconnected from the original business problem and hence the bueinss benefits. 

A better approach is to focus on the “Predictive Applications that enriches critical CRM activities" to ensure the project is driven by the “what & why” of the business outcomes and remains a business, not technology, project. 

## Enhancing the CRM Process

A typical (simplified) CRM process look like this:
![Enhancing the CRM Process](CRM_Lifecycle.png)

Under each CRM phases we have listed touch points where a Predictive Application can increase the effectiveness or efficiency of a critical CRM activity to boost revenue and/or lower cost.  Four common starting points are:

* __[Lead Scoring](http://jupyter1.datascienceinstitute.com.au:8888/notebooks/Notebooks/Predictive_Applications/bank_lead_scoring_benefits.ipynb)__: This is the process of ranking leads to prioritise sales resources on the customers who are most likely to buy now.  The benefits are increased sales ROI and conversion rate, which translate to a lower cost of sale and higher revenues. There is also the benefit of a happier, more engaged salesforce. 
* __[Customer Lifetime Value (CLV)](http://jupyter1.datascienceinstitute.com.au:8888/notebooks/Notebooks/Predictive_Applications/Online_Retail_CLV.ipynb)__: This is the present value of the future cash flows attributed to the customer during his/her entire relationship with the company. CLV is considered an essential business metric as it shifts focus from quarterly revenues to long-term profits.  Furthermore, the sum over all CLVs estimates the value of the customer base, which can then be managed as an asset.
* __[Churn Prediction](http://jupyter1.datascienceinstitute.com.au:8888/notebooks/Notebooks/Predictive_Applications/customer-churn-prediction.ipynb)__: This answers the question “Which customers are most likely to leave in the next period?” Given the high cost of acquiring new customers there is a strong incentive to take action to retain a customer. Organisations typically employ retention actions (e.g., a targeted phone call or mailing campaigns), with offers of special benefits or discounts. The difficult questions becomes “who, how, and at what cost?” 
* __Recommendation Engine__: This predicts which products a prospect is likely to purchase based on past behaviour (e.g., past purchases, activity, ratings). It was made famous by Amazon and Netflix, who provide recommendations on what to buy or view. The benefit is high conversation rates and an increase in basket size. 

## Lead scoring
This notebook explore the benefits of Lead Scoring by comparing the converstion rate and return on investment for calling 20% of prospects on a contact list. Two approaches are compared: 
* Unscored: Take a random selection of 20% of the contact list
* Scored: Score each prospect using a simple machine learning model and call the top 20%

The contact list is taken from a real world banking example. The [data](https://archive.ics.uci.edu/ml/datasets/Bank+Marketing) contains 41k CRM records from a Portugues bank during the year 2010. 

## Measuring the benefits of lead scoring
To measure the benefits of lead scoring, we define a function to calculate the Conversion Rate 

$CVR =  \frac{sale}{calls} $ 

and another to calculate Return on Investment 

$ROI = \frac{profit - cost}{cost} $ 

where $profit = unitProfit*sales$ and $cost = unitCost*calls$

We have assumed the unit cost is \$5 and unit profit \$45.  These are very rough estimates and small changes have a big effect on ROI so they need to be refined.

In [1]:
def calc_call_roi(contactList, leadScore, percentToCall, cost = 5.00, profit = 45.00): 
    
    sales, calls = calc_calls(contactList, leadScore, percentToCall)
    return (sales*profit - calls*cost) / float(calls*cost)

def calc_call_cvr(contactList, leadScore, percentToCall):

    sales, calls = calc_calls(contactList, leadScore, percentToCall)  
    return sales / float(calls)

def calc_calls(contactList, leadScore, percentToCall):
    
    calls = int(len(contactList)*percentToCall)
    if 'lead_score' in contactList.column_names():
        contactList.remove_column('lead_score')
    contactList = contactList.add_column(leadScore,name='lead_score')
    
    callList = contactList.topk('lead_score', k=calls)
    sales = len(callList[callList['y']=='yes']) 
    
    return  sales, calls

First we load the data and break it into training and validation datasets. This is a common Data Science strategy that enable the accuracy of the trained model to be estimated based on data that is has not yet seen. 

In [2]:
import graphlab as gl
bank = gl.SFrame.read_csv('Data/bank-additional/bank-additional-full.csv', delimiter=';', verbose=False)
train, validate = bank.random_split(0.8)

[INFO] graphlab.cython.cy_server: GraphLab Create v1.10 started. Logging: /tmp/graphlab_server_1465251290.log
INFO:graphlab.cython.cy_server:GraphLab Create v1.10 started. Logging: /tmp/graphlab_server_1465251290.log


This non-commercial license of GraphLab Create is assigned to kevin.mcisaac@gmail.com and will expire on November 06, 2016. For commercial licensing options, visit https://dato.com/buy/.


For the first approach we phone a random 20% of the prospect. Note the ROI can vary greatly depending on the luck of the draw so we run this a few time! 

In [12]:
import random
meanROI = 0
meanCVR = 0
n = 5
print 'Call random 20%'
for i in range(0,5):
    randLeadScores = gl.SArray([random.random() for _ in validate])
    initROI = calc_call_roi(validate, randLeadScores, 0.2)
    initCVR = calc_call_cvr(validate, randLeadScores, 0.2)
    meanROI += initROI
    meanCVR += initCVR
    print 'ROI = {0:.2%}, CVR =  {1:.2%}'.format(initROI, initCVR)
    
meanROI = meanROI/n
meanCVR = meanCVR/n
print 'Mean ROI = {0:.2%}, CVR =  {1:.2%}'.format(meanROI, meanCVR)

Call random 20%
ROI = 15.97%, CVR =  12.89%
ROI = 0.73%, CVR =  11.19%
ROI = -1.45%, CVR =  10.95%
ROI = -5.81%, CVR =  10.47%
ROI = -3.09%, CVR =  10.77%
Mean ROI = 1.27%, CVR =  11.25%


## Using a machine learning model to score leads

Using the GraphLab machine learning API we can very quickly train a model based on the historical call won/lost data from the CRM then estimate the accuracy of hte model against the validation set.

We need to exclude the following features

* y: This is the outcome we are trying to predict.
* duration: last contact duration, in seconds. This is related to the target as when duration=0 then y='no' as so much be discarded

We also exclude the following quarterly indicators, as its not clear how they impact the model, ie., is everybody more likely to take out an account when the indicators are better. 

* emp.var.rate: employment variation rate - quarterly indicator (numeric)
* cons.price.idx: consumer price index - monthly indicator (numeric) 
* cons.conf.idx: consumer confidence index - monthly indicator (numeric) 
* euribor3m: euribor 3 month rate - daily indicator (numeric)
* nr.employed: number of employees - quarterly indicator (numeric)

In [23]:
features = set(train.column_names()) - set(['y', 'duration', 'emp.var.rate','cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed'])

simple_model = gl.classifier.boosted_trees_classifier.create(train, features = features, target='y', verbose=False,
                                                    max_depth=5, early_stopping_rounds=60, max_iterations=500,
                                                         metric='auc' ,random_seed=19374  )
results = simple_model.evaluate(validate)
print "accuracy: %g, precision: %g, recall: %g" % (results['accuracy'], results['precision'], results['recall'])

accuracy: 0.903823, precision: 0.646865, recall: 0.221719


This model correctly predicts the purchasing decisions of ~90% of the contacts, however there is room for improvement as only 66% of its predictions convert to sales and only 23% of actual sales were predicted by the model. Even still, we shall see that this leads to significant benefits.  

We now use the model to predict the probability of a sale for each contact, then measure the ROI and CVR for the top 20% of opportunities.

In [24]:
toolkitLeadScore = simple_model.predict(validate,output_type='probability')
toolkitROI = calc_call_roi(validate, toolkitLeadScore, 0.2 )
toolkitCVR = calc_call_cvr(validate, toolkitLeadScore, 0.2 )

print 'Call top 20%: ROI = {0:.2%}, CVR =  {1:.2%}'.format(toolkitROI, toolkitCVR)

Call top 20%: ROI = 172.23%, CVR =  30.25%


Even with this trivlal model we see that the converstion rates is 3 times better than choosing random contacts and the ROI is 100 times better.

There are two obvious ways to improve this
1. Enrich the data with additional demographic and behavioural data
2. Estimate the CLV and use that as a second score