# How to profit from Data Science: Churn Prediction

The [2015 McKinsey report](http://www.mckinsey.com/industries/telecommunications/our-insights/telcos-the-untapped-promise-of-big-data?cid=digistrat-eml-alt-mkq-mck-oth-1606) on the use of Big Data in global telecom companies showed 50% of organisation saw no increase in profitability, while 5% found significant benefit. Why?

A common problem is the Data Science opportunity is framed in the context of “Big Data”. This risks the project focusing too early on the “who & how” of storing and managing that "Big Data". This inevitabily leads to an IT centric project to deploy a massive IT infrastructure (e.g., data warehouse, data lake) that becomes disconnected from the original business problem and hence the bueinss benefits. 

A better approach is to focus on the “Predictive Applications that enriches critical CRM activities" to ensure the project is driven by the “what & why” of the business outcomes and remains a business, not technology, project. 

## Enhancing the CRM Process

A typical (simplified) CRM process look like this:
![Enhancing the CRM Process](CRM_Lifecycle.png)

Under each CRM phases we have listed touch points where a Predictive Application can increase the effectiveness or efficiency of a critical CRM activity to boost revenue and/or lower cost.  Four common starting points are:

* __[Lead Scoring](http://jupyter1.datascienceinstitute.com.au:8888/notebooks/Notebooks/Predictive_Applications/bank_lead_scoring_benefits.ipynb)__: This is the process of ranking leads to prioritise sales resources on the customers who are most likely to buy now.  The benefits are increased sales ROI and conversion rate, which translate to a lower cost of sale and higher revenues. There is also the benefit of a happier, more engaged salesforce. 
* __[Customer Lifetime Value (CLV)](http://jupyter1.datascienceinstitute.com.au:8888/notebooks/Notebooks/Predictive_Applications/Online_Retail_CLV.ipynb)__: This is the present value of the future cash flows attributed to the customer during his/her entire relationship with the company. CLV is considered an essential business metric as it shifts focus from quarterly revenues to long-term profits.  Furthermore, the sum over all CLVs estimates the value of the customer base, which can then be managed as an asset.
* __[Churn Prediction](http://jupyter1.datascienceinstitute.com.au:8888/notebooks/Notebooks/Predictive_Applications/customer-churn-prediction.ipynb)__: This answers the question “Which customers are most likely to leave in the next period?” Given the high cost of acquiring new customers there is a strong incentive to take action to retain a customer. Organisations typically employ retention actions (e.g., a targeted phone call or mailing campaigns), with offers of special benefits or discounts. The difficult questions becomes “who, how, and at what cost?” 
* __Recommendation Engine__: This predicts which products a prospect is likely to purchase based on past behaviour (e.g., past purchases, activity, ratings). It was made famous by Amazon and Netflix, who provide recommendations on what to buy or view. The benefit is high conversation rates and an increase in basket size. 

Churn analysis of the UCI [Online Retail data](http://archive.ics.uci.edu/ml/datasets/Online+Retail)


In [33]:
import graphlab as gl
import graphlab.aggregate
import datetime
import time
import os

In [2]:
if  not os.path.exists('Data/online_retail'):
    data = gl.SFrame("https://s3.amazonaws.com/dato-datasets/churn-prediction/online_retail.csv")
    data = data.remove_columns(['InvoiceNo', 'Description'])
    data['InvoiceDate']=data['InvoiceDate'].str_to_datetime('%m/%d/%y %H:%M')
    data.save('Data/online_retail')
else:
    data = gl.SFrame('Data/online_retail')

[INFO] graphlab.cython.cy_server: GraphLab Create v1.10 started. Logging: /tmp/graphlab_server_1464905664.log
INFO:graphlab.cython.cy_server:GraphLab Create v1.10 started. Logging: /tmp/graphlab_server_1464905664.log


This non-commercial license of GraphLab Create is assigned to kevin.mcisaac@gmail.com and will expire on November 06, 2016. For commercial licensing options, visit https://dato.com/buy/.


Finally, we want to separate some users into a train/validation set, making sure the validation users are not in the training set, and creating TimeSeries objects out of them.

In [3]:
(train, valid) = gl.churn_predictor.random_split(data, user_id = 'CustomerID', fraction = 0.9, seed = 12)
train_trial = gl.TimeSeries(train, index = 'InvoiceDate')
valid_trial = gl.TimeSeries(valid, index = 'InvoiceDate')

Now we can load user information, which can be used to augment the churn prediction model.

In [4]:
if  not os.path.exists('Data/userdata'):
    userdata = gl.SFrame("https://s3.amazonaws.com/dato-datasets/churn-prediction/online_retail_side_data_extended.csv")
    userdata.save('Data/userdata')
else:
    userdata = gl.SFrame('Data/userdata')

## Training the model

Let's now train the model.

### Create a train-test split based on users

First, let's observe the data, and see what the time range looks like

In [5]:
print "Start date : %s" % train_trial.min_time
print "End date   : %s" % train_trial.max_time

Start date : 2010-12-01 08:26:00
End date   : 2011-12-09 12:50:00


In [6]:
# Period of inactivity that defines churn -- meaning that if a user stops purchasing
# items for 30 days, we'll consider them as having churned.
churn_period_trial = datetime.timedelta(days = 30) 

# Different beginning of months
churn_boundary_aug = datetime.datetime(year = 2011, month = 8, day = 1) 
churn_boundary_sep = datetime.datetime(year = 2011, month = 9, day = 1) 
churn_boundary_oct = datetime.datetime(year = 2011, month = 10, day = 1) 

In [7]:
model = gl.churn_predictor.create(train_trial,
                                  user_data = userdata,
                                  user_id='CustomerID',
                                  churn_period = churn_period_trial,
                                  time_boundaries = [churn_boundary_aug, churn_boundary_sep, churn_boundary_oct])

PROGRESS: Grouping observation_data by user.
PROGRESS: Resampling grouped observation_data by time-period 1 day, 0:00:00.


PROGRESS: Generating features for time-boundary.
PROGRESS: --------------------------------------------------
PROGRESS: Features for 2011-08-01 00:00:00.
PROGRESS: Features for 2011-09-01 00:00:00.
PROGRESS: Features for 2011-10-01 00:00:00.
PROGRESS: Joining user_data with aggregated features.
PROGRESS: --------------------------------------------------
PROGRESS: Training a classifier model.


PROGRESS: --------------------------------------------------
PROGRESS: Model training complete: Next steps
PROGRESS: --------------------------------------------------
PROGRESS: (1) Evaluate the model at various timestamps in the past:
PROGRESS:       metrics = model.evaluate(data, time_in_past)
PROGRESS: (2) Make a churn forecast for a timestamp in the future:
PROGRESS:       predictions = model.predict(data, time_in_future)


### Evaluating the model (post-hoc anaylsis)

In [8]:
# Evaluate this model in October
evaluation_time = churn_boundary_oct
metrics = model.evaluate(valid_trial, evaluation_time, user_data = userdata)
print(metrics)

### Make predictions in the future

Here the question to ask is will they churn after a certain period of time. To validate we can see if they user has used us after that evaluation period. Voila! I was confusing it with expiration time (customer churn not usage churn)

In [11]:
# Make predictions in the future.

predictions_trial = model.predict(valid_trial, user_data = userdata)
predictions_trial.sort('probability', ascending=False).print_rows(20,max_column_width=20)

PROGRESS: Making a churn forecast for the time window:
PROGRESS: --------------------------------------------------
PROGRESS:  Start : 2011-12-09 11:20:00
PROGRESS:  End   : 2012-01-08 11:20:00
PROGRESS: --------------------------------------------------
PROGRESS: Grouping dataset by user.
PROGRESS: Resampling grouped observation_data by time-period 1 day, 0:00:00.


PROGRESS: Generating features for boundary 2011-12-09 11:20:00.
PROGRESS: Joining user_data with aggregated features.
+------------+-----------------+
| CustomerID |   probability   |
+------------+-----------------+
|   13761    |  0.661192655563 |
|   12789    |  0.831865549088 |
|   12377    |  0.929451584816 |
|   13715    |  0.852380812168 |
|   17725    |  0.501834571362 |
|   15437    |  0.89022809267  |
|   12739    |  0.785794794559 |
|   16523    | 0.0530522763729 |
|   14711    |  0.599178552628 |
|   12851    |  0.785794794559 |
+------------+-----------------+
[442 rows x 2 columns]



In [32]:
view = model.views.explore(train_trial, churn_boundary_oct, user_data = userdata)
view.show()

PROGRESS: Making a churn forecast for the time window:
PROGRESS: --------------------------------------------------
PROGRESS:  Start : 2011-10-01 00:00:00
PROGRESS:  End   : 2011-10-31 00:00:00
PROGRESS: --------------------------------------------------
PROGRESS: Grouping dataset by user.
PROGRESS: Resampling grouped observation_data by time-period 1 day, 0:00:00.


PROGRESS: Generating features for boundary 2011-10-01 00:00:00.
PROGRESS: Joining user_data with aggregated features.
PROGRESS: Not enough data to make predictions for 644 user(s). 


View object

URI: 		http://localhost:32212/view/4859773c-a12c-4c45-a2e8-4ad8cd77d739
HTML: 		
<gl-churn-predictor-explore
    uri="http://localhost:32212/view/38957042-4d52-4e98-9b3b-1e6146ba237c"
    api_key=""
/>
        

In [31]:
view = model.views.overview(train_trial, evaluation_time, user_data = userdata)
view.show()

PROGRESS: Making a churn forecast for the time window:
PROGRESS: --------------------------------------------------
PROGRESS:  Start : 2011-10-01 00:00:00
PROGRESS:  End   : 2011-10-31 00:00:00
PROGRESS: --------------------------------------------------
PROGRESS: Grouping dataset by user.
PROGRESS: Resampling grouped observation_data by time-period 1 day, 0:00:00.


PROGRESS: Generating features for boundary 2011-10-01 00:00:00.
PROGRESS: Joining user_data with aggregated features.
PROGRESS: Not enough data to make predictions for 644 user(s). 
PROGRESS: Making a churn forecast for the time window:
PROGRESS: --------------------------------------------------
PROGRESS:  Start : 2011-10-01 00:00:00
PROGRESS:  End   : 2011-10-31 00:00:00
PROGRESS: --------------------------------------------------
PROGRESS: Grouping dataset by user.
PROGRESS: Resampling grouped observation_data by time-period 1 day, 0:00:00.


PROGRESS: Generating features for boundary 2011-10-01 00:00:00.
PROGRESS: Joining user_data with aggregated features.
PROGRESS: Not enough data to make predictions for 644 user(s). 
PROGRESS: Making a churn forecast for the time window:
PROGRESS: --------------------------------------------------
PROGRESS:  Start : 2011-10-01 00:00:00
PROGRESS:  End   : 2011-10-31 00:00:00
PROGRESS: --------------------------------------------------
PROGRESS: Grouping dataset by user.
PROGRESS: Resampling grouped observation_data by time-period 1 day, 0:00:00.


PROGRESS: Generating features for boundary 2011-10-01 00:00:00.
PROGRESS: Joining user_data with aggregated features.
PROGRESS: Not enough data to make predictions for 644 user(s). 


In [28]:
view.show()

In [21]:
gl.canvas.set_target('headless', port=8889)

In [29]:
sf= gl.SFrame({'a':[1,2,3],'b':[4,5,6]})
sf.show()

Canvas is accessible via web browser at the URL: http://localhost:8889/index.html
