# Exploring & Explaining churn prediction models

Churn prediction is the task of identifying users that are likely to stop using a service, product or website. In this tutorial, you will learn how to:

#### Explore & Evaluate predictions made by the model
* Explore predictions made by this model to gain confidence in the model.
* Understanding why the model made the predictions that it did make.
* Make a churn report by segmenting users based on the their reasons for churn.
* Evaluate the model and compare it with a baseline model.


### Let's get started!

In [1]:
import graphlab as gl
import datetime
gl.canvas.set_target('ipynb') # make sure plots appear inline

A newer version of GraphLab Create (v2.1) is available! Your current version is v2.0.1.
You can use pip to upgrade the graphlab-create package. For more information see https://turi.com/products/create/upgrade.


###  Load previously saved data

In the previous notebooks, we had saved the data & models in a binary format. Let us try and load them back.

In [2]:
interactions_ts = gl.TimeSeries("data/user_activity_data_rocket_2.ts/")
users = gl.SFrame("data/users_rocket_2.sf/")
model = gl.load_model("data/churn_model_rocket_2.mdl")

This trial license of GraphLab Create is assigned to zaret@rocketgames.com and will expire on August 20, 2016. Please contact trial@turi.com for licensing options or to request a free non-commercial license for academic use.


[INFO] graphlab.cython.cy_server: GraphLab Create v2.0.1 started. Logging: /tmp/graphlab_server_1469228786.log


In [3]:
(train, valid) = gl.churn_predictor.random_split(interactions_ts, user_id = 'user_id', fraction = 0.9, seed = 12)

In [4]:
churn_period_apr =  datetime.datetime(year = 2016, month = 4, day = 1)

## Interactive view to explore the model

In [5]:
v = model.views.overview(train, churn_period_apr, valid, user_data=users)
v.show()

PROGRESS: Making a churn forecast for the time window:
PROGRESS: --------------------------------------------------
PROGRESS:  Start : 2016-04-01 00:00:00
PROGRESS:  End   : 2016-04-08 00:00:00
PROGRESS: --------------------------------------------------
PROGRESS: Grouping dataset by user.
PROGRESS: Resampling grouped observation_data by time-period 1 day, 0:00:00.


PROGRESS: Generating features for boundary 2016-04-01 00:00:00.
PROGRESS: Joining user_data with aggregated features.
PROGRESS: Not enough data to make predictions for 4506 user(s). 
PROGRESS: Making a churn forecast for the time window:
PROGRESS: --------------------------------------------------
PROGRESS:  Start : 2016-04-01 00:00:00
PROGRESS:  End   : 2016-04-08 00:00:00
PROGRESS: --------------------------------------------------
PROGRESS: Grouping dataset by user.
PROGRESS: Resampling grouped observation_data by time-period 1 day, 0:00:00.


PROGRESS: Generating features for boundary 2016-04-01 00:00:00.
PROGRESS: Joining user_data with aggregated features.
PROGRESS: Not enough data to make predictions for 517 user(s). 
PROGRESS: Making a churn forecast for the time window:
PROGRESS: --------------------------------------------------
PROGRESS:  Start : 2016-04-01 00:00:00
PROGRESS:  End   : 2016-04-08 00:00:00
PROGRESS: --------------------------------------------------
PROGRESS: Grouping dataset by user.
PROGRESS: Resampling grouped observation_data by time-period 1 day, 0:00:00.


PROGRESS: Generating features for boundary 2016-04-01 00:00:00.
PROGRESS: Joining user_data with aggregated features.
PROGRESS: Not enough data to make predictions for 517 user(s). 


### What are the key features that impact churn?

In [7]:
importance = model.get_feature_importance()

In [8]:
print "What are the top 5 factors that impact predictions?"
print "----------------------------------------------------"
print '\n'.join(["%s. %s" % (i+1, x) for i,x in enumerate(importance['description'][0:5])])

What are the top 5 factors that impact predictions?
----------------------------------------------------
1. Days since most recent event
2. Sum of "rank_desc" is  each day in the last 90 days
3. Index 'California in feature 'region'
4. Sum of "rank_desc" is  each day in the last 60 days
5. Sum of "rank_desc" in the last 14 days


### Segmenting groups of users with similar churn explanations

In [11]:
report = model.get_churn_report(interactions_ts, user_data=users, time_boundary=churn_period_apr)
report

PROGRESS: Making a churn forecast for the time window:
PROGRESS: --------------------------------------------------
PROGRESS:  Start : 2016-04-01 00:00:00
PROGRESS:  End   : 2016-04-08 00:00:00
PROGRESS: --------------------------------------------------
PROGRESS: Grouping dataset by user.
PROGRESS: Resampling grouped observation_data by time-period 1 day, 0:00:00.


PROGRESS: Generating features for boundary 2016-04-01 00:00:00.
PROGRESS: Joining user_data with aggregated features.
PROGRESS: Not enough data to make predictions for 5023 user(s). 


segment_id,num_users,num_users_percentage,explanation,avg_probability,stdv_probability
0,2328,19.0647776595,"[Sum of ""rank_desc"" is less than 1.50 each day ...",0.977488160133,0.0
1,1824,14.9373515683,"[Sum of ""rank_desc"" is less than 1.50 each day ...",0.977488160133,0.0
2,1662,13.6106788961,"[Sum of ""rank_desc"" is between 1.50 and 6.50 ...",0.836229081547,0.082644345636
3,427,3.49684710507,"[No events in feature ""credits"" in the last 14 ...",0.81698551952,0.0854538316213
4,423,3.46408975514,"[Sum of ""rank_desc"" in the last 21 days greater ...",0.0532949917441,0.036658112064
5,407,3.33306035542,"[Sum of ""rank_desc"" is between 1.50 and 6.50 ...",0.716619795458,0.106947842739
6,327,2.67791335681,"[Less than 2.50 days since most recent event, ...",0.14758957853,0.0815823337296
7,267,2.18655310785,"[Average of ""txns_on_day"" in the last 14 days less ...",0.276421949212,0.100203748211
8,265,2.17017443289,"[Sum of ""rank_desc"" is greater than (or equal ...",0.954297520754,0.0195304310583
9,241,1.97363033331,"[Sum of ""rank_desc"" is greater than (or equal ...",0.683529925297,0.107788700639

users
"[0018750b-9aff- 3e46-8169-0a49cae70d6d, ..."
[0009af2b-28ec- 39a8-ae38-4b77a92f959 ...
"[00020bdc-2d35-371b-9212- 9f348bfeae41, 00341848 ..."
"[00827b92-efe9-3fcd-b481- 1212f3ee40fa, 021fd50e- ..."
"[001b4e2c-24a6-3c6c- 9ec8-c1c03946a8f6, ..."
"[001e0998-77d7-4db1-9cce- 7927ffa09056, ..."
"[01406fb3-5d9a- 4e91-9459-89ee820762de, ..."
"[020c2147-08d2-3f7b-809f- 40e7546d561c, 03480d2a- ..."
"[00adfd77-cd80-3e65-a30c- 0fde245b461e, 00bed819 ..."
"[001ef55f-b67a- 3ebb-b737-3906e87ab82e, ..."


In [12]:
report['num_users'].show()

### What does a segment look like?

In [13]:
segment = report[report['segment_id'] == '2'][0]

In [14]:
print ""
print "Segment 2"
print "---------------------------------------"
print 'Segment size      : %.2f %% of users' % segment["num_users_percentage"]
print 'Churn probability : %s' % segment["avg_probability"]
print ""
print "Characteristics of users in segment 2?"
print "-----------------------------------------------"
print "\n".join(['%s. %s' % (i + 1, x) for i, x in enumerate(segment["explanation"])])


Segment 2
---------------------------------------
Segment size      : 13.61 % of users
Churn probability : 0.836229081547

Characteristics of users in segment 2?
-----------------------------------------------
1. Sum of "rank_desc" is between 1.50 and 6.50 each day in the last 60 days
2. No events in feature "credits" in the last 14 days
3. No "txns_on_day" events in the last 21 days
4. No events in feature "rank_desc" in the last 14 days
5. No "rank_desc" events in the last 21 days


### Understanding individual predictions: Why did the model make a prediction?

In [22]:
valid['user_id'].head()

dtype: str
Rows: 10
['c796f7b5-69e0-3fa2-b4b5-a68e0877c70d', 'bbb1b843-7485-3cfb-837d-24573410bf3c', 'bbb1b843-7485-3cfb-837d-24573410bf3c', 'f9d1ddbc-0c3c-38c2-bd37-2e4f77c4a1e8', 'a33283b5-1450-398c-b25f-ecf125921d17', 'f9d1ddbc-0c3c-38c2-bd37-2e4f77c4a1e8', 'bbb1b843-7485-3cfb-837d-24573410bf3c', 'bbb1b843-7485-3cfb-837d-24573410bf3c', '8214ccb7-2eb5-45ea-a178-19be0b145268', '6cc11000-0f51-321c-8ba9-c149e8a0f786']

In [23]:
particular_user = valid[valid['user_id'] == 'c796f7b5-69e0-3fa2-b4b5-a68e0877c70d']
particular_user


event_time,user_id,date,rev,e_purchaseamount,e_purchaseprice,hasemail
2016-01-01 00:00:49,c796f7b5-69e0-3fa2-b4b5-a 68e0877c70d ...,2016-01-01,0.99,2500.0,99,True
2016-01-01 07:52:10,c796f7b5-69e0-3fa2-b4b5-a 68e0877c70d ...,2016-01-01,0.99,2500.0,99,True
2016-01-01 21:41:34,c796f7b5-69e0-3fa2-b4b5-a 68e0877c70d ...,2016-01-01,0.99,2500.0,99,True
2016-01-03 03:33:06,c796f7b5-69e0-3fa2-b4b5-a 68e0877c70d ...,2016-01-03,0.99,2500.0,99,True
2016-01-05 04:14:22,c796f7b5-69e0-3fa2-b4b5-a 68e0877c70d ...,2016-01-05,0.99,2500.0,99,True
2016-01-06 12:22:17,c796f7b5-69e0-3fa2-b4b5-a 68e0877c70d ...,2016-01-06,0.99,2500.0,99,True
2016-01-10 20:11:32,c796f7b5-69e0-3fa2-b4b5-a 68e0877c70d ...,2016-01-10,0.99,2500.0,99,True
2016-01-13 08:42:59,c796f7b5-69e0-3fa2-b4b5-a 68e0877c70d ...,2016-01-13,0.99,2500.0,99,True
2016-01-13 09:21:18,c796f7b5-69e0-3fa2-b4b5-a 68e0877c70d ...,2016-01-13,0.99,2500.0,99,True
2016-01-13 11:50:57,c796f7b5-69e0-3fa2-b4b5-a 68e0877c70d ...,2016-01-13,0.99,2500.0,99,True

e_viptier,xrate,e_source,e_vip_boost,e_vip_points,e_creditsbeforepurchase,e_level
6,25.2525252525,VIPDialog,1.5,100,300001-1000000,1000
6,25.2525252525,VIPDialog,1.5,100,300001-1000000,1000
6,25.2525252525,VIPDialog,1.5,100,300001-1000000,1000
6,25.2525252525,VIPDialog,1.5,100,300001-1000000,1000
6,25.2525252525,VIPDialog,1.5,100,300001-1000000,1000
6,25.2525252525,VIPDialog,1.5,100,300001-1000000,1000
6,25.2525252525,VIPDialog,1.5,100,1000001-5000000,1000
6,25.2525252525,VIPDialog,1.5,100,300001-1000000,1000
6,25.2525252525,VIPDialog,1.5,100,300001-1000000,1000
6,25.2525252525,VIPDialog,1.5,100,300001-1000000,1000

e_machine,u_playertenure,u_fbstatus,u_totalcredits,credits,rn,rank,txns,txns_on_day,rank_desc
before_spin,102,True,751968,749468.0,1,1,57,3,27
SimpleWild5x,102,True,885468,882968.0,2,1,57,3,27
SimpleCrazySevens,103,True,812818,810318.0,3,1,57,3,27
SimpleWild5x,104,True,751268,748768.0,4,2,57,1,26
SimpleWild5x,106,True,749568,747068.0,5,3,57,1,25
SimpleRespinRedHot,108,True,798318,795818.0,6,4,57,1,24
SimpleWild5x,112,True,1594318,1591818.0,7,5,57,1,23
before_spin,114,True,772068,769568.0,8,6,57,4,22
before_spin,114,True,778318,775818.0,9,6,57,4,22
MultiWild5x,115,True,760568,758068.0,10,6,57,4,22

next_event_time,previous_event_time,last_event_time,first_event_time
2016-01-01 07:52:10,(null),2016-04-23 04:16:25,2016-01-01 00:00:49
2016-01-01 21:41:34,2016-01-01 00:00:49,2016-04-23 04:16:25,2016-01-01 00:00:49
2016-01-03 03:33:06,2016-01-01 07:52:10,2016-04-23 04:16:25,2016-01-01 00:00:49
2016-01-05 04:14:22,2016-01-01 21:41:34,2016-04-23 04:16:25,2016-01-01 00:00:49
2016-01-06 12:22:17,2016-01-03 03:33:06,2016-04-23 04:16:25,2016-01-01 00:00:49
2016-01-10 20:11:32,2016-01-05 04:14:22,2016-04-23 04:16:25,2016-01-01 00:00:49
2016-01-13 08:42:59,2016-01-06 12:22:17,2016-04-23 04:16:25,2016-01-01 00:00:49
2016-01-13 09:21:18,2016-01-10 20:11:32,2016-04-23 04:16:25,2016-01-01 00:00:49
2016-01-13 11:50:57,2016-01-13 08:42:59,2016-04-23 04:16:25,2016-01-01 00:00:49
2016-01-13 19:26:23,2016-01-13 09:21:18,2016-04-23 04:16:25,2016-01-01 00:00:49


In [24]:
explanations = model.explain(particular_user, user_data=users)

PROGRESS: Making a churn forecast for the time window:
PROGRESS: --------------------------------------------------
PROGRESS:  Start : 2016-04-23 04:16:25
PROGRESS:  End   : 2016-04-30 04:16:25
PROGRESS: --------------------------------------------------
PROGRESS: Grouping dataset by user.
PROGRESS: Resampling grouped observation_data by time-period 1 day, 0:00:00.


PROGRESS: Generating features for boundary 2016-04-23 04:16:25.
PROGRESS: Joining user_data with aggregated features.


In [25]:
print ""
print "Model explanations"
print "---------------------------------------"
print 'Customer ID       : %s' % explanations["user_id"]
print 'Churn probability : %s' % explanations["probability"]
print ""
print "Why did the model make this prediction?"
print "---------------------------------------"
print "\n".join(['%s. %s' % (i + 1, x) for i, x in enumerate(explanations["explanation"][0])])


Model explanations
---------------------------------------
Customer ID       : ['c796f7b5-69e0-3fa2-b4b5-a68e0877c70d']
Churn probability : [0.9774881601333618]

Why did the model make this prediction?
---------------------------------------
1. Sum of "rank_desc" is less than 1.50 each day in the last 60 days
2. Sum of "rank_desc" is less than 4.50 each day in the last 14 days
