# Customer Churn Analysis
Churn rate, when applied to a customer base, refers to the proportion of contractual customers or subscribers who leave a supplier during a given time period. It is a possible indicator of customer dissatisfaction, cheaper and/or better offers from the competition, more successful sales and/or marketing by the competition, or reasons having to do with the customer life cycle.

Churn is closely related to the concept of average customer life time. For example, an annual churn rate of 25 percent implies an average customer life of four years. An annual churn rate of 33 percent implies an average customer life of three years. The churn rate can be minimized by creating barriers which discourage customers to change suppliers (contractual binding periods, use of proprietary technology, value-added services, unique business models, etc.), or through retention activities such as loyalty programs. It is possible to overstate the churn rate, as when a consumer drops the service but then restarts it within the same year. Thus, a clear distinction needs to be made between "gross churn", the total number of absolute disconnections, and "net churn", the overall loss of subscribers or members. The difference between the two measures is the number of new subscribers or members that have joined during the same period. Suppliers may find that if they offer a loss-leader "introductory special", it can lead to a higher churn rate and subscriber abuse, as some subscribers will sign on, let the service lapse, then sign on again to take continuous advantage of current specials. https://en.wikipedia.org/wiki/Churn_rate

In [1]:
%%capture

import numpy as np
import pandas as pd
import h2o
from h2o.automl import H2OAutoML
from __future__ import print_function
import pandas_profiling

# Suppress unwatned warnings
import warnings
warnings.filterwarnings('ignore')
import logging
logging.getLogger("requests").setLevel(logging.WARNING)

In [2]:
# Load our favorite visualization library
import os
import plotly
import plotly.plotly as py
import plotly.figure_factory as ff
import plotly.graph_objs as go
import cufflinks as cf
plotly.offline.init_notebook_mode(connected=True)

# Sign into Plotly with masked, encrypted API key

myPlotlyKey = os.environ['SECRET_ENV_BRETTS_PLOTLY_KEY']
py.sign_in(username='bretto777',api_key=myPlotlyKey)


### Load The Trifacta Prepared Dataset

In [3]:
accessKey = os.environ['BRETT_AWS_ACCESS_KEY']
s3file = 'https://trifactapro.s3.amazonaws.com/trifacta/queryResults/admin%40trifacta.local/churn-prepared.csv?AWSAccessKeyId=' + accessKey + '&Expires=1521559382&Signature=yFubi211G%2BXVFp%2Bdb1tPFZKrnSk%3D'

In [6]:
# Load some data
churnDF = pd.read_csv(s3file, delimiter=',')
churnDF["Churn"] = churnDF["Churn"].replace(to_replace=False, value='Retain')
churnDF["Churn"] = churnDF["Churn"].replace(to_replace=True, value='Churn')
churnDFs = churnDF.sample(frac=0.07) # Sample for speedy viz
churnDF.head(5)

Unnamed: 0,Phone,Phone1,Int_l_Plan,VMail_Plan,VMail_Message,Day_Mins,Day_Calls,Day_Charge,Eve_Mins,Eve_Calls,...,Night_Calls,Night_Charge,Intl_Mins,Intl_Calls,Intl_Charge,CustServ_Calls,Churn,State,Account_Length,Area_Code
0,389-8163,389-8163,False,True,34,192.3,114,32.69,129.3,114,...,102,6.13,6.3,12,1.7,1,Retain,NV,110,408
1,351-1894,351-1894,False,True,38,193.0,106,32.81,153.6,106,...,87,11.72,7.4,5,2.0,2,Retain,ND,84,408
2,386-6408,386-6408,False,False,0,72.5,88,12.33,204.0,112,...,118,5.31,6.6,3,1.78,1,Retain,WV,113,510
3,370-9592,370-9592,False,True,40,105.2,61,17.88,341.3,79,...,97,7.46,6.3,3,1.7,2,Retain,CO,181,510
4,397-9251,397-9251,False,False,0,180.5,88,30.69,134.7,102,...,97,7.68,10.0,3,2.7,2,Retain,TX,51,415


In [7]:
#%%capture
pandas_profiling.ProfileReport(churnDF)

0,1
Number of variables,22
Number of observations,3290
Total Missing (%),0.0%
Total size in memory,520.6 KiB
Average record size in memory,162.0 B

0,1
Numeric,11
Categorical,2
Boolean,2
Date,0
Text (Unique),2
Rejected,5
Unsupported,0

0,1
Distinct count,211
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,101.17
Minimum,1
Maximum,243
Zeros (%),0.0%

0,1
Minimum,1
5-th percentile,36
Q1,74
Median,101
Q3,127
95-th percentile,167
Maximum,243
Range,242
Interquartile range,53

0,1
Standard deviation,39.719
Coef of variation,0.39259
Kurtosis,-0.10928
Mean,101.17
MAD,31.733
Skewness,0.090407
Sum,332859
Variance,1577.6
Memory size,25.8 KiB

Value,Count,Frequency (%),Unnamed: 3
105,43,0.0%,
87,41,0.0%,
101,40,0.0%,
90,39,0.0%,
86,38,0.0%,
95,38,0.0%,
93,37,0.0%,
116,37,0.0%,
100,37,0.0%,
107,36,0.0%,

Value,Count,Frequency (%),Unnamed: 3
1,8,0.0%,
2,1,0.0%,
3,5,0.0%,
4,1,0.0%,
5,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
221,1,0.0%,
224,2,0.0%,
225,2,0.0%,
232,1,0.0%,
243,1,0.0%,

0,1
Distinct count,3
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,437.14
Minimum,408
Maximum,510
Zeros (%),0.0%

0,1
Minimum,408
5-th percentile,408
Q1,408
Median,415
Q3,510
95-th percentile,510
Maximum,510
Range,102
Interquartile range,102

0,1
Standard deviation,42.355
Coef of variation,0.09689
Kurtosis,-0.70077
Mean,437.14
MAD,36.672
Skewness,1.1289
Sum,1438200
Variance,1793.9
Memory size,25.8 KiB

Value,Count,Frequency (%),Unnamed: 3
415,1632,0.0%,
408,830,0.0%,
510,828,0.0%,

Value,Count,Frequency (%),Unnamed: 3
408,830,0.0%,
415,1632,0.0%,
510,828,0.0%,

Value,Count,Frequency (%),Unnamed: 3
408,830,0.0%,
415,1632,0.0%,
510,828,0.0%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Retain,2839
Churn,451

Value,Count,Frequency (%),Unnamed: 3
Retain,2839,0.0%,
Churn,451,0.0%,

0,1
Distinct count,10
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,1.5641
Minimum,0
Maximum,9
Zeros (%),0.0%

0,1
Minimum,0
5-th percentile,0
Q1,1
Median,1
Q3,2
95-th percentile,4
Maximum,9
Range,9
Interquartile range,1

0,1
Standard deviation,1.3159
Coef of variation,0.84128
Kurtosis,1.7566
Mean,1.5641
MAD,1.0521
Skewness,1.0964
Sum,5146
Variance,1.7315
Memory size,25.8 KiB

Value,Count,Frequency (%),Unnamed: 3
1,1166,0.0%,
2,751,0.0%,
0,686,0.0%,
3,425,0.0%,
4,161,0.0%,
5,66,0.0%,
6,22,0.0%,
7,9,0.0%,
9,2,0.0%,
8,2,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0,686,0.0%,
1,1166,0.0%,
2,751,0.0%,
3,425,0.0%,
4,161,0.0%,

Value,Count,Frequency (%),Unnamed: 3
5,66,0.0%,
6,22,0.0%,
7,9,0.0%,
8,2,0.0%,
9,2,0.0%,

0,1
Distinct count,119
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,100.48
Minimum,0
Maximum,165
Zeros (%),0.0%

0,1
Minimum,0
5-th percentile,67
Q1,87
Median,101
Q3,114
95-th percentile,133
Maximum,165
Range,165
Interquartile range,27

0,1
Standard deviation,20.058
Coef of variation,0.19963
Kurtosis,0.24779
Mean,100.48
MAD,15.94
Skewness,-0.10901
Sum,330567
Variance,402.31
Memory size,25.8 KiB

Value,Count,Frequency (%),Unnamed: 3
102,77,0.0%,
105,73,0.0%,
107,68,0.0%,
104,68,0.0%,
95,68,0.0%,
108,67,0.0%,
110,66,0.0%,
88,66,0.0%,
97,66,0.0%,
101,65,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0,2,0.0%,
30,1,0.0%,
35,1,0.0%,
36,1,0.0%,
40,2,0.0%,

Value,Count,Frequency (%),Unnamed: 3
157,1,0.0%,
158,3,0.0%,
160,1,0.0%,
163,1,0.0%,
165,1,0.0%,

0,1
Correlation,1

0,1
Distinct count,1627
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,177.97
Minimum,0
Maximum,299.5
Zeros (%),0.0%

0,1
Minimum,0.0
5-th percentile,89.7
Q1,143.3
Median,178.7
Q3,215.25
95-th percentile,264.75
Maximum,299.5
Range,299.5
Interquartile range,71.95

0,1
Standard deviation,52.451
Coef of variation,0.29471
Kurtosis,-0.18266
Mean,177.97
MAD,42.284
Skewness,-0.17446
Sum,585530
Variance,2751.1
Memory size,25.8 KiB

Value,Count,Frequency (%),Unnamed: 3
174.5,8,0.0%,
154.0,8,0.0%,
159.5,8,0.0%,
162.3,7,0.0%,
175.4,7,0.0%,
183.4,7,0.0%,
206.2,6,0.0%,
178.7,6,0.0%,
155.2,6,0.0%,
215.6,6,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0.0,2,0.0%,
2.6,1,0.0%,
7.8,1,0.0%,
7.9,1,0.0%,
12.5,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
297.9,1,0.0%,
298.1,1,0.0%,
298.4,1,0.0%,
299.4,1,0.0%,
299.5,2,0.0%,

0,1
Distinct count,123
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,100.09
Minimum,0
Maximum,170
Zeros (%),0.0%

0,1
Minimum,0
5-th percentile,67
Q1,87
Median,100
Q3,114
95-th percentile,133
Maximum,170
Range,170
Interquartile range,27

0,1
Standard deviation,19.943
Coef of variation,0.19925
Kurtosis,0.20248
Mean,100.09
MAD,15.881
Skewness,-0.055302
Sum,329298
Variance,397.74
Memory size,25.8 KiB

Value,Count,Frequency (%),Unnamed: 3
105,80,0.0%,
94,77,0.0%,
108,71,0.0%,
102,70,0.0%,
88,69,0.0%,
97,69,0.0%,
101,68,0.0%,
98,66,0.0%,
111,65,0.0%,
109,64,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0,1,0.0%,
12,1,0.0%,
36,1,0.0%,
37,1,0.0%,
42,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
157,1,0.0%,
159,1,0.0%,
164,1,0.0%,
168,1,0.0%,
170,1,0.0%,

0,1
Correlation,1

0,1
Distinct count,1604
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,200.95
Minimum,0
Maximum,363.7
Zeros (%),0.0%

0,1
Minimum,0.0
5-th percentile,118.75
Q1,166.53
Median,201.3
Q3,235.3
95-th percentile,284.3
Maximum,363.7
Range,363.7
Interquartile range,68.775

0,1
Standard deviation,50.744
Coef of variation,0.25252
Kurtosis,0.022569
Mean,200.95
MAD,40.492
Skewness,-0.022683
Sum,661130
Variance,2575
Memory size,25.8 KiB

Value,Count,Frequency (%),Unnamed: 3
169.9,9,0.0%,
201.0,7,0.0%,
195.5,7,0.0%,
220.6,7,0.0%,
161.7,7,0.0%,
180.5,7,0.0%,
209.4,7,0.0%,
230.9,7,0.0%,
167.2,7,0.0%,
179.3,6,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0.0,1,0.0%,
31.2,1,0.0%,
42.2,1,0.0%,
42.5,1,0.0%,
43.9,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
350.9,1,0.0%,
351.6,1,0.0%,
354.2,1,0.0%,
361.8,1,0.0%,
363.7,1,0.0%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.09696

0,1
True,319
(Missing),2971

Value,Count,Frequency (%),Unnamed: 3
True,319,0.0%,
(Missing),2971,0.0%,

0,1
Distinct count,21
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,4.4723
Minimum,0
Maximum,20
Zeros (%),0.0%

0,1
Minimum,0
5-th percentile,1
Q1,3
Median,4
Q3,6
95-th percentile,9
Maximum,20
Range,20
Interquartile range,3

0,1
Standard deviation,2.4573
Coef of variation,0.54944
Kurtosis,3.1225
Mean,4.4723
MAD,1.8763
Skewness,1.3259
Sum,14714
Variance,6.0383
Memory size,25.8 KiB

Value,Count,Frequency (%),Unnamed: 3
3,657,0.0%,
4,614,0.0%,
2,484,0.0%,
5,467,0.0%,
6,332,0.0%,
7,214,0.0%,
1,159,0.0%,
8,113,0.0%,
9,106,0.0%,
10,49,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0,18,0.0%,
1,159,0.0%,
2,484,0.0%,
3,657,0.0%,
4,614,0.0%,

Value,Count,Frequency (%),Unnamed: 3
16,2,0.0%,
17,1,0.0%,
18,3,0.0%,
19,1,0.0%,
20,1,0.0%,

0,1
Correlation,0.99999

0,1
Distinct count,162
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,10.238
Minimum,0
Maximum,20
Zeros (%),0.0%

0,1
Minimum,0.0
5-th percentile,5.7
Q1,8.5
Median,10.3
Q3,12.1
95-th percentile,14.7
Maximum,20.0
Range,20.0
Interquartile range,3.6

0,1
Standard deviation,2.7936
Coef of variation,0.27286
Kurtosis,0.62507
Mean,10.238
MAD,2.1841
Skewness,-0.24642
Sum,33684
Variance,7.8043
Memory size,25.8 KiB

Value,Count,Frequency (%),Unnamed: 3
10.0,62,0.0%,
11.3,58,0.0%,
10.9,56,0.0%,
9.8,56,0.0%,
10.2,53,0.0%,
11.0,52,0.0%,
10.6,52,0.0%,
11.1,52,0.0%,
11.4,51,0.0%,
9.5,51,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0.0,18,0.0%,
1.1,1,0.0%,
1.3,1,0.0%,
2.0,2,0.0%,
2.1,2,0.0%,

Value,Count,Frequency (%),Unnamed: 3
18.2,2,0.0%,
18.3,1,0.0%,
18.4,1,0.0%,
18.9,1,0.0%,
20.0,1,0.0%,

0,1
Distinct count,119
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,100.09
Minimum,33
Maximum,175
Zeros (%),0.0%

0,1
Minimum,33
5-th percentile,68
Q1,87
Median,100
Q3,113
95-th percentile,132
Maximum,175
Range,142
Interquartile range,26

0,1
Standard deviation,19.572
Coef of variation,0.19554
Kurtosis,-0.077962
Mean,100.09
MAD,15.702
Skewness,0.029911
Sum,329312
Variance,383.07
Memory size,25.8 KiB

Value,Count,Frequency (%),Unnamed: 3
105,83,0.0%,
104,77,0.0%,
91,76,0.0%,
102,69,0.0%,
106,67,0.0%,
98,67,0.0%,
100,67,0.0%,
94,64,0.0%,
95,64,0.0%,
92,64,0.0%,

Value,Count,Frequency (%),Unnamed: 3
33,1,0.0%,
36,1,0.0%,
38,1,0.0%,
42,2,0.0%,
44,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
156,2,0.0%,
157,2,0.0%,
164,1,0.0%,
166,1,0.0%,
175,1,0.0%,

0,1
Correlation,1

0,1
Distinct count,1583
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,200.78
Minimum,23.2
Maximum,395
Zeros (%),0.0%

0,1
Minimum,23.2
5-th percentile,118.0
Q1,167.0
Median,201.25
Q3,235.17
95-th percentile,282.71
Maximum,395.0
Range,371.8
Interquartile range,68.175

0,1
Standard deviation,50.546
Coef of variation,0.25174
Kurtosis,0.09302
Mean,200.78
MAD,40.377
Skewness,0.0096501
Sum,660580
Variance,2554.9
Memory size,25.8 KiB

Value,Count,Frequency (%),Unnamed: 3
210.0,8,0.0%,
191.4,8,0.0%,
214.6,8,0.0%,
188.2,8,0.0%,
197.4,7,0.0%,
214.7,7,0.0%,
206.1,7,0.0%,
194.3,7,0.0%,
231.5,7,0.0%,
221.6,7,0.0%,

Value,Count,Frequency (%),Unnamed: 3
23.2,1,0.0%,
43.7,1,0.0%,
45.0,1,0.0%,
47.4,1,0.0%,
50.1,2,0.0%,

Value,Count,Frequency (%),Unnamed: 3
364.9,1,0.0%,
367.7,1,0.0%,
377.5,1,0.0%,
381.9,1,0.0%,
395.0,1,0.0%,

First 3 values
353-1352
411-1530
420-7692

Last 3 values
403-8904
420-6052
373-3959

Value,Count,Frequency (%),Unnamed: 3
327-1058,1,0.0%,
327-1319,1,0.0%,
327-3053,1,0.0%,
327-3587,1,0.0%,
327-3850,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
422-7728,1,0.0%,
422-8268,1,0.0%,
422-8333,1,0.0%,
422-8344,1,0.0%,
422-9964,1,0.0%,

First 3 values
353-1352
411-1530
420-7692

Last 3 values
403-8904
420-6052
373-3959

Value,Count,Frequency (%),Unnamed: 3
327-1058,1,0.0%,
327-1319,1,0.0%,
327-3053,1,0.0%,
327-3587,1,0.0%,
327-3850,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
422-7728,1,0.0%,
422-8268,1,0.0%,
422-8333,1,0.0%,
422-8344,1,0.0%,
422-9964,1,0.0%,

0,1
Distinct count,51
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
WV,105
MN,82
NY,81
Other values (48),3022

Value,Count,Frequency (%),Unnamed: 3
WV,105,0.0%,
MN,82,0.0%,
NY,81,0.0%,
AL,79,0.0%,
OR,77,0.0%,
VA,77,0.0%,
WI,77,0.0%,
WY,77,0.0%,
OH,76,0.0%,
ID,73,0.0%,

0,1
Correlation,0.95665

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.27751

0,1
True,913
(Missing),2377

Value,Count,Frequency (%),Unnamed: 3
True,913,0.0%,
(Missing),2377,0.0%,

Unnamed: 0,Phone,Phone1,Int_l_Plan,VMail_Plan,VMail_Message,Day_Mins,Day_Calls,Day_Charge,Eve_Mins,Eve_Calls,Eve_Charge,Night_Mins,Night_Calls,Night_Charge,Intl_Mins,Intl_Calls,Intl_Charge,CustServ_Calls,Churn,State,Account_Length,Area_Code
0,389-8163,389-8163,False,True,34,192.3,114,32.69,129.3,114,10.99,136.3,102,6.13,6.3,12,1.7,1,Retain,NV,110,408
1,351-1894,351-1894,False,True,38,193.0,106,32.81,153.6,106,13.06,260.4,87,11.72,7.4,5,2.0,2,Retain,ND,84,408
2,386-6408,386-6408,False,False,0,72.5,88,12.33,204.0,112,17.34,117.9,118,5.31,6.6,3,1.78,1,Retain,WV,113,510
3,370-9592,370-9592,False,True,40,105.2,61,17.88,341.3,79,29.01,165.7,97,7.46,6.3,3,1.7,2,Retain,CO,181,510
4,397-9251,397-9251,False,False,0,180.5,88,30.69,134.7,102,11.45,170.7,97,7.68,10.0,3,2.7,2,Retain,TX,51,415


## Scatterplot Matrix

In [9]:
# separate the calls data for plotting

churnDFs = churnDFs[['Account_Length','Day_Calls','Eve_Calls','CustServ_Calls','Churn']]

# Create scatter plot matrix of call data
splom = ff.create_scatterplotmatrix(churnDFs, diag='histogram', index='Churn',  
                                  colormap= dict(
                                      Churn = '#9CBEF1',
                                      Retain = '#04367F'
                                      ),
                                  colormap_type='cat',
                                  height=560, width=650,
                                  size=4, marker=dict(symbol='circle'))
py.iplot(splom)

In [10]:
%%capture

#h2o.connect(ip="35.225.239.147")
h2o.init(nthreads=1, max_mem_size="768m")

In [11]:
%%capture

# Split data into training and testing frames

from sklearn import cross_validation
from sklearn.model_selection import train_test_split

training, testing = train_test_split(churnDF, train_size=0.8, stratify=churnDF["Churn"], random_state=9)
train = h2o.H2OFrame(python_obj=training).drop("State")
test = h2o.H2OFrame(python_obj=testing).drop("State")

# Set predictor and response variables
y = "Churn"
x = train.columns
x.remove(y)

# Automatic Machine Learning

The Automatic Machine Learning (AutoML) function automates the supervised machine learning model training process. The current version of AutoML trains and cross-validates a Random Forest, an Extremely-Randomized Forest, a random grid of Gradient Boosting Machines (GBMs), a random grid of Deep Neural Nets, and a Stacked Ensemble of all the models.

In [12]:
%%capture
# Run AutoML until 11 models are built
autoModel = H2OAutoML(max_models = 20)
autoModel.train(x = x, y = y,
          training_frame = train,
          validation_frame = test, 
          leaderboard_frame = test)


## Leaderboard

In [9]:
leaders = autoModel.leaderboard
leaders

model_id,auc,logloss
GBM_grid_0_AutoML_20180222_185551_model_5,0.950331,0.158572
GBM_grid_0_AutoML_20180222_185551_model_0,0.950304,0.142285
GBM_grid_0_AutoML_20180222_185551_model_2,0.950113,0.132571
GBM_grid_0_AutoML_20180222_185551_model_3,0.943027,0.139264
GBM_grid_0_AutoML_20180222_185551_model_1,0.942033,0.138981
StackedEnsemble_AllModels_0_AutoML_20180222_185551,0.941501,0.12717
GBM_grid_0_AutoML_20180222_185551_model_15,0.937618,0.143789
GBM_grid_0_AutoML_20180222_185551_model_4,0.937577,0.190669
GBM_grid_0_AutoML_20180222_185551_model_13,0.936378,0.182239
GBM_grid_0_AutoML_20180222_185551_model_7,0.934402,0.158281




# Variable Importances
Below we plot variable importances as reported by the best performing algo in the ensemble.

In [10]:
importances = h2o.get_model(leaders[2, 0]).varimp(use_pandas=True)
importances = importances.loc[:,['variable','relative_importance']].groupby('variable').mean()
importances.sort_values(by="relative_importance", ascending=False).iplot(kind='bar', colors='#5AC4F2', theme='white')

In [16]:
import matplotlib.pyplot as plt
plt.figure()
bestModel = h2o.get_model(leaders[2, 0])
plt = bestModel.partial_plot(data=test, cols=["Day Mins","CustServ Calls","Day Charge"])


PartialDependencePlot progress: |█████████████████████████████████████████| 100%


# Best Model vs the Base Learners
This plot shows the ROC curves for the Super Model, the Best Base Model, and 9 next best models in the ensemble. 

In [12]:
Model0 = np.array(h2o.get_model(leaders[0,0]).roc(xval=True))
Model1 = np.array(h2o.get_model(leaders[1,0]).roc(xval=True))
Model2 = np.array(h2o.get_model(leaders[2,0]).roc(xval=True))
Model3 = np.array(h2o.get_model(leaders[3,0]).roc(xval=True))
Model4 = np.array(h2o.get_model(leaders[4,0]).roc(xval=True))
Model5 = np.array(h2o.get_model(leaders[5,0]).roc(xval=True))
Model6 = np.array(h2o.get_model(leaders[6,0]).roc(xval=True))
Model7 = np.array(h2o.get_model(leaders[7,0]).roc(xval=True))
Model8 = np.array(h2o.get_model(leaders[8,0]).roc(xval=True))
Model9 = np.array(h2o.get_model(leaders[9,0]).roc(xval=True))

layout = go.Layout(autosize=False, width=725, height=575,  xaxis=dict(title='False Positive Rate', titlefont=dict(family='Arial, sans-serif', size=15, color='grey')), 
                                                           yaxis=dict(title='True Positive Rate', titlefont=dict(family='Arial, sans-serif', size=15, color='grey')))

traceChanceLine = go.Scatter(x = [0,1], y = [0,1], mode = 'lines+markers', name = 'chance', line = dict(color = ('rgb(136, 140, 150)'), width = 4, dash = 'dash'))
Model0Trace = go.Scatter(x = Model0[0], y = Model0[1], mode = 'lines', name = 'Model 0', line = dict(color = ('rgb(26, 58, 126)'), width = 3))
Model1Trace = go.Scatter(x = Model1[0], y = Model1[1], mode = 'lines', name = 'Model 1', line = dict(color = ('rgb(156, 190, 241))'), width = 1))
Model2Trace = go.Scatter(x = Model2[0], y = Model2[1], mode = 'lines', name = 'Model 2', line = dict(color = ('rgb(156, 190, 241)'), width = 1))
Model3Trace = go.Scatter(x = Model3[0], y = Model3[1], mode = 'lines', name = 'Model 3', line = dict(color = ('rgb(156, 190, 241)'), width = 1))
Model4Trace = go.Scatter(x = Model4[0], y = Model4[1], mode = 'lines', name = 'Model 4', line = dict(color = ('rgb(156, 190, 241)'), width = 1))
Model5Trace = go.Scatter(x = Model5[0], y = Model5[1], mode = 'lines', name = 'Model 5', line = dict(color = ('rgb(156, 190, 241)'), width = 1))
Model6Trace = go.Scatter(x = Model6[0], y = Model6[1], mode = 'lines', name = 'Model 6', line = dict(color = ('rgb(156, 190, 241)'), width = 1))
Model7Trace = go.Scatter(x = Model7[0], y = Model7[1], mode = 'lines', name = 'Model 7', line = dict(color = ('rgb(156, 190, 241)'), width = 1))
Model8Trace = go.Scatter(x = Model8[0], y = Model8[1], mode = 'lines', name = 'Model 8', line = dict(color = ('rgb(156, 190, 241)'), width = 1))
Model9Trace = go.Scatter(x = Model9[0], y = Model9[1], mode = 'lines', name = 'Model 9', line = dict(color = ('rgb(156, 190, 241)'), width = 1))

fig = go.Figure(data=[Model0Trace,Model1Trace,Model2Trace,Model3Trace,Model4Trace,Model5Trace,Model6Trace,Model8Trace,Model9Trace,traceChanceLine], layout=layout)

py.iplot(fig)

# Confusion Matrix

In [13]:
cm = h2o.get_model(leaders[1, 0]).confusion_matrix(xval=True)
cm = cm.table.as_data_frame()
cm
confusionMatrix = ff.create_table(cm)
confusionMatrix.layout.height=300
confusionMatrix.layout.width=800
confusionMatrix.layout.font.size=17
py.iplot(confusionMatrix)

# Business Impact Matrix

Weighting Predictions With a Dollar Value
-   Correctly predicting retain: `+$5`
-   Correctly predicting churn: `+$75`
-   Incorrectly predicting retain: `-$150`
-   Incorrectly predicting churn: `-$1.5`

    

In [14]:
CorrectPredictChurn = cm.loc[0,'Churn']
CorrectPredictChurnImpact = 75
cm1 = CorrectPredictChurn*CorrectPredictChurnImpact

IncorrectPredictChurn = cm.loc[1,'Churn']
IncorrectPredictChurnImpact = -5
cm2 = IncorrectPredictChurn*IncorrectPredictChurnImpact

IncorrectPredictRetain = cm.loc[0,'Retain']
IncorrectPredictRetainImpact = -150
cm3 = IncorrectPredictRetain*IncorrectPredictRetainImpact

CorrectPredictRetain = cm.loc[0,'Retain']
CorrectPredictRetainImpact = 5
cm4 = IncorrectPredictRetain*CorrectPredictRetainImpact


data_matrix = [['Business Impact', '($) Predicted Churn', '($) Predicted Retain', '($) Total'],
               ['($) Actual Churn', cm1, cm3, '' ],
               ['($) Actual Retain', cm2, cm4, ''],
               ['($) Total', cm1+cm2, cm3+cm4, cm1+cm2+cm3+cm4]]

impactMatrix = ff.create_table(data_matrix, height_constant=20, hoverinfo='weight')
impactMatrix.layout.height=300
impactMatrix.layout.width=800
impactMatrix.layout.font.size=17
py.iplot(impactMatrix)

In [24]:
print("Total customers evaluated: 2132")

Total customers evaluated: 2132


In [25]:
print("Total value created by the model: $" + str(cm1+cm2+cm3+cm4))

Total value created by the model: $3955.0


In [26]:
print("Total value per customer: $" +str(round(((cm1+cm2+cm3+cm4)/2132),3)))

Total value per customer: $1.855


In [48]:
%%capture
# Save the best model

path = h2o.save_model(model=h2o.get_model(leaders[0, 0]), force=True)
os.rename(h2o.get_model(leaders[0, 0]).model_id, "AutoML-leader")    

In [49]:
%%capture
LoadedEnsemble = h2o.load_model(path="AutoML-leader")
print(LoadedEnsemble)