# Customer Churn Analysis
Churn rate, when applied to a customer base, refers to the proportion of contractual customers or subscribers who leave a supplier during a given time period. It is a possible indicator of customer dissatisfaction, cheaper and/or better offers from the competition, more successful sales and/or marketing by the competition, or reasons having to do with the customer life cycle.

Churn is closely related to the concept of average customer life time. For example, an annual churn rate of 25 percent implies an average customer life of four years. An annual churn rate of 33 percent implies an average customer life of three years. The churn rate can be minimized by creating barriers which discourage customers to change suppliers (contractual binding periods, use of proprietary technology, value-added services, unique business models, etc.), or through retention activities such as loyalty programs. It is possible to overstate the churn rate, as when a consumer drops the service but then restarts it within the same year. Thus, a clear distinction needs to be made between "gross churn", the total number of absolute disconnections, and "net churn", the overall loss of subscribers or members. The difference between the two measures is the number of new subscribers or members that have joined during the same period. Suppliers may find that if they offer a loss-leader "introductory special", it can lead to a higher churn rate and subscriber abuse, as some subscribers will sign on, let the service lapse, then sign on again to take continuous advantage of current specials. https://en.wikipedia.org/wiki/Churn_rate

In [29]:
%%capture
%load_ext autoreload
%autoreload 2

import sys 
sys.path.append('model_management')

from model_management.sklearn_model import SklearnModel
import numpy as np
import pandas as pd
import h2o
from h2o.automl import H2OAutoML
from __future__ import print_function
import pandas_profiling

# Suppress unwatned warnings
import warnings
warnings.filterwarnings('ignore')
import logging
logging.getLogger("requests").setLevel(logging.WARNING)

# Load our favorite visualization library
import os
import plotly
import plotly.plotly as py
import plotly.figure_factory as ff
import plotly.graph_objs as go
import cufflinks as cf
plotly.offline.init_notebook_mode(connected=True)

# Sign into Plotly with masked, encrypted API key

myPlotlyKey = os.environ['SECRET_ENV_BRETTS_PLOTLY_KEY']
py.sign_in(username='bretto777',api_key=myPlotlyKey)

### Load The Dataset

In [30]:
# Load some data
churnDF = pd.read_csv('https://trifactapro.s3.amazonaws.com/churn.csv', delimiter=',')
churnDF.head(5)

Unnamed: 0,State,Account Length,Area Code,Phone,Int'l Plan,VMail Plan,VMail Message,Day Mins,Day Calls,Day Charge,...,Eve Calls,Eve Charge,Night Mins,Night Calls,Night Charge,Intl Mins,Intl Calls,Intl Charge,CustServ Calls,Churn
0,KS,128,415,382-4657,no,yes,25,265.1,110,45.07,...,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False
1,OH,107,415,371-7191,no,yes,26,161.6,123,27.47,...,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False
2,NJ,137,415,358-1921,no,no,0,243.4,114,41.38,...,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False
3,OH,84,408,375-9999,yes,no,0,299.4,71,50.9,...,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False
4,OK,75,415,330-6626,yes,no,0,166.7,113,28.34,...,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False


In [31]:
#%%capture
pandas_profiling.ProfileReport(churnDF)

0,1
Number of variables,21
Number of observations,3333
Total Missing (%),0.0%
Total size in memory,524.1 KiB
Average record size in memory,161.0 B

0,1
Numeric,12
Categorical,3
Boolean,1
Date,0
Text (Unique),1
Rejected,4
Unsupported,0

0,1
Distinct count,212
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,101.06
Minimum,1
Maximum,243
Zeros (%),0.0%

0,1
Minimum,1
5-th percentile,35
Q1,74
Median,101
Q3,127
95-th percentile,167
Maximum,243
Range,242
Interquartile range,53

0,1
Standard deviation,39.822
Coef of variation,0.39403
Kurtosis,-0.10784
Mean,101.06
MAD,31.821
Skewness,0.096606
Sum,336849
Variance,1585.8
Memory size,26.1 KiB

Value,Count,Frequency (%),Unnamed: 3
105,43,0.0%,
87,42,0.0%,
93,40,0.0%,
101,40,0.0%,
90,39,0.0%,
86,38,0.0%,
95,38,0.0%,
116,37,0.0%,
100,37,0.0%,
112,36,0.0%,

Value,Count,Frequency (%),Unnamed: 3
1,8,0.0%,
2,1,0.0%,
3,5,0.0%,
4,1,0.0%,
5,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
221,1,0.0%,
224,2,0.0%,
225,2,0.0%,
232,1,0.0%,
243,1,0.0%,

0,1
Distinct count,3
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,437.18
Minimum,408
Maximum,510
Zeros (%),0.0%

0,1
Minimum,408
5-th percentile,408
Q1,408
Median,415
Q3,510
95-th percentile,510
Maximum,510
Range,102
Interquartile range,102

0,1
Standard deviation,42.371
Coef of variation,0.096919
Kurtosis,-0.70563
Mean,437.18
MAD,36.704
Skewness,1.1268
Sum,1457129
Variance,1795.3
Memory size,26.1 KiB

Value,Count,Frequency (%),Unnamed: 3
415,1655,0.0%,
510,840,0.0%,
408,838,0.0%,

Value,Count,Frequency (%),Unnamed: 3
408,838,0.0%,
415,1655,0.0%,
510,840,0.0%,

Value,Count,Frequency (%),Unnamed: 3
408,838,0.0%,
415,1655,0.0%,
510,840,0.0%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.14491

0,1
True,483
(Missing),2850

Value,Count,Frequency (%),Unnamed: 3
True,483,0.0%,
(Missing),2850,0.0%,

0,1
Distinct count,10
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,1.5629
Minimum,0
Maximum,9
Zeros (%),0.0%

0,1
Minimum,0
5-th percentile,0
Q1,1
Median,1
Q3,2
95-th percentile,4
Maximum,9
Range,9
Interquartile range,1

0,1
Standard deviation,1.3155
Coef of variation,0.84172
Kurtosis,1.7309
Mean,1.5629
MAD,1.0525
Skewness,1.0914
Sum,5209
Variance,1.7305
Memory size,26.1 KiB

Value,Count,Frequency (%),Unnamed: 3
1,1181,0.0%,
2,759,0.0%,
0,697,0.0%,
3,429,0.0%,
4,166,0.0%,
5,66,0.0%,
6,22,0.0%,
7,9,0.0%,
9,2,0.0%,
8,2,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0,697,0.0%,
1,1181,0.0%,
2,759,0.0%,
3,429,0.0%,
4,166,0.0%,

Value,Count,Frequency (%),Unnamed: 3
5,66,0.0%,
6,22,0.0%,
7,9,0.0%,
8,2,0.0%,
9,2,0.0%,

0,1
Distinct count,119
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,100.44
Minimum,0
Maximum,165
Zeros (%),0.0%

0,1
Minimum,0
5-th percentile,67
Q1,87
Median,101
Q3,114
95-th percentile,133
Maximum,165
Range,165
Interquartile range,27

0,1
Standard deviation,20.069
Coef of variation,0.19982
Kurtosis,0.24318
Mean,100.44
MAD,15.945
Skewness,-0.11179
Sum,334752
Variance,402.77
Memory size,26.1 KiB

Value,Count,Frequency (%),Unnamed: 3
102,78,0.0%,
105,75,0.0%,
107,69,0.0%,
95,69,0.0%,
104,68,0.0%,
108,67,0.0%,
97,67,0.0%,
110,66,0.0%,
106,66,0.0%,
88,66,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0,2,0.0%,
30,1,0.0%,
35,1,0.0%,
36,1,0.0%,
40,2,0.0%,

Value,Count,Frequency (%),Unnamed: 3
157,1,0.0%,
158,3,0.0%,
160,1,0.0%,
163,1,0.0%,
165,1,0.0%,

0,1
Correlation,1

0,1
Distinct count,1667
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,179.78
Minimum,0
Maximum,350.8
Zeros (%),0.0%

0,1
Minimum,0.0
5-th percentile,89.92
Q1,143.7
Median,179.4
Q3,216.4
95-th percentile,270.74
Maximum,350.8
Range,350.8
Interquartile range,72.7

0,1
Standard deviation,54.467
Coef of variation,0.30298
Kurtosis,-0.01994
Mean,179.78
MAD,43.523
Skewness,-0.029077
Sum,599190
Variance,2966.7
Memory size,26.1 KiB

Value,Count,Frequency (%),Unnamed: 3
174.5,8,0.0%,
159.5,8,0.0%,
154.0,8,0.0%,
175.4,7,0.0%,
162.3,7,0.0%,
183.4,7,0.0%,
198.4,6,0.0%,
185.0,6,0.0%,
153.5,6,0.0%,
155.2,6,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0.0,2,0.0%,
2.6,1,0.0%,
7.8,1,0.0%,
7.9,1,0.0%,
12.5,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
335.5,1,0.0%,
337.4,1,0.0%,
345.3,1,0.0%,
346.8,1,0.0%,
350.8,1,0.0%,

0,1
Distinct count,123
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,100.11
Minimum,0
Maximum,170
Zeros (%),0.0%

0,1
Minimum,0
5-th percentile,67
Q1,87
Median,100
Q3,114
95-th percentile,133
Maximum,170
Range,170
Interquartile range,27

0,1
Standard deviation,19.923
Coef of variation,0.199
Kurtosis,0.20616
Mean,100.11
MAD,15.86
Skewness,-0.055563
Sum,333681
Variance,396.91
Memory size,26.1 KiB

Value,Count,Frequency (%),Unnamed: 3
105,80,0.0%,
94,79,0.0%,
108,71,0.0%,
97,70,0.0%,
102,70,0.0%,
88,69,0.0%,
101,68,0.0%,
109,67,0.0%,
98,66,0.0%,
111,65,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0,1,0.0%,
12,1,0.0%,
36,1,0.0%,
37,1,0.0%,
42,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
157,1,0.0%,
159,1,0.0%,
164,1,0.0%,
168,1,0.0%,
170,1,0.0%,

0,1
Correlation,1

0,1
Distinct count,1611
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,200.98
Minimum,0
Maximum,363.7
Zeros (%),0.0%

0,1
Minimum,0.0
5-th percentile,118.8
Q1,166.6
Median,201.4
Q3,235.3
95-th percentile,284.3
Maximum,363.7
Range,363.7
Interquartile range,68.7

0,1
Standard deviation,50.714
Coef of variation,0.25233
Kurtosis,0.02563
Mean,200.98
MAD,40.469
Skewness,-0.023877
Sum,669870
Variance,2571.9
Memory size,26.1 KiB

Value,Count,Frequency (%),Unnamed: 3
169.9,9,0.0%,
230.9,7,0.0%,
209.4,7,0.0%,
201.0,7,0.0%,
220.6,7,0.0%,
180.5,7,0.0%,
161.7,7,0.0%,
167.2,7,0.0%,
195.5,7,0.0%,
194.0,6,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0.0,1,0.0%,
31.2,1,0.0%,
42.2,1,0.0%,
42.5,1,0.0%,
43.9,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
350.9,1,0.0%,
351.6,1,0.0%,
354.2,1,0.0%,
361.8,1,0.0%,
363.7,1,0.0%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
no,3010
yes,323

Value,Count,Frequency (%),Unnamed: 3
no,3010,0.0%,
yes,323,0.0%,

0,1
Distinct count,21
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,4.4794
Minimum,0
Maximum,20
Zeros (%),0.0%

0,1
Minimum,0
5-th percentile,1
Q1,3
Median,4
Q3,6
95-th percentile,9
Maximum,20
Range,20
Interquartile range,3

0,1
Standard deviation,2.4612
Coef of variation,0.54945
Kurtosis,3.0836
Mean,4.4794
MAD,1.8811
Skewness,1.3215
Sum,14930
Variance,6.0576
Memory size,26.1 KiB

Value,Count,Frequency (%),Unnamed: 3
3,668,0.0%,
4,619,0.0%,
2,489,0.0%,
5,472,0.0%,
6,336,0.0%,
7,218,0.0%,
1,160,0.0%,
8,116,0.0%,
9,109,0.0%,
10,50,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0,18,0.0%,
1,160,0.0%,
2,489,0.0%,
3,668,0.0%,
4,619,0.0%,

Value,Count,Frequency (%),Unnamed: 3
16,2,0.0%,
17,1,0.0%,
18,3,0.0%,
19,1,0.0%,
20,1,0.0%,

0,1
Correlation,0.99999

0,1
Distinct count,162
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,10.237
Minimum,0
Maximum,20
Zeros (%),0.0%

0,1
Minimum,0.0
5-th percentile,5.7
Q1,8.5
Median,10.3
Q3,12.1
95-th percentile,14.7
Maximum,20.0
Range,20.0
Interquartile range,3.6

0,1
Standard deviation,2.7918
Coef of variation,0.27271
Kurtosis,0.60918
Mean,10.237
MAD,2.1847
Skewness,-0.24514
Sum,34121
Variance,7.7944
Memory size,26.1 KiB

Value,Count,Frequency (%),Unnamed: 3
10.0,62,0.0%,
11.3,59,0.0%,
9.8,56,0.0%,
10.9,56,0.0%,
10.1,53,0.0%,
10.2,53,0.0%,
10.6,53,0.0%,
11.1,52,0.0%,
11.0,52,0.0%,
9.7,51,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0.0,18,0.0%,
1.1,1,0.0%,
1.3,1,0.0%,
2.0,2,0.0%,
2.1,2,0.0%,

Value,Count,Frequency (%),Unnamed: 3
18.2,2,0.0%,
18.3,1,0.0%,
18.4,1,0.0%,
18.9,1,0.0%,
20.0,1,0.0%,

0,1
Distinct count,120
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,100.11
Minimum,33
Maximum,175
Zeros (%),0.0%

0,1
Minimum,33
5-th percentile,68
Q1,87
Median,100
Q3,113
95-th percentile,132
Maximum,175
Range,142
Interquartile range,26

0,1
Standard deviation,19.569
Coef of variation,0.19548
Kurtosis,-0.07202
Mean,100.11
MAD,15.69
Skewness,0.0325
Sum,333659
Variance,382.93
Memory size,26.1 KiB

Value,Count,Frequency (%),Unnamed: 3
105,84,0.0%,
104,78,0.0%,
91,76,0.0%,
102,72,0.0%,
100,69,0.0%,
106,69,0.0%,
98,67,0.0%,
94,66,0.0%,
103,65,0.0%,
108,64,0.0%,

Value,Count,Frequency (%),Unnamed: 3
33,1,0.0%,
36,1,0.0%,
38,1,0.0%,
42,2,0.0%,
44,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
157,2,0.0%,
158,1,0.0%,
164,1,0.0%,
166,1,0.0%,
175,1,0.0%,

0,1
Correlation,1

0,1
Distinct count,1591
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,200.87
Minimum,23.2
Maximum,395
Zeros (%),0.0%

0,1
Minimum,23.2
5-th percentile,118.18
Q1,167.0
Median,201.2
Q3,235.3
95-th percentile,282.84
Maximum,395.0
Range,371.8
Interquartile range,68.3

0,1
Standard deviation,50.574
Coef of variation,0.25177
Kurtosis,0.085816
Mean,200.87
MAD,40.41
Skewness,0.0089213
Sum,669510
Variance,2557.7
Memory size,26.1 KiB

Value,Count,Frequency (%),Unnamed: 3
210.0,8,0.0%,
214.6,8,0.0%,
197.4,8,0.0%,
191.4,8,0.0%,
188.2,8,0.0%,
231.5,7,0.0%,
221.6,7,0.0%,
193.6,7,0.0%,
214.7,7,0.0%,
194.3,7,0.0%,

Value,Count,Frequency (%),Unnamed: 3
23.2,1,0.0%,
43.7,1,0.0%,
45.0,1,0.0%,
47.4,1,0.0%,
50.1,2,0.0%,

Value,Count,Frequency (%),Unnamed: 3
364.9,1,0.0%,
367.7,1,0.0%,
377.5,1,0.0%,
381.9,1,0.0%,
395.0,1,0.0%,

First 3 values
385-6952
405-5403
368-6174

Last 3 values
363-5947
346-6941
373-3959

Value,Count,Frequency (%),Unnamed: 3
327-1058,1,0.0%,
327-1319,1,0.0%,
327-3053,1,0.0%,
327-3587,1,0.0%,
327-3850,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
422-7728,1,0.0%,
422-8268,1,0.0%,
422-8333,1,0.0%,
422-8344,1,0.0%,
422-9964,1,0.0%,

0,1
Distinct count,51
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
WV,106
MN,84
NY,83
Other values (48),3060

Value,Count,Frequency (%),Unnamed: 3
WV,106,0.0%,
MN,84,0.0%,
NY,83,0.0%,
AL,80,0.0%,
WI,78,0.0%,
OH,78,0.0%,
OR,78,0.0%,
WY,77,0.0%,
VA,77,0.0%,
CT,74,0.0%,

0,1
Distinct count,46
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,8.099
Minimum,0
Maximum,51
Zeros (%),0.0%

0,1
Minimum,0
5-th percentile,0
Q1,0
Median,0
Q3,20
95-th percentile,36
Maximum,51
Range,51
Interquartile range,20

0,1
Standard deviation,13.688
Coef of variation,1.6901
Kurtosis,-0.051129
Mean,8.099
MAD,11.72
Skewness,1.2648
Sum,26994
Variance,187.37
Memory size,26.1 KiB

Value,Count,Frequency (%),Unnamed: 3
0,2411,0.0%,
31,60,0.0%,
29,53,0.0%,
28,51,0.0%,
33,46,0.0%,
27,44,0.0%,
30,44,0.0%,
24,42,0.0%,
26,41,0.0%,
32,41,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0,2411,0.0%,
4,1,0.0%,
8,2,0.0%,
9,2,0.0%,
10,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
47,3,0.0%,
48,2,0.0%,
49,1,0.0%,
50,2,0.0%,
51,1,0.0%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
no,2411
yes,922

Value,Count,Frequency (%),Unnamed: 3
no,2411,0.0%,
yes,922,0.0%,

Unnamed: 0,State,Account Length,Area Code,Phone,Int'l Plan,VMail Plan,VMail Message,Day Mins,Day Calls,Day Charge,Eve Mins,Eve Calls,Eve Charge,Night Mins,Night Calls,Night Charge,Intl Mins,Intl Calls,Intl Charge,CustServ Calls,Churn
0,KS,128,415,382-4657,no,yes,25,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False
1,OH,107,415,371-7191,no,yes,26,161.6,123,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False
2,NJ,137,415,358-1921,no,no,0,243.4,114,41.38,121.2,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False
3,OH,84,408,375-9999,yes,no,0,299.4,71,50.9,61.9,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False
4,OK,75,415,330-6626,yes,no,0,166.7,113,28.34,148.3,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False


In [32]:
churnDF["Churn"] = churnDF["Churn"].replace([True, False],[1,0])
churnDF["Int'l Plan"] = churnDF["Int'l Plan"].replace(["no","yes"],[0,1])
churnDF["VMail Plan"] = churnDF["VMail Plan"].replace(["no","yes"],[0,1])
churnDF.drop(["State", "Area Code", "Phone"], axis=1, inplace=True)


In [33]:
%%capture

#h2o.connect(ip="35.225.239.147")
h2o.init(nthreads=1, max_mem_size="768m")

In [34]:
%%capture

# Split data into training and testing frames

from sklearn import cross_validation
from sklearn.model_selection import train_test_split

training, testing = train_test_split(churnDF, train_size=0.8, stratify=churnDF["Churn"], random_state=9)
x_train = training.drop(["Churn"], axis = 1)
y_train = training["Churn"]
x_test = testing.drop(["Churn"], axis = 1)
y_test = testing["Churn"]
train = h2o.H2OFrame(python_obj=training)
test = h2o.H2OFrame(python_obj=testing)

# Set predictor and response variables
y = "Churn"
x = train.columns
x.remove(y)

In [35]:
x_train = x_train.values
y_train = y_train.values
x_test = x_test.values
y_test = y_test.values

In [36]:
from sklearn.ensemble import GradientBoostingClassifier
clf = GradientBoostingClassifier(n_estimators=150, 
                                                           learning_rate=.8,
                                                           max_depth=1, 
                                                           random_state=0).fit(x_train,y_train)


In [37]:
from sklearn.linear_model import LogisticRegression
logistic = LogisticRegression().fit(x_train,y_train)

In [38]:

model = SklearnModel(model=clf,
                     problem_class='binary_classification',
                     description='This is the first churn model',
                     name="GBM 2",
                     y_test = y_test,
                     x_test = x_test)

In [39]:
model.metrics()

[{'key': 'f1', 'value': 0.5375},
 {'key': 'accuracy', 'value': 0.889055472263868},
 {'key': 'precision', 'value': 0.6825396825396826},
 {'key': 'recall', 'value': 0.44329896907216493}]

In [40]:
model.save()

<Response [200]>

# Automatic Machine Learning

The Automatic Machine Learning (AutoML) function automates the supervised machine learning model training process. The current version of AutoML trains and cross-validates a Random Forest, an Extremely-Randomized Forest, a random grid of Gradient Boosting Machines (GBMs), a random grid of Deep Neural Nets, and a Stacked Ensemble of all the models.

In [41]:
%%capture
# Run AutoML until 11 models are built
autoModel = H2OAutoML(max_models = 20)
autoModel.train(x = x, y = y,
          training_frame = train,
          validation_frame = test, 
          leaderboard_frame = test)


## Leaderboard

In [42]:
leaders = autoModel.leaderboard
leaders

model_id,mean_residual_deviance,rmse,mae,rmsle
StackedEnsemble_AllModels_0_AutoML_20180625_130332,0.037225,0.192939,0.087473,0.135276
StackedEnsemble_BestOfFamily_0_AutoML_20180625_130332,0.037813,0.194455,0.089393,0.136592
GBM_grid_0_AutoML_20180625_130332_model_2,0.038833,0.197062,0.096418,0.137811
GBM_grid_0_AutoML_20180625_130332_model_1,0.039766,0.199413,0.097265,0.138815
GBM_grid_0_AutoML_20180625_130332_model_3,0.040445,0.201109,0.103606,0.141431
GBM_grid_0_AutoML_20180625_130332_model_0,0.040799,0.201987,0.094744,0.141218
GBM_grid_0_AutoML_20180625_130332_model_7,0.041416,0.20351,0.109479,0.140649
GBM_grid_0_AutoML_20180625_130332_model_11,0.044323,0.21053,0.112005,0.145377
DRF_0_AutoML_20180625_130332,0.044819,0.211706,0.099376,0.149893
XRT_0_AutoML_20180625_130332,0.046319,0.215218,0.107896,0.15181




# Variable Importances
Below we plot variable importances as reported by the best performing algo in the ensemble.

In [43]:
importances = h2o.get_model(leaders[2, 0]).varimp(use_pandas=True)
importances = importances.loc[:,['variable','relative_importance']].groupby('variable').mean()
importances.sort_values(by="relative_importance", ascending=False).iplot(kind='bar', colors='#5AC4F2', theme='white')

In [44]:
import matplotlib.pyplot as plt
plt.figure()
bestModel = h2o.get_model(leaders[2, 0])
plt = bestModel.partial_plot(data=test, cols=["Day Mins","CustServ Calls","Day Charge"])


PartialDependencePlot progress: |█████████████████████████████████████████| 100%


# Best Model vs the Base Learners
This plot shows the ROC curves for the best models

In [56]:
mygbm = h2o.get_model(leaders[2,0])
mygbm.roc(xval=True)
#Model2 = np.array(h2o.get_model(leaders[2,0]).roc(xval=True))

AttributeError: type object 'H2OGradientBoostingEstimator' has no attribute 'roc'

In [46]:
#Model0 = np.array(h2o.get_model(leaders[0,0]).roc(xval=True))
#Model1 = np.array(h2o.get_model(leaders[1,0]).roc(xval=True))
Model2 = np.array(h2o.get_model(leaders[2,0]).roc(xval=True))
Model3 = np.array(h2o.get_model(leaders[3,0]).roc(xval=True))
Model4 = np.array(h2o.get_model(leaders[4,0]).roc(xval=True))
Model5 = np.array(h2o.get_model(leaders[5,0]).roc(xval=True))
Model6 = np.array(h2o.get_model(leaders[6,0]).roc(xval=True))
Model7 = np.array(h2o.get_model(leaders[7,0]).roc(xval=True))
Model8 = np.array(h2o.get_model(leaders[8,0]).roc(xval=True))
Model9 = np.array(h2o.get_model(leaders[9,0]).roc(xval=True))

layout = go.Layout(autosize=False, width=725, height=575,  xaxis=dict(title='False Positive Rate', titlefont=dict(family='Arial, sans-serif', size=15, color='grey')), 
                                                           yaxis=dict(title='True Positive Rate', titlefont=dict(family='Arial, sans-serif', size=15, color='grey')))

traceChanceLine = go.Scatter(x = [0,1], y = [0,1], mode = 'lines+markers', name = 'chance', line = dict(color = ('rgb(136, 140, 150)'), width = 4, dash = 'dash'))
#Model0Trace = go.Scatter(x = Model0[0], y = Model0[1], mode = 'lines', name = 'Model 0', line = dict(color = ('rgb(26, 58, 126)'), width = 3))
#Model1Trace = go.Scatter(x = Model1[0], y = Model1[1], mode = 'lines', name = 'Model 1', line = dict(color = ('rgb(156, 190, 241))'), width = 1))
Model2Trace = go.Scatter(x = Model2[0], y = Model2[1], mode = 'lines', name = 'Model 2', line = dict(color = ('rgb(156, 190, 241)'), width = 1))
Model3Trace = go.Scatter(x = Model3[0], y = Model3[1], mode = 'lines', name = 'Model 3', line = dict(color = ('rgb(156, 190, 241)'), width = 1))
Model4Trace = go.Scatter(x = Model4[0], y = Model4[1], mode = 'lines', name = 'Model 4', line = dict(color = ('rgb(156, 190, 241)'), width = 1))
Model5Trace = go.Scatter(x = Model5[0], y = Model5[1], mode = 'lines', name = 'Model 5', line = dict(color = ('rgb(156, 190, 241)'), width = 1))
Model6Trace = go.Scatter(x = Model6[0], y = Model6[1], mode = 'lines', name = 'Model 6', line = dict(color = ('rgb(156, 190, 241)'), width = 1))
Model7Trace = go.Scatter(x = Model7[0], y = Model7[1], mode = 'lines', name = 'Model 7', line = dict(color = ('rgb(156, 190, 241)'), width = 1))
Model8Trace = go.Scatter(x = Model8[0], y = Model8[1], mode = 'lines', name = 'Model 8', line = dict(color = ('rgb(156, 190, 241)'), width = 1))
Model9Trace = go.Scatter(x = Model9[0], y = Model9[1], mode = 'lines', name = 'Model 9', line = dict(color = ('rgb(156, 190, 241)'), width = 1))

fig = go.Figure(data=[Model0Trace,Model1Trace,Model2Trace,Model3Trace,Model4Trace,Model5Trace,Model6Trace,Model8Trace,Model9Trace,traceChanceLine], layout=layout)

py.iplot(fig)

AttributeError: type object 'H2OGradientBoostingEstimator' has no attribute 'roc'

# Confusion Matrix

In [None]:
cm = h2o.get_model(leaders[1, 0]).confusion_matrix(xval=True)
cm = cm.table.as_data_frame()
cm
confusionMatrix = ff.create_table(cm)
confusionMatrix.layout.height=300
confusionMatrix.layout.width=800
confusionMatrix.layout.font.size=17
py.iplot(confusionMatrix)

# Business Impact Matrix

Weighting Predictions With a Dollar Value
-   Correctly predicting retain: `+$5`
-   Correctly predicting churn: `+$75`
-   Incorrectly predicting retain: `-$150`
-   Incorrectly predicting churn: `-$1.5`

    

In [None]:
CorrectPredictChurn = cm.loc[0,'Churn']
CorrectPredictChurnImpact = 75
cm1 = CorrectPredictChurn*CorrectPredictChurnImpact

IncorrectPredictChurn = cm.loc[1,'Churn']
IncorrectPredictChurnImpact = -5
cm2 = IncorrectPredictChurn*IncorrectPredictChurnImpact

IncorrectPredictRetain = cm.loc[0,'Retain']
IncorrectPredictRetainImpact = -150
cm3 = IncorrectPredictRetain*IncorrectPredictRetainImpact

CorrectPredictRetain = cm.loc[0,'Retain']
CorrectPredictRetainImpact = 5
cm4 = IncorrectPredictRetain*CorrectPredictRetainImpact


data_matrix = [['Business Impact', '($) Predicted Churn', '($) Predicted Retain', '($) Total'],
               ['($) Actual Churn', cm1, cm3, '' ],
               ['($) Actual Retain', cm2, cm4, ''],
               ['($) Total', cm1+cm2, cm3+cm4, cm1+cm2+cm3+cm4]]

impactMatrix = ff.create_table(data_matrix, height_constant=20, hoverinfo='weight')
impactMatrix.layout.height=300
impactMatrix.layout.width=800
impactMatrix.layout.font.size=17
py.iplot(impactMatrix)

In [None]:
print("Total customers evaluated: 2132")

In [None]:
print("Total value created by the model: $" + str(cm1+cm2+cm3+cm4))

In [None]:
print("Total value per customer: $" +str(round(((cm1+cm2+cm3+cm4)/2132),3)))

In [None]:
%%capture
# Save the best model

path = h2o.save_model(model=h2o.get_model(leaders[0, 0]), force=True)
os.rename(h2o.get_model(leaders[0, 0]).model_id, "AutoML-leader")    

In [None]:
%%capture
LoadedEnsemble = h2o.load_model(path="AutoML-leader")
print(LoadedEnsemble)

In [None]:
print "hi"