# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

In [100]:
!python -V

Python 3.11.11


In [83]:
#!pip install --upgrade pycaret --user
import pycaret

In [84]:
'''
Getting our dataset.
We clean it up a little so that it matches the new_churn_data that we
are going to predict later.
'''
import pycaret
import pandas as pd

df = pd.read_csv('churn_data.csv')
df['charge_per_tenure'] = df['TotalCharges']/df['tenure']
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
df['PhoneService'].replace({'Yes':1, 'No':0}, inplace=True)
df['Contract'] = label_encoder.fit_transform(df['Contract'])
df['PaymentMethod'] = label_encoder.fit_transform(df['PaymentMethod'])
df

Unnamed: 0,customerID,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,charge_per_tenure
0,7590-VHVEG,1,0,0,2,29.85,29.85,No,29.850000
1,5575-GNVDE,34,1,1,3,56.95,1889.50,No,55.573529
2,3668-QPYBK,2,1,0,3,53.85,108.15,Yes,54.075000
3,7795-CFOCW,45,0,1,0,42.30,1840.75,No,40.905556
4,9237-HQITU,2,1,0,2,70.70,151.65,Yes,75.825000
...,...,...,...,...,...,...,...,...,...
7038,6840-RESVB,24,1,1,3,84.80,1990.50,No,82.937500
7039,2234-XADUH,72,1,1,1,103.20,7362.90,No,102.262500
7040,4801-JZAZL,11,0,0,2,29.60,346.45,No,31.495455
7041,8361-LTMKD,4,1,0,3,74.40,306.60,Yes,76.650000


In [85]:
'''
Setting up our auto ML model
'''
#!conda install -c conda-forge pycaret -y
from pycaret.classification import setup, compare_models, predict_model, save_model, load_model

automl = setup(df, target='Churn')

Unnamed: 0,Description,Value
0,Session id,1789
1,Target,Churn
2,Target type,Binary
3,Target mapping,"No: 0, Yes: 1"
4,Original data shape,"(7043, 9)"
5,Transformed data shape,"(7043, 9)"
6,Transformed train set shape,"(4930, 9)"
7,Transformed test set shape,"(2113, 9)"
8,Numeric features,7
9,Categorical features,1


In [86]:
'''
After seeing all of the different models, we can choose from either Logistic
Regression, Naive Bayes, or K Neighbors Classifier.

Since Logistic Regression has the highest AUC, let's go with that for now.
'''

best_model = compare_models(sort='AUC')

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lr,Logistic Regression,0.7341,0.8338,0.7341,0.5396,0.622,-0.0012,-0.0065,0.024
ridge,Ridge Classifier,0.7347,0.8213,0.7347,0.5398,0.6223,0.0,0.0,0.008
nb,Naive Bayes,0.7604,0.8144,0.7604,0.779,0.7671,0.4246,0.4289,0.008
knn,K Neighbors Classifier,0.7661,0.7425,0.7661,0.752,0.7553,0.3519,0.3577,0.011
svm,SVM - Linear Kernel,0.7235,0.7259,0.7235,0.7565,0.7125,0.3203,0.3443,0.009
rf,Random Forest Classifier,0.7347,0.7051,0.7347,0.5398,0.6223,0.0,0.0,0.033
et,Extra Trees Classifier,0.7347,0.5479,0.7347,0.5398,0.6223,0.0,0.0,0.021
lightgbm,Light Gradient Boosting Machine,0.7347,0.525,0.7347,0.5398,0.6223,0.0,0.0,0.072
dt,Decision Tree Classifier,0.7347,0.5,0.7347,0.5398,0.6223,0.0,0.0,0.008
qda,Quadratic Discriminant Analysis,0.7347,0.5,0.7347,0.5398,0.6223,0.0,0.0,0.008


In [87]:
'''
Let's save our logistic regression model 
'''

save_model(best_model, 'Logistic Regression')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=None),
          steps=[('label_encoding',
                  TransformerWrapperWithInverse(exclude=None, include=None,
                                                transformer=LabelEncoder())),
                 ('numerical_imputer',
                  TransformerWrapper(exclude=None,
                                     include=['tenure', 'PhoneService',
                                              'Contract', 'PaymentMethod',
                                              'MonthlyCharges', 'TotalCharges',
                                              'charge_per_tenure'],
                                     transformer=SimpleImputer(add_indica...
                                                               handle_unknown='value',
                                                               hierarchy=None,
                                                               min_samples_leaf=20,
                                                          

In [88]:
'''
Using our python script to use our saved model on new data.
'''

from IPython.display import Code

%run predict_churn.py

What is the name of the file? new_churn_data.csv


Transformation Pipeline and Model Successfully Loaded


predictions: 
0     No
1     No
2     No
3     No
4    Yes
Name: prediction_label, dtype: object


In [89]:
'''
Below is the percent confidence that our prediction is correct.
'''

df_new_data = pd.read_csv('new_churn_data.csv')
prediction_scores = predict_model(load_model('Logistic Regression'), 
                                  df_new_data)['prediction_score']
prediction_scores

Transformation Pipeline and Model Successfully Loaded


0    0.7248
1    0.5527
2    0.9155
3    0.8692
4    0.6411
Name: prediction_score, dtype: float64

In [90]:
'''
In this code, we find the percentile that each prediction score is in,
to see how confident the prediction is related to all the other predictions.
'''
y = 0
predictions_sorted = prediction_scores.sort_values()
for i in predictions_sorted:
    percentile = (y/4)*100
    print('The probability of ', i, ' is in the %.0fth percentile.' %percentile)
    y+= 1


The probability of  0.5527  is in the 0th percentile.
The probability of  0.6411  is in the 25th percentile.
The probability of  0.7248  is in the 50th percentile.
The probability of  0.8692  is in the 75th percentile.
The probability of  0.9155  is in the 100th percentile.


In [91]:
'''
Now let's do the same process but for the unmodified data.
We will have to do some extra cleaning in the python script file, 
such as encoding and creating a new variable charge_per_tenure.
'''

%run predict_churn_unmodified.py

What is the name of your file? new_churn_data_unmodified.csv


Transformation Pipeline and Model Successfully Loaded


predictions: 
0    No
1    No
2    No
3    No
4    No
Name: prediction_label, dtype: object


In [92]:
'''
Below is the probability of each prediction being correct.
The next cell then shows the percentile of each of those probabilites.
'''

df_new_data = pd.read_csv('new_churn_data_unmodified.csv')
df_new_data['charge_per_tenure'] = df_new_data['TotalCharges']/df_new_data['tenure']
label_encoder = LabelEncoder()
df_new_data['PhoneService'].replace({'Yes':1, 'No':0}, inplace=True)
df_new_data['Contract'] = label_encoder.fit_transform(df_new_data['Contract'])
df_new_data['PaymentMethod'] = label_encoder.fit_transform(df_new_data['PaymentMethod'])
prediction_scores = predict_model(load_model('Logistic Regression'), 
                                  df_new_data)['prediction_score']
prediction_scores

Transformation Pipeline and Model Successfully Loaded


0    0.7312
1    0.5448
2    0.9155
3    0.8728
4    0.6672
Name: prediction_score, dtype: float64

In [93]:
y = 0
predictions_sorted = prediction_scores.sort_values()
for i in predictions_sorted:
    percentile = (y/4)*100
    print('The probability of ', i, ' is in the %.0fth percentile.' %percentile)
    y+= 1

The probability of  0.5448  is in the 0th percentile.
The probability of  0.6672  is in the 25th percentile.
The probability of  0.7312  is in the 50th percentile.
The probability of  0.8728  is in the 75th percentile.
The probability of  0.9155  is in the 100th percentile.


In [94]:
'''
Now let's look at another autoML, H2O and compare its performance to
pycaret.

First we import h2o and read in our dataframe from the top.
'''

#!pip install h2o
import h2o
h2o.init()

df_h2o = h2o.H2OFrame(df)
df_h2o

Checking whether there is an H2O instance running at http://localhost:54321. connected.
Please download and install the latest version from: https://h2o-release.s3.amazonaws.com/h2o/latest_stable.html


0,1
H2O_cluster_uptime:,1 hour 3 mins
H2O_cluster_timezone:,America/Chicago
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.46.0.7
H2O_cluster_version_age:,6 months and 1 day
H2O_cluster_name:,H2O_from_python_anaughton_urv5et
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,3.297 Gb
H2O_cluster_total_cores:,10
H2O_cluster_allowed_cores:,10


Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%


customerID,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,charge_per_tenure
7590-VHVEG,1,0,0,2,29.85,29.85,No,29.85
5575-GNVDE,34,1,1,3,56.95,1889.5,No,55.5735
3668-QPYBK,2,1,0,3,53.85,108.15,Yes,54.075
7795-CFOCW,45,0,1,0,42.3,1840.75,No,40.9056
9237-HQITU,2,1,0,2,70.7,151.65,Yes,75.825
9305-CDSKC,8,1,0,2,99.65,820.5,Yes,102.562
1452-KIOVK,22,1,0,1,89.1,1949.4,No,88.6091
6713-OKOMC,10,0,0,3,29.75,301.9,No,30.19
7892-POOKP,28,1,0,2,104.8,3046.05,Yes,108.787
6388-TABGU,62,1,1,0,56.15,3487.95,No,56.2573


In [95]:
'''
Now we define our test and target variables.
'''

from h2o.automl import H2OAutoML

x = df_h2o.columns.remove('Churn')
y = 'Churn'
aml = H2OAutoML(seed = 0, max_runtime_secs=300)
aml.train(x, y, df_h2o)

AutoML progress: |
13:04:31.738: _train param, Dropping bad and constant columns: [customerID]


13:04:32.773: _train param, Dropping bad and constant columns: [customerID]


13:04:33.392: _train param, Dropping bad and constant columns: [customerID]

██
13:04:34.198: _train param, Dropping unused columns: [customerID]
13:04:34.467: _train param, Dropping bad and constant columns: [customerID]

█
13:04:35.346: _train param, Dropping bad and constant columns: [customerID]

█
13:04:36.509: _train param, Dropping bad and constant columns: [customerID]

█
13:04:37.137: _train param, Dropping bad and constant columns: [customerID]
13:04:37.776: _train param, Dropping bad and constant columns: [customerID]

█
13:04:38.471: _train param, Dropping unused columns: [customerID]
13:04:38.895: _train param, Dropping unused columns: [customerID]

██
13:04:39.179: _train param, Dropping bad and constant columns: [customerID]
13:04:39.835: _train param, Dropping bad and constant columns: [customerID]

key,value
Stacking strategy,cross_validation
Number of base models (used / total),5/6
# GBM base models (used / total),1/1
# XGBoost base models (used / total),1/1
# DeepLearning base models (used / total),1/1
# GLM base models (used / total),1/1
# DRF base models (used / total),1/2
Metalearner algorithm,GLM
Metalearner fold assignment scheme,Random
Metalearner nfolds,5

Unnamed: 0,No,Yes,Error,Rate
No,4088.0,1086.0,0.2099,(1086.0/5174.0)
Yes,378.0,1491.0,0.2022,(378.0/1869.0)
Total,4466.0,2577.0,0.2079,(1464.0/7043.0)

metric,threshold,value,idx
max f1,0.3061736,0.6707152,218.0
max f2,0.1465223,0.7716266,298.0
max f0point5,0.5192717,0.6718958,126.0
max accuracy,0.5192717,0.8189692,126.0
max precision,0.8932107,1.0,0.0
max recall,0.0077226,1.0,389.0
max specificity,0.8932107,1.0,0.0
max absolute_mcc,0.3042648,0.5388794,219.0
max min_per_class_accuracy,0.3085734,0.7920371,217.0
max mean_per_class_accuracy,0.3042648,0.7942998,219.0

group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain,kolmogorov_smirnov
1,0.0100809,0.8305434,3.6621753,3.6621753,0.971831,0.8528083,0.971831,0.8528083,0.0369181,0.0369181,266.2175299,266.2175299,0.0365316
2,0.0200199,0.8033508,3.3914928,3.5277939,0.9,0.8183169,0.9361702,0.8356849,0.0337079,0.070626,239.1492777,252.7793905,0.0688865
3,0.0301008,0.7856387,3.3437253,3.4661483,0.8873239,0.7947578,0.9198113,0.8219782,0.0337079,0.1043339,234.3725273,246.6148278,0.1010482
4,0.0400398,0.7714453,3.2838263,3.4208911,0.8714286,0.7791726,0.9078014,0.8113527,0.0326378,0.1369716,228.382634,242.0891059,0.1319465
5,0.0501207,0.7533219,3.2906503,3.3946953,0.8732394,0.7608449,0.9008499,0.8011939,0.0331728,0.1701445,229.0650269,239.469532,0.1633799
6,0.1000994,0.642599,2.6870729,3.041386,0.7130682,0.6919629,0.8070922,0.7466559,0.1342964,0.3044409,168.7072876,204.1385958,0.2781556
7,0.1500781,0.5510831,2.5157854,2.8663515,0.6676136,0.5931232,0.7606433,0.6955267,0.1257357,0.4301766,151.5785362,186.6351511,0.3812782
8,0.2000568,0.4871604,2.0447447,2.6610956,0.5426136,0.5167527,0.7061746,0.6508649,0.1021937,0.5323703,104.4744698,166.1095586,0.4523548
9,0.3000142,0.3665934,1.7128751,2.3451717,0.4545455,0.4267945,0.6223379,0.5762101,0.1712146,0.7035848,71.287514,134.5171689,0.5493521
10,0.3999716,0.2758105,1.2471872,2.070773,0.3309659,0.3191187,0.5495208,0.5119601,0.1246656,0.8282504,24.7187211,107.0773013,0.5829856

Unnamed: 0,No,Yes,Error,Rate
No,4027.0,1147.0,0.2217,(1147.0/5174.0)
Yes,480.0,1389.0,0.2568,(480.0/1869.0)
Total,4507.0,2536.0,0.231,(1627.0/7043.0)

metric,threshold,value,idx
max f1,0.3231002,0.630647,216.0
max f2,0.1676408,0.7478006,293.0
max f0point5,0.5186045,0.6163107,126.0
max accuracy,0.5186045,0.7975295,126.0
max precision,0.8983032,1.0,0.0
max recall,0.0040238,1.0,398.0
max specificity,0.8983032,1.0,0.0
max absolute_mcc,0.3652421,0.4838722,195.0
max min_per_class_accuracy,0.3063291,0.7592295,224.0
max mean_per_class_accuracy,0.2340432,0.7626845,258.0

group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain,kolmogorov_smirnov
1,0.0100809,0.8358257,3.1845003,3.1845003,0.8450704,0.8589443,0.8450704,0.8589443,0.0321027,0.0321027,218.450026,218.450026,0.0299767
2,0.0200199,0.8022544,3.445326,3.3139882,0.9142857,0.8175342,0.8794326,0.8383861,0.0342429,0.0663456,244.5325996,231.3988214,0.06306
3,0.0301008,0.7852874,2.8129752,3.1461961,0.7464789,0.7936,0.8349057,0.823387,0.0283574,0.094703,181.297523,214.619613,0.0879385
4,0.0400398,0.7677081,3.0146602,3.1135454,0.8,0.7753766,0.8262411,0.8114695,0.0299625,0.1246656,201.4660246,211.3545378,0.1151952
5,0.0501207,0.7465986,3.0252752,3.0957913,0.8028169,0.757334,0.8215297,0.8005811,0.0304976,0.1551632,202.5275247,209.5791329,0.1429869
6,0.1000994,0.6474991,2.6228401,2.8596511,0.6960227,0.6968029,0.7588652,0.7487656,0.1310861,0.2862493,162.2840058,185.965112,0.2533927
7,0.1500781,0.5581814,2.1517994,2.6239238,0.5710227,0.5997137,0.6963103,0.6991287,0.1075441,0.3937935,115.1799394,162.3923771,0.3317525
8,0.2000568,0.4926876,2.0019228,2.4685339,0.53125,0.5250853,0.6550745,0.6556487,0.1000535,0.493847,100.192282,146.8533896,0.3999158
9,0.3000142,0.3816267,1.6968169,2.2114167,0.4502841,0.4359964,0.5868434,0.5824659,0.1696094,0.6634564,69.6816935,121.141665,0.4947281
10,0.3999716,0.2857551,1.2097181,1.9610809,0.3210227,0.3315622,0.5204118,0.5197623,0.1209203,0.7843767,20.9718068,96.1080902,0.5232634

Unnamed: 0,mean,sd,cv_1_valid,cv_2_valid,cv_3_valid,cv_4_valid,cv_5_valid
accuracy,0.7727238,0.0169816,0.7580314,0.7605245,0.7978724,0.7823944,0.7647963
aic,1191.3438,40.72121,1221.3088,1233.0953,1129.3259,1193.5128,1179.476
auc,0.8419722,0.0131941,0.8458654,0.8457612,0.8498224,0.8497715,0.8186405
err,0.2272762,0.0169816,0.2419686,0.2394755,0.2021277,0.2176056,0.2352037
err_count,320.2,29.269438,354.0,347.0,285.0,309.0,306.0
f0point5,0.5838932,0.0144589,0.5667276,0.5831776,0.5928466,0.6028668,0.5738476
f1,0.6344895,0.0133237,0.6365503,0.6426365,0.6293888,0.6492622,0.6146096
f2,0.6954645,0.0281001,0.7259953,0.7155963,0.6707317,0.703394,0.6616052
lift_top_group,3.3212073,0.4797636,2.7722652,3.3558314,4.0869565,3.2048612,3.1861224
loglikelihood,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [96]:
aml.leaderboard

model_id,auc,logloss,aucpr,mean_per_class_error,rmse,mse
StackedEnsemble_BestOfFamily_4_AutoML_3_20250929_130431,0.842374,0.418481,0.649731,0.239254,0.369569,0.136581
StackedEnsemble_AllModels_3_AutoML_3_20250929_130431,0.841606,0.419138,0.64958,0.23939,0.370049,0.136936
XGBoost_lr_search_selection_AutoML_3_20250929_130431_select_grid_model_2,0.84122,0.420167,0.646335,0.243719,0.370407,0.137201
StackedEnsemble_AllModels_4_AutoML_3_20250929_130431,0.841,0.419832,0.647053,0.243394,0.370344,0.137155
GBM_lr_annealing_selection_AutoML_3_20250929_130431_select_model,0.840825,0.42084,0.648158,0.241007,0.370507,0.137276
StackedEnsemble_BestOfFamily_3_AutoML_3_20250929_130431,0.840499,0.420681,0.647177,0.237031,0.370601,0.137345
StackedEnsemble_BestOfFamily_1_AutoML_3_20250929_130431,0.840455,0.42063,0.647651,0.246895,0.370486,0.13726
StackedEnsemble_AllModels_1_AutoML_3_20250929_130431,0.840157,0.420877,0.648447,0.247317,0.37068,0.137404
StackedEnsemble_BestOfFamily_2_AutoML_3_20250929_130431,0.840116,0.421068,0.645868,0.247006,0.370646,0.137378
StackedEnsemble_AllModels_2_AutoML_3_20250929_130431,0.839954,0.421084,0.648712,0.237292,0.370774,0.137474


In [98]:
'''
It seems that a StackedEnsemble_BestOfFamily_4 is the best performing
model, with an AUC score of 0.8418, which was higher than our logistic
regression model from earlier.

Let's save this model and use it on the new churn data
'''

model = h2o.get_model('StackedEnsemble_BestOfFamily_4_AutoML_3_20250929_130431')
model_saved = h2o.save_model(model = model, path = 'h2o model')

In [99]:
'''
Using our model on the unmodified churn data.
We see that it is slightly more accurate than our previous logistic
regression model, as this h2o model gets 4/5 predictions correct!
'''

load_model = h2o.load_model(model_saved)
df_h2o = h2o.H2OFrame(pd.read_csv('new_churn_data.csv'))
predictions = load_model.predict(df_h2o)
predictions

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
stackedensemble prediction progress: |███████████████████████████████████████████| (done) 100%


predict,No,Yes
Yes,0.54301,0.45699
No,0.997777,0.00222321
No,0.905262,0.0947377
No,0.78016,0.21984
No,0.996919,0.00308079


# Summary

Write a short summary of the process and results here.

We took our original unmodified data, did a little clean up, and used pycaret to determine the best model for predicting whether a customer will churn or not.  We settled on the logistic regression model since it had the highest AUC score of .8338 (the score jumped between 0.82-0.84 on multiple runs), but we could also try some other models that high scores in other categories as well.  We then saved this model to a pickle file and used it to predict some new churn data.  The new predictions come out to be about 60% correct.  There is some randomness in the results each time it is ran, but 3 out of 5 being correct seems to be the general result.

We also looked at the probabilities that each prediction is correct and found that they hover around 50-90%, which is okay.  In addition to the new churn data, we also predicted the unmodified churn data, which took some more work to do.  We had to clean the data slightly in the python script in order to use the model on it.  Some other small things that was done to help make things clearner was add an input command for the file name and organizing all of the functions in the python script into a class.

We then used an h2o auto ML model to fit our data.  We find that a stacked ensemble model is has the highest AUC score (no surprise there), with a score of 0.8424 (pretty consistenly too).  We then saved our model and then used it to predict the new churn data.  It got 4/5 correct and has higher probability scores as well, showing that it is running better than our logistic regression model.

Moving forward, it would be prudent to continue feeding in new data to our model that we have.  It could also be useful to save, say, three models total and have the new data be run through each one.  After a little time, we would possibly see which one is producing the best results.