# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

# Imports

In [12]:
import pandas as pd
from pycaret.classification import setup, compare_models, save_model
import warnings
import h2o
from h2o.automl import H2OAutoML

warnings.filterwarnings('ignore')

## Load the data

In [13]:
df = pd.read_csv("data/prepped_churn_data.csv", index_col='customerID')
df

Unnamed: 0_level_0,tenure,MonthlyCharges,TotalCharges,PhoneService_Yes,Contract_One year,Contract_Two year,PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check,Churn_Yes,total_charges_tenure_ratio,monthly_charges_times_tenure
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
7590-VHVEG,1,29.85,29.85,0,0,0,0,1,0,0,29.850000,29.85
5575-GNVDE,34,56.95,1889.50,1,1,0,0,0,1,0,55.573529,1936.30
3668-QPYBK,2,53.85,108.15,1,0,0,0,0,1,1,54.075000,107.70
7795-CFOCW,45,42.30,1840.75,0,1,0,0,0,0,0,40.905556,1903.50
9237-HQITU,2,70.70,151.65,1,0,0,0,1,0,1,75.825000,141.40
...,...,...,...,...,...,...,...,...,...,...,...,...
6840-RESVB,24,84.80,1990.50,1,1,0,0,0,1,0,82.937500,2035.20
2234-XADUH,72,103.20,7362.90,1,1,0,1,0,0,0,102.262500,7430.40
4801-JZAZL,11,29.60,346.45,0,0,0,0,1,0,0,31.495455,325.60
8361-LTMKD,4,74.40,306.60,1,0,0,0,0,1,1,76.650000,297.60


The preprocessing was causing an error so I set it to false with all the features set to numeric features because they have all already been one hot encoded.

In [14]:
features = df.drop('Churn_Yes', axis=1).columns.to_list()
automl = setup(df, target='Churn_Yes', preprocess=False, numeric_features=features, fold_shuffle=True)

Unnamed: 0,Description,Value
0,session_id,8167
1,Target,Churn_Yes
2,Target Type,Binary
3,Label Encoded,"0: 0, 1: 1"
4,Original Data,"(7043, 12)"
5,Missing Values,False
6,Numeric Features,11
7,Categorical Features,0
8,Transformed Train Set,"(4930, 11)"
9,Transformed Test Set,"(2113, 11)"


I'll set the model comparison to sort by recal because we'd like to catch as many customers before they churn as possible.

In [15]:
best_model = compare_models(sort='Recall')

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
qda,Quadratic Discriminant Analysis,0.7144,0.8254,0.8065,0.4801,0.6015,0.4007,0.434,0.005
nb,Naive Bayes,0.7318,0.8146,0.7595,0.5003,0.6025,0.4133,0.4344,0.006
svm,SVM - Linear Kernel,0.6943,0.0,0.5344,0.5465,0.4795,0.2948,0.3334,0.009
lda,Linear Discriminant Analysis,0.7874,0.8276,0.5235,0.6217,0.5675,0.4282,0.4315,0.007
lr,Logistic Regression,0.7892,0.8343,0.5128,0.6289,0.5643,0.4274,0.4316,0.019
ada,Ada Boost Classifier,0.7955,0.8366,0.5128,0.6502,0.5716,0.4401,0.4465,0.038
lightgbm,Light Gradient Boosting Machine,0.7852,0.8276,0.5091,0.6216,0.5583,0.4186,0.423,0.078
gbc,Gradient Boosting Classifier,0.7917,0.8396,0.4962,0.6445,0.5598,0.4265,0.4333,0.102
dt,Decision Tree Classifier,0.7256,0.6586,0.4939,0.488,0.4905,0.3029,0.3031,0.007
xgboost,Extreme Gradient Boosting,0.7757,0.8143,0.4924,0.5988,0.5394,0.3932,0.3971,0.095


In [16]:
best_model

Best model is Quadratic Discriminant Analysis with a recall much better than most other models, but with a accuracy a bit lower than the 'no information rate'. Which is the rate at which if no churn was always predicted, how often that prediction would be correct.

## Save the model

In [17]:
save_model(best_model, 'model')

Transformation Pipeline and Model Succesfully Saved


(Pipeline(memory=None,
          steps=[('dtypes',
                  DataTypes_Auto_infer(categorical_features=[],
                                       display_types=True, features_todrop=[],
                                       id_columns=[],
                                       ml_usecase='classification',
                                       numerical_features=['tenure',
                                                           'MonthlyCharges',
                                                           'TotalCharges',
                                                           'PhoneService_Yes',
                                                           'Contract_One year',
                                                           'Contract_Two year',
                                                           'PaymentMethod_Credit '
                                                           'card (automatic)',
                                                           'Pa

Model is saved to a pickel file where it could be loaded from to make predictions as shown in the FTE notebook.

## Make predictions on new data with python script

In [18]:
%run predict_churn.py

Transformation Pipeline and Model Successfully Loaded
            Churn_predictions  Churn_probability  Churn_percentile
customerID                                                        
9305-CKSKC                  0       1.980306e-25               0.4
1452-KNGVK                  1       1.000000e+00               0.8
6723-OKKJM                  0       9.402070e-09               0.6
7832-POPKP                  0       0.000000e+00               0.2
6348-TACGU                  1       1.000000e+00               1.0


predict_churn.py python script correctly predicts all the unmodified new data after it loads it and preprocesses as the training data has been preprocessed. (At least with my randomly generated model)

## H2O autoML

In [19]:
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321. connected.


0,1
H2O_cluster_uptime:,3 mins 11 secs
H2O_cluster_timezone:,Asia/Tokyo
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.44.0.3
H2O_cluster_version_age:,1 month and 24 days
H2O_cluster_name:,H2O_from_python_chand_9348le
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,6.953 Gb
H2O_cluster_total_cores:,12
H2O_cluster_allowed_cores:,12


In [21]:
hf = h2o.H2OFrame(pd.read_csv('data/churn_data.csv', index_col='customerID'))

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%


In [22]:
X = hf.columns
y = 'Churn'
X.remove('Churn')

train, test = hf.split_frame(ratios=[.8], seed=101)

train[y] = train[y].asfactor()
test[y] = test[y].asfactor()

aml = H2OAutoML(max_runtime_secs=120, seed=101)
aml.train(x=X, y=y, training_frame=train)

aml.leaderboard

AutoML progress: |█
08:59:25.678: AutoML: XGBoost is not available; skipping it.

██████████████████████████████████████████████████████████████| (done) 100%


model_id,auc,logloss,aucpr,mean_per_class_error,rmse,mse
StackedEnsemble_BestOfFamily_4_AutoML_1_20240214_85925,0.8407,0.420554,0.651281,0.2446,0.369564,0.136577
GBM_grid_1_AutoML_1_20240214_85925_model_10,0.839994,0.421683,0.649407,0.242857,0.370074,0.136954
StackedEnsemble_AllModels_4_AutoML_1_20240214_85925,0.839745,0.421228,0.651214,0.240541,0.369912,0.136835
StackedEnsemble_AllModels_1_AutoML_1_20240214_85925,0.839655,0.421466,0.653203,0.244386,0.369973,0.13688
StackedEnsemble_BestOfFamily_1_AutoML_1_20240214_85925,0.839447,0.421981,0.64858,0.243358,0.370172,0.137027
StackedEnsemble_AllModels_3_AutoML_1_20240214_85925,0.839146,0.422199,0.647092,0.242707,0.370365,0.13717
GBM_grid_1_AutoML_1_20240214_85925_model_11,0.839012,0.424162,0.645367,0.242128,0.371167,0.137765
StackedEnsemble_BestOfFamily_3_AutoML_1_20240214_85925,0.838951,0.422449,0.648917,0.241058,0.370291,0.137116
StackedEnsemble_BestOfFamily_2_AutoML_1_20240214_85925,0.83894,0.422665,0.647968,0.243601,0.370423,0.137213
StackedEnsemble_AllModels_2_AutoML_1_20240214_85925,0.83863,0.42256,0.64968,0.241908,0.370449,0.137232


# Summary

The pycaret autoML was super neat once I got it to run. I had to downgrade my sklearn and sktime and even then the preprocessing didn't work. I got preprocessing to work with python version 3.6 and sklearn version 0.23.1, but at that point I had already done most of the assignment without it so I continued without it. 

I'm not sure what data I was supposed to train the autoML on to get the new_churn_data.csv to work, but none of the data I currently have has that format. Maybe I was supposed to use the week 2 assignment solution's output data. So instead I went ahead with preprocessing the new data myself like the optional assignment goals state. 

Pycaret was very cool to watch work and populate the models in the table as it ran and I was surprised to see QDA as the best model for recall. Without the autoML I probably would have never thought to try that model.

My script was able to make the right predictions for all the new data (with my randomly generated model), and the prediction probabilities are quite extreme. I'm not familiar with how QDA calculates the probabilities, but it seems to be very sure of the predictions.

H2O was very easy to use but it'd take a lot more of me messing with it to be able to use it like I did pycaret and make some sort of semblance of a pipeline. It is much harder to interpret what it's doing and which model I should use for the data. It also took much longer than pycaret.