# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

# Imports

In [2]:
import pandas as pd
import sklearn
from pycaret.classification import setup, compare_models, predict_model, save_model, load_model
import warnings
import pickle
import h2o
from h2o.automl import H2OAutoML

#warnings.filterwarnings('ignore')

## Load the data

In [3]:
df = pd.read_csv("prepped_churn_data.csv", index_col='customerID')
df

Unnamed: 0_level_0,tenure,MonthlyCharges,TotalCharges,PhoneService_Yes,Contract_One year,Contract_Two year,PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check,Churn_Yes,total_charges_tenure_ratio,monthly_charges_times_tenure
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
7590-VHVEG,1,29.85,29.85,0,0,0,0,1,0,0,29.850000,29.85
5575-GNVDE,34,56.95,1889.50,1,1,0,0,0,1,0,55.573529,1936.30
3668-QPYBK,2,53.85,108.15,1,0,0,0,0,1,1,54.075000,107.70
7795-CFOCW,45,42.30,1840.75,0,1,0,0,0,0,0,40.905556,1903.50
9237-HQITU,2,70.70,151.65,1,0,0,0,1,0,1,75.825000,141.40
...,...,...,...,...,...,...,...,...,...,...,...,...
6840-RESVB,24,84.80,1990.50,1,1,0,0,0,1,0,82.937500,2035.20
2234-XADUH,72,103.20,7362.90,1,1,0,1,0,0,0,102.262500,7430.40
4801-JZAZL,11,29.60,346.45,0,0,0,0,1,0,0,31.495455,325.60
8361-LTMKD,4,74.40,306.60,1,0,0,0,0,1,1,76.650000,297.60


The preprocessing was causing an error so I set it to false with all the features set to numeric features. 

In [4]:
features = df.drop('Churn_Yes', axis=1).columns.to_list()
automl = setup(df, target='Churn_Yes', preprocess=False, numeric_features=features, fold_shuffle=True)

Unnamed: 0,Description,Value
0,session_id,2856
1,Target,Churn_Yes
2,Target Type,Binary
3,Label Encoded,"0: 0, 1: 1"
4,Original Data,"(7043, 12)"
5,Missing Values,False
6,Numeric Features,11
7,Categorical Features,0
8,Transformed Train Set,"(4930, 11)"
9,Transformed Test Set,"(2113, 11)"


I'll set the model comparison to sort by recal because we'd like to catch as many customers before they churn as possible.

In [5]:
best_model = compare_models(sort='Recall')


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
qda,Quadratic Discriminant Analysis,0.7256,0.825,0.786,0.4895,0.6029,0.4105,0.4376,0.007
nb,Naive Bayes,0.7312,0.8149,0.7623,0.495,0.6001,0.4112,0.4331,0.006
ada,Ada Boost Classifier,0.8037,0.8398,0.5392,0.6592,0.5925,0.4649,0.4694,0.036
lda,Linear Discriminant Analysis,0.7892,0.8293,0.5246,0.622,0.5685,0.4305,0.4336,0.007
lightgbm,Light Gradient Boosting Machine,0.7931,0.8317,0.5192,0.634,0.5697,0.4355,0.44,0.072
dt,Decision Tree Classifier,0.7333,0.6673,0.5184,0.4973,0.5065,0.3242,0.325,0.009
gbc,Gradient Boosting Classifier,0.8,0.8445,0.5146,0.6572,0.5761,0.4479,0.4544,0.1
lr,Logistic Regression,0.7947,0.8367,0.5061,0.6441,0.5661,0.4343,0.4401,0.478
et,Extra Trees Classifier,0.7714,0.7892,0.4992,0.5799,0.5356,0.3853,0.3878,0.078
catboost,CatBoost Classifier,0.7945,0.8397,0.4992,0.6453,0.5622,0.4309,0.4374,0.673


In [6]:
best_model

## Save the model

In [7]:
save_model(best_model, 'QDA')

Transformation Pipeline and Model Succesfully Saved


(Pipeline(memory=None,
          steps=[('dtypes',
                  DataTypes_Auto_infer(categorical_features=[],
                                       display_types=True, features_todrop=[],
                                       id_columns=[],
                                       ml_usecase='classification',
                                       numerical_features=['tenure',
                                                           'MonthlyCharges',
                                                           'TotalCharges',
                                                           'PhoneService_Yes',
                                                           'Contract_One year',
                                                           'Contract_Two year',
                                                           'PaymentMethod_Credit '
                                                           'card (automatic)',
                                                           'Pa

## Make predictions on new data with python script

In [8]:
%run predict_churn.py

Transformation Pipeline and Model Successfully Loaded
   Churn_predictions  Churn_probability  Churn_percentile
0                  0       1.695634e-05               0.6
1                  1       1.000000e+00               0.9
2                  0       8.055329e-08               0.4
3                  0       0.000000e+00               0.2
4                  1       1.000000e+00               0.9


## H2O autoML

In [12]:
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321. connected.


0,1
H2O_cluster_uptime:,36 secs
H2O_cluster_timezone:,Asia/Tokyo
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.44.0.3
H2O_cluster_version_age:,1 month and 16 days
H2O_cluster_name:,H2O_from_python_chand_8qhm1m
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,19.82 Gb
H2O_cluster_total_cores:,12
H2O_cluster_allowed_cores:,12


In [13]:
hf = h2o.H2OFrame(pd.read_csv('churn_data.csv', index_col='customerID'))

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%


In [15]:
X = hf.columns
y = 'Churn'
X.remove('Churn')

train, test = hf.split_frame(ratios=[.8], seed=101)

train[y] = train[y].asfactor()
test[y] = test[y].asfactor()

aml = H2OAutoML(max_models=10, seed=101)
aml.train(x=X, y=y, training_frame=train)

aml.leaderboard

AutoML progress: |█
14:19:02.942: AutoML: XGBoost is not available; skipping it.

██████████████████████████████████████████████████████████████| (done) 100%


model_id,auc,logloss,aucpr,mean_per_class_error,rmse,mse
StackedEnsemble_BestOfFamily_1_AutoML_3_20240206_141902,0.839112,0.422245,0.649574,0.244858,0.370268,0.137099
StackedEnsemble_AllModels_1_AutoML_3_20240206_141902,0.838709,0.422565,0.651825,0.241465,0.370451,0.137234
GBM_1_AutoML_3_20240206_141902,0.836955,0.424106,0.649281,0.239624,0.370828,0.137513
GLM_1_AutoML_3_20240206_141902,0.834314,0.429463,0.620612,0.245348,0.374524,0.140268
GBM_5_AutoML_3_20240206_141902,0.83302,0.427868,0.6512,0.243388,0.372815,0.138991
DeepLearning_1_AutoML_3_20240206_141902,0.832294,0.43106,0.619534,0.246207,0.375564,0.141048
GBM_grid_1_AutoML_3_20240206_141902_model_1,0.830232,0.430665,0.644374,0.245341,0.374016,0.139888
GBM_2_AutoML_3_20240206_141902,0.829987,0.430966,0.644923,0.253923,0.37428,0.140085
GBM_3_AutoML_3_20240206_141902,0.826678,0.434438,0.639438,0.250688,0.376099,0.14145
GBM_4_AutoML_3_20240206_141902,0.821738,0.440302,0.634209,0.254288,0.378788,0.143481


# Summary

Write a short summary of the process and results here.