# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

In [39]:
import pandas as pd

In [40]:
df_path = '/Users/kie/Documents/Regis University/MS/01. MSDS 600/02. Week 2/Chi_assignment2_data.csv'
df = pd.read_csv(df_path)
df = df.drop('customerID',axis=1)
df = df.drop('ratio_group',axis=1)
df = df.drop('Unnamed: 0',axis=1)
df = df.drop('ratio_totalcharges_tenure', axis=1)
df

Unnamed: 0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,1,0,0,1,29.85,29.85,0
1,34,1,1,2,56.95,1889.50,0
2,2,1,0,2,53.85,108.15,1
3,45,0,1,3,42.30,1840.75,0
4,2,1,0,1,70.70,151.65,1
...,...,...,...,...,...,...,...
7038,24,1,1,2,84.80,1990.50,0
7039,72,1,1,4,103.20,7362.90,0
7040,11,0,0,1,29.60,346.45,0
7041,4,1,0,2,74.40,306.60,1


# 1. Use pycaret to find an ML algorithm

In [41]:
from pycaret.classification import setup, compare_models, predict_model, save_model, load_model

In [42]:
automl = setup(df, target='Churn')

Unnamed: 0,Description,Value
0,session_id,6342
1,Target,Churn
2,Target Type,Binary
3,Label Encoded,"0: 0, 1: 1"
4,Original Data,"(7043, 7)"
5,Missing Values,False
6,Numeric Features,3
7,Categorical Features,3
8,Ordinal Features,False
9,High Cardinality Features,False


In [43]:
best_model = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lr,Logistic Regression,0.7953,0.8368,0.5223,0.6389,0.5725,0.4402,0.4454,0.283
catboost,CatBoost Classifier,0.7935,0.8386,0.5062,0.6385,0.5636,0.4308,0.4365,0.335
ridge,Ridge Classifier,0.7925,0.0,0.46,0.6518,0.5375,0.4093,0.4206,0.005
ada,Ada Boost Classifier,0.7925,0.8406,0.5031,0.6376,0.5606,0.4275,0.4337,0.036
gbc,Gradient Boosting Classifier,0.7921,0.8411,0.4962,0.6376,0.5573,0.4242,0.4304,0.092
lda,Linear Discriminant Analysis,0.7905,0.8267,0.5169,0.626,0.5643,0.4284,0.433,0.01
lightgbm,Light Gradient Boosting Machine,0.786,0.8284,0.5169,0.6125,0.5595,0.4198,0.4231,0.028
rf,Random Forest Classifier,0.7714,0.8035,0.4931,0.5784,0.5317,0.3819,0.3844,0.097
knn,K Neighbors Classifier,0.77,0.7489,0.4462,0.5832,0.5048,0.3587,0.3645,0.011
et,Extra Trees Classifier,0.7641,0.7785,0.4992,0.5585,0.5264,0.3703,0.3717,0.084


In [44]:
best_model

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=1000,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=6342, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [45]:
best_model = compare_models(sort='AUC')

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
gbc,Gradient Boosting Classifier,0.7921,0.8411,0.4962,0.6376,0.5573,0.4242,0.4304,0.088
ada,Ada Boost Classifier,0.7925,0.8406,0.5031,0.6376,0.5606,0.4275,0.4337,0.037
catboost,CatBoost Classifier,0.7935,0.8386,0.5062,0.6385,0.5636,0.4308,0.4365,0.297
lr,Logistic Regression,0.7953,0.8368,0.5223,0.6389,0.5725,0.4402,0.4454,0.015
lightgbm,Light Gradient Boosting Machine,0.786,0.8284,0.5169,0.6125,0.5595,0.4198,0.4231,0.025
lda,Linear Discriminant Analysis,0.7905,0.8267,0.5169,0.626,0.5643,0.4284,0.433,0.006
nb,Naive Bayes,0.6874,0.8119,0.8477,0.4508,0.5886,0.3725,0.421,0.005
rf,Random Forest Classifier,0.7714,0.8035,0.4931,0.5784,0.5317,0.3819,0.3844,0.095
et,Extra Trees Classifier,0.7641,0.7785,0.4992,0.5585,0.5264,0.3703,0.3717,0.101
knn,K Neighbors Classifier,0.77,0.7489,0.4462,0.5832,0.5048,0.3587,0.3645,0.01


In [46]:
best_model

GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
                           learning_rate=0.1, loss='deviance', max_depth=3,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=100,
                           n_iter_no_change=None, presort='deprecated',
                           random_state=6342, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)

In [47]:
automl = setup(df, target='Churn', preprocess=False, numeric_features=['tenure','PhoneService','Contract','PaymentMethod'])

Unnamed: 0,Description,Value
0,session_id,627
1,Target,Churn
2,Target Type,Binary
3,Label Encoded,"0: 0, 1: 1"
4,Original Data,"(7043, 7)"
5,Missing Values,False
6,Numeric Features,6
7,Categorical Features,0
8,Transformed Train Set,"(4930, 6)"
9,Transformed Test Set,"(2113, 6)"


In [48]:
best_model = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
gbc,Gradient Boosting Classifier,0.7931,0.8314,0.4822,0.6569,0.5552,0.4247,0.4339,0.08
lr,Logistic Regression,0.7905,0.8309,0.5102,0.6394,0.5664,0.4307,0.4361,0.02
ridge,Ridge Classifier,0.7888,0.0,0.4399,0.6627,0.5276,0.3988,0.4134,0.005
lda,Linear Discriminant Analysis,0.7876,0.8192,0.4973,0.6356,0.5571,0.4202,0.4263,0.006
lightgbm,Light Gradient Boosting Machine,0.7846,0.8186,0.5102,0.6208,0.5594,0.4188,0.4227,0.247
catboost,CatBoost Classifier,0.7846,0.8256,0.4875,0.6282,0.5481,0.4098,0.416,0.32
ada,Ada Boost Classifier,0.7838,0.8293,0.4799,0.6267,0.5427,0.4046,0.4113,0.035
knn,K Neighbors Classifier,0.7629,0.7456,0.4459,0.5757,0.5019,0.3498,0.355,0.118
rf,Random Forest Classifier,0.7621,0.7861,0.4649,0.5707,0.5117,0.3567,0.3604,0.098
et,Extra Trees Classifier,0.7499,0.7609,0.4656,0.5405,0.4999,0.3344,0.3363,0.17


In [49]:
best_model

GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
                           learning_rate=0.1, loss='deviance', max_depth=3,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=100,
                           n_iter_no_change=None, presort='deprecated',
                           random_state=627, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)

# 2.Save the model to disk

In [50]:
save_model(best_model, 'GBC')

Transformation Pipeline and Model Succesfully Saved


(Pipeline(memory=None,
          steps=[('dtypes',
                  DataTypes_Auto_infer(categorical_features=[],
                                       display_types=True, features_todrop=[],
                                       id_columns=[],
                                       ml_usecase='classification',
                                       numerical_features=['tenure',
                                                           'PhoneService',
                                                           'Contract',
                                                           'PaymentMethod'],
                                       target='Churn', time_features=[])),
                 ['trained_model',
                  GradientBoostingClassifier(ccp_alpha=0.0,
                                             criterion='friedman_mse...
                                             learning_rate=0.1, loss='deviance',
                                             max_depth=3, max_feature

In [51]:
import pickle

with open('GBC.pk', 'wb') as f:
    pickle.dump(best_model, f)

In [52]:
with open('GBC.pk', 'rb') as f:
    loaded_model = pickle.load(f)

In [53]:
new_data = df.iloc[-2:-1].copy()
new_data.drop('Churn', axis=1, inplace=True)
loaded_model.predict(new_data)

array([1])

In [54]:
loaded_lda = load_model('GBC')

Transformation Pipeline and Model Successfully Loaded


In [55]:
predict_model(loaded_lda, new_data)

Unnamed: 0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Label,Score
7041,4,1,0,2,74.4,306.6,1,0.6362


# 3. create a Python module and test with the new data

In [56]:
from IPython.display import Code

Code('predict_Churn.py')

In [57]:
%run predict_Churn.py

Transformation Pipeline and Model Successfully Loaded
predictions:
   customerID  tenure  PhoneService  Contract  PaymentMethod  MonthlyCharges  \
0  9305-CKSKC      22             1         0              2           97.40   
1  1452-KNGVK       8             0         1              1           77.30   
2  6723-OKKJM      28             1         0              0           28.25   
3  7832-POPKP      62             1         0              2          101.70   
4  6348-TACGU      10             0         0              1           51.15   

   TotalCharges  charge_per_tenure  Churn_prediction  
0        811.70          36.895455                 1  
1       1701.95         212.743750                 0  
2        250.90           8.960714                 0  
3       3106.56          50.105806                 0  
4       3440.97         344.097000                 0  


# Summary

To find an ML algorith, I firstly set the target with default setting, then, compared models with default settings also, the best model was LogisticRegression. Next, I tried to add "AUC" as another parameter. And the result changed this time. Even the Accuracy score was lower but the AUC score was better. The best model this time was GradientBoostingClassifier.
After that, I tried to convert the categorical features in the data file to numeric to see the performance before running the model comparation. And it was better, both Accuracy and AUC scores are the highest ones comparing to other models. Hence, I use GradientBoostingClassifier for the predictions.
I wonder if the array written on the assignment tasks [1, 0, 0, 1, 0] is the answer for the predictions or not. I'm a little confusing but I think the predictions based on the models we choose. Hence the result may be different.