# DS Automation Assignment - Tom Bukowski

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

In [1]:
import pandas as pd

df = pd.read_csv('prepped_churn_data.csv', index_col='customerID')
df

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,total_tenure_ratio
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
7590-VHVEG,1,0,0,0,29.85,29.85,0,29.850000
5575-GNVDE,34,1,1,1,56.95,1889.50,0,55.573529
3668-QPYBK,2,1,0,1,53.85,108.15,1,54.075000
7795-CFOCW,45,0,1,2,42.30,1840.75,0,40.905556
9237-HQITU,2,1,0,0,70.70,151.65,1,75.825000
...,...,...,...,...,...,...,...,...
6840-RESVB,24,1,1,1,84.80,1990.50,0,82.937500
2234-XADUH,72,1,1,3,103.20,7362.90,0,102.262500
4801-JZAZL,11,0,0,0,29.60,346.45,0,31.495455
8361-LTMKD,4,1,0,1,74.40,306.60,1,76.650000


In [2]:
from pycaret.classification import setup, compare_models, predict_model, save_model, load_model

In [3]:
automl = setup(data=df, target='Churn', preprocess=False, fold_shuffle=True, imputation_type='iterative', numeric_features=['tenure', 'PhoneService', 'Contract', 'PaymentMethod'])

Unnamed: 0,Description,Value
0,session_id,2016
1,Target,Churn
2,Target Type,Binary
3,Label Encoded,"0: 0, 1: 1"
4,Original Data,"(7032, 8)"
5,Missing Values,False
6,Numeric Features,7
7,Categorical Features,0
8,Transformed Train Set,"(4922, 7)"
9,Transformed Test Set,"(2110, 7)"


In [4]:
automl[6]

customerID
6839-ITVZJ    1
9481-IEBZY    0
9514-JDSKI    1
8267-KFGYD    0
0224-HJAPT    1
             ..
5312-UXESG    0
7319-VENRZ    0
9367-OIUXP    0
2475-MROZF    0
1389-WNUIB    0
Name: Churn, Length: 2110, dtype: int64

In [5]:
best_model = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lda,Linear Discriminant Analysis,0.7836,0.8191,0.4882,0.6224,0.5467,0.4075,0.4129,0.009
catboost,CatBoost Classifier,0.7836,0.8299,0.4747,0.6275,0.5387,0.4014,0.409,1.389
gbc,Gradient Boosting Classifier,0.7834,0.8323,0.4731,0.628,0.5384,0.4008,0.4084,0.147
lr,Logistic Regression,0.783,0.832,0.4988,0.6186,0.5518,0.4108,0.4153,0.932
ada,Ada Boost Classifier,0.782,0.8279,0.4974,0.6151,0.5496,0.408,0.4122,0.061
ridge,Ridge Classifier,0.7818,0.0,0.4275,0.6404,0.5115,0.3783,0.3918,0.012
lightgbm,Light Gradient Boosting Machine,0.7755,0.8222,0.4777,0.6035,0.5314,0.3869,0.3925,0.307
xgboost,Extreme Gradient Boosting,0.766,0.811,0.4572,0.5798,0.5095,0.3592,0.3645,0.197
rf,Random Forest Classifier,0.7623,0.7973,0.4496,0.574,0.5014,0.3491,0.3551,0.182
knn,K Neighbors Classifier,0.7584,0.7367,0.4177,0.5665,0.4798,0.3275,0.3344,0.028


In [6]:
best_model

LinearDiscriminantAnalysis(covariance_estimator=None, n_components=None,
                           priors=None, shrinkage=None, solver='svd',
                           store_covariance=False, tol=0.0001)

In [7]:
df.iloc[-2:-1].shape

(1, 8)

In [8]:
predict_model(best_model, df.iloc[-2:-1])

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,total_tenure_ratio,Label,Score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
8361-LTMKD,4,1,0,1,74.4,306.6,1,76.65,1,0.6111


In [9]:
save_model(best_model, 'saved_GBC_model')

Transformation Pipeline and Model Succesfully Saved


(Pipeline(memory=None,
          steps=[('dtypes',
                  DataTypes_Auto_infer(categorical_features=[],
                                       display_types=True, features_todrop=[],
                                       id_columns=[],
                                       ml_usecase='classification',
                                       numerical_features=['tenure',
                                                           'PhoneService',
                                                           'Contract',
                                                           'PaymentMethod'],
                                       target='Churn', time_features=[])),
                 ['trained_model',
                  LinearDiscriminantAnalysis(covariance_estimator=None,
                                             n_components=None, priors=None,
                                             shrinkage=None, solver='svd',
                                             store_covaria

In [11]:
new_data = df.iloc[-2:-1].copy()
new_data.drop('Churn', axis=1, inplace=True)

In [10]:
loaded_gbc = load_model('saved_GBC_model')

Transformation Pipeline and Model Successfully Loaded


In [12]:
predict_model(loaded_gbc, new_data)

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,total_tenure_ratio,Label,Score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
8361-LTMKD,4,1,0,1,74.4,306.6,76.65,1,0.6111


# Python module

In [22]:
from IPython.display import Code

Code('predict_churn.py')

In [23]:
%run predict_churn.py

Transformation Pipeline and Model Successfully Loaded
Predictions: 
customerID
9305-CKSKC    Did not Churn
1452-KNGVK            Churn
6723-OKKJM    Did not Churn
7832-POPKP    Did not Churn
6348-TACGU            Churn
Name: Churn_prediction, dtype: object


# Summary

Had a lot of issues with running the pycaret package, specifically the setup, so I was unable to continue the project. Eventually ended up reinstalling all of Anaconda, and updating ALL packages in Jupyter Notebook.

Once complete, I found that the best model was the Gradient Boosting Classifier -- but then running it again I found that it was the LDA. I ran everything with the assumption that it was the GBC, hence why the naming convensions all show GBC in the filenames.

I created the python module similar to the FTE, including the dunder main at the bottom.

I had to modify the "new_churn_data.csv" file, because my prepped_data had feature named "total_tenue_ratio" and the new_churn_data file label was called "charge_per_tenure", and the predict_model method would not run.

My predictions using the pycaret packages are as follows:
- Did not churn
- Churn
- Did not churn
- Did not churn
- Churn