# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

In [2]:
from pycaret.classification import setup, compare_models, predict_model, save_model, load_model

In [5]:
import pandas as pd

In [6]:
df = pd.read_csv(r'C:\Users\Joseph\OneDrive\Desktop\Data Science\MSDS600\Week 5\Assignment\prepared_churn_data.csv', index_col='customerID')
df

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,charge_per_tenure
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
7590-VHVEG,1.0,0,0,3,29.85,29.85,0,29.850000
5575-GNVDE,34.0,1,1,2,56.95,1889.50,0,55.573529
3668-QPYBK,2.0,1,0,2,53.85,108.15,1,54.075000
7795-CFOCW,45.0,0,1,1,42.30,1840.75,0,40.905556
9237-HQITU,2.0,1,0,3,70.70,151.65,1,75.825000
...,...,...,...,...,...,...,...,...
6840-RESVB,24.0,1,1,2,84.80,1990.50,0,82.937500
2234-XADUH,72.0,1,1,0,103.20,7362.90,0,102.262500
4801-JZAZL,11.0,0,0,3,29.60,346.45,0,31.495455
8361-LTMKD,4.0,1,0,2,74.40,306.60,1,76.650000


In [7]:
#Set up autoML
automl = setup(df, target='Churn')

Unnamed: 0,Description,Value
0,Session id,1470
1,Target,Churn
2,Target type,Binary
3,Original data shape,"(7043, 8)"
4,Transformed data shape,"(7043, 8)"
5,Transformed train set shape,"(4930, 8)"
6,Transformed test set shape,"(2113, 8)"
7,Numeric features,7
8,Preprocess,True
9,Imputation type,simple


In [9]:
automl.get_metrics()

Unnamed: 0_level_0,Name,Display Name,Score Function,Scorer,Target,Args,Greater is Better,Multiclass,Custom
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
acc,Accuracy,Accuracy,<function accuracy_score at 0x00000253CC3A9A80>,accuracy,pred,{},True,True,False
auc,AUC,AUC,<pycaret.internal.metrics.BinaryMulticlassScor...,"make_scorer(roc_auc_score, needs_proba=True, e...",pred_proba,"{'average': 'weighted', 'multi_class': 'ovr'}",True,True,False
recall,Recall,Recall,<pycaret.internal.metrics.BinaryMulticlassScor...,"make_scorer(recall_score, average=weighted)",pred,{'average': 'weighted'},True,True,False
precision,Precision,Prec.,<pycaret.internal.metrics.BinaryMulticlassScor...,"make_scorer(precision_score, average=weighted)",pred,{'average': 'weighted'},True,True,False
f1,F1,F1,<pycaret.internal.metrics.BinaryMulticlassScor...,"make_scorer(f1_score, average=weighted)",pred,{'average': 'weighted'},True,True,False
kappa,Kappa,Kappa,<function cohen_kappa_score at 0x00000253CC3A9...,make_scorer(cohen_kappa_score),pred,{},True,True,False
mcc,MCC,MCC,<function matthews_corrcoef at 0x00000253CC3A9...,make_scorer(matthews_corrcoef),pred,{},True,True,False


In [10]:
#use pycaret to find an ML algorithm that performs best on the data
best_model = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lr,Logistic Regression,0.7996,0.8439,0.549,0.6442,0.5921,0.4605,0.4636,0.792
ridge,Ridge Classifier,0.7959,0.0,0.4588,0.6691,0.543,0.4179,0.4308,0.012
ada,Ada Boost Classifier,0.7957,0.844,0.5238,0.6428,0.5749,0.4428,0.4482,0.045
lda,Linear Discriminant Analysis,0.7949,0.8313,0.5199,0.6405,0.5732,0.4403,0.4449,0.011
gbc,Gradient Boosting Classifier,0.7927,0.8459,0.5116,0.6381,0.5666,0.4328,0.4381,0.106
rf,Random Forest Classifier,0.7815,0.817,0.4955,0.6084,0.545,0.4036,0.4079,0.117
lightgbm,Light Gradient Boosting Machine,0.7815,0.8345,0.5154,0.6047,0.5549,0.4118,0.415,0.073
xgboost,Extreme Gradient Boosting,0.7769,0.8232,0.5131,0.5925,0.5489,0.402,0.4044,0.029
et,Extra Trees Classifier,0.7704,0.8011,0.4939,0.58,0.5327,0.382,0.3846,0.081
knn,K Neighbors Classifier,0.7609,0.7435,0.4481,0.5615,0.4974,0.3435,0.3477,0.452


In [11]:
best_model

Logistic Regression has the highest accuracy, so this algorithm will be used.

In [12]:
#save the model to disk
save_model(best_model, 'LR')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=None),
          steps=[('numerical_imputer',
                  TransformerWrapper(exclude=None,
                                     include=['tenure', 'PhoneService',
                                              'Contract', 'PaymentMethod',
                                              'MonthlyCharges', 'TotalCharges',
                                              'charge_per_tenure'],
                                     transformer=SimpleImputer(add_indicator=False,
                                                               copy=True,
                                                               fill_value=None,
                                                               keep_empty_features=False,
                                                               missing_values=nan,
                                                               strategy='mean',
                                                               verbos...
           

In [21]:
#Create a Python script with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
import pickle

with open('LR_model.pk', 'wb') as f:
    pickle.dump(best_model, f)

In [22]:
with open('LR_model.pk', 'rb') as f:
    loaded_model = pickle.load(f)

In [23]:
new_df = pd.read_csv(r'C:\Users\Joseph\OneDrive\Desktop\Data Science\MSDS600\Week 5\Assignment\new_churn_data.csv', index_col='customerID')
new_df

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,charge_per_tenure
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
9305-CKSKC,22,1,0,2,97.4,811.7,36.895455
1452-KNGVK,8,0,1,1,77.3,1701.95,212.74375
6723-OKKJM,28,1,0,0,28.25,250.9,8.960714
7832-POPKP,62,1,0,2,101.7,3106.56,50.105806
6348-TACGU,10,0,0,1,51.15,3440.97,344.097


In [25]:
#Print out the predictions for new data
new_data = new_df
loaded_model.predict(new_data)

array([1, 0, 0, 0, 1], dtype=int8)

In [27]:
#Test Python module and function with the new data
predict_model(best_model,new_data)

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,charge_per_tenure,prediction_label,prediction_score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
9305-CKSKC,22,1,0,2,97.400002,811.700012,36.895454,1,0.5011
1452-KNGVK,8,0,1,1,77.300003,1701.949951,212.743744,0,0.5515
6723-OKKJM,28,1,0,0,28.25,250.899994,8.960714,0,0.9294
7832-POPKP,62,1,0,2,101.699997,3106.560059,50.105808,0,0.8447
6348-TACGU,10,0,0,1,51.150002,3440.969971,344.096985,1,0.6486


# Summary

The first step involves installing PyCaret using conda package manager from the conda-forge channel. : The churn data is loaded from a CSV file (prepared_churn_data.csv) using pandas. Then, PyCaret's setup function is used to set up the classification experiment, specifying the target variable (Churn) and other parameterion: PyCaret's compare_models function is used to compare different machine learning modeCC. In this case, Logistic Regression (lr) is selected as the best m based on the accuracy scoreo ving: The best model (Logistic Regression) is saved to disk using PyCaret's save_model funictions: The saved model is loaded from disk using pickle. Then, new churn data is loaded from a CSV file (new_churn_data.csv). The loaded model is used to predict churn probabilities for the n Creation: Finally, a Python script/module is created with a function that takes a pandas DataFrame as input and returns the probability of churn for each row in the DataFrame. The function utilizes the saved model to make pr Overall, the model appears to be making predictions with varying levels of confidence ranging from 50%-93%.edictions.