# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

In [1]:
import pandas as pd 
import numpy as np
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn import tree
import matplotlib.pyplot as plt

In [3]:
df = pd.read_csv("/Users/joshu/Desktop/regis_classes/Week 5/prepped_churn_data.csv", index_col='customerID')
df = df.drop(columns = ["Unnamed: 0"])
df

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,total_charges_tenure_ratio
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
7590-VHVEG,1,0,0,1,29.85,29.85,0,0.033501
5575-GNVDE,34,1,1,0,56.95,1889.50,0,0.017994
3668-QPYBK,2,1,0,0,53.85,108.15,1,0.018493
7795-CFOCW,45,0,1,3,42.30,1840.75,0,0.024447
9237-HQITU,2,1,0,1,70.70,151.65,1,0.013188
...,...,...,...,...,...,...,...,...
6840-RESVB,24,1,1,0,84.80,1990.50,0,0.012057
2234-XADUH,72,1,1,2,103.20,7362.90,0,0.009779
4801-JZAZL,11,0,0,1,29.60,346.45,0,0.031751
8361-LTMKD,4,1,0,0,74.40,306.60,1,0.013046


In [4]:
from pycaret.classification import setup, compare_models, predict_model, save_model, load_model

In [5]:
automl = setup(df, target = "Churn")

Unnamed: 0,Description,Value
0,session_id,7902
1,Target,Churn
2,Target Type,Binary
3,Label Encoded,
4,Original Data,"(7043, 8)"
5,Missing Values,False
6,Numeric Features,4
7,Categorical Features,3
8,Ordinal Features,False
9,High Cardinality Features,False


In [6]:
best_model = compare_models(sort='AUC')

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
gbc,Gradient Boosting Classifier,0.7988,0.8392,0.4965,0.6546,0.5633,0.436,0.4438,0.055
lr,Logistic Regression,0.7982,0.8354,0.5182,0.6434,0.5732,0.4432,0.4481,0.24
ada,Ada Boost Classifier,0.798,0.8351,0.5011,0.6488,0.5646,0.4361,0.4427,0.024
lda,Linear Discriminant Analysis,0.7929,0.8268,0.5112,0.6292,0.5635,0.4298,0.4341,0.005
lightgbm,Light Gradient Boosting Machine,0.787,0.823,0.5104,0.6137,0.5559,0.4177,0.4215,0.016
nb,Naive Bayes,0.687,0.8094,0.842,0.4482,0.5848,0.3693,0.4172,0.005
rf,Random Forest Classifier,0.7769,0.797,0.4856,0.5897,0.5312,0.387,0.3909,0.119
et,Extra Trees Classifier,0.7633,0.7713,0.4826,0.5554,0.5156,0.3602,0.3623,0.108
knn,K Neighbors Classifier,0.772,0.7501,0.45,0.5842,0.5077,0.3628,0.3683,0.181
dt,Decision Tree Classifier,0.7276,0.6573,0.4973,0.4814,0.4878,0.3028,0.3036,0.005


In [8]:
best_model

GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
                           learning_rate=0.1, loss='deviance', max_depth=3,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=100,
                           n_iter_no_change=None, presort='deprecated',
                           random_state=7902, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)

In [9]:
df.iloc[-2:-1].shape

(1, 8)

In [10]:
predict_model(best_model, df.iloc[-2:-1])

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,total_charges_tenure_ratio,Label,Score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
8361-LTMKD,4,1,0,0,74.4,306.6,1,0.013046,1,0.6639


In [11]:
save_model(best_model, "GBC")

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=None,
          steps=[('dtypes',
                  DataTypes_Auto_infer(categorical_features=[],
                                       display_types=True, features_todrop=[],
                                       id_columns=[],
                                       ml_usecase='classification',
                                       numerical_features=[], target='Churn',
                                       time_features=[])),
                 ('imputer',
                  Simple_Imputer(categorical_strategy='not_available',
                                 fill_value_categorical=None,
                                 fill_value_numerical=None,
                                 numeric_strate...
                                             learning_rate=0.1, loss='deviance',
                                             max_depth=3, max_features=None,
                                             max_leaf_nodes=None,
                                             min_i

In [12]:
new_data = df.iloc[-2:-1].copy()
loaded_model = load_model('GBC')

loaded_model.predict(new_data)

Transformation Pipeline and Model Successfully Loaded


array([1], dtype=int64)

In [13]:
predict_model(loaded_model, new_data)

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,total_charges_tenure_ratio,Label,Score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
8361-LTMKD,4,1,0,0,74.4,306.6,1,0.013046,1,0.6639


In [32]:
from IPython.display import Code
Code("predict_churn.py")

In [33]:
%run predict_churn.py

Transformation Pipeline and Model Successfully Loaded
predictions: 
customerID
9305-CKSKC       Churn
1452-KNGVK       Churn
6723-OKKJM    No Churn
7832-POPKP    No Churn
6348-TACGU       Churn
Name: Churn_prediction, dtype: object


In this notebook we leveraged pycaret to find the best model to use for predicicting churn of customers. Pycaret determined that the best model was Gradient Boosting Classifier. The accuracy scores seem in line with random forrest model as well. We then saved that model so that we could use it on new data to predict whether or not a customer will churn. Lastly, we then wrote a python script in order to automate the prediction of new data. In this notebook we leveraged pycaret to find the best model to use for predicicting churn of customers.

# Summary

Write a short summary of the process and results here.