# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

In [55]:
!python -V

Python 3.11.11


In [56]:
#!pip install --upgrade pycaret --user
import pycaret

In [67]:
'''
Getting our dataset.
We clean it up a little so that it matches the new_churn_data that we
are going to predict later.
'''
import pycaret
import pandas as pd

df = pd.read_csv('churn_data.csv')
df['charge_per_tenure'] = df['TotalCharges']/df['tenure']
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
df['PhoneService'].replace({'Yes':1, 'No':0}, inplace=True)
df['Contract'] = label_encoder.fit_transform(df['Contract'])
df['PaymentMethod'] = label_encoder.fit_transform(df['PaymentMethod'])
df

Unnamed: 0,customerID,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,charge_per_tenure
0,7590-VHVEG,1,0,0,2,29.85,29.85,No,29.850000
1,5575-GNVDE,34,1,1,3,56.95,1889.50,No,55.573529
2,3668-QPYBK,2,1,0,3,53.85,108.15,Yes,54.075000
3,7795-CFOCW,45,0,1,0,42.30,1840.75,No,40.905556
4,9237-HQITU,2,1,0,2,70.70,151.65,Yes,75.825000
...,...,...,...,...,...,...,...,...,...
7038,6840-RESVB,24,1,1,3,84.80,1990.50,No,82.937500
7039,2234-XADUH,72,1,1,1,103.20,7362.90,No,102.262500
7040,4801-JZAZL,11,0,0,2,29.60,346.45,No,31.495455
7041,8361-LTMKD,4,1,0,3,74.40,306.60,Yes,76.650000


In [58]:
'''
Setting up our auto ML model
'''
#!conda install -c conda-forge pycaret -y
from pycaret.classification import setup, compare_models, predict_model, save_model, load_model

automl = setup(df, target='Churn')

Unnamed: 0,Description,Value
0,Session id,7695
1,Target,Churn
2,Target type,Binary
3,Target mapping,"No: 0, Yes: 1"
4,Original data shape,"(7043, 9)"
5,Transformed data shape,"(7043, 9)"
6,Transformed train set shape,"(4930, 9)"
7,Transformed test set shape,"(2113, 9)"
8,Numeric features,7
9,Categorical features,1


In [59]:
'''
After seeing all of the different models, we can choose from either Logistic
Regression, Naive Bayes, or K Neighbors Classifier.

Since Logistic Regression has the highest AUC, let's go with that for now.
'''

best_model = compare_models(sort='AUC')

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lr,Logistic Regression,0.7361,0.8343,0.7361,0.6654,0.6277,0.0119,0.0439,0.214
ridge,Ridge Classifier,0.7347,0.8243,0.7347,0.5398,0.6223,0.0,0.0,0.009
nb,Naive Bayes,0.7633,0.8175,0.7633,0.7807,0.7696,0.4297,0.4336,0.008
knn,K Neighbors Classifier,0.7775,0.7613,0.7775,0.7642,0.7669,0.3821,0.3885,0.012
svm,SVM - Linear Kernel,0.7396,0.6966,0.7396,0.755,0.7073,0.2699,0.3136,0.009
rf,Random Forest Classifier,0.7347,0.6865,0.7347,0.5398,0.6223,0.0,0.0,0.034
et,Extra Trees Classifier,0.7347,0.6138,0.7347,0.5398,0.6223,0.0,0.0,0.023
lightgbm,Light Gradient Boosting Machine,0.7347,0.5237,0.7347,0.5398,0.6223,0.0,0.0,0.073
dt,Decision Tree Classifier,0.7347,0.5,0.7347,0.5398,0.6223,0.0,0.0,0.009
qda,Quadratic Discriminant Analysis,0.7347,0.5,0.7347,0.5398,0.6223,0.0,0.0,0.008


In [60]:
'''
Let's save our logistic regression model 
'''

save_model(best_model, 'Logistic Regression')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=None),
          steps=[('label_encoding',
                  TransformerWrapperWithInverse(exclude=None, include=None,
                                                transformer=LabelEncoder())),
                 ('numerical_imputer',
                  TransformerWrapper(exclude=None,
                                     include=['tenure', 'PhoneService',
                                              'Contract', 'PaymentMethod',
                                              'MonthlyCharges', 'TotalCharges',
                                              'charge_per_tenure'],
                                     transformer=SimpleImputer(add_indica...
                                                               handle_unknown='value',
                                                               hierarchy=None,
                                                               min_samples_leaf=20,
                                                          

In [61]:
'''
Making a python script for our saved model
'''

from IPython.display import Code

%run predict_churn.py

Transformation Pipeline and Model Successfully Loaded


predictions: 
0    No
1    No
2    No
3    No
4    No
Name: prediction_label, dtype: object


In [62]:
'''
Below is the percent confidence that our prediction is correct.
'''

df_new_data = pd.read_csv('new_churn_data.csv')
prediction_scores = predict_model(load_model('Logistic Regression'), 
                                  df_new_data)['prediction_score']
prediction_scores

Transformation Pipeline and Model Successfully Loaded


0    0.6288
1    0.7435
2    0.9123
3    0.8161
4    0.7906
Name: prediction_score, dtype: float64

In [63]:
'''
In this code, we find the percentile that each prediction score is in,
to see how confident the prediction is related to all the other predictions.
'''

prediction_score_max = max(prediction_scores)
prediction_score_min = min(prediction_scores)
y = 0

for i in prediction_scores:
    percentile = (i - prediction_score_min)/(prediction_score_max-prediction_score_min)
    print('Prediction number ', y, ' is in percentile: %.3f' %percentile)

Prediction number  0  is in percentile: 0.000
Prediction number  0  is in percentile: 0.405
Prediction number  0  is in percentile: 1.000
Prediction number  0  is in percentile: 0.661
Prediction number  0  is in percentile: 0.571


In [68]:
'''
Now let's look at another autoML, H2O and compare its performance to
pycaret.

First we import h2o and read in our dataframe from the top.
'''

#!pip install h2o
import h2o
h2o.init()

df_h2o = h2o.H2OFrame(df)
df_h2o

Checking whether there is an H2O instance running at http://localhost:54321. connected.
Please download and install the latest version from: https://h2o-release.s3.amazonaws.com/h2o/latest_stable.html


0,1
H2O_cluster_uptime:,7 mins 19 secs
H2O_cluster_timezone:,America/Chicago
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.46.0.7
H2O_cluster_version_age:,6 months
H2O_cluster_name:,H2O_from_python_anaughton_2tcomy
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,3.539 Gb
H2O_cluster_total_cores:,10
H2O_cluster_allowed_cores:,10


Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%


customerID,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,charge_per_tenure
7590-VHVEG,1,0,0,2,29.85,29.85,No,29.85
5575-GNVDE,34,1,1,3,56.95,1889.5,No,55.5735
3668-QPYBK,2,1,0,3,53.85,108.15,Yes,54.075
7795-CFOCW,45,0,1,0,42.3,1840.75,No,40.9056
9237-HQITU,2,1,0,2,70.7,151.65,Yes,75.825
9305-CDSKC,8,1,0,2,99.65,820.5,Yes,102.562
1452-KIOVK,22,1,0,1,89.1,1949.4,No,88.6091
6713-OKOMC,10,0,0,3,29.75,301.9,No,30.19
7892-POOKP,28,1,0,2,104.8,3046.05,Yes,108.787
6388-TABGU,62,1,1,0,56.15,3487.95,No,56.2573


In [69]:
'''
Now we define our test and target variables.
'''

from h2o.automl import H2OAutoML

x = df_h2o.columns.remove('Churn')
y = 'Churn'
aml = H2OAutoML(seed = 0)
aml.train(x, y, df_h2o)


AutoML progress: |
12:33:16.237: _train param, Dropping bad and constant columns: [customerID]

█
12:33:19.913: _train param, Dropping bad and constant columns: [customerID]

██
12:33:21.808: _train param, Dropping bad and constant columns: [customerID]
12:33:25.274: _train param, Dropping unused columns: [customerID]
12:33:26.51: _train param, Dropping bad and constant columns: [customerID]

█
12:33:27.488: _train param, Dropping bad and constant columns: [customerID]

█
12:33:30.306: _train param, Dropping bad and constant columns: [customerID]

█
12:33:31.487: _train param, Dropping bad and constant columns: [customerID]
12:33:32.388: _train param, Dropping bad and constant columns: [customerID]

█
12:33:33.204: _train param, Dropping unused columns: [customerID]
12:33:33.888: _train param, Dropping unused columns: [customerID]


12:33:34.182: _train param, Dropping bad and constant columns: [customerID]

██
12:33:35.34: _train param, Dropping bad and constant columns: [customerID]


H2OJobCancelled: Job<$03017f00000132d4ffffffff$_8b3aedcf3b376eb5a1b4ec26bd4545f8> was cancelled by the user.

# Summary

Write a short summary of the process and results here.

We took our original unmodified data, did a little clean up, and used pycaret to determine the best model for predicting whether a customer will churn or not.  We settled on the logistic regression model since it had the highest AUC score, but we could also try some other models that high scores in other categories as well.  We then saved this model to a pickle file and used it to predict some new churn data.  The new predictions come out to be about 60% correct.  There is some randomness in the results each time it is ran, but 3 out of 5 being correct seems to be the general result.

Moving forward, it would be prudent to continue feeding in new data to our model that we have.  It could also be useful to save, say, three models total and have the new data be run through each one.  After a little time, we would possibly see which one is producing the best results.