# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

In [2]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from tpot import TPOTClassifier
from sklearn.model_selection import train_test_split
import timeit

  from .autonotebook import tqdm as notebook_tqdm


Find an ML algorithm that performs best

In [3]:
df = pd.read_csv('prepped_churn_data.csv')
df.head()

Unnamed: 0,customerID,tenure,MonthlyCharges,TotalCharges,PhoneService_No,PhoneService_Yes,Contract_Month-to-month,Contract_One year,Contract_Two year,PaymentMethod_Bank transfer (automatic),PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check,Churn_No,Churn_Yes,LogTotalCharges
0,7590-VHVEG,1,29.85,29.85,1,0,1,0,0,0,0,1,0,1,0,3.396185
1,5575-GNVDE,34,56.95,1889.5,0,1,0,1,0,0,0,0,1,1,0,7.544068
2,3668-QPYBK,2,53.85,108.15,0,1,1,0,0,0,0,0,1,0,1,4.683519
3,7795-CFOCW,45,42.3,1840.75,1,0,0,1,0,1,0,0,0,1,0,7.517928
4,9237-HQITU,2,70.7,151.65,0,1,1,0,0,0,0,1,0,0,1,5.021575


In [4]:
features = df.drop(columns = ['Churn_No', 'Churn_Yes', 'customerID'], axis = 1)
targets = df['Churn_Yes']

x_train, x_test, y_train, y_test = train_test_split(features, targets, stratify=targets, random_state = 42)

In [27]:
tpot = TPOTClassifier(generations=5, n_jobs=-1, verbosity=2, random_state=42, scoring='f1')

tpot.fit(x_train, y_train)
print(tpot.score(x_test, y_test))

                                                                              
Generation 1 - Current best internal CV score: 0.6186457165242654
                                                                              
Generation 2 - Current best internal CV score: 0.6186457165242654
                                                                              
Generation 3 - Current best internal CV score: 0.6186457165242654
                                                                              
Generation 4 - Current best internal CV score: 0.6222373689455098
                                                                              
Generation 5 - Current best internal CV score: 0.6243012440455719
                                                                              
Best pipeline: GaussianNB(RandomForestClassifier(XGBClassifier(input_matrix, learning_rate=0.001, max_depth=2, min_child_weight=9, n_estimators=100, n_jobs=1, subsample=1.0, verbosity=0), bootstra



In [28]:
predictions = tpot.predict(x_test)
print(predictions)

[0 0 0 ... 0 0 0]


In [29]:
from sklearn.metrics import accuracy_score
print(f'Accuracy of the TPOT predictions: {accuracy_score(y_test,predictions)}')

Accuracy of the TPOT predictions: 0.7768313458262351


Save the model

In [30]:
tpot.export('tpot_churn_pipeline.py')

Create a python script

In [31]:
from IPython.display import Code

Code(filename='tpot_churn_pipeline.py')

In [35]:
import tpot_churn_pipeline

print( tpot_churn_pipeline)

[0 0 0 0 0]
<module 'tpot_churn_pipeline' from 'c:\\Users\\qwert\\Downloads\\tpot_churn_pipeline.py'>


# Summary

The new churn data needed to be processed before the pipeline could be run. In production I would add more checking to make sure that the correct columns are all present before running the pipeline. In a different version of this assignment I ran TPOT with the default number of generations, which was a very bad idea. 

My model predicted that none of the members in the new data would Churn, but as I don't have true data for those members I can't know how well my model is actually performing. 