# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from tpot import TPOTClassifier
from sklearn.model_selection import train_test_split
import timeit

In [4]:
df = pd.read_csv("../Assignment/data/prepped_churn_data.csv", index_col='customerID')
df.head()

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,MonthlyCharges_to_tenure_ratio,TotalCharges_to_tenure_ratio
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
7590-VHVEG,1,1,0,0,29.85,29.85,1,0.033501,0.033501
5575-GNVDE,34,0,1,1,56.95,1889.5,1,0.597015,0.017994
3668-QPYBK,2,0,0,1,53.85,108.15,0,0.03714,0.018493
7795-CFOCW,45,1,1,2,42.3,1840.75,1,1.06383,0.024447
9237-HQITU,2,0,0,0,70.7,151.65,0,0.028289,0.013188


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7032 entries, 7590-VHVEG to 3186-AJIEK
Data columns (total 9 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   tenure                          7032 non-null   int64  
 1   PhoneService                    7032 non-null   int64  
 2   Contract                        7032 non-null   int64  
 3   PaymentMethod                   7032 non-null   int64  
 4   MonthlyCharges                  7032 non-null   float64
 5   TotalCharges                    7032 non-null   float64
 6   Churn                           7032 non-null   int64  
 7   MonthlyCharges_to_tenure_ratio  7032 non-null   float64
 8   TotalCharges_to_tenure_ratio    7032 non-null   float64
dtypes: float64(4), int64(5)
memory usage: 549.4+ KB


In [6]:
df.shape

(7032, 9)

In [7]:
features = df.drop('Churn', axis=1)
targets = df['Churn']

x_train, x_test, y_train, y_test = train_test_split(features, targets, stratify=targets, random_state=42)

In [8]:
%%time
tpot = TPOTClassifier(generations=5, population_size=50, cv=5, random_state=42, scoring='accuracy', verbosity=2, n_jobs=-1)
tpot.fit(x_train,y_train)
print(tpot.score(x_test,y_test))

Optimization Progress:   0%|          | 0/300 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: 0.7965504465048519

Generation 2 - Current best internal CV score: 0.7967416387132746

Generation 3 - Current best internal CV score: 0.7967416387132746

Generation 4 - Current best internal CV score: 0.7967416387132746

Generation 5 - Current best internal CV score: 0.7967416387132746

Best pipeline: ExtraTreesClassifier(input_matrix, bootstrap=False, criterion=entropy, max_features=0.25, min_samples_leaf=10, min_samples_split=5, n_estimators=100)
0.7986348122866894
CPU times: total: 52 s
Wall time: 2min 16s


In [9]:
predictions = tpot.predict(x_test)
predictions


array([1, 1, 0, ..., 1, 0, 1], dtype=int64)

In [10]:
from sklearn.metrics import accuracy_score
print(f'Accurace of TPOT predictions: {accuracy_score(y_test,predictions)}')

Accurace of TPOT predictions: 0.7986348122866894


In [11]:
from IPython.display import Code
Code('tpot_churn_pipeline_processed.py')

In [12]:
print('Churn Predictions')
print(predictions)
print('Actuals')
print(y_test)

Churn Predictions
[1 1 0 ... 1 0 1]
Actuals
customerID
0394-YONDK    1
6933-VLYFX    1
9360-OMDZZ    1
7912-SYRQT    0
7191-ADRGF    1
             ..
3552-CTCYF    1
5915-ANOEI    1
7994-XIRTR    1
9172-ANCRX    0
3551-GAEGL    1
Name: Churn, Length: 1758, dtype: int64


In [13]:
tpot.export('tpot_churn_pipeline_raw.py')

In [14]:
predictions

array([1, 1, 0, ..., 1, 0, 1], dtype=int64)

In [15]:
%run try2.py

[1 1]


In [16]:
%run tpot_churn_pipeline_processed.py

[1 1]


# Summary

Write a short summary of the process and results here.

Ok this assignment confused me quite a bit. I believe the overall objective was to build a predictive model and apply it to a new data set. So for my assignment I used tpot and built my model and generated my python program. This is where things kind of fell apart for me and based off of our discussion in class it sounded like quite a few other struggled as well. But here is a break-down of what I did and the decision I made. So the first thing I did was to modify my raw pipeline code to take in a new churn data set with a target variable. I assume at this point that the target variable here is supposed to mimic Churn data. There was a not on the example that stated that we need to make the dataset meet the format of our existing data set. This is where I got confused. Our existing dataset had Churn data in it, the new dataset did not. So, I added the data to the new data set based on the information that was given as part of the assignment. The code would not run without target data. This did not feel right and I am not sure this was the proper way to handle it, and I look forward to seeing the solution to see if I was close or how to properly do this. I then ran the newly created python script against the new data set and returned the resultant predictions. I did this with two different python scripts. The first was taked after the example given in the course work, the second closely resembled the generated python script that came from tpot. Both generated the exact same result. 