# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

In [1]:
conda install numpy scipy scikit-learn pandas joblib pytorch


Retrieving notices: ...working... DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): repo.anaconda.com:443
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): repo.anaconda.com:443
DEBUG:urllib3.connectionpool:https://repo.anaconda.com:443 "GET /pkgs/r/notices.json HTTP/1.1" 404 None
DEBUG:urllib3.connectionpool:https://repo.anaconda.com:443 "GET /pkgs/main/notices.json HTTP/1.1" 404 None
done
Collecting package metadata (current_repodata.json): / DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): repo.anaconda.com:443
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): repo.anaconda.com:443
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): repo.anaconda.com:443
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): repo.anaconda.com:443
DEBUG:urllib3.connectionpool:https://repo.anaconda.com:443 "GET /pkgs/r/noarch/current_repodata.json HTTP/1.1" 304 0
DEBUG:urllib3.connectionpool:https://repo.anaconda.com:443 "GE

In [2]:
pip install deap update_checker tqdm stopit xgboost

Collecting deap
  Downloading deap-1.4.1.tar.gz (1.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting update_checker
  Downloading update_checker-0.18.0-py3-none-any.whl (7.0 kB)
Collecting stopit
  Downloading stopit-1.1.2.tar.gz (18 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting xgboost
  Obtaining dependency information for xgboost from https://files.pythonhosted.org/packages/6d/d1/3e954de1d492129710e8625349a7b86eb287a4f413c5b5c15522f89a6c04/xgboost-2.0.0-py3-none-macosx_12_0_arm64.whl.metadata
  Downloading xgboost-2.0.0-py3-none-macosx_12_0_arm64.whl.metadata (2.0 kB)
Downloading xgboost-2.0.0-py3-none-macosx_12_0_arm64.whl (1.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hBuilding wheels for collected pa

In [3]:
pip install tpot

Collecting tpot
  Obtaining dependency information for tpot from https://files.pythonhosted.org/packages/7b/a7/0060d028906ecd058b1331c3ce6f3f19ba03464b21dc9abbbaf66b0a1091/TPOT-0.12.1-py3-none-any.whl.metadata
  Downloading TPOT-0.12.1-py3-none-any.whl.metadata (2.0 kB)
Downloading TPOT-0.12.1-py3-none-any.whl (87 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m87.4/87.4 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: tpot
Successfully installed tpot-0.12.1
Note: you may need to restart the kernel to use updated packages.


In [4]:
pip install tpot

Note: you may need to restart the kernel to use updated packages.


## Small Note

Above I installed the required modules to run the code. It was nice because I did not need to fix any errors or deprecations. 

In [5]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

from tpot import TPOTClassifier
from sklearn.model_selection import train_test_split

import timeit 

In [12]:
df = pd.read_csv('/Users/blandrumjeffries/Downloads/churn_data.csv')
df.head()

Unnamed: 0,customerID,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,1,No,Month-to-month,Electronic check,29.85,29.85,No
1,5575-GNVDE,34,Yes,One year,Mailed check,56.95,1889.5,No
2,3668-QPYBK,2,Yes,Month-to-month,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,45,No,One year,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,2,Yes,Month-to-month,Electronic check,70.7,151.65,Yes


In [13]:
df['Churn'] = df['Churn'].replace({'No': 0, 'Yes': 1})
df['Churn']

df['Contract'] = df['Contract'].replace({'Month-to-month': 0, 'One year': 1, 'Two year': 2})

df['PaymentMethod'] = df['PaymentMethod'].replace({'Electronic check': 0, 'Mailed check': 1, 'Credit card (automatic)': 2, 'Bank transfer (automatic)': 3})

df['PhoneService'] = df['PhoneService'].replace({'No': 0, 'Yes': 1})

In [20]:
df.drop('customerID', axis=1, inplace=True)


In [31]:
df.to_csv('WEEK5FIXEDDATA.csv', index=False)

## Small note
Again this recurring issue, I took the right file that was editied and it complains about the axis always. But I saved the data so no big deal!

In [14]:
features = df.drop('Churn', axis=1)
targets = df['Churn']

x_train, x_test, y_train, y_test = train_test_split(features, targets, stratify=targets, random_state=42)

In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   customerID      7043 non-null   object 
 1   tenure          7043 non-null   int64  
 2   PhoneService    7043 non-null   int64  
 3   Contract        7043 non-null   int64  
 4   PaymentMethod   7043 non-null   int64  
 5   MonthlyCharges  7043 non-null   float64
 6   TotalCharges    7032 non-null   float64
 7   Churn           7043 non-null   int64  
dtypes: float64(2), int64(5), object(1)
memory usage: 440.3+ KB


In [22]:
features = df.drop('Churn', axis=1)
targets = df['Churn']

x_train, x_test, y_train, y_test = train_test_split(features, targets, stratify=targets, random_state=42)

In [25]:
%%time
tpot = TPOTClassifier(generations=5, population_size=50, cv=5,random_state=42, scoring='precision', verbosity=2, n_jobs=-1)

tpot.fit(x_train, y_train)
print(tpot.score(x_test, y_test))

Imputing missing values in feature set


Optimization Progress:   0%|          | 0/300 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: 0.7654634630905817

Generation 2 - Current best internal CV score: 0.7654634630905817

Generation 3 - Current best internal CV score: 0.7654634630905817

Generation 4 - Current best internal CV score: 0.7874905211747317

Generation 5 - Current best internal CV score: 0.7948519948519949

Best pipeline: XGBClassifier(Nystroem(input_matrix, gamma=0.7000000000000001, kernel=additive_chi2, n_components=1), learning_rate=0.01, max_depth=6, min_child_weight=16, n_estimators=100, n_jobs=1, subsample=0.8500000000000001, verbosity=0)
Imputing missing values in feature set
0.7894736842105263
CPU times: user 23.3 s, sys: 9.64 s, total: 33 s
Wall time: 55.8 s


In [26]:
predictions = tpot.predict(x_test)
predictions

Imputing missing values in feature set


array([0, 0, 0, ..., 0, 0, 0])

I used precision because I feel like it fit better than the accruacy. Then the predicition per row.

In [27]:
print('Predictions for test data set')
print(predictions)
print('Actuals for test data set')
print(y_test)

Predictions for test data set
[0 0 0 ... 0 0 0]
Actuals for test data set
5909    0
3670    0
6220    0
5905    0
6435    0
       ..
476     0
1607    1
6808    0
2962    1
3955    0
Name: Churn, Length: 1761, dtype: int64


In [28]:
from sklearn.metrics import accuracy_score
print(f'Accuracy of the TPOT predictions: {accuracy_score(y_test,predictions)}')

Accuracy of the TPOT predictions: 0.7535491198182851


In [29]:
tpot.export('tpot_churn_pipeline.py')

In [30]:
from IPython.display import Code

Code('tpot_churn_pipeline.py')

In [36]:
Code('tpot_churn_pipeline.py')

In [37]:
%run tpot_churn_pipeline.py

[0 0 0 ... 0 0 0]


In [38]:
predictions

array([0, 0, 0, ..., 0, 0, 0])

# Summary

This assingment was rather interesting. The first part of the data manipulation is the same which is now easier to follow. At first I found this a bit complicated and confusing. However, now I feel much better about it. The second part of this assignment was really nice and straightforward. If I am honest, I do not like working out of the notebook setting, I feel it is something that would not be done in a normal work setting. The transitioning this week in to an edtior and GIT is amazing. Also, I found the transition into the python code bringing everything together, this was the coolest part of the assignment. We have learned many commands and they all convert into a python file where we are able to run the py code.