# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

In [1]:
#!pip install pandas

In [2]:
#loading our data into a data frame using pandas 
import pandas as pd
df = pd.read_csv('prepped_churn_data.csv')
df

Unnamed: 0.1,Unnamed: 0,customerID,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,totalCharges_tenure_ratio
0,0,5919,1,0,0,2,29.85,29.85,0,29.850000
1,1,38,34,1,2,1,56.95,1889.50,0,55.573529
2,2,1101,2,1,0,1,53.85,108.15,1,54.075000
3,3,2783,45,0,2,3,42.30,1840.75,0,40.905556
4,4,3241,2,1,0,2,70.70,151.65,1,75.825000
...,...,...,...,...,...,...,...,...,...,...
7038,7038,1621,24,1,2,1,84.80,1990.50,0,82.937500
7039,7039,4517,72,1,2,0,103.20,7362.90,0,102.262500
7040,7040,6155,11,0,0,2,29.60,346.45,0,31.495455
7041,7041,5314,4,1,0,1,74.40,306.60,1,76.650000


In [3]:
#use pycart for autoML
#install the pycaret python with conda
#!pip install forge pycaret 

In [4]:
#import specific functions or classes 
from pycaret.classification import setup, compare_models, predict_model, save_model, load_model

In [5]:
#set up autoML
automl = setup(df, target = 'Churn')

Unnamed: 0,Description,Value
0,session_id,7094
1,Target,Churn
2,Target Type,Binary
3,Label Encoded,
4,Original Data,"(7043, 10)"
5,Missing Values,True
6,Numeric Features,6
7,Categorical Features,3
8,Ordinal Features,False
9,High Cardinality Features,False


In [6]:
automl[6]

[<pandas.io.formats.style.Styler at 0x1304a9fad00>]

In [7]:
best_model = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
ada,Ada Boost Classifier,0.797,0.8375,0.5133,0.6514,0.5736,0.4429,0.4487,0.059
gbc,Gradient Boosting Classifier,0.7947,0.8373,0.495,0.6509,0.5617,0.431,0.4383,0.127
lr,Logistic Regression,0.7929,0.8369,0.5216,0.6368,0.5726,0.4379,0.4422,0.651
lightgbm,Light Gradient Boosting Machine,0.7921,0.8269,0.5087,0.6374,0.5654,0.4311,0.4361,0.034
lda,Linear Discriminant Analysis,0.7888,0.8282,0.53,0.6221,0.5719,0.433,0.4358,0.007
ridge,Ridge Classifier,0.788,0.0,0.4645,0.6422,0.5383,0.4054,0.4148,0.005
knn,K Neighbors Classifier,0.7649,0.738,0.4326,0.5793,0.4947,0.3458,0.3524,0.014
rf,Random Forest Classifier,0.7613,0.7869,0.4699,0.5633,0.5119,0.3556,0.3585,0.134
et,Extra Trees Classifier,0.7493,0.7561,0.46,0.5349,0.4944,0.329,0.3308,0.135
dummy,Dummy Classifier,0.7337,0.5,0.0,0.0,0.0,0.0,0.0,0.005


In [8]:
best_model

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None, learning_rate=1.0,
                   n_estimators=50, random_state=7094)

In [9]:
df.iloc[-2:-1].shape

(1, 10)

In [10]:
predict_model(best_model, df.iloc[-2:-1])

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Ada Boost Classifier,1.0,0,1.0,1.0,1.0,,0.0


Unnamed: 0.1,Unnamed: 0,customerID,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,totalCharges_tenure_ratio,Label,Score
7041,7041,5314,4,1,0,1,74.4,306.6,1,76.65,1,0.5005


In [11]:
save_model(best_model, 'LDA')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=None,
          steps=[('dtypes',
                  DataTypes_Auto_infer(categorical_features=[],
                                       display_types=True, features_todrop=[],
                                       id_columns=['Unnamed: 0', 'customerID'],
                                       ml_usecase='classification',
                                       numerical_features=[], target='Churn',
                                       time_features=[])),
                 ('imputer',
                  Simple_Imputer(categorical_strategy='not_available',
                                 fill_value_categorical=None,
                                 fill_value_numer...
                 ('dummy', Dummify(target='Churn')),
                 ('fix_perfect', Remove_100(target='Churn')),
                 ('clean_names', Clean_Colum_Names()),
                 ('feature_select', 'passthrough'), ('fix_multi', 'passthrough'),
                 ('dfs', 'passthrough'), ('pca', 'pass

In [12]:
import pickle

with open('LDA_model.pk', 'wb') as f:
    pickle.dump(best_model, f)

In [15]:
with open('LDA_model.pk', 'rb') as f:
    loaded_model = pickle.load(f)

In [17]:
loaded_lda = load_model('LDA')

Transformation Pipeline and Model Successfully Loaded


In [18]:
predict_model(loaded_lda, new_data)

Unnamed: 0.1,Unnamed: 0,customerID,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,totalCharges_tenure_ratio,Label,Score
7041,7041,5314,4,1,0,1,74.4,306.6,76.65,1,0.5005


In [25]:
from IPython.display import Code

Code('Week_5_Assignment_Gabriel_Moore.ipynb')

ClassNotFound: no lexer for filename 'Week_5_Assignment_Gabriel_Moore.ipynb' found

ClassNotFound: no lexer for filename 'Week_5_Assignment_Gabriel_Moore.ipynb' found

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "165166dd",
   "metadata": {},
   "source": [
    "# DS Automation Assignment"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c195af74",
   "metadata": {},
   "source": [
    "Using our prepared churn data from week 2:\n",
    "- use pycaret to find an ML algorithm that performs best on the data\n",
    "    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.\n",
    "- save the model to disk\n",
    "- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe\n",
    "    - your Python file/function should print out the predictions for new data (new_churn_data.csv)\n",
    "    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested\n",
    "- test your 

In [26]:
%run Week_5_Assignment_Gabriel_Moore.ipynb

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
gbc,Gradient Boosting Classifier,0.7846,0.8332,0.4999,0.6313,0.5575,0.4178,0.423,0.121
ada,Ada Boost Classifier,0.7844,0.8306,0.5022,0.6307,0.5576,0.4179,0.4235,0.059
lda,Linear Discriminant Analysis,0.784,0.8209,0.5238,0.6235,0.5683,0.4259,0.4294,0.007
ridge,Ridge Classifier,0.7828,0.0,0.4634,0.6398,0.5362,0.3994,0.4089,0.005
lr,Logistic Regression,0.7826,0.8295,0.5036,0.6255,0.5565,0.4151,0.4201,0.608
lightgbm,Light Gradient Boosting Machine,0.7817,0.8222,0.5089,0.6213,0.5591,0.4161,0.42,0.032
rf,Random Forest Classifier,0.7611,0.7908,0.4888,0.5727,0.5269,0.3685,0.3709,0.128
knn,K Neighbors Classifier,0.7558,0.7381,0.4299,0.5686,0.4891,0.3328,0.3387,0.014
et,Extra Trees Classifier,0.7513,0.7695,0.4888,0.5495,0.517,0.3504,0.3517,0.129
dt,Decision Tree Classifier,0.7333,0.6713,0.5216,0.5097,0.515,0.3313,0.3317,0.007


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Gradient Boosting Classifier,1.0,0,1.0,1.0,1.0,,0.0


Transformation Pipeline and Model Successfully Saved
Transformation Pipeline and Model Successfully Loaded


This project was fairly easy in all. Wait I take that back! This was without a doubt the most troblesome of our projects thus far. In large part due to the environmental issues that I faced. In the first place, Pycaret is only operational with select versions of python (for now). At first I was running python 3.10, this did not work as python 3.7 or 3.8 was needed for pycaret. And so the creation of a new conda environment was needed. 

Only a handful of commands were needed to complete this project. However, the complexity of what is happening "under the hood" should not be overlooked. I do admit I must spend more time studying such topics. 

# Summary

Write a short summary of the process and results here.