# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

In [None]:
# this is exmaple

# load the data


In [2]:
import pandas as pd

df = pd.read_csv('C:\Datascience\prepared_churn_data (1).csv' )
df

Unnamed: 0,customerID,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,charge_per_tenure
0,7590-VHVEG,1.0,0,0,3,29.85,29.85,0,29.850000
1,5575-GNVDE,34.0,1,1,2,56.95,1889.50,0,55.573529
2,3668-QPYBK,2.0,1,0,2,53.85,108.15,1,54.075000
3,7795-CFOCW,45.0,0,1,1,42.30,1840.75,0,40.905556
4,9237-HQITU,2.0,1,0,3,70.70,151.65,1,75.825000
...,...,...,...,...,...,...,...,...,...
7038,6840-RESVB,24.0,1,1,2,84.80,1990.50,0,82.937500
7039,2234-XADUH,72.0,1,1,0,103.20,7362.90,0,102.262500
7040,4801-JZAZL,11.0,0,0,3,29.60,346.45,0,31.495455
7041,8361-LTMKD,4.0,1,0,2,74.40,306.60,1,76.650000


# Auto ML with pyrackets

In [3]:
from pycaret.classification import *

In [4]:
automl = setup(df, target='Churn')


Unnamed: 0,Description,Value
0,Session id,8020
1,Target,Churn
2,Target type,Binary
3,Original data shape,"(7043, 9)"
4,Transformed data shape,"(7043, 9)"
5,Transformed train set shape,"(4930, 9)"
6,Transformed test set shape,"(2113, 9)"
7,Numeric features,7
8,Categorical features,1
9,Preprocess,True


In [5]:
best_model = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
knn,K Neighbors Classifier,0.7716,0.7589,0.4557,0.5915,0.5134,0.3679,0.3739,0.585
nb,Naive Bayes,0.758,0.8184,0.6368,0.5373,0.5823,0.4138,0.4172,0.019
svm,SVM - Linear Kernel,0.7361,0.7114,0.4202,0.5918,0.4518,0.2997,0.3317,0.031
lr,Logistic Regression,0.7355,0.8367,0.0076,0.5167,0.0149,0.0087,0.0417,0.866
dt,Decision Tree Classifier,0.7347,0.5,0.0,0.0,0.0,0.0,0.0,0.021
ridge,Ridge Classifier,0.7347,0.8264,0.0,0.0,0.0,0.0,0.0,0.031
rf,Random Forest Classifier,0.7347,0.6946,0.0,0.0,0.0,0.0,0.0,0.122
qda,Quadratic Discriminant Analysis,0.7347,0.5,0.0,0.0,0.0,0.0,0.0,0.018
ada,Ada Boost Classifier,0.7347,0.5,0.0,0.0,0.0,0.0,0.0,0.019
gbc,Gradient Boosting Classifier,0.7347,0.4884,0.0,0.0,0.0,0.0,0.0,0.081


In [6]:
best_model

In [7]:
df.iloc[-4:-1]

Unnamed: 0,customerID,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,charge_per_tenure
7039,2234-XADUH,72.0,1,1,0,103.2,7362.9,0,102.2625
7040,4801-JZAZL,11.0,0,0,3,29.6,346.45,0,31.495455
7041,8361-LTMKD,4.0,1,0,2,74.4,306.6,1,76.65


In [19]:
df.iloc[-2:-1]

Unnamed: 0,customerID,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,charge_per_tenure
7041,8361-LTMKD,4.0,1,0,2,74.4,306.6,1,76.65


In [8]:
predict_model(best_model, df.iloc[-2:-1])

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,K Neighbors Classifier,1.0,0,1.0,1.0,1.0,,0.0


Unnamed: 0,customerID,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,charge_per_tenure,Churn,prediction_label,prediction_score
7041,8361-LTMKD,4.0,1,0,2,74.400002,306.600006,76.650002,1,1,0.8


In [9]:
predict_model(best_model, df.iloc[-4:-1])

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,K Neighbors Classifier,1.0,1.0,1.0,1.0,1.0,1.0,1.0


Unnamed: 0,customerID,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,charge_per_tenure,Churn,prediction_label,prediction_score
7039,2234-XADUH,72.0,1,1,0,103.199997,7362.899902,102.262497,0,0,0.8
7040,4801-JZAZL,11.0,0,0,3,29.6,346.450012,31.495455,0,0,0.8
7041,8361-LTMKD,4.0,1,0,2,74.400002,306.600006,76.650002,1,1,0.8


# saving and loading our model

In [10]:
save_model(best_model, 'KNeighborsClassifier')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=None),
          steps=[('numerical_imputer',
                  TransformerWrapper(exclude=None,
                                     include=['tenure', 'PhoneService',
                                              'Contract', 'PaymentMethod',
                                              'MonthlyCharges', 'TotalCharges',
                                              'charge_per_tenure'],
                                     transformer=SimpleImputer(add_indicator=False,
                                                               copy=True,
                                                               fill_value=None,
                                                               keep_empty_features=False,
                                                               missing_values=nan,
                                                               strategy='mean'))),
                 ('c...
                                     transformer=TargetEn

In [11]:
import pickle
with open('KNeighborsClassifier', 'wb') as f:
    pickle.dump(best_model, f)

In [12]:
with open('KNeighborsClassifier', 'rb') as f:
    loaded_model = pickle.load(f)

In [13]:

loaded_lda = load_model('KNeighborsClassifier')

Transformation Pipeline and Model Successfully Loaded


In [14]:
new_data=df.iloc[-2:-1]

In [15]:
predict_model(loaded_lda, new_data)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,K Neighbors Classifier,1.0,0,1.0,1.0,1.0,,0.0


Unnamed: 0,customerID,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,charge_per_tenure,Churn,prediction_label,prediction_score
7041,8361-LTMKD,4.0,1,0,2,74.400002,306.600006,76.650002,1,1,0.8


# Making a Python module to make predictions

In [32]:
from IPython.display import Code

In [47]:

Code('predict_Churn.py')

In [48]:
%run predict_Churn.py

Transformation Pipeline and Model Successfully Loaded


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,K Neighbors Classifier,0.8166,0.8469,0.5543,0.693,0.6159,0.4974,0.5027


Columns in predictions DataFrame: Index(['customerID', 'tenure', 'PhoneService', 'Contract', 'PaymentMethod',
       'MonthlyCharges', 'TotalCharges', 'charge_per_tenure', 'Churn',
       'prediction_label', 'prediction_score'],
      dtype='object')
Predictions:
0        No
1        No
2        No
3        No
4        No
       ... 
7038     No
7039     No
7040     No
7041    Yes
7042     No
Name: Churn_prediction, Length: 7043, dtype: object


# Summary

I used the data science automation process on my Week 5 assignment with the prepared churn dataset. 

I started by importing the dataset and installed PyCaret to find out the best machine learning algorithm to use on the data. Upon comparing various models, I realized that K-Nearest Neighbors (KNN) worked best.

I finally saved the data after processing it and imported the KNN model. In order to make predictions, I created a Python module.