# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

# <font color = "blue"> Begin Student Submission </font>
## Load data
Here, I load the data from my modified week 2 assignment's output

In [1]:
import pandas as pd

df = pd.read_csv('NEW_prepped_churn_data.csv', index_col = 'customerID')
df

Unnamed: 0_level_0,Unnamed: 0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
7590-VHVEG,0,1,0,0,0,29.85,29.85,No
5575-GNVDE,1,34,1,1,1,56.95,1889.50,No
3668-QPYBK,2,2,1,0,1,53.85,108.15,Yes
7795-CFOCW,3,45,0,1,2,42.30,1840.75,No
9237-HQITU,4,2,1,0,0,70.70,151.65,Yes
...,...,...,...,...,...,...,...,...
6840-RESVB,7038,24,1,1,1,84.80,1990.50,No
2234-XADUH,7039,72,1,1,3,103.20,7362.90,No
4801-JZAZL,7040,11,0,0,0,29.60,346.45,No
8361-LTMKD,7041,4,1,0,1,74.40,306.60,Yes


Now, I use Pycaret for autoML. Hopefully this will identify a better algorithm for churn prediction than the last couple weeks have 

In [2]:
from pycaret.classification import setup, compare_models, predict_model, save_model, load_model
automl = setup(df, target='Churn')

Unnamed: 0,Description,Value
0,session_id,8183
1,Target,Churn
2,Target Type,Binary
3,Label Encoded,"No: 0, Yes: 1"
4,Original Data,"(7043, 8)"
5,Missing Values,False
6,Numeric Features,4
7,Categorical Features,3
8,Ordinal Features,False
9,High Cardinality Features,False


In [3]:
automl[6]

Pipeline(memory=None,
         steps=[('dtypes',
                 DataTypes_Auto_infer(categorical_features=[],
                                      display_types=True, features_todrop=[],
                                      id_columns=['Unnamed: 0'],
                                      ml_usecase='classification',
                                      numerical_features=[], target='Churn',
                                      time_features=[])),
                ('imputer',
                 Simple_Imputer(categorical_strategy='not_available',
                                fill_value_categorical=None,
                                fill_value_numerical=None,
                                num...
                ('scaling', 'passthrough'), ('P_transform', 'passthrough'),
                ('binn', 'passthrough'), ('rem_outliers', 'passthrough'),
                ('cluster_all', 'passthrough'),
                ('dummy', Dummify(target='Churn')),
                ('fix_perfect', Remo

In [4]:
best_model = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lda,Linear Discriminant Analysis,0.7982,0.8301,0.5441,0.6347,0.5854,0.4532,0.4558,0.005
lr,Logistic Regression,0.7976,0.839,0.5317,0.6375,0.5791,0.4474,0.451,0.677
gbc,Gradient Boosting Classifier,0.7972,0.8386,0.4992,0.6475,0.5629,0.4339,0.4406,0.084
ridge,Ridge Classifier,0.7937,0.0,0.4659,0.6477,0.5415,0.4131,0.4227,0.004
ada,Ada Boost Classifier,0.7915,0.8419,0.5,0.6278,0.5563,0.4224,0.4273,0.045
lightgbm,Light Gradient Boosting Machine,0.7826,0.825,0.51,0.601,0.5509,0.4089,0.4118,0.026
xgboost,Extreme Gradient Boosting,0.7746,0.8109,0.4953,0.5834,0.5353,0.3879,0.3905,0.098
rf,Random Forest Classifier,0.7728,0.7994,0.4845,0.5798,0.5273,0.3795,0.3825,0.098
knn,K Neighbors Classifier,0.7696,0.742,0.466,0.5742,0.5137,0.3651,0.3689,0.237
et,Extra Trees Classifier,0.7562,0.7738,0.4852,0.539,0.51,0.3485,0.3498,0.092


In [5]:
best_model

LinearDiscriminantAnalysis(n_components=None, priors=None, shrinkage=None,
                           solver='svd', store_covariance=False, tol=0.0001)

Looks like the best model for this data is also the Linear Discrimination Analysis model which was also the best in the FTE notebook! As the notebook asks us to do, I compare the differences betweel the 1-D and 2-D array dimensions and use the best model to predict some values.

In [6]:
print (df.iloc[-2:-1].shape)
print (df.iloc[-1].shape)

(1, 8)
(8,)


In [7]:
predict_model(best_model, df.iloc[-2:-1])

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Linear Discriminant Analysis,0,0,0,0,0,0,0


Unnamed: 0_level_0,Unnamed: 0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,Label,Score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
8361-LTMKD,7041,4,1,0,1,74.4,306.6,Yes,Yes,0.5138


I also save this best machine learning model to a pickel file so it can be used later.

In [8]:
save_model(best_model, 'LDA')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=None,
          steps=[('dtypes',
                  DataTypes_Auto_infer(categorical_features=[],
                                       display_types=True, features_todrop=[],
                                       id_columns=['Unnamed: 0'],
                                       ml_usecase='classification',
                                       numerical_features=[], target='Churn',
                                       time_features=[])),
                 ('imputer',
                  Simple_Imputer(categorical_strategy='not_available',
                                 fill_value_categorical=None,
                                 fill_value_numerical=None,
                                 num...
                 ('dummy', Dummify(target='Churn')),
                 ('fix_perfect', Remove_100(target='Churn')),
                 ('clean_names', Clean_Colum_Names()),
                 ('feature_select', 'passthrough'), ('fix_multi', 'passthrough'),
                 ('df

In [9]:
#Rerun from here on
import pickle

with open('LDA_model.pk', 'wb') as f:
    pickle.dump(best_model, f)

To ensure the save was successful, I load the pickle file and continue using the model from it for the rest of the notebook

In [10]:
with open('LDA_model.pk', 'rb') as f:
    loaded_model = pickle.load(f)
    
new_data = df.iloc[-2:-1].copy()
new_data.drop('Churn', axis=1, inplace=True)
#loaded_model.predict(new_data) #This line was raising a ValueError?

In [11]:
loaded_lda = load_model('LDA')

Transformation Pipeline and Model Successfully Loaded


In [12]:
predict_model(loaded_lda, new_data)

Unnamed: 0_level_0,Unnamed: 0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Label,Score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
8361-LTMKD,7041,4,1,0,1,74.4,306.6,Yes,0.5138


Alright, looks like this was loaded successfuly and it works! Now, we can utilize a python script to automate the data analysis process. This is based on the FTE example with some modifications to variables and parameters.

In [13]:
from IPython.display import Code

Code('predict_churn.py')

In [14]:
%run predict_churn.py

Transformation Pipeline and Model Successfully Loaded
predictions:
customerID
9305-CKSKC    Yes
1452-KNGVK     No
6723-OKKJM     No
7832-POPKP     No
6348-TACGU     No
Name: Churn_prediction, dtype: object


The last part of this assignment requires me to upload this notebook and python script to Github in a Github repository. As you are currently reading this, this task has been accomplished!

# Summary

This week, I took a look into how to automate the data analysis. I utilized Pycaret to automate the process of selecting a machine learning model that fits the dataset well. I also utilized Pickle to save that model to a pickle file which will allow for easy access for users who would like to analyze churn data for this telecommunications company. Additionally, I took a look at how to use Python scripts within a Jupyter Notebook using 'magic'. When I first started my work with Regis's physics department, we utilized this 'magic' to transition between explanation of code and ROOT-C++ code using PyROOT. I thought it was pretty cool getting to see this applied directly to Python this week.

As far as the results of what I've done this week, it looks like the new algorithm works better than the model from last week's assignment. The best machine learning model was able to predict 4 of 5 (80%) of the client churns correctly which is a great prediction rate!