# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

# Loading the data


In [7]:

import pandas as pd
pd.DataFrame.iteritems = pd.DataFrame.items
df = pd.read_csv("C:\\Users\\singu\\Downloads\\prepared_churn_data (1).csv")
df

Unnamed: 0,customerID,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,charge_per_tenure
0,7590-VHVEG,1.0,0,0,3,29.85,29.85,0,29.850000
1,5575-GNVDE,34.0,1,1,2,56.95,1889.50,0,55.573529
2,3668-QPYBK,2.0,1,0,2,53.85,108.15,1,54.075000
3,7795-CFOCW,45.0,0,1,1,42.30,1840.75,0,40.905556
4,9237-HQITU,2.0,1,0,3,70.70,151.65,1,75.825000
...,...,...,...,...,...,...,...,...,...
7038,6840-RESVB,24.0,1,1,2,84.80,1990.50,0,82.937500
7039,2234-XADUH,72.0,1,1,0,103.20,7362.90,0,102.262500
7040,4801-JZAZL,11.0,0,0,3,29.60,346.45,0,31.495455
7041,8361-LTMKD,4.0,1,0,2,74.40,306.60,1,76.650000


In [2]:
#!pip install pycaret

# AutoML with pycaret

In [9]:
from pycaret.classification import setup, compare_models, predict_model, save_model, load_model

In [10]:
automl = setup(df, target='Churn')

Unnamed: 0,Description,Value
0,Session id,4238
1,Target,Churn
2,Target type,Binary
3,Original data shape,"(7043, 9)"
4,Transformed data shape,"(7043, 9)"
5,Transformed train set shape,"(4930, 9)"
6,Transformed test set shape,"(2113, 9)"
7,Numeric features,7
8,Categorical features,1
9,Preprocess,True


In [11]:
print(automl.dataset)

      customerID  tenure  PhoneService  Contract  PaymentMethod  \
5706  5196-SGOAK     1.0             1         0              3   
1755  1803-BGNBD    12.0             1         0              3   
3826  3213-VVOLG    29.0             1         2              2   
5225  5376-PCKNB    72.0             1         1              0   
2279  5334-AFQJB    72.0             1         2              0   
...          ...     ...           ...       ...            ...   
5417  9715-SBVSU    14.0             1         2              1   
510   4332-MUOEZ    20.0             1         1              0   
2795  0709-TVGUR     9.0             1         0              3   
1841  8958-JPTRR    56.0             1         1              3   
7006  0093-XWZFY    40.0             1         0              0   

      MonthlyCharges  TotalCharges  charge_per_tenure  Churn  
5706       75.699997     75.699997          75.699997      1  
1755       54.299999    654.500000          54.541668      0  
3826  

In [12]:
best_model = compare_models()

In [7]:
best_model

In [13]:
df.iloc[-2:-1].shape

(1, 9)

In [14]:
predict_model(best_model, df.iloc[-2:-1])

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,K Neighbors Classifier,1.0,0,1.0,1.0,1.0,,0.0


Unnamed: 0,customerID,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,charge_per_tenure,Churn,prediction_label,prediction_score
7041,8361-LTMKD,4.0,1,0,2,74.400002,306.600006,76.650002,1,1,1.0


# Saving and loading the data

In [15]:

save_model(best_model, 'KNeighborsClassifier')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=None),
          steps=[('numerical_imputer',
                  TransformerWrapper(exclude=None,
                                     include=['tenure', 'PhoneService',
                                              'Contract', 'PaymentMethod',
                                              'MonthlyCharges', 'TotalCharges',
                                              'charge_per_tenure'],
                                     transformer=SimpleImputer(add_indicator=False,
                                                               copy=True,
                                                               fill_value=None,
                                                               keep_empty_features=False,
                                                               missing_values=nan,
                                                               strategy='mean'))),
                 ('c...
                                     transformer=TargetEn

In [16]:
import pickle
with open('KNeighborsClassifier', 'wb') as f:
    pickle.dump(best_model, f)

In [17]:
with open('KNeighborsClassifier', 'rb') as f:
    loaded_model = pickle.load(f)

In [18]:
loaded_lda = load_model('KNeighborsClassifier')

Transformation Pipeline and Model Successfully Loaded


In [19]:
new_data=df.iloc[-2:-1]

In [20]:
predict_model(loaded_lda, new_data)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,K Neighbors Classifier,1.0,0,1.0,1.0,1.0,,0.0


Unnamed: 0,customerID,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,charge_per_tenure,Churn,prediction_label,prediction_score
7041,8361-LTMKD,4.0,1,0,2,74.400002,306.600006,76.650002,1,1,1.0


# Making a python module to make predictions 

In [1]:
from IPython.display import Code


In [21]:

Code('predict_Churn.py')

In [22]:
%run predict_Churn.py

Transformation Pipeline and Model Successfully Loaded


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,K Neighbors Classifier,0.8167,0.8493,0.5356,0.7029,0.608,0.4912,0.4989


predictions:
0        No
1        No
2       Yes
3        No
4       Yes
       ... 
7038     No
7039     No
7040     No
7041    Yes
7042     No
Name: Churn, Length: 7043, dtype: object


# Summary

Write a short summary of the process and results here.

The code successfully imports a churn dataset, identifies the best classification model, uses PyCaret's AutoML, and predicts new data. To make it reproducible, it stored the model using Python's pickle and PyCaret's save_model(). Using an external script that is predict_Churn.py, this process is modular and systematic: from loading the data to deployment of the model. Feature importance analysis, hyper parameters tuning, and data approval can be added to improve it. Hence, in summary, it is an excellent end-to-end machine learning pipeline ready for further development and implementation.