# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

In [1]:
import pandas as pd

df = pd.read_csv('Prepped_Churn_Data.csv')
df = df.drop('Unnamed: 0', axis=1)
df

Unnamed: 0,tenure,PhoneService,MonthlyCharges,TotalCharges,Churn,One year,Two year,Credit card (automatic),Electronic check,Mailed check
0,1,1,29.85,29.85,1,0,0,0,1,0
1,34,0,56.95,1889.50,1,1,0,0,0,1
2,2,0,53.85,108.15,0,0,0,0,0,1
3,45,1,42.30,1840.75,1,1,0,0,0,0
4,2,0,70.70,151.65,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...
7038,24,0,84.80,1990.50,1,1,0,0,0,1
7039,72,0,103.20,7362.90,1,1,0,1,0,0
7040,11,1,29.60,346.45,1,0,0,0,1,0
7041,4,0,74.40,306.60,0,0,0,0,0,1


In [26]:
from pycaret.classification import setup, compare_models, predict_model, save_model, load_model, create_model

In [12]:
automl = setup(df, target='Churn', fix_imbalance = True)

Unnamed: 0,Description,Value
0,session_id,6948
1,Target,Churn
2,Target Type,Binary
3,Label Encoded,"0: 0, 1: 1"
4,Original Data,"(7043, 10)"
5,Missing Values,False
6,Numeric Features,3
7,Categorical Features,6
8,Ordinal Features,False
9,High Cardinality Features,False


In [13]:
automl[6]

6948

In [46]:
new_df= df.drop('Churn', axis=1)
new_df

Unnamed: 0,tenure,PhoneService,MonthlyCharges,TotalCharges,One year,Two year,Credit card (automatic),Electronic check,Mailed check
0,1,1,29.85,29.85,0,0,0,1,0
1,34,0,56.95,1889.50,1,0,0,0,1
2,2,0,53.85,108.15,0,0,0,0,1
3,45,1,42.30,1840.75,1,0,0,0,0
4,2,0,70.70,151.65,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...
7038,24,0,84.80,1990.50,1,0,0,0,1
7039,72,0,103.20,7362.90,1,0,1,0,0
7040,11,1,29.60,346.45,0,0,0,1,0
7041,4,0,74.40,306.60,0,0,0,0,1


In [38]:
gbc= create_model('gbc')

Unnamed: 0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,0.785,0.863,0.8083,0.8872,0.8459,0.4928,0.4986
1,0.7931,0.8481,0.8472,0.8665,0.8567,0.4846,0.485
2,0.7911,0.8385,0.8333,0.8746,0.8535,0.4903,0.492
3,0.785,0.8575,0.8278,0.8713,0.849,0.4766,0.4785
4,0.7546,0.8234,0.775,0.8746,0.8218,0.4322,0.4405
5,0.7708,0.828,0.8222,0.858,0.8397,0.4382,0.4394
6,0.7708,0.8482,0.7889,0.8847,0.8341,0.4675,0.4756
7,0.7667,0.8261,0.7944,0.8746,0.8326,0.4509,0.4566
8,0.787,0.8528,0.8162,0.8825,0.848,0.4939,0.4982
9,0.7444,0.8234,0.7939,0.8457,0.819,0.386,0.3882


In [39]:
best_model = gbc

In [40]:
import pickle 
save_model(best_model, 'GBC')

Transformation Pipeline and Model Succesfully Saved


(Pipeline(memory=None,
          steps=[('dtypes',
                  DataTypes_Auto_infer(categorical_features=[],
                                       display_types=True, features_todrop=[],
                                       id_columns=[],
                                       ml_usecase='classification',
                                       numerical_features=[], target='Churn',
                                       time_features=[])),
                 ('imputer',
                  Simple_Imputer(categorical_strategy='not_available',
                                 fill_value_categorical=None,
                                 fill_value_numerical=None,
                                 numeric_strate...
                                             learning_rate=0.1, loss='deviance',
                                             max_depth=3, max_features=None,
                                             max_leaf_nodes=None,
                                             min_i

In [41]:
with open('GBC_model.pk', 'wb') as f:
    pickle.dump(best_model, f)

In [42]:
with open('GBC_model.pk', 'rb') as f:
    loaded_model = pickle.load(f)

In [48]:
new_data = new_df.iloc[-2:-1].copy()
loaded_model.predict(new_data)

array([1])

In [49]:
loaded_lda = load_model('GBC')

Transformation Pipeline and Model Successfully Loaded


In [50]:
predict_model(loaded_lda, new_data)

Unnamed: 0,tenure,PhoneService,MonthlyCharges,TotalCharges,One year,Two year,Credit card (automatic),Electronic check,Mailed check,Label,Score
7041,4,0,74.4,306.6,0,0,0,0,1,0,0.654


In [61]:
from IPython.display import Code

Code('predict_churn.py')

In [62]:
%run predict_churn.py

Transformation Pipeline and Model Successfully Loaded


AttributeError: 'str' object has no attribute 'predict'

# Summary

This is not an excuse, but I have had a terrible week. Death in my extended family, increasing pressure from work (mandatory overtime), etc. I did not take the proper time this week to learn all I could before sitting down and finishing this assignment. I have run into the few errors you can see ahead and after battling errors for over 7 hours I have to throw in the towel.

Prior to calling it, I have done what you can see above. In addition to that I had several issues with my data as I tried to model it. Eventually I had to go back and re-clean it all and changed a few of my categorical columns to numerical via one_hot_encoding as well as drop a few columns. 

Despite the cleaned up data I kept running into several different errors. My best model prediction continued to be the catbooster. Following along with the FTE I attempted to make that my best model, but everytime I ran it, it gave me an error that basically boiled down to saying "The filepath changes your column 1 from PhoneService to PaymentMethod, and that's not what it should be." 

I spent a good chunk of time trying to resolve that issue via google, stack overflow, and github. None of my attempts at fixing it worked, so I eventually settled with GBC. Which gave me several errors of having 10 features when I should only be having 9. Eventually I was able to find a solution through the Pycaret documentation that I corrected during the setup with the fix_imbalance=True addition that you can see above. 

With the model seemingly figured out I moved onto VS Code and wrote the script that you can see above and attached with this assignment. I struggled for some time trying to understand how to truly write a script. The book exercises did not compute, especially when compared to the (to me) large scope of this particular script. I eventually managed to make something that made a little sense to me through googling and YouTube tutorials. 

Despite all that I have consistently run into the error seen above, and I no longer have the will to continue fighting it and hopefully you will be able to make heads or tails out of it and can point me in a better direction.

I apologize for not reaching out sooner, but as I said earlier, I didn't know I would struggle this hard and I didn't have the time to make it to the zoom meeting (I work every monday), nor the free time to schedule another time with you.
