# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

### MSDS600_X700 Week 5 Assignment
----------------------------------------------
#### Student: Robert Apple
#### Date: 21-July-2021

### 1a.  Use pycaret to find an ML algorithm that performs best on the data  

#### I'm going to create two dataframes.  1) df_original.  2) df_modified

1.  This one will be the original churn data, with the addition of the "charge_per_tenure"
2.  This one will be modified by hand to 'clean it up'.

In [240]:
import pandas as pd

#--Loading up the orignal data and making small changes where warranted.
#-----------------------------------------------------------------------

df = pd.read_csv('C:\pandas_data\churn_updated_data.csv', index_col = 'customerID')
df['charge_per_tenure'] = pd.Series(df['TotalCharges']/df['tenure'])
df = df[['tenure','PhoneService','Contract','PaymentMethod','MonthlyCharges','TotalCharges','charge_per_tenure','Churn']]


df['TotalCharges'].fillna(0, inplace=True)
df['charge_per_tenure'].fillna(0, inplace=True)

df.to_csv('C:/Users/Rob4H/MSDS600_X70 -- Week 5/week_5_data.csv')


df.info()


<class 'pandas.core.frame.DataFrame'>
Index: 7043 entries, 7590-VHVEG to 3186-AJIEK
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   tenure             7043 non-null   int64  
 1   PhoneService       7043 non-null   int64  
 2   Contract           7043 non-null   int64  
 3   PaymentMethod      7043 non-null   int64  
 4   MonthlyCharges     7043 non-null   float64
 5   TotalCharges       7043 non-null   float64
 6   charge_per_tenure  7043 non-null   float64
 7   Churn              7043 non-null   int64  
dtypes: float64(3), int64(5)
memory usage: 495.2+ KB


In [190]:
#--modify the new data as well
#-----------------------------
new_data = pd.read_csv('C:/Users/Rob4H/MSDS600_X70 -- Week 5/new_churn_data.csv', index_col='customerID')

new_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, 9305-CKSKC to 6348-TACGU
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   tenure             5 non-null      int64  
 1   PhoneService       5 non-null      int64  
 2   Contract           5 non-null      int64  
 3   PaymentMethod      5 non-null      int64  
 4   MonthlyCharges     5 non-null      float64
 5   TotalCharges       5 non-null      float64
 6   charge_per_tenure  5 non-null      float64
dtypes: float64(3), int64(4)
memory usage: 320.0+ bytes


In [192]:
#--Now to choose the metric
#--------------------------

from pycaret.classification import setup, compare_models, predict_model, save_model, load_model, tune_model, create_model

automl = setup(df, target='Churn')  
#------------------------------------
automl

best_model = compare_models(sort='AUC')


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
gbc,Gradient Boosting Classifier,0.7917,0.8336,0.4777,0.6406,0.5459,0.4149,0.4231,0.189
lr,Logistic Regression,0.7925,0.8334,0.5185,0.628,0.5672,0.4327,0.4365,0.601
ada,Ada Boost Classifier,0.7917,0.8322,0.5,0.6341,0.5585,0.4247,0.4302,0.077
catboost,CatBoost Classifier,0.7897,0.8321,0.4769,0.6339,0.5434,0.4107,0.418,1.474
lda,Linear Discriminant Analysis,0.7907,0.8235,0.5254,0.6205,0.5683,0.4317,0.4346,0.011
lightgbm,Light Gradient Boosting Machine,0.7848,0.8234,0.4838,0.6178,0.5414,0.4039,0.4097,0.303
nb,Naive Bayes,0.7114,0.8153,0.7992,0.4724,0.5936,0.392,0.4249,0.01
xgboost,Extreme Gradient Boosting,0.7696,0.8096,0.4823,0.5745,0.5236,0.3735,0.3763,0.399
rf,Random Forest Classifier,0.773,0.8037,0.4692,0.587,0.5208,0.3748,0.3791,0.172
et,Extra Trees Classifier,0.7604,0.7805,0.4823,0.5526,0.5146,0.3567,0.3584,0.149


In [193]:
#--From this above, it appears that 'gbc' is the best choice for the data -- with the addition of 
#--charge_per_tenure column.  But I've run this a gazillion times, and I'm sticking with GBC.  
#--
#--79.19% accuracy
#--
#-- NOTE:  I'm getting a different listing each time I run this.  I speculate that there is a "randomize" component under
#--        the hood on this, probably taking a seed from the time of day or something, which is slicing/randomizing the data
#--        different ways.
#--
#--        I would also be willing to bet that if our training data were significantly larger, this "randomizing" thing would
#--        stop being an issue.
#----------------------------- 
best_model   #--Note,...it was GBC before.  It seems to flip between GBC, CatBoost and LR.  I'm sticking with GBC this time.

GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
                           learning_rate=0.1, loss='deviance', max_depth=3,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=100,
                           n_iter_no_change=None, presort='deprecated',
                           random_state=6737, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)

### 1b.  Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. 

#### The week 3 FTE has some information on these different metrics

In [143]:
#-- I choose GradientBoostingClassifier
#--
#-- I am a believer that overall accuracy is important.  If someone gets a false-positive, they
#-- could undergo additional testing. Same for false-negative.  But GBC seems to land the bests for AUC and Accuracy. 
#-------------------------------------------------------------



### 2.  Save the model to disk

In [194]:
save_model(best_model, 'gbc')

#--I confirmed, my model is in here --> "C:\Users\Rob4H\MSDS600_X70 -- Week 5"

Transformation Pipeline and Model Succesfully Saved


(Pipeline(memory=None,
          steps=[('dtypes',
                  DataTypes_Auto_infer(categorical_features=[],
                                       display_types=True, features_todrop=[],
                                       id_columns=[],
                                       ml_usecase='classification',
                                       numerical_features=[], target='Churn',
                                       time_features=[])),
                 ('imputer',
                  Simple_Imputer(categorical_strategy='not_available',
                                 fill_value_categorical=None,
                                 fill_value_numerical=None,
                                 numeric_strate...
                                             learning_rate=0.1, loss='deviance',
                                             max_depth=3, max_features=None,
                                             max_leaf_nodes=None,
                                             min_i

In [195]:
#--This is for my own debugging.  I don't know how to put a model into GITHUB and then retrieve it.
#--------------------------------------------------------------------------------------------------
loaded_model = load_model('C:/Users/Rob4H/MSDS600_X70 -- Week 5/gbc')
loaded_model


Transformation Pipeline and Model Successfully Loaded


Pipeline(memory=None,
         steps=[('dtypes',
                 DataTypes_Auto_infer(categorical_features=[],
                                      display_types=True, features_todrop=[],
                                      id_columns=[],
                                      ml_usecase='classification',
                                      numerical_features=[], target='Churn',
                                      time_features=[])),
                ('imputer',
                 Simple_Imputer(categorical_strategy='not_available',
                                fill_value_categorical=None,
                                fill_value_numerical=None,
                                numeric_strate...
                                            learning_rate=0.1, loss='deviance',
                                            max_depth=3, max_features=None,
                                            max_leaf_nodes=None,
                                            min_impurity_decrease=

In [196]:
new_data

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,charge_per_tenure
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
9305-CKSKC,22,1,0,2,97.4,811.7,36.895455
1452-KNGVK,8,0,1,1,77.3,1701.95,212.74375
6723-OKKJM,28,1,0,0,28.25,250.9,8.960714
7832-POPKP,62,1,0,2,101.7,3106.56,50.105806
6348-TACGU,10,0,0,1,51.15,3440.97,344.097


In [234]:
#--It should be [1,0,0,1,0]
#--I'm getting  [0,0,0,0,0]  --don't know if I'm close because what I've gotten is erratic.
#--
#-------------------------------------------------------------------------------------------------
my_list_of_values = []
for i in range(5):
   output = predict_model(loaded_model, new_data.iloc[i:i+1])
   #print(output)
   my_list_of_values.append(output['Label'].values[0])
#-----------------------------------------------------
print(my_list_of_values)

[0, 0, 0, 0, 0]


# Robin, I'm not please with the outcomes I'm getting.  
# My accuracy should be higher than its showing, but I can't 
# seem to get past it.
#
# Regardless, I'm also going to check the old fashioned way of training the 
# model.  Since I have GBC as the chosen one, that is what I'm sticking with.

In [242]:
#--I'm gunna run this the old fashioned way.  Load it all up from the beginning
#----------
DV = "Churn"

X_train = df.drop(DV, axis=1) #--this removes the "Dependent Variable" (DV) from our domain
y_train = df[DV]   #--This has our DV in the range
y_train


customerID
7590-VHVEG    0
5575-GNVDE    0
3668-QPYBK    1
7795-CFOCW    0
9237-HQITU    1
             ..
6840-RESVB    0
2234-XADUH    0
4801-JZAZL    0
8361-LTMKD    1
3186-AJIEK    0
Name: Churn, Length: 7043, dtype: int64

In [236]:
from sklearn.ensemble import GradientBoostingClassifier
gbc = GradientBoostingClassifier()

gbc.fit(X_train, y_train)


GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
                           learning_rate=0.1, loss='deviance', max_depth=3,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=100,
                           n_iter_no_change=None, presort='deprecated',
                           random_state=None, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)

In [243]:
my_list_of_values = []
for i in range(5):
   output = gbc.predict(new_data.iloc[i:i+1])
   my_list_of_values.append(output[0])

print(my_list_of_values)

[0, 0, 0, 0, 0]


In [244]:
from sklearn.metrics import accuracy_score
y_pred = [1, 0, 0, 1, 0]
y_true = [0, 0, 0, 0, 0]
accuracy = accuracy_score(y_true, y_pred)
print(f"Our accuracy is {accuracy * 100}%")

Our accuracy is 60.0%


### 3. create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
* Your Python file/function should print out the predictions for new data (new_churn_data.csv)
* The true values for the new data are [1, 0, 0, 1, 0] if you're interested

In [None]:
#--https://github.com/Rob4Hope/MSDS600_X70/blob/main/Week_5_Python_Code.py



### 4. Test your Python module and function with the new data, new_churn_data.csv

In [None]:
#--Things work.  I'm not happy with the accuracy. SERIOUSLY think something is broken on my packages or versions.


### 5. write a short summary of the process and results at the end of this notebook

In [None]:
#--Well...I learned a few things, and I hope to learn a few more when this assignment gets looked at from other perspectives.
#--
#-- 1.  Setting things up wasn't as hard as I supposed, once I started to gete a feel for it.
#-- 2.  I ran into bugs in my PyCharm installation, and had to backout pandas to an earlier version.
#-- 3.  I ran into all kinds of problems with pycaret inside PyCharm, and since the assignment didn't say I had to use the
#--     pycaret tools (like predict_model) inside my Python file, I chose to load the thing the old fashioned way.
#-- 4.  My numbers were consistently bad.  I had high 'score' values, but my 'label' didn't move.  So, I'm missing something
#--     something somewhere.  
#-- 5.  I understand the process of using pycaret to compare models and make a choice.  The ramdoness of the model selection
#--     was also intersting. I actually got some model estimates as high as 82%....but generally they were in the 79% range.
#--
#-- I spend a lot of time on this assignment.  I certainly understand the concept of what is being done.  But somewhere I 
#-- am worried I missed a parameter or something, which caused the low accuracy. I'm also VERY MUCH AWARE that the versions I
#-- have installed MAY BE RESPONSIBLE for the troubles.  <<sigh>>
#--
#-- Anyway,...I'm turning it in.  Feedback is welcome..... :-)


### 6. upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox