# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

In [40]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
df = pd.read_csv('data/modified_churn_data.csv')
le = LabelEncoder()
df['Churn']  = le.fit_transform(df['Churn'])
df['PhoneService']  = le.fit_transform(df['PhoneService'])
df['Contract']  = le.fit_transform(df['Contract'])
df['PaymentMethod']  = le.fit_transform(df['PaymentMethod'])
df['MonthlyCharges']  = le.fit_transform(df['MonthlyCharges'])
df['TotalCharges']  = le.fit_transform(df['TotalCharges'])
df['total_monthly_ratio']  = le.fit_transform(df['total_monthly_ratio'])
df = df.select_dtypes(exclude=['object'])
df

Unnamed: 0.1,Unnamed: 0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,total_monthly_ratio
0,0,1,0,0,2,142,74,0,0
1,1,34,1,1,3,498,3624,0,3207
2,2,2,1,0,3,436,536,1,134
3,3,45,0,1,0,266,3570,0,3856
4,4,2,1,0,2,729,674,1,189
...,...,...,...,...,...,...,...,...,...
7038,7038,24,1,1,3,991,3700,0,2499
7039,7039,72,1,1,1,1340,6304,0,6078
7040,7040,11,0,0,2,137,1265,0,1476
7041,7041,4,1,0,3,795,1157,1,553


In [41]:
df['test1_ratio'] = df['TotalCharges'] / df['MonthlyCharges']
df['test2_ratio'] = df['TotalCharges'] / df['MonthlyCharges']
df['test3_ratio'] = df['TotalCharges'] / df['MonthlyCharges']
df['test4_ratio'] = df['TotalCharges'] / df['MonthlyCharges']
df

Unnamed: 0.1,Unnamed: 0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,total_monthly_ratio,test1_ratio,test2_ratio,test3_ratio,test4_ratio
0,0,1,0,0,2,142,74,0,0,0.521127,0.521127,0.521127,0.521127
1,1,34,1,1,3,498,3624,0,3207,7.277108,7.277108,7.277108,7.277108
2,2,2,1,0,3,436,536,1,134,1.229358,1.229358,1.229358,1.229358
3,3,45,0,1,0,266,3570,0,3856,13.421053,13.421053,13.421053,13.421053
4,4,2,1,0,2,729,674,1,189,0.924554,0.924554,0.924554,0.924554
...,...,...,...,...,...,...,...,...,...,...,...,...,...
7038,7038,24,1,1,3,991,3700,0,2499,3.733602,3.733602,3.733602,3.733602
7039,7039,72,1,1,1,1340,6304,0,6078,4.704478,4.704478,4.704478,4.704478
7040,7040,11,0,0,2,137,1265,0,1476,9.233577,9.233577,9.233577,9.233577
7041,7041,4,1,0,3,795,1157,1,553,1.455346,1.455346,1.455346,1.455346


In [29]:
from pycaret.classification import setup, compare_models, predict_model, save_model, load_model

In [3]:
automl = setup(df, target = 'Churn')

Unnamed: 0,Description,Value
0,session_id,5577
1,Target,Churn
2,Target Type,Binary
3,Label Encoded,"0: 0, 1: 1"
4,Original Data,"(7043, 9)"
5,Missing Values,False
6,Numeric Features,5
7,Categorical Features,3
8,Ordinal Features,False
9,High Cardinality Features,False


In [4]:
automl[6]

True

In [5]:
best_model = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lda,Linear Discriminant Analysis,0.7968,0.8337,0.4782,0.6634,0.5553,0.4281,0.438,0.036
ridge,Ridge Classifier,0.7947,0.0,0.4309,0.6791,0.5263,0.4038,0.4213,0.017
catboost,CatBoost Classifier,0.7931,0.8403,0.511,0.6391,0.5676,0.4338,0.4388,1.559
lr,Logistic Regression,0.7925,0.8385,0.4805,0.6484,0.5513,0.4203,0.4287,1.23
gbc,Gradient Boosting Classifier,0.7913,0.8401,0.4904,0.6403,0.5551,0.4219,0.4285,0.769
ada,Ada Boost Classifier,0.7909,0.8382,0.4996,0.6361,0.5591,0.4247,0.4304,0.35
lightgbm,Light Gradient Boosting Machine,0.7807,0.8247,0.5156,0.6037,0.5555,0.4112,0.4139,1.117
xgboost,Extreme Gradient Boosting,0.7722,0.8149,0.4889,0.5861,0.5324,0.3836,0.3868,1.318
knn,K Neighbors Classifier,0.769,0.7649,0.4698,0.5797,0.5187,0.369,0.3727,0.083
rf,Random Forest Classifier,0.7645,0.8012,0.4736,0.5697,0.5168,0.3629,0.3658,0.567


In [6]:
best_model

LinearDiscriminantAnalysis(n_components=None, priors=None, shrinkage=None,
                           solver='svd', store_covariance=False, tol=0.0001)

In [7]:
df.iloc[-2:-1].shape

(1, 9)

In [17]:
predict_model(best_model, df.iloc[-2:-1])

Unnamed: 0.1,Unnamed: 0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,total_monthly_ratio,Label,Score
7041,7041,4,1,0,3,795,1157,1,553,1,0.5699


In [18]:
save_model(best_model, 'LDA')

Transformation Pipeline and Model Succesfully Saved


(Pipeline(memory=None,
          steps=[('dtypes',
                  DataTypes_Auto_infer(categorical_features=[],
                                       display_types=True, features_todrop=[],
                                       id_columns=['Unnamed: 0'],
                                       ml_usecase='classification',
                                       numerical_features=[], target='Churn',
                                       time_features=[])),
                 ('imputer',
                  Simple_Imputer(categorical_strategy='not_available',
                                 fill_value_categorical=None,
                                 fill_value_numerical=None,
                                 num...
                 ('dummy', Dummify(target='Churn')),
                 ('fix_perfect', Remove_100(target='Churn')),
                 ('clean_names', Clean_Colum_Names()),
                 ('feature_select', 'passthrough'), ('fix_multi', 'passthrough'),
                 ('df

In [11]:
import pickle

with open('LDA_model_Churn.pk', 'wb') as f:
    pickle.dump(best_model, f)

In [14]:
with open('LDA_model_Churn.pk', 'rb') as f:
    loaded_model = pickle.load(f)

In [42]:
new_data = df.iloc[0:7].copy()
new_data.drop('Churn', axis=1, inplace=True)
loaded_model.predict(new_data)
#new_data

array([1, 0, 1, 0, 1, 0, 0])

In [43]:
loaded_lda = load_model('LDA')

Transformation Pipeline and Model Successfully Loaded


In [44]:
predict_model(loaded_lda, new_data)

Unnamed: 0.1,Unnamed: 0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,total_monthly_ratio,test1_ratio,test2_ratio,test3_ratio,test4_ratio,Label,Score
0,0,1,0,0,2,142,74,0,0.521127,0.521127,0.521127,0.521127,1,0.6437
1,1,34,1,1,3,498,3624,3207,7.277108,7.277108,7.277108,7.277108,0,0.9725
2,2,2,1,0,3,436,536,134,1.229358,1.229358,1.229358,1.229358,0,0.6196
3,3,45,0,1,0,266,3570,3856,13.421053,13.421053,13.421053,13.421053,0,0.9501
4,4,2,1,0,2,729,674,189,0.924554,0.924554,0.924554,0.924554,1,0.786
5,5,8,1,0,2,1274,2173,1075,1.705651,1.705651,1.705651,1.705651,1,0.8705
6,6,22,1,0,1,1075,3673,2372,3.416744,3.416744,3.416744,3.416744,0,0.6793


In [45]:
from IPython.display import Code

Code('predict_churn.py')

In [46]:
%run predict_churn.py

Transformation Pipeline and Model Successfully Loaded
predictions:
0    0
1    0
2    0
3    0
4    0
Name: Churn_prediction, dtype: int64


# Summary

Write a short summary of the process and results here.


We found that LDA is our best model.Shape was found as (1, 9).Created predict_churn.py and imported to the code.Uploaded the files to Git.