# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

In [3]:
# !conda create -n msds python=3.10.14 -y
# !conda activate msds
# !pip install --upgrade pycaret

In [4]:
!pip install pycaret 



You should consider upgrading via the 'C:\Users\FLORENCE MIKE\AppData\Local\Programs\Python\Python310\python.exe -m pip install --upgrade pip' command.





In [69]:
import pandas as pd
df = pd.read_csv('churn_data_modified.csv')

In [70]:
for col in df.select_dtypes(include=['object']).columns:
    df[col] = pd.factorize(df[col])[0]

In [71]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   customerID      7043 non-null   int64  
 1   tenure          7043 non-null   int64  
 2   PhoneService    7043 non-null   int64  
 3   Contract        7043 non-null   int64  
 4   PaymentMethod   7043 non-null   int64  
 5   MonthlyCharges  7043 non-null   float64
 6   TotalCharges    7043 non-null   float64
 7   Churn           7043 non-null   int64  
dtypes: float64(2), int64(6)
memory usage: 440.3 KB


# using pycaret to find an ML algorithm that performs best on the data

In [72]:
from pycaret.classification import setup, compare_models, predict_model, save_model, load_model

In [73]:
automl = setup(df, target='Churn')

Unnamed: 0,Description,Value
0,Session id,1768
1,Target,Churn
2,Target type,Binary
3,Original data shape,"(7043, 8)"
4,Transformed data shape,"(7043, 8)"
5,Transformed train set shape,"(4930, 8)"
6,Transformed test set shape,"(2113, 8)"
7,Numeric features,7
8,Preprocess,True
9,Imputation type,simple


In [74]:
from pycaret.classification import setup, compare_models

# Set up for regression
automl = setup(df, target='Churn', session_id=5906)

# Compare regression models
best_model = compare_models(sort='AUC')

Unnamed: 0,Description,Value
0,Session id,5906
1,Target,Churn
2,Target type,Binary
3,Original data shape,"(7043, 8)"
4,Transformed data shape,"(7043, 8)"
5,Transformed train set shape,"(4930, 8)"
6,Transformed test set shape,"(2113, 8)"
7,Numeric features,7
8,Preprocess,True
9,Imputation type,simple


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
gbc,Gradient Boosting Classifier,0.7959,0.8396,0.4962,0.6533,0.5624,0.4329,0.4408,0.956
ada,Ada Boost Classifier,0.7929,0.8385,0.5084,0.6392,0.565,0.4317,0.4374,0.443
lr,Logistic Regression,0.7929,0.836,0.5146,0.6378,0.5679,0.4341,0.4394,2.92
lightgbm,Light Gradient Boosting Machine,0.7939,0.8259,0.5329,0.6343,0.5774,0.4429,0.4469,0.517
qda,Quadratic Discriminant Analysis,0.7505,0.825,0.7417,0.5213,0.612,0.4364,0.4513,0.066
ridge,Ridge Classifier,0.7925,0.8235,0.4511,0.6613,0.5351,0.4079,0.4209,0.127
lda,Linear Discriminant Analysis,0.7913,0.8235,0.5039,0.6366,0.5613,0.427,0.4327,0.165
nb,Naive Bayes,0.7156,0.8109,0.7715,0.4782,0.5901,0.3904,0.4169,0.069
rf,Random Forest Classifier,0.7791,0.8105,0.4832,0.6055,0.536,0.3939,0.399,1.108
et,Extra Trees Classifier,0.7771,0.8034,0.5008,0.594,0.5424,0.3968,0.3999,1.019


In [77]:
best_model

In [78]:
df.iloc[-2:-1].shape

(1, 8)

In [79]:
predict_model(best_model, df.iloc[-2:-1])

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Gradient Boosting Classifier,1.0,0,1.0,1.0,1.0,,0.0


Unnamed: 0,customerID,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,prediction_label,prediction_score
7041,7041,4,1,0,1,74.400002,297.600006,1,1,0.7569


# saving the model

In [80]:
save_model(best_model, 'LDA')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=None),
          steps=[('numerical_imputer',
                  TransformerWrapper(exclude=None,
                                     include=['customerID', 'tenure',
                                              'PhoneService', 'Contract',
                                              'PaymentMethod', 'MonthlyCharges',
                                              'TotalCharges'],
                                     transformer=SimpleImputer(add_indicator=False,
                                                               copy=True,
                                                               fill_value=None,
                                                               keep_empty_features=False,
                                                               missing_values=nan,
                                                               strategy='mean'))),
                 ('categori...
                                             criterion='f

# predicting

In [81]:
import pickle

with open('LDA_model.pk', 'wb') as f:
    pickle.dump(best_model, f)

In [82]:
with open('LDA_model.pk', 'rb') as f:
    loaded_model = pickle.load(f)

In [83]:
new_data = df.iloc[-2:-1].copy()
new_data.drop('Churn', axis=1, inplace=True)
loaded_model.predict(new_data)

array([1], dtype=int8)

In [84]:
loaded_lda = load_model('LDA')

Transformation Pipeline and Model Successfully Loaded


In [85]:
predict_model(loaded_lda, new_data)

Unnamed: 0,customerID,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,prediction_label,prediction_score
7041,7041,4,1,0,1,74.400002,297.600006,1,0.7569


In [86]:
import pandas as pd
df_2 = pd.read_csv('new_churn_data.csv', index_col='customerID')
df_2.info()


<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, 9305-CKSKC to 6348-TACGU
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   tenure             5 non-null      int64  
 1   PhoneService       5 non-null      int64  
 2   Contract           5 non-null      int64  
 3   PaymentMethod      5 non-null      int64  
 4   MonthlyCharges     5 non-null      float64
 5   TotalCharges       5 non-null      float64
 6   charge_per_tenure  5 non-null      float64
dtypes: float64(3), int64(4)
memory usage: 320.0+ bytes


In [99]:
from IPython.display import Code

Code('predict_churn.py')

# Summary

# By using Pycaret, we 
1. Loaded and prepared the churn data.
2. Compared ML algorithmns and selected the best performer based on AUC
3. Created a Python function to predict churn probability
4. Tested the function with new data

# Results
The best performing model achieved an AUC of 0.84.
The model demonstrated excellent performance on both training and testing data

# Recommendations:
Monitor model performance on new data.
Continuosusly collect and incorporate new data to improve model accuracy.
Explore feature engineering and hyperparameter tuning for further improvements.