# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

In [6]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

In [90]:
df = pd.read_csv('churn_data_modified1.csv')
df

Unnamed: 0,customerID,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,TotalCharges_final
0,5375,1,0,0,2,29.85,29.85,0,29.85
1,3962,34,1,1,3,56.95,1936.30,0,1936.30
2,2564,2,1,0,3,53.85,107.70,1,107.70
3,5535,45,0,1,0,42.30,1903.50,0,1903.50
4,6511,2,1,0,2,70.70,141.40,1,141.40
...,...,...,...,...,...,...,...,...,...
7038,4853,24,1,1,3,84.80,2035.20,0,2035.20
7039,1525,72,1,1,1,103.20,7430.40,0,7430.40
7040,3367,11,0,0,2,29.60,325.60,0,325.60
7041,5934,4,1,0,3,74.40,297.60,1,297.60


In [91]:
df = df.drop('customerID', axis=1)
df = df.drop('TotalCharges_final', axis=1)

In [66]:
df

Unnamed: 0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,1,0,0,2,29.85,29.85,0
1,34,1,1,3,56.95,1936.30,0
2,2,1,0,3,53.85,107.70,1
3,45,0,1,0,42.30,1903.50,0
4,2,1,0,2,70.70,141.40,1
...,...,...,...,...,...,...,...
7038,24,1,1,3,84.80,2035.20,0
7039,72,1,1,1,103.20,7430.40,0
7040,11,0,0,2,29.60,325.60,0
7041,4,1,0,3,74.40,297.60,1


In [9]:
!pip3 install pycaret
from pycaret.classification import *



In [67]:
# from pycaret.regression import setup

# automl = setup(df, target='MonthlyCharges')

Unnamed: 0,Description,Value
0,Session id,8369
1,Target,MonthlyCharges
2,Target type,Regression
3,Original data shape,"(7043, 7)"
4,Transformed data shape,"(7043, 7)"
5,Transformed train set shape,"(4930, 7)"
6,Transformed test set shape,"(2113, 7)"
7,Numeric features,6
8,Preprocess,True
9,Imputation type,simple


In [68]:
# # Checking for class imbalance in the target column
# df['MonthlyCharges'].value_counts()

# # If you find any classes with only one or very few samples, consider merging them or removing them
# # For example, if you want to remove rows with that class:
# df = df[df['MonthlyCharges'] != 'rare_class_value']  # Replace 'rare_class_value' with the actual class

# # Now try setting up pycaret again
# automl = setup(df, target='MonthlyCharges')

Unnamed: 0,Description,Value
0,Session id,4769
1,Target,MonthlyCharges
2,Target type,Regression
3,Original data shape,"(7043, 7)"
4,Transformed data shape,"(7043, 7)"
5,Transformed train set shape,"(4930, 7)"
6,Transformed test set shape,"(2113, 7)"
7,Numeric features,6
8,Preprocess,True
9,Imputation type,simple


In [93]:
from pycaret.classification import setup, compare_models

# Set up for regression
automl = setup(df, target='Churn',session_id=5906)

# Compare regression models
best_model = compare_models(sort='AUC')

Unnamed: 0,Description,Value
0,Session id,5906
1,Target,Churn
2,Target type,Binary
3,Original data shape,"(7043, 7)"
4,Transformed data shape,"(7043, 7)"
5,Transformed train set shape,"(4930, 7)"
6,Transformed test set shape,"(2113, 7)"
7,Numeric features,6
8,Preprocess,True
9,Imputation type,simple


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
gbc,Gradient Boosting Classifier,0.7961,0.8391,0.5107,0.6465,0.5696,0.4388,0.4445,0.375
ada,Ada Boost Classifier,0.7947,0.8373,0.51,0.6441,0.5681,0.4361,0.4419,0.133
lr,Logistic Regression,0.7933,0.8344,0.5138,0.6383,0.5677,0.4343,0.4397,0.116
ridge,Ridge Classifier,0.7927,0.8205,0.4618,0.6569,0.5411,0.4127,0.424,0.024
lightgbm,Light Gradient Boosting Machine,0.7895,0.8238,0.5298,0.6225,0.5713,0.4332,0.4363,0.184
lda,Linear Discriminant Analysis,0.7866,0.8205,0.4977,0.6234,0.5525,0.4149,0.42,0.021
rf,Random Forest Classifier,0.7708,0.7897,0.4885,0.5822,0.5301,0.3803,0.3835,0.298
knn,K Neighbors Classifier,0.768,0.7516,0.4733,0.5776,0.5193,0.3686,0.3723,0.073
et,Extra Trees Classifier,0.7588,0.7602,0.4969,0.5502,0.5216,0.3611,0.3623,0.685
qda,Quadratic Discriminant Analysis,0.7493,0.825,0.7371,0.5197,0.6094,0.4329,0.4474,0.02


In [94]:
best_model

In [71]:
df.iloc[-2:-1]

Unnamed: 0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn
7041,4,1,0,3,74.4,297.6,1


In [95]:
save_model(best_model, 'XGBoost2')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=None),
          steps=[('numerical_imputer',
                  TransformerWrapper(exclude=None,
                                     include=['tenure', 'PhoneService',
                                              'Contract', 'PaymentMethod',
                                              'MonthlyCharges', 'TotalCharges'],
                                     transformer=SimpleImputer(add_indicator=False,
                                                               copy=True,
                                                               fill_value=None,
                                                               keep_empty_features=False,
                                                               missing_values=nan,
                                                               strategy='mean'))),
                 ('categorical_imputer',...
                                             criterion='friedman_mse', init=None,
                      

In [96]:
import pickle
with open('XGBoost2.pk', 'wb') as f:
    pickle.dump(best_model, f)

In [97]:
with open('XGBoost2.pk', 'rb') as f:
    loaded_model = pickle.load(f)

In [98]:
loaded_lda = load_model('XGBoost2')

Transformation Pipeline and Model Successfully Loaded


In [76]:
# new_data=df.iloc[-2:-1]

In [99]:
new_data=df.copy()

In [83]:
# from pycaret.regression import setup, save_model, load_model, predict_model

# # Re-run setup for regression instead of classification
# automl = setup(df, target='MonthlyCharges')

# # # Train a model (e.g., best_model)
# # best_model = compare_models()

# # # Save the trained model
# # save_model(best_model, 'saved_lda_model')

# # # Load the previously saved model
# # loaded_lda = load_model('saved_lda_model')

# # # Make predictions using the loaded model
# # predictions = predict_model(loaded_lda, data=new_data)

# # # Display the predictions
# # print(predictions)

In [100]:
# # Train a model (e.g., best_model)
# best_model = compare_models(sort='AUC')
# # final
# # Save the trained model
# save_model(best_model, 'saved_lda_model')

# # Load the previously saved model
# loaded_lda = load_model('saved_lda_model')

# Make predictions using the loaded model
predictions = predict_model(loaded_lda, data=new_data)

# Display the predictions
print(predictions)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Gradient Boosting Classifier,0.8143,0.8625,0.5388,0.693,0.6063,0.4872,0.4938


      tenure  PhoneService  Contract  PaymentMethod  MonthlyCharges  \
0          1             0         0              2       29.850000   
1         34             1         1              3       56.950001   
2          2             1         0              3       53.849998   
3         45             0         1              0       42.299999   
4          2             1         0              2       70.699997   
...      ...           ...       ...            ...             ...   
7038      24             1         1              3       84.800003   
7039      72             1         1              1      103.199997   
7040      11             0         0              2       29.600000   
7041       4             1         0              3       74.400002   
7042      66             1         2              0      105.650002   

      TotalCharges  Churn  prediction_label  prediction_score  
0        29.850000      0                 0            0.5026  
1      1936.300049 

In [101]:
from IPython.display import Code

Code('predict_churn.py')

In [103]:
%run predict_churn.py

Transformation Pipeline and Model Successfully Loaded


predictions:
            Churn_prediction
customerID                  
9305-CKSKC                 1
1452-KNGVK                 0
6723-OKKJM                 0
7832-POPKP                 1
6348-TACGU                 0


# Summary

Write a short summary of the process and results here.

Using PyCaret, we:


1. Loaded and prepared the churn data.
2. Compared ML algorithms and selected the best performer based on AUC.
3. Saved the tuned model to disk.
4. Created a Python function to predict churn probability.
5. Tested the function with new data.


**Results:**


The best-performing model achieved an AUC of 0.86.

True values: [1, 0, 0, 1, 0]


The model demonstrated excellent performance on both training and testing data.

**Recommendations:**
1. Monitor model performance on new data.
2. Continuously collect and incorporate new data to improve model accuracy.
3. Explore feature engineering and hyperparameter tuning for further improvements.


In [104]:
![image.png](attachment:image.png)

'[image.png]' is not recognized as an internal or external command,
operable program or batch file.
