use pycaret to find an ML algorithm that performs best on the data

In [44]:
import pandas as pd
from pycaret.classification import setup, compare_models, save_model, load_model, predict_model

Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics

In [45]:
def load_and_clean_data(file_path):
    df = pd.read_csv(file_path)
    df_cleaned = df.drop(columns=['Unnamed: 12', 'Unnamed: 13', 'customerID'])
    df_cleaned.dropna(inplace=True)
    return df_cleaned

save the model to disk, create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe

In [46]:
def train_model(df_cleaned):
    clf_setup = setup(data=df_cleaned, target='Churn_Yes', session_id=123)
    best_model = compare_models()
    save_model(best_model, 'best_churn_model')
    print("Model trained and saved successfully!")
    return best_model

In [47]:
def predict_churn(new_data_path):
    model = load_model('best_churn_model')
    new_data = pd.read_csv(new_data_path)
    predictions = predict_model(model, data=new_data)
    churn_predictions = predictions[predictions['Label'] == 1]
    return churn_predictions

test your Python module and function with the new data, new_churn_data.csv

In [54]:
def test_model():
    new_data_path = 'new_churn_data.csv'
    churn_predictions = predict_churn(new_data_path)
    print("Customers predicted to churn:")
    print(churn_predictions)
    true_values = [1, 0, 0, 1, 0]
    print(f"True values: {true_values}")
    return churn_predictions

In [55]:
df_cleaned = load_and_clean_data("../cleaned_churn_data.csv")

In [56]:
best_model = train_model(df_cleaned)

Unnamed: 0,Description,Value
0,Session id,123
1,Target,Churn_Yes
2,Target type,Binary
3,Original data shape,"(7043, 11)"
4,Transformed data shape,"(7043, 11)"
5,Transformed train set shape,"(4930, 11)"
6,Transformed test set shape,"(2113, 11)"
7,Numeric features,9
8,Categorical features,1
9,Preprocess,True


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lr,Logistic Regression,0.771,0.8142,0.4382,0.5933,0.5028,0.3587,0.3663,0.163
knn,K Neighbors Classifier,0.7637,0.7445,0.4396,0.5706,0.4959,0.3453,0.3506,0.047
ridge,Ridge Classifier,0.7527,0.771,0.3876,0.5482,0.4532,0.2999,0.3078,0.032
et,Extra Trees Classifier,0.7473,0.7429,0.4182,0.5315,0.4671,0.3048,0.3091,0.143
gbc,Gradient Boosting Classifier,0.7462,0.7324,0.4328,0.527,0.4746,0.3097,0.3126,0.188
rf,Random Forest Classifier,0.746,0.7331,0.4167,0.5276,0.465,0.3017,0.3056,0.177
lightgbm,Light Gradient Boosting Machine,0.746,0.7299,0.4358,0.5265,0.476,0.3106,0.3135,0.125
svm,SVM - Linear Kernel,0.7456,0.7266,0.3935,0.6078,0.4266,0.2872,0.3253,0.036
lda,Linear Discriminant Analysis,0.7424,0.7587,0.4182,0.5182,0.4622,0.2956,0.2988,0.03
ada,Ada Boost Classifier,0.742,0.7362,0.4297,0.5172,0.468,0.3001,0.303,0.1


Transformation Pipeline and Model Successfully Saved
Model trained and saved successfully!


Summary

In [None]:
In this work, we used the PyCaret machine learning library to develop a predictive model for customer churn based on a dataset of customer attributes. The process began by cleaning the dataset, removing unnecessary columns, and handling missing values. We then used PyCaret to set up a classification environment and compared different machine learning models to identify the best-performing one. The best model was saved to disk for future use.

A Python function was created to load the saved model and predict whether customers in a new dataset (new_churn_data.csv) are likely to churn. The function was designed to return only those customers predicted to churn, filtering the results to show rows where the model's predicted label is 1. The model was then tested with the new data, and the predicted churn outcomes were compared to true values provided in the prompt. The workflow showcases the process of building, training, and deploying a machine learning model using PyCaret to solve a real-world problem of predicting customer churn.

In [None]:
Reference
OpenAI. (2024). ChatGPT (Oct 1 version) [Large language model]. https://chat.openai.com/chat