# Model Building

In this notebook, we will build and evaluate machine learning models to predict customer churn. We will start by splitting the data into training and testing sets, followed by training various models, evaluating their performance, and selecting the best model.


In [10]:
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.model_selection import GridSearchCV

In [2]:
# Load the preprocessed dataset
data_path = "../data/processed/cleaned_data.csv"
df = pd.read_csv(data_path)

## Train-Test Split

We split the dataset into training and testing sets to evaluate the performance of our models on unseen data. This step is crucial for estimating the generalization ability of our models.

In [3]:
# Separate features and target variable
X = df.drop(columns=['customerID', 'Churn'])
y = df['Churn']

In [4]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("Training set size:", X_train.shape)
print("Testing set size:", X_test.shape)

Training set size: (5634, 19)
Testing set size: (1409, 19)


## Model Training and Evaluation

We train several machine learning models and evaluate their performance using appropriate metrics. We aim to select the model that performs best on the validation set.

In [5]:
# Initialize models
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'Random Forest': RandomForestClassifier(random_state=42),
    'Support Vector Machine': SVC(random_state=42)
}

In [6]:
# Function to evaluate model performance
def evaluate_model(model, X_train, y_train, X_test, y_test):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    return accuracy, precision, recall, f1, y_pred

In [7]:
# Dictionary to store evaluation results
evaluation_results = {}

In [8]:
# Evaluate each model
for model_name, model in models.items():
    accuracy, precision, recall, f1, y_pred = evaluate_model(model, X_train, y_train, X_test, y_test)
    evaluation_results[model_name] = {
        'Accuracy': accuracy,
        'Precision': precision,
        'Recall': recall,
        'F1 Score': f1,
        'Confusion Matrix': confusion_matrix(y_test, y_pred)
    }

In [9]:
# Display evaluation results
for model_name, metrics in evaluation_results.items():
    print(f"\n{model_name} Results:")
    print(f"Accuracy: {metrics['Accuracy']:.4f}")
    print(f"Precision: {metrics['Precision']:.4f}")
    print(f"Recall: {metrics['Recall']:.4f}")
    print(f"F1 Score: {metrics['F1 Score']:.4f}")
    print(f"Confusion Matrix:\n{metrics['Confusion Matrix']}")


Logistic Regression Results:
Accuracy: 0.8176
Precision: 0.6824
Recall: 0.5818
F1 Score: 0.6281
Confusion Matrix:
[[935 101]
 [156 217]]

Decision Tree Results:
Accuracy: 0.7246
Precision: 0.4809
Recall: 0.5067
F1 Score: 0.4935
Confusion Matrix:
[[832 204]
 [184 189]]

Random Forest Results:
Accuracy: 0.7970
Precision: 0.6629
Recall: 0.4745
F1 Score: 0.5531
Confusion Matrix:
[[946  90]
 [196 177]]

Support Vector Machine Results:
Accuracy: 0.8112
Precision: 0.6945
Recall: 0.5121
F1 Score: 0.5895
Confusion Matrix:
[[952  84]
 [182 191]]


## Hyperparameter Tuning for Logistic Regression

We perform hyperparameter tuning to optimize the performance of the Logistic Regression model.


In [11]:
# Define parameter grid for Logistic Regression
param_grid_lr = {
    'C': [0.01, 0.1, 1, 10, 100],
    'solver': ['liblinear', 'saga']
}

# Initialize GridSearchCV for Logistic Regression
grid_search_lr = GridSearchCV(LogisticRegression(max_iter=1000, random_state=42), param_grid_lr, cv=5, scoring='accuracy', n_jobs=-1)

# Perform Grid Search for Logistic Regression
grid_search_lr.fit(X_train, y_train)

# Best parameters and best score for Logistic Regression
best_params_lr = grid_search_lr.best_params_
best_score_lr = grid_search_lr.best_score_

print("Best Parameters for Logistic Regression:", best_params_lr)
print("Best Cross-Validation Accuracy for Logistic Regression:", best_score_lr)

Best Parameters for Logistic Regression: {'C': 0.1, 'solver': 'saga'}
Best Cross-Validation Accuracy for Logistic Regression: 0.7999648542713093


## Hyperparameter Tuning for Support Vector Machine

We perform hyperparameter tuning to optimize the performance of the Support Vector Machine (SVM) model.


In [12]:
# Define parameter grid for SVM
param_grid_svm = {
    'C': [0.1, 1, 10, 100],
    'gamma': ['scale', 'auto'],
    'kernel': ['linear', 'rbf']
}

# Initialize GridSearchCV for SVM
grid_search_svm = GridSearchCV(SVC(random_state=42), param_grid_svm, cv=5, scoring='accuracy', n_jobs=-1)

# Perform Grid Search for SVM
grid_search_svm.fit(X_train, y_train)

# Best parameters and best score for SVM
best_params_svm = grid_search_svm.best_params_
best_score_svm = grid_search_svm.best_score_

print("Best Parameters for SVM:", best_params_svm)
print("Best Cross-Validation Accuracy for SVM:", best_score_svm)


Best Parameters for SVM: {'C': 1, 'gamma': 'auto', 'kernel': 'rbf'}
Best Cross-Validation Accuracy for SVM: 0.7933971735269132


## Final Model Evaluation

We evaluate the final models on the test set to estimate their performance on unseen data. This step provides an unbiased estimate of the models' accuracy and generalization ability.


In [13]:
# Train final Logistic Regression model with best parameters
final_model_lr = LogisticRegression(**best_params_lr, max_iter=1000, random_state=42)
final_model_lr.fit(X_train, y_train)
y_pred_final_lr = final_model_lr.predict(X_test)

# Evaluate final Logistic Regression model
final_accuracy_lr = accuracy_score(y_test, y_pred_final_lr)
final_precision_lr = precision_score(y_test, y_pred_final_lr)
final_recall_lr = recall_score(y_test, y_pred_final_lr)
final_f1_lr = f1_score(y_test, y_pred_final_lr)
final_confusion_matrix_lr = confusion_matrix(y_test, y_pred_final_lr)

print("\nFinal Logistic Regression Model Results:")
print(f"Accuracy: {final_accuracy_lr:.4f}")
print(f"Precision: {final_precision_lr:.4f}")
print(f"Recall: {final_recall_lr:.4f}")
print(f"F1 Score: {final_f1_lr:.4f}")
print(f"Confusion Matrix:\n{final_confusion_matrix_lr}")


Final Logistic Regression Model Results:
Accuracy: 0.8204
Precision: 0.6911
Recall: 0.5818
F1 Score: 0.6317
Confusion Matrix:
[[939  97]
 [156 217]]


In [14]:
# Train final SVM model with best parameters
final_model_svm = SVC(**best_params_svm, random_state=42)
final_model_svm.fit(X_train, y_train)
y_pred_final_svm = final_model_svm.predict(X_test)

# Evaluate final SVM model
final_accuracy_svm = accuracy_score(y_test, y_pred_final_svm)
final_precision_svm = precision_score(y_test, y_pred_final_svm)
final_recall_svm = recall_score(y_test, y_pred_final_svm)
final_f1_svm = f1_score(y_test, y_pred_final_svm)
final_confusion_matrix_svm = confusion_matrix(y_test, y_pred_final_svm)

print("\nFinal SVM Model Results:")
print(f"Accuracy: {final_accuracy_svm:.4f}")
print(f"Precision: {final_precision_svm:.4f}")
print(f"Recall: {final_recall_svm:.4f}")
print(f"F1 Score: {final_f1_svm:.4f}")
print(f"Confusion Matrix:\n{final_confusion_matrix_svm}")


Final SVM Model Results:
Accuracy: 0.8126
Precision: 0.7041
Recall: 0.5040
F1 Score: 0.5875
Confusion Matrix:
[[957  79]
 [185 188]]


## Conclusion

After evaluating multiple models and tuning their hyperparameters, the Logistic Regression model was found to have the best performance with an accuracy of 82.04%, precision of 69.11%, recall of 58.18%, and F1 score of 63.17%. The model can now be used to predict customer churn with a reasonable degree of accuracy.

In future work, we can further enhance model performance by exploring more advanced algorithms, using feature engineering techniques, and incorporating more data if available. Ensuring a balanced dataset through techniques like SMOTE (Synthetic Minority Over-sampling Technique) could also help improve model performance, especially recall.
