# Model Building

In this notebook, we will build and evaluate machine learning models to predict customer churn. We will start by splitting the data into training and testing sets, followed by training various models, evaluating their performance, and selecting the best model.


In [24]:
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

In [25]:
# Load the preprocessed dataset
data_path = "../data/processed/cleaned_data.csv"
df = pd.read_csv(data_path)

## Train-Test Split

We split the dataset into training and testing sets to evaluate the performance of our models on unseen data. This step is crucial for estimating the generalization ability of our models.

In [27]:
# Separate features and target variable
X = df.drop(columns=['customerID', 'Churn'])
y = df['Churn']

In [28]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("Training set size:", X_train.shape)
print("Testing set size:", X_test.shape)

Training set size: (5634, 19)
Testing set size: (1409, 19)


## Model Training and Evaluation

We train several machine learning models and evaluate their performance using appropriate metrics. We aim to select the model that performs best on the validation set.

In [29]:
# Initialize models
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'Random Forest': RandomForestClassifier(random_state=42),
    'Support Vector Machine': SVC(random_state=42)
}

In [30]:
# Function to evaluate model performance
def evaluate_model(model, X_train, y_train, X_test, y_test):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    return accuracy, precision, recall, f1, y_pred

In [31]:
# Dictionary to store evaluation results
evaluation_results = {}

In [32]:
# Evaluate each model
for model_name, model in models.items():
    accuracy, precision, recall, f1, y_pred = evaluate_model(model, X_train, y_train, X_test, y_test)
    evaluation_results[model_name] = {
        'Accuracy': accuracy,
        'Precision': precision,
        'Recall': recall,
        'F1 Score': f1,
        'Confusion Matrix': confusion_matrix(y_test, y_pred)
    }

In [33]:
# Display evaluation results
for model_name, metrics in evaluation_results.items():
    print(f"\n{model_name} Results:")
    print(f"Accuracy: {metrics['Accuracy']:.4f}")
    print(f"Precision: {metrics['Precision']:.4f}")
    print(f"Recall: {metrics['Recall']:.4f}")
    print(f"F1 Score: {metrics['F1 Score']:.4f}")
    print(f"Confusion Matrix:\n{metrics['Confusion Matrix']}")


Logistic Regression Results:
Accuracy: 0.8176
Precision: 0.6824
Recall: 0.5818
F1 Score: 0.6281
Confusion Matrix:
[[935 101]
 [156 217]]

Decision Tree Results:
Accuracy: 0.7246
Precision: 0.4809
Recall: 0.5067
F1 Score: 0.4935
Confusion Matrix:
[[832 204]
 [184 189]]

Random Forest Results:
Accuracy: 0.7970
Precision: 0.6629
Recall: 0.4745
F1 Score: 0.5531
Confusion Matrix:
[[946  90]
 [196 177]]

Support Vector Machine Results:
Accuracy: 0.8112
Precision: 0.6945
Recall: 0.5121
F1 Score: 0.5895
Confusion Matrix:
[[952  84]
 [182 191]]
