# Model Building

In this notebook, we will build and evaluate machine learning models to predict customer churn. We will start by splitting the data into training and testing sets, followed by training various models, evaluating their performance, and selecting the best model.


In [6]:
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

In [7]:
# Load the preprocessed dataset
data_path = "../data/processed/cleaned_data.csv"
df = pd.read_csv(data_path)

## Train-Test Split

We split the dataset into training and testing sets to evaluate the performance of our models on unseen data. This step is crucial for estimating the generalization ability of our models.

In [8]:
# Separate features and target variable
X = df.drop(columns=['customerID', 'Churn'])
y = df['Churn']

In [9]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("Training set size:", X_train.shape)
print("Testing set size:", X_test.shape)

Training set size: (5634, 19)
Testing set size: (1409, 19)


## Model Training and Evaluation

We train several machine learning models and evaluate their performance using appropriate metrics. We aim to select the model that performs best on the validation set.

In [None]:
# Initialize models
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'Random Forest': RandomForestClassifier(random_state=42),
    'Support Vector Machine': SVC(random_state=42)
}