# Logistic Regression Model

This notebook focuses on building and evaluating a Logistic Regression model for predicting customer churn.

- **Data Source:** `data/processed/train.csv` and `data/processed/test.csv`
- **Preprocessing:** All encoding, and feature engineering were done in the `feature_engineering.ipynb` notebook.
- **Objective:** Establish a strong baseline model using logistic regression, evaluate key performance metrics (accuracy, precision, recall, F1, ROC-AUC), and document results for comparison with future models.


In [58]:
import pandas as pd

train_df = pd.read_csv("../../data/processed/train.csv")
test_df = pd.read_csv("../../data/processed/test.csv")

In [59]:
# Getting data information to make sure our data is in order for our model

# 1. Check for missing values
print("Missing values check:")
print(train_df.isna().sum())
print(test_df.isna().sum())

# 2. Check row counts
print("\n\nRow counts:")
print("Train shape:", train_df.shape)
print("Test shape:", test_df.shape)

# 3. Check target column balance
print("\n\nTarget column distribution (Train):")
print(train_df['Churn'].value_counts(normalize=True))

print("\n\nTarget column distribution (Test):")
print(test_df['Churn'].value_counts(normalize=True))

# 4. Check data types
print("\n\nData types:")
print(train_df.dtypes)

# 5. Quick sanity check: no duplicated rows
print("\n\nDuplicate rows check:")
print(f"Train duplicates: {train_df.duplicated().sum()} out of {train_df.shape[0]}")
print(f"Test duplicates: {test_df.duplicated().sum()} out of {test_df.shape[0]}")


Missing values check:
PreferredLoginDevice                               0
Gender                                             0
Complain                                           0
CityTier                                           0
SatisfactionScore                                  0
PreferredPaymentMode_Credit Card                   0
PreferredPaymentMode_Debit Card                    0
PreferredPaymentMode_E wallet                      0
PreferredPaymentMode_Unified Payments Interface    0
PreferredOrderCat_Grocery                          0
PreferredOrderCat_Laptop & Accessory               0
PreferredOrderCat_Mobile Phone                     0
PreferredOrderCat_Others                           0
MaritalStatus_Married                              0
MaritalStatus_Single                               0
Tenure(months)                                     0
WarehouseToHome                                    0
HourSpendOnApp                                     0
NumberOfDeviceRegistered

We notice there are duplicates in both datasets <br>
Leaving duplicates can bias the model if it sees some patterns more than others just because they are repeated.

In [60]:
train_df = train_df.drop_duplicates()
test_df = test_df.drop_duplicates()

print("New train size:", train_df.shape)
print("New test size:", test_df.shape)


New train size: (4149, 26)
New test size: (1107, 26)


Seperating our datasets into training and testing sets for our feature and target variables

In [61]:
feature_train = train_df.drop(columns=["Churn"])
target_train = train_df["Churn"]

feature_test = test_df.drop(columns=["Churn"])
target_test = test_df["Churn"]

print(f"Train shape: {feature_train.shape}, Test shape: {feature_test.shape}")

Train shape: (4149, 25), Test shape: (1107, 25)


In [62]:
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression(
    solver="liblinear",     
    # solver = "lbfgs",      
    # The class_weight parameter controls how the model handles imbalanced classes (where one class has way more samples than the other).
    # "balanced" makes sure minority classes get higher weight so the model pays more attention to them
    class_weight="balanced",
    random_state=42
)



In [63]:
# Fitting/Training our model
log_reg.fit(feature_train, target_train)

In [66]:
target_pred = log_reg.predict(feature_test)

# we can get the model's confidence score for predicting churn in customers
target_pred_proba = log_reg.predict_proba(feature_test)[:, 1]  # Probabilities for class=1