In [1]:
import pandas as pd
import sqlite3
from pathlib import Path
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, roc_auc_score

In [2]:
project_root = Path("..")
db_path = project_root / "data" / "database" / "churn.db"

conn = sqlite3.connect(db_path)
df = pd.read_sql("SELECT * FROM model_features", conn)
conn.close()

df.shape

(7043, 7)

In [3]:
X = df.drop(columns="churn")
y = df["churn"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

In [5]:
X = df.drop(columns="churn").copy()
y = df["churn"]

numeric_cols = ["tenure", "MonthlyCharges", "TotalCharges"]

for col in numeric_cols:
    X[col] = pd.to_numeric(X[col], errors="coerce")

X = X.dropna()
y = y.loc[X.index]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

X.dtypes

tenure                   int64
MonthlyCharges         float64
TotalCharges           float64
is_monthly_contract      int64
has_internet             int64
has_tech_support         int64
dtype: object

In [6]:
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

preds = model.predict(X_test)
probs = model.predict_proba(X_test)[:, 1]

print(classification_report(y_test, preds))
roc_auc_score(y_test, probs)

              precision    recall  f1-score   support

           0       0.84      0.87      0.86      1291
           1       0.60      0.56      0.58       467

    accuracy                           0.78      1758
   macro avg       0.72      0.71      0.72      1758
weighted avg       0.78      0.78      0.78      1758



0.829351448091465

### Model performance and interpretation

The logistic regression model achieves an overall accuracy of approximately **78%**
and a ROCâ€“AUC of **~0.83**, indicating good separation between churned and
non-churned customers despite class imbalance.

Performance differs by class. The model predicts **non-churn** customers
more accurately than **churn** customers, which is expected given that churn
represents a smaller share of the population. For churned customers, the model
achieves moderate recall, meaning it identifies a meaningful fraction of
high-risk customers while avoiding excessive false positives.

Importantly, the model is intentionally simple and interpretable. Its purpose is
not to maximize predictive accuracy, but to quantify how operational features
such as tenure, pricing, and contract structure relate to churn risk. This makes
the results directly actionable for business decisions such as targeted retention
offers, contract restructuring, or customer support interventions.

Overall, the results suggest that a substantial portion of churn risk can be
explained using standard customer attributes available in operational databases,
without requiring complex models or external data sources.