# Baseline Model – Logistic Regression

## Objective
Build an interpretable baseline model to predict credit card default.
Logistic regression is widely used in credit risk due to its transparency and stability.


In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report

In [None]:
df = pd.read_csv(r"C:\Users\zouta\Desktop\Projets\Default Of Credit Cards Clients\data\processed\credit_default_features.csv")
df.shape

(30000, 29)

In [4]:
target = "default payment next month"

X = df.drop(columns=[target])
y = df[target]

X.shape, y.value_counts(normalize=True)


((30000, 28),
 default payment next month
 0    0.7788
 1    0.2212
 Name: proportion, dtype: float64)

In [5]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    stratify=y,
    random_state=42
)


In [6]:
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


In [7]:
model = LogisticRegression(
    max_iter=1000,
    class_weight="balanced"
)

model.fit(X_train_scaled, y_train)


0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,'balanced'
,random_state,
,solver,'lbfgs'
,max_iter,1000


In [8]:
y_pred = model.predict(X_test_scaled)

cm = confusion_matrix(y_test, y_pred)
cr = classification_report(y_test, y_pred)

cm, cr


(array([[3486, 1187],
        [ 490,  837]]),
 '              precision    recall  f1-score   support\n\n           0       0.88      0.75      0.81      4673\n           1       0.41      0.63      0.50      1327\n\n    accuracy                           0.72      6000\n   macro avg       0.65      0.69      0.65      6000\nweighted avg       0.77      0.72      0.74      6000\n')

## Baseline Model Interpretation

- The model prioritizes recall to reduce false negatives (missed defaults).
- Logistic regression provides a transparent decision boundary.
- Results serve as a reference point for more complex models.

This baseline establishes a risk-aware benchmark for subsequent modeling steps.


In [9]:
y_proba = model.predict_proba(X_test_scaled)[:, 1]


In [11]:
import numpy as np
from sklearn.metrics import recall_score, precision_score

thresholds = np.arange(0.1, 0.6, 0.05)

results = []

for t in thresholds:
    y_pred_t = (y_proba >= t).astype(int)
    recall = recall_score(y_test, y_pred_t)
    precision = precision_score(y_test, y_pred_t)
    results.append((t, recall, precision))

results


[(np.float64(0.1), 0.9954785229841748, 0.2227655986509275),
 (np.float64(0.15000000000000002), 0.9713639788997739, 0.2250742098830103),
 (np.float64(0.20000000000000004), 0.9434815373021854, 0.23275701803309165),
 (np.float64(0.25000000000000006), 0.920120572720422, 0.2396937573616019),
 (np.float64(0.30000000000000004), 0.8673700075357951, 0.24913419913419912),
 (np.float64(0.3500000000000001), 0.8168801808590807, 0.26739023186975824),
 (np.float64(0.40000000000000013), 0.7648831951770911, 0.3008298755186722),
 (np.float64(0.45000000000000007), 0.6910324039186134, 0.35653188180404355),
 (np.float64(0.5000000000000001), 0.6307460437076111, 0.4135375494071146),
 (np.float64(0.5500000000000002), 0.5945742275810098, 0.44425675675675674)]

## Threshold Selection

Lowering the decision threshold increases recall at the cost of precision.
In credit risk, missing a defaulter is more costly than rejecting a good client,
so a lower threshold may be preferred.

In [12]:
best_threshold = 0.3
y_pred_custom = (y_proba >= best_threshold).astype(int)

confusion_matrix(y_test, y_pred_custom)
print(classification_report(y_test, y_pred_custom))

              precision    recall  f1-score   support

           0       0.87      0.26      0.40      4673
           1       0.25      0.87      0.39      1327

    accuracy                           0.39      6000
   macro avg       0.56      0.56      0.39      6000
weighted avg       0.73      0.39      0.40      6000



In [None]:
# Hypothèses de coûts (exemple réaliste)
COST_FN = 5000   # client en défaut accepté
COST_FP = 500    # client sain refusé

In [15]:
cm = confusion_matrix(y_test, y_pred_custom)

tn, fp, fn, tp = cm.ravel()

total_cost = fn * COST_FN + fp * COST_FP

tn, fp, fn, tp, total_cost

(np.int64(1204),
 np.int64(3469),
 np.int64(176),
 np.int64(1151),
 np.int64(2614500))

In [17]:
y_pred_default = model.predict(X_test_scaled)
cm_default = confusion_matrix(y_test, y_pred_default)

tn_d, fp_d, fn_d, tp_d = cm_default.ravel()

cost_default = fn_d * COST_FN + fp_d * COST_FP

cost_default

np.int64(3043500)

## Business Cost Evaluation

Lowering the decision threshold significantly reduces false negatives,
which leads to a substantial reduction in estimated financial losses.

Although more clients may be rejected (false positives),
the overall business cost is lower compared to the default threshold.

In [18]:
# Hypothèses de coûts (exemple réaliste)
COST_FN = 5000   # client en défaut accepté
COST_FP = 500    # client sain refusé

In [19]:
cm = confusion_matrix(y_test, y_pred_custom)

tn, fp, fn, tp = cm.ravel()

total_cost = fn * COST_FN + fp * COST_FP

tn, fp, fn, tp, total_cost


(np.int64(1204),
 np.int64(3469),
 np.int64(176),
 np.int64(1151),
 np.int64(2614500))

In [20]:
y_pred_default = model.predict(X_test_scaled)
cm_default = confusion_matrix(y_test, y_pred_default)

tn_d, fp_d, fn_d, tp_d = cm_default.ravel()

cost_default = fn_d * COST_FN + fp_d * COST_FP

cost_default


np.int64(3043500)

## Business Cost Evaluation

Lowering the decision threshold significantly reduces false negatives,
which leads to a substantial reduction in estimated financial losses.

Although more clients may be rejected (false positives),
the overall business cost is lower compared to the default threshold.
