In [17]:
import pandas as pd

df = pd.read_csv("../data/processed_credit_card_clients.csv")


In [18]:
X = df.drop("default payment next month", axis=1)
y = df["default payment next month"]

In [19]:
from sklearn.model_selection import train_test_split


In [20]:
X_train ,X_test , y_train, y_test = train_test_split(
    X,y,
    test_size = 0.2,
    random_state=42,
    stratify= y
)

**why stratify= y**

* Class imbalance (~22%)

* Train/Test मा same distribution

### Why Pipeline Was Not Used

Pipeline was not used to keep preprocessing steps explicit for learning and debugging purposes.  
Once the workflow is stable, the same logic can be refactored into a pipeline.


In [21]:
for col in df.columns:
    unique_count = df[col].nunique()
    print(f"{col}: {unique_count} unique values, dtype: {df[col].dtype}")

LIMIT_BAL: 81 unique values, dtype: int64
SEX: 2 unique values, dtype: int64
EDUCATION: 4 unique values, dtype: int64
MARRIAGE: 3 unique values, dtype: int64
AGE: 56 unique values, dtype: int64
PAY_0: 11 unique values, dtype: int64
PAY_2: 11 unique values, dtype: int64
PAY_3: 11 unique values, dtype: int64
PAY_4: 11 unique values, dtype: int64
PAY_5: 10 unique values, dtype: int64
PAY_6: 10 unique values, dtype: int64
BILL_AMT1: 22723 unique values, dtype: int64
BILL_AMT2: 22346 unique values, dtype: int64
BILL_AMT3: 22026 unique values, dtype: int64
BILL_AMT4: 21548 unique values, dtype: int64
BILL_AMT5: 21010 unique values, dtype: int64
BILL_AMT6: 20604 unique values, dtype: int64
PAY_AMT1: 7943 unique values, dtype: int64
PAY_AMT2: 7899 unique values, dtype: int64
PAY_AMT3: 7518 unique values, dtype: int64
PAY_AMT4: 6937 unique values, dtype: int64
PAY_AMT5: 6897 unique values, dtype: int64
PAY_AMT6: 6939 unique values, dtype: int64
default payment next month: 2 unique values, dtype

In [22]:
from sklearn.preprocessing import OneHotEncoder

In [23]:
cate_cols = ["SEX", "MARRIAGE"]

In [27]:
ohe = OneHotEncoder(
    drop="first",
    handle_unknown="ignore",
    sparse_output=False
)

In [28]:
ohe.fit(X_train[cate_cols])

0,1,2
,categories,'auto'
,drop,'first'
,sparse_output,False
,dtype,<class 'numpy.float64'>
,handle_unknown,'ignore'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'


In [29]:
X_train_ohe = ohe.transform(X_train[cate_cols])
X_test_ohe = ohe.transform(X_test[cate_cols])

In [32]:
ohe_cols =ohe.get_feature_names_out(cate_cols)

X_train_ohe = pd.DataFrame(
    X_train_ohe,
    columns=ohe_cols,
    index = X_train.index
)



X_test_ohe = pd.DataFrame(
    X_test_ohe,
    columns = ohe_cols,
    index = X_test.index
)

In [33]:
X_train = X_train.drop(columns=cate_cols)
X_test = X_test.drop(columns=cate_cols )

In [36]:
#Concatenate encoded features
X_train = pd.concat([X_train,X_train_ohe],axis=1)
X_test = pd.concat([X_test,X_test_ohe],axis=1)

In [37]:
from sklearn.preprocessing import StandardScaler
num_cols = [
    "LIMIT_BAL", "AGE",
    "BILL_AMT1", "BILL_AMT2", "BILL_AMT3",
    "BILL_AMT4", "BILL_AMT5", "BILL_AMT6",
    "PAY_AMT1", "PAY_AMT2", "PAY_AMT3",
    "PAY_AMT4", "PAY_AMT5", "PAY_AMT6"
]

In [38]:
scaler = StandardScaler()
scaler.fit(X_train[num_cols])

0,1,2
,copy,True
,with_mean,True
,with_std,True


In [39]:
X_train[num_cols]= scaler.transform(X_train[num_cols])
X_test[num_cols] = scaler.transform(X_test[num_cols])

In [41]:
X_train[num_cols].describe().loc[["mean","std"]]

Unnamed: 0,LIMIT_BAL,AGE,BILL_AMT1,BILL_AMT2,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6
mean,8.437695000000001e-17,-1.669775e-16,4.2040450000000007e-17,-9.325873e-18,6.306067000000001e-17,-5.62513e-17,5.033011e-17,6.306067000000001e-17,-1.1842380000000002e-17,-1.509903e-17,-2.960595e-19,-1.6135240000000002e-17,7.105427e-18,1.7763570000000002e-17
std,1.000021,1.000021,1.000021,1.000021,1.000021,1.000021,1.000021,1.000021,1.000021,1.000021,1.000021,1.000021,1.000021,1.000021


In [42]:
from sklearn.linear_model import LogisticRegression

In [43]:
lr = LogisticRegression(
    solver="liblinear",
    class_weight="balanced",
    random_state=42
)

In [44]:
lr.fit(X_train,y_train)

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,'balanced'
,random_state,42
,solver,'liblinear'
,max_iter,100


In [45]:
lr.n_iter_

array([5], dtype=int32)

### Evaluation Metrics

Due to class imbalance, accuracy alone is misleading.  
Precision, recall, F1-score, and confusion matrix were used for evaluation.


In [49]:
y_pred = lr.predict(X_test)


In [50]:
from sklearn.metrics import confusion_matrix

In [51]:
cm = confusion_matrix(y_test,y_pred)

In [52]:
cm

array([[3254, 1419],
       [ 500,  827]])

In [55]:
from sklearn.metrics import classification_report
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.87      0.70      0.77      4673
           1       0.37      0.62      0.46      1327

    accuracy                           0.68      6000
   macro avg       0.62      0.66      0.62      6000
weighted avg       0.76      0.68      0.70      6000



### Business Interpretation

In credit default prediction, false negatives are more costly than false positives,
as approving a defaulter leads to direct financial loss.
Therefore, recall for the default class (1) is prioritized over precision.

In contrast, for fraud detection systems, higher recall is also critical to avoid
missing fraudulent transactions.
