# Lab 5 — Supervised Learning – Binary Classification

**Dataset:** Default of Credit Card Clients Dataset  
**Tools:** Python, pandas, scikit-learn, numpy, matplotlib, seaborn

## Objectives
- Understand binary classification and its practical applications.
- Train classification models and evaluate using accuracy metrics.
- Handle class imbalance using sampling techniques.
- Visualize decision boundaries and compare classifier performance.

## Dataset Description
This dataset contains information on credit card clients and whether they defaulted on payments. It includes demographic, financial, and repayment history features.

**Key Attributes**
- `LIMIT_BAL`: Credit limit (numeric)  
- `SEX`: Gender (1 = male; 2 = female)  
- `EDUCATION`: Education level  
- `MARRIAGE`: Marital status  
- `AGE`: Age in years  
- `PAY_0` to `PAY_6`: History of past monthly payments  
- `BILL_AMT1` to `BILL_AMT6`: Amount of bill statement  
- `PAY_AMT1` to `PAY_AMT6`: Amount paid in previous months  
- `default.payment.next.month`: Target (1 = default, 0 = no default)

> **Note:** The dataset is imbalanced (about 77% non-default, 23% default).

---

## Exercises (summary)

### Exercise 1: Data Understanding and Preprocessing
- Load dataset, display shape/info/summary stats.
- Convert categorical columns (`SEX`, `EDUCATION`, `MARRIAGE`) using encoding.
- Check/handle missing values and duplicates.
- Normalize/standardize numeric features.
- Split into train/test (e.g., 80:20).

### Exercise 2: Model Training and Evaluation
Train and evaluate these classifiers:
- Logistic Regression
- K-Nearest Neighbors (use elbow method to pick K)
- Decision Tree
- Random Forest
- SVM

For each model:
- Train on training set.
- Predict on test set.
- Evaluate with confusion matrix, accuracy, precision, recall, F1-score, ROC AUC.
- Plot ROC curve and include classification report.

### Exercise 3: Handling Class Imbalance
- Check target distribution.
- Apply SMOTE to balance dataset.
- Retrain Logistic Regression and Random Forest on balanced data.
- Compare performance before/after balancing (F1, Recall, AUC).

### Exercise 4: Model Comparison Table
Create two tables summarizing performance *before* and *after* SMOTE for each model:
- Accuracy, Precision, Recall, F1-Score, ROC AUC.

---

## Submission Guidelines
- Submit Jupyter notebook as PDF.
- Each section should include the question, code, output, and brief explanation.
- Notebook header/footer as specified in lab brief.

---

## Knowledge Check (pick any 5)
- Why is Recall important in a credit default scenario?
- What is the ROC curve and what does AUC signify?
- How does SMOTE handle class imbalance?
- When would you prefer Random Forest over Logistic Regression?
- What are the pros and cons of using KNN for classification tasks?


In [None]:
import numpy as np 
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore", category=UserWarning)


from sklearn.preprocessing import LabelEncoder, StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import (
    classification_report,
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    roc_auc_score,
    roc_curve,
    confusion_matrix
)

In [None]:
df = pd.read_csv('datasets/UCI_Credit_Card.csv')
df.head()

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
cat_cols = ['SEX', 'EDUCATION', 'MARRIAGE']

In [None]:
le = LabelEncoder()

In [None]:
df['SEX_LE'] = le.fit_transform(df[cat_cols[0]])
df['SEX_LE'].head()

In [None]:
df1 = pd.get_dummies(df, columns=['EDUCATION'], prefix='EDU', dtype='int64')

In [None]:
df1 = pd.get_dummies(df1, columns=['MARRIAGE'], prefix='MAR', dtype='int64')

In [None]:
df1.columns

In [None]:
df1.isnull().sum().sum()

In [None]:
df1.duplicated().sum()

In [None]:
sc = StandardScaler()
minMax = MinMaxScaler()

In [None]:
df1['AGE_SC'] = sc.fit_transform(df1[['AGE']])
df1['AGE_SC']

In [None]:
df1['LIMIT_BAL_SC'] = sc.fit_transform(df1[['LIMIT_BAL']])
df1['LIMIT_BAL_SC']

In [None]:
col_to_Min_Max = ['BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']

In [None]:
df1[col_to_Min_Max] = minMax.fit_transform(df1[col_to_Min_Max])

In [None]:
df1[col_to_Min_Max].min()

In [None]:
df1[col_to_Min_Max].max()

In [None]:
df1.columns

In [None]:
X = df1.drop(columns=['default.payment.next.month'])
y = df1['default.payment.next.month']

In [None]:
X.shape, y.shape

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [None]:
X_train.shape, X_test.shape

In [None]:
y_train.shape, y_test.shape

In [None]:
df_m_test = pd.DataFrame()
df_m_train = pd.DataFrame()

In [None]:
def print_metrics(name, yt, yp, yproba):
    acc = accuracy_score(yt, yp)
    prec = precision_score(yt, yp)
    rec = recall_score(yt, yp)
    f1 = f1_score(yt, yp)
    auc_score = roc_auc_score(yt, yproba)

    print("=== Classification Report ===")
    print(classification_report(yt, yp))

    fpr, tpr, _ = roc_curve(yt, yproba)
    plt.figure(figsize=(8,6))
    plt.plot(fpr, tpr, label=f"{name} (AUC = {auc_score:.2f})", color='blue')
    plt.plot([0,1], [0,1], linestyle='--', color='grey')
    plt.title('ROC Curve')
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.legend(loc='lower right')
    plt.grid(True)
    plt.show()

    # Plot Confusion Matrix
    cm = confusion_matrix(yt, yp)
    plt.figure(figsize=(6,5))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=False)
    plt.title('Confusion Matrix')
    plt.xlabel('Predicted')
    plt.ylabel('Actual')
    plt.show()


    metrics_df = pd.DataFrame([{
        "Model": name,
         "y_true": yt,
         "y_predicted": yp,
         "y_proba": yproba,
        "Accuracy": acc,
        "Precision": prec,
        "Recall": rec,
        "F1 Score": f1,
        "ROC AUC": auc_score,
        "Comfusion Matrix": cm
     }])
    
    return metrics_df


In [None]:
lr_model = LogisticRegression()
lr_model.fit(X_train, y_train)
y_pred = lr_model.predict(X_test)
y_proba = lr_model.predict_proba(X_test)[:, 1]
print_metrics(lr_model.__class__.__name__, y_test, y_pred, y_proba)

In [None]:
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
y_proba = knn.predict_proba(X_test)[:, 1]

In [None]:
print_metrics(knn.__class__.__name__, y_test, y_pred, y_proba)

In [None]:
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)
y_pred = dt.predict(X_test)
y_proba = dt.predict_proba(X_test)[:, 1]

In [None]:
print_metrics(dt.__class__.__name__, y_test, y_pred, y_proba)

In [None]:
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
y_proba = rf.predict_proba(X_test)[:, 1]

In [None]:
print_metrics(rf.__class__.__name__, y_test, y_pred, y_proba)

In [None]:
svc = SVC()
svc.fit(X_train, y_train)
y_pred = svc.predict(X_test)
y_scores = svc.decision_function(X_test)

In [None]:
print_metrics(svc.__class__.__name__, y_test, y_pred, y_proba)

In [None]:
df1['default.payment.next.month'].value_counts()

In [None]:
from imblearn.over_sampling import SMOTE

In [None]:
smote = SMOTE(random_state=42)

In [None]:
X_train_sm, y_train_sm = smote.fit_resample(X_train, y_train)
X_train_sm.shape, y_train_sm.shape

In [None]:
lr= LogisticRegression()
lr.fit(X_train_sm, y_train_sm)
y_pred = lr.predict(X_test)
y_proba = lr.predict_proba(X_test)[:, 1]
print_metrics(lr.__class__.__name__, y_test, y_pred, y_proba)

In [None]:
model = RandomForestClassifier()
model.fit(X_train_sm, y_train_sm)
y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)[:, 1]
print_metrics(model.__class__.__name__, y_test, y_pred, y_proba)

In [None]:
model = KNeighborsClassifier()
model.fit(X_train_sm, y_train_sm)
y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)[:, 1]
print_metrics(model.__class__.__name__, y_test, y_pred, y_proba)

In [None]:
model = DecisionTreeClassifier()
model.fit(X_train_sm, y_train_sm)
y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)[:, 1]
print_metrics(model.__class__.__name__, y_test, y_pred, y_proba)

In [None]:
model = SVC()
model.fit(X_train_sm, y_train_sm)
y_pred = model.predict(X_test)
y_proba = model.decision_function(X_test)[:, 1]
print_metrics(model.__class__.__name__, y_test, y_pred, y_proba)