
# Credit Approval — Exploratory & Training Notebook (Rich Comments)

**Project:** `credit-approval-classifier`  
**Generated:** 2025-11-08 21:47  
**Environment (recommended):** `conda activate credit-approval-env`  
**Default data path:** `../data/CreditData.csv`

This notebook is a **well-documented companion** to the script-based project. It demonstrates:
- Loading the dataset (`Approved` with labels `Yes/No`)
- Building **Logistic Regression** and **Decision Tree** pipelines
- Explaining key parameter choices you asked about
- Evaluating with confusion matrices & metrics
- Extracting and interpreting Logistic Regression coefficients



## 1) Imports and design notes

- **Pipeline uses tuples**: scikit‑learn requires an ordered list of `(name, estimator)` pairs; names let us reference steps and order defines execution.
- **`OneHotEncoder(handle_unknown='ignore', sparse_output=False)`**:
  - `ignore` → unseen categories in test data are encoded as all zeros (avoid crashes).
  - `sparse_output=False` → dense arrays for easier inspection and concatenation. Use sparse if categories are huge.


In [None]:

import os
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import (
    confusion_matrix, ConfusionMatrixDisplay,
    accuracy_score, precision_score, recall_score,
    classification_report
)

print("Matplotlib backend:", matplotlib.get_backend())



## 2) Load data
Will look for `../data/CreditData.csv` relative to this notebook. Adjust if needed.


In [None]:

default_rel = os.path.join('..', 'data', 'CreditData.csv')
fallback_abs = '/mnt/data/CreditData.csv'
csv_path = default_rel if os.path.exists(default_rel) else fallback_abs

print("Loading from:", csv_path)
df = pd.read_csv(csv_path)

TARGET = 'Approved'
if TARGET not in df.columns:
    raise ValueError(f"Expected target '{TARGET}' not found. Columns: {df.columns.tolist()}")

print("Shape:", df.shape)
display(df.head())
print("Target distribution:\n", df[TARGET].value_counts(dropna=False))



## 3) Split and type detection


In [None]:

X = df.drop(columns=[TARGET])
y = df[TARGET].astype(str)

categorical_cols = X.select_dtypes(include=['object']).columns.tolist()
numeric_cols = X.select_dtypes(include=[np.number]).columns.tolist()

print("Categorical:", categorical_cols)
print("Numeric    :", numeric_cols)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)



## 4) Preprocessing & Pipelines


In [None]:

categorical_encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
numeric_scaler = StandardScaler()

preprocess_for_logreg = ColumnTransformer(
    transformers=[
        ('cat', categorical_encoder, categorical_cols),
        ('num', numeric_scaler, numeric_cols),
    ]
)

preprocess_for_tree = ColumnTransformer(
    transformers=[
        ('cat', categorical_encoder, categorical_cols),
        ('num', 'passthrough', numeric_cols),
    ]
)

logreg_clf = Pipeline(steps=[
    ('preprocess', preprocess_for_logreg),
    ('model', LogisticRegression(max_iter=1000))
])

tree_clf = Pipeline(steps=[
    ('preprocess', preprocess_for_tree),
    ('model', DecisionTreeClassifier(random_state=42))
])



## 5) Fit models


In [None]:

logreg_clf.fit(X_train, y_train)
tree_clf.fit(X_train, y_train)
print("Done fitting.")



## 6) Evaluate on test set


In [None]:

def evaluate_model(name, pipeline, X_test, y_test, positive_label='Yes'):
    y_pred = pipeline.predict(X_test)
    other = [lab for lab in sorted(y_test.unique()) if lab != positive_label]
    neg_label = other[0] if other else 'No'
    labels = [positive_label, neg_label]

    cm = confusion_matrix(y_test, y_pred, labels=labels)
    acc = accuracy_score(y_test, y_pred)
    prec = precision_score(y_test, y_pred, pos_label=positive_label, zero_division=0)
    rec = recall_score(y_test, y_pred, pos_label=positive_label, zero_division=0)

    print(f"\n=== {name} ===")
    print(f"Accuracy : {acc:.4f}")
    print(f"Precision: {prec:.4f} (pos='{positive_label}')")
    print(f"Recall   : {rec:.4f} (pos='{positive_label}')\n")
    print(classification_report(y_test, y_pred, zero_division=0))

    disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=labels)
    fig, ax = plt.subplots(figsize=(5, 4))
    disp.plot(ax=ax, colorbar=False)
    plt.title(f'{name} — Confusion Matrix')
    fig.tight_layout()
    plt.show()

    return {'model': name, 'accuracy': acc, 'precision': prec, 'recall': rec, 'cm': cm}

metrics_logreg = evaluate_model('Logistic Regression', logreg_clf, X_test, y_test, positive_label='Yes')
metrics_tree  = evaluate_model('Decision Tree',        tree_clf,  X_test, y_test, positive_label='Yes')



## 7) Logistic Regression — coefficients with explanations


In [None]:

ohe = logreg_clf.named_steps['preprocess'].named_transformers_['cat']
cat_feature_names = ohe.get_feature_names_out(ohe.feature_names_in_).tolist()
num_feature_names = numeric_cols
all_features = cat_feature_names + num_feature_names

coef = logreg_clf.named_steps['model'].coef_.ravel()
coef_df = (
    pd.DataFrame({'feature': all_features, 'coefficient': coef})
    .sort_values('coefficient', ascending=False)
    .reset_index(drop=True)
)
coef_df['odds_ratio'] = np.exp(coef_df['coefficient'])

display(coef_df.head(15))
display(coef_df.tail(15))



## 8) Decision Tree — visualize (depth=4)


In [None]:

ct = tree_clf.named_steps['preprocess']
Xt_train = ct.fit_transform(X_train)
feature_names = ct.get_feature_names_out().tolist()

plain_tree = DecisionTreeClassifier(random_state=42)
plain_tree.fit(Xt_train, y_train)

plt.figure(figsize=(20, 12))
plot_tree(
    plain_tree,
    feature_names=feature_names,
    class_names=sorted(y_train.unique()),
    filled=False, rounded=True, proportion=True, max_depth=4
)
plt.title('Decision Tree (visualized to depth=4)')
plt.tight_layout()
plt.show()



## 9) Wrap-up and gotchas
- Pipelines use tuple `(name, estimator)` steps by design (ordered & addressable).
- `handle_unknown='ignore'` makes OHE robust to unseen categories.
- `sparse_output=False` is convenient for dense workflows; switch to sparse for huge cardinalities.
- Scale numerics for Logistic Regression; don't bother for Trees.
- For servers/WSL without display, use a headless backend (e.g., `matplotlib.use("Agg")`).
