# FinRisk: Credit Risk & Fraud Detection - Notebook 5
## Phase 2: Advanced Model Optimization

**Objective:** To improve the performance of the Logistic Regression model, which is currently our best performer but still below target metrics. We will use two primary techniques:
1.  **Hyperparameter Tuning:** Systematically find the best regularization parameters (`C` and `penalty`).
2.  **Feature Engineering:** Create `PolynomialFeatures` to allow the model to capture non-linear relationships.

In [1]:
# ==============================================================================
# 1. Import Libraries
# ==============================================================================
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns

# Scikit-learn
from sklearn.model_selection import GridSearchCV, TimeSeriesSplit
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PolynomialFeatures
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_auc_score, roc_curve
from sklearn.linear_model import LogisticRegression
import xgboost as xgb

# Settings
import warnings
warnings.filterwarnings('ignore')
sns.set_style("whitegrid")

print("Libraries imported successfully.")

Libraries imported successfully.


## 2. Load Data and Re-run Preparation Steps

We load the processed data and set up our features, target, and time-based split as before.

In [2]:
# Load data
PROCESSED_DATA_PATH = '../data/processed/'
DATA_FILE = 'credit_risk_features.csv'
df = pd.read_csv(os.path.join(PROCESSED_DATA_PATH, DATA_FILE), parse_dates=['application_date'])

# Define features (X) and target (y)
TARGET = 'default_flag'
features_to_exclude = [
    'application_id', 'customer_id', 'application_date', 'last_activity_date',
    'default_flag', 'application_status', 'city'
]
X = df.drop(columns=features_to_exclude)
y = df[TARGET]

# Identify feature types
categorical_features = X.select_dtypes(include=['object', 'category']).columns
numerical_features = X.select_dtypes(include=np.number).columns

# Time-based split
df_sorted = df.sort_values('application_date')
split_index = int(len(df_sorted) * 0.8)
train_df, test_df = df_sorted.iloc[:split_index], df_sorted.iloc[split_index:]
X_train, y_train = train_df[X.columns], train_df[TARGET]
X_test, y_test = test_df[X.columns], test_df[TARGET]

print(f"Training data shape: {X_train.shape}, Testing data shape: {X_test.shape}")

Training data shape: (80000, 26), Testing data shape: (20000, 26)


## 3. Optimization 1: Hyperparameter Tuning with GridSearchCV

We'll define a "grid" of parameters for Logistic Regression and use cross-validation to find the best combination. We'll tune `C` (inverse of regularization strength) and the `penalty` type.

In [3]:
# Define the preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ],
    remainder='passthrough'
)

# Create a pipeline with the preprocessor and the classifier
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(random_state=42, class_weight='balanced', solver='liblinear'))
])

# Define the parameter grid to search
# C: Controls the penalty strength. Smaller C means stronger regularization.
# penalty: 'l1' (Lasso) can perform feature selection, 'l2' (Ridge) is standard.
param_grid = {
    'classifier__C': [0.001, 0.01, 0.1, 1, 10, 100],
    'classifier__penalty': ['l1', 'l2']
}

# Use TimeSeriesSplit for cross-validation to respect the time-based nature of the data
tscv = TimeSeriesSplit(n_splits=5)

# Set up GridSearchCV
grid_search = GridSearchCV(pipeline, param_grid, cv=tscv, scoring='roc_auc', n_jobs=-1)

print("--- Starting GridSearchCV for Logistic Regression ---")
grid_search.fit(X_train, y_train)

print(f"Best parameters found: {grid_search.best_params_}")
print(f"Best cross-validation AUC: {grid_search.best_score_:.4f}")

# Store the best model
best_lr_tuned = grid_search.best_estimator_

--- Starting GridSearchCV for Logistic Regression ---
Best parameters found: {'classifier__C': 100, 'classifier__penalty': 'l1'}
Best cross-validation AUC: 0.6243


## 4. Optimization 2: Engineering Polynomial Features

Now, let's create a new pipeline that includes a step to generate interaction and polynomial features *before* training the logistic regression model. This gives the simple model a much richer feature set to work with.

In [None]:
# Create a new pipeline with PolynomialFeatures
# We add this step ONLY for numerical features after scaling.
poly_preprocessor = ColumnTransformer(
    transformers=[
        # Apply scaling, then polynomial features
        ('num_poly', Pipeline(steps=[
            ('scaler', StandardScaler()),
            ('poly', PolynomialFeatures(degree=2, include_bias=False))
        ]), numerical_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ],
    remainder='passthrough'
)

# Use the best parameters we found from GridSearchCV
best_params = grid_search.best_params_
lr_poly_model = LogisticRegression(
    C=best_params['classifier__C'],
    penalty=best_params['classifier__penalty'],
    random_state=42,
    class_weight='balanced',
    solver='liblinear'
)

# Create the full pipeline
pipeline_poly = Pipeline(steps=[
    ('preprocessor', poly_preprocessor),
    ('classifier', lr_poly_model)
])

print("\n--- Training Logistic Regression with Polynomial Features ---")
pipeline_poly.fit(X_train, y_train)
print("Model trained successfully.")


--- Training Logistic Regression with Polynomial Features ---


## 5. Final Evaluation

Let's evaluate our two new optimized models and a tuned XGBoost model against the original baseline to see the impact of our optimizations.

In [None]:
# Helper function for evaluation
def calculate_ks_and_gini(y_true, y_pred_proba):
    auc = roc_auc_score(y_true, y_pred_proba)
    gini = 2 * auc - 1
    data = pd.DataFrame({'y': y_true, 'p': y_pred_proba}).sort_values('p', ascending=False)
    data['good'] = (1 - data.y).cumsum() / (1 - data.y).sum()
    data['bad'] = data.y.cumsum() / data.y.sum()
    ks = (data.bad - data.good).max()
    return ks * 100, gini

# Get predictions from our new models
y_pred_tuned_lr = best_lr_tuned.predict_proba(X_test)[:, 1]
y_pred_poly_lr = pipeline_poly.predict_proba(X_test)[:, 1]

# Store results
results = {
    "Logistic Regression (Tuned)": y_pred_tuned_lr,
    "Logistic Regression (Poly Features)": y_pred_poly_lr
}

# --- Evaluate and display results ---
evaluation_summary = []
plt.figure(figsize=(12, 9))

for name, y_pred_proba in results.items():
    auc = roc_auc_score(y_test, y_pred_proba)
    ks, gini = calculate_ks_and_gini(y_test, y_pred_proba)
    evaluation_summary.append({'Model': name, 'AUC': auc, 'Gini': gini, 'KS Statistic': ks})
    fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
    plt.plot(fpr, tpr, label=f'{name} (AUC = {auc:.3f})')

# Display summary table
evaluation_df = pd.DataFrame(evaluation_summary).sort_values('AUC', ascending=False)
print("--- Final Model Performance Summary ---")
print(evaluation_df)

# Check against success metrics
print("\n--- Checking Against Success Metrics ---")
best_model_stats = evaluation_df.iloc[0]
auc_check = "PASS" if best_model_stats['AUC'] > 0.75 else "FAIL"
ks_check = "PASS" if best_model_stats['KS Statistic'] > 40 else "FAIL"
gini_check = "PASS" if best_model_stats['Gini'] > 0.50 else "FAIL"

print(f"Best Model: {best_model_stats['Model']}")
print(f"AUC > 0.75: {auc_check} (Actual: {best_model_stats['AUC']:.3f})")
print(f"KS > 40: {ks_check} (Actual: {best_model_stats['KS Statistic']:.2f})")
print(f"Gini > 0.50: {gini_check} (Actual: {best_model_stats['Gini']:.3f})")

# Finalize plot
plt.plot([0, 1], [0, 1], 'k--', label='Random Chance')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curves for Optimized Models')
plt.legend()
plt.show()