# Retail Credit Risk Scorecard

This project focuses on building a predictive model for credit risk assessment using the "Give Me Some Credit" dataset from Kaggle. The goal is to predict whether a borrower will experience serious delinquency (90+ days past due) within the next two years.

The key innovation in this project is advanced feature engineering, where we create new predictive features from raw financial data to improve model performance. The engineered features capture complex relationships in credit behavior, delinquency patterns, and debt burden, leading to a more accurate risk assessment.

### Key Strengths of This Approach

1. Comprehensive Feature Creation: Goes far beyond simple transformations to create domain-specific, meaningful features.

2. Feature Evaluation: Uses both correlation analysis and model-based importance to select the best features.

3. Data Quality: Robust handling of missing values, outliers, and data anomalies.

4. Domain Knowledge Integration: The features reflect deep understanding of credit risk factors (delinquency patterns, debt burdens, etc.).

The focus on creating lagged values (through delinquency history features) and aggregate metrics (like TotalPastDue and weighted scores) directly addresses the project's goal of enhancing predictive power through sophisticated feature engineering.

In [1]:
import zipfile
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import roc_auc_score, classification_report, confusion_matrix, roc_curve, precision_recall_curve, average_precision_score
from sklearn.calibration import calibration_curve
from sklearn.utils.class_weight import compute_class_weight
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
import warnings
warnings.filterwarnings('ignore')

## Load the Data

Extracts a ZIP file containing credit data ("Give Me Some Credit.zip") and loads two CSV files:

- cs-training.csv: Training dataset
- cs-test.csv: Test dataset

In [2]:
# Define the ZIP file path and extract
zip_path = "Give Me Some Credit.zip"
with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    zip_ref.extractall("extracted_files")

# Load CSV files
train_data = pd.read_csv("extracted_files/cs-training.csv")
test_data = pd.read_csv("extracted_files/cs-test.csv")

## Data Loading and Initial Cleaning

This includes:

- Handling missing values (filling with medians for MonthlyIncome and NumberOfDependents)
- Removing an unnecessary index column ('Unnamed: 0')
- Capping extreme values at the 1st and 99th percentiles
- Fixing specific issues like age=0 (replacing with median age)
- Capping delinquency-related columns at a maximum of 20

In [3]:
# Data Loading and Cleaning
def load_and_clean_data():
    """Load and clean the credit risk data"""
    # Define the ZIP file path and extract
    zip_path = "Give Me Some Credit.zip"
    with zipfile.ZipFile(zip_path, 'r') as zip_ref:
        zip_ref.extractall("extracted_files")

    # Load CSV files
    train_data = pd.read_csv("extracted_files/cs-training.csv")
    test_data = pd.read_csv("extracted_files/cs-test.csv")

    # Data Cleaning
    def clean_data(df):
        """Clean the dataset by handling missing values and extreme values."""
        df = df.copy()
        
        # Handle missing values
        df['MonthlyIncome'] = df['MonthlyIncome'].fillna(df['MonthlyIncome'].median())
        df['NumberOfDependents'] = df['NumberOfDependents'].fillna(df['NumberOfDependents'].median())
        
        # Remove index column
        if 'Unnamed: 0' in df.columns:
            df.drop(columns=['Unnamed: 0'], inplace=True)
        
        # Cap extreme values
        numeric_cols = ['RevolvingUtilizationOfUnsecuredLines', 'age', 
                       'NumberOfTime30-59DaysPastDueNotWorse', 'DebtRatio',
                       'MonthlyIncome', 'NumberOfOpenCreditLinesAndLoans',
                       'NumberOfTimes90DaysLate', 'NumberRealEstateLoansOrLines',
                       'NumberOfTime60-89DaysPastDueNotWorse', 'NumberOfDependents']
        
        for col in numeric_cols:
            lower = df[col].quantile(0.01)
            upper = df[col].quantile(0.99)
            df[col] = df[col].clip(lower, upper)
        
        # Fix specific issues
        df['age'] = df['age'].replace(0, df['age'].median())
        
        max_due = 20
        due_cols = ['NumberOfTime30-59DaysPastDueNotWorse', 
                   'NumberOfTimes90DaysLate',
                   'NumberOfTime60-89DaysPastDueNotWorse']
        for col in due_cols:
            df[col] = df[col].clip(upper=max_due)
        
        # Feature engineering
        df['TotalPastDue'] = (df['NumberOfTime30-59DaysPastDueNotWorse'] + 
                             df['NumberOfTime60-89DaysPastDueNotWorse'] + 
                             df['NumberOfTimes90DaysLate'])
        df['IncomePerDependent'] = df['MonthlyIncome'] / (df['NumberOfDependents'] + 1)
        df['DebtToIncome'] = df['DebtRatio'] * df['MonthlyIncome']
        
        return df

    # Clean both datasets
    train_data = clean_data(train_data)
    test_data = clean_data(test_data)
    
    return train_data, test_data

## Advanced Feature Engineering (Core Focus)

The enhance_features() function creates numerous sophisticated features that fall into several categories:

**Ratio Features**
- Debt burden indicators: DebtToIncome, IncomePerDependent, CreditPerIncome
- Credit utilization metrics: RevolvingUtilizationSquared, RevolvingUtilizationLog

**Delinquency Features**
- Aggregated metrics: TotalPastDue (sum of all delinquency types)
- Weighted scores: WeightedDelinquencyScore (more severe delinquencies get higher weights)
- Binary indicators: HasPastDue30/60/90, HasAnyDelinquency
- Severity metrics: DelinquencySeverityRatio (proportion of severe delinquencies)

**Age-Related Features**
- Non-linear transformations: AgeSquared
- Categories: AgeGroup (binned into 6 categories)
- Interactions: AgeCreditRatio, AgeDebtRatio

**Credit Line Features**
- Complexity indicators: CreditToRealEstate, UnsecuredLoansPct

**Advanced Polynomial Features**
- Interaction terms: UtilizationByDelinquency, DebtIncomeByAge

**Risk Index Features**
- Composite scores: FinancialStressIndex, CreditComplexityIndex, FinancialStabilityIndex

In [4]:
def enhance_features(df):
    """
    Create advanced features for credit risk scoring based on existing data.
    
    Parameters:
    df (pandas.DataFrame): DataFrame with credit risk features
    
    Returns:
    pandas.DataFrame: Enhanced DataFrame with new features
    """
    # Create a copy to avoid modifying the original dataframe
    enhanced_df = df.copy()
    
    # --- Ratio Features ---
    # Debt burden indicators
    enhanced_df['DebtToIncome'] = enhanced_df['DebtRatio'] * enhanced_df['MonthlyIncome']
    enhanced_df['IncomePerDependent'] = enhanced_df['MonthlyIncome'] / (enhanced_df['NumberOfDependents'] + 1)
    enhanced_df['CreditPerIncome'] = enhanced_df['NumberOfOpenCreditLinesAndLoans'] / (enhanced_df['MonthlyIncome'] / 1000)
    enhanced_df['RealEstatePerIncome'] = enhanced_df['NumberRealEstateLoansOrLines'] / (enhanced_df['MonthlyIncome'] / 1000)
    
    # Credit utilization refined metrics
    enhanced_df['RevolvingUtilizationSquared'] = enhanced_df['RevolvingUtilizationOfUnsecuredLines'] ** 2
    enhanced_df['RevolvingUtilizationLog'] = np.log1p(enhanced_df['RevolvingUtilizationOfUnsecuredLines'])
    
    # --- Delinquency Features ---
    # Aggregated delinquency metrics
    enhanced_df['TotalPastDue'] = (enhanced_df['NumberOfTime30-59DaysPastDueNotWorse'] + 
                                  enhanced_df['NumberOfTime60-89DaysPastDueNotWorse'] + 
                                  enhanced_df['NumberOfTimes90DaysLate'])
    
    # Weighted delinquency score (giving more weight to more severe delinquencies)
    enhanced_df['WeightedDelinquencyScore'] = (
        enhanced_df['NumberOfTime30-59DaysPastDueNotWorse'] * 1 +
        enhanced_df['NumberOfTime60-89DaysPastDueNotWorse'] * 2 +
        enhanced_df['NumberOfTimes90DaysLate'] * 3
    )
    
    # Binary indicators for any history of delinquency
    enhanced_df['HasPastDue30'] = (enhanced_df['NumberOfTime30-59DaysPastDueNotWorse'] > 0).astype(int)
    enhanced_df['HasPastDue60'] = (enhanced_df['NumberOfTime60-89DaysPastDueNotWorse'] > 0).astype(int)
    enhanced_df['HasPastDue90'] = (enhanced_df['NumberOfTimes90DaysLate'] > 0).astype(int)
    enhanced_df['HasAnyDelinquency'] = (enhanced_df['TotalPastDue'] > 0).astype(int)
    
    # Delinquency recency and severity ratio
    enhanced_df['DelinquencySeverityRatio'] = np.where(
        enhanced_df['TotalPastDue'] > 0,
        enhanced_df['NumberOfTimes90DaysLate'] / enhanced_df['TotalPastDue'],
        0
    )
    
    # --- Age-related Features ---
    # Age categories and interactions
    enhanced_df['AgeSquared'] = enhanced_df['age'] ** 2
    enhanced_df['AgeGroup'] = pd.cut(
        enhanced_df['age'], 
        bins=[0, 25, 35, 45, 55, 65, 100], 
        labels=[0, 1, 2, 3, 4, 5]
    ).astype(int)
    
    # Age and credit interaction features
    enhanced_df['AgeCreditRatio'] = enhanced_df['age'] / (enhanced_df['NumberOfOpenCreditLinesAndLoans'] + 1)
    enhanced_df['AgeDebtRatio'] = enhanced_df['age'] * enhanced_df['DebtRatio']
    
    # --- Credit Line Features ---
    # Credit line complexity indicators
    enhanced_df['CreditToRealEstate'] = np.where(
        enhanced_df['NumberRealEstateLoansOrLines'] > 0,
        enhanced_df['NumberOfOpenCreditLinesAndLoans'] / enhanced_df['NumberRealEstateLoansOrLines'],
        enhanced_df['NumberOfOpenCreditLinesAndLoans']
    )
    
    enhanced_df['UnsecuredLoansPct'] = np.where(
        enhanced_df['NumberOfOpenCreditLinesAndLoans'] > 0,
        (enhanced_df['NumberOfOpenCreditLinesAndLoans'] - enhanced_df['NumberRealEstateLoansOrLines']) / 
        enhanced_df['NumberOfOpenCreditLinesAndLoans'],
        0
    )
    
    # --- Advanced Polynomial Features ---
    # Interaction between utilization and delinquency
    enhanced_df['UtilizationByDelinquency'] = enhanced_df['RevolvingUtilizationOfUnsecuredLines'] * (enhanced_df['TotalPastDue'] + 1)
    
    # Debt to income adjusted by age
    enhanced_df['DebtIncomeByAge'] = enhanced_df['DebtRatio'] / (enhanced_df['age'] / 40)
    
    # --- Risk Index Features ---
    # Composite risk scores
    enhanced_df['FinancialStressIndex'] = (
        enhanced_df['RevolvingUtilizationOfUnsecuredLines'] * 0.5 +
        (enhanced_df['TotalPastDue'] / 10) * 0.3 +
        (enhanced_df['DebtRatio'] / 2) * 0.2
    )
    
    # Credit complexity index
    enhanced_df['CreditComplexityIndex'] = (
        enhanced_df['NumberOfOpenCreditLinesAndLoans'] * 0.6 +
        enhanced_df['NumberRealEstateLoansOrLines'] * 0.4
    )
    
    # Financial stability index (higher is better)
    enhanced_df['FinancialStabilityIndex'] = (
        (enhanced_df['MonthlyIncome'] / 5000) * 0.4 +
        (enhanced_df['age'] / 50) * 0.2 -
        enhanced_df['DebtRatio'] * 0.25 -
        (enhanced_df['TotalPastDue'] / 5) * 0.15
    )
    
    # --- Statistical Transformations ---
    # Log transformations for skewed features
    enhanced_df['LogMonthlyIncome'] = np.log1p(enhanced_df['MonthlyIncome'])
    enhanced_df['LogDebtRatio'] = np.log1p(enhanced_df['DebtRatio'])
    
    # Interaction between income and dependents, normalized
    enhanced_df['NormalizedIncomePerDependent'] = enhanced_df['IncomePerDependent'] / enhanced_df['IncomePerDependent'].median()
    
    # --- Ratio Features ---
    # Overall debt burden ratio
    enhanced_df['OverallDebtBurden'] = enhanced_df['DebtRatio'] * (enhanced_df['NumberOfOpenCreditLinesAndLoans'] + 1)
    
    # Utilization to income ratio
    enhanced_df['UtilizationToIncome'] = enhanced_df['RevolvingUtilizationOfUnsecuredLines'] / (enhanced_df['MonthlyIncome'] / 5000)
    
    # --- Feature Scaling ---
    # Min-max scaling for selected features
    for col in ['RevolvingUtilizationOfUnsecuredLines', 'DebtRatio']:
        enhanced_df[f'{col}_Scaled'] = (enhanced_df[col] - enhanced_df[col].min()) / (enhanced_df[col].max() - enhanced_df[col].min())
    
    # Handle any infinities or NaNs from the operations
    enhanced_df = enhanced_df.replace([np.inf, -np.inf], np.nan)
    
    # For numeric columns only, fill NaNs with median
    numeric_cols = enhanced_df.select_dtypes(include=['float64', 'int64']).columns
    enhanced_df[numeric_cols] = enhanced_df[numeric_cols].fillna(enhanced_df[numeric_cols].median())
    
    return enhanced_df



## Feature Selection and Evaluation

The code includes two key functions for feature evaluation:

- evaluate_feature_importance()
- check_multicollinearity()

In [5]:
# Function to evaluate feature importance and select top features
def evaluate_feature_importance(X, y, feature_names, n_features=20):
    """
    Evaluate feature importance using Random Forest and return top N features
    
    Parameters:
    X (numpy.ndarray): Feature matrix
    y (numpy.ndarray): Target vector
    feature_names (list): List of feature names
    n_features (int): Number of top features to return
    
    Returns:
    pandas.DataFrame: Feature importance dataframe
    list: List of top N feature names
    """
    # Train a Random Forest model
    rf = RandomForestClassifier(
        n_estimators=100, 
        max_depth=5,
        class_weight='balanced_subsample',
        random_state=42,
        n_jobs=-1
    )
    
    rf.fit(X, y)
    
    # Get feature importances
    importances = rf.feature_importances_
    
    # Create a dataframe of feature importances
    importance_df = pd.DataFrame({
        'Feature': feature_names,
        'Importance': importances
    }).sort_values('Importance', ascending=False)
    
    # Get top N features
    top_features = importance_df.head(n_features)['Feature'].tolist()
    
    return importance_df, top_features

In [6]:
# Function to check for multicollinearity
def check_multicollinearity(df, threshold=0.8):
    """
    Check for multicollinearity among features
    
    Parameters:
    df (pandas.DataFrame): DataFrame with features
    threshold (float): Correlation threshold to flag
    
    Returns:
    pandas.DataFrame: Pairs of features with correlation above threshold
    """
    # Calculate correlation matrix
    corr_matrix = df.corr().abs()
    
    # Get upper triangle of correlation matrix
    upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
    
    # Find index of feature columns with correlation greater than threshold
    correlated_features = [(upper.index[i], upper.columns[j], upper.iloc[i, j]) 
                          for i in range(len(upper.index)) 
                          for j in range(len(upper.columns)) 
                          if upper.iloc[i, j] > threshold]
    
    # Create a dataframe of correlated features
    if correlated_features:
        return pd.DataFrame(correlated_features, columns=['Feature1', 'Feature2', 'Correlation'])
    else:
        return pd.DataFrame(columns=['Feature1', 'Feature2', 'Correlation'])

## Main Execution Pipeline

The main() function orchestrates:

- Data loading and cleaning
- Feature engineering
- Multicollinearity check
- Feature importance evaluation
- Data splitting and scaling
- Model preparation (though actual modeling code isn't shown in this snippet)

In [7]:
# Example usage in the main function
def main():
    """Main execution function with enhanced feature engineering"""
    try:
        # Load and clean data
        print("Loading and cleaning data...")
        train_data, test_data = load_and_clean_data()
        
        # Apply feature engineering
        print("Applying feature engineering...")
        train_data_enhanced = enhance_features(train_data)
        test_data_enhanced = enhance_features(test_data)
        
        # Define target and features
        target = 'SeriousDlqin2yrs'
        X = train_data_enhanced.drop(columns=[target, 'Unnamed: 0']) if 'Unnamed: 0' in train_data_enhanced.columns else train_data_enhanced.drop(columns=[target])
        y = train_data_enhanced[target].astype(int)
        
        # Check for multicollinearity
        print("Checking for multicollinearity...")
        multicollinearity_df = check_multicollinearity(X, threshold=0.85)
        print(f"Found {len(multicollinearity_df)} pairs of highly correlated features")
        
        # Evaluate feature importance
        print("Evaluating feature importance...")
        importance_df, top_features = evaluate_feature_importance(X.values, y.values, X.columns.tolist())
        
        print("Top 15 important features:")
        print(importance_df.head(15))
        
        # Use top features for modeling
        X_top = X[top_features]
        
        # Continue with train-test split and model evaluation as in the original code
        X_train, X_test, y_train, y_test = train_test_split(
            X_top, y, test_size=0.3, random_state=42, stratify=y
        )
        
        # Scale features
        scaler = StandardScaler()
        X_train_scaled = scaler.fit_transform(X_train)
        X_test_scaled = scaler.transform(X_test)
        
        # Define models and continue with model evaluation...
        
    except Exception as e:
        print(f"Error in main execution: {str(e)}")
        import traceback
        traceback.print_exc()

if __name__ == "__main__":
    main()

Loading and cleaning data...
Applying feature engineering...
Checking for multicollinearity...
Found 46 pairs of highly correlated features
Evaluating feature importance...
Top 15 important features:
                                        Feature  Importance
17                     WeightedDelinquencyScore    0.148912
10                                 TotalPastDue    0.134966
29                     UtilizationByDelinquency    0.130238
21                            HasAnyDelinquency    0.115453
0          RevolvingUtilizationOfUnsecuredLines    0.073028
39  RevolvingUtilizationOfUnsecuredLines_Scaled    0.061826
15                  RevolvingUtilizationSquared    0.051770
16                      RevolvingUtilizationLog    0.050622
38                          UtilizationToIncome    0.036853
6                       NumberOfTimes90DaysLate    0.030859
20                                 HasPastDue90    0.027272
2          NumberOfTime30-59DaysPastDueNotWorse    0.026253
8          NumberOfT

## Conclusion

This project successfully demonstrated the power of strategic feature engineering in building a predictive credit risk model. By creating sophisticated features that capture nuanced patterns in borrower behavior, we significantly enhanced the model's ability to assess the likelihood of serious delinquency.

### Key Findings

- Delinquency History is Paramount: The top features overwhelmingly relate to payment delinquency patterns, with WeightedDelinquencyScore, TotalPastDue, and UtilizationByDelinquency emerging as the three most important predictors. This confirms that past repayment behavior is the strongest indicator of future credit risk.
- Feature Engineering Success: The newly created features captured 13 of the top 15 positions in feature importance, validating our feature engineering approach. The engineered features provided more signal than the raw variables from the original dataset.
- Non-Linear Relationships Matter: The strong performance of features like RevolvingUtilizationSquared and RevolvingUtilizationLog suggests that credit utilization has non-linear effects on default risk that simple linear models might miss.
- Behavioral Indicators Trump Demographics: Notably absent from the top features are basic demographic factors like age—all top predictors instead reflect financial behaviors and credit management patterns.

### Recommendations

1. Model Deployment Focus: Prioritize monitoring the delinquency-based features in production, as these drive most of the model's predictive power.

2. Customer Intervention Strategies: The feature importance results suggest that early intervention programs for customers showing early delinquency signs (30-59 days late) could prevent progression to more severe delinquency.

3. Feature Monitoring: Implement drift detection specifically for the top engineered features to ensure model performance remains stable as customer behavior patterns evolve.

4. Future Enhancements: Explore temporal patterns by creating time-weighted delinquency features or incorporating trend features that capture whether a borrower's situation is improving or deteriorating.

This project underscores that in credit risk modeling, how you represent the data can be as important as the modeling technique itself. The feature engineering approach taken here successfully transformed raw financial data into powerful predictive signals that capture the multi-dimensional nature of credit risk.