# Credit Card Feature Engineering

In this tutorial, you'll learn how to create features using the [Kaggle Credit Card Default dataset](./data/UCI_Credit_Card.csv). Start by installing the necessary dependencies below. 

In [10]:
pip install pandas numpy scikit-learn


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


### Setup: Load Data and Libraries

We first need to install the necessary packages, load in the data, and verify that it has been accessed correctly. This can be done by running the code below.

In [9]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# Load the credit card default dataset
df = pd.read_csv('data/UCI_Credit_Card.csv')

# Display original features
print("Original features (first 5 rows):")
print(df[['LIMIT_BAL', 'BILL_AMT1', 'PAY_AMT1', 'AGE']].head())

Original features (first 5 rows):
   LIMIT_BAL  BILL_AMT1  PAY_AMT1  AGE
0    20000.0     3913.0       0.0   24
1   120000.0     2682.0       0.0   26
2    90000.0    29239.0    1518.0   34
3    50000.0    46990.0    2000.0   37
4    50000.0     8617.0    2000.0   57


### Interaction Type 1: Division (Ratios)

Ratios capture relative relationships between features. Often, these are the most useful interaction features for predictive tasks. The code below determines:
1. **Utilization Rate**: How much of the credit limit is being used
2. **Payment Rate**: How much of the bill was paid

In [31]:
# Utilization rate: How much of credit limit is being used
df['utilization_rate'] = df['BILL_AMT1'] / (df['LIMIT_BAL'] + 1)  # +1 avoids division by zero

# Payment rate: How much of the bill was paid
df['payment_rate'] = df['PAY_AMT1'] / (df['BILL_AMT1'] + 1)

# Replace any infinity values with 0
df['utilization_rate'] = df['utilization_rate'].replace([np.inf, -np.inf], 0)
df['payment_rate'] = df['payment_rate'].replace([np.inf, -np.inf], 0)

print("\nRatio features (first 5 rows):")
print(df[['utilization_rate', 'payment_rate']].head())


Ratio features (first 5 rows):
   utilization_rate  payment_rate
0          0.195640      0.000000
1          0.022350      0.000000
2          0.324874      0.051915
3          0.939781      0.042561
4          0.172337      0.232072


### Interaction Type 2: Subtraction (Differences)
Differences show gaps or remaining capacity. For our credit data, we can use subtraction to determine:
1. **Available Credit**: How much credit a person still has available.
2. **Underpayment**: How much of the bill wasn't paid

In [36]:
# Available credit: How much credit is still available
df['available_credit'] = df['LIMIT_BAL'] - df['BILL_AMT1']

# Underpayment: How much of the bill wasn't paid
df['underpayment'] = df['BILL_AMT1'] - df['PAY_AMT1']

**How to Handle Infinite Values?**
Sometimes division creates infinity (when dividing by very small numbers). This will cause major issues, so we can replace these with 0 so the model can handle them.

In [37]:
# Replace any infinity values with 0
df['utilization_rate'] = df['utilization_rate'].replace([np.inf, -np.inf], 0)
df['payment_rate'] = df['payment_rate'].replace([np.inf, -np.inf], 0)

print("\nRatio features (first 5 rows):")
print(df[['utilization_rate', 'payment_rate']].head())


Ratio features (first 5 rows):
   utilization_rate  payment_rate
0          0.195640      0.000000
1          0.022350      0.000000
2          0.324874      0.051915
3          0.939781      0.042561
4          0.172337      0.232072


### Using Your Engineered Features
Once you've created interaction features, add them to your model:

In [35]:
# Select features including engineered ones
feature_columns = [
    # Original features
    'LIMIT_BAL', 'AGE', 'PAY_0', 'BILL_AMT1', 'PAY_AMT1',
    # Engineered interaction features
    'utilization_rate', 'payment_rate', 'available_credit', 'underpayment'
]

# Prepare X and y
X = df[feature_columns]
y = df['default.payment.next.month']

# Handle any missing values
X = X.fillna(X.median())

# Split and train
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Evaluate
predictions = model.predict(X_test)
print(f"\nModel Accuracy: {accuracy_score(y_test, predictions):.4f}")


Model Accuracy: 0.7947
