# Phase 2: Handling Imbalanced Data 

It will split the data, demonstrate how to apply SMOTE, and show how to calculate class weights.

In [14]:
# Import necessary libraries for this phase
import sklearn
import imblearn
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from sklearn.utils import compute_class_weight
import numpy as np
import pandas as pd
import json

### Step 1: Data Splitting

First, we separate our features (X) from our target (y) and then split them into training and testing sets. We use stratify=y to ensure that the proportion of defaulted loans is the same in both the train and test splits, which is crucial for imbalanced datasets.


In [15]:
# This code assumes 'df_final' is the fully preprocessed DataFrame from the end of Phase 2.
# Ensure it exists before running this cell.
df_final = pd.read_parquet('processed_lending_club_data.parquet')
print("--- Step 1: Data Splitting ---")

# Define features (X) and target (y)
X = df_final.drop('loan_status', axis=1)
y = df_final['loan_status']

# Split the data into training and testing sets (80/20 split)
# stratify=y ensures the class distribution is the same in train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print("Data splitting complete.")
print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)
print("-" * 50)

print("Original distribution of target variable in the training set:")
print(y_train.value_counts(normalize=True))
print("\nDistribution of target variable in the testing set:")
print(y_test.value_counts(normalize=True))
print("\nThe distributions are consistent, thanks to stratification.")
print("-" * 50)

--- Step 1: Data Splitting ---
Data splitting complete.
Shape of X_train: (1076248, 93)
Shape of X_test: (269062, 93)
Shape of y_train: (1076248,)
Shape of y_test: (269062,)
--------------------------------------------------
Original distribution of target variable in the training set:
loan_status
0    0.800374
1    0.199626
Name: proportion, dtype: float64

Distribution of target variable in the testing set:
loan_status
0    0.800373
1    0.199627
Name: proportion, dtype: float64

The distributions are consistent, thanks to stratification.
--------------------------------------------------


### Step 2: Implementing SMOTE (Oversampling)

Now we apply the SMOTE technique to the training data only. This creates synthetic data points for the minority class (defaulters) to create a balanced dataset for the model to train on. The test set remains untouched and imbalanced, reflecting the real-world data the model will encounter.


In [16]:
# Instantiate SMOTE
smote = SMOTE(random_state=42)

# Fit and apply SMOTE to the training data
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

print("SMOTE application complete.")
print("-" * 50)

print("Class distribution in the original training set (y_train):")
print(y_train.value_counts())
print("-" * 50)

print("Class distribution after applying SMOTE (y_train_smote):")
print(y_train_smote.value_counts())
print("\nThe training data is now balanced.")

SMOTE application complete.
--------------------------------------------------
Class distribution in the original training set (y_train):
loan_status
0    861401
1    214847
Name: count, dtype: int64
--------------------------------------------------
Class distribution after applying SMOTE (y_train_smote):
loan_status
1    861401
0    861401
Name: count, dtype: int64

The training data is now balanced.


### Step 3: Preparing Class Weights

As an alternative to oversampling with SMOTE, we can use class weights. This method doesn't change the data itself but adjusts the model's loss function to penalize misclassifying the rare class more heavily. These weights are calculated from the original imbalanced training data.

In [17]:
class_weights = compute_class_weight(
    class_weight='balanced',
    classes=np.unique(y_train),
    y=y_train
)

# Create a dictionary mapping class labels to their calculated weights
# This is the format required by scikit-learn's `class_weight` parameter
class_weight_dict = {0: class_weights[0], 1: class_weights[1]}

print("Class weights calculation complete.")
print("-" * 50)
print(f"Weight for class 0 (Fully Paid): {class_weight_dict[0]:.4f}")
print(f"Weight for class 1 (Charged Off): {class_weight_dict[1]:.4f}")
print("\nThese weights tell the model to pay much more attention to class 1 during training.")

print("\n--- Phase 3: Handling Imbalanced Data Complete ---")
print("You now have two options for training your models:")
print("1. Use the SMOTE-balanced data: (X_train_smote, y_train_smote)")
print("2. Use the original imbalanced data (X_train, y_train) with the `class_weight_dict`.")


Class weights calculation complete.
--------------------------------------------------
Weight for class 0 (Fully Paid): 0.6247
Weight for class 1 (Charged Off): 2.5047

These weights tell the model to pay much more attention to class 1 during training.

--- Phase 3: Handling Imbalanced Data Complete ---
You now have two options for training your models:
1. Use the SMOTE-balanced data: (X_train_smote, y_train_smote)
2. Use the original imbalanced data (X_train, y_train) with the `class_weight_dict`.


In [20]:
print("Saving data to parquet files...")

# Save feature data (DataFrames)
X_train.to_parquet('X_train.parquet')
X_test.to_parquet('X_test.parquet')

# Save target data (convert Series to DataFrames first)
y_train.to_frame().to_parquet('y_train.parquet')
y_test.to_frame().to_parquet('y_test.parquet')

# Save SMOTE-balanced data if you have it
if 'X_train_smote' in locals() and 'y_train_smote' in locals():
    X_train_smote.to_parquet('X_train_smote.parquet')
    y_train_smote.to_frame().to_parquet('y_train_smote.parquet')
    print("SMOTE-balanced data also saved.")

with open('class_weight_dict.json', 'w') as f:
    json.dump(class_weight_dict, f)

Saving data to parquet files...
SMOTE-balanced data also saved.
