# DistilBERT Underperformance Diagnostic

**Problem**: DistilBERT experiment (exp_004) achieved only 0.6312 AUC vs expected 0.65-0.68
**Expected gain**: +0.03-0.05 over TF-IDF baseline (0.6253)
**Actual result**: -0.0022 (slightly worse)

**Key observations from execution logs:**
- Early stopping triggered at very low iterations: 5, 5, 9, 118, 3
- Fold 4 performed well (0.6578, 118 iterations)
- Fold 5 performed poorly (0.5980, only 3 iterations)
- High variance in training behavior suggests potential issues

**Hypotheses to investigate:**
1. Feature scaling/normalization issues with DistilBERT embeddings
2. LightGBM hyperparameters not optimal for dense neural features
3. Data leakage or validation strategy problems
4. DistilBERT embedding quality/distribution issues
5. Class imbalance handling problems
6. Early stopping too aggressive for this feature type

In [3]:
import pandas as pd
import numpy as np
import json
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score
import lightgbm as lgb
import torch
from transformers import DistilBertTokenizer, DistilBertModel
import warnings
warnings.filterwarnings('ignore')

# Set seed
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
torch.manual_seed(RANDOM_SEED)

print("Libraries loaded successfully")

Libraries loaded successfully


## 1. Load Data and Reproduce DistilBERT Features

In [4]:
# Load data
train_path = '/home/data/train.json'
with open(train_path, 'r') as f:
    train_data = json.load(f)
train_df = pd.DataFrame(train_data)

test_path = '/home/data/test.json'
with open(test_path, 'r') as f:
    test_data = json.load(f)
test_df = pd.DataFrame(test_data)

print(f"Training samples: {len(train_df)}")
print(f"Test samples: {len(test_df)}")
print(f"Target distribution: {train_df['requester_received_pizza'].mean():.4f}")

# Combine text fields
train_df['combined_text'] = train_df['request_title'].fillna('') + ' ' + train_df['request_text_edit_aware'].fillna('')
test_df['combined_text'] = test_df['request_title'].fillna('') + ' ' + test_df['request_text_edit_aware'].fillna('')

# Load meta-features from the experiment
meta_features = [
    'total_text_length', 'title_word_count', 'total_word_count', 'word_count', 'text_length',
    'requester_number_of_posts_at_request', 'requester_number_of_comments_at_request',
    'requester_upvotes_minus_downvotes_at_request', 'requester_upvotes_plus_downvotes_at_request',
    'requester_account_age_in_days_at_request', 'requester_days_since_first_post_on_raop_at_request',
    'request_hour', 'request_minute', 'request_day_of_week', 'request_day_of_month',
    'request_month', 'request_year', 'requester_account_age_in_days_at_request_bin',
    'requester_upvotes_plus_downvotes_at_request_bin'
]

print(f"Meta-features to use: {len(meta_features)}")

Training samples: 2878
Test samples: 1162
Target distribution: 0.2484
Meta-features to use: 19


## 2. Extract DistilBERT Embeddings

In [5]:
# Load DistilBERT model and tokenizer
print("Loading DistilBERT...")
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = DistilBertModel.from_pretrained('distilbert-base-uncased')
model.eval()

if torch.cuda.is_available():
    model = model.cuda()
    print("Using GPU for DistilBERT")
else:
    print("Using CPU for DistilBERT")

# Extract embeddings
batch_size = 16
def extract_distilbert_features(texts, max_length=256):
    """Extract [CLS] token embeddings from DistilBERT"""
    all_features = []
    
    for i in range(0, len(texts), batch_size):
        batch_texts = texts[i:i+batch_size]
        
        # Tokenize
        inputs = tokenizer(
            batch_texts,
            padding=True,
            truncation=True,
            max_length=max_length,
            return_tensors='pt'
        )
        
        if torch.cuda.is_available():
            inputs = {k: v.cuda() for k, v in inputs.items()}
        
        # Get embeddings
        with torch.no_grad():
            outputs = model(**inputs)
            # Use [CLS] token (first token) embedding
            cls_embeddings = outputs.last_hidden_state[:, 0, :].cpu().numpy()
        
        all_features.append(cls_embeddings)
        
        if i % (batch_size * 10) == 0:
            print(f"Processed {i}/{len(texts)} texts")
    
    return np.vstack(all_features)

print("Extracting DistilBERT features from training data...")
train_distilbert = extract_distilbert_features(train_df['combined_text'].tolist())

print("\nExtracting DistilBERT features from test data...")
test_distilbert = extract_distilbert_features(test_df['combined_text'].tolist())

print(f"\nDistilBERT feature shape: {train_distilbert.shape}")
print(f"Sample values (first 10 dims): {train_distilbert[0, :10]}")

Loading DistilBERT...


Using GPU for DistilBERT
Extracting DistilBERT features from training data...


Processed 0/2878 texts


Processed 160/2878 texts


Processed 320/2878 texts


Processed 480/2878 texts


Processed 640/2878 texts


Processed 800/2878 texts


Processed 960/2878 texts


Processed 1120/2878 texts


Processed 1280/2878 texts


Processed 1440/2878 texts


Processed 1600/2878 texts


Processed 1760/2878 texts


Processed 1920/2878 texts


Processed 2080/2878 texts


Processed 2240/2878 texts


Processed 2400/2878 texts


Processed 2560/2878 texts


Processed 2720/2878 texts



Extracting DistilBERT features from test data...
Processed 0/1162 texts


Processed 160/1162 texts


Processed 320/1162 texts


Processed 480/1162 texts


Processed 640/1162 texts


Processed 800/1162 texts


Processed 960/1162 texts


Processed 1120/1162 texts

DistilBERT feature shape: (2878, 768)
Sample values (first 10 dims): [ 0.21099712 -0.01889989  0.02943027 -0.1861856   0.01320272 -0.2866645
  0.12797116  0.503011   -0.06609333 -0.5012868 ]


## 3. Analyze DistilBERT Embedding Distribution

In [None]:
# Analyze embedding statistics
print("=== DistilBERT Embedding Distribution Analysis ===\n")

# Basic statistics
print("Embedding statistics:")
print(f"Mean: {train_distilbert.mean():.4f}")
print(f"Std: {train_distilbert.std():.4f}")
print(f"Min: {train_distilbert.min():.4f}")
print(f"Max: {train_distilbert.max():.4f}")
print(f"Range: {train_distilbert.max() - train_distilbert.min():.4f}")

# Per-dimension statistics
print(f"\nPer-dimension statistics:")
print(f"Mean of means: {train_distilbert.mean(axis=0).mean():.4f}")
print(f"Std of means: {train_distilbert.mean(axis=0).std():.4f}")
print(f"Mean of stds: {train_distilbert.std(axis=0).mean():.4f}")
print(f"Std of stds: {train_distilbert.std(axis=0).std():.4f}")

# Check for constant or near-constant dimensions
constant_dims = np.where(train_distilbert.std(axis=0) < 0.01)[0]
print(f"\nNear-constant dimensions (std < 0.01): {len(constant_dims)}")
if len(constant_dims) > 0:
    print(f"First 10 constant dims: {constant_dims[:10]}")

# Check distribution shape
print(f"\nDistribution shape analysis:")
print(f"Skewness: {pd.Series(train_distilbert.flatten()).skew():.4f}")
print(f"Kurtosis: {pd.Series(train_distilbert.flatten()).kurtosis():.4f}")

# Check for outliers using IQR method
Q1 = np.percentile(train_distilbert, 25)
Q3 = np.percentile(train_distilbert, 75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = np.sum((train_distilbert < lower_bound) | (train_distilbert > upper_bound))
print(f"Outliers (IQR method): {outliers} ({outliers/train_distilbert.size*100:.2f}%)")

## 4. Compare Meta-Features vs DistilBERT Features

In [None]:
# Prepare meta-features
meta_features_df = train_df[meta_features].copy()
meta_features_test = test_df[meta_features].copy()

# Handle missing values
meta_features_df = meta_features_df.fillna(0)
meta_features_test = meta_features_test.fillna(0)

print("=== Feature Scale Comparison ===\n")

# Meta-features statistics
meta_means = meta_features_df.mean()
meta_stds = meta_features_df.std()
meta_ranges = meta_features_df.max() - meta_features_df.min()

print("Meta-features statistics:")
print(f"Mean range: {meta_ranges.mean():.2f}")
print(f"Mean std: {meta_stds.mean():.2f}")
print(f"Max value: {meta_features_df.max().max():.2f}")
print(f"Min value: {meta_features_df.min().min():.2f}")

# DistilBERT statistics
print(f"\nDistilBERT embedding statistics:")
print(f"Mean range: {(train_distilbert.max(axis=0) - train_distilbert.min(axis=0)).mean():.4f}")
print(f"Mean std: {train_distilbert.std(axis=0).mean():.4f}")
print(f"Max value: {train_distilbert.max():.4f}")
print(f"Min value: {train_distilbert.min():.4f}")

# Scale comparison visualization
plt.figure(figsize=(12, 6))

plt.subplot(1, 2, 1)
plt.boxplot([meta_ranges.values, (train_distilbert.max(axis=0) - train_distilbert.min(axis=0))], 
            labels=['Meta-features', 'DistilBERT'])
plt.title('Feature Range Comparison')
plt.ylabel('Range (max - min)')

plt.subplot(1, 2, 2)
plt.boxplot([meta_stds.values, train_distilbert.std(axis=0)], 
            labels=['Meta-features', 'DistilBERT'])
plt.title('Feature Std Dev Comparison')
plt.ylabel('Standard Deviation')

plt.tight_layout()
plt.show()

print(f"\n=== Scale Mismatch Analysis ===")
print(f"Meta-features average scale: {meta_ranges.mean():.2f}")
print(f"DistilBERT average scale: {(train_distilbert.max(axis=0) - train_distilbert.min(axis=0)).mean():.4f}")
print(f"Scale ratio (DistilBERT / Meta): {(train_distilbert.max(axis=0) - train_distilbert.min(axis=0)).mean() / meta_ranges.mean():.4f}")

if (train_distilbert.max(axis=0) - train_distilbert.min(axis=0)).mean() / meta_ranges.mean() < 0.1:
    print("⚠️  WARNING: DistilBERT features are on much smaller scale than meta-features!")
    print("   This can cause LightGBM to ignore them. Consider scaling.")

## 5. Test Different LightGBM Hyperparameters

In [None]:
# Prepare combined features (without scaling first)
X_combined_raw = np.hstack([meta_features_df.values, train_distilbert])
print(f"Combined feature shape: {X_combined_raw.shape}")

# Prepare target
y = train_df['requester_received_pizza'].values

# Test different hyperparameter configurations
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_SEED)

print("=== Testing Different LightGBM Hyperparameter Configurations ===\n")

# Configuration 1: Original parameters
print("Config 1: Original parameters (from experiment)")
print("- num_leaves: 31")
print("- learning_rate: 0.1")
print("- n_estimators: 1000")
print("- scale_pos_weight: auto")
print("- early_stopping: 50 rounds")

# Configuration 2: Reduced early stopping patience
print("\nConfig 2: Reduced early stopping patience")
print("- early_stopping_rounds: 10 (instead of 50)")

# Configuration 3: Adjusted for dense features
print("\nConfig 3: Adjusted for dense neural features")
print("- num_leaves: 63 (more capacity)")
print("- min_child_samples: 20 (prevent overfitting)")
print("- feature_fraction: 0.8 (feature sampling)")

# Configuration 4: With feature scaling
print("\nConfig 4: With StandardScaler on DistilBERT features")

# Let's test these configurations
configs = {
    'original': {
        'num_leaves': 31,
        'learning_rate': 0.1,
        'n_estimators': 1000,
        'early_stopping_rounds': 50
    },
    'reduced_patience': {
        'num_leaves': 31,
        'learning_rate': 0.1,
        'n_estimators': 1000,
        'early_stopping_rounds': 10
    },
    'more_capacity': {
        'num_leaves': 63,
        'learning_rate': 0.05,
        'n_estimators': 1000,
        'early_stopping_rounds': 50,
        'min_child_samples': 20,
        'feature_fraction': 0.8
    }
}

results = {}

for config_name, params in configs.items():
    print(f"\n{'='*50}")
    print(f"Testing: {config_name}")
    print(f"{'='*50}")
    
    fold_scores = []
    fold_iterations = []
    
    for fold, (train_idx, val_idx) in enumerate(cv.split(X_combined_raw, y)):
        X_train, X_val = X_combined_raw[train_idx], X_combined_raw[val_idx]
        y_train, y_val = y[train_idx], y[val_idx]
        
        # Calculate scale_pos_weight
        scale_pos_weight = (len(y_train) - sum(y_train)) / sum(y_train)
        
        model = lgb.LGBMClassifier(
            random_state=RANDOM_SEED,
            scale_pos_weight=scale_pos_weight,
            **params
        )
        
        model.fit(
            X_train, y_train,
            eval_set=[(X_val, y_val)],
            eval_metric='auc',
            callbacks=[lgb.early_stopping(params['early_stopping_rounds'], verbose=False)]
        )
        
        # Get best iteration
        best_iter = model.booster_.best_iteration
        fold_iterations.append(best_iter)
        
        # Predict and score
        val_pred = model.predict_proba(X_val)[:, 1]
        score = roc_auc_score(y_val, val_pred)
        fold_scores.append(score)
        
        print(f"  Fold {fold+1}: AUC={score:.4f}, iterations={best_iter}")
    
    mean_score = np.mean(fold_scores)
    std_score = np.std(fold_scores)
    mean_iter = np.mean(fold_iterations)
    
    results[config_name] = {
        'mean_score': mean_score,
        'std_score': std_score,
        'fold_scores': fold_scores,
        'mean_iterations': mean_iter,
        'fold_iterations': fold_iterations
    }
    
    print(f"  Mean AUC: {mean_score:.4f} ± {std_score:.4f}")
    print(f"  Mean iterations: {mean_iter:.1f}")

# Compare results
print(f"\n{'='*60}")
print("SUMMARY: Hyperparameter Comparison")
print(f"{'='*60}")

for config_name, result in results.items():
    print(f"{config_name:20s}: {result['mean_score']:.4f} ± {result['std_score']:.4f} (avg {result['mean_iterations']:.1f} iter)")
    print(f"  Fold iterations: {result['fold_iterations']}")
    print(f"  Fold scores: {result['fold_scores']}")
    print()