In [1]:
## Data Understanding
**Reference notebooks for data characteristics:**
- `exploration/eda.ipynb` - Contains full EDA: 2,878 training samples, 24.8% positive rate (moderate class imbalance), text features (title: avg 72 chars, text: avg 403 chars), 32 total features including user activity metrics
- Key finding: request_number_of_comments_at_retrieval has highest correlation (0.29) with target
- Use findings from these notebooks when implementing features

## Models for Text + Tabular Data
For problems combining text and structured meta-data:

**Primary Approaches:**
1. **Gradient Boosting on Text Embeddings**: Extract embeddings from pretrained transformers (BERT/RoBERTa) and feed to LightGBM/XGBoost/CatBoost. This consistently outperforms single-model approaches by leveraging both semantic text understanding and tabular learning strengths.

2. **Multimodal Transformers**: Convert tabular features to text strings and prepend to documents, then fine-tune end-to-end. Effective when text is primary signal but meta-data provides important context.

3. **Two-Stage Ensembles**: Generate LLM embeddings → Train multiple tabular models → Stack with meta-learner. This approach won multiple Kaggle competitions by capturing diverse patterns.

**Model Selection Guidelines:**
- Start with LightGBM on TF-IDF + tabular features (strong baseline)
- Add transformer embeddings for 0.02-0.05 AUC boost
- Neural networks useful when >5K samples and text-heavy signal
- For Reddit data: SocBERT (trained on 929M Reddit posts) outperforms generic BERT

## Preprocessing for Reddit/Text Data

**Text Features:**
- **TF-IDF**: Use n-grams (1-3) on both title and text separately. Include char n-grams for robustness to typos.
- **Text Length Features**: Word count, char count, avg word length - surprisingly predictive
- **Sentiment/LIWC**: Extract gratitude, politeness markers ("please", "thank", "appreciate")
- **Topic Modeling**: LDA with 10-20 topics captures request themes (financial hardship, celebration, etc.)
- **Edit-Aware Text**: Use request_text_edit_aware to remove success-indicating edits
- **Reddit-Specific Cleaning**: Handle markdown, user mentions (/u/username), subreddit links (r/subname), URLs, emojis

**Tabular Features:**
- **User Activity Ratios**: comments/posts ratios, RAOP-specific vs general activity
- **Temporal Features**: Hour of day, day of week from timestamps (people may be more generous on weekends)
- **Vote Ratios**: upvote/downvote ratios at request time vs retrieval time
- **Flair Encoding**: One-hot or target encode (shroom=received, PIF=given after receiving)
- **Subreddit Diversity**: Number of unique subreddits as social breadth indicator

**Handling Data Quality Issues:**
- post_was_edited contains mixed types (booleans + timestamps) - clean to boolean
- Zero-length text fields need special handling

## Handling Class Imbalance (24.8% positive)

**Validation Strategy:**
- **Stratified K-Fold (k=5)**: Essential for stable CV scores given class imbalance
- **Stratified train/val split**: Use stratify parameter to maintain class distribution

**Training Techniques:**
1. **Class Weights**: Use scale_pos_weight (XGBoost) or class_weight (sklearn) = (negative_samples/positive_samples) ≈ 3.0
2. **Focal Loss**: Alternative to class weighting, focuses on hard examples
3. **Oversampling**: SMOTE or random oversample minority class in training folds only
4. **Threshold Tuning**: Optimize decision threshold on validation set rather than using 0.5

**Avoid**: Oversampling before CV splits - leads to data leakage

## Feature Engineering Specific to This Problem

**High-Impact Features:**
1. **Request Urgency Signals**: Words like "desperate", "starving", "broke", "need" - but balance with authenticity
2. **Gratitude Expressions**: "thank you", "appreciate", "grateful" - politeness matters
3. **Narrative Length**: Longer, detailed stories often more successful (but not too long)
4. **Community Engagement**: RAOP-specific comment count more predictive than general Reddit activity
5. **Reciprocity Indicators**: Mention of past giving, promises to "pay it forward"

**From Winning Solutions:**
- Combine request_title + request_text for TF-IDF (captures full context)
- Use request_text_edit_aware to remove post-success edits that leak target
- Extract LIWC-style psycholinguistic features (positive emotion, social words)
- Create interaction features: account_age × activity_level, text_length × politeness
- **Character-level features**: Character n-grams (3-5) capture typos and writing style
- **Domain-adapted embeddings**: SocBERT or Tweet2Vec for character-level patterns

## Ensembling Strategies

**For ROC-AUC Optimization:**
1. **Weighted Average**: Weight models by validation AUC (e.g., 0.4×LGBM + 0.3×XGB + 0.3×CatBoost)
2. **Rank Averaging**: More robust to outliers than probability averaging
3. **Stacking**: Use logistic regression or LightGBM as meta-learner on out-of-fold predictions
4. **Diversity is Key**: Combine tree models (different seeds, feature subsets) with neural networks

**Best Practices:**
- Ensure base models are diverse (different algorithms, different features)
- Use 5-fold CV for stacking to avoid overfitting
- Calibrate probabilities before ensembling (especially for tree models)

## Validation and Optimization

**Cross-Validation:**
- **Primary**: Stratified 5-fold CV for all model development
- **Final Score**: Use full training set, evaluate on public LB sparingly (risk of overfitting)

**Hyperparameter Tuning:**
- **Optuna/Random Search**: 50-100 trials sufficient for gradient boosting
- **Key Parameters**: Learning rate (0.01-0.1), max_depth (3-7), num_leaves (31-127), subsample (0.7-1.0)
- **Early Stopping**: Use validation AUC with patience=50

**Feature Selection:**
- Use feature importance from LightGBM to prune low-impact features
- Recursive Feature Elimination (RFE) can improve generalization

## Implementation Priority

**Phase 1 (Baseline):**
1. TF-IDF on combined title+text (1-3 n-grams)
2. Basic tabular features (scaling only)
3. LightGBM with class weighting
4. Stratified 5-fold CV

**Phase 2 (Improvement):**
1. Add transformer embeddings (BERT-base)
2. Engineer Reddit-specific features (gratitude, urgency)
3. Try CatBoost for categorical handling
4. Simple ensemble (average 2-3 models)

**Phase 3 (Competitive):**
1. Full multimodal approach (text + tabular)
2. Advanced feature engineering (topic modeling, LIWC)
3. Stacked ensemble with 5+ diverse models
4. Threshold optimization on validation set

## Key Insights from Data Exploration
- Moderate class imbalance requires careful handling but not extreme measures
- Text length and community engagement metrics are predictive
- User flair is sparse but informative (shroom=received before)
- Edit-aware text prevents leakage from post-success edits

Number of training samples: 2878
Type of data: <class 'list'>
First sample keys: ['giver_username_if_known', 'number_of_downvotes_of_request_at_retrieval', 'number_of_upvotes_of_request_at_retrieval', 'post_was_edited', 'request_id', 'request_number_of_comments_at_retrieval', 'request_text', 'request_text_edit_aware', 'request_title', 'requester_account_age_in_days_at_request', 'requester_account_age_in_days_at_retrieval', 'requester_days_since_first_post_on_raop_at_request', 'requester_days_since_first_post_on_raop_at_retrieval', 'requester_number_of_comments_at_request', 'requester_number_of_comments_at_retrieval', 'requester_number_of_comments_in_raop_at_request', 'requester_number_of_comments_in_raop_at_retrieval', 'requester_number_of_posts_at_request', 'requester_number_of_posts_at_retrieval', 'requester_number_of_posts_on_raop_at_request', 'requester_number_of_posts_on_raop_at_retrieval', 'requester_number_of_subreddits_at_request', 'requester_received_pizza', 'requester_subredd

In [2]:
# Convert to DataFrame for easier analysis
df = pd.DataFrame(train_data)

# Check target distribution
print("Target distribution:")
print(df['requester_received_pizza'].value_counts())
print(f"\nClass balance: {df['requester_received_pizza'].mean():.3f} (positive rate)")

# Check for missing values
print(f"\nMissing values per column:")
print(df.isnull().sum().head(10))  # Show first 10 columns

Target distribution:
requester_received_pizza
False    2163
True      715
Name: count, dtype: int64

Class balance: 0.248 (positive rate)

Missing values per column:
giver_username_if_known                        0
number_of_downvotes_of_request_at_retrieval    0
number_of_upvotes_of_request_at_retrieval      0
post_was_edited                                0
request_id                                     0
request_number_of_comments_at_retrieval        0
request_text                                   0
request_text_edit_aware                        0
request_title                                  0
requester_account_age_in_days_at_request       0
dtype: int64


In [3]:
# Examine text features
print("Text feature analysis:")
print(f"\nRequest title length (characters):")
print(df['request_title'].str.len().describe())

print(f"\nRequest text length (characters):")
print(df['request_text'].str.len().describe())

print(f"\nRequest text edit aware length (characters):")
print(df['request_text_edit_aware'].str.len().describe())

# Show some examples
print("\n" + "="*50)
print("EXAMPLE 1 (successful):")
successful = df[df['requester_received_pizza'] == True].iloc[0]
print(f"Title: {successful['request_title']}")
print(f"Text (first 200 chars): {successful['request_text'][:200]}...")

print("\n" + "="*50)
print("EXAMPLE 2 (unsuccessful):")
unsuccessful = df[df['requester_received_pizza'] == False].iloc[0]
print(f"Title: {unsuccessful['request_title']}")
print(f"Text (first 200 chars): {unsuccessful['request_text'][:200]}...")

Text feature analysis:

Request title length (characters):
count    2878.000000
mean       71.572967
std        36.233487
min         7.000000
25%        46.000000
50%        64.000000
75%        90.000000
max       272.000000
Name: request_title, dtype: float64

Request text length (characters):
count    2878.000000
mean      402.521543
std       362.393727
min         0.000000
25%       182.000000
50%       308.000000
75%       503.750000
max      4460.000000
Name: request_text, dtype: float64

Request text edit aware length (characters):
count    2878.000000
mean      394.567755
std       351.922518
min         0.000000
25%       180.000000
50%       302.000000
75%       498.000000
max      4460.000000
Name: request_text_edit_aware, dtype: float64

EXAMPLE 1 (successful):
Title: [REQUEST] Not much food until tomorrow.
Text (first 200 chars): I will go ahead and say that I got a pizza meal from here before as to not seem like I'm scamming anyone. I have been promised 2 well-paying jo

In [4]:
# Examine numerical features
numerical_cols = [
    'requester_account_age_in_days_at_request',
    'requester_number_of_comments_at_request',
    'requester_number_of_posts_at_request',
    'requester_upvotes_minus_downvotes_at_request',
    'requester_upvotes_plus_downvotes_at_request',
    'requester_number_of_subreddits_at_request',
    'request_number_of_comments_at_retrieval',
    'number_of_upvotes_of_request_at_retrieval',
    'number_of_downvotes_of_request_at_retrieval'
]

print("Numerical feature statistics:")
for col in numerical_cols:
    print(f"\n{col}:")
    print(df[col].describe())

# Check categorical features
print("\n" + "="*60)
print("Categorical features:")
print(f"\nUser flair distribution:")
print(df['requester_user_flair'].value_counts())

print(f"\nPost edited distribution:")
print(df['post_was_edited'].value_counts())

# Correlation with target for numerical features
print("\n" + "="*60)
print("Correlation with target:")
correlations = df[numerical_cols + ['requester_received_pizza']].corr()['requester_received_pizza'].sort_values(ascending=False)
print(correlations)

Numerical feature statistics:

requester_account_age_in_days_at_request:
count    2878.000000
mean      250.682364
std       301.838771
min         0.000000
25%         3.038877
50%       155.156377
75%       383.640090
max      2809.750787
Name: requester_account_age_in_days_at_request, dtype: float64

requester_number_of_comments_at_request:
count    2878.000000
mean      112.311327
std       192.017515
min         0.000000
25%         0.000000
50%        22.000000
75%       132.000000
max       981.000000
Name: requester_number_of_comments_at_request, dtype: float64

requester_number_of_posts_at_request:
count    2878.000000
mean       21.614663
std        51.580719
min         0.000000
25%         0.000000
50%         4.000000
75%        21.000000
max       867.000000
Name: requester_number_of_posts_at_request, dtype: float64

requester_upvotes_minus_downvotes_at_request:
count      2878.000000
mean       1184.582349
std        4198.255486
min         -67.000000
25%           3.000