In [4]:
## Problem Type
Binary classification combining text features (request title, text) with tabular metadata (Reddit activity metrics, user flair, timestamps).

## Reference Notebooks for Data Characteristics
- `exploration/eda.ipynb` - Contains full EDA: 2,878 training samples, 24.8% positive class (class imbalance), text length distributions, feature types (16 int, 9 object, 6 float, 1 bool), missing values in requester_user_flair (75% missing)

## Data Understanding
**Class Imbalance**: ~25% positive class requires special handling. See eda.ipynb for exact distribution.
**Text Features**: request_text (avg 402 chars), request_title (avg 72 chars) - both need preprocessing
**Categorical Features**: requester_user_flair with high cardinality (3 categories but 75% missing)
**Temporal Features**: Unix timestamps available for feature engineering

## Models
For text + tabular classification problems, winning Kaggle solutions typically use:
- **Gradient Boosting (Primary)**: XGBoost, LightGBM, or CatBoost trained on combined text embeddings + tabular features
- **Neural Networks (Secondary)**: BERT/RoBERTa for text encoding combined with tabular features in downstream classifier
- **Ensemble Size**: 3-5 diverse models (mix of tree-based and neural approaches)

## Text Feature Engineering
**Preprocessing** (Critical for Reddit/social media language):
- Preserve informal cues: DON'T remove all punctuation (can indicate sentiment/sarcasm)
- Normalize elongated words: "soooo" → "so"
- Handle Reddit-specific artifacts: strip/normalize URLs, user mentions (/u/username), subreddit tags (/r/subreddit)
- Clean markdown formatting while preserving emoji for sentiment
- Apply lemmatization (preferred over stemming for social media)
- Create custom stopword list including Reddit-specific terms
- Combine request_title and request_text into single document

**Feature Extraction**:
- TF-IDF vectors (unigrams + bigrams) for gradient boosting models
- Sentence embeddings (BERT, RoBERTa) for neural approaches
- Text length features: char count, word count, avg word length
- Sentiment analysis scores
- Named entity recognition features
- Punctuation density and patterns
- Capitalization patterns (ALL CAPS words count)

## Tabular Feature Engineering
**Metadata Features**:
- Log transforms for count features (upvotes, comments, posts) to reduce skewness
- Ratios: upvotes/comments, comments/posts, karma metrics
- Differences between request time and retrieval time metrics
- User activity rates: comments per day, posts per day
- Subreddit diversity metrics from requester_subreddits_at_request
- Account age normalized by activity (comments per day of account age)

**Categorical Encoding**:
- **requester_user_flair**: Create explicit "Missing" category for 75% missing values, then apply target encoding
- One-hot encoding for low-cardinality categorical features
- Frequency encoding for high-cardinality features
- Target encoding with careful cross-validation to avoid leakage

**Temporal Features**:
- Extract hour of day, day of week from timestamps
- Cyclical encoding for time features (sin/cos transforms)
- Time since account creation normalized by request time
- Posting time relative to peak Reddit hours

## Handling Class Imbalance
**Critical for this dataset (24.8% positive class)**:
- Use AUC-ROC as evaluation metric (provided in competition)
- Apply scale_pos_weight in XGBoost/LightGBM (calculate as negative/positive ratio ≈ 3.0)
- Consider class_weight='balanced' in scikit-learn models
- Optional: Try SMOTE oversampling on minority class
- Focus on PR-AUC during validation for imbalanced metrics
- Use stratified sampling throughout to preserve class distribution

## Validation Strategy
- Stratified K-Fold (k=5) to preserve class distribution
- Time-based splits if temporal leakage is a concern
- Use early stopping on validation AUC-ROC
- Monitor both AUC-ROC and PR-AUC for imbalanced performance
- Create separate validation sets for text-based and tabular-based models

## Ensembling
**Stacking Approach**:
- Level 1: Diverse models (XGBoost on TF-IDF, LightGBM on embeddings, CatBoost on combined)
- Level 2: Logistic regression or simple averaging
- Use out-of-fold predictions for meta-features
- Include both text-heavy and metadata-heavy models for diversity

**Blending**:
- Weighted average based on validation performance
- Rank averaging for robustness
- Geometric mean for probability calibration

## Optimization
**Hyperparameter Tuning**:
- Bayesian optimization (Optuna) for efficient search
- Focus on: learning_rate, max_depth, min_child_samples, subsample
- Use early stopping to prevent overfitting
- Tune scale_pos_weight carefully for class imbalance

**Feature Selection**:
- SHAP values for feature importance interpretation
- Recursive feature elimination based on validation score
- Correlation analysis to remove redundant features
- Focus on features that work well across multiple model types

Training data shape: (2878, 32)
Columns: ['giver_username_if_known', 'number_of_downvotes_of_request_at_retrieval', 'number_of_upvotes_of_request_at_retrieval', 'post_was_edited', 'request_id', 'request_number_of_comments_at_retrieval', 'request_text', 'request_text_edit_aware', 'request_title', 'requester_account_age_in_days_at_request', 'requester_account_age_in_days_at_retrieval', 'requester_days_since_first_post_on_raop_at_request', 'requester_days_since_first_post_on_raop_at_retrieval', 'requester_number_of_comments_at_request', 'requester_number_of_comments_at_retrieval', 'requester_number_of_comments_in_raop_at_request', 'requester_number_of_comments_in_raop_at_retrieval', 'requester_number_of_posts_at_request', 'requester_number_of_posts_at_retrieval', 'requester_number_of_posts_on_raop_at_request', 'requester_number_of_posts_on_raop_at_retrieval', 'requester_number_of_subreddits_at_request', 'requester_received_pizza', 'requester_subreddits_at_request', 'requester_upvotes_minu

Unnamed: 0,giver_username_if_known,number_of_downvotes_of_request_at_retrieval,number_of_upvotes_of_request_at_retrieval,post_was_edited,request_id,request_number_of_comments_at_retrieval,request_text,request_text_edit_aware,request_title,requester_account_age_in_days_at_request,...,requester_received_pizza,requester_subreddits_at_request,requester_upvotes_minus_downvotes_at_request,requester_upvotes_minus_downvotes_at_retrieval,requester_upvotes_plus_downvotes_at_request,requester_upvotes_plus_downvotes_at_retrieval,requester_user_flair,requester_username,unix_timestamp_of_request,unix_timestamp_of_request_utc
0,,2,5,False,t3_q8ycf,0,I will soon be going on a long deployment whic...,I will soon be going on a long deployment whic...,"[REQUEST] Oceanside, Ca. USA- US Marine getti...",0.0,...,False,[Random_Acts_Of_Pizza],3,3,7,7,,SDMarine,1330391000.0,1330391000.0
1,,2,4,False,t3_ixnia,20,"We would all really appreciate it, and would e...","We would all really appreciate it, and would e...",[REQUEST] Three (verified) medical students in...,99.526863,...,False,"[AskReddit, IAmA, TwoXChromosomes, circlejerk,...",491,883,1459,2187,,TheycallmeFoxJohnson,1311434000.0,1311430000.0
2,,1,2,True,t3_ndy6g,0,"It took a lot of courage to make this post, an...","It took a lot of courage to make this post, an...",(REQUEST) not home 4 the holidays &amp; would ...,0.0,...,False,[Random_Acts_Of_Pizza],1,1,3,3,,riverfrontmom,1323968000.0,1323968000.0
3,,1,1,1363315140.0,t3_1abbu1,32,I will go ahead and say that I got a pizza mea...,I will go ahead and say that I got a pizza mea...,[REQUEST] Not much food until tomorrow.,491.088264,...,True,"[Entroductions, RandomActsOfChristmas, RandomK...",25,21,165,195,shroom,Joeramos,1363305000.0,1363301000.0
4,,3,14,False,t3_kseg4,3,My '99 Jeep Cherokee I've had for 10 years now...,My '99 Jeep Cherokee I've had for 10 years now...,[Request] Had my car stolen today,369.417558,...,False,"[DetroitRedWings, DoesAnybodyElse, FoodPorn, K...",942,2043,1906,3483,,m4ngo,1317088000.0,1317084000.0


In [5]:
# Explore target distribution
print("Target distribution:")
print(df_train['requester_received_pizza'].value_counts())
print(f"\nSuccess rate: {df_train['requester_received_pizza'].mean():.3f}")

# Check for missing values
print(f"\nMissing values per column:")
missing = df_train.isnull().sum()
print(missing[missing > 0])

# Check data types
print(f"\nData types:")
print(df_train.dtypes.value_counts())

Target distribution:
requester_received_pizza
False    2163
True      715
Name: count, dtype: int64

Success rate: 0.248

Missing values per column:
requester_user_flair    2163
dtype: int64

Data types:
int64      16
object      9
float64     6
bool        1
Name: count, dtype: int64


In [6]:
# Explore text features
print("Text feature examples:")
print("\nRequest title examples:")
print(df_train['request_title'].head(3).tolist())

print("\nRequest text length statistics:")
df_train['request_text_length'] = df_train['request_text'].str.len()
print(df_train['request_text_length'].describe())

print("\nRequest title length statistics:")
df_train['request_title_length'] = df_train['request_title'].str.len()
print(df_train['request_title_length'].describe())

# Explore categorical features
print("\nRequester user flair distribution:")
print(df_train['requester_user_flair'].value_counts(dropna=False))

Text feature examples:

Request title examples:
['[REQUEST] Oceanside, Ca. USA-  US Marine getting ready to deploy.', "[REQUEST] Three (verified) medical students in Pittsburgh this summer doing research.  And we're almost out of loan money.", '(REQUEST) not home 4 the holidays &amp; would really like some pizza for my family!!!']

Request text length statistics:
count    2878.000000
mean      402.521543
std       362.393727
min         0.000000
25%       182.000000
50%       308.000000
75%       503.750000
max      4460.000000
Name: request_text_length, dtype: float64

Request title length statistics:
count    2878.000000
mean       71.572967
std        36.233487
min         7.000000
25%        46.000000
50%        64.000000
75%        90.000000
max       272.000000
Name: request_title_length, dtype: float64

Requester user flair distribution:
requester_user_flair
None      2163
shroom     677
PIF         38
Name: count, dtype: int64
