In [2]:
# Seed Prompt: Text Classification with Tabular Meta-data

## Problem Type
Binary classification with text (request_title, request_text) and tabular meta-data features. Evaluation metric: AUC.

## Reference Notebooks for Data Characteristics
- `eda.ipynb` - Contains full EDA including:
  * Target distribution (24.8% success rate - moderate class imbalance)
  * Feature correlations with target (request_number_of_comments_at_retrieval highest at 0.291)
  * User flair analysis (shroom/PIF have 100% success - strong predictive signal)
  * Text length statistics and feature distributions

## Core Modeling Strategy

### 1. Multimodal Approach (Text + Tabular)
Winning Kaggle solutions for text+tabular problems use multimodal strategies:

**Option A: Separate Models + Ensemble**
- Train transformer model (BERT/RoBERTa) on text features only
- Train gradient boosting model (LightGBM/CatBoost) on tabular features only  
- Ensemble predictions via stacking or weighted averaging
- This approach leverages strengths of each model type on their respective data

**Option B: Joint Architecture**
- Use AutoGluon TextPredictor or similar that concatenates tokenized text with encoded categorical/numerical columns
- Allows transformer to attend to both modalities simultaneously
- Often simpler to implement but may require more compute

**Option C: Feature-as-Text**
- Convert tabular features into short text tokens and prepend to original text
- Surprisingly effective and simple approach that can outperform complex pipelines
- Example: "account_age: 100 days, comments: 50, posts: 10 [SEP] original request text"

### 2. Text Feature Engineering
For Reddit posts specifically, use this pipeline:

**Text Cleaning:**
- Concatenate request_title and request_text into single document
- Remove URLs, non-alphabetic characters, normalize whitespace
- Lowercase text (optional: preserve some capitalization for emphasis detection)
- Remove duplicate posts if any exist
- Custom stopword removal (including Reddit-specific terms)
- Lemmatization to normalize words

**Core Text Features:**
- TF-IDF vectors: unigrams, bigrams, trigrams
- Character n-grams (3-5 chars) to capture misspellings and OOV words
- Word embeddings: GloVe/FastText averaged vectors
- Sentiment scores from lexicons (VADER works well for social media)

**Reddit-Specific Features:**
- Length metrics: char count, word count, sentence count
- Special character counts: exclamation marks, question marks, all-caps words
- Gratitude indicators: count of "thanks", "please", "appreciate", "grateful"
- Urgency indicators: count of "urgent", "desperate", "need", "help"
- Story indicators: presence of narrative markers ("I", "my", "when", "because")
- Binary flags: contains_question, contains_gratitude, contains_urgency, contains_story

**Advanced Features:**
- Readability scores: Flesch-Kincaid, SMOG index
- Named entity counts (locations, organizations, persons)
- Topic modeling: LDA/NMF features (5-10 topics)
- LIWC-style psycholinguistic features if available

### 3. Tabular Feature Engineering

**Key Features from EDA:**
- request_number_of_comments_at_retrieval (highest correlation: 0.291)
- User flair encoding (one-hot or target encoding - shroom/PIF very predictive)
- Account age and activity metrics (log transform skewed distributions)
- Upvote/downvote ratios and differences
- Subreddit diversity features

**Feature Preprocessing:**
- Log transform skewed numeric features (account ages, comment counts)
- Target encoding for high-cardinality categorical features
- Handle missing values in requester_user_flair (likely mode imputation)
- Create interaction features between engagement metrics

**Reddit-Specific Tabular Features:**
- Engagement ratios: comments/posts, upvotes/comments, etc.
- Account maturity: binary flags for new accounts (<30 days)
- Activity density: comments per day, posts per day
- Community engagement: RAOP-specific vs general Reddit activity ratio
- Vote controversy: (downvotes / (upvotes + downvotes))

### 4. Handling Class Imbalance (24.8% positive)

**Data-Level Techniques:**
- SMOTE or random oversampling of minority class
- Random undersampling of majority class
- Focus on data-level approaches first before algorithm-level

**Algorithm-Level Techniques:**
- Class weights: Use balanced class weights in models
- Sample weights: Assign higher weights to minority class samples
- Focal loss for neural networks

**Validation Strategy:**
- Stratified K-Fold (k=5 or 10) to preserve class distribution
- Use AUC as primary validation metric (not accuracy)
- Monitor both precision and recall for minority class

### 5. Model Selection

**Primary Models:**
- **LightGBM**: Fast, handles mixed data types, good with imbalanced data
- **CatBoost**: Excellent with categorical features, handles missing values
- **XGBoost**: Robust, well-established, good baseline

**Text Models:**
- **BERT/RoBERTa**: Fine-tune on text classification task
- **DistilBERT**: Faster, slightly less accurate alternative
- **DeBERTa**: State-of-the-art for text classification

**Ensembling Strategy:**
- Train 3-5 diverse models (different algorithms, different feature sets)
- Use stacking with logistic regression or LightGBM as meta-learner
- Weighted averaging based on validation performance
- Consider rank averaging for AUC optimization

### 6. Optimization Techniques

**Hyperparameter Tuning:**
- Bayesian optimization (Optuna, Hyperopt) for gradient boosting models
- Learning rate scheduling for transformers
- Early stopping based on validation AUC

**Feature Selection:**
- Use SHAP values to identify important features
- Recursive feature elimination for tabular features
- Attention weights from transformers for text feature importance

### 7. Potential Pitfalls & Leakage

**Data Leakage Warning:**
- requester_user_flair values "shroom" and "PIF" indicate past success (100% success rate)
- Ensure these features are properly handled to avoid overfitting
- Consider creating "has_received_before" binary feature instead of raw flair
- Check timing: verify features are available at request time, not after

**Temporal Considerations:**
- Check if unix_timestamp_of_request creates time-based patterns
- Consider time-based validation splits if temporal leakage is suspected
- Ensure retrieval-time features don't contain future information

### 8. Implementation Priority

**Phase 1 - Baseline:**
1. Simple LightGBM on tabular features only
2. TF-IDF + LightGBM on text features only
3. Basic ensemble of above

**Phase 2 - Advanced:**
1. Fine-tune BERT on text features
2. Engineer platform-specific text features
3. Implement proper handling of user flair
4. Optimize class imbalance handling

**Phase 3 - Optimization:**
1. Hyperparameter tuning with Optuna
2. Advanced ensembling (stacking, rank averaging)
3. Feature selection and engineering refinement
4. Model blending with different architectures

## Key Success Factors

Based on Kaggle winning solutions for similar problems:
1. **Multimodal approach is essential** - don't rely on text or tabular alone
2. **Handle class imbalance properly** - use both data and algorithm-level techniques
3. **Leverage user flair carefully** - it's highly predictive but may cause leakage
4. **Ensemble diverse models** - combination of tree models and neural networks works best
5. **Engineer platform-specific features** - Reddit-specific patterns are important
6. **Validate properly** - stratified CV is crucial for imbalanced data
7. **Clean text properly** - Reddit posts need specific preprocessing for URLs, mentions, etc.

Training data shape: (2878, 32)
Columns: ['giver_username_if_known', 'number_of_downvotes_of_request_at_retrieval', 'number_of_upvotes_of_request_at_retrieval', 'post_was_edited', 'request_id', 'request_number_of_comments_at_retrieval', 'request_text', 'request_text_edit_aware', 'request_title', 'requester_account_age_in_days_at_request', 'requester_account_age_in_days_at_retrieval', 'requester_days_since_first_post_on_raop_at_request', 'requester_days_since_first_post_on_raop_at_retrieval', 'requester_number_of_comments_at_request', 'requester_number_of_comments_at_retrieval', 'requester_number_of_comments_in_raop_at_request', 'requester_number_of_comments_in_raop_at_retrieval', 'requester_number_of_posts_at_request', 'requester_number_of_posts_at_retrieval', 'requester_number_of_posts_on_raop_at_request', 'requester_number_of_posts_on_raop_at_retrieval', 'requester_number_of_subreddits_at_request', 'requester_received_pizza', 'requester_subreddits_at_request', 'requester_upvotes_minu

In [3]:
# Check target distribution
target_counts = train_df['requester_received_pizza'].value_counts()
print("Target distribution:")
print(target_counts)
print(f"Success rate: {target_counts[True] / len(train_df):.3f}")

Target distribution:
requester_received_pizza
False    2163
True      715
Name: count, dtype: int64
Success rate: 0.248


In [4]:
# Check data types and missing values
print("Data types:")
print(train_df.dtypes.value_counts())

print("\nMissing values:")
missing = train_df.isnull().sum()
print(missing[missing > 0])

# Check text length characteristics
print("\nText length statistics:")
train_df['request_text_length'] = train_df['request_text'].str.len()
train_df['request_title_length'] = train_df['request_title'].str.len()
print(train_df[['request_text_length', 'request_title_length']].describe())

Data types:
int64      19
object      8
float64     4
bool        1
Name: count, dtype: int64

Missing values:
requester_user_flair    2163
dtype: int64

Text length statistics:
       request_text_length  request_title_length
count          2878.000000           2878.000000
mean            402.521543             71.572967
std             362.393727             36.233487
min               0.000000              7.000000
25%             182.000000             46.000000
50%             308.000000             64.000000
75%             503.750000             90.000000
max            4460.000000            272.000000


In [5]:
# Check key meta-data features correlation with target
import matplotlib.pyplot as plt

# Numeric features that might be important
numeric_features = [
    'requester_account_age_in_days_at_request',
    'requester_number_of_comments_at_request',
    'requester_number_of_posts_at_request',
    'requester_number_of_subreddits_at_request',
    'requester_upvotes_minus_downvotes_at_request',
    'requester_upvotes_plus_downvotes_at_request',
    'request_number_of_comments_at_retrieval',
    'number_of_upvotes_of_request_at_retrieval',
    'number_of_downvotes_of_request_at_retrieval'
]

# Calculate correlation with target (convert bool to int for correlation)
target_numeric = train_df['requester_received_pizza'].astype(int)
correlations = {}
for feature in numeric_features:
    if feature in train_df.columns:
        corr = train_df[feature].corr(target_numeric)
        correlations[feature] = corr

# Sort by absolute correlation
sorted_correlations = sorted(correlations.items(), key=lambda x: abs(x[1]), reverse=True)
print("Feature correlations with target (sorted by absolute value):")
for feature, corr in sorted_correlations:
    print(f"{feature}: {corr:.3f}")

# Check user flair distribution
print("\nUser flair distribution:")
print(train_df['requester_user_flair'].value_counts(dropna=False))

# Check success rate by flair
print("\nSuccess rate by user flair:")
flair_success = train_df.groupby('requester_user_flair')['requester_received_pizza'].agg(['count', 'mean'])
print(flair_success)

Feature correlations with target (sorted by absolute value):
request_number_of_comments_at_retrieval: 0.291
number_of_upvotes_of_request_at_retrieval: 0.091
requester_number_of_subreddits_at_request: 0.047
requester_account_age_in_days_at_request: 0.043
requester_upvotes_minus_downvotes_at_request: 0.043
requester_number_of_posts_at_request: 0.038
requester_number_of_comments_at_request: 0.037
requester_upvotes_plus_downvotes_at_request: 0.033
number_of_downvotes_of_request_at_retrieval: 0.021

User flair distribution:
requester_user_flair
None      2163
shroom     677
PIF         38
Name: count, dtype: int64

Success rate by user flair:
                      count  mean
requester_user_flair             
PIF                      38   1.0
shroom                  677   1.0
