# Random Acts of Pizza - Baseline Model

This notebook creates a baseline model for predicting pizza request success.

## Approach
1. Load and explore the data
2. Extract numerical features (only those available at request time)
3. Extract text features from title and text using TF-IDF
4. Train a LightGBM model with cross-validation
5. Generate predictions for test set

In [1]:
import json
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import roc_auc_score
import lightgbm as lgb
import warnings
warnings.filterwarnings('ignore')

# Load data
print("Loading training data...")
with open('/home/data/train.json', 'r') as f:
    train_data = json.load(f)

print("Loading test data...")
with open('/home/data/test.json', 'r') as f:
    test_data = json.load(f)

print(f"Train samples: {len(train_data)}")
print(f"Test samples: {len(test_data)}")

# Convert to DataFrame
train_df = pd.DataFrame(train_data)
test_df = pd.DataFrame(test_data)

print("\nTrain columns:", train_df.columns.tolist())
print("\nTarget distribution:")
print(train_df['requester_received_pizza'].value_counts())
print(f"Success rate: {train_df['requester_received_pizza'].mean():.3f}")

Loading training data...
Loading test data...
Train samples: 2878
Test samples: 1162

Train columns: ['giver_username_if_known', 'number_of_downvotes_of_request_at_retrieval', 'number_of_upvotes_of_request_at_retrieval', 'post_was_edited', 'request_id', 'request_number_of_comments_at_retrieval', 'request_text', 'request_text_edit_aware', 'request_title', 'requester_account_age_in_days_at_request', 'requester_account_age_in_days_at_retrieval', 'requester_days_since_first_post_on_raop_at_request', 'requester_days_since_first_post_on_raop_at_retrieval', 'requester_number_of_comments_at_request', 'requester_number_of_comments_at_retrieval', 'requester_number_of_comments_in_raop_at_request', 'requester_number_of_comments_in_raop_at_retrieval', 'requester_number_of_posts_at_request', 'requester_number_of_posts_at_retrieval', 'requester_number_of_posts_on_raop_at_request', 'requester_number_of_posts_on_raop_at_retrieval', 'requester_number_of_subreddits_at_request', 'requester_received_pizza'

## Feature Engineering

Extract numerical features and text features

In [5]:
# Define numerical features to use - only features available at request time
numerical_features = [
    'requester_account_age_in_days_at_request',
    'requester_number_of_comments_at_request',
    'requester_number_of_posts_at_request',
    'requester_upvotes_minus_downvotes_at_request',
    'requester_upvotes_plus_downvotes_at_request',
    'requester_number_of_subreddits_at_request',
    'requester_number_of_comments_in_raop_at_request',
    'requester_number_of_posts_on_raop_at_request',
    'requester_days_since_first_post_on_raop_at_request'
]

# Check which features exist in both train and test
available_num_features = [f for f in numerical_features if f in train_df.columns and f in test_df.columns]
print(f"Using {len(available_num_features)} numerical features:")
for f in available_num_features:
    print(f"  - {f}")

# Extract numerical features
X_num_train = train_df[available_num_features].fillna(0)
X_num_test = test_df[available_num_features].fillna(0)

print(f"\nNumerical features shape: {X_num_train.shape}")

# Extract text features
print("\nExtracting text features...")
text_features = ['request_title', 'request_text_edit_aware']

# Combine title and text for TF-IDF
train_text = train_df[text_features[0]].fillna('') + ' ' + train_df[text_features[1]].fillna('')
test_text = test_df[text_features[0]].fillna('') + ' ' + test_df[text_features[1]].fillna('')

# TF-IDF vectorizer
vectorizer = TfidfVectorizer(
    max_features=5000,
    stop_words='english',
    ngram_range=(1, 2),
    min_df=2,
    max_df=0.95
)

print("Fitting TF-IDF on training text...")
X_text_train = vectorizer.fit_transform(train_text)
X_text_test = vectorizer.transform(test_text)

print(f"Text features shape: {X_text_train.shape}")

# Combine features and convert to CSR for indexing support
from scipy.sparse import hstack, csr_matrix
X_train = hstack([X_text_train, X_num_train.values])
X_test = hstack([X_text_test, X_num_test.values])

# Convert to CSR format to support indexing
X_train = csr_matrix(X_train)
X_test = csr_matrix(X_test)

y_train = train_df['requester_received_pizza'].values

print(f"\nFinal training shape: {X_train.shape}")
print(f"Final test shape: {X_test.shape}")

Using 9 numerical features:
  - requester_account_age_in_days_at_request
  - requester_number_of_comments_at_request
  - requester_number_of_posts_at_request
  - requester_upvotes_minus_downvotes_at_request
  - requester_upvotes_plus_downvotes_at_request
  - requester_number_of_subreddits_at_request
  - requester_number_of_comments_in_raop_at_request
  - requester_number_of_posts_on_raop_at_request
  - requester_days_since_first_post_on_raop_at_request

Numerical features shape: (2878, 9)

Extracting text features...
Fitting TF-IDF on training text...


Text features shape: (2878, 5000)

Final training shape: (2878, 5009)
Final test shape: (1162, 5009)


## Model Training with Cross-Validation

In [6]:
# Setup cross-validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Model parameters
params = {
    'objective': 'binary',
    'metric': 'auc',
    'boosting_type': 'gbdt',
    'num_leaves': 31,
    'learning_rate': 0.05,
    'feature_fraction': 0.9,
    'bagging_fraction': 0.8,
    'bagging_freq': 5,
    'verbose': -1,
    'random_state': 42
}

cv_scores = []
predictions = np.zeros(len(test_df))

print("Training with 5-fold CV...")
for fold, (train_idx, val_idx) in enumerate(cv.split(X_train, y_train)):
    print(f"\nFold {fold + 1}/5")
    
    X_tr, X_val = X_train[train_idx], X_train[val_idx]
    y_tr, y_val = y_train[train_idx], y_train[val_idx]
    
    # Create LightGBM datasets
    train_data = lgb.Dataset(X_tr, label=y_tr)
    valid_data = lgb.Dataset(X_val, label=y_val)
    
    # Train model
    model = lgb.train(
        params,
        train_data,
        num_boost_round=1000,
        valid_sets=[valid_data],
        callbacks=[lgb.early_stopping(50), lgb.log_evaluation(0)]
    )
    
    # Predict on validation
    val_pred = model.predict(X_val, num_iteration=model.best_iteration)
    score = roc_auc_score(y_val, val_pred)
    cv_scores.append(score)
    
    print(f"Fold {fold + 1} AUC: {score:.4f}")
    
    # Predict on test
    predictions += model.predict(X_test, num_iteration=model.best_iteration) / 5

print(f"\nCV Scores: {cv_scores}")
print(f"Mean AUC: {np.mean(cv_scores):.4f} ± {np.std(cv_scores):.4f}")

Training with 5-fold CV...

Fold 1/5
Training until validation scores don't improve for 50 rounds


Early stopping, best iteration is:
[34]	valid_0's auc: 0.638189
Fold 1 AUC: 0.6382

Fold 2/5
Training until validation scores don't improve for 50 rounds


Early stopping, best iteration is:
[66]	valid_0's auc: 0.625139
Fold 2 AUC: 0.6251

Fold 3/5
Training until validation scores don't improve for 50 rounds


Early stopping, best iteration is:
[29]	valid_0's auc: 0.683797
Fold 3 AUC: 0.6838

Fold 4/5
Training until validation scores don't improve for 50 rounds


Early stopping, best iteration is:
[84]	valid_0's auc: 0.597805
Fold 4 AUC: 0.5978

Fold 5/5
Training until validation scores don't improve for 50 rounds


Early stopping, best iteration is:
[26]	valid_0's auc: 0.680701
Fold 5 AUC: 0.6807

CV Scores: [0.6381886012371001, 0.6251392948852533, 0.6837965729420694, 0.5978049728049728, 0.6807012432012433]
Mean AUC: 0.6451 ± 0.0330


## Generate Submission

In [7]:
# Create submission
submission = pd.DataFrame({
    'request_id': test_df['request_id'],
    'requester_received_pizza': predictions
})

print("Submission preview:")
print(submission.head())

# Save submission
submission.to_csv('/home/submission/submission.csv', index=False)
print(f"\nSubmission saved to /home/submission/submission.csv")
print(f"Shape: {submission.shape}")

Submission preview:
  request_id  requester_received_pizza
0  t3_1aw5zf                  0.229269
1   t3_roiuw                  0.164773
2   t3_mjnbq                  0.213965
3   t3_t8wd1                  0.314275
4  t3_1m4zxu                  0.150681

Submission saved to /home/submission/submission.csv
Shape: (1162, 2)
