# Bag of Subreddits Feature Engineering

This notebook implements the **posting pattern** feature extraction as described in point 3.c of the assignment.

The idea is to create a feature vector for each user where each column represents the **fraction of times the user has posted in the x-th subreddit**. We focus on the **Top 15 subreddits** to reduce dimensionality.

## Approach:
1. Load the supervised dataset with subreddit information
2. Identify the top 15 most popular subreddits
3. For each author, compute the fraction of posts in each of the top 15 subreddits
4. Create a "Bag of Subreddits" feature matrix
5. Save features for use in model training

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sys

sys.path.append("../source")
from src import stratified_split

## Step 1: Load Data

In [None]:
# Load the original supervised dataset (which contains subreddit information)
df_supervised = pd.read_csv("../data/data_supervised.csv")

# Load target labels
target_df = pd.read_csv("./Data/target_supervised.csv")

print(f"Total comments: {len(df_supervised)}")
print(f"Total authors in target: {len(target_df)}")
print(f"\nDataset columns: {df_supervised.columns.tolist()}")

In [None]:
# Check subreddit distribution
print(f"Unique subreddits: {df_supervised['subreddit'].nunique()}")
print(f"\nTop 20 subreddits:")
print(df_supervised['subreddit'].value_counts().head(20))

## Step 2: Identify Top 15 Subreddits

In [None]:
# Get the top 15 most popular subreddits
TOP_N = 15

subreddit_counts = df_supervised['subreddit'].value_counts()
top_15_subreddits = subreddit_counts.head(TOP_N).index.tolist()

print(f"Top {TOP_N} Subreddits:")
for i, sub in enumerate(top_15_subreddits, 1):
    count = subreddit_counts[sub]
    percentage = (count / len(df_supervised)) * 100
    print(f"{i:2d}. {sub:30s} - {count:6d} posts ({percentage:.2f}%)")

In [None]:
# Visualize top 15 subreddits
plt.figure(figsize=(12, 6))
top_15_data = subreddit_counts.head(TOP_N)
sns.barplot(x=top_15_data.values, y=top_15_data.index, palette='viridis')
plt.xlabel('Number of Posts')
plt.ylabel('Subreddit')
plt.title(f'Top {TOP_N} Subreddits by Post Count')
plt.tight_layout()
plt.show()

## Step 3: Create Bag of Subreddits Feature Matrix

For each author, we compute the fraction of their posts that belong to each of the top 15 subreddits.

In [None]:
def create_bag_of_subreddits(df, top_subreddits):
    """
    Create a Bag of Subreddits feature matrix.
    
    For each author, computes the fraction of their posts in each of the top subreddits.
    
    Parameters:
    -----------
    df : pd.DataFrame
        DataFrame with 'author' and 'subreddit' columns
    top_subreddits : list
        List of top subreddit names to use as features
    
    Returns:
    --------
    pd.DataFrame
        DataFrame with authors as index and subreddit fractions as columns
    """
    # Count posts per author per subreddit
    author_subreddit_counts = df.groupby(['author', 'subreddit']).size().unstack(fill_value=0)
    
    # Keep only the top subreddits
    top_subs_available = [s for s in top_subreddits if s in author_subreddit_counts.columns]
    author_subreddit_counts = author_subreddit_counts[top_subs_available]
    
    # Add columns for missing subreddits (if any)
    for sub in top_subreddits:
        if sub not in author_subreddit_counts.columns:
            author_subreddit_counts[sub] = 0
    
    # Reorder columns to match the original order
    author_subreddit_counts = author_subreddit_counts[top_subreddits]
    
    # Get total posts per author
    total_posts_per_author = df.groupby('author').size()
    
    # Compute fractions
    author_subreddit_fractions = author_subreddit_counts.div(total_posts_per_author, axis=0)
    
    # Handle any NaN values (shouldn't happen, but just in case)
    author_subreddit_fractions = author_subreddit_fractions.fillna(0)
    
    return author_subreddit_fractions

# Create the bag of subreddits features
bag_of_subreddits = create_bag_of_subreddits(df_supervised, top_15_subreddits)

print(f"Bag of Subreddits shape: {bag_of_subreddits.shape}")
print(f"\nFeature names (columns): {bag_of_subreddits.columns.tolist()}")
print(f"\nSample of the feature matrix:")
bag_of_subreddits.head(10)

In [None]:
# Validate: fractions should sum to <= 1 for each author
# (can be < 1 if author posts in subreddits outside top 15)
fraction_sums = bag_of_subreddits.sum(axis=1)

print(f"Fraction sum statistics:")
print(f"Min: {fraction_sums.min():.4f}")
print(f"Max: {fraction_sums.max():.4f}")
print(f"Mean: {fraction_sums.mean():.4f}")
print(f"Std: {fraction_sums.std():.4f}")

# Distribution of fraction sums
plt.figure(figsize=(10, 5))
plt.hist(fraction_sums, bins=50, edgecolor='black')
plt.xlabel('Sum of Top 15 Subreddit Fractions')
plt.ylabel('Number of Authors')
plt.title('Distribution of Top 15 Subreddit Activity per Author')
plt.axvline(x=fraction_sums.mean(), color='r', linestyle='--', label=f'Mean: {fraction_sums.mean():.2f}')
plt.legend()
plt.tight_layout()
plt.show()

## Step 4: Match with Target Labels and Prepare for Training

In [None]:
# Get authors that are in both the bag_of_subreddits and target_df
common_authors = bag_of_subreddits.index.intersection(target_df['author'])
print(f"Authors in bag_of_subreddits: {len(bag_of_subreddits)}")
print(f"Authors in target: {len(target_df)}")
print(f"Common authors: {len(common_authors)}")

# Filter to only include authors with labels
bag_of_subreddits_labeled = bag_of_subreddits.loc[common_authors].copy()

# Get corresponding labels
target_df_indexed = target_df.set_index('author')
labels = target_df_indexed.loc[common_authors, 'gender'].values

print(f"\nFinal feature matrix shape: {bag_of_subreddits_labeled.shape}")
print(f"Labels shape: {labels.shape}")
print(f"\nLabel distribution:")
print(pd.Series(labels).value_counts())

## Step 5: Train-Validation-Test Split

In [None]:
# Prepare X and y
X_subreddit = bag_of_subreddits_labeled.values
y = labels

# Use the stratified split function from src.py
X_train, X_val, X_test, y_train, y_val, y_test = stratified_split(X_subreddit, y)

print(f"Training set: {X_train.shape[0]} samples")
print(f"Validation set: {X_val.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")

print(f"\nFeature dimensions: {X_train.shape[1]} (Top {TOP_N} subreddits)")

## Step 6: Visualize Feature Importance by Gender

In [None]:
# Compare subreddit usage between genders
bag_of_subreddits_with_labels = bag_of_subreddits_labeled.copy()
bag_of_subreddits_with_labels['gender'] = labels

# Calculate mean fraction for each gender
gender_means = bag_of_subreddits_with_labels.groupby('gender')[top_15_subreddits].mean()

# Plot comparison
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# Bar plot comparison
x = np.arange(len(top_15_subreddits))
width = 0.35

axes[0].bar(x - width/2, gender_means.loc[0], width, label='Male (0)', alpha=0.8)
axes[0].bar(x + width/2, gender_means.loc[1], width, label='Female (1)', alpha=0.8)
axes[0].set_xlabel('Subreddit')
axes[0].set_ylabel('Mean Fraction of Posts')
axes[0].set_title('Average Subreddit Activity by Gender')
axes[0].set_xticks(x)
axes[0].set_xticklabels(top_15_subreddits, rotation=45, ha='right')
axes[0].legend()

# Difference plot (Female - Male)
diff = gender_means.loc[1] - gender_means.loc[0]
colors = ['pink' if d > 0 else 'lightblue' for d in diff]
axes[1].barh(top_15_subreddits, diff, color=colors)
axes[1].axvline(x=0, color='black', linestyle='-', linewidth=0.5)
axes[1].set_xlabel('Difference (Female - Male)')
axes[1].set_ylabel('Subreddit')
axes[1].set_title('Gender Difference in Subreddit Activity\n(Pink = More Female, Blue = More Male)')

plt.tight_layout()
plt.show()

## Step 7: Save Features for Model Training

In [None]:
# Save the bag of subreddits features
np.save('./Data/bag_of_subreddits_train.npy', X_train)
np.save('./Data/bag_of_subreddits_val.npy', X_val)
np.save('./Data/bag_of_subreddits_test.npy', X_test)

# Save labels
np.save('./Data/y_subreddit_train.npy', y_train)
np.save('./Data/y_subreddit_val.npy', y_val)
np.save('./Data/y_subreddit_test.npy', y_test)

# Save the feature names for reference
np.save('./Data/top15_subreddit_names.npy', np.array(top_15_subreddits))

print("âœ… Bag of Subreddits features saved successfully!")
print(f"\nSaved files:")
print(f"  - bag_of_subreddits_train.npy: {X_train.shape}")
print(f"  - bag_of_subreddits_val.npy: {X_val.shape}")
print(f"  - bag_of_subreddits_test.npy: {X_test.shape}")
print(f"  - y_subreddit_train.npy: {y_train.shape}")
print(f"  - y_subreddit_val.npy: {y_val.shape}")
print(f"  - y_subreddit_test.npy: {y_test.shape}")
print(f"  - top15_subreddit_names.npy: {len(top_15_subreddits)} names")

## Summary

In this notebook we have:

1. **Identified the Top 15 subreddits** by post count
2. **Created a Bag of Subreddits feature matrix** where each feature represents the fraction of an author's posts in a given subreddit
3. **Matched features with target labels** (gender)
4. **Split the data** into training, validation, and test sets using stratified sampling
5. **Visualized gender differences** in subreddit activity patterns
6. **Saved the features** for use in model training

These features can now be:
- Used alone for classification
- Combined with text features (BoW, TF-IDF) to potentially improve model performance