# Hacker News Upvote Prediction: Enhanced EDA

This notebook explores the Hacker News dataset with both **post-level** and **user-level** features to build a more robust prediction model.

## Approach
1. Load pre-extracted data (items and users)
2. Explore post-level features (title, score, time, etc.)
3. Explore user-level features (karma, account age, etc.)
4. Analyze relationships between user attributes and post scores
5. Identify key features for our prediction model

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import os

# For better visualization
%matplotlib inline
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams.update({'font.size': 14})

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 50)

print('Libraries imported successfully!')

## 1. Load Extracted Data

We'll load the data that was previously extracted from the Hacker News database and stored as parquet files.

In [None]:
# Define paths
ITEMS_PATH = '../data/raw/items_100k.parquet'
USERS_PATH = '../data/raw/users_100k.parquet'
MERGED_PATH = '../data/raw/items_users_merged_100k.parquet'

# Check if the files exist
files_exist = all(os.path.exists(path) for path in [ITEMS_PATH, USERS_PATH, MERGED_PATH])

if files_exist:
    # Load the pre-extracted data
    df_items = pd.read_parquet(ITEMS_PATH)
    df_users = pd.read_parquet(USERS_PATH)
    df_merged = pd.read_parquet(MERGED_PATH)
    print(f"Loaded items: {len(df_items)} rows")
    print(f"Loaded users: {len(df_users)} rows")
    print(f"Loaded merged dataset: {len(df_merged)} rows")
else:
    print("Data files not found. Please run the data extraction script first.")
    print("You can run this from the command line:")
    print("python ../src/utils/data_extraction.py")

## 2. Explore Post-Level Features

Let's first look at the structure and basic statistics of the items dataset.

In [None]:
# Basic info about the items dataset
print("Items dataset columns:")
print(df_items.columns.tolist())

# Display a few rows
print("\nSample items:")
df_items.head()

In [None]:
# Summary statistics for items
df_items.describe(include='all')

In [None]:
# Check for missing values
missing_counts = df_items.isnull().sum()
missing_percent = (missing_counts / len(df_items)) * 100

missing_df = pd.DataFrame({
    'Missing Count': missing_counts,
    'Missing Percent': missing_percent
})

print("Missing values in items dataset:")
missing_df[missing_df['Missing Count'] > 0].sort_values('Missing Count', ascending=False)

### Analyze Score Distribution

Let's examine the distribution of scores (upvotes) which is our target variable.

In [None]:
# Score distribution
plt.figure(figsize=(12, 6))
plt.hist(df_items['score'], bins=50, color='skyblue', edgecolor='black')
plt.title('Distribution of Hacker News Scores')
plt.xlabel('Score (Upvotes)')
plt.ylabel('Frequency')
plt.grid(True, alpha=0.3)
plt.show()

# Look at score distribution stats
print("Score statistics:")
print(df_items['score'].describe())

# Check for outliers
print("\nTop 10 highest scores:")
print(df_items['score'].nlargest(10))

# Plot log-transformed score
plt.figure(figsize=(12, 6))
plt.hist(np.log1p(df_items['score']), bins=50, color='lightgreen', edgecolor='black')
plt.title('Distribution of Log-Transformed Hacker News Scores')
plt.xlabel('Log(Score + 1)')
plt.ylabel('Frequency')
plt.grid(True, alpha=0.3)
plt.show()

### Analyze Titles

Let's look at title characteristics and their relationship with scores.

In [None]:
# Add title length features
df_items['title_length'] = df_items['title'].apply(lambda x: len(str(x)))
df_items['title_word_count'] = df_items['title'].apply(lambda x: len(str(x).split()))

# Plot title length distributions
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

axes[0].hist(df_items['title_length'], bins=50, color='salmon', edgecolor='black')
axes[0].set_title('Distribution of Title Length (Characters)')
axes[0].set_xlabel('Title Length (Characters)')
axes[0].set_ylabel('Frequency')
axes[0].grid(True, alpha=0.3)

axes[1].hist(df_items['title_word_count'], bins=30, color='lightblue', edgecolor='black')
axes[1].set_title('Distribution of Title Word Count')
axes[1].set_xlabel('Title Word Count')
axes[1].set_ylabel('Frequency')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Relationship between title length and score
plt.figure(figsize=(12, 6))
plt.scatter(df_items['title_word_count'], df_items['score'], alpha=0.3, color='blue')
plt.title('Score vs. Title Word Count')
plt.xlabel('Title Word Count')
plt.ylabel('Score')
plt.grid(True, alpha=0.3)
plt.show()

### Analyze Temporal Patterns

Let's look at how scores vary over time.

In [None]:
# Add time-based features
df_items['year'] = df_items['time'].dt.year
df_items['month'] = df_items['time'].dt.month
df_items['day_of_week'] = df_items['time'].dt.dayofweek
df_items['hour'] = df_items['time'].dt.hour

# Score by year
yearly_scores = df_items.groupby('year')['score'].agg(['mean', 'median', 'count'])
print("Scores by year:")
yearly_scores

In [None]:
# Plot yearly trends
plt.figure(figsize=(14, 6))
plt.plot(yearly_scores.index, yearly_scores['mean'], marker='o', linewidth=2, label='Mean Score')
plt.plot(yearly_scores.index, yearly_scores['median'], marker='s', linewidth=2, label='Median Score')
plt.title('Score Trends by Year')
plt.xlabel('Year')
plt.ylabel('Score')
plt.grid(True, alpha=0.3)
plt.legend()
plt.xticks(yearly_scores.index)
plt.show()

# Post counts by year (to see dataset distribution)
plt.figure(figsize=(14, 6))
plt.bar(yearly_scores.index, yearly_scores['count'], color='skyblue')
plt.title('Number of Posts by Year')
plt.xlabel('Year')
plt.ylabel('Number of Posts')
plt.grid(True, alpha=0.3, axis='y')
plt.xticks(yearly_scores.index)
plt.show()

In [None]:
# Score by hour of day
hourly_scores = df_items.groupby('hour')['score'].agg(['mean', 'median', 'count'])

plt.figure(figsize=(14, 6))
plt.plot(hourly_scores.index, hourly_scores['mean'], marker='o', linewidth=2, label='Mean Score')
plt.plot(hourly_scores.index, hourly_scores['median'], marker='s', linewidth=2, label='Median Score')
plt.title('Score by Hour of Day (UTC)')
plt.xlabel('Hour of Day (UTC)')
plt.ylabel('Score')
plt.grid(True, alpha=0.3)
plt.legend()
plt.xticks(range(0, 24, 2))
plt.show()

### Analyze URLs/Domains

Let's look at which domains tend to get more upvotes.

In [None]:
# Extract domain from URL
import re
from urllib.parse import urlparse

def extract_domain(url):
    if not isinstance(url, str):
        return None
    try:
        domain = urlparse(url).netloc
        # Remove www. prefix if present
        domain = re.sub(r'^www\.', '', domain)
        return domain if domain else None
    except:
        return None

df_items['domain'] = df_items['url'].apply(extract_domain)

# Count of posts by domain
domain_counts = df_items['domain'].value_counts()
print(f"Number of unique domains: {len(domain_counts)}")
print("\nTop 20 domains by post count:")
print(domain_counts.head(20))

In [None]:
# Average score by domain (for domains with at least 20 posts)
min_posts = 20
domain_stats = df_items.groupby('domain')['score'].agg(['mean', 'median', 'count'])
domain_stats = domain_stats[domain_stats['count'] >= min_posts].sort_values('mean', ascending=False)

print(f"Top 20 domains by average score (minimum {min_posts} posts):")
domain_stats.head(20)

In [None]:
# Plot top domains by average score
top_domains = domain_stats.head(15).index
plt.figure(figsize=(14, 8))
plt.barh(top_domains, domain_stats.loc[top_domains, 'mean'], color='skyblue')
plt.title(f'Top 15 Domains by Average Score (min {min_posts} posts)')
plt.xlabel('Average Score')
plt.ylabel('Domain')
plt.grid(True, alpha=0.3, axis='x')
plt.tight_layout()
plt.show()

## 3. Explore User-Level Features

Now let's analyze the user dataset and see how user attributes relate to post scores.

In [None]:
# Basic info about the users dataset
print("Users dataset columns:")
print(df_users.columns.tolist())

# Display a few rows
print("\nSample users:")
df_users.head()

In [None]:
# Summary statistics for users
df_users.describe(include='all')

In [None]:
# Check for missing values
missing_counts = df_users.isnull().sum()
missing_percent = (missing_counts / len(df_users)) * 100

missing_df = pd.DataFrame({
    'Missing Count': missing_counts,
    'Missing Percent': missing_percent
})

print("Missing values in users dataset:")
missing_df[missing_df['Missing Count'] > 0].sort_values('Missing Count', ascending=False)

### Analyze User Karma Distribution

In [None]:
# Karma distribution
plt.figure(figsize=(12, 6))
plt.hist(df_users['karma'], bins=50, color='lightgreen', edgecolor='black')
plt.title('Distribution of User Karma')
plt.xlabel('Karma')
plt.ylabel('Frequency')
plt.grid(True, alpha=0.3)
plt.show()

# Log-transformed karma distribution (to handle skew)
plt.figure(figsize=(12, 6))
plt.hist(np.log1p(df_users['karma']), bins=50, color='coral', edgecolor='black')
plt.title('Distribution of Log-Transformed User Karma')
plt.xlabel('Log(Karma + 1)')
plt.ylabel('Frequency')
plt.grid(True, alpha=0.3)
plt.show()

### Analyze User Account Age

In [None]:
# Calculate account age (in days from account creation to now)
reference_date = pd.Timestamp.now().normalize()  # Current date at midnight
df_users['account_age_days'] = (reference_date - df_users['created']).dt.days

# Calculate account age in years for better visualization
df_users['account_age_years'] = df_users['account_age_days'] / 365.25

# Account age distribution
plt.figure(figsize=(12, 6))
plt.hist(df_users['account_age_years'], bins=50, color='skyblue', edgecolor='black')
plt.title('Distribution of User Account Age')
plt.xlabel('Account Age (Years)')
plt.ylabel('Frequency')
plt.grid(True, alpha=0.3)
plt.show()

## 4. Joint Analysis of Post and User Features

Now let's look at how user attributes relate to post scores using the merged dataset.

In [None]:
# First check if we have the merged dataset
print("Merged dataset columns:")
print(df_merged.columns.tolist())
print(f"\nMerged dataset shape: {df_merged.shape}")

# Let's see how many posts have user data
user_data_count = df_merged['karma'].notna().sum()
print(f"\nPosts with user data: {user_data_count} ({user_data_count/len(df_merged)*100:.1f}%)")

In [None]:
# In the merged dataset, add account age at post time
df_merged['post_account_age_days'] = (df_merged['time'] - df_merged['created']).dt.days

# Filter out rows with negative account age (impossible, would be data error)
df_merged = df_merged[df_merged['post_account_age_days'] >= 0]

# Calculate account age in years for better visualization
df_merged['post_account_age_years'] = df_merged['post_account_age_days'] / 365.25

# Analyze relationship between karma and score
plt.figure(figsize=(12, 6))
plt.scatter(np.log1p(df_merged['karma']), 
            np.log1p(df_merged['score']), 
            alpha=0.3, 
            color='blue')
plt.title('Log Score vs. Log Karma')
plt.xlabel('Log(Karma + 1)')
plt.ylabel('Log(Score + 1)')
plt.grid(True, alpha=0.3)
plt.show()

In [None]:
# Analyze relationship between account age and score
plt.figure(figsize=(12, 6))
plt.scatter(df_merged['post_account_age_years'], 
            np.log1p(df_merged['score']), 
            alpha=0.3, 
            color='green')
plt.title('Log Score vs. Account Age at Post Time')
plt.xlabel('Account Age at Post Time (Years)')
plt.ylabel('Log(Score + 1)')
plt.grid(True, alpha=0.3)
plt.show()

In [None]:
# Group users by karma buckets and see average score
df_merged['karma_bucket'] = pd.qcut(df_merged['karma'], 10, duplicates='drop')
karma_score = df_merged.groupby('karma_bucket')['score'].agg(['mean', 'median', 'count']).reset_index()

plt.figure(figsize=(14, 6))
plt.plot(range(len(karma_score)), karma_score['mean'], marker='o', linewidth=2, label='Mean Score')
plt.plot(range(len(karma_score)), karma_score['median'], marker='s', linewidth=2, label='Median Score')
plt.title('Score by User Karma Bucket')
plt.xlabel('Karma Bucket (Low to High)')
plt.ylabel('Score')
plt.grid(True, alpha=0.3)
plt.legend()
plt.xticks(range(len(karma_score)), [str(bucket) for bucket in karma_score['karma_bucket']], rotation=45)
plt.tight_layout()
plt.show()

In [None]:
# Group users by account age and see average score
df_merged['age_bucket'] = pd.qcut(df_merged['post_account_age_years'], 10, duplicates='drop')
age_score = df_merged.groupby('age_bucket')['score'].agg(['mean', 'median', 'count']).reset_index()

plt.figure(figsize=(14, 6))
plt.plot(range(len(age_score)), age_score['mean'], marker='o', linewidth=2, label='Mean Score')
plt.plot(range(len(age_score)), age_score['median'], marker='s', linewidth=2, label='Median Score')
plt.title('Score by User Account Age Bucket')
plt.xlabel('Account Age Bucket (Young to Old)')
plt.ylabel('Score')
plt.grid(True, alpha=0.3)
plt.legend()
plt.xticks(range(len(age_score)), [str(bucket) for bucket in age_score['age_bucket']], rotation=45)
plt.tight_layout()
plt.show()

## 5. Calculate Feature Correlations

In [None]:
# Calculate log transformations for highly skewed variables
df_merged['log_score'] = np.log1p(df_merged['score'])
df_merged['log_karma'] = np.log1p(df_merged['karma'])

# Select numeric features for correlation analysis
numeric_features = ['score', 'log_score', 'title_length', 'title_word_count', 
                    'karma', 'log_karma', 'post_account_age_days', 'year', 
                    'month', 'day_of_week', 'hour']

# Filter columns that exist in the dataframe
valid_features = [col for col in numeric_features if col in df_merged.columns]

# Compute correlation matrix
corr_matrix = df_merged[valid_features].corr()

# Visualize correlation matrix
plt.figure(figsize=(14, 10))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Feature Correlation Matrix')
plt.tight_layout()
plt.show()

## 6. Key Insights and Next Steps

### Summary of Findings

1. **Score Distribution**
   - The score distribution is heavily right-skewed
   - Log transformation makes it more amenable to modeling

2. **Post Features**
   - Title length shows [relationship with score]
   - Posting time (hour of day, day of week) impacts scores
   - Certain domains consistently receive higher scores

3. **User Features**
   - User karma has [relationship with post score]
   - Account age at post time shows [relationship with score]

### Feature Engineering Ideas

1. **Post-level features**
   - Title embeddings from Word2Vec
   - Title characteristics (length, capitalization, etc.)
   - Domain category or domain embedding
   - Time-based features (hour, day of week, month)

2. **User-level features**
   - Log-transformed user karma
   - Account age at post time
   - User's post frequency

### Next Steps

1. **Preprocess Data**
   - Apply log transformation to score and karma
   - Handle missing values appropriately
   - Encode categorical variables

2. **Word2Vec Pipeline**
   - Pre-train on Wikipedia corpus
   - Fine-tune on Hacker News titles
   - Extract title embeddings

3. **Feature Fusion**
   - Combine title embeddings, user features, and other attributes
   - Train regression model for score prediction

# Feature Engineering: Justification and Strategy

A robust prediction model for Hacker News upvotes requires carefully selected features. Below, we justify each candidate feature based on EDA findings and domain knowledge.

## 1. Title Features
- **Title Text (Embeddings):** The content of the title is the most direct signal of a post's topic and appeal. Word2Vec or transformer embeddings can capture semantic meaning.
- **Title Length (Chars/Words):** While the distribution is fairly normal, outlier titles (very short/long) may affect engagement. Including length as a feature can help the model learn such effects.
- **Keyword Flags:** Phrases like 'Show HN', 'Ask HN', or tech terms (e.g., 'AI', 'Python') are associated with different score distributions. Including binary flags for these can improve predictions.

## 2. Author/User Features
- **User Karma (log-transformed):** Higher karma users tend to get more upvotes, but the relationship is non-linear. Log transformation normalizes the distribution.
- **Account Age at Post Time:** Older accounts may have more trust or visibility.
- **Author ID (Categorical):** For prolific users, author identity can be predictive. For rare users, grouping as 'Other' avoids overfitting.

## 3. Engagement Features
- **Descendants (Comment Count, log-transformed):** Strongly correlated with score (correlation ~0.87 after cleaning and log transform). Indicates engagement.

## 4. URL/Domain Features
- **Is Self-Post:** Self-posts (no URL) have higher median scores than external links.
- **Domain (Categorical):** Certain domains (e.g., arstechnica.com, github.com) are associated with higher or lower typical scores. Grouping rare domains as 'Other' is recommended.

## 5. Temporal Features
- **Hour of Day / Day of Week:** Posts during weekends and off-peak hours have higher median scores.
- **Year/Month:** Can capture long-term trends or seasonality.

## 6. Status Features
- **Dead Flag:** Posts marked as 'dead' are rare in cleaned data, but if present, should be included as a binary feature.

## Feature Selection Summary
- **Include:** Title embeddings, title length, keyword flags, log(karma), account age, author ID (for frequent posters), log(descendants), is_self_post, domain, hour, dayofweek.
- **Exclude:** Raw title/URL length (no correlation), post count per author (no predictive value).

## Next Steps
- **Preprocessing:** Apply log transforms, handle missing values, encode categoricals.
- **Modeling:** Use a regression model (e.g., LightGBM, Ridge, or neural net) with the above features.
- **Evaluation:** Focus on log(score+1) as the target due to skewness.

---
**References:**
- See the EDA above for empirical evidence supporting each feature.
- For more, see the extended EDA in `eda.ipynb`.

# Conclusion & Recommendations

- The Hacker News dataset exhibits strong non-linearities and skewed distributions.
- Feature engineering, especially log transforms and categorical encoding, is essential.
- Engagement (comments), timing, and content all matter.
- Next, proceed to model training with the engineered features.