# Welcome to the Hacker News EDA!

## Introduction

This dataset contains all stories and comments from Hacker News from its launch in 2006. Each story contains a story id, the author that made the post, when it was written, and the number of points the story received. Hacker News is a social news website focusing on computer science and entrepreneurship. It is run by Paul Graham's investment fund and startup incubator, Y Combinator. In general, content that can be submitted is defined as "anything that gratifies one's intellectual curiosity".

You can find the dataset on [Kaggle](https://www.kaggle.com/datasets/hacker-news/hacker-news).

## Objectives of the EDA

In this Exploratory Data Analysis (EDA), we aim to investigate the factors that drive engagement and rivalry on Hacker News posts. Specifically, we will explore the following questions:

- What elements contribute to the rivalry around a post, as measured by points and number of comments?  
- Are there certain types of influencers (authors) who consistently receive more points or comments?  
- Do specific topics or themes in post titles attract higher engagement?  
- How has the volume of posts evolved over time since the platform's launch?

Based on these questions, we formulate the following hypotheses:

- Posts with higher points tend to generate more comments, indicating greater rivalry and engagement.  
- Some authors, acting as influencers, consistently receive more attention in terms of points and comments.  
- Certain topics or keywords in titles are more likely to attract higher engagement.  
- The number of posts has increased over time, reflecting the growth of the Hacker News community.

This analysis will help us understand the dynamics of content popularity and user interaction on Hacker News.

## 1. Global exploration of the dataset

Before diving into specific questions, it is important to get an overall understanding of the dataset’s structure and content.  
This includes examining the dataset’s size, data types, summary statistics, and initial observations on key variables such as points, number of comments, and authors.


In [None]:
# Importating libraries

import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns
import re

In [None]:
# Read the csv 

df = pd.read_csv("/home/inesajd/hacker_news.csv")

In [None]:
df.head()

In [None]:
df.info()

We can see that the dataset contains 20,099 rows and 7 columns, including 3 numerical variables and 4 categorical (object) variables.

Each story includes a unique story ID, the author who made the post, the timestamp of the post, and the number of points it received.

## Distribution of key variables

To better understand the engagement on Hacker News posts, we first examine the distributions of two main variables: **points** (the score a post received) and **number of comments**.  

These distributions will help us identify typical values, the presence of outliers, and the overall shape of the data, which are important for interpreting subsequent analyses.


In [None]:
plt.figure(figsize=(14,6)) # in order to have two graphs

In [None]:
# Distribution of the points 

plt.subplot(1, 2, 1)
sns.histplot(df['num_points'], bins=50, kde=True)
plt.title('Distribution of Points')
plt.xlabel('Points')
plt.ylabel('Frequency')

# Distribution number of comments 
plt.subplot(1, 2, 2)
sns.histplot(df['num_comments'], bins=50, kde=True)
plt.title('Distribution of Number of Comments')
plt.xlabel('Number of Comments')
plt.ylabel('Frequency')

We see that the majority of the points are between 0 and approximalty 500. Let's check that


In [None]:

min_points = df["num_points"].min()
f"The minimum number of points is {min_points}"
count_min = df[df["num_points"]== min_points]
count_min

There are 2098 posts that have received only one point.

In [None]:
percentage_min = (len(count_min) / len(df)) * 100
round(percentage_min, 2)

10.44% of the messages contained in the df are scored 1 point. 

In [None]:
max_point = df["num_points"].max()
f"The max number of points is {max_point}"
max_count = df[df["num_points"] == max_point]
max_count

Now lets check the stats of num_points

In [None]:
df["num_points"].describe()

- The majority of posts have a relatively low score (median of 9 points).  
- A few very popular posts with high scores pull the average up, explaining the large standard deviation.  
- The distribution is therefore **highly skewed**, with a long tail on the right.  
- This suggests that only a few posts are very successful, while most receive few votes.

Let's see that with an histogram and a boxplot :


In [None]:

plt.figure(figsize=(10,6))
sns.histplot(df["num_points"], kde= True)
plt.title("Distribution of num_points (whithout outliers)")
plt.xlabel('Number of Points')
plt.ylabel('Frequency')
plt.xlim(0, 65)
plt.show()

In [None]:
sns.boxplot(y=df["num_points"])
plt.title('Boxplot of Points')
plt.ylim(0,65)
plt.show()

### Interpretation of the distribution

The data reveals a highly skewed distribution of points, where most posts receive relatively low engagement. This pattern is typical in social news platforms, highlighting a few standout posts that capture significant attention amidst a large number of modestly rated ones.

Zooming in on the lower range of points provides a clearer view of where the bulk of posts lie, while the full distribution underscores the presence of exceptional outliers driving the average upwards.

Such insights are crucial for understanding user interaction dynamics and the nature of content popularity on Hacker News.


In [None]:
df['num_comments'].describe()

In [None]:
sns.histplot(df["num_comments"], kde= True)
plt.title("Distribution of num_comments (whithout outliers)")
plt.xlabel('Number of comments')
plt.ylabel('Frequency')
plt.xlim(0, 30)
plt.show()

In [None]:
sns.boxplot(y=df["num_comments"])
plt.title('Boxplot of comments')
plt.ylim(0,30)
plt.show()

### Interpretation of the number of comments distribution

The distribution of comments is heavily skewed, with most posts receiving very few comments. 

Zooming into the range of 0 to 25 comments reveals that the bulk of posts cluster at the lower end, while a small number generate extensive discussion.

This pattern mirrors the points distribution but with an even stronger concentration towards low engagement, highlighting different dynamics between voting and commenting behaviors on Hacker News.


In [None]:
points_threshold = df['num_points'].quantile(0.95)
comments_threshold = df['num_comments'].quantile(0.95)

print(f"Points threshold for viral posts (95th percentile): {points_threshold}")
print(f"Comments threshold for viral posts (95th percentile): {comments_threshold}")

viral_posts = df[(df['num_points'] >= points_threshold) | (df['num_comments'] >= comments_threshold)]
print(f"Number of viral posts: {len(viral_posts)}")


### Interpretation of viral posts thresholds

Using the 95th percentile as a threshold, we define viral posts as those receiving **at least 221 points** or **125 comments**.  

Out of the entire dataset, **1,382 posts** meet this criterion, representing a minority but significant subset of highly engaging content.

These thresholds highlight the skewed nature of engagement on Hacker News, where a relatively small number of posts attract disproportionate attention and interaction.

Studying these viral posts more closely will help us understand what drives exceptional popularity and rivalry on the platform.


In [None]:
corr = df[['num_points', 'num_comments']].corr()
print("Correlation matrix between points and comments:")
print(corr)

## Interpretation of the correlation between points and comments

The correlation coefficient between the number of points and the number of comments is approximately **0.78**, indicating a strong positive linear relationship.

This means that, generally, posts with higher points tend to receive more comments, reflecting greater overall engagement and interaction.

However, since the correlation is not perfect, there are exceptions where some posts may have many points but relatively few comments, or vice versa, highlighting different patterns of user behavior on Hacker News.


In [None]:
top_authors_viral = viral_posts['author'].value_counts().head(10)
print("Top 10 authors with viral posts:")
print(top_authors_viral)

## Top Authors with Viral Posts

The top 10 authors with the highest number of viral posts are led by *ingve*, who has 24 viral posts, followed by *adamnemecek* and *prostoalex* with 14 each.  

This suggests that certain authors consistently produce highly engaging content that attracts significant attention on Hacker News.

Understanding the characteristics and posting behavior of these influential authors could provide insights into what drives virality on the platform.


In [None]:
regex_pattern = r'\b[a-zA-Z]{4,}\b' # word at least len 4

def extract_words(titles, pattern):
    words_list = []
    for title in titles:
        if isinstance(title, str):
            matches = re.findall(pattern, title.lower())
            words_list.extend(matches)
    return words_list

In [None]:
viral_titles = viral_posts['title'].dropna()
words_list = extract_words(viral_titles, regex_pattern)

In [None]:
word_counts = {}

for w in words_list:
    word_counts[w] = word_counts.get(w, 0) + 1

sorted_word_counts = sorted(word_counts.items(), key=lambda x: x[1], reverse=True)

print("Top 20 words in viral post titles:")
for word, count in sorted_word_counts[:20]:
    print(f"{word}: {count}")

## Most frequent words in viral post titles

The most common words appearing in the titles of viral posts include both **brand names and technology keywords**, such as *Google*, *Apple*, *Facebook*, *Microsoft*, *Linux*, and *code*.  

This indicates that posts related to major tech companies and software development tend to attract significant attention on Hacker News.

Alongside these, some common English words like *with*, *what*, and *your* appear frequently, which are less informative but typical in natural language.

Focusing further analysis on meaningful keywords and excluding common stop words will help better understand the topics driving virality.


In [None]:
stop_words = {
    'what', 'that', 'your', 'with', 'from', 'about', 'after', 'than', 'more', 'show',
    'this', 'have', 'will', 'they', 'some', 'just', 'like', 'when', 'were', 'them',
    'then', 'also', 'into', 'only', 'over', 'such', 'here', 'very', 'than', 'who',
    'which', 'their', 'been', 'because', 'these', 'does', 'had', 'how', 'its', 'our',
    'out', 'off', 'too', 'now', 'new', 'one', 'all', 'any', 'can', 'but', 'and', 'for',
    'are', 'you', 'was', 'not', 'his', 'her', 'she', 'him', 'had', 'did', 'has', 'the',
    'a', 'an', 'in', 'on', 'at', 'by', 'of', 'to', 'is', 'as', 'if', 'or'
}

In [None]:
word_counts = {}

for w in words_list:
    if w not in stop_words:
        word_counts[w] = word_counts.get(w, 0) + 1

sorted_word_counts = sorted(word_counts.items(), key=lambda x: x[1], reverse=True)

print("Top 20 meaningful words in viral post titles:")
for word, count in sorted_word_counts[:20]:
    print(f"{word}: {count}")

## Key themes 

After filtering out common stop words, the most frequent meaningful words in viral post titles highlight major **technology brands** and **topics** such as *Google*, *Apple*, *Facebook*, *Microsoft*, *Linux*, *Amazon*, *Windows*, and *Python*.  

Additionally, thematic words like *code*, *open*, *learning*, *source*, *video*, *released*, *tech*, *data*, and *software* suggest strong interest in software development, open source, education, and technology releases.

This pattern underscores that viral content on Hacker News often revolves around prominent tech companies and current trends in programming and technology.


In [None]:
proper_nouns = {
    'google', 'apple', 'facebook', 'microsoft', 'linux', 'amazon', 'python', 'windows'
}

proper_noun_counts = {}

for w in words_list:
    if w in proper_nouns:
        proper_noun_counts[w] = proper_noun_counts.get(w, 0) + 1

sorted_proper_nouns = sorted(proper_noun_counts.items(), key=lambda x: x[1], reverse=True)

print("Top proper nouns in viral post titles:")
for word, count in sorted_proper_nouns:
    print(f"{word}: {count}")

In [None]:
filtered_word_counts = {}

for w in words_list:
    if w not in stop_words and w not in proper_nouns:
        filtered_word_counts[w] = filtered_word_counts.get(w, 0) + 1

sorted_filtered_words = sorted(filtered_word_counts.items(), key=lambda x: x[1], reverse=True)

print("Top thematic words excluding proper nouns:")
for word, count in sorted_filtered_words[:10]:
    print(f"{word}: {count}")

## Thematic words (excluding proper nouns)

Excluding proper nouns, the top thematic words focus on concepts related to software development, innovation, and learning, such as *open*, *code*, *first*, *years*, *learning*, *source*, *video*, *released*, *should*, and *tech*.  

These terms suggest that viral posts often center on open source projects, educational content, new releases, and broader technological trends, reflecting the community’s focus on knowledge sharing and advancement.

In [None]:
df['created_at'] = pd.to_datetime(df['created_at'], errors='coerce')

print(df['created_at'].isnull().sum())

In [None]:
posts_per_year = df.groupby(df['created_at'].dt.year).size()
posts_per_year

In [None]:
posts_per_month = df.groupby(df['created_at'].dt.to_period('M')).size()


In [None]:
plt.figure(figsize=(14,6))
posts_per_month.plot(kind='line')
plt.title('Monthly Number of Posts on Hacker News (2015-2016)')
plt.xlabel('Month')
plt.ylabel('Number of Posts')
plt.xticks(rotation=45)
plt.show()

In [None]:
threshold = posts_per_month.mean() + posts_per_month.std()
peaks = posts_per_month[posts_per_month > threshold]

In [None]:
threshold_sensitive = posts_per_month.mean() + 0.5 * posts_per_month.std()
peaks_sensitive = posts_per_month[posts_per_month > threshold_sensitive]
print(f"Number of peaks detected with sensitive threshold: {len(peaks_sensitive)}")
print(peaks_sensitive)


In [None]:
for date_period, count in peaks_sensitive.items():
    start = date_period.to_timestamp()
    end = start + pd.offsets.MonthEnd(1)
    
    
    posts_peak = df[(df['created_at'] >= start) & (df['created_at'] <= end)]
    
    
    print(f"\nPeak month: {start.strftime('%Y-%m')} with {count} posts")
    
    
    top_titles = posts_peak['title'].value_counts().head(5)
    print("Top 5 titles during this peak:")
    print(top_titles.to_string())

## Analysis of peak months and dominant topics

We identified several peak months with significantly higher post volumes between late 2015 and early 2016. During these periods, the number of posts surged to around 1,600-1,700 per month, well above the average.

Examining the top titles from these peak months reveals a mix of timely tech news, security vulnerabilities, social discussions, and emerging technology topics:

- **October 2015** featured major security alerts such as the Adobe Flash vulnerability and lightweight web frameworks for low-resource systems.
- **November 2015** included discussions on corporate reorganizations, critiques of Apple’s design, and intriguing tech stories.
- **January 2016** highlighted debates on Apple’s competitive edge, Bitcoin’s future, and cybersecurity concerns like OpenSSH.
- **March 2016** brought attention to high-profile cyberattacks (DDoS), scientific reproducibility, Tesla firmware hacks, and socio-economic ideas like universal basic income.

These insights suggest that spikes in post volume correlate with significant tech events, security issues, and broader societal discussions, driving heightened engagement and rivalry on Hacker News.


## Conclusion

Through this exploratory data analysis of the Hacker News dataset, we investigated the key factors driving engagement and rivalry on the platform.

We found that the distribution of both points and number of comments is highly skewed, with most posts receiving relatively low engagement while a small subset of viral posts attract disproportionate attention. Using the 95th percentile thresholds, we identified approximately 1,382 viral posts that stand out in terms of points or comments.

A strong positive correlation (~0.78) between points and comments confirms that posts garnering more votes tend to also generate more discussion, although exceptions highlight varied user interaction patterns.

Analysis of top authors with viral posts revealed that a few influential contributors consistently produce highly engaging content, supporting the hypothesis of identifiable influencers.

Examining viral post titles showed a dominance of major technology brands and thematic keywords related to software development, open source, and emerging tech trends, indicating that content centered on these topics drives significant engagement.

Finally, the temporal analysis of post volumes uncovered distinct peaks corresponding to major tech events, security vulnerabilities, and socio-economic discussions, demonstrating how external factors influence user activity and rivalry on Hacker News.

Overall, these findings reveal patterns and relationships consistent with our initial hypotheses, providing valuable insights into the factors that influence popularity and engagement on Hacker News.  
