# Exploring Hackers News Posts
the source of the data can be found [here](https://www.kaggle.com/datasets/santiagobasulto/all-hacker-news-posts-stories-askshow-hn-polls?resource=download)

# Introduction

In [16]:
# Read in the data.
import pandas as pd

df=pd.read_csv('dataset/hn.csv') 

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3885799 entries, 0 to 3885798
Data columns (total 8 columns):
 #   Column              Dtype  
---  ------              -----  
 0   Object ID           int64  
 1   Title               object 
 2   Post Type           object 
 3   Author              object 
 4   Created At          object 
 5   URL                 object 
 6   Points              int64  
 7   Number of Comments  float64
dtypes: float64(1), int64(2), object(5)
memory usage: 237.2+ MB


# Basic Analysis 


In [6]:
# Basic data exploration
print("Total posts:", df.shape[0])
print("Unique authors:", df['Author'].nunique())

# Calculate average points and comments
average_points = df['Points'].mean()
average_comments = df['Number of Comments'].mean()

print(f"Average points per post: {average_points:.2f}")
print(f"Average comments per post: {average_comments:.2f}")

# For more detailed analysis, specify your questions or the kind of insights you're interested in!


Total posts: 3885799
Unique authors: 344515
Average points per post: 15.22
Average comments per post: 7.49


We can see above that the data set contains the title of the posts, the number of comments for each post, and the date the post was created. Let's start by exploring the number of comments for each type of post. 

# Average Points and Comments by Year

We'll extract the year from the "Created At" column and then calculate the average points and comments per post for each year.

In [12]:
# Assuming you've loaded the dataset into a DataFrame named df
df['Created At'] = pd.to_datetime(df['Created At'])

df['Year'] = df['Created At'].dt.year
avg_points_comments_by_year = df.groupby('Year')[['Points', 'Number of Comments']].mean()

print("Average Points and Comments by Year:\n", avg_points_comments_by_year)

Average Points and Comments by Year:
          Points  Number of Comments
Year                               
2006   6.400000            1.511111
2007   5.225750            3.187263
2008   6.950024            3.821054
2009  10.577034            5.118574
2010  12.407591            5.624498
2011   9.592732            3.563401
2012   9.326492            3.886842
2013  10.836811            5.286462
2014  12.911303            5.934712
2015  14.014969            5.906572
2016  15.701899            7.023367
2017  16.901943            7.826129
2018  17.672678            8.205834
2019  18.230468            8.691374
2020  17.668120            9.093882
2021  21.621959           12.253274
2022  21.261100           12.903375
2023  20.850739           12.764002


Comment to fill 


# User Engagement by hours

To calculate and list user engagement (in terms of average points received by posts) for each hour of the day, by using head() to display the top hours, we'll remove it to list all 24 hours. This will give us a complete view of how average points vary throughout the day:

In [33]:
# Ensure 'Created At' is in datetime format
df['Created At'] = pd.to_datetime(df['Created At'])

# Extract the hour from the 'Created At' column
df['Hour'] = df['Created At'].dt.hour

# Calculate the average points received by posts for each hour of the day without sorting by the average points
avg_points_by_hour = df.groupby('Hour')['Points'].mean()

# Print the average points for all hours from 0 to 23
print("Average Points by Hour (0-23):")
print(avg_points_by_hour)

best_hour = avg_points_by_hour.idxmax()
worst_hour = avg_points_by_hour.idxmin()
print(f"Best time to post: {best_hour}:00 with an average of {avg_points_by_hour.max():.2f} points.")
print(f"Worst time to post: {worst_hour}:00 with an average of {avg_points_by_hour.min():.2f} points.")


Average Points by Hour (0-23):
Hour
0     15.265126
1     15.327798
2     15.375918
3     15.136323
4     14.919373
5     14.709055
6     14.138929
7     14.194287
8     14.400982
9     14.848922
10    15.488114
11    16.629613
12    16.833736
13    16.088414
14    15.376913
15    15.427714
16    15.526614
17    15.328288
18    15.151001
19    14.971817
20    14.427646
21    14.539836
22    14.581370
23    15.021682
Name: Points, dtype: float64
Best time to post: 12:00 with an average of 16.83 points.
Worst time to post: 6:00 with an average of 14.14 points.


This is a comment to fill

# User Retnetion 

## Number of posts per users

In [21]:
posts_per_user = df.groupby('Author').size()
avg_posts_per_user = posts_per_user.mean()
print(f"Average number of posts per user: {avg_posts_per_user:.2f}")

Average number of posts per user: 11.28


## Top Users and How Many Posts They Have

In [22]:
top_users = posts_per_user.sort_values(ascending=False).head(10)
print("\nTop users and how many posts they have:")
print(top_users)


Top users and how many posts they have:
Author
rbanffy       26918
Tomte         21141
tosh          16757
pseudolus     14590
jonbaer       13178
ingve         12976
bookofjoe     10542
mooreds        9860
evo_9          9135
prostoalex     8807
dtype: int64


## Engagment 

In this analysis, we aim to explore the relationship between the engagement a user's first post receives (in terms of points) and their likelihood of posting again on HackerNews. The hypothesis is that initial recognition (measured by points) might influence a user's decision to engage further with the platform. We will:

1) Identify the first post of each user and examine how many points it received.
2) Categorize these first posts into five brackets based on the points they received: 0-10, 11-25, 26-50, 51-100, 100-200, 200-500, 500-1000, 1000+
3) Determine the likelihood of users posting again for each bracket by comparing the number of users who made more than one post against those who didn't.

In [29]:
# Sort the DataFrame by Author and Created At to ensure we're getting the first post correctly
df_sorted = df.sort_values(by=['Author', 'Created At'])

# Find the first post of each user
first_posts = df_sorted.drop_duplicates(subset=['Author'], keep='first').copy()

# Define the new brackets
bins = [0, 10, 25, 50, 100, 200, 500, 1000, float('inf')]
labels = ['0-10', '11-25', '26-50', '51-100', '101-200', '201-500', '501-1000', '1000+']
first_posts['Points Bracket'] = pd.cut(first_posts['Points'], bins=bins, labels=labels, right=False)

# Calculate the total number of posts for each user
user_post_counts = df.groupby('Author').size()

# Identify users who posted more than once
repeat_users = user_post_counts[user_post_counts > 1].index

# Mark first posts by whether the author posted again
first_posts['Posted Again'] = first_posts['Author'].isin(repeat_users)

# Calculate the likelihood of posting again by points bracket
likelihood_posting_again = first_posts.groupby('Points Bracket')['Posted Again'].mean()

print(likelihood_posting_again)


Points Bracket
0-10        0.497411
11-25       0.561714
26-50       0.618009
51-100      0.644573
101-200     0.645652
201-500     0.632810
501-1000    0.616588
1000+       0.578947
Name: Posted Again, dtype: float64


the higher the points on your first post, the higher chance you have of posting again, but you will start to get dimnishing returns after ~500 points