# Introduction: Exploring Hackers News Posts

Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") receive votes and comments, similar to reddit.

Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of the Hacker News listings can get hundreds of thousands of visitors as a result.

There are various types of posts submitted to the Hacker News platform but for this project we are specifically interested in posts with titles that contains either ask HN or Show HN, where Users submit Ask HN posts to ask the Hacker News community a specific question, and submit Show HN posts to show the Hacker News community a project, product, or just something interesting.

# Objectives and Methodologies

Our goal for this project is to compare the Ask HN, Show HN, and other types of posts to determine the following:

- Which of Ask HN or Show HN posts receive more comments on average?
- Do posts created at a certain time receive more comments on average?

We will use Python programming language to explore and analyze the data, and Jupyter Notebook to present our code and findings.

# Database Description

The dataset used for this analysis is the hacker_news.csv, where it contains a total of 20100 posts.

Below are descriptions of the columns or varialbes of the Hacker News dataset:

- **id**: the unique identifier from Hacker News for the post
- **title**: the title of the post
- **url**: the URL that the posts links to, if the post has a URL
- **num_points**: the number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
- **num_comments**: the number of comments on the post
- **author**: the username of the person who submitted the post
- **created_at**: the date and time of the post's submission

## Section 1: importing libraries, read dataset into a list of lists, and initial exploration

In [None]:
#used helper functions or imports from the python library
import pandas as pd
import re
import csv
import datetime as dt
from datetime import datetime

In [None]:
#open and read data and display the first 5 rows
df = pd.read_csv('hacker_news.csv')
df.head(5)

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
0,12224879,Interactive Dynamic Video,http://www.interactivedynamicvideo.com/,386,52,ne0phyte,8/4/2016 11:52
1,10975351,How to Use Open Source and Shut the Fuck Up at...,http://hueniverse.com/2016/01/26/how-to-use-op...,39,10,josep2,1/26/2016 19:30
2,11964716,Florida DJs May Face Felony for April Fools' W...,http://www.thewire.com/entertainment/2013/04/f...,2,1,vezycash,6/23/2016 22:20
3,11919867,Technology ventures: From Idea to Enterprise,https://www.amazon.com/Technology-Ventures-Ent...,3,1,hswarna,6/17/2016 0:01
4,10301696,Note by Note: The Making of Steinway L1037 (2007),http://www.nytimes.com/2007/11/07/movies/07ste...,8,2,walterbell,9/30/2015 4:12


In [None]:
#Check the amount of data or posts there are in the dataset
print("Shape of hacker_news dataset:", df.shape)
print('Sample size / Number of Rows:', df.shape[0])
print('Number of columns:', df.shape[1])

Shape of hacker_news dataset: (20100, 7)
Sample size: 20100
Number of columns: 7


In [None]:
# Filter rows where the 'title' column contains 'ask hn' or 'show hn'
filtered_df = df[df['title'].str.contains('ask HN|show HN', case=False, na=False)] #case false means not case sensitive, na false means do not include null values
filtered_df.head(5)

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
7,12296411,Ask HN: How to improve my personal website?,,2,6,ahmedbaracat,8/16/2016 9:55
13,10627194,Show HN: Wio Link ESP8266 Based Web of Things...,https://iot.seeed.cc,26,22,kfihihc,11/25/2015 14:03
17,10610020,Ask HN: Am I the only one outraged by Twitter ...,,28,29,tkfx,11/22/2015 13:43
22,11610310,Ask HN: Aby recent changes to CSS that broke m...,,1,1,polskibus,5/2/2016 10:14
30,12210105,Ask HN: Looking for Employee #3 How do I do it?,,1,3,sph130,8/2/2016 14:20


In [None]:
# Create a new column to label posts as either "Ask hn" or "Show hn"
filtered_df.loc[:, 'hn'] = filtered_df['title'].str.extract('(ask HN|show HN)', flags=re.IGNORECASE, expand=False).str.lower()

# Calculate the average number of comments for "Ask Reddit" and "Show Reddit"
avg_comments_per_type = filtered_df.groupby('hn')['num_comments'].mean()

print("Average comments for 'Ask HN' vs. 'Show HN':")
print(avg_comments_per_type)

Average comments for 'Ask HN' vs. 'Show HN':
hn
ask hn     14.031519
show hn    10.302146
Name: num_comments, dtype: float64


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


In [None]:
# Step 2: Analyze the impact of post creation time on the number of comments
# Convert 'created_time' to datetime format
df['created_at'] = pd.to_datetime(df['created_at'], format='%m/%d/%Y %H:%M')

# Extract the hour from the 'created_time' column
df['hour'] = df['created_at'].dt.hour

# Calculate the average number of comments by hour
avg_comments_by_hour = df.groupby('hour')['num_comments'].mean()

print("\nAverage comments by hour of post creation:")
print(avg_comments_by_hour)


Average comments by hour of post creation:
hour
0     25.076040
1     21.198980
2     26.015123
3     23.823770
4     21.891841
5     22.715232
6     19.771368
7     24.755906
8     24.328720
9     25.080460
10    24.516035
11    27.118110
12    27.465872
13    27.733212
14    29.144222
15    29.018639
16    23.699693
17    25.538913
18    25.188995
19    24.361572
20    22.277831
21    21.992233
22    21.353143
23    22.598972
Name: num_comments, dtype: float64


In [None]:
filtered_df.head()

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at,hn
7,12296411,Ask HN: How to improve my personal website?,,2,6,ahmedbaracat,8/16/2016 9:55,ask hn
13,10627194,Show HN: Wio Link ESP8266 Based Web of Things...,https://iot.seeed.cc,26,22,kfihihc,11/25/2015 14:03,show hn
17,10610020,Ask HN: Am I the only one outraged by Twitter ...,,28,29,tkfx,11/22/2015 13:43,ask hn
22,11610310,Ask HN: Aby recent changes to CSS that broke m...,,1,1,polskibus,5/2/2016 10:14,ask hn
30,12210105,Ask HN: Looking for Employee #3 How do I do it?,,1,3,sph130,8/2/2016 14:20,ask hn


In [None]:
# Initialize an empty list to store the data
list_of_lists = []

# Open and read the CSV file
with open('hacker_news.csv', mode='r', newline='') as file:
    reader = csv.reader(file)
    for row in reader:
        list_of_lists.append(row)

# Print the first 5 rows of the list of lists
print(list_of_lists[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


## Section 2: Removing header from a list of lists

In [None]:
# Initialize an empty list to store the data
hn = []

# Read the CSV file into a list of lists
with open('hacker_news.csv', mode='r', newline='') as file:
    reader = csv.reader(file)
    hn = list(reader)  # Convert the reader object to a list of lists

# Step 1: Extract the first row as headers
headers = hn[0]

# Step 2: Remove the first row from hn
hn = hn[1:]

# Step 3: Display the headers
print("Headers:")
print(headers)

# Step 4: Display the first five rows of hn to verify that the header row was removed
print("\nFirst five rows of data:")
for row in hn[:5]:
    print(row)


Headers:
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

First five rows of data:
['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']
['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']
['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']
['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']
['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30

## Section 3: Extract "ask HN" and "show HN" posts

In [None]:
# Initialize empty lists to store posts
ask_posts = []
show_posts = []
other_posts = []

# Loop through each row in hn
for row in hn:
    title = row[1].lower()  # Get the title and convert it to lowercase

    # Check if the title starts with 'ask hn' or 'show hn'
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

# Check the number of posts in each list
print("Number of Ask HN posts:", len(ask_posts))
print("Number of Show HN posts:", len(show_posts))
print("Number of Other posts:", len(other_posts))


Number of Ask HN posts: 1744
Number of Show HN posts: 1162
Number of Other posts: 17194


## Section 4: Calculating the Average Number of Comments for "Ask HN" and "Show HN" Posts

In [None]:
# Step 1: Calculate total and average number of comments for Ask HN posts
total_ask_comments = 0

# Iterate over ask_posts to sum up the comments
for post in ask_posts:
    num_comments = int(post[4])  # Convert the number of comments to an integer
    total_ask_comments += num_comments

# Calculate the average number of comments on Ask HN posts
avg_ask_comments = total_ask_comments / len(ask_posts)

# Print the result
print("Average number of comments on Ask HN posts:", avg_ask_comments)


# Step 2: Calculate total and average number of comments for Show HN posts
total_show_comments = 0

# Iterate over show_posts to sum up the comments
for post in show_posts:
    num_comments = int(post[4])  # Convert the number of comments to an integer
    total_show_comments += num_comments

# Calculate the average number of comments on Show HN posts
avg_show_comments = total_show_comments / len(show_posts)

# Print the result
print("Average number of comments on Show HN posts:", avg_show_comments)

# Step 3: Compare the average number of comments
if avg_ask_comments > avg_show_comments:
    print("Ask HN posts receive more comments on average.")
else:
    print("Show HN posts receive more comments on average.")


Average number of comments on Ask HN posts: 14.038417431192661
Average number of comments on Show HN posts: 10.31669535283993
Ask HN posts receive more comments on average.


### Findings

From the analysis, we computed the average number of comments for both Ask HN and Show HN posts. Based on the average number of comments, we can conclude that:

- Ask HN posts tend to generate more discussion and interaction compared to Show HN posts.

## Section 5: Finding the Number of "Ask Posts" and Comments by Hour Created

In [None]:
# Step 1: Create the result_list with the 'created_at' and number of comments
result_list = []

for post in ask_posts:
    created_at = post[6]  # The 'created_at' column is at index 6
    num_comments = int(post[4])  # The 'num_comments' column is at index 4 and needs to be an integer
    result_list.append([created_at, num_comments])

# Step 2: Initialize dictionaries to count posts and sum comments by hour
counts_by_hour = {}
comments_by_hour = {}

# Step 3: Loop through result_list and process the hours
for row in result_list:
    # Parse the date and extract the hour
    created_at_str = row[0]
    num_comments = row[1]

    # Convert string to datetime object
    created_at_dt = dt.datetime.strptime(created_at_str, "%m/%d/%Y %H:%M")

    # Extract the hour from the datetime object
    hour = created_at_dt.strftime("%H")  # Get the hour in 'HH' format

    # Step 4: Update counts_by_hour and comments_by_hour dictionaries
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1  # Initialize post count for this hour
        comments_by_hour[hour] = num_comments  # Initialize comment count for this hour
    else:
        counts_by_hour[hour] += 1  # Increment post count
        comments_by_hour[hour] += num_comments  # Add the comment count

In [None]:
# Display the number of posts and comments grouped by hour
print("Hour | Posts | Comments")
print("------------------------")
for hour in sorted(counts_by_hour):
    print(f"{hour}:00 | {counts_by_hour[hour]} posts | {comments_by_hour[hour]} comments")


Hour | Posts | Comments
------------------------
00:00 | 55 posts | 447 comments
01:00 | 60 posts | 683 comments
02:00 | 58 posts | 1381 comments
03:00 | 54 posts | 421 comments
04:00 | 47 posts | 337 comments
05:00 | 46 posts | 464 comments
06:00 | 44 posts | 397 comments
07:00 | 34 posts | 267 comments
08:00 | 48 posts | 492 comments
09:00 | 45 posts | 251 comments
10:00 | 59 posts | 793 comments
11:00 | 58 posts | 641 comments
12:00 | 73 posts | 687 comments
13:00 | 85 posts | 1253 comments
14:00 | 107 posts | 1416 comments
15:00 | 116 posts | 4477 comments
16:00 | 108 posts | 1814 comments
17:00 | 100 posts | 1146 comments
18:00 | 109 posts | 1439 comments
19:00 | 110 posts | 1188 comments
20:00 | 80 posts | 1722 comments
21:00 | 109 posts | 1745 comments
22:00 | 71 posts | 479 comments
23:00 | 68 posts | 543 comments


## Section 6: Calculating the Average Number of Comments for "Ask HN" Posts by Hour

In [None]:
# Initialize an empty list to store the averages
avg_by_hour = []

# Loop through counts_by_hour to calculate the average comments per post for each hour
for hour in counts_by_hour:
    avg_comments = comments_by_hour[hour] / counts_by_hour[hour]  # Calculate average
    avg_by_hour.append([hour, avg_comments])  # Append the hour and the average to the list

# Display the results sorted by hour
print("Hour | Avg Comments Per Post")
print("---------------------------")
for row in sorted(avg_by_hour):
    print(f"{row[0]}:00 | {row[1]:.2f} average comments")


Hour | Avg Comments Per Post
---------------------------
00:00 | 8.13 average comments
01:00 | 11.38 average comments
02:00 | 23.81 average comments
03:00 | 7.80 average comments
04:00 | 7.17 average comments
05:00 | 10.09 average comments
06:00 | 9.02 average comments
07:00 | 7.85 average comments
08:00 | 10.25 average comments
09:00 | 5.58 average comments
10:00 | 13.44 average comments
11:00 | 11.05 average comments
12:00 | 9.41 average comments
13:00 | 14.74 average comments
14:00 | 13.23 average comments
15:00 | 38.59 average comments
16:00 | 16.80 average comments
17:00 | 11.46 average comments
18:00 | 13.20 average comments
19:00 | 10.80 average comments
20:00 | 21.52 average comments
21:00 | 16.01 average comments
22:00 | 6.75 average comments
23:00 | 7.99 average comments


## Section 7: sorting and printing values from a list of lists

In [None]:
# Step 1: Create a list that swaps the columns in avg_by_hour
swap_avg_by_hour = []

# Step 2: Iterate over avg_by_hour and swap the elements in each row
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])  # Swapping: average first, hour second

# Step 3: Sort swap_avg_by_hour in descending order based on the average number of comments
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

# Step 4: Print the top 5 hours for Ask Posts Comments
print("Top 5 Hours for Ask Posts Comments")

for avg, hour in sorted_swap[:5]:  # Extract the top 5
    # Convert hour to a datetime object to format it as "15:00"
    formatted_hour = datetime.strptime(hour, "%H").strftime("%H:%M")
    print("{}: {:.2f} average comments per post".format(formatted_hour, avg))


Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


To have the highest chance of receiving comments, you should post between 15:00 and 21:00. These times have the highest average number of comments, suggesting that these are the most active hours for engaging users.

## Section 8: Additional Insights

This section focuses on addressing the following insights:
- Determine if show or ask posts receive more points on average.
- Determine if posts created at a certain time are more likely to receive more points.
- Compare your results to the average number of comments and points other posts receive.

In [None]:
# Step 1: Calculate Average Points for Ask and Show HN Posts

# Initialize variables to store total points for ask and show posts
total_ask_points = 0
total_show_points = 0

# Calculate total points for ask posts
for post in ask_posts:
    total_ask_points += int(post[3])  # Points column is the 4th column (index 3)

# Calculate average points for ask posts
avg_ask_points = total_ask_points / len(ask_posts)

# Calculate total points for show posts
for post in show_posts:
    total_show_points += int(post[3])  # Points column is the 4th column (index 3)

# Calculate average points for show posts
avg_show_points = total_show_points / len(show_posts)

# Step 2: Determine if posts created at a certain time get more points

# Create an empty list to store the hour and points information for ask posts
ask_points_by_hour = []

# Iterate over ask_posts and extract the hour and points
for post in ask_posts:
    created_at = post[6]  # Created_at column is at index 6
    points = int(post[3])  # Points column is at index 3
    ask_points_by_hour.append([created_at, points])

# Initialize dictionaries for points by hour and posts by hour
ask_counts_by_hour = {}
ask_points_by_hour_dict = {}

# Loop through ask_points_by_hour and calculate points per hour
for post in ask_points_by_hour:
    hour = dt.datetime.strptime(post[0], "%m/%d/%Y %H:%M").strftime("%H")  # Extract hour
    if hour not in ask_counts_by_hour:
        ask_counts_by_hour[hour] = 1
        ask_points_by_hour_dict[hour] = post[1]
    else:
        ask_counts_by_hour[hour] += 1
        ask_points_by_hour_dict[hour] += post[1]

# Calculate the average points per hour
avg_ask_points_by_hour = []
for hour in ask_counts_by_hour:
    avg_points = ask_points_by_hour_dict[hour] / ask_counts_by_hour[hour]
    avg_ask_points_by_hour.append([hour, avg_points])

# Sort by hour
avg_ask_points_by_hour_sorted = sorted(avg_ask_points_by_hour)

# Step 3: Compare with other posts

# Create lists to store other posts and calculate their average points
total_other_points = 0
for post in other_posts:
    total_other_points += int(post[3])  # Points column is at index 3

# Calculate average points for other posts
avg_other_points = total_other_points / len(other_posts)

# Display results
print(f"Average points for Ask HN posts: {avg_ask_points:.2f}")
print(f"Average points for Show HN posts: {avg_show_points:.2f}")
print(f"Average points for Other posts: {avg_other_points:.2f}")

# Display the average points by hour for ask posts
print("\nAverage Points for Ask HN Posts by Hour")
for hour, avg_points in avg_ask_points_by_hour_sorted:
    print(f"{hour}:00: {avg_points:.2f} average points")


Average points for Ask HN posts: 15.06
Average points for Show HN posts: 27.56
Average points for Other posts: 55.41

Average Points for Ask HN Posts by Hour
00:00: 8.20 average points
01:00: 11.67 average points
02:00: 13.67 average points
03:00: 6.93 average points
04:00: 8.28 average points
05:00: 12.00 average points
06:00: 13.43 average points
07:00: 10.62 average points
08:00: 10.73 average points
09:00: 7.31 average points
10:00: 18.68 average points
11:00: 14.22 average points
12:00: 10.71 average points
13:00: 24.26 average points
14:00: 11.98 average points
15:00: 29.99 average points
16:00: 23.35 average points
17:00: 19.41 average points
18:00: 15.97 average points
19:00: 13.75 average points
20:00: 14.39 average points
21:00: 15.79 average points
22:00: 7.20 average points
23:00: 8.54 average points


## **Source**

Data and project Practice: https://www.dataquest.io/projects/guided-project-a-exploring-hacker-news-posts-2/