Analyzing Hacker News to Find Posts With Highest Comments

We will analyze and compare two types of posts on Hacker News, a website where users vote and comment on technology-related stories. The two types of posts we will investigate start with 'Ask HN' and 'Show HN'.

Our objective is to compare these two types of posts and find which one receives more comments on average and if there is a specific time period when posts receive more comments on average. 

In [6]:
# Importing csv file and converting it into list of lists.

from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)

In [7]:
#Exploring the data set

print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


In [8]:
# Assigning first row as header

headers = hn[0]
print(headers)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


In [9]:
# Removing header row from the list

hn = hn[1:]
print(hn[:5])     #Checking if header has been removed from list

[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


In [10]:
#Sorting the data set into three separate lists

ask_posts = []          #For posts starting with Ask HN
show_posts = []         #For posts starting with Show HN
other_posts = []        #For all other posts

for rows in hn:
    title = rows[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(rows)
    elif title.lower().startswith('show hn'):    
        show_posts.append(rows)
    else:
        other_posts.append(rows)

#Checking the number of posts under each category

print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


In [11]:
# Finding out number of comments for ask posts and show posts and their average

total_ask_comments = 0
total_show_comments = 0

for i in ask_posts:
    comments = int(i[4])
    total_ask_comments += comments
    avg_ask_comments = total_ask_comments / len(ask_posts)

print(total_ask_comments)
print(avg_ask_comments)

for r in show_posts:
    comments = int(r[4])
    total_show_comments += comments
    avg_show_comments = total_show_comments / len(show_posts)

print(total_show_comments)
print(avg_show_comments)

24483
14.038417431192661
11988
10.31669535283993


On comparing the number of comments received and their average, we see that Ask HM posts receive more comments than Show HM posts. We will analyze only Ask HM posts further to find the hour in which most comments are received.

In [12]:
#Importing datetime module 

import datetime as dt

In [13]:
#Creating a list with dates and times of posts created and their respective number of comments

result_list = []

for items in ask_posts:
    result_list.append([items[6], int(items[4])])

#Creating dictionaries for number of posts and comments per hour

counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    cmmt_date = row[0]
    cmmt_number = row[1]
    cmmt_date = dt.datetime.strptime(cmmt_date, "%m/%d/%Y %H:%M")
    cmmt_time = dt.datetime.strftime(cmmt_date, "%H")
    
    if cmmt_time not in counts_by_hour:
        counts_by_hour[cmmt_time] = 1
        comments_by_hour[cmmt_time] = cmmt_number
    else:
        counts_by_hour[cmmt_time] += 1
        comments_by_hour[cmmt_time] += cmmt_number
        
print(counts_by_hour)  
print(comments_by_hour)
print(len(comments_by_hour))

{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}
{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}
24


In [14]:
#Finding the average number of comments per hour

avg_by_hour = []

for keys in comments_by_hour:
    avg = comments_by_hour[keys] / counts_by_hour[keys]
    avg_by_hour.append([keys, avg])

print(avg_by_hour)    

[['09', 5.5777777777777775], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['16', 16.796296296296298], ['23', 7.985294117647059], ['12', 9.41095890410959], ['17', 11.46], ['15', 38.5948275862069], ['21', 16.009174311926607], ['20', 21.525], ['02', 23.810344827586206], ['18', 13.20183486238532], ['03', 7.796296296296297], ['05', 10.08695652173913], ['19', 10.8], ['01', 11.383333333333333], ['22', 6.746478873239437], ['08', 10.25], ['04', 7.170212765957447], ['00', 8.127272727272727], ['06', 9.022727272727273], ['07', 7.852941176470588], ['11', 11.051724137931034]]


In [15]:
#Creating a list with average as first element and hour as second element

swap_avg_by_hour = []

for rows in avg_by_hour:
    swap_avg_by_hour.append([rows[1], rows[0]])  

#Sorting the list in descending order to find top five hours

sorted_swap = sorted(swap_avg_by_hour, reverse = True)

#Formatting hour and average

print("Top 5 Hours for Ask HN posts")
for row in sorted_swap[:5]:
    time = row[1]
    avg_comm = row[0]
    time = dt.datetime.strptime(time, "%H").strftime("%H:%M")
    print("{} {:.2f} average comments per post".format(time, avg_comm))

Top 5 Hours for Ask HN posts
15:00 38.59 average comments per post
02:00 23.81 average comments per post
20:00 21.52 average comments per post
16:00 16.80 average comments per post
21:00 16.01 average comments per post


We see that posting Ask HN posts by 3 p.m. EST receives higher comments on average with 38.59%.

Based on our analysis, we are concluding that Ask HN posts receive higher average comments when posted at 3 p.m. EST.