Exploring Hacker News Posts: Popularity of Show vs. Ask Posts

**Introduction:** Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to Reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result. 

We are interested in posts whose titles begin with either "Ask HN" or "Show HN". Users submit Ask HN posts to the Hacker News Community to ask a specific question. Show HN posts include posts where the user displays a community project, product, or other interesting artifact.

In this data analysis, we will compare these two types of posts to determine the following:

1. Do Ask HN or Show HN posts receive more comments on average.
2. Do posts created at a certain time receive more comments on average.

Please note the data set we're working with was reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions.

In [1]:
#opening the Hacker News Dataset
import csv

file = open('hacker_news.csv')
hn = list(csv.reader(file))
print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


**Removing Headers from the Printed List

In [2]:
headers = hn[0]
hn = hn[1:]
print(headers)
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


The above data set contains the following:
    1. Post ID
    2. Title of posts
    3. Post URL
    4. Number of points on post
    5. Number of comments on post
    6. Author of post
    7. The date the post was created. 
Next we will explore the number of comments for each type of post.

**Extracting ASK Hacker News and SHOW Hacker News Posts:**
Next we will identify posts that behin with either ASK HN or SHOW HN and separate the data for these types of posts into different lists.

In [5]:
#Separating post data into different lists by Title
ask_posts = []
show_posts = []
other_posts = []

for post in hn:
    title = post[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(post)
    elif title.lower().startswith("show hn"):
        show_posts.append(post)
    else:
        other_posts.append(post)
        
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


**Calculating the Avg. Number of Comments on ASK Hacker News and SHOW Hacker News Posts**

In [6]:
#Calculating the average number of comments 'Ask HN' posts receive.
total_ask_comments = 0

for post in ask_posts:
    total_ask_comments += int(post[4])
    
avg_ask_comments = total_ask_comments / len(ask_posts)
print(avg_ask_comments)

14.038417431192661


In [7]:
#Calculating the average number of comments 'Show HN' posts receive.
total_show_comments = 0

for post in show_posts:
    total_show_comments += int(post[4])
    
avg_show_comments = total_show_comments / len(show_posts)
print(avg_show_comments)

10.31669535283993


Based on conclusions drawns from analysis above - ask posts receive approximately 14 comments while show posts receive approximately 10.

Because Ask posts receive more comments, the remaining analysis will focus on these posts.

**Analyzing the Amount of Comments on Ask Posts Based on Hour Created**

Below we will determine if ask posts can maximize the amount of comments received based on the time it was created. First, we will discover the amounts of ask posts created per hour of the day. Then we will calculate the average amount of comments the posts created at each hour receive.

In [11]:
#Calculating the amount of ask posts created during each hour of the day and the number of comments received.
import datetime as dt

result_list = []

for post in ask_posts:
    result_list.append(
        [post[6], int(post[4])]
    )
    
comments_by_hour = {}
counts_by_hour = {}
date_format = "%m/%d/%Y %H:%M"

for each_row in result_list:
    date = each_row[0]
    comment = each_row[1]
    time = dt.datetime.strptime(date, date_format).strftime("%H")
    if time in counts_by_hour:
        comments_by_hour[time] += comment
        counts_by_hour[time] += 1
    else:
        comments_by_hour[time] = comment
        counts_by_hour[time] = 1

comments_by_hour

{'00': 447,
 '01': 683,
 '02': 1381,
 '03': 421,
 '04': 337,
 '05': 464,
 '06': 397,
 '07': 267,
 '08': 492,
 '09': 251,
 '10': 793,
 '11': 641,
 '12': 687,
 '13': 1253,
 '14': 1416,
 '15': 4477,
 '16': 1814,
 '17': 1146,
 '18': 1439,
 '19': 1188,
 '20': 1722,
 '21': 1745,
 '22': 479,
 '23': 543}

**Calculating the Acg. Number of Comments for ASK Hacker News Posts by Hour**

In [12]:
# Calculating the average amount of comments `Ask HN` posts created at each hour of the day receive.
avg_by_hour = []

for hr in comments_by_hour:
    avg_by_hour.append([hr, comments_by_hour[hr] / counts_by_hour[hr]])

avg_by_hour

[['17', 11.46],
 ['11', 11.051724137931034],
 ['12', 9.41095890410959],
 ['22', 6.746478873239437],
 ['19', 10.8],
 ['06', 9.022727272727273],
 ['13', 14.741176470588234],
 ['16', 16.796296296296298],
 ['01', 11.383333333333333],
 ['18', 13.20183486238532],
 ['08', 10.25],
 ['20', 21.525],
 ['23', 7.985294117647059],
 ['00', 8.127272727272727],
 ['09', 5.5777777777777775],
 ['05', 10.08695652173913],
 ['21', 16.009174311926607],
 ['10', 13.440677966101696],
 ['03', 7.796296296296297],
 ['07', 7.852941176470588],
 ['02', 23.810344827586206],
 ['14', 13.233644859813085],
 ['15', 38.5948275862069],
 ['04', 7.170212765957447]]

**Sorting and Printing Values from the List Above**

In [13]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
    
print(swap_avg_by_hour)

sorted_swap = sorted(swap_avg_by_hour, reverse=True)

sorted_swap

[[11.46, '17'], [11.051724137931034, '11'], [9.41095890410959, '12'], [6.746478873239437, '22'], [10.8, '19'], [9.022727272727273, '06'], [14.741176470588234, '13'], [16.796296296296298, '16'], [11.383333333333333, '01'], [13.20183486238532, '18'], [10.25, '08'], [21.525, '20'], [7.985294117647059, '23'], [8.127272727272727, '00'], [5.5777777777777775, '09'], [10.08695652173913, '05'], [16.009174311926607, '21'], [13.440677966101696, '10'], [7.796296296296297, '03'], [7.852941176470588, '07'], [23.810344827586206, '02'], [13.233644859813085, '14'], [38.5948275862069, '15'], [7.170212765957447, '04']]


[[38.5948275862069, '15'],
 [23.810344827586206, '02'],
 [21.525, '20'],
 [16.796296296296298, '16'],
 [16.009174311926607, '21'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [13.20183486238532, '18'],
 [11.46, '17'],
 [11.383333333333333, '01'],
 [11.051724137931034, '11'],
 [10.8, '19'],
 [10.25, '08'],
 [10.08695652173913, '05'],
 [9.41095890410959, '12'],
 [9.022727272727273, '06'],
 [8.127272727272727, '00'],
 [7.985294117647059, '23'],
 [7.852941176470588, '07'],
 [7.796296296296297, '03'],
 [7.170212765957447, '04'],
 [6.746478873239437, '22'],
 [5.5777777777777775, '09']]

In [14]:
# Sorting the values and printing the the 5 hours with the highest average comments.

print("Top 5 Hours for 'Ask HN' Comments")
for avg, hr in sorted_swap[:5]:
    print(
        "{}: {:.2f} average comments per post".format(
            dt.datetime.strptime(hr, "%H").strftime("%H:%M"),avg
        )
    )

Top 5 Hours for 'Ask HN' Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


From above, we can see the hour that receives the most comments per post on average is 15:00 with an average of 38.59 comments per post. Based on documention included with dataset, we can conclude the most amount of comments on Ask Hacker News Posts are on posts created at 3:00pm est.

**Conclusion**
Through the analysis of Hacker News Posts, we were able to determine which type of posts based on time created receive the most amount of comments on average. Our analysis shows that in order to maximize the amount of comments a post receives, we'd recommend the post be categorized as an 'Ask Post' and created between 3 and 4pm est.