# Analyzing 2016 posts of Hacker News

Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") receive votes and comments, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of the Hacker News listings can get hundreds of thousands of visitors as a result.

The dataset were downloaded from [here](URL 'https://www.kaggle.com/datasets/hacker-news/hacker-news-posts'), but note that we have reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that didn't receive any comments and then randomly sampling from the remaining submissions. Below are descriptions of the columns:
-  id: the unique identifier from Hacker News for the post
-  title: the title of the post
-  url: the URL that the posts links to, if the post has a URL
-  num_points: the number of points the post acquired, calculated as the -  total number of upvotes minus the total number of downvotes
-  num_comments: the number of comments on the post
-  author: the username of the person who submitted the post
-  created_at: the date and time of the post's submission

So now lets open the dataset and take peak in first few rows.

In [2]:
from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)
headers = hn[:1]
print(headers)
hn = hn[1:]
print(*hn[:5], sep = "\n")
print("\n")
print("Total posts are: ", len(hn))

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']]
['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']
['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']
['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']
['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']
['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']


Total posts are:  

## Extracting posts that has 'Ask HN' or 'Show HN' in the title

In [5]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.startswith('Ask HN'):
        ask_posts.append(row)
    elif title.startswith('Show HN'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print(ask_posts[:1])
print("Total numbers of Ask HN posts are: ", len(ask_posts))
print('\n')
print(show_posts[:1])
print("Total numbers of Show HN posts are: ", len(show_posts))
print('\n')
print(other_posts[:1])
print("Total numbers of other posts are: ", len(other_posts))

[['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55']]
Total numbers of Ask HN posts are:  1742


[['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03']]
Total numbers of Show HN posts are:  1161


[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']]
Total numbers of other posts are:  17197


## Compare number of comments by category


In [10]:
total_ask_comments = 0
total_show_comments = 0

for row in ask_posts:
    comments = int(row[4])
    total_ask_comments += comments
print("Total comments for Ask HN: ", total_ask_comments)   

for row in show_posts:
    comments = int(row[4])
    total_show_comments += comments
print("Total comments for Show HN: ", total_show_comments)  


avg_ask_comments = total_ask_comments / len(ask_posts)
avg_show_comments = total_show_comments / len(show_posts)

print("Average comments per Ask posts: ", round(avg_ask_comments))
print("Average comments per Show posts: ", round(avg_show_comments))

Total comments for Ask HN:  24466
Total comments for Show HN:  11987
Average comments per Ask posts:  14
Average comments per Show posts:  10


On average, posts that has 'Ask HN' in their title have ~14 comments versus posts has 'Show HN' has ~10 comments. And intuitively, it make sense since posts that ask for other's opinion would receive more comments than posts that presenting the context.

Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.

## Compare number of comments by posted hours

In [41]:
import datetime as dt
result_list = []

for row in ask_posts:
    post_time = row[6]
    comments = int(row[4])
    result = [post_time,comments]
    result_list.append(result)
    
counts_by_hour = {}
comments_by_hour = {}
for row in result_list:
    post_date = row[0]
    comments_hr = row[1]
    post_date_str = dt.datetime.strptime(post_date, "%m/%d/%Y %H:%M")
    row[0] = post_date_str
    
    hour = dt.datetime.strftime(post_date_str, "%H")
    if hour in counts_by_hour:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comments_hr
    else:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comments_hr       
        
print("Total of posts by posted hours: ", counts_by_hour)
print("\n")
print("Total of comments by posted hours: ", comments_by_hour)

Total of posts by posted hours:  {'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 108, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 54, '06': 44, '07': 34, '11': 58}


Total of comments by posted hours:  {'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1430, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 439, '06': 397, '07': 267, '11': 641}


Now we know total numbers of posts and comments received by hours but to further analysis, we need to see average number of comments for posts created by during each hour of day.

In [51]:
avg_by_hour = []
for row in counts_by_hour:
    avg = comments_by_hour[row] / counts_by_hour[row]
    avg_by_hour.append([row, avg])
    
print(avg_by_hour)

[['09', 5.5777777777777775], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['16', 16.796296296296298], ['23', 7.985294117647059], ['12', 9.41095890410959], ['17', 11.46], ['15', 38.5948275862069], ['21', 16.009174311926607], ['20', 21.525], ['02', 23.810344827586206], ['18', 13.24074074074074], ['03', 7.796296296296297], ['05', 10.08695652173913], ['19', 10.8], ['01', 11.383333333333333], ['22', 6.746478873239437], ['08', 10.25], ['04', 7.170212765957447], ['00', 8.12962962962963], ['06', 9.022727272727273], ['07', 7.852941176470588], ['11', 11.051724137931034]]


To make it easy to derive conclusion from the result, we need to sort top 5 values.

In [80]:
swap_avg_by_hour = []

for row in avg_by_hour:
    avg = row[1]
    hour = row[0]
    swap_avg_by_hour.append([avg, hour])
    
print(swap_avg_by_hour)

sorted_swap = sorted(swap_avg_by_hour, reverse = True)
print('\n')
print("Top 5 Hours for Ask Posts Comments:")
for row in sorted_swap[:5]:
    hour = str(row[1])
    hour = dt.datetime.strptime(hour, "%H")
    hour = hour.strftime("%H:%M")
    print('{0} - {1:.2f} average comments per post'.format(hour, row[0]))

    
#Lets see how posts are posted by hour during the day
print('\n')
print("Top 5 Hours for Ask HN posting:")
    
swap_counts = []
for row in counts_by_hour:
    counts = counts_by_hour[row]
    hour = row
    swap_counts.append([counts, hour])
swap_post_hr = sorted(swap_counts, reverse = True)

for row in swap_post_hr[:5]:
    hour = str(row[1])
    hour = dt.datetime.strptime(hour, "%H")
    hour = hour.strftime("%H:%M")
    print('{0} - {1:.2f} average post by hour'.format(hour, row[0]))
    

[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.24074074074074, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.12962962962963, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]


Top 5 Hours for Ask Posts Comments:
15:00 - 38.59 average comments per post
02:00 - 23.81 average comments per post
20:00 - 21.52 average comments per post
16:00 - 16.80 average comments per post
21:00 - 16.01 average comments per post


Top 5 Hours for Ask HN posting:
15:00 - 116.00 average post by hour
19:00 - 110.00 average post by hour
21:00 - 109.00 average post by hour
18:00 - 108.0

## Conclusion

As a result is shown, 'Ask HN' are most likely to receive feedbacks from readers than 'Show HN'. According to the data, average numbers of comments per posts for 'Ask HN' category would receive ~14 comments, while for 'Show HN' receives ~10 comments.

And the highest average numbers of comments received per 'Ask HN' is 3pm and rest of the top 4 average numbers of comments were generally equally distributed in the evening and night hours. Its hihgly suggestive that readers would allocate some time to read and respond in their leisure hours.

Looking from different angle, the posts were also mostly posted in after 3pm till 9pm. However given the highest numbers of comments viewers are more active afternoon and evening, its potential that if posts were uploaded before peak hours would likely to receive more attention from viewers.