# Exploring Hacker News Posts

In this project, we wish to explore a dataset of [Hackers News](https://news.ycombinator.com/) posts and see which form of posts tend to get the most traction. Specifically we will be comparing posts that are labeled as 'Ask HN', where users ask the community a question to posts labeled 'Show HN', where users wish to share some piece of news or information with their peers. 

We will be investigating which of these posts tends to generate more discussion (in the form of comments) and whether the time and date of the posting has a significant influence on the amount of discussion it generates.

We are working with a subset of the data that contains about 20,000 rows. The original dataset was reduced by removing posts with no comments and then subsampling from the remainder.

To begin, we'll load in the data and remove the header rows.

In [1]:
from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)

for i in range(0, 5):
    print(hn[i], '\n')

headers = hn[0]
hn = hn[1:]
print(headers, '\n')
for i in range(0, 5):
    print(hn[i], '\n')


['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] 

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'] 

['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'] 

['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'] 

['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'] 

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] 

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0p

### Extracting Ask HN and Show HN Posts

We will begin by sorting the posts into our requisite categories. Since we are concerened with the popularity of Ask HN and Show HN posts we will simply sort any posts not belonging to either into an "Other" category.

In [4]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    title = title.lower()
    
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

print(len(ask_posts),'\n',len(show_posts),'\n', len(other_posts))

1744 
 1162 
 17194


In [16]:
for i in range(0, 5):
    print(ask_posts[i], '\n')
    
for i in range(0, 5):
    print(show_posts[i], '\n')


['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'] 

['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'] 

['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14'] 

['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20'] 

['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38'] 

['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'] 

['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46'] 

['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https://shanhu.io', '1', '1

### Calculating the Average Number of Comments

Now that we separated ask posts and show posts into different lists, we'll calculate the average number of comments each type of post receives.

In [5]:
total_ask_comments = 0
for posts in ask_posts:
    n_comments = int(posts[4])
    total_ask_comments += n_comments

avg_ask_comments = total_ask_comments/len(ask_posts)

print('The average number of comments per ask post is: ' + str(avg_ask_comments))

total_show_comments = 0
for posts in show_posts:
    n_comments = int(posts[4])
    total_show_comments += n_comments
    
avg_show_comments = total_show_comments/len(show_posts)

print('The average number of comments per show post is ' + str(avg_show_comments))

The average number of comments per ask post is: 14.038417431192661
The average number of comments per show post is 10.31669535283993


From our analysis, we can see that the average number of comments per ask post is about ~35% higher than it is for show posts. This suggests that ask posts tend to generate more discussion than your run of the mill show post.

### Relation between Posting Time and Comments

Since an Ask HN post aims to get some feedback from community, we will look at how we can maximize the number of responses we get by varying the time we create the post. To do so we will split the post times on the hour and see which hours have the greatest number of comments.

In [18]:
import datetime as dt
result_list = []

# Create a list with the post date and number of comments
for row in ask_posts:
    result_list.append([row[6], int(row[4])])

# Dictionaries for the number of posts separarted by hour + number of comments on said posts
counts_by_hour = {}
comments_by_hour = {}

datetime_list = []

# Parse the data to fill up the two dictionaries
for result in result_list:
    date = result[0]
    n_comments = result[1]
    date_dt = dt.datetime.strptime(date, '%m/%d/%Y %H:%M')
    post_hour = date_dt.time().hour
    
    if post_hour not in counts_by_hour:
        counts_by_hour[post_hour] = 1
        comments_by_hour[post_hour] = n_comments
    else:
        counts_by_hour[post_hour] += 1
        comments_by_hour[post_hour] += n_comments
    
    
    datetime_list.append(post_hour)


In [19]:
# Calculating average number of posts by hour
avg_by_hour = []

for key in counts_by_hour:
    avg_by_hour.append([key, comments_by_hour[key]/counts_by_hour[key]])

print(avg_by_hour[0:5])

[[0, 8.127272727272727], [1, 11.383333333333333], [2, 23.810344827586206], [3, 7.796296296296297], [4, 7.170212765957447]]


In [29]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])

print(swap_avg_by_hour)

sorted_swap = sorted(swap_avg_by_hour, reverse = True)
print('\n')
print(sorted_swap)

print('Top 5 Hours for Ask Posts Comments: ')

top_5 = []

for i in range(0,5):
    hour = sorted_swap[i][1]
    comments = sorted_swap[i][0]
    hour = dt.datetime.strptime(str(hour), "%H")
    hour_str = hour.strftime('%H:%M')
    
    line_format = '{}: {:.2f} average comments per post'
    to_print = line_format.format(hour_str, comments)
    print(to_print)


[[8.127272727272727, 0], [11.383333333333333, 1], [23.810344827586206, 2], [7.796296296296297, 3], [7.170212765957447, 4], [10.08695652173913, 5], [9.022727272727273, 6], [7.852941176470588, 7], [10.25, 8], [5.5777777777777775, 9], [13.440677966101696, 10], [11.051724137931034, 11], [9.41095890410959, 12], [14.741176470588234, 13], [13.233644859813085, 14], [38.5948275862069, 15], [16.796296296296298, 16], [11.46, 17], [13.20183486238532, 18], [10.8, 19], [21.525, 20], [16.009174311926607, 21], [6.746478873239437, 22], [7.985294117647059, 23]]


[[38.5948275862069, 15], [23.810344827586206, 2], [21.525, 20], [16.796296296296298, 16], [16.009174311926607, 21], [14.741176470588234, 13], [13.440677966101696, 10], [13.233644859813085, 14], [13.20183486238532, 18], [11.46, 17], [11.383333333333333, 1], [11.051724137931034, 11], [10.8, 19], [10.25, 8], [10.08695652173913, 5], [9.41095890410959, 12], [9.022727272727273, 6], [8.127272727272727, 0], [7.985294117647059, 23], [7.852941176470588, 

Posts made at 15:00 (or 3:00 p.m) tend to generate the largest amount of comments with about 38.6 comments per post. According to the documentation, all times are given in eastern standard time (EST). 

# Conclusion

In this project we analyzed ask posts and show posts to determine which combination of post type and time receives the most comments on average. Based on our analysis, ask posts made between 15:00 and 16:00 EST generate the most average comments.

It should be noted that the analysis excluded posts that had no comments. Thus it would be more accurate to say that _of the posts that received comments_, the greatest responses were found at the aforementioned hours. 