In [1]:
__author__ = 'Ola Olagunju'
__email__ = 'gunjujide@gmail.com'

# Analyzing the Popularity of Hacker News Posts
----------------

## 1. Introduction
[Hacker News](https://news.ycombinator.com/) is a social news website, under the startup incubator [Y Combinator](https://www.ycombinator.com/), with a focus on computer science and entrepreneurship. Hacker News gains huge popularity in technology and startup communities. On this site, users can submit any posts, which "gratify one's intellectual curiosity" (Ref: Hacker News Guidelines). Their posts are voted and commented upon, where the top-ranked posts can draw hundreds of thousands of traffic.

You can find the original dataset for Hacker News posts [here](https://www.kaggle.com/hacker-news/hacker-news-posts). For this project, we use the **hacker_news.csv** dataset, a modified dataset, of which approximately 300,000 data rows have been trimmed down to 20,000 rows by:

- Deleting all the posts without any comments

- Sampling randomly from the remaining posts after the deletion

Here are the descriptions for the columns of the **hacker_news.csv** dataset:

- **id**: The unique identifier for the post
- **title**: The title of the post
- **url**: The URL that the posts link to if the post has a URL
- **num_points**: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
- **num_comments**: The number of comments that were made on the post
- **author**: The username of the person who submitted the post
- **created_at**: The date and time at which the post was submitted (time zone Eastern Time in the US)

We will be analyzing the posts that begin with 'Ask HN' or 'Show HN' in the title

Users submit Ask HN posts to ask the Hacker News community a specific question. For example: 
- Ask HN: How to improve my personal website?
- Ask HN: Am I the only one outraged by Twitter shutting down share counts?
- Ask HN: Aby recent changes to CSS that broke mobile?

Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting. For example:
- Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform'
- Show HN: Something pointless I made
- Show HN: Shanhu.io, a programming playground powered by e8vm

### Targets

1) Identify the type of post that is more popular between 'Ask HN' and 'Show HN'.

2) Does 'Ask HN' or 'Show HN' receive more comments (engagement) on average?

3) Do posts created at a certain time receive more comments on average?

## 2. Data Exploration

In [17]:
# Read the file and put it in a list of lists
from csv import reader

read_file = reader(open('hacker_news.csv'))
hn = list(read_file)
headers = hn[0]
hn = hn[1:]

print(hn[:5])

[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


In [18]:
# Sort all the posts into three different categories
ask_posts = [] # 'Ask HN posts'
show_posts = [] # 'Show HN posts'
other_posts = [] # 'Other posts'

for row in hn:
    title = row[1].lower()
    
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

# Check how many posts are in each category
print('Ask HN has', len(ask_posts), 'posts')
print('Show HN has', len(show_posts), 'posts')

#print('Other posts have', len(other_posts), 'posts')

Ask HN has 1744 posts
Show HN has 1162 posts


## 3. Data Analysis

In [25]:
# checking to see which category receives more comments on average

total_ask_comments, total_show_comments = 0, 0

for post in ask_posts:
    num_comments = float(post[4])
    total_ask_comments += num_comments

for post in show_posts:
    num_comments = float(post[4])
    total_show_comments += num_comments

avg_ask_comments = total_ask_comments/len(ask_posts)
avg_show_comments = total_show_comments/len(show_posts)

print('Average number of Ask HN comments:', round(avg_ask_comments))
print('Average number of Show HN comments:', round(avg_show_comments))



Average number of Ask HN comments: 14
Average number of Show HN comments: 10


As we can see above, Ask HN posts have more comments on average, and seem to be more popular than Show HN posts. 

We'll now focus on Ask HN posts. We'll check to see what time in the day attracts more Ask HN posts.

In [39]:
# Find number of posts (and comments) for each hour of the day

import datetime as dt

result_list = []

for post in ask_posts:
    result_list.append([post[6], float(post[4])]) 
    
counts_by_hour, comments_by_hour = {}, {}

for row in result_list:
    date = row[0]
    date_obj = dt.datetime.strptime(date, '%m/%d/%Y %H:%M')
    hour = dt.datetime.strftime(date_obj, '%H')
    
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = row[1]
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += row[1]
        
print(sorted(counts_by_hour.items(), key = lambda x:x[1], reverse = True), '\n')
print(sorted(comments_by_hour.items(), key = lambda x:x[1], reverse = True))

[('15', 116), ('19', 110), ('21', 109), ('18', 109), ('16', 108), ('14', 107), ('17', 100), ('13', 85), ('20', 80), ('12', 73), ('22', 71), ('23', 68), ('01', 60), ('10', 59), ('02', 58), ('11', 58), ('00', 55), ('03', 54), ('08', 48), ('04', 47), ('05', 46), ('09', 45), ('06', 44), ('07', 34)] 

[('15', 4477.0), ('16', 1814.0), ('21', 1745.0), ('20', 1722.0), ('18', 1439.0), ('14', 1416.0), ('02', 1381.0), ('13', 1253.0), ('19', 1188.0), ('17', 1146.0), ('10', 793.0), ('12', 687.0), ('01', 683.0), ('11', 641.0), ('23', 543.0), ('08', 492.0), ('22', 479.0), ('05', 464.0), ('00', 447.0), ('03', 421.0), ('06', 397.0), ('04', 337.0), ('07', 267.0), ('09', 251.0)]


In [64]:
# Show which times have the most comments (engagement) on average

avg_comments_by_hour = []

for post in comments_by_hour:
    avg_comments_by_hour.append([post, round(comments_by_hour[post]/counts_by_hour[post], 2)])

avg_comments_by_hour = sorted(avg_comments_by_hour, key = lambda x:x[1], reverse = True)
print(avg_comments_by_hour)

print('\n\nTop 5 Hours for Ask Posts Comments:\n')

for hour in avg_comments_by_hour[:5]:
    est_time = dt.datetime.strptime(hour[0], '%H')
    gmt_time = est_time + dt.timedelta(hours = 5)
    
    est_time = dt.datetime.strftime(est_time, '%-I %p')
    gmt_time = dt.datetime.strftime(gmt_time, '%-I %p')
    
    print('{} EST or {} GMT: {:.0f} average comments per post'.format(est_time, gmt_time, hour[1]))
    

[['15', 38.59], ['02', 23.81], ['20', 21.52], ['16', 16.8], ['21', 16.01], ['13', 14.74], ['10', 13.44], ['14', 13.23], ['18', 13.2], ['17', 11.46], ['01', 11.38], ['11', 11.05], ['19', 10.8], ['08', 10.25], ['05', 10.09], ['12', 9.41], ['06', 9.02], ['00', 8.13], ['23', 7.99], ['07', 7.85], ['03', 7.8], ['04', 7.17], ['22', 6.75], ['09', 5.58]]


Top 5 Hours for Ask Posts Comments:

3 PM EST or 8 PM GMT: 39 average comments per post
2 AM EST or 7 AM GMT: 24 average comments per post
8 PM EST or 1 AM GMT: 22 average comments per post
4 PM EST or 9 PM GMT: 17 average comments per post
9 PM EST or 2 AM GMT: 16 average comments per post


Taking into consideration that Hacker News attracts viwership from all continents, it is logical that the times when users are usually online in these timezones would be reflected in the averages. 8:00PM GMT recieves the most engagement on posts, as many people are usually home and active online in the evening. At 7:00AM GMT, many people are active online before going to work/school. The reasoning behind the average cooments for 8:00PM EST, 9:00PM EST, 9:00PM GMT are identical to that of 8:00PM GMT. For 1:00AM GMT, 2:00AM GMT, 2:00AM EST, 3:00PM EST and 4:00PM EST, these are not times when most users would be online, so we won't consider these as best hours to post in Ask HN.

## 4. Conclusion
Based on our analysis, we have learned that:
- 'Ask HN' (Ask Hacker News) posts are more popular than 'Ask HN' (Show Hacker News) posts.
- Ask HN posts receive more comments engagement than Show HN posts.
- 8:00pm and 9:00pm (EST or GMT) are the best times to post in Ask HN, where the most user engagement happens. 