# Hacker News - Analysis of posts

Hacker news is a technology oriented forum where useres can post questions and discuss topics. To evaluate popularity and weigh the importance of indevidual submissions, posts can be up or downvoted similar to reddit. In this project, we will analyse "Ask HN" submissions which ask specific questions. The time zone for the data is USA Eastern and therefore equal to Montreal time.

|Column name (index)| Description|
|---|---|
|id (0)| Unique identifier of user|
|title (1)| Title of post|
|url (2)| URL that the post links to|
|num_points (3)| Total points (upvotes - downvotes)|
|num_comments (4)| Total number of comments|
|author (5)| Username |
|created_at (6)| Date at post submission|

In [19]:
from csv import reader # to parse .csv file

open_file = open('C:/Users/User/Documents/data_sets/hacker_news.csv', encoding = 'utf-8') # utf-8 encoding required to read data
read = reader(open_file) # parse
hn = list(read) # transform into list of lists
headers = hn[0] # headers
hn = hn[1:] # data

In [20]:
# Filtering Ask HN and Show HN post
ask_posts = [] # posts asking a question
show_posts = [] # posts answering a question
other_posts = [] 

for row in hn:
    title = row[1].lower() # lower case title to facilitate usage of startswith() method
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else: other_posts.append(row)

print('Number of "Ask hn" posts: ', len(ask_posts))  
print('Number of "Show hn" posts: ', len(show_posts))  
print('Number of other posts: ', len(other_posts))  

Number of "Ask hn" posts:  9139
Number of "Show hn" posts:  10158
Number of other posts:  273822


In [21]:
def avg_comments(data):
    '''Returns the average number of comments in data'''
    total_comments = 0 # number of comments in data
    total_posts = len(data) # number of posts in data
    for row in data:
        comments = int(row[4])
        total_comments += comments
    return total_comments/total_posts # average number of comments

In [22]:
avg_ask_comments = avg_comments(ask_posts)
print('Averge number comments for ask posts: ', avg_ask_comments)
avg_show_comments = avg_comments(show_posts)
print('Averge number comments for show posts: ', avg_show_comments)
avg_other_comments = avg_comments(other_posts)
print('Averge number comments for other posts: ', avg_other_comments)

Averge number comments for ask posts:  10.393478498741656
Averge number comments for show posts:  4.886099625910612
Averge number comments for other posts:  6.4572678601427205


The average comments for each post type indicate a substantial difference between ask and show posts. This may be explained by questions stimulating discussions more easily (for example due to their concise nature), being less technically challenging as a consequence of being asked by mostly inexperienced users, or by a greater proclivity of users to help rather than discuss. Since questions generate the most comments, the following analysis will focus on "Ask HN" posts. 

## At what times do ask-posts aquire the most comments?

In [29]:
# identify date format
for row in ask_posts[0:3]:
    print(row[6])
    
date_format = "%m/%d/%Y %H:%M" # format of post dates

9/26/2016 2:53
9/26/2016 1:17
9/25/2016 22:57


In [56]:
import datetime as dt # time analysis

result_list = [] # list of tuples containing post dates and number of comments
for row in ask_posts:
    result_list.append((row[6], row[4])) # list of tuples (date created, number of comments)

posts_by_hour = {} # number of posts for each hour
comments_by_hour = {} # number of comments for each hour

# Extract hours and implement lists of post and comment frequency counts by hour
for tup in result_list:
    date = tup[0] # date of post
    comments = int(tup[1]) # convert to integer to summarize comments
    time = dt.datetime.strptime(date, date_format) # parse dates according to given format
    hour = time.strftime("%H") # extract hour from post
    if hour not in posts_by_hour: # frequency counts
        posts_by_hour[hour] = 1
        comments_by_hour[hour] = comments
    else:
        posts_by_hour[hour] += 1
        comments_by_hour[hour] += comments

# Post comment counts from 5 - 11 pm
table
for hour in sorted(posts_by_hour):
    print("Number of posts at ", hour, " pm: ", posts_by_hour[hour])
    print("Number of comments at ", hour, " pm: ", comments_by_hour[hour])

Number of posts at  00  pm:  301
Number of comments at  00  pm:  2277
Number of posts at  01  pm:  282
Number of comments at  01  pm:  2089
Number of posts at  02  pm:  269
Number of comments at  02  pm:  2996
Number of posts at  03  pm:  271
Number of comments at  03  pm:  2154
Number of posts at  04  pm:  243
Number of comments at  04  pm:  2360
Number of posts at  05  pm:  209
Number of comments at  05  pm:  1838
Number of posts at  06  pm:  234
Number of comments at  06  pm:  1587
Number of posts at  07  pm:  226
Number of comments at  07  pm:  1585
Number of posts at  08  pm:  257
Number of comments at  08  pm:  2362
Number of posts at  09  pm:  222
Number of comments at  09  pm:  1477
Number of posts at  10  pm:  282
Number of comments at  10  pm:  3013
Number of posts at  11  pm:  312
Number of comments at  11  pm:  2797
Number of posts at  12  pm:  342
Number of comments at  12  pm:  4234
Number of posts at  13  pm:  444
Number of comments at  13  pm:  7245
Number of posts at  

In [61]:
# Compute average number comments per post at each hour
avg_by_hour = [] # list of lists containting hours and corresponding average comments per post
for hour in posts_by_hour:
    posts = posts_by_hour[hour]
    comments = comments_by_hour[hour]
    avg = round(comments/posts, 1) # average number of comments per post 
    avg_by_hour.append([avg, hour])


In [70]:
# Print results
sorted_avg = sorted(avg_by_hour, reverse=True)[:5] # top 5 hours with highest avgerage comments per "Ask HN" post
for avg in sorted_avg:
    average = avg[0]
    hour = dt.datetime.strptime(avg[1], '%H') # Initialize hour as datetime object
    hour = dt.datetime.strftime(hour, "%H:%M") # Formate time 
    print('{average:.2f} average comments per "Ask HN" post at {hour}'.format(average = average, hour = hour))

28.70 average comments per "Ask HN" post at 15:00
16.30 average comments per "Ask HN" post at 13:00
12.40 average comments per "Ask HN" post at 12:00
11.10 average comments per "Ask HN" post at 02:00
10.70 average comments per "Ask HN" post at 10:00


The results show that at 15:00 the largest number of average comments are registers. This indicates, that for US eastern time, which includes Montreal, 15:00 is the best time to ask questions and generate the most comments on HN.