# Exploring Hacker News Posts

In this project, the following data set of submissions to popular technology [site Hacker News](https://news.ycombinator.com/) will be used. 

Hacker News is a social news website focusing on computer science and entrepreneurship where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit.

The dataset can be found [here](https://www.kaggle.com/hacker-news/hacker-news-posts) and presents the following informations:

`id`: The unique identifier from Hacker News for the post

`title`: The title of the post

`url`: The URL that the posts links to, if the post has a URL

`num_points`: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes

`num_comments`: The number of comments that were made on the post

`author`: The username of the person who submitted the post

`created_at`: The date and time at which the post was submitted


Posts whose titles begin with either `Ask HN` or `Show HN` will be analysed. Users submit `Ask HN` posts to ask the Hacker News community a specific question and `Show HN` posts to show the Hacker News community a project, product, or just generally something interesting.

These two types of posts will be compared to determine the following:

- Do Ask HN or Show HN receive more comments on average?
- Do posts created at a certain time receive more comments on average?


In [1]:
from csv import reader

### Import data set ###
open_file = open('HN_posts_year_to_Sep_26_2016.csv')
read_file = reader(open_file)
dataset = list(read_file)

dataset_header = dataset[0]
hn = dataset[1:]

print(hn[:5])

[['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16'], ['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']]


Below is a function to print rows in a readable way.

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

print(dataset_header)        
print('\n')
explore_data(hn, 0, 5, rows_and_columns=True)
print('\n')

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']


['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24']


['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']


['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']


['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']


Number of 

From this dataset, posts beginning with `Ask HN` and `Show HN` (and case variations) will be separated into two different lists.

In [3]:
ask_posts=[]
ask_posts_db=[]

show_posts=[]
show_posts_db=[]

other_posts=[]

for row in hn:
    title=row[1]
    title=title.lower()
    if title.startswith('ask hn'):
        ask_posts.append(title)
        ask_posts_db.append(row)
    elif title.startswith('show hn'):
        show_posts.append(title)
        show_posts_db.append(row)
    else:
        other_posts.append(title)

tot = len(ask_posts)+len(show_posts)+len(other_posts)
print('Number of posts that start with ask_hn:', len(ask_posts))
print('Number of posts that start with show_hn:', len(show_posts))
print('Number of other post types:', len(other_posts))
print(tot)

Number of posts that start with ask_hn: 9139
Number of posts that start with show_hn: 10158
Number of other post types: 273822
293119


It will be determined which ask posts or show posts receive more comments on average.

In [4]:
total_ask_comments = 0
total_show_comments = 0

for row in ask_posts_db:
    clean_row = row[4]
    clean_row = clean_row.replace("'","")
    clean_row = int(clean_row)
    total_ask_comments = total_ask_comments + clean_row

for row in show_posts_db:
    clean_row = row[4]
    clean_row = clean_row.replace("'","")
    clean_row = int(clean_row)
    total_show_comments = total_show_comments + clean_row
    
avg_ask_comments = total_ask_comments / len(ask_posts) 
avg_show_comments = total_show_comments / len(show_posts) 

print('Total number of comments in ask posts:',total_ask_comments)
print('Average number of comments in ask posts:',round(avg_ask_comments,2))
print('\n')
print('Total number of comments in show posts:',total_show_comments)
print('Average number of comments in show posts:',round(avg_show_comments,2))
print('\n')

Total number of comments in ask posts: 94986
Average number of comments in ask posts: 10.39


Total number of comments in show posts: 49633
Average number of comments in show posts: 4.89




Based on these results, ask posts receive more comments on average. Which mean users tend to comment more with posts related to a specific question for the Hacker News community.

Since ask posts are more likely to receive comments, it will be determined if ask posts created at a certain time are more likely to attract comments.

In the following, two dictionaries will be created:

`counts_by_hour`: contains the number of ask posts created during each hour of the day.  
`comments_by_hour`: contains the corresponding number of comments ask posts created at each hour received.

In [5]:
import datetime as dt

result_list = []
counts_by_hour = {}
comments_by_hour = {}

for row in ask_posts_db:
    created_at = row[6]
    num_comments = int(row[4])
    result_list.append([created_at, num_comments])

datetime_format = "%m/%d/%Y %H:%M"
for row in result_list:
    time = row[0]
    comment = row[1]
    dt_object = dt.datetime.strptime(time, datetime_format)
    dt_hour = dt_object.strftime("%H")
    
    if dt_hour not in counts_by_hour:
        counts_by_hour[dt_hour] = 1
        comments_by_hour[dt_hour]= comment
    
    else:
        counts_by_hour[dt_hour] += 1
        comments_by_hour[dt_hour] += comment

counts_by_hour_items = counts_by_hour.items()
sorted_counts_by_hour = sorted(counts_by_hour_items)

comments_by_hour_items = comments_by_hour.items()
sorted_comments_by_hour = sorted(comments_by_hour_items)

print("Amount of ask posts created per hour:\n", sorted_counts_by_hour)
print('\n')
print("Total amount of comments:\n", sorted_comments_by_hour)
print('\n') 

Amount of ask posts created per hour:
 [('00', 301), ('01', 282), ('02', 269), ('03', 271), ('04', 243), ('05', 209), ('06', 234), ('07', 226), ('08', 257), ('09', 222), ('10', 282), ('11', 312), ('12', 342), ('13', 444), ('14', 513), ('15', 646), ('16', 579), ('17', 587), ('18', 614), ('19', 552), ('20', 510), ('21', 518), ('22', 383), ('23', 343)]


Total amount of comments:
 [('00', 2277), ('01', 2089), ('02', 2996), ('03', 2154), ('04', 2360), ('05', 1838), ('06', 1587), ('07', 1585), ('08', 2362), ('09', 1477), ('10', 3013), ('11', 2797), ('12', 4234), ('13', 7245), ('14', 4972), ('15', 18525), ('16', 4466), ('17', 5547), ('18', 4877), ('19', 3954), ('20', 4462), ('21', 4500), ('22', 3372), ('23', 2297)]




The two dictionaries will be used to calculate the average number of comments per post for posts created during each hour of the day.

In [6]:
avg_by_hour = []

for hour in comments_by_hour:
    avg_hour = comments_by_hour[hour]/counts_by_hour[hour]
    avg_by_hour.append([hour, avg_hour])
    
print(avg_by_hour)

[['02', 11.137546468401487], ['01', 7.407801418439717], ['22', 8.804177545691905], ['21', 8.687258687258687], ['19', 7.163043478260869], ['17', 9.449744463373083], ['15', 28.676470588235293], ['14', 9.692007797270955], ['13', 16.31756756756757], ['11', 8.96474358974359], ['10', 10.684397163120567], ['09', 6.653153153153153], ['07', 7.013274336283186], ['03', 7.948339483394834], ['23', 6.696793002915452], ['20', 8.749019607843136], ['16', 7.713298791018998], ['08', 9.190661478599221], ['00', 7.5647840531561465], ['18', 7.94299674267101], ['12', 12.380116959064328], ['04', 9.7119341563786], ['06', 6.782051282051282], ['05', 8.794258373205741]]


In [7]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])

sorted_swap= sorted(swap_avg_by_hour, reverse=True)

print(sorted_swap[:5])
print("\n")
print("Top 5 Hours for Ask Posts Comments:\n")

for row in sorted_swap[:5]:
    hour = dt.datetime.strptime(row[1], '%H')
    hour = hour.strftime('%H:00')
    string = '{h}: {avg:.2f} average comments per post'.format(h = hour, avg = row[0])
    print(string)
    

[[28.676470588235293, '15'], [16.31756756756757, '13'], [12.380116959064328, '12'], [11.137546468401487, '02'], [10.684397163120567, '10']]


Top 5 Hours for Ask Posts Comments:

15:00: 28.68 average comments per post
13:00: 16.32 average comments per post
12:00: 12.38 average comments per post
02:00: 11.14 average comments per post
10:00: 10.68 average comments per post


The time zone for the dataset is Eastern Time in the US. Let's consider Central European Standard Time such as France which is 6 ahead of Eastern Time.

In [18]:
for row in sorted_swap[:5]:
    hour = dt.datetime.strptime(row[1], '%H')
    Fr_time = hour + dt.timedelta(hours = 6)
    Fr_time = Fr_time.strftime('%H:00')
    string = '{h}: {a:.2f} average comments per post'.format(h = Fr_time, a = row[0])
    print(string)

21:00: 28.68 average comments per post
19:00: 16.32 average comments per post
18:00: 12.38 average comments per post
08:00: 11.14 average comments per post
16:00: 10.68 average comments per post


# CONCLUSION

A dataset on the Hacker News posts was analysed in order to answer two questions:

### - Do Ask HN or Show HN receive more comments on average?
Based on the finding results, ask posts receive more comments on average. Which mean users tend to comment more with posts related to a specific question for the Hacker News community.

### - Do posts created at a certain time receive more comments on average?
Based on the finding results, for the **Eastern Time** a post created at 3pm will receive more comments on average. 

While for **Central European Standard Time**, a post created at 9pm will receive more comments on average.