# Hacker News Post Analysis

1. Aim of this analysis is to compare the average number of comments on `Ask Posts` and `Show Posts`
2. Check the hourly distribution of average number of comments on `Ask Posts`
3. Ideas to dig deeper for better conclusions

- Hacker News Posts data [source](https://www.kaggle.com/hacker-news/hacker-news-posts)


In [1]:
def explore_data(dataset, start, end, row_column_count = False):

    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n')
    if row_column_count:
        print('Rows:    ',len(dataset))
        print('Columns: ',len(dataset[0]))

In [2]:
from csv import reader

file = open('../my_datasets/HN_posts_year_to_Sep_26_2016.csv', encoding='utf8')
read_file = reader(file)
hn = list(read_file)

explore_data(hn,0,6,True)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']


['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24']


['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']


['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']


['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']


Rows:     

In [37]:
headers = hn[0]
hn = hn[1:]
explore_data(hn,0,5)

['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']


['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24']


['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']


['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']


['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']




### Number of each type of post on HN

In [7]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

print('ask_posts:', len(ask_posts)) 
print('show_posts:', len(show_posts))
print('other_posts:',len(other_posts))

ask_posts: 9139 
 show_posts: 10158
other_posts: 273822


### See if `ask_posts` or `show_posts` receive more comments on average

In [10]:
total_ask_comments = 0

for row in ask_posts:
    num_comment = int(row[4])
    total_ask_comments += num_comment

avg_ask_comments = round(total_ask_comments / len(ask_posts), 2)
print('Average comments on ask posts: ', avg_ask_comments)

total_show_comments = 0

for row in show_posts:
    num_comment = int(row[4])
    total_show_comments += num_comment
    
avg_show_comments = round(total_show_comments / len(show_posts), 2)
print('Average comments on show posts: ', avg_show_comments)

Average comments on ask posts:  10.39
Average comments on show posts:  4.89


- It seems that on average **ask posts** receive a little more than two times more comments than **show posts**

### Overall hourly distribution of number of posts and comments

In [18]:
import datetime as dt

result_list = []

for row in ask_posts:
    creat_time = row[6]
    num_comment = int(row[4])
    result_list.append([creat_time, num_comment])
    
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    creat_time = row[0]
    creat_time = dt.datetime.strptime(creat_time, "%m/%d/%Y %H:%M")
    hour = creat_time.hour
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = row[1]
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += row[1]

print('Number of posts by hour= ', sorted(counts_by_hour.items()), '\n')
print('Number of comments by hour= ', sorted(comments_by_hour.items()))

Number of posts by hour=  [(0, 301), (1, 282), (2, 269), (3, 271), (4, 243), (5, 209), (6, 234), (7, 226), (8, 257), (9, 222), (10, 282), (11, 312), (12, 342), (13, 444), (14, 513), (15, 646), (16, 579), (17, 587), (18, 614), (19, 552), (20, 510), (21, 518), (22, 383), (23, 343)] 

Number of comments by hour=  [(0, 2277), (1, 2089), (2, 2996), (3, 2154), (4, 2360), (5, 1838), (6, 1587), (7, 1585), (8, 2362), (9, 1477), (10, 3013), (11, 2797), (12, 4234), (13, 7245), (14, 4972), (15, 18525), (16, 4466), (17, 5547), (18, 4877), (19, 3954), (20, 4462), (21, 4500), (22, 3372), (23, 2297)]


In [23]:
avg_by_hour = []

for key in counts_by_hour:
    avg_by_hour.append([key, round(comments_by_hour[key] / counts_by_hour[key], 2)])

print("Average number of comments per post by hour= ", sorted(avg_by_hour))

Average number of comments per post by hour=  [[0, 7.56], [1, 7.41], [2, 11.14], [3, 7.95], [4, 9.71], [5, 8.79], [6, 6.78], [7, 7.01], [8, 9.19], [9, 6.65], [10, 10.68], [11, 8.96], [12, 12.38], [13, 16.32], [14, 9.69], [15, 28.68], [16, 7.71], [17, 9.45], [18, 7.94], [19, 7.16], [20, 8.75], [21, 8.69], [22, 8.8], [23, 6.7]]


## Average number of comments on `Ask Posts` by hour

In [36]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1],row[0]])

print("Swapped average by hour = ", swap_avg_by_hour)
print('\n')
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
print("Top 5 Hours for Ask Post Comments:")
print("----------------------------------")      

for row in sorted_swap:
    avg_comments = row[0]
    h = str(row[1])
    EDT_to_CEST = dt.timedelta(hours=6)
    hour = dt.datetime.strptime(h, '%H') + EDT_to_CEST 
    hour = hour.strftime('%H:%M')
    print("{} = {} average comments per post".format(hour,avg_comments))

Swapped average by hour =  [[11.14, 2], [7.41, 1], [8.8, 22], [8.69, 21], [7.16, 19], [9.45, 17], [28.68, 15], [9.69, 14], [16.32, 13], [8.96, 11], [10.68, 10], [6.65, 9], [7.01, 7], [7.95, 3], [6.7, 23], [8.75, 20], [7.71, 16], [9.19, 8], [7.56, 0], [7.94, 18], [12.38, 12], [9.71, 4], [6.78, 6], [8.79, 5]]


Top 5 Hours for Ask Post Comments:
----------------------------------
21:00 = 28.68 average comments per post
19:00 = 16.32 average comments per post
18:00 = 12.38 average comments per post
08:00 = 11.14 average comments per post
16:00 = 10.68 average comments per post
10:00 = 9.71 average comments per post
20:00 = 9.69 average comments per post
23:00 = 9.45 average comments per post
14:00 = 9.19 average comments per post
17:00 = 8.96 average comments per post
04:00 = 8.8 average comments per post
11:00 = 8.79 average comments per post
02:00 = 8.75 average comments per post
03:00 = 8.69 average comments per post
09:00 = 7.95 average comments per post
00:00 = 7.94 average comments 

- According to this data, posts created **during 21,19,18,8,16 hour** in the day, receive on average more comments. (Time have been changed from Eastern Time to Central Europe Summer Time)


#### Ideas to dig deeper:
1. Check the time distribution on each day of week
2. Visualize the time distributions
