# Hacker News, posting optimization for more comments
## Introduction
In this project I am going to analyze set of 20 000 posts on Hacker News. My goal is to figure out which posts are among the most popular and what contributes to that.

In [1]:
import csv
hn_open = open("Hacker_news.csv")
hn_read = list(csv.reader(hn_open))
hn_header = hn_read[0]
hn = hn_read[1:]
print(hn_header)
print("\n")
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16'], ['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']]


Separate each post in `ask_posts`, `show_posts` or `other_posts`. Also I am goin to check how many posts of each "category" there is.

In [2]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if (title.lower().startswith("ask hn")):
        ask_posts.append(row)
    elif (title.lower().startswith("show hn")):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))


9139
10158
273822


Now lets find out which type of post gets more comments on average.

In [3]:
# count number of comments per ask posts
total_ask_comments = 0
for row in ask_posts:
    total_ask_comments += int(row[4])
# print average number of commnets per ask post
avg_ask_comments = total_ask_comments/len(ask_posts)
print(avg_ask_comments)


# count number of commnets per show post
total_show_comments = 0
for row in show_posts:
    total_show_comments += int(row[4])
# print average number of commnets per show post
avg_show_comments = total_show_comments/len(ask_posts)
print(avg_show_comments)

10.393478498741656
5.430900536163694


On average, for the **Ask HN** we have 10.39 comments per posts and for the **Show HN** it's only 4.89 comments per post. This shows us that people are more willing to comment on **Ask HN** posts than **Show HN**. Based on this fact, for our next steps we will analyze **only Ask HN posts.**

Next I will calculate the number of ask posts created in each hour of the day, along with the number of comments received. Also I will calculate the average number of commnets ask posts receive by hour created. I will use datetime module.

In [4]:
import datetime as dt
result_list = []

# append created_at and number of comments to result_list
for row in ask_posts:
    result_list.append((row[6], int(row[4])))

# create 2 dictionaries
counts_by_hour = {}
comments_by_hour = {}


for row in result_list:
    date = row[0]
    comment = row[1]
    new_date = dt.datetime.strptime(date, "%m/%d/%Y %H:%M")
    hour = new_date.hour
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comment
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comment

print(counts_by_hour)
print("\n")
print(comments_by_hour)

{2: 269, 1: 282, 22: 383, 21: 518, 19: 552, 17: 587, 15: 646, 14: 513, 13: 444, 11: 312, 10: 282, 9: 222, 7: 226, 3: 271, 23: 343, 20: 510, 16: 579, 8: 257, 0: 301, 18: 614, 12: 342, 4: 243, 6: 234, 5: 209}


{2: 2996, 1: 2089, 22: 3372, 21: 4500, 19: 3954, 17: 5547, 15: 18525, 14: 4972, 13: 7245, 11: 2797, 10: 3013, 9: 1477, 7: 1585, 3: 2154, 23: 2297, 20: 4462, 16: 4466, 8: 2362, 0: 2277, 18: 4877, 12: 4234, 4: 2360, 6: 1587, 5: 1838}


Now I am going to determine average number of commnets per post on specific hour of the day to see if there are any advantages of chooising specific time to post.

In [5]:
avg_by_hour = []

for hour in counts_by_hour:
    avg_by_hour.append([hour, round((comments_by_hour[hour]/counts_by_hour[hour]), 2)])

print(avg_by_hour)

[[2, 11.14], [1, 7.41], [22, 8.8], [21, 8.69], [19, 7.16], [17, 9.45], [15, 28.68], [14, 9.69], [13, 16.32], [11, 8.96], [10, 10.68], [9, 6.65], [7, 7.01], [3, 7.95], [23, 6.7], [20, 8.75], [16, 7.71], [8, 9.19], [0, 7.56], [18, 7.94], [12, 12.38], [4, 9.71], [6, 6.78], [5, 8.79]]


Although I now have the results I need, this format makes it difficult to identify the hours with the highest values. Let's finish by sorting the list of lists and printing highest values in a format that's easier to read.

In [6]:
swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
print(swap_avg_by_hour)

[[11.14, 2], [7.41, 1], [8.8, 22], [8.69, 21], [7.16, 19], [9.45, 17], [28.68, 15], [9.69, 14], [16.32, 13], [8.96, 11], [10.68, 10], [6.65, 9], [7.01, 7], [7.95, 3], [6.7, 23], [8.75, 20], [7.71, 16], [9.19, 8], [7.56, 0], [7.94, 18], [12.38, 12], [9.71, 4], [6.78, 6], [8.79, 5]]


In [10]:
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
print("Top 5 Hours for Ask Posts Comments")


for item in sorted_swap[:5]:
    hour = dt.datetime.strptime(str(item[1]), "%H")
    hour = hour.strftime("%H:%M")
    avg = item[0]
    print("{}: {} average comments per post".format(hour, avg))

Top 5 Hours for Ask Posts Comments
15:00: 28.68 average comments per post
13:00: 16.32 average comments per post
12:00: 12.38 average comments per post
02:00: 11.14 average comments per post
10:00: 10.68 average comments per post
