# Exploring Hacker News Posts

### Loading of the data set

Load the data set about the Hacker News posts for the csv file `HM_posts_year_to_Sep_26_2016.csv`.
Afterward, the first five rows of the data set are printed to get an first overview.

In [7]:
from csv import reader

file_op = open("additional_files/HN_posts_year_to_Sep_26_2016.csv",encoding="utf8")
file_rd = reader(file_op,)
hn = list(file_rd)

for i in range(0,5):
    print(hn[i])
    print("\n")

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']


['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24']


['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']


['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']




Because the first row of the data set containes the header line, we separate it from the data and store it in the `headers` variable.

In [8]:
headers = hn[0]
hn = hn[1:]

***
### Filtering of data of intrest

We create three lists called `ask_posts`, `show_posts`, and `other_posts`, that contain only the posts starting with "ask hn", show hn or all other posts, respectively.

In [12]:
ask_posts = []
show_posts = []
other_posts = []

print(("total length of data set:",len(hn)))
print("\n")

for row in hn:
    name = row[1]
    name = name.lower()
    if name.startswith("ask hn"):
        ask_posts.append(row)
    elif name.startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)
print(("length of ask_posts:",len(ask_posts)))
print(("length of show_posts:",len(show_posts)))
print(("length of other_posts:",len(other_posts)))

('total length of data set:', 293119)


('length of ask_posts:', 9139)
('length of show_posts:', 10158)
('length of other_posts:', 273822)


***
### Analysis: Do **'ask posts'** or **'show posts'** receive more comments on average?

In [17]:
total_ask_comments = 0

for row in ask_posts:
    num_comm = int(row[4])
    total_ask_comments += num_comm
    
avg_ask_comments = total_ask_comments/len(ask_posts)
print("Average number of comments on 'ask hn' posts is {:.2f}".format(avg_ask_comments))

total_show_comments = 0

for row in show_posts:
    num_comm = int(row[4])
    total_show_comments += num_comm

avg_show_comments = total_show_comments/len(show_posts)
print("Average number of comments on 'show hn' posts is {:.2f}".format(avg_show_comments))

Average number of comments on 'ask hn' posts is 10.39
Average number of comments on 'show hn' posts is 4.89


Therefore we can conclude, that **'ask posts'** get more comments on average compared to **'show posts'**.

***
### Further Analysis of ask posts

Next, we'll determine if ask posts created at a certain time are more likely to attract comments.

In [24]:
import datetime as dt

result_list = []

for row in ask_posts:
    created_at = row[6]
    post_num = row[4]
    result_list.append([created_at,post_num])

counts_by_hour = {}
comments_by_hour = {}

for elem in result_list:
    dt_obj = dt.datetime.strptime(elem[0],"%m/%d/%Y %H:%M")
    curr_hour = dt_obj.strftime("%H")
    if curr_hour not in counts_by_hour:
        counts_by_hour[curr_hour] = 1
        comments_by_hour[curr_hour] = int(elem[1])
    else:
        counts_by_hour[curr_hour] += 1
        comments_by_hour[curr_hour] += int(elem[1])

print(counts_by_hour)
print("\n")
print(comments_by_hour)
print("\n")

avg_by_hour = []
for key in counts_by_hour:
    frequ = counts_by_hour[key]
    comm_num = comments_by_hour[key]
    avg_by_hour.append([key,comm_num/frequ])

print(avg_by_hour)

{'02': 269, '01': 282, '22': 383, '21': 518, '19': 552, '17': 587, '15': 646, '14': 513, '13': 444, '11': 312, '10': 282, '09': 222, '07': 226, '03': 271, '23': 343, '20': 510, '16': 579, '08': 257, '00': 301, '18': 614, '12': 342, '04': 243, '06': 234, '05': 209}


{'02': 2996, '01': 2089, '22': 3372, '21': 4500, '19': 3954, '17': 5547, '15': 18525, '14': 4972, '13': 7245, '11': 2797, '10': 3013, '09': 1477, '07': 1585, '03': 2154, '23': 2297, '20': 4462, '16': 4466, '08': 2362, '00': 2277, '18': 4877, '12': 4234, '04': 2360, '06': 1587, '05': 1838}


[['02', 11.137546468401487], ['01', 7.407801418439717], ['22', 8.804177545691905], ['21', 8.687258687258687], ['19', 7.163043478260869], ['17', 9.449744463373083], ['15', 28.676470588235293], ['14', 9.692007797270955], ['13', 16.31756756756757], ['11', 8.96474358974359], ['10', 10.684397163120567], ['09', 6.653153153153153], ['07', 7.013274336283186], ['03', 7.948339483394834], ['23', 6.696793002915452], ['20', 8.749019607843136], ['16',

For better visualisation we now want to sort the results from highest to lowest average.

In [32]:
swap_avg_by_hour = []
for elem in avg_by_hour:
    swap_avg_by_hour.append([elem[1],elem[0]])

sorted_swap = sorted(swap_avg_by_hour,reverse=True)
print( "Top 5 Hours for Ask Posts Comments")
for elem in sorted_swap[0:5]:
    dt_obj = dt.datetime.strptime(elem[1],"%H")
    clock = dt_obj.strftime("%H:%M")
    print("{}: {:.2f} average comments per post".format(clock,elem[0]))

Top 5 Hours for Ask Posts Comments
15:00: 28.68 average comments per post
13:00: 16.32 average comments per post
12:00: 12.38 average comments per post
02:00: 11.14 average comments per post
10:00: 10.68 average comments per post


### Conclusion

In summary, the best time fram to post an 'ask hn' post ist between **12:00 and 15:00** o'clock!