# Hacker News - Analysis of Posts

Hacker News is a popular website in the technology domain, where user can add posts and receive comments and votes for that. The dataset storing data about posts can be download at [Kaggle](https://www.kaggle.com/datasets/hacker-news/hacker-news-posts). In this data science project the posts on the website Hacker News will be analyzed to answer the question if the Show HN or Ask HN posts receive more comments. Also it will be examined if posts published at a certain time receive more likes on average.

Let's read in the dataset first and print the first 5 rows.

In [16]:
opened_file = open("./data/HN_posts_year_to_Sep_26_2016.csv", encoding="utf8")
                
from csv import reader
read_file = reader(opened_file)
hn = list(read_file)

print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']]


Let's extract the headers in a separate list. and remove the header row in the dataset.

In [17]:
headers = hn[0] #only run once, otherwise another not the header row will be chosen
print(headers)
print('\n')

hn = hn[1:] #only run once, otherwise more than the header row will be removed
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16'], ['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']]


Count the number of posts that begin with *ask hn*, *show_hn* and others.

In [18]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
            show_posts.append(row)
    else: 
        other_posts.append(row)
        
len_ask = len(ask_posts)
len_show = len(show_posts)
len_other = len(other_posts)

print("Number of posts in ask_posts: ", len_ask)
print("Number of posts in show_posts: ", len_show)
print("Number of posts in other_posts: ", len_other)

Number of posts in ask_posts:  9139
Number of posts in show_posts:  10158
Number of posts in other_posts:  273822


Let's print the first five rows of the just created lists.

In [19]:
print(ask_posts[:5])
print('\n')
print(show_posts[:5])
print('\n')
print(other_posts[:5])

[['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53'], ['12578522', 'Ask HN: How do you pass on your work when you die?', '', '6', '3', 'PascLeRasc', '9/26/2016 1:17'], ['12577908', 'Ask HN: How a DNS problem can be limited to a geographic region?', '', '1', '0', 'kuon', '9/25/2016 22:57'], ['12577870', 'Ask HN: Why join a fund when you can be an angel?', '', '1', '3', 'anthony_james', '9/25/2016 22:48'], ['12577647', 'Ask HN: Someone uses stock trading as passive income?', '', '5', '2', '00taffe', '9/25/2016 21:50']]


[['12578335', 'Show HN: Finding puns computationally', 'http://puns.samueltaylor.org/', '2', '0', 'saamm', '9/26/2016 0:36'], ['12578182', 'Show HN: A simple library for complicated animations', 'https://christinecha.github.io/choreographer-js/', '1', '0', 'christinecha', '9/26/2016 0:01'], ['12578098', 'Show HN: WebGL visualization of DNA sequences', 'http://grondilu.github.io/dna.html', '1', '0', 'grondilu', '9/

Calculate the average of asked comments for *ask_posts* and *show_posts*.

In [27]:
total_ask_comments = 0

for row in ask_posts:
    num_ask_comments = int(row[4])
    total_ask_comments += num_ask_comments
    
avg_ask_comments = total_ask_comments / len(ask_posts)
print("Average of ask_comments: ",avg_ask_comments)

total_show_comments = 0

for row in show_posts:
    num_show_comments = int(row[4])
    total_show_comments += num_show_comments
    
avg_show_comments = total_show_comments / len(show_posts)
print("Average of show_comments: ", avg_show_comments)

Average of ask_comments:  10.393478498741656
Average of show_comments:  4.886099625910612


In average ask comments appear almost twice as often as show comments. Therefore ask posts are more likely to receive comments.

In [34]:
import datetime as dt

result_list = []

for row in ask_posts:
    created_at = row[6]
    num_ask_comments = (int(row[4]))
    result_list.append([created_at,num_ask_comments])
    
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    hour_of_date = row[0]
    comments = row[1]
    hour = dt.datetime.strptime(hour_of_date, "%m/%d/%Y %H:%M").hour
    #hour = date_string.strftime("%I:%M")
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comments
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comments
    
print("Counts by hour: ", counts_by_hour)
print('\n')
print("Comments by hour: ", comments_by_hour)

Counts by hour:  {2: 269, 1: 282, 22: 383, 21: 518, 19: 552, 17: 587, 15: 646, 14: 513, 13: 444, 11: 312, 10: 282, 9: 222, 7: 226, 3: 271, 23: 343, 20: 510, 16: 579, 8: 257, 0: 301, 18: 614, 12: 342, 4: 243, 6: 234, 5: 209}


Comments by hour:  {2: 2996, 1: 2089, 22: 3372, 21: 4500, 19: 3954, 17: 5547, 15: 18525, 14: 4972, 13: 7245, 11: 2797, 10: 3013, 9: 1477, 7: 1585, 3: 2154, 23: 2297, 20: 4462, 16: 4466, 8: 2362, 0: 2277, 18: 4877, 12: 4234, 4: 2360, 6: 1587, 5: 1838}


Calculate the average number of comments for posts created every number of the day.

In [37]:
avg_by_hour = []

for hour, hour2 in zip(counts_by_hour, comments_by_hour):
    avg_by_hour.append([hour, comments_by_hour[hour2]/counts_by_hour[hour]])

print(avg_by_hour)
print(len(avg_by_hour)) #check if the length is 24 to cover every hour of the day

[[2, 11.137546468401487], [1, 7.407801418439717], [22, 8.804177545691905], [21, 8.687258687258687], [19, 7.163043478260869], [17, 9.449744463373083], [15, 28.676470588235293], [14, 9.692007797270955], [13, 16.31756756756757], [11, 8.96474358974359], [10, 10.684397163120567], [9, 6.653153153153153], [7, 7.013274336283186], [3, 7.948339483394834], [23, 6.696793002915452], [20, 8.749019607843136], [16, 7.713298791018998], [8, 9.190661478599221], [0, 7.5647840531561465], [18, 7.94299674267101], [12, 12.380116959064328], [4, 9.7119341563786], [6, 6.782051282051282], [5, 8.794258373205741]]
24


In [55]:
swap_avg_by_hour = []

for row in avg_by_hour:
    new_order = [row[1], row[0]]
    swap_avg_by_hour.append(new_order)
    
print(swap_avg_by_hour)
print('\n')

sorted_swap = sorted(swap_avg_by_hour, reverse=True)
#sorted(student_objects, key=attrgetter('age'), reverse=True)
print("Top Hours for Ask Posts Comments")

for average, hour in sorted_swap:
    hour_o = dt.datetime.strptime(str(hour), "%H")
    hour_h = hour_o.strftime("%H:%M")
    
    print("{h}: {a:.2f} average comments".format(h=hour_h, a=average))

    

[[11.137546468401487, 2], [7.407801418439717, 1], [8.804177545691905, 22], [8.687258687258687, 21], [7.163043478260869, 19], [9.449744463373083, 17], [28.676470588235293, 15], [9.692007797270955, 14], [16.31756756756757, 13], [8.96474358974359, 11], [10.684397163120567, 10], [6.653153153153153, 9], [7.013274336283186, 7], [7.948339483394834, 3], [6.696793002915452, 23], [8.749019607843136, 20], [7.713298791018998, 16], [9.190661478599221, 8], [7.5647840531561465, 0], [7.94299674267101, 18], [12.380116959064328, 12], [9.7119341563786, 4], [6.782051282051282, 6], [8.794258373205741, 5]]


Top Hours for Ask Posts Comments
15:00: 28.68 average comments
13:00: 16.32 average comments
12:00: 12.38 average comments
02:00: 11.14 average comments
10:00: 10.68 average comments
04:00: 9.71 average comments
14:00: 9.69 average comments
17:00: 9.45 average comments
08:00: 9.19 average comments
11:00: 8.96 average comments
22:00: 8.80 average comments
05:00: 8.79 average comments
20:00: 8.75 average 

From our data analysis we can conclude that The best time to write an ask post is by far 15:00 with 28.68 average comments, followed by 13:00 (16.32) and 12:00 (12.38). Since the time in the dataset is in Eastern Time in the US (see in the [documentation](https://www.kaggle.com/datasets/hacker-news/hacker-news-posts)), the time to get the most comments for ask posts in Central European Summer Time (Germany) is 21:00, followed by 19:00, and 18:00.