# Analysing Hacker News Posts

In this project, we're going to analyse posts on hacker news to find out at what time period does a post receive the most number of comments on average. We can download the dataset [here](https://www.kaggle.com/hacker-news/hacker-news-posts).

We'll start by reading through the csv file as a list of lists to perform operations on. We sort the data as two parts: the header whisch contains the columns and the rest of the data.

In [1]:
from csv import reader
import datetime as dt

opened_file = open('/Users/Tejas/csv_files/hacker_news.csv')
hn = list(reader(opened_file))
header = hn[0]
hn = hn[1:]

In [2]:
header

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

In [3]:
hn[:5]

[['12579008',
  'You have two days to comment if you want stem cells to be classified as your own',
  'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018',
  '1',
  '0',
  'altstar',
  '9/26/2016 3:26'],
 ['12579005',
  'SQLAR  the SQLite Archiver',
  'https://www.sqlite.org/sqlar/doc/trunk/README.md',
  '1',
  '0',
  'blacksqr',
  '9/26/2016 3:24'],
 ['12578997',
  'What if we just printed a flatscreen television on the side of our boxes?',
  'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43',
  '1',
  '0',
  'pavel_lishin',
  '9/26/2016 3:19'],
 ['12578989',
  'algorithmic music',
  'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext',
  '1',
  '0',
  'poindontcare',
  '9/26/2016 3:16'],
 ['12578979',
  'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake',
  'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94',
  '1',
  '0',
  'markgainor1',
  '9/26/2016 3:14']]

Next, we're going to append the posts which ask questions to the users, shows projects and all the other posts to three seperate lists. We do that by filtering out the starting words of the title of the post using the `startswith` method.

In [4]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print(f"""{ask_posts[:2]} 

{show_posts[:2]}""")
    

[['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53'], ['12578522', 'Ask HN: How do you pass on your work when you die?', '', '6', '3', 'PascLeRasc', '9/26/2016 1:17']] 

[['12578335', 'Show HN: Finding puns computationally', 'http://puns.samueltaylor.org/', '2', '0', 'saamm', '9/26/2016 0:36'], ['12578182', 'Show HN: A simple library for complicated animations', 'https://christinecha.github.io/choreographer-js/', '1', '0', 'christinecha', '9/26/2016 0:01']]


We can then see the number of posts for each list

In [5]:
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

9139
10158
273822


We're now going to find how many comments a post for each list receives on average. we do this by calculating the total number of comments on each post and dividing that by the number of posts.

In [6]:
total_ask_posts = 0

for posts in ask_posts:
    comments = int(posts[4])
    total_ask_posts += comments
    

avg_ask_posts = total_ask_posts / len(ask_posts)

print(avg_ask_posts)

10.393478498741656


In [7]:
total_show_posts = 0

for posts in show_posts:
    comments = int(posts[4])
    total_show_posts += comments

avg_show_posts = total_show_posts / len(show_posts)
avg_show_posts
    

4.886099625910612

From this, we can see that the ask posts generally receive more comments on average. So we can leave out the show posts, and carry out our analysis using the ask posts list.

We'll take in the date the post was created and the number of comments received by a post. We'll then look at the total number of comments of posts per hour and how many times a post has been commented on. From this, we can find the average number of posts per hour, with a little bit of help from the `datetime` module.

In [8]:
result_list = []

for posts in ask_posts:
    result_list.append(
    [posts[6], int(posts[4])])
    

    
counts_by_hour = {}
comments_by_hour = {}
time_fmt = '%m/%d/%Y %H:%M' 

for row in result_list:
    date = row[0]
    num_comments = row[1]
    time = dt.datetime.strptime(date, time_fmt).strftime('%H')
    
    if time not in counts_by_hour:
        comments_by_hour[time] = num_comments
        counts_by_hour[time] = 1
        
    else:
        comments_by_hour[time] += num_comments
        counts_by_hour[time] += 1
        
print(f"""{comments_by_hour} 

{counts_by_hour}""")


{'02': 2996, '01': 2089, '22': 3372, '21': 4500, '19': 3954, '17': 5547, '15': 18525, '14': 4972, '13': 7245, '11': 2797, '10': 3013, '09': 1477, '07': 1585, '03': 2154, '23': 2297, '20': 4462, '16': 4466, '08': 2362, '00': 2277, '18': 4877, '12': 4234, '04': 2360, '06': 1587, '05': 1838} 

{'02': 269, '01': 282, '22': 383, '21': 518, '19': 552, '17': 587, '15': 646, '14': 513, '13': 444, '11': 312, '10': 282, '09': 222, '07': 226, '03': 271, '23': 343, '20': 510, '16': 579, '08': 257, '00': 301, '18': 614, '12': 342, '04': 243, '06': 234, '05': 209}


In [9]:
avg_by_hour = []

for posts in comments_by_hour:
    avg_by_hour.append([posts, comments_by_hour[posts] / counts_by_hour[posts]])
    
avg_by_hour

[['02', 11.137546468401487],
 ['01', 7.407801418439717],
 ['22', 8.804177545691905],
 ['21', 8.687258687258687],
 ['19', 7.163043478260869],
 ['17', 9.449744463373083],
 ['15', 28.676470588235293],
 ['14', 9.692007797270955],
 ['13', 16.31756756756757],
 ['11', 8.96474358974359],
 ['10', 10.684397163120567],
 ['09', 6.653153153153153],
 ['07', 7.013274336283186],
 ['03', 7.948339483394834],
 ['23', 6.696793002915452],
 ['20', 8.749019607843136],
 ['16', 7.713298791018998],
 ['08', 9.190661478599221],
 ['00', 7.5647840531561465],
 ['18', 7.94299674267101],
 ['12', 12.380116959064328],
 ['04', 9.7119341563786],
 ['06', 6.782051282051282],
 ['05', 8.794258373205741]]

Once that's done, we sort through the list in descending order of their averages and format them with respect to the total average comments a post receives at a particular hour.

In [10]:
swap_list = []

for hour, avg in avg_by_hour:
    swap_list.append([avg, hour])
    

    
sorted_swap = sorted(swap_list, reverse=True)

for avg, hour in sorted_swap[:5]:
    hour_dt = dt.datetime.strptime(hour, '%H').strftime('%H:%M')
    print(f"At {hour}, the avg num comments is: {avg:.2f}")
    

At 15, the avg num comments is: 28.68
At 13, the avg num comments is: 16.32
At 12, the avg num comments is: 12.38
At 02, the avg num comments is: 11.14
At 10, the avg num comments is: 10.68


From this, we can see that at approximately 3pm eastern time in the United States, we can yield the most number of comments on a post.

In my time zone, the best time to post would be at around 1:30 am.


This marks the end of the project, thank you so much for reading through it.