# Best Time to Post for More Comments - Analyze Data from Technology Site HackerNews


Hacker News is extrmely popular in technology and startup circles, where top posts on its can get hundreds of thousands of visitors. We want to find out what is the best time to post in order to get more attention. Data can be downloaded [here](https://www.kaggle.com/hacker-news/hacker-news-posts)

Data structure:

|columns | description|
|--------| :------------|
| title | title of the post (self explanatory) |
| url | the url of the item being linked to |
| num_points| the number of upvotes the post received|
| num_comments| the number of comments the post received|
| author | the name of the account that made the post|
| create_at| the date and time the post was made(Eastern Time in the US)|

Titles begin with "Ask HN": users ask the Hacker News community a specfic question.
Titles begin with "Show HN": users show th Hacker News community a project, produce or just something interesting.
We're interested in if these two types of posts receive more comments on average. We also want to know if posts created at a certain tiem receive more comments on average?

In [1]:
import csv

In [2]:
# open the file
opened_file = open("HN_posts_year_to_Sep_26_2016.csv")
# read in the file
read_file = csv.reader(opened_file)
# conver to list
hn = list(read_file)
# display the first 5 rows
for row in hn[:5]:
    print(row)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']
['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24']
['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']
['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']


In [3]:
# save the header
header = hn[0]
# remove the header row
hn = hn[1:]
# display the header
print(header)
# display the first 5 rows
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16'], ['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']]


### Filter data

**Create new lists of post with title begin with "Ask HN" or "Show HN"**

In [4]:
# create 3 empty lists
ask_posts = []
show_posts = []
other_posts = []

In [5]:
# parse the data 
# and fill the lists accordingly
for row in hn:
    title = row[1]
    title = title.lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

In [6]:
# check the number of posts
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

9139
10158
273822


### Determin if ask or show posts receive more comments on average

**Find total number of the ask posts**

In [7]:
print(ask_posts[:2])

[['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53'], ['12578522', 'Ask HN: How do you pass on your work when you die?', '', '6', '3', 'PascLeRasc', '9/26/2016 1:17']]


In [8]:
total_ask_comments = 0
for row in ask_posts:
    # convert number of comments to integer
    num_comments = int(row[4])
    total_ask_comments += num_comments

**Compute average comments of the ask posts**

In [9]:
avg_ask_comments = total_ask_comments/len(ask_posts)
# display the average number of comments
print(avg_ask_comments)

10.393478498741656


**Find total number of the show posts**

In [10]:
total_show_comemnts = 0
for row in show_posts:
    num_comments = int(row[4])
    total_show_comemnts += num_comments

**Compute average number of the show posts**

In [11]:
avg_show_comments = total_show_comemnts/len(show_posts)
print(avg_show_comments)

4.886099625910612


Summary: Ask posts receive more comments on average(~20), while the ask posts receive ~5 comments on acerage.

------

### Do posts created on certain time attract more comments?

In [12]:
import datetime as dt

In [13]:
time = "9/26/2016 3:16"
time = dt.datetime.strptime(time, "%m/%d/%Y %H:%M")
time = time.strftime("%-H")
print(time)

3


**Save post creating time and number of comments to a list**

In [14]:
result_list = [] # list of lists
for row in ask_posts:
    # create a list of time and number of comments
    time_and_number = []
    # get post creation time
    created_at = row[6]
    # example of created time: 9/26/2016 3:16
    #created_at = dt.strptime(created_at, "%m/%d/%Y %H/%M")
    # get number of comments
    n_comments = row[3]
    # save the time and the number to the list
    time_and_number.append(created_at)
    time_and_number.append(n_comments)
    # append the time and number list to the result list
    result_list.append(time_and_number)

In [15]:
# check the contents of the result_list
print(len(result_list))
print(result_list[:2])

9139
[['9/26/2016 2:53', '4'], ['9/26/2016 1:17', '6']]


In [16]:
# Create two empty dictionaries
counts_by_hour = {}
comments_by_hour = {}

In [17]:
for item in result_list:
    # first element: date and time, e.g. 9/26/2016 2:53
    date = item[0] 
    # parse date and create a datetime object for it
    date = dt.datetime.strptime(date, "%m/%d/%Y %H:%M")
    # extract the hour
    hour = date.strftime("%-H")
    
    # second element is number of comments
    n_comments = float(item[1])
    
    # if the hour is not in the dictionary, 
    # create the key as hour and set it equal to 1
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = n_comments
    # if the hour already a key in counts_by_hour
    # increment the value in counts_by_hour by 1
    # increment the value in comments_by_hour by the coment number
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += n_comments

### Calculate the average number of comments per hour

In [18]:
print(counts_by_hour)

{'2': 269, '1': 282, '22': 383, '21': 518, '19': 552, '17': 587, '15': 646, '14': 513, '13': 444, '11': 312, '10': 282, '9': 222, '7': 226, '3': 271, '23': 343, '20': 510, '16': 579, '8': 257, '0': 301, '18': 614, '12': 342, '4': 243, '6': 234, '5': 209}


In [19]:
print(comments_by_hour)

{'2': 2944.0, '1': 2662.0, '22': 3601.0, '21': 5042.0, '19': 4782.0, '17': 7155.0, '15': 13978.0, '14': 5390.0, '13': 7962.0, '11': 2856.0, '10': 3789.0, '9': 1763.0, '7': 2040.0, '3': 2539.0, '23': 2616.0, '20': 4491.0, '16': 5970.0, '8': 2744.0, '0': 2835.0, '18': 6850.0, '12': 4643.0, '4': 2650.0, '6': 2030.0, '5': 2046.0}


In [33]:
# create an empty list of lists for result
ave_by_hour = []
for hour in comments_by_hour:
    # creat an empty list
    ave_hour = []
    # append hour as the first element of list ave_hour
    ave_hour.append(hour) 
    # calculate average
    ave = float(comments_by_hour[hour])/float(counts_by_hour[hour])
    #ave = round(ave)
    
    # append the average value 
    # as the second element of list ave_hour
    ave_hour.append(ave)
    
    # append the list to ave_by_hour
    ave_by_hour.append(ave_hour)

In [34]:
# print average number of comments
# per hour
for item in ave_by_hour:
    print(item)

['2', 10.944237918215613]
['1', 9.439716312056738]
['22', 9.402088772845953]
['21', 9.733590733590734]
['19', 8.66304347826087]
['17', 12.189097103918229]
['15', 21.637770897832816]
['14', 10.50682261208577]
['13', 17.93243243243243]
['11', 9.153846153846153]
['10', 13.436170212765957]
['9', 7.941441441441442]
['7', 9.026548672566372]
['3', 9.3690036900369]
['23', 7.626822157434402]
['20', 8.805882352941177]
['16', 10.310880829015543]
['8', 10.67704280155642]
['0', 9.418604651162791]
['18', 11.156351791530945]
['12', 13.576023391812866]
['4', 10.905349794238683]
['6', 8.675213675213675]
['5', 9.789473684210526]


### Sort and print first few highest

**Create a swap list of the ave_by_hour list:**

In [35]:
swap_ave_by_hour= []
for item in ave_by_hour:
    swap_item = []
    hour = item[0]
    ave = item[1]
    swap_item.append(ave)
    swap_item.append(hour)
    swap_ave_by_hour.append(swap_item)

In [36]:
print(swap_ave_by_hour)

[[10.944237918215613, '2'], [9.439716312056738, '1'], [9.402088772845953, '22'], [9.733590733590734, '21'], [8.66304347826087, '19'], [12.189097103918229, '17'], [21.637770897832816, '15'], [10.50682261208577, '14'], [17.93243243243243, '13'], [9.153846153846153, '11'], [13.436170212765957, '10'], [7.941441441441442, '9'], [9.026548672566372, '7'], [9.3690036900369, '3'], [7.626822157434402, '23'], [8.805882352941177, '20'], [10.310880829015543, '16'], [10.67704280155642, '8'], [9.418604651162791, '0'], [11.156351791530945, '18'], [13.576023391812866, '12'], [10.905349794238683, '4'], [8.675213675213675, '6'], [9.789473684210526, '5']]


**Sort the swap list**

In [37]:
sorted_swap = sorted(swap_ave_by_hour, reverse=True)
print(sorted_swap)

[[21.637770897832816, '15'], [17.93243243243243, '13'], [13.576023391812866, '12'], [13.436170212765957, '10'], [12.189097103918229, '17'], [11.156351791530945, '18'], [10.944237918215613, '2'], [10.905349794238683, '4'], [10.67704280155642, '8'], [10.50682261208577, '14'], [10.310880829015543, '16'], [9.789473684210526, '5'], [9.733590733590734, '21'], [9.439716312056738, '1'], [9.418604651162791, '0'], [9.402088772845953, '22'], [9.3690036900369, '3'], [9.153846153846153, '11'], [9.026548672566372, '7'], [8.805882352941177, '20'], [8.675213675213675, '6'], [8.66304347826087, '19'], [7.941441441441442, '9'], [7.626822157434402, '23']]


In [38]:
for item in sorted_swap[:5]:
    print(item)

[21.637770897832816, '15']
[17.93243243243243, '13']
[13.576023391812866, '12']
[13.436170212765957, '10']
[12.189097103918229, '17']


**Format and display**

In [46]:
template = "{}: {:.2f} average comments per post"
for item in sorted_swap[:5]:
    hour = item[1]
    hour = dt.datetime.strptime(hour, "%H")
    hour = hour.strftime("%H:%M")
    ave = item[0]
    print(template.format(hour, ave))

15:00: 21.64 average comments per post
13:00: 17.93 average comments per post
12:00: 13.58 average comments per post
10:00: 13.44 average comments per post
17:00: 12.19 average comments per post


Summary: In order to get a higer chance of receiving comments, posts are better to becreated in the afternoon, especially around 3p.m.