# Determining optimal posting times for receiving highest number of comments

In this project we will look into posts at Hacker News (HN). We want to examine the data and answer a few questions, like:
- Do posts aimed at asking Hacker News questions, receive more comments at average, than posts where the user wants to show Hacker News something they find intersting?
- Do posts created at certain times recieve more comments on average for **ask hn** or **show hn** posts?

We will sort the data between **ask hn** and **show hn** posts. calculate the average number of comments each category receives and focus on the more active category between **ask hn** and **show hn**. In the most active category, we will calculate the average number of comments received, per post and hour. We will then present the top 5 posting hours for receiving the most comments.


The [data set](https://www.kaggle.com/hacker-news/hacker-news-posts) consists of 20,000 rows and 7 columns that include information regarding:
- **id**: The unique identifier from Hacker News for the post
- **title**: The title of the post
- **url**: The URL that the posts links to, if it the post has a URL
- **num_points**: The number of points the post acquired, calculated as the total    number of upvotes minus the total number of downvotes
- **num_comments**: The number of comments that were made on the post
- **author**: The username of the person who submitted the post
- **created_at**: The date and time at which the post was submitted

For now, we will open the **hacker_news.nsv** file, read it, and turn it into a list we store into the variable **hn**. We will then print the first 5 rows of the data set.


In [1]:
from csv import reader
open_file = open('hacker_news.csv')
read_file = reader(open_file)
hn = list(read_file)
print(hn[:5])


[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


As we see we also got the header as part of the data set, let us change this and the headers into a variable **headers**, and also remove it from the data set.

In [2]:
headers = hn[0]
hn = hn[1:]
print(headers)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


## Sorting the data set into their respective category
Since we are interested in posts where users **ask Hn** or want to **Show HN**, we will try to sepeare all these posts into seperate lists.

In [3]:
ask_posts = []
show_posts = []
other_posts = []

In [4]:
for row in hn:
    title = row[1]
    if title.lower().startswith("ask hn"):
        ask_posts.append(row)
    elif title.lower().startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)
print('There are this many posts in the list of ask_posts:',len(ask_posts))
print('There are this many posts in the list of show_posts:',len(show_posts))
print('There are this many posts in the list of other_posts:',len(other_posts))

There are this many posts in the list of ask_posts: 1744
There are this many posts in the list of show_posts: 1162
There are this many posts in the list of other_posts: 17194


We see that **Other posts** receive even more comments than **ask hn** or **show hn** posts. We are however focusing on ask or show posts as specified in the beginning.

Let us take a look at the first 5 posts of the ask_posts list.

In [5]:
print(ask_posts[:5])

[['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'], ['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14'], ['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20'], ['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38']]


Now, let us take a look at the first 5 posts of the show_posts list:

In [6]:
print(show_posts[:5])

[['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'], ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46'], ['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https://shanhu.io', '1', '1', 'h8liu', '4/28/2016 18:05'], ['12178806', 'Show HN: Webscope  Easy way for web developers to communicate with Clients', 'http://webscopeapp.com', '3', '3', 'fastbrick', '7/28/2016 7:11'], ['10872799', 'Show HN: GeoScreenshot  Easily test Geo-IP based web pages', 'https://www.geoscreenshot.com/', '1', '9', 'kpsychwave', '1/9/2016 20:45']]


## Counting comments for "ask hn" and "show hn"" posts

Next, we will determine which kind of posts that receive more comments on average.

In [23]:
total_ask_comments = 0
for post in ask_posts:
    total_ask_comments += int(post[4])
    
avg_ask_comments = total_ask_comments / len(ask_posts)
print("Average number of comments per ask posts are: ", avg_ask_comments)

total_show_comments = 0
for post in show_posts:
    total_show_comments += int(post[4])
    
avg_show_comments = total_show_comments / len(show_posts)
print("Average number of comments per show posts are: ", avg_show_comments)    
    
total_other_comments = 0
for post in other_posts:
    total_other_comments += int(post[4])
    
avg_other_comments = total_other_comments / len(other_posts)
print("Average number of comments per other posts are: ", avg_other_comments)


Average number of comments per ask posts are:  14.038417431192661
Average number of comments per show posts are:  10.31669535283993
Average number of comments per other posts are:  26.8730371059672


We see that posts aimed at asking questions have a higher average of comments per post. We will focus the remaining analysis on the ask posts.

## Most active hours for commenting
Now we determine if ask posts created at a certain time receive more comments than the rest. To perform this analysis, we will:
- Calculate the amount of ask posts created in each hour of the day, as well as the number of comments received.
- Calculate the average number of comments ask post receive for each hour they were created.

In [8]:
import datetime as dt
result_list = []
for row in ask_posts:
    result_list.append(
        [row[6],int(row[4])]) 
    
counts_by_hour = {}
comments_by_hour = {}

date_format = '%m/%d/%Y %H:%M'

for row in result_list:
    date = row[0]
    date = dt.datetime.strptime(date,date_format)
    hour = date.hour
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = int(row[1])
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += int(row[1])  
        
print(counts_by_hour)
print("\n")
print(comments_by_hour)

{0: 55, 1: 60, 2: 58, 3: 54, 4: 47, 5: 46, 6: 44, 7: 34, 8: 48, 9: 45, 10: 59, 11: 58, 12: 73, 13: 85, 14: 107, 15: 116, 16: 108, 17: 100, 18: 109, 19: 110, 20: 80, 21: 109, 22: 71, 23: 68}


{0: 447, 1: 683, 2: 1381, 3: 421, 4: 337, 5: 464, 6: 397, 7: 267, 8: 492, 9: 251, 10: 793, 11: 641, 12: 687, 13: 1253, 14: 1416, 15: 4477, 16: 1814, 17: 1146, 18: 1439, 19: 1188, 20: 1722, 21: 1745, 22: 479, 23: 543}


## Calculating average comments per hour
Now we will calculate the average number of comments per hour by:
- Iterating in the dictionary **counts_by_hour** and for each iteration we:
    - calculate the average number of comments per post and hour.
    - append the **hour** and **average number of comments** to the list **avg_by_hour**.

In [12]:
avg_by_hour = []
for hour in counts_by_hour:
    comments = comments_by_hour[hour]
    count = counts_by_hour[hour]
    avg =  comments/count 
    avg_by_hour.append([hour,avg])
for row in avg_by_hour:
    print(row)

[0, 8.127272727272727]
[1, 11.383333333333333]
[2, 23.810344827586206]
[3, 7.796296296296297]
[4, 7.170212765957447]
[5, 10.08695652173913]
[6, 9.022727272727273]
[7, 7.852941176470588]
[8, 10.25]
[9, 5.5777777777777775]
[10, 13.440677966101696]
[11, 11.051724137931034]
[12, 9.41095890410959]
[13, 14.741176470588234]
[14, 13.233644859813085]
[15, 38.5948275862069]
[16, 16.796296296296298]
[17, 11.46]
[18, 13.20183486238532]
[19, 10.8]
[20, 21.525]
[21, 16.009174311926607]
[22, 6.746478873239437]
[23, 7.985294117647059]


## Sorting the results
This list will now be sorted so we can easier find the highest number of comments and hour. To be able to sort it we first need to swap the columns, this will be done in the **swap_avg_by_hour**.

In [24]:
swap_avg_by_hour=[]
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1],row[0]])

sorted_swap = sorted(swap_avg_by_hour,reverse=True)

# Conclusion - The top 5 best times to post
Below we present our findings with the best posting hours. 

In [21]:
string="{}: {:.2f} average comments per post"
for avg,hour in sorted_swap[:5]:
    time = dt.datetime.strptime(str(hour),"%H")
    hour = dt.datetime.strftime(time,"%H:%M")
    print(string.format(hour,avg))

15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


The point of this project was to find the top hours to post for getting the most amount of comments. According to our results, the best time to post is at 3pm. There is approximately a 60% increase in the number of posts from the best hour to the second best(2 am ). The most active hours are at **3-5pm, 8-10pm** and around **2am**.