# Exploring Hacker News Posts

[Hacker News](https://news.ycombinator.com/), a site popular among technology and startup circles, was started by the startup incubator, `Y Combinator` where posts submitted by users are voted and commented upon. Posts that make it to the top of Hacker News' listings can get hundreds of thousands of visits as a result.

In this project, we will explore two types of posts from the Hacker News site.That is,posts whose titles either begin with `Ask HN` or `Show HN`. Users submit `Ask HN` posts to ask the Hacker News community a specific question or submit `Show HN` posts to show the Hacker News community a project, product or just something interesting.

In comparison to these two types of posts, we determine :
- Which of the two receive more comments on average?
- Do posts created at a certain time receive more comments on average?

It should be noted that the dataset has been reduced from 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, then randomly sampling from the remaining submissions.

You can explore about the Hacker News dataset from [here](https://www.kaggle.com/hacker-news/hacker-news-posts). 


In [56]:
#Importing the Hacker News Dataset
from csv import reader

opened_file = open('hacker_news.csv')
hn = list(reader(opened_file)) #dataset stored as hn as a list of lists

print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


Printing the first five rows. The first row/list contains clumn headers and the lists after contain the data for one row. 

## Removing headers from a List of Lists

In [7]:
headers = hn[0]
print(headers)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


Assigning the first row containing column headers to variable headers

In [57]:
# Removing the first row that has column headers from hn dataset
hn = hn[1:]

print(hn[:5])

[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


We have now separated the header columns from the dataset.

## Filtering data for Ask HN or Show HN Posts

We are only concerned with post titles that begin with either `Ask HN` or `Show HN`.

Thus we'll extract into new lists of lists containing just the data for this information.


In [58]:
#LIST TO HOLD Ask Hn Posts' Data
ask_posts = []

#LIST TO HOLD Show Hn Posts' Data
show_posts = []

#LIST TO Other Posts' Data
other_posts = []

In [59]:
for row in hn: #extracting ask posts, show posts and other posts
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)      

In [60]:
#Checking the number under each posts: AskHN Posts, Show Hn Posts, Other Posts 
print('There are '+str(len(ask_posts))+' Ask Hn posts')
print('There are '+str(len(show_posts))+' Show Hn posts')
print('There are '+str(len(other_posts))+' other posts')

There are 1744 Ask Hn posts
There are 1162 Show Hn posts
There are 17194 other posts


## Calculating average number of comments per posts

Now that we have separated ask posts and show posts into lists of lists, we will calculate the average number of comments in each type of posts.

In [61]:
total_ask_comments = 0
total_show_comments = 0

In [29]:
#Calculate total no of comments for ask posts
for post in ask_posts:
    total_ask_comments+= int(post[4])

#Compute average no of comments for ask posts
avg_ask_comments = total_ask_comments/len(ask_posts)
print(avg_ask_comments)


14.038417431192661


In [62]:
#Calculate total no of comments for show posts
for post in show_posts:
    total_show_comments+= int(post[4])

#Compute average no of comments for ask posts
avg_show_comments = total_show_comments/len(show_posts)
print(avg_show_comments)

10.31669535283993


As you can see, ask posts receive more comments `(14)`  on average than show posts `(10)` on average.

Since ask posts are more likely to receive comments, our remaining analysis will focus on these posts.

## Finding amount of ask posts and comments created by hour

In this section, we will determine if ask posts created at a certain time are more likely to attract comments.

We will calculate the `amount of ask posts` along with its `comments created` in `each hour` of the day, and thus calculate the `average number of comments ask posts receive by hour created`.

In [63]:
#Calculating the amount of ask posts and comments by hour created
import datetime as dt

result_list = []

for post in ask_posts:
    result_list.append([post[6], int(post[4])])

counts_by_hour = {}
comments_by_hour = {}
date_format = "%m/%d/%Y %H:%M"

for row in result_list: #extract no of comments for posts created each hour
    date = row[0]
    comment =  row[1]
    time = dt.datetime.strptime(date, date_format).strftime("%H")
   
    if time not in counts_by_hour:
        counts_by_hour[time]=1
        comments_by_hour[time]=comment   
    else:
        counts_by_hour[time]+=1
        comments_by_hour[time]+=comment
  
comments_by_hour

{'00': 447,
 '01': 683,
 '02': 1381,
 '03': 421,
 '04': 337,
 '05': 464,
 '06': 397,
 '07': 267,
 '08': 492,
 '09': 251,
 '10': 793,
 '11': 641,
 '12': 687,
 '13': 1253,
 '14': 1416,
 '15': 4477,
 '16': 1814,
 '17': 1146,
 '18': 1439,
 '19': 1188,
 '20': 1722,
 '21': 1745,
 '22': 479,
 '23': 543}

`counts_by_hour` : contains the number of ask posts created during each hour of the day

`comments_by_hour`: contains the corresponding number of comments ask posts at each hour received 

## Finding the average number of comments for Ask posts created by each hour of the day 

In [64]:
#Calculate average number of comments per ask post for posts created each hour
avg_by_hour = []

for hour in comments_by_hour:
    avg = comments_by_hour[hour]/counts_by_hour[hour]
    avg_by_hour.append([hour,avg])

avg_by_hour

[['06', 9.022727272727273],
 ['08', 10.25],
 ['22', 6.746478873239437],
 ['05', 10.08695652173913],
 ['17', 11.46],
 ['02', 23.810344827586206],
 ['11', 11.051724137931034],
 ['18', 13.20183486238532],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['21', 16.009174311926607],
 ['03', 7.796296296296297],
 ['09', 5.5777777777777775],
 ['20', 21.525],
 ['13', 14.741176470588234],
 ['19', 10.8],
 ['00', 8.127272727272727],
 ['04', 7.170212765957447],
 ['07', 7.852941176470588],
 ['12', 9.41095890410959],
 ['23', 7.985294117647059],
 ['15', 38.5948275862069],
 ['01', 11.383333333333333],
 ['10', 13.440677966101696]]

We have now calculated the average number of comments for posts created during each hour of the day under the list of lists, `avg_by_hour`

## Sorting and Printing Highest Values from a List of Lists

The format of `avg_by_hour` list of lists is difficult to identify the hours with the highest average number comments for  posts created within those hours.

We will finally sort this list and try get the times which one can create a  post so as to have a higher chance of receiving comments. 

In [65]:
#List equals avg_by_hour with swapped columns
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1],row[0]])

print(swap_avg_by_hour)

[[9.022727272727273, '06'], [10.25, '08'], [6.746478873239437, '22'], [10.08695652173913, '05'], [11.46, '17'], [23.810344827586206, '02'], [11.051724137931034, '11'], [13.20183486238532, '18'], [13.233644859813085, '14'], [16.796296296296298, '16'], [16.009174311926607, '21'], [7.796296296296297, '03'], [5.5777777777777775, '09'], [21.525, '20'], [14.741176470588234, '13'], [10.8, '19'], [8.127272727272727, '00'], [7.170212765957447, '04'], [7.852941176470588, '07'], [9.41095890410959, '12'], [7.985294117647059, '23'], [38.5948275862069, '15'], [11.383333333333333, '01'], [13.440677966101696, '10']]


In [66]:
#Sort list from the highest
sorted_swap =  sorted(swap_avg_by_hour, reverse=True)

print("Top 5 Hours for Ask Posts Comments")
for avg, hr in sorted_swap[:5]:
    print("{}: {:.2f} average comments per post".
          format(dt.datetime.strptime(hr,"%H").strftime("%H:%M"), avg))

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


The hour that receives the most comments per post is `15:00 ` with an average of approximately `38 comments per posts`, a `62% increase of comments` between 15:00 and 02:00.

From our top 5 list, we can also see that most comments were made towards the `evening-night hours` when most people are less busy or not at work or they are trying to relax so as to end their day.

In my own opinion, in the morning most people are at work and thus have less time to comment on posts.

According to the [dataset's documentation](https://www.kaggle.com/hacker-news/hacker-news-posts), the timezone used for collecting the date and time the posts were made was `Eastern Time` in the US. I would therefore recommend creating an ask post at 3:00 pm EST to have a higher chance of receiving comments.

Or if you live in Eastern Africa like me 😀, then create your ask post at 10:00 pm EAT.

## Conclusion

In this project, we explored and analyzed the ask and show posts from the Hacker News Site to find out which type of posts and time receive the most comments.

From our findings, we noticed that the ask posts receive more comments on average and most comments are received from 15:00 est. And so, we recommend if anyone should create a post on the site, they should explore this option.