# Exploring Hacker News Posts

***

## Introduction

[Hacker News](https://news.ycombinator.com/) (HN) is a website where user submitted posts are voted and commented upon, very much like [Reddit](https://reddit.com).
In this website we are particularly interested in posts where their respective titles begin with `Ask HN` or `Show HN`.

Users submit `Ask HN` posts to ask the Hacker News community a specific question while `Show HN` posts are used to show the community a project, a product or something generally interesting.

Within the dataset the description of the coloumns are as below (Index Numbers have been added so that we know what coloumn the codes below refer to):

Index No. |Column name |Description                                         
--------------|---------|-------------------------------------------
0| id          |The unique identifier from Hacker News for the post
1|title       |The title of the post
2| url         | The URL that the posts links to, if it the post has a URL
3| num_points  | The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
4| num_comments| The number of comments that were made on the post
5|author      | The username of the person who submitted the post
6|created_at  | The date and time at which the post was submitted 

The Dataset we are working with contains approximately 300,000 HN posts. Further information regarding the dataset used for this project can be found by clicking [here](https://www.kaggle.com/hacker-news/hacker-news-posts).

The main objective of this project is as below:

**Project Goals:**

1. To determine whether 'Ask HN' Posts or 'Show HN' posts receive a higher number of comments on average.
2. To determine the best time of day to submit your post in order to recieve a high number of comments.


***




## Project Goal 1: To determine whether 'Ask HN' Posts or 'Show HN' posts receive a higher number of comments on average.

## Opening and reading the file

Below we open the .csv dataset and assign if to the variable `hn`. We also remove the header row and keep it in a seperate variable `headers`.

We print the `headers` and `hn` variables to get a better understanding of what our data consists of.

In [1]:
from csv import reader

opened_file = open('hacker_news.csv', encoding="utf8")
read_file = reader(opened_file)
hn = list(read_file)
headers = hn[0]
hn = hn[1:]

print(headers,'\n')
for row in hn[:5]:
    print(row,'\n')


['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] 

['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'] 

['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'] 

['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'] 

['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16'] 

['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14'] 



## Removing Posts without comments

From here on we will only consider posts that have comments and will not be dealing with posts with 0 comments. We are doing this as to obtain a fair average within posts that have some traction. 

This also helps us largely reduce our dataset size from approxmiately 300,000 to approximately 80,000.

In [2]:
posts_w_comments = []

for row in hn:
    ncomments = row[4]
    if ncomments != '0':
        posts_w_comments.append(row)
        
print("Number of posts with comments: ",len(posts_w_comments))

Number of posts with comments:  80401


## Filtering 'Ask HN' and 'Show HN' posts

We then proceed to look at the whole dataset and seperate Ask HN posts and Show HN posts in two seperate lists with the variable name `ask_posts` and `show_posts` respectively. Since users may not post with titles that match the case exactly we will need to convert each string to lower case to make it easier for us to identify posts that begin with ask hn & show hn.

Finally, the number of 'Ask HN' & 'Show HN' posts are then displayed.

In [3]:
ask_posts = []
show_posts = []
other_posts = []

for row in posts_w_comments:      
    title = row[1]
    lower_title = title.lower()
    if lower_title.startswith('ask hn'):
        ask_posts.append(row)
    elif lower_title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

print("No. of Posts starting with Ask HN: {} \nNo. of Posts starting with Show HN: {}\nNo. of Other Posts {}".format(len(ask_posts),len(show_posts),len(other_posts)))

No. of Posts starting with Ask HN: 6911 
No. of Posts starting with Show HN: 5059
No. of Other Posts 68431


The results above show us that there are more Show HN posts in our dataset, and it can be assumed that there are more Show HN posts on the website as well.

## Calculating average number of comments for 'Ask HN' and 'Show HN' posts

To calculate the average number of comments for 'Ask HN' and 'Show HN' posts we first sum up the total number of comments for the posts within `ask_posts` and `show_posts` and then divide by the number of posts to obtain an average.

In [4]:
total_ask_comments = 0
total_show_comments = 0

for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments
    
avg_ask_comments = total_ask_comments/len(ask_posts)

print("Average number of comments on Ask HN Posts: ", avg_ask_comments)

for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments

avg_show_comments = total_show_comments/len(show_posts)

print("Average number of comments of Show HN Posts: ", avg_show_comments)
    

Average number of comments on Ask HN Posts:  13.744175951381855
Average number of comments of Show HN Posts:  9.810832180272781


From the above analysis it is clear that there is more traction on Ask HN posts. This could be because people are willing to post their opinions or provide answers to help the original poster regarding his/her queries. Whereas Show HN posts may recieve lesser comments due to the fact that people may just see what the original poster has to say or show, find it interesting or not and move on to the next post much like any social media platform. A user may only comment on Show HN posts when they have a strong opinion on the content they are viewing.
***

## Project Goal 2: To determine the best time of day to submit your post in order to recieve a high number of comments.

## Finding the number of Ask Posts and number of comments created by hour

Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.

With the below code we work with date and time in an attempt to:
* count the number of Ask HN posts posted each hour.
* calculate the total number of comments for Ask HN posts for each hour.

In [11]:
import datetime as dt

result_list = []

for row in ask_posts:
    post_date_time = row[6]
    ncomments = int(row[4])
    result_list.append([post_date_time,ncomments])
    
counts_by_hour = {}
comments_by_hour = {}
date_format = "%m/%d/%Y %H:%M"
    
for row in result_list:
    date_time = row[0]
    ncomments = row[1]
    time = dt.datetime.strptime(date_time, date_format).strftime("%H")
    
    if time not in counts_by_hour:
        counts_by_hour[time] = 1
        comments_by_hour[time] = ncomments
    else:
        counts_by_hour[time] += 1
        comments_by_hour[time] += ncomments
        
print("Below data is in the format: (Hour, Number of Comments for Ask Posts in that hour)\n")        
print(sorted(comments_by_hour.items(), key=lambda kv: kv[1])) #Sort dictionary by values ascending
print('\n')        
print("Below data is in the format: (Hour, Number of Ask Posts posted in that hour)\n") 
print(sorted(counts_by_hour.items(), key=lambda kv: kv[1])) #Sort dictionary by values ascending
        

Below data is in the format: (Hour, Number of Comments for Ask Posts in that hour)

[('09', 1477), ('07', 1585), ('06', 1587), ('05', 1838), ('01', 2089), ('03', 2154), ('00', 2277), ('23', 2297), ('04', 2360), ('08', 2362), ('11', 2797), ('02', 2996), ('10', 3013), ('22', 3372), ('19', 3954), ('12', 4234), ('20', 4462), ('16', 4466), ('21', 4500), ('18', 4877), ('14', 4972), ('17', 5547), ('13', 7245), ('15', 18525)]


Below data is in the format: (Hour, Number of Ask Posts posted in that hour)

[('07', 157), ('05', 165), ('09', 176), ('06', 176), ('04', 186), ('08', 190), ('03', 212), ('10', 219), ('01', 223), ('02', 227), ('00', 231), ('11', 251), ('12', 274), ('23', 276), ('22', 287), ('13', 326), ('14', 378), ('20', 392), ('17', 404), ('21', 407), ('16', 415), ('19', 420), ('18', 452), ('15', 467)]


The 24 hour time format is in EST format according to the dataset documentation.

From the above results we can identify that at 15:00-16:00 EST has the highest number of comments at 18525 & posts at 467.

## Calculating the Average Number of Comments for Ask HN Posts by Hour

We now use the results from the above code block to determine the average number of comments for Ask HN Posts by Hour.
This is done by simply dividing the total number of comments for all posts for that hour by the number of posts submitted within that hour. i.e. `comments_by_hour[key`/`counts_by_hour[key]`

In [6]:
avg_by_hour = []

for key in comments_by_hour:
    average_by_hour = comments_by_hour[key]/counts_by_hour[key]
    avg_by_hour.append([key, average_by_hour])
    
def take_second(elem):  # Fuction to return index 1 in list of lists
    return elem[1]
    
sorted_avg_by_hour = sorted(avg_by_hour, key=take_second, reverse=True) # Uses above function 'take_second' to sort avg_by_hour in descending by value
print(sorted_avg_by_hour)

[['15', 39.66809421841542], ['13', 22.2239263803681], ['12', 15.452554744525548], ['10', 13.757990867579908], ['17', 13.73019801980198], ['02', 13.198237885462555], ['14', 13.153439153439153], ['04', 12.688172043010752], ['08', 12.43157894736842], ['22', 11.749128919860627], ['20', 11.38265306122449], ['11', 11.143426294820717], ['05', 11.139393939393939], ['21', 11.056511056511056], ['18', 10.789823008849558], ['16', 10.76144578313253], ['03', 10.160377358490566], ['07', 10.095541401273886], ['00', 9.857142857142858], ['19', 9.414285714285715], ['01', 9.367713004484305], ['06', 9.017045454545455], ['09', 8.392045454545455], ['23', 8.322463768115941]]


The above data tells us that 1500 hours EST is the best time to post a Ask HN post to recieve a high number of comments. The Top 5 hours to post Ask HN posts are presented in a neater manner below: 

In [7]:
print("Top 5 Hours to submit Ask HN Posts:\n")
for row in sorted_avg_by_hour[:5]:
    print("Hour: {}, Average Number of Comments: {:.2f}".format(row[0],row[1]))

Top 5 Hours to submit Ask HN Posts:

Hour: 15, Average Number of Comments: 39.67
Hour: 13, Average Number of Comments: 22.22
Hour: 12, Average Number of Comments: 15.45
Hour: 10, Average Number of Comments: 13.76
Hour: 17, Average Number of Comments: 13.73


## Conclusion

To summarize we have found out that among all the Hacker News posts that do recieve comments:

1) 'Ask HN' Posts recieve a higher number of comments on average. <br>
2) 1500-1600 EST (3 PM - 4PM EST) would be the best time of day to submit an Ask HN post to recieve a high number of comments