# Exploring Hacker News Posts

## Introduction

In this project, we'll work with a data set of submissions to popular technology site [Hacker News.](https://news.ycombinator.com/)


We're specifically interested in posts whose titles begin with either *Ask HN* or *Show HN*. Users submit Ask HN posts to ask the Hacker News community a specific question. Below are a couple examples:

- Ask HN: How to improve my personal website?
- Ask HN: Am I the only one outraged by Twitter shutting down share counts?
- Ask HN: Aby recent changes to CSS that broke mobile?

Likewise, users submit *Show HN* posts to show the Hacker News community a project, product, or just generally something interesting. Below are a couple of examples:

- Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform'
- Show HN: Something pointless I made
- Show HN: Shanhu.io, a programming playground powered by e8vm

We'll compare these two types of posts to determine the following:

- Do *Ask HN* or *Show HN* receive more comments on average?
- Do posts created at a certain time receive more comments on average?

We will also analyze points on Ask HN and Show HN along with the main goal.

*We are not going to use pandas or numpy in this project.*



## Import CSV

Let's start by importing the libraries we need and reading the data set into a list of lists.

In [1]:
from csv import reader

opened_file = open('hacker_news.csv', encoding='utf8') #encoding is utf-8 otherwise throws error
read_file = reader(opened_file)
hn = list(read_file)

print(hn[0])
print(hn[1])
print(hn[2])
print(hn[3])
print(hn[4])
                   

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']
['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24']
['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']
['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']


You can find the data set [here](https://www.kaggle.com/hacker-news/hacker-news-posts), but note that it has been reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions. Below are descriptions of the columns:

|Columns| |Description|
|-------| |-----------|
|**id**|| The unique identifier from Hacker News for the post|
|**title**|| The title of the post|
|**url**|| The URL that the posts links to, if the post has a URL|
|**num_points**|| The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes|
|**num_comments**|| The number of comments that were made on the post|
|**author**|| The username of the person who submitted the post|
|**created_at**||The date and time at which the post was submitted|




## Split Data

We will split the header row and keep the data in *hn*:

In [2]:
headers = hn[0]
hn = hn[1:]

print(headers)
print('\n')
print(hn[0])
print(hn[1])
print(hn[2])
print(hn[3])
print(hn[4])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']
['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24']
['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']
['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']
['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']


## Split Post Categories

We will create three empty lists for ask posts, show posts and other posts.

In [3]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print('Number of ask posts: ' + str(len(ask_posts)))
print('Number of show posts: ' + str(len(show_posts)))
print('Number of other posts: ' + str(len(other_posts)))

print('Total number of posts in data set: ' + str(len(hn)))

Number of ask posts: 9139
Number of show posts: 10158
Number of other posts: 273822
Total number of posts in data set: 293119


There are indeed 293119 posts in total in our dataset. There are 9k+ ask posts and 10k+ show posts and the remaining are classified as other posts.

Let's check first three rows of each group:

In [4]:
print(ask_posts[0])
print(ask_posts[1])
print(ask_posts[2])
print('\n')
print(show_posts[0])
print(show_posts[1])
print(show_posts[2])
print('\n')
print(other_posts[0])
print(other_posts[1])
print(other_posts[2])

['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53']
['12578522', 'Ask HN: How do you pass on your work when you die?', '', '6', '3', 'PascLeRasc', '9/26/2016 1:17']
['12577908', 'Ask HN: How a DNS problem can be limited to a geographic region?', '', '1', '0', 'kuon', '9/25/2016 22:57']


['12578335', 'Show HN: Finding puns computationally', 'http://puns.samueltaylor.org/', '2', '0', 'saamm', '9/26/2016 0:36']
['12578182', 'Show HN: A simple library for complicated animations', 'https://christinecha.github.io/choreographer-js/', '1', '0', 'christinecha', '9/26/2016 0:01']
['12578098', 'Show HN: WebGL visualization of DNA sequences', 'http://grondilu.github.io/dna.html', '1', '0', 'grondilu', '9/25/2016 23:44']


['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']
['12579005', 'SQLAR  th

## Compute  Comment Averages
Next, let's determine if ask posts or show posts receive more comments on average.

In [5]:
# ask comments
total_ask_comments = 0

for row in ask_posts:
    total_ask_comments += int(row[4]) # index 4 for is number of comments column
    
avg_ask_comments = total_ask_comments / len(ask_posts)
print('Average ask comments: {}'.format(avg_ask_comments))

# show comments
total_show_comments = 0

for row in show_posts:
    total_show_comments += int(row[4])
    
avg_show_comments = total_show_comments / len(show_posts)
print('Average show comments: {}'.format(avg_show_comments))

# other comments 
total_other_comments = 0

for row in other_posts:
    total_other_comments += int(row[4])
    
avg_other_comments = total_other_comments / len(other_posts)
print('Average other comments: {}'.format(avg_other_comments))

Average ask comments: 10.393478498741656
Average show comments: 4.886099625910612
Average other comments: 6.4572678601427205


As it can be seen from the statistics, ask comments are twice as much as show comments on average. It is natural since ask posts are more likely to receive answers compared to show comments.

Since **ask posts** are more likely to receive comments, we'll focus our remaining analysis just on these posts.


## Compute Points Averages

As a side analysis we have mentioned that we would analyze points on Ask HN and Show HN as well. Let's check which category gets the most points on average.

*Points = Upvotes - Downvotes*

In [6]:
# ask points
total_ask_points = 0

for row in ask_posts:
    total_ask_points += int(row[3]) # index 3 for number of points column
    
avg_ask_points = total_ask_points / len(ask_posts)
print('Average ask points: {}'.format(avg_ask_points))

# show points
total_show_points = 0

for row in show_posts:
    total_show_points += int(row[3])
    
avg_show_points = total_show_points / len(show_posts)
print('Average show points: {}'.format(avg_show_points))

# other points
total_other_points = 0

for row in other_posts:
    total_other_points += int(row[3])
    
avg_other_points = total_other_points / len(other_posts)
print('Average other points: {}'.format(avg_other_points))

Average ask points: 11.31174089068826
Average show points: 14.843571569206537
Average other points: 15.156010108756783


As it can be seen from the statistics, Show HN posts are getting more points than Ask HN posts. This is also expected since people who see the posts tend to upvote or downvote a show post rather than commenting on it. Remember the goal of this project is to see the prime hour to get the most comments or our posts.

## Frequency Table For Hours
Next, we'll determine if ask posts created at a certain *time* are more likely to attract comments.

In [7]:
import datetime as dt

result_list = []

for row in ask_posts:
    created_at = row[6] # index 6 for created_at column
    num_of_comments = int(row[4]) 
    num_of_points = int(row[3])
    result_list.append([created_at, num_of_comments, num_of_points])

counts_by_hour = {}
comments_by_hour = {}
points_by_hour = {}

for row in result_list:
    dt_object = dt.datetime.strptime(row[0], '%m/%d/%Y %H:%M')
    hour = dt_object.strftime('%H')
    comments = int(row[1])
    points = int(row[2])
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comments
        points_by_hour[hour] = points
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comments
        points_by_hour[hour] += points
        
print(counts_by_hour)
print('\n')
print(comments_by_hour)
print('\n')
print(points_by_hour)

{'02': 269, '01': 282, '22': 383, '21': 518, '19': 552, '17': 587, '15': 646, '14': 513, '13': 444, '11': 312, '10': 282, '09': 222, '07': 226, '03': 271, '23': 343, '20': 510, '16': 579, '08': 257, '00': 301, '18': 614, '12': 342, '04': 243, '06': 234, '05': 209}


{'02': 2996, '01': 2089, '22': 3372, '21': 4500, '19': 3954, '17': 5547, '15': 18525, '14': 4972, '13': 7245, '11': 2797, '10': 3013, '09': 1477, '07': 1585, '03': 2154, '23': 2297, '20': 4462, '16': 4466, '08': 2362, '00': 2277, '18': 4877, '12': 4234, '04': 2360, '06': 1587, '05': 1838}


{'02': 2944, '01': 2662, '22': 3601, '21': 5042, '19': 4782, '17': 7155, '15': 13978, '14': 5390, '13': 7962, '11': 2856, '10': 3789, '09': 1763, '07': 2040, '03': 2539, '23': 2616, '20': 4491, '16': 5970, '08': 2744, '00': 2835, '18': 6850, '12': 4643, '04': 2650, '06': 2030, '05': 2046}


- **counts_by_hour**: contains the number of ask posts created during each hour of the day.
- **comments_by_hour**: contains the corresponding number of comments ask posts created at each hour received
- **points_by_hour**: contains the corresponding number of points ask posts created at each hour received


### Compute Averages for Comments and Points Based on Hours
Next, we'll use these three dictionaries to calculate the *average* number of comments for posts created during each hour of the day.

We'll do that by dividing total amount of comments for each hour by the total amount of posts(counts) for that hour. 

In [8]:
avg_comments_by_hour = []
avg_points_by_hour = []

for hour in counts_by_hour:
        avg_comments_by_hour.append([hour, (comments_by_hour[hour] / counts_by_hour[hour])])
        avg_points_by_hour.append([hour, (points_by_hour[hour] / counts_by_hour[hour])])

print(avg_comments_by_hour)
print('\n')
print(avg_points_by_hour)

[['02', 11.137546468401487], ['01', 7.407801418439717], ['22', 8.804177545691905], ['21', 8.687258687258687], ['19', 7.163043478260869], ['17', 9.449744463373083], ['15', 28.676470588235293], ['14', 9.692007797270955], ['13', 16.31756756756757], ['11', 8.96474358974359], ['10', 10.684397163120567], ['09', 6.653153153153153], ['07', 7.013274336283186], ['03', 7.948339483394834], ['23', 6.696793002915452], ['20', 8.749019607843136], ['16', 7.713298791018998], ['08', 9.190661478599221], ['00', 7.5647840531561465], ['18', 7.94299674267101], ['12', 12.380116959064328], ['04', 9.7119341563786], ['06', 6.782051282051282], ['05', 8.794258373205741]]


[['02', 10.944237918215613], ['01', 9.439716312056738], ['22', 9.402088772845953], ['21', 9.733590733590734], ['19', 8.66304347826087], ['17', 12.189097103918229], ['15', 21.637770897832816], ['14', 10.50682261208577], ['13', 17.93243243243243], ['11', 9.153846153846153], ['10', 13.436170212765957], ['09', 7.941441441441442], ['07', 9.02654867256

### Swap Columns and Print Top 5 Averages
Although we now have the results we need, this format makes it hard to identify the hours with the highest values. Let's finish by sorting the list of lists and printing the five highest values in a format that's easier to read.

In [11]:
swap_avg_comments_by_hour = []
swap_avg_points_by_hour = []

# swap columns 
for row in avg_comments_by_hour:
    swap_avg_comments_by_hour.append([row[1], row[0]])
for row in avg_points_by_hour:
    swap_avg_points_by_hour.append([row[1], row[0]])
    
print(swap_avg_comments_by_hour)
print('\n')
sorted_swap_comments = sorted(swap_avg_comments_by_hour, reverse=True)
print(swap_avg_points_by_hour)
print('\n')
sorted_swap_points = sorted(swap_avg_points_by_hour, reverse=True)

print('Top 5 Hours for Ask Posts Comments')

# print top hour for comments
for row in sorted_swap_comments[:5]:
    hour_dt = dt.datetime.strptime(row[1], '%H')
    hour = hour_dt.strftime('%H:%M')
    string = '{}: {:.2f} average comments per post.'.format(hour, row[0])
    print(string)

print('\n')
print('Top 5 Hours for Ask Posts Points')

# print top hour for points
for row in sorted_swap_points[:5]:
    hour_dt = dt.datetime.strptime(row[1], '%H')
    hour = hour_dt.strftime('%H:%M')
    string = '{}: {:.2f} average points per post.'.format(hour, row[0])
    print(string)

[[11.137546468401487, '02'], [7.407801418439717, '01'], [8.804177545691905, '22'], [8.687258687258687, '21'], [7.163043478260869, '19'], [9.449744463373083, '17'], [28.676470588235293, '15'], [9.692007797270955, '14'], [16.31756756756757, '13'], [8.96474358974359, '11'], [10.684397163120567, '10'], [6.653153153153153, '09'], [7.013274336283186, '07'], [7.948339483394834, '03'], [6.696793002915452, '23'], [8.749019607843136, '20'], [7.713298791018998, '16'], [9.190661478599221, '08'], [7.5647840531561465, '00'], [7.94299674267101, '18'], [12.380116959064328, '12'], [9.7119341563786, '04'], [6.782051282051282, '06'], [8.794258373205741, '05']]


[[10.944237918215613, '02'], [9.439716312056738, '01'], [9.402088772845953, '22'], [9.733590733590734, '21'], [8.66304347826087, '19'], [12.189097103918229, '17'], [21.637770897832816, '15'], [10.50682261208577, '14'], [17.93243243243243, '13'], [9.153846153846153, '11'], [13.436170212765957, '10'], [7.941441441441442, '09'], [9.026548672566372, 

Now we can clearly see that top 5 hours to post an ask post on Hacker News site in order to get the maximum amount of comments. The top hour is **15:00** but it would be good to keep in mind that the time zone is *Eastern Time in the US* for this data set.

#### Convert to GMT+3 Time Zone


In [20]:
# print top hour for comments
print('TOP HOURS FOR COMMENTS AND POINTS FOR GMT+3 TIMEZONE \n')
for row in sorted_swap_comments[:5]:
    hour_dt = dt.datetime.strptime(row[1], '%H')
    hour_dt += dt.timedelta(hours=8)
    hour = hour_dt.strftime('%H:%M')
    string = '{}: {:.2f} average comments per post.'.format(hour, row[0])
    print(string)

print('\n')
print('Top 5 Hours for Ask Posts Points')

# print top hour for points
for row in sorted_swap_points[:5]:
    hour_dt = dt.datetime.strptime(row[1], '%H')
    hour_dt += dt.timedelta(hours=8)
    hour = hour_dt.strftime('%H:%M')
    string = '{}: {:.2f} average points per post.'.format(hour, row[0])
    print(string)

TOP HOURS FOR COMMENTS AND POINTS FOR GMT+3 TIMEZONE 

23:00: 28.68 average comments per post.
21:00: 16.32 average comments per post.
20:00: 12.38 average comments per post.
10:00: 11.14 average comments per post.
18:00: 10.68 average comments per post.


Top 5 Hours for Ask Posts Points
23:00: 21.64 average points per post.
21:00: 17.93 average points per post.
20:00: 13.58 average points per post.
18:00: 13.44 average points per post.
01:00: 12.19 average points per post.


Between Eastern Time in the US and GMT+3 time zone there is 8 hours of difference(I currently reside in Turkey so this time zone should be considered.) 

## Conclusions:



Our goals were to find whether Ask HN posts or Show HN posts attract more attention and also to see the prime hour in order to get most comments on our post.

We have found out that Ask HN posts have higher average of comments than Show HN posts so we chose to make our analysis based on Ask HN posts.

As we can see from the averages above, if we'd like to attract the most amount of comments for our **Ask HN** post the best option would be to post it on **23:00**. Likewise the most amount of points an ask post attracts based on hour is again 23:00. Thus we can say that we would hit two birds with one stone.