# Exploring Hacker News Posts

In this project we will be analyzing some data set of submissions to popular technology site Hacker News.

Hacker News is a site where user-submitted stories ('posts'
) are voted and commented upon, similar to reddit.

Hacker News is extremely popular in technology and startup circles.

We are specifically interested in posts whose titles begin with either __'Ask HN'__ or __'Show HN'__. The __'Ask HN'__ posts are used to ask the Hacker News community a specifici question. Like:

- __Ask HN__: How to improve my personal website?
- __Ask HN__: Am I the only one outraged by Twitter shutting down share counts?
- __Ask HN__: Aby recent changes to CSS that broke mobile?

The __'Show HN'__ posts are used to show the Hacker News community a project, product, or just generally something interesting. Like:

- __Show HN__: Wio Link  ESP8266 Based Web of Things Hardware Development Platform'
- __Show HN__: Something pointless I made
- __Show HN__: Shanhu.io, a programming playground powered by e8vm

We want to compare these two types of posts to determine the following?

- Do __'Ask HN'__ or __'Show HN'__ receive more comments on average?
- Do posts created at a certain time receive more comments on average?

In [1]:
#Importing the libraries we need and showing the first five rows
from csv import reader

file = open('dataset/hacker_news.csv')
reader = reader(file)
hn = list(reader)

print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


### Removing Headers from a List of Lists

Notice that the first list in the inner lists contains the columns headers and the lists after contain the data for one row. We need to first remove the row containing the column headers.

In [2]:
#Removing headers from a list of lists
headers = hn[0]
print(headers)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


In [3]:
hn = hn[1:]
print(hn[:5])

[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


### Extracting Ask HN and Show HN Posts

To find the posts that begin with either 'Ask HN' or 'Show HN' we will use the string method 'startswith'. Given a string object, say, string1, we can check if starts with, say, dq by inspecting the output of the object. 

If string1 starts with dq, it will return True, otherwise it will return False.

In [4]:
print('somedata'.startswith('Some'))
print('somedata'.startswith('some'))

False
True


If we wish to control for case, we can use the __lower__ method which returns a lowercase version of the starting string. Here's an example:

In [5]:
print('SomeData'.lower())

somedata


Let's use these methods to separe posts beginning with 'Ask HN' and 'Show HN' (and case variations) into two different lists next.

In [6]:
#Extracting 'Ask HN' and 'Show HN' posts

ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))  

1744
1162
17194


In [7]:
#Checking the first five rows for aks_posts
ask_posts[:5]

[['12296411',
  'Ask HN: How to improve my personal website?',
  '',
  '2',
  '6',
  'ahmedbaracat',
  '8/16/2016 9:55'],
 ['10610020',
  'Ask HN: Am I the only one outraged by Twitter shutting down share counts?',
  '',
  '28',
  '29',
  'tkfx',
  '11/22/2015 13:43'],
 ['11610310',
  'Ask HN: Aby recent changes to CSS that broke mobile?',
  '',
  '1',
  '1',
  'polskibus',
  '5/2/2016 10:14'],
 ['12210105',
  'Ask HN: Looking for Employee #3 How do I do it?',
  '',
  '1',
  '3',
  'sph130',
  '8/2/2016 14:20'],
 ['10394168',
  'Ask HN: Someone offered to buy my browser extension from me. What now?',
  '',
  '28',
  '17',
  'roykolak',
  '10/15/2015 16:38']]

### Calculating the Average Number of Comments for 'Ask HN' and 'Show HN' posts

Above we separated the 'ask posts' and the 'show posts' into two list of lists named __ask_posts__ and __show_posts__.

Now, let's determine if ask posts or show posts receive more comments on average.

In [8]:
#Calculating the Average Number of Comments for 'Ask HN' 
#and 'Show HN' posts

total_ask_comments = 0
total_show_comments = 0

for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments

for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments
    
avg_ask_comments = total_ask_comments/(len(ask_posts))    
avg_show_comments = total_show_comments/(len(show_posts))

print(avg_ask_comments)    
print(avg_show_comments)

14.038417431192661
10.31669535283993


We can see that the ask posts have receive more comments on average. Perharps some users try to help other users with their questions, something interesting.

Now we will focus our remaining analysis just on these posts (ask posts).

### Finding the Amount of Aks Posts and Comments by Hour Created

We will determine if ask posts created at a certain time are more likely to attract comments. We will use the following steps to perform this analysis:

1. Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
2. Calculate the average number of commentes ask posts receive by hour created.

In [9]:
#Finding the amount of asks posts and comments by
#hour created

import datetime as dt
result_list = []
counts_by_hour = {}
comments_by_hour = {}

for row in ask_posts:
    created_at = row[6]
    number_comments = int(row[4])
    result_list.append([created_at, number_comments])

for row in result_list:
    object_date = dt.datetime.strptime(row[0], "%m/%d/%Y %H:%M")
    comment_number = row[1]
    hour = dt.datetime.strftime(object_date, "%H")
    
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comment_number
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comment_number     

In [10]:
counts_by_hour

{'09': 45,
 '13': 85,
 '10': 59,
 '14': 107,
 '16': 108,
 '23': 68,
 '12': 73,
 '17': 100,
 '15': 116,
 '21': 109,
 '20': 80,
 '02': 58,
 '18': 109,
 '03': 54,
 '05': 46,
 '19': 110,
 '01': 60,
 '22': 71,
 '08': 48,
 '04': 47,
 '00': 55,
 '06': 44,
 '07': 34,
 '11': 58}

In [11]:
comments_by_hour

{'09': 251,
 '13': 1253,
 '10': 793,
 '14': 1416,
 '16': 1814,
 '23': 543,
 '12': 687,
 '17': 1146,
 '15': 4477,
 '21': 1745,
 '20': 1722,
 '02': 1381,
 '18': 1439,
 '03': 421,
 '05': 464,
 '19': 1188,
 '01': 683,
 '22': 479,
 '08': 492,
 '04': 337,
 '00': 447,
 '06': 397,
 '07': 267,
 '11': 641}

### Calculating the Average Number of Comments for Ask HN Posts by Hour

Previously, we created two dictionaries: __counts_by_hour__, it contains the number of ask posts created during each hour of the day and __comments_by_hour__, it contains the corresponding number of comments ask posts created at each hour received.

Now, we will use these two dictionaries to calculate the average number of comments for posts created during each hour of the day.

In [12]:
#Calculating the Average Number of Comments for Ask HN
#Post by Hour
avg_by_hour = []

for row in comments_by_hour:
    avg_by_hour.append([row, comments_by_hour[row] / counts_by_hour[row]])

print(avg_by_hour)

[['09', 5.5777777777777775], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['16', 16.796296296296298], ['23', 7.985294117647059], ['12', 9.41095890410959], ['17', 11.46], ['15', 38.5948275862069], ['21', 16.009174311926607], ['20', 21.525], ['02', 23.810344827586206], ['18', 13.20183486238532], ['03', 7.796296296296297], ['05', 10.08695652173913], ['19', 10.8], ['01', 11.383333333333333], ['22', 6.746478873239437], ['08', 10.25], ['04', 7.170212765957447], ['00', 8.127272727272727], ['06', 9.022727272727273], ['07', 7.852941176470588], ['11', 11.051724137931034]]


### Sorting and Printing Values from a List of Lists

Above we calculated the average number of comments for posts created during each hour of the day and stored the results in a list of list named __avg_by_hour__.

Although, this format makes it hard to identify the hours with the highest values.

Let's finish by sorting the list of lists and print the five highest values in a format that's easier to read and understand.

In [13]:
#Sorting and printing values
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])

sorted_swap = sorted(swap_avg_by_hour,reverse=True)
print('Top 5 Hours for Ask Posts Comments')

for row in sorted_swap[0:5]:
    print ('{}: {:.2f} average comments per post'.format(dt.datetime.strptime(row[1], "%H").strftime("%H:%M"), row[0]))
    



Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


We can see that the hour with most comments per post on average is 15:00, with 38.59 comments per post. Therefore, that's the most recommended hour to post a question about technology.

### Conclusion

In this short project we were able to analyze ask posts to verify which part of the day is the best one to post a question. We find out that 15 pm is the best hour of the day to do so. 

It's highly recommended to post questions in periods like 15 pm or 16 pm, which has 16.80 average comments per post.