# Exploring Hacker News Posts

Identifying How and When to Post on Hacker News
In this project we are tasked with providing a recommendation on what to post on Hacker News in order to reach the most people. Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

We're specifically interested in posts whose titles begin with either 'Ask HN' or 'Show HN'. Users submit 'Ask HN' posts to ask the Hacker News community a specific question and 'Show HN' to show off projects or information relevant to the community.

We'll compare these two types of posts to determine the following:

Do Ask HN or Show HN receive more comments on average?
Do posts created at a certain time receive more comments on average?
Does this behavior change with points?
The dataset can be downloaded here.

Descriptions of the columns:
* <mark>id</mark>: the unique identifier from Hacker News for the post
* <mark>title</mark>: the title of the post
* <mark>url</mark>: the URL that the posts links to, if the post has a URL
* <mark>num_points</mark>: the number of points the post acquired, calculated as  the total number of upvotes minus the total number of downvotes
* <mark>num_comments</mark>: the number of comments on the post
* <mark>author</mark>: the username of the person who submitted the post
* <mark>created_at</mark>: the date and time of the post's submission

In [1]:
from csv import reader
opened_file = open('hacker_news.csv')
read = reader(opened_file)
dataset = list(read)
#print(dataset[:5])

In [2]:
headers = dataset[:1]
hn = dataset[1:]
print("Headers")
print(headers)
print('\n')

print("Sample Data")
for row in hn[:3]:
    print(row)
    print('\n')
    

Headers
[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']]


Sample Data
['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']




# Extracting Ask HN and Show HN Posts

To filter our data, we separate headers from the dataset and store them in a variable named headers.

Since we're only concerned with post titles beginning with Ask HN or Show HN, we'll create new lists of lists containing just the data for those titles.

To find the posts that begin with either Ask HN or Show HN, we'll use the string method startswith. since the startswith method is case sensitive, we'll use the lower method to control capitalization problem.

In [3]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1].lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print("Number of Posts in 'ask_posts': ", len(ask_posts))
print("Number of Posts in 'show_posts': ", len(show_posts))
print("Number of Posts in 'other_posts'): ", len(other_posts))

Number of Posts in 'ask_posts':  1744
Number of Posts in 'show_posts':  1162
Number of Posts in 'other_posts'):  17194


In [4]:
# comments in ask posts

total_ask_comments = 0

for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments
avg_ask_comments = total_ask_comments / len(ask_posts)    
print("average number of comments on ask posts:", avg_ask_comments)

# comments in show posts

total_show_comments = 0

for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments
avg_show_comments = total_show_comments / len(show_posts)
print("average number of comments on show posts", avg_show_comments)

average number of comments on ask posts: 14.038417431192661
average number of comments on show posts 10.31669535283993


Ask HN posts are more likely to receive more comments than Show HN posts. When Ask HN got 14 comments, Show HN posts got 10 comments on average.

# Finding the Number of Ask Posts and Comments by Hour Created

Next, we'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:

1. Calculate the number of ask posts creataed in each hour of the day, along with the number of comments received.
2. Calculate the average number of comments ask posts receive by hour created.

We'll use the datetime module to work with the data in the created_at column.

In [5]:
import datetime as dt

result_list = []
for row in ask_posts:
    created_at = row[6]
    num_comments = int(row[4])
    result = [created_at,num_comments]
    result_list.append(result)

counts_by_hour = {}
comments_by_hour = {}
for row in result_list:
    date = dt.datetime.strptime(row[0], "%m/%d/%Y %H:%M")
    hour = date.hour
    n_comments = row[1]
    
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = n_comments
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += n_comments

print("Number of posts per each hour of the day:", counts_by_hour)
print("Number of comments per each hour of the day:", comments_by_hour)
print('\n')

#Calculate the average number of comments per post

avg_by_hour = []

for item in counts_by_hour:
    avg = (comments_by_hour[item] / counts_by_hour[item])
    avg_by_hour.append([item, avg])

print(avg_by_hour)

Number of posts per each hour of the day: {9: 45, 13: 85, 10: 59, 14: 107, 16: 108, 23: 68, 12: 73, 17: 100, 15: 116, 21: 109, 20: 80, 2: 58, 18: 109, 3: 54, 5: 46, 19: 110, 1: 60, 22: 71, 8: 48, 4: 47, 0: 55, 6: 44, 7: 34, 11: 58}
Number of comments per each hour of the day: {9: 251, 13: 1253, 10: 793, 14: 1416, 16: 1814, 23: 543, 12: 687, 17: 1146, 15: 4477, 21: 1745, 20: 1722, 2: 1381, 18: 1439, 3: 421, 5: 464, 19: 1188, 1: 683, 22: 479, 8: 492, 4: 337, 0: 447, 6: 397, 7: 267, 11: 641}


[[9, 5.5777777777777775], [13, 14.741176470588234], [10, 13.440677966101696], [14, 13.233644859813085], [16, 16.796296296296298], [23, 7.985294117647059], [12, 9.41095890410959], [17, 11.46], [15, 38.5948275862069], [21, 16.009174311926607], [20, 21.525], [2, 23.810344827586206], [18, 13.20183486238532], [3, 7.796296296296297], [5, 10.08695652173913], [19, 10.8], [1, 11.383333333333333], [22, 6.746478873239437], [8, 10.25], [4, 7.170212765957447], [0, 8.127272727272727], [6, 9.022727272727273], [7, 

To sort the average comments by each hour, we'll create a list that equals avg_by_hour with swapped columns and store it in a variable named swap_avg_by_hour

In [6]:
# Create swapped list

swap_avg_by_hour = []
for row in avg_by_hour:
    hour = row[0]
    avg = row[1]
    swap_avg_by_hour.append([avg, hour])

# Sort by average comments
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

# Print 'Top 5 hours for Ask Posts Comments'
print("Top 5 Hours for Ask Posts Comments")
for row in sorted_swap[:5]:
    hour = dt.datetime.strptime(str(row[1]), "%H")
    hour = hour.strftime("%H:%M")
    print("{0}: {1:.2f} average comments per post".format(hour, row[0]))

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post
