# Hacker News Project

In this project we are diving into Hacker News website. We are going to find which posts perform better, Ask posts or Show posts. We are also going to find the best times to publish the posts that generate more engagement.

In [19]:
from csv import reader
open_file = open('hacker_news.csv')
read_file = reader(open_file)
hn = list(read_file)

In [20]:
#First five rows of our data:
print(hn[0:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


Let's remove the headers row from our data

In [21]:
headers = hn[0]
hn = hn[1:]
print(headers)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


In [22]:
print(hn[0:5])

[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


## Extracting Ask HN and Show HN Posts

We want to filter our data to only show posts that begin with titles Ask HN or Show HN. To do this we are going to use startswith and lower methods.

In [23]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    title = title.lower()
    if title.startswith("ask hn"):
        ask_posts.append(row)
    elif title.startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)

print("Length of ask_posts: " + str(len(ask_posts)))
print("Length of show_posts: " + str(len(show_posts)))
print("Length of other_posts: " + str(len(other_posts)))

Length of ask_posts: 1744
Length of show_posts: 1162
Length of other_posts: 17194


## Finding comments for each category

Next we are going to find the total number of comments for ask_posts and show_posts

In [28]:
total_ask_comments = 0
total_show_comments = 0

for row in ask_posts:
    n_comments_ask = int(row[4])
    total_ask_comments += n_comments_ask
    
for row in show_posts:
    n_comments_show = int(row[4])
    total_show_comments += n_comments_show
    
print("Ask posts have a total of " + str(total_ask_comments) + " comments.")
print("Show posts have a total of " + str(total_show_comments) + " comments.")

#Calculating average comments
avg_ask_comments = total_ask_comments / len(ask_posts)
avg_show_comments = total_show_comments / len(show_posts)

print("Ask posts have " + str(avg_ask_comments) + " comments on average.")
print("Show posts have " + str(avg_show_comments) + " comments on average.")

Ask posts have a total of 24483 comments.
Show posts have a total of 11988 comments.
Ask posts have 14.038417431192661 comments on average.
Show posts have 10.31669535283993 comments on average.


As we can see ask posts get 14 comments on average, while show posts get 10 comments on average, meaning that ask posts create more engagement.

# Does time matter?

Next we are going to check if time of posting matters for the engagement. Because ask posts generated more comments, we are only going to use them from now on.

In [30]:
import datetime as dt

result_list = []

for row in ask_posts:
    created_at = row[6]
    n_comments = int(row[4])
    result_list.append([created_at, n_comments])
#Printing first row 
print(result_list[0:1])

[['8/16/2016 9:55', 6]]


Above we created a list that shows the time a post was created and the amount of comments it generated. Now we are going to add these into dictionaries.

In [46]:
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    time = row[0]
    time = dt.datetime.strptime(time, "%m/%d/%Y %H:%M")
    time = time.strftime("%H")
    n_comments = row[1]
    
    if time in counts_by_hour:
        counts_by_hour[time] += 1
    else:
        counts_by_hour[time] = 1
    if time in comments_by_hour:
        comments_by_hour[time] += n_comments
    else:
        comments_by_hour[time] = n_comments
print(counts_by_hour)
print(comments_by_hour)

{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}
{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}


Let's now create a list of lists that contains the hours when posts were created, and the average number of comments for those posts. This can be done by dividing comments_per_hour by counts_by_hour.

In [52]:
average_list = []
for hour in counts_by_hour:
    n_count = counts_by_hour[hour]
    n_comments = comments_by_hour[hour]
    average = n_comments / n_count
    average_list.append([hour, average])
print(average_list)

[['09', 5.5777777777777775], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['16', 16.796296296296298], ['23', 7.985294117647059], ['12', 9.41095890410959], ['17', 11.46], ['15', 38.5948275862069], ['21', 16.009174311926607], ['20', 21.525], ['02', 23.810344827586206], ['18', 13.20183486238532], ['03', 7.796296296296297], ['05', 10.08695652173913], ['19', 10.8], ['01', 11.383333333333333], ['22', 6.746478873239437], ['08', 10.25], ['04', 7.170212765957447], ['00', 8.127272727272727], ['06', 9.022727272727273], ['07', 7.852941176470588], ['11', 11.051724137931034]]


Now we have average comments by hour, but on this format it is difficult to see any information from the list. Let's sort the list and printing the highest values in a format that is easier to read.

In [57]:
swap_average_list = []

for row in average_list:
    hour = row[0]
    comments = row[1]
    swap_average_list.append([comments, hour])
print(swap_average_list)

[[5.5777777777777775, '11'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]


In [61]:
#Sorting in descending order:

sorted_swap = sorted(swap_average_list, reverse=True)
print("Top 5 Hours for Ask Posts Comments")
template = "{time}: {avg_comments:.2f} average comments per post"
for value in sorted_swap[0:5]:
    hour = value[1]
    hour = dt.datetime.strptime(hour, "%H")
    hour = hour.strftime("%H:%M")
    comments = value[0]
    print(template.format(time=hour, avg_comments = comments))

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


From this we can see that the top 5 times for Ask posts are 15:00, 02:00, 20:00, 16:00 and 21:00.