# Exploring Hacker News Posts

In this project, I'll be working with a data set of submissions to popular technology site [Hacker News](https://news.ycombinator.com/ "Hacker News") 

![](https://volument.com/blog/img/hn-dirt-big.png)

***

Hacker News is a site started by the startup incubator [Y combinator](https://www.ycombinator.com/ "Y combinator"), where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News listing can get hundreds of thousands of vistors as a result.

***

You can find the data set [here](https://www.kaggle.com/hacker-news/hacker-news-posts/ "Hacker news data and colums descripitions"), but note that for this project the data has been reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions. Below are descriptions of the columns: 
- **id**:The unique identifier from Hacker News for the post
- **title**:The title of the post 
- **url**: The URL that the posts links to, if it the post has a URL
- **num_points**: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
- **num_comments**: The number of comments that were made on the post
- **author**: The username of the person who submitted the post
- **author**: The username of the person who submitted the post
- **created_at**:The date and time at which the post was submitted

***

We're specifically interested in posts whose titles begin with either Ask HN or Show HN. Users submit Ask HN posts to ask the Hacker News community a specific question.

Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting.

We'll compare these two types of posts to determine the following:

- Do Ask HN or Show HN receive more comments on average?
- Do posts created at a certain time receive more comments on average?


In [12]:
# import the reader function from the csv module 
from csv import reader
import datetime as dt 

In [6]:
# use the python built-in function open() to open the csv file
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
# list of the hacker_news.csv
hn_raw = list(read_file)
headers = hn_raw[0]
hn = hn_raw[1:]
# print the first 5 rows
print(headers)
print('\n')
print(hn[:5])
# Close the opened file 
opened_file.close() 



['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


Now that I've removed the headers from my list of list data, I'm ready to filter our data. Since I'm only concerned with post titles beginning with *Ask HN* or *Show HN*, I'll create new lists of lists containing just the data for those titles. 
***
To find the posts that begin with either *Ask HN* or *Show HN*, I'll use the string method **Startswith** . Given a string object. 

Capitalization matters, If we wish to control for case, we can use the lower method which returns a lowercase version of the starting string.

In [7]:
ask_posts = [] 
show_posts = [] 
other_posts = [] 
# Loop through each row in hn 
for row in hn: 
    title = row[1].lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else: 
        other_posts.append(row)
#chek the number for each list 
num_ask_posts = len(ask_posts)
('\n')
num_show_posts = len(show_posts)
('\n')
num_other_posts = len(other_posts)

print(num_ask_posts)
print('\n')
print(num_show_posts)
print('\n')
print(num_other_posts)
    

1744


1162


17194


In [11]:
total_ask_comments = 0 
for row in ask_posts:
    a_comment = int(row[4])
    total_ask_comments += a_comment

#commute the average of comments for ask posts 
avg_ask_comments = total_ask_comments / num_ask_posts
print("The average of comments on ask posts", avg_ask_comments)


total_show_comments = 0 
for row in show_posts:
    a_comment = int(row[4])
    total_show_comments += a_comment

#commute the average of comments for show posts 
avg_show_comments = total_show_comments / num_show_posts
print("The average of comments on show posts", avg_show_comments)


The average of comments on ask posts 14.038417431192661
The average of comments on show posts 10.31669535283993


**Do show posts or ask posts receive more comments on average?**
Looking at the result printed above the ask posts receives more comments on average than the show post. This is an intuitive conclusion, becasue the ask posts would require users to engage. 

Since ask posts are more likely to receive comments, I'll focus our remaining analysis just on these posts.

Next, I'll determine if ask posts created at a certain time are more likely to attract comments. I'll use the following steps to perform this analysis:

1) Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.

2) Calculate the average number of comments ask posts receive by hour created.




In [15]:
result_list = [] 
for row in ask_posts: 
    created_at = row[6]
    a_comment = int(row[4])
    two_element_list = [created_at, a_comment]
    result_list.append(two_element_list)

# Create two empty dictionaies 
counts_by_hour = {}
comments_by_hour = {} 
for row in result_list:
    dt_ob = row[0]
    dt_o = dt.datetime.strptime(dt_ob, "%m/%d/%Y %H:%M")
    post_time = dt_o.strftime("%H")
    
    if post_time not in counts_by_hour:
        counts_by_hour[post_time] = 1
        comments_by_hour[post_time] = int(row[1]) 
    else:
        counts_by_hour[post_time] += 1
        comments_by_hour[post_time] += int(row[1])
        

print(counts_by_hour)
print('\n')
print(comments_by_hour)       
    
    
    

{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}


{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}


Above I created two dictionaries: 

- *counts_by_hour* : contains the number of ask posts created during each hour of the day.

- *comments_by_hour* : contains the corresponding number of comments ask posts created at each hour received.

Next I'll use these two dictionaries to calculate the average number of comments for posts created during during each hour of the day. 


In [16]:
# calculate the average number of comments per post for post created during each hour of the day
avg_by_hour = [] 
for key in comments_by_hour:
    avg_comments = comments_by_hour[key] / counts_by_hour[key]
    avg_by_hour.append([key, avg_comments])
print(avg_by_hour)

[['09', 5.5777777777777775], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['16', 16.796296296296298], ['23', 7.985294117647059], ['12', 9.41095890410959], ['17', 11.46], ['15', 38.5948275862069], ['21', 16.009174311926607], ['20', 21.525], ['02', 23.810344827586206], ['18', 13.20183486238532], ['03', 7.796296296296297], ['05', 10.08695652173913], ['19', 10.8], ['01', 11.383333333333333], ['22', 6.746478873239437], ['08', 10.25], ['04', 7.170212765957447], ['00', 8.127272727272727], ['06', 9.022727272727273], ['07', 7.852941176470588], ['11', 11.051724137931034]]


Above I calculated the average number of comments for post created during each hour of the day and stored the results in a list of list named **avg_by_hour**. For readability I'll improve the output of the result to make the result easier to digest. 

In [21]:

swap_avg_by_hour = []
for row in avg_by_hour: 
    first_e = row[0]
    second_e = row[1]
    elements = [second_e, first_e]
    swap_avg_by_hour.append(elements)
print(swap_avg_by_hour)

#Since the first column of this list is the average number of comments, sorting the list will sort by the average number of comments
sorted_swap = sorted(swap_avg_by_hour, reverse = True)

print("Top 5 Hours for Ask posts Comments", sorted_swap)

for row in sorted_swap[:5]:
    f_ho = dt.datetime.strptime(row[1], "%H")
    f_h = f_ho.strftime("%H:%M")
    string = "{}: {:.2f} average comments per post".format(f_h, row[0])
    print(string)
    
    
    






[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]
Top 5 Hours for Ask posts Comments [[38.5948275862069, '15'], [23.810344827586206, '02'], [21.525, '20'], [16.796296296296298, '16'], [16.009174311926607, '21'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [13.20183486238532, '18'], [11.46, '17'], [11.383333333333333, '01'], [11.051724137931034, '11'], [10.8, '19'], [10.25, '08'], [10.08695652173913,

# # Conclusion


Looking at my results above the ideal time to post and optimise user engagement is late in the afternoon between *1500-1600* and in the early hours of the morning at *0200*. 

The night is also a good time between *2000-2100*.  

