# Exploring Hacker News Posts

Hacker News is a site similar to reddit, where users submit aritcles and stories, which are then voted and commented on. Hacker News is very popular in the tech and startup world. In this project, we will explore a data set of posts on the Hacker News site. The data set can be found [here](https://www.kaggle.com/hacker-news/hacker-news-posts). The full data set will not be used in this analysis, the submissions without any comments have been removed, and a random sample was taken of the remaining submissions. 

We will be looking specifcally at posts with the titles **Ask HN** and **Show HN**. These posts are submitted to either ask the community a specific question or show the community something, respectively. 

We will be looking at two specific questions in regards to these posts:
* Do **Ask HN** or **Show HN** receive more comments on average?
* Do posts created at certain times receive more comments on average?

We will start by importing the libraries we need and reading the data into a list of lists. 

## Introduction

In [1]:
from csv import reader #import the reader definition from the csv module
open_file = open('hacker_news.csv') 
read_file = reader(open_file) #use reader to parse the open_file
hn = list(read_file) #convert the read_file into a list of lists

print(hn[:5])


[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


Looking at the first 5 rows of data, we can see the first row contains the headers. We will need to remove them from our working set. 

In [2]:
headers = hn[0] #extract the first row and assign to variable headers
hn = hn[1:] #remove the headers row from the data set
print(headers)
print(hn[:5])


['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


## Sorting the Data by Post Type
Now that we have a usable list to work with, we can begin sorting the data. First we will separate posts beginning with **Ask HN** and **Show HN**. 

In [3]:
ask_posts = [] #Create three empty lists to sort the data
show_posts = []
other_posts = []

for row in hn: 
    title = row[1] 
    if title.lower().startswith('ask hn'): #check what the title starts with and append to the appropriate list
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


## Finding the Average Number of Comments
We have the posts separated by title, now we will see if which posts receive the highest average number of comments. 

In [4]:
total_ask_comments = 0
for row in ask_posts:
    total_ask_comments += int(row[4])
    
avg_ask_comments = total_ask_comments / len(ask_posts)
print(avg_ask_comments)


14.038417431192661


In [5]:
total_show_comments = 0
for row in show_posts:
    total_show_comments += int(row[4])
    
avg_show_comments = total_show_comments / len(show_posts)
print(avg_show_comments)

10.31669535283993


You can see from the code above that **Ask HN** posts receive more comments on average. We'll dive deeper into those posts next.

## Finding the Amount of Ask Posts and Comments by Hour

Now we will try to determine if there is a specific time of day that generates more comments on a post. 

In [10]:
#Import the datetime module and calculate the number of ask posts created by hour
import datetime as dt

result_list = []
for post in ask_posts:
    result_list.append([post[6], int(post[4])])
    
#Create two frequency tables to look at posts per hour and comments per hour
counts_by_hour ={}
comments_by_hour = {}
date_format = '%m/%d/%Y %H:%M'

for row in result_list:
    date = row[0]
    comment = row[1]
    time = dt.datetime.strptime(date, date_format).strftime('%H')
    if time not in counts_by_hour:
        counts_by_hour[time] = 1
        comments_by_hour[time] = comment
    else:
        counts_by_hour[time] += 1
        comments_by_hour[time] += comment

comments_by_hour


{'00': 447,
 '01': 683,
 '02': 1381,
 '03': 421,
 '04': 337,
 '05': 464,
 '06': 397,
 '07': 267,
 '08': 492,
 '09': 251,
 '10': 793,
 '11': 641,
 '12': 687,
 '13': 1253,
 '14': 1416,
 '15': 4477,
 '16': 1814,
 '17': 1146,
 '18': 1439,
 '19': 1188,
 '20': 1722,
 '21': 1745,
 '22': 479,
 '23': 543}

## Calculating Average Number Comments by Hour

In [13]:
#Create a list of lists with the average number of comments per post by hour of the day
avg_by_hour = []

for hour in comments_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour]/counts_by_hour[hour]])
    
avg_by_hour

[['03', 7.796296296296297],
 ['19', 10.8],
 ['13', 14.741176470588234],
 ['00', 8.127272727272727],
 ['02', 23.810344827586206],
 ['08', 10.25],
 ['06', 9.022727272727273],
 ['05', 10.08695652173913],
 ['18', 13.20183486238532],
 ['17', 11.46],
 ['11', 11.051724137931034],
 ['01', 11.383333333333333],
 ['04', 7.170212765957447],
 ['22', 6.746478873239437],
 ['14', 13.233644859813085],
 ['12', 9.41095890410959],
 ['15', 38.5948275862069],
 ['07', 7.852941176470588],
 ['10', 13.440677966101696],
 ['21', 16.009174311926607],
 ['23', 7.985294117647059],
 ['20', 21.525],
 ['09', 5.5777777777777775],
 ['16', 16.796296296296298]]

## Sorting and Printing Results

In [15]:
#Sort and print the top 5 results in an easy to read list
swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
    
swap_avg_by_hour

[[7.796296296296297, '03'],
 [10.8, '19'],
 [14.741176470588234, '13'],
 [8.127272727272727, '00'],
 [23.810344827586206, '02'],
 [10.25, '08'],
 [9.022727272727273, '06'],
 [10.08695652173913, '05'],
 [13.20183486238532, '18'],
 [11.46, '17'],
 [11.051724137931034, '11'],
 [11.383333333333333, '01'],
 [7.170212765957447, '04'],
 [6.746478873239437, '22'],
 [13.233644859813085, '14'],
 [9.41095890410959, '12'],
 [38.5948275862069, '15'],
 [7.852941176470588, '07'],
 [13.440677966101696, '10'],
 [16.009174311926607, '21'],
 [7.985294117647059, '23'],
 [21.525, '20'],
 [5.5777777777777775, '09'],
 [16.796296296296298, '16']]

In [16]:
#sort the results
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

sorted_swap
   

[[38.5948275862069, '15'],
 [23.810344827586206, '02'],
 [21.525, '20'],
 [16.796296296296298, '16'],
 [16.009174311926607, '21'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [13.20183486238532, '18'],
 [11.46, '17'],
 [11.383333333333333, '01'],
 [11.051724137931034, '11'],
 [10.8, '19'],
 [10.25, '08'],
 [10.08695652173913, '05'],
 [9.41095890410959, '12'],
 [9.022727272727273, '06'],
 [8.127272727272727, '00'],
 [7.985294117647059, '23'],
 [7.852941176470588, '07'],
 [7.796296296296297, '03'],
 [7.170212765957447, '04'],
 [6.746478873239437, '22'],
 [5.5777777777777775, '09']]

In [21]:
print('Top 5 Hours for Ask Posts Comments')
for avg, hour in sorted_swap[:5]:
    print(
        "{}: {:.2f} average comments per post".format(
            dt.datetime.strptime(hour, "%H").strftime("%H:%M"), avg))
    

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


The hour that receives the most comments per hour is 15:00. If we were to make an **Ask HN** post, this would be the ideal time to create it to get the most engagement. Referring back to the [documentation](https://www.kaggle.com/hacker-news/hacker-news-posts/home) we see that the time zone is Eastern. So the hour of the most comments is 3 PM EST. 

# Conclusion
In this project, we analyzed **Ask HN** and **Show HN** posts from the website Hacker News, in order to see whether either type of posts was better at generating comments, and if the time of day the post was created factored in to the amount of comments the post received. 

We found that the **Ask HN** posts had on average more comments per post than the **Show HN** posts. This is most likely due to the nature of the post, as it is inviting a discussion with other members of the site. When looking at the average number of comments per **Ask HN** posts by the hour in which they were posted, we determined that the hour of 3:00 PM EST is the ideal time to post and maximize the number of comments on t