# Exploring Hacker News Posts

In this project, we will be explore a data set of submissions to the site "Hacker News". Hacker News is a site similiar to Reddit that allows user to post information and then the post can be commented and voted on. While Hacker News is similiar to Reddit, it is geared towards the Technology and Start up market primarily.

The data set that we will be working with can be found [here](https://www.kaggle.com/hacker-news/hacker-news-posts). We will specifically be looking at posts titled `Ask HN` and `Show HN` as these posts are geared directly to the entire Hacker News community. `Ask HN` is used to get input on a question from the Hacker News community where as `Show HN` is used to direct the community to something of interest.

We will be comparing these two types of posts to answer these questions:

* Do `Ask HN` or `Show HN` receive more comments on average?
* Do posts created at a certain time receive more comments on average?

## Importing the Data Set

Lets begin by importing our data set:

In [1]:
#Reading in the data set
from csv import reader
opened_file = open("hacker_news.csv")
read_file = reader(opened_file)
hn = list(read_file)

for row in hn[:4]:
    print(row)
    print('\n')

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']




## Removing Header Row

Next we will remove the headers from the data, but we will save them to variable for later use.

In [2]:
# Save header and then remove from the data set
headers = hn[0]
hn = hn[1:]
# Confirming that the header has been removed
print(headers)
print('\n')

for row in hn[:4]:
    print(row)
    print('\n')

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']




## Separting Ask, Show, and Other Posts

Next we will separte the `Ask HN`, `Show HN`, and other posts so that we can use them to answer our questions.

In [3]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    # Store post title as lower case for easier sorting
    title = row[1].lower()
    
    # Use str.startswith method to check titles
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
list_len_temp = "There are {1:,} posts in {0}"
# Verify the lists lengths
print(list_len_temp.format('Ask Posts', len(ask_posts)))
print(list_len_temp.format('Show Posts', len(show_posts)))
print(list_len_temp.format('Other Posts', len(other_posts)))

There are 1,744 posts in Ask Posts
There are 1,162 posts in Show Posts
There are 17,194 posts in Other Posts


## Calculating Average Number of Comments

Now that we have separated out the `Ask HN` and the `Show HN` posts, we are able to find an answer to our first question:

* Do `Ask HN` or `Show HN` receive more comments on average?

In [4]:
total_ask_comments = 0

#Looping through ask_posts to collect comment totals
for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments

avg_ask_comments = total_ask_comments / len(ask_posts)

avg_comments_temp = "{} had an average of {:,.2f} comments"
print(avg_comments_temp.format('Ask Posts', avg_ask_comments))

total_show_comments = 0

# Looping through show_posts to collect comment totals
for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments
    
avg_show_comments = total_show_comments / len(show_posts)

print(avg_comments_temp.format('Show Posts', avg_show_comments))

Ask Posts had an average of 14.04 comments
Show Posts had an average of 10.32 comments


So from our calculations we have found that posts that start with `Ask HN` have a higher number of comments on average.

## Finding Ask Posts and Comments by Hour Created

Now that we know that the `Ask HN` posts have more comments on average, we will focus on them for them for our next question. We will start by separating the `'created_at'` and `'num_comments'` from the `ask_posts` and then we will create dictionaries to sort them by hour.

In [5]:
import datetime as dt

results_list = []

#Creates a separate list of lists with created date and number of comments
for row in ask_posts:
    created = row[6]
    n_comments = int(row[4])
    results_list.append([created, n_comments])
    
counts_by_hour = {}
comments_by_hour = {}

for row in results_list:
    created = row[0]
    dt_format = "%m/%d/%Y %H:%M"
    #Creates a datetime object
    created_dt = dt.datetime.strptime(created, dt_format)
    #Extracts the hour of the datetime object
    hour = created_dt.hour
    
    n_comments = int(row[1])
    
    #Sorts results into 2 dictionaries with hour being the key
    #and posts and comments being the values
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = n_comments
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += n_comments

## Calculating the Average Number of Comments for Ask HN Posts by Hour

Next we can use our two dictionaries to create a list that will hold the average number of comments per hour.

In [6]:
avg_by_hour = []

#Creates a list of lists with the hour and the calculated
#average number of comments
for hour in comments_by_hour:
    comments = comments_by_hour[hour]
    posts = counts_by_hour[hour]
    avg_comments = comments / posts
    avg_by_hour.append([hour, avg_comments])

## Sorting and Printing Values from the avg_by_hour

In [7]:
swap_avg_by_hour = []

#swaps the position of average comments and hour
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
    
print(swap_avg_by_hour)
print('\n')

#Sorts list from highest avgerage comment to lowest
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

print("Top 5 Hours for Ask HN Posts Comments")

#Converts hour back to a datetime object and formats both avg and
#hour for printing
for row in sorted_swap[:5]:
    avg = row[0]
    hour = dt.datetime.strptime(str(row[1]), "%H")
    format_time = hour.strftime("%H:%M")
    avg_hour_template = "{}: {:.2f} average comments per post"
    print(avg_hour_template.format(format_time, avg))

[[8.127272727272727, 0], [11.383333333333333, 1], [23.810344827586206, 2], [7.796296296296297, 3], [7.170212765957447, 4], [10.08695652173913, 5], [9.022727272727273, 6], [7.852941176470588, 7], [10.25, 8], [5.5777777777777775, 9], [13.440677966101696, 10], [11.051724137931034, 11], [9.41095890410959, 12], [14.741176470588234, 13], [13.233644859813085, 14], [38.5948275862069, 15], [16.796296296296298, 16], [11.46, 17], [13.20183486238532, 18], [10.8, 19], [21.525, 20], [16.009174311926607, 21], [6.746478873239437, 22], [7.985294117647059, 23]]


Top 5 Hours for Ask HN Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


We can see here the top 5 hours to post to get comments. I live in Central Standard time. According to the data sets specifications, these times are in Eastern. So for me the best times to post would be 2:00pm, 1:00am, 7:00pm, 3:00pm, and 6:00pm (based on conversion from EST to CST).