### Powerful Statistics for Hacker News Posts Engagement: Find what kind of posts and time receive most comments

In this project, we'll aim to find what kind of Hacker News post receives most comments and the best time to post to drive more engagement. 

Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. In this project, we are going to look at the post whose titles begin with either Ask HN or Show HN. Users submit Ask HN posts to ask the Hacker News community a specific question. Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting.

We'll analyze a data set of submissions to Hacker News available [here](https://www.kaggle.com/hacker-news/hacker-news-posts) to compare these two types of posts to determine the following:
- Do Ask HN or Show HN receive more comments on average?
- Do posts created at a certain time receive more comments on average?

#### Limitation of the project
Please note that the dataset has been reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions. The dataset used has been provided as a part of guided project for Dataquest's python intermediate course.

#### Summary of Results
After analyzing the data, we've reached a conclusion that Ask HN posts recieve more comments on average than Show HN posts. Since ask posts are more likely to receive comments, we focused our remaining analysis on these posts and found that ask post made at 15:00 receives the most comments per post.

For more details, please refer to the full analysis below.

### Data Exploration and Data Cleaning
Below, we'll do a quick exploration of the hacker_news.csv file stored in this repository. We'll read in the file using the direct link here.

In [1]:
from csv import reader
file = open('hacker_news.csv')
read_file = reader(file)
hn = list(read_file)
file.close()

headers = hn[0]
hn_dataset = hn[1:]
print(headers)
print('\n')
print(hn_dataset[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


Now that we've removed the headers from `hn`, we're ready to filter our data. Since we're only concerned with post titles beginning with `Ask HN` or `Show HN`, we'll create new lists of lists containing just the data for those titles. Let's separate posts beginning with Ask HN and Show HN (and case variations) into two different lists next.

In [2]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn_dataset:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print('No. of Ask HN posts:', len(ask_posts))
print('\n')
print('No. of Show HN posts:', len(show_posts))
print('\n')
print('No. of other posts:', len(other_posts))    

No. of Ask HN posts: 1744


No. of Show HN posts: 1162


No. of other posts: 17194


Now, let's determine if ask post or show posts receive more comments on average.

In [3]:
total_ask_comments = 0

for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments

average_ask_comments = total_ask_comments / len(ask_posts)
print('Average Ask Comments: ', average_ask_comments)

Average Ask Comments:  14.038417431192661


In [4]:
total_show_comments = 0

for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments

average_show_comments = total_show_comments / len(show_posts)
print('Average Show Comments: ', average_show_comments)

Average Show Comments:  10.31669535283993


#### Finding 1: ####
The `ASK HN` posts receive more comments on average than `SHOW HN` posts.

Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts. Next, we'll determine if ask posts created at a certain time are more likely to attract comments.

We'll use the following steps to perform this analysis:

1. Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.

In [6]:
from datetime import datetime as dt

result_list = []

for row in ask_posts:
    created_at = row[6]
    no_of_comments = int(row[4])
    result_list.append([created_at, no_of_comments])
    
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    datetime_obj = dt.strptime(row[0], '%m/%d/%Y %H:%M')
    hour = datetime_obj.strftime('%H')
    if hour in counts_by_hour:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += row[1]
    else:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = row[1]
    

avg_by_hour = []

for key in counts_by_hour.keys():
    num_comments = comments_by_hour[key]
    total_count = counts_by_hour[key]
    avg_by_hour.append([key, num_comments/total_count])

swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])

sorted_swap = sorted(swap_avg_by_hour, reverse=True)

print('Top 5 hours for Ask Posts Comments')

for row in sorted_swap[:6]:
    datetime_obj = dt.strptime(row[1], '%H')
    hour = dt.strftime(datetime_obj, '%H:%M')
    print('{}: {:.2f} average comments per post'.format(hour, row[0]))

Top 5 hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post
13:00: 14.74 average comments per post


#### Conclusion:
In this project, we analyzed Hacker News posts to determine which type of post and time receive the most comments on average. Based on our analysis, 2 interesting statistics on audience engagement that could be useful are:

1. Ask HN posts get significantly higher number of comments (~2.2x higher) than Show HN posts.
    Therefore, it can be recommended that the post be categorized as ask post in order to get more engagement on the posts with the audience in the form of comments.

2. The best time to post is between 15:00 and 16:00 (3:00 pm est - 4:00 pm est) as the statistics above show that the ask post made at 15:00 receives the most comments per post, with an average of ~30 comments per post