# Analysis of Submissions to Hacker News

The purpose of this project is to examine post submissions to the Hacker News online community and determine if posts of a certain type receive more attention from readers.

There are 2 primary types of post in Hacker News, "Ask HN" and "Show HN" posts. As their names imply, some posts ask the community for answers or feedback, while other serve the purpose of providing information to the community.

Our analysis will examine whether posts of the "Ask" or "Show" nature receive more comments than their counterparts.

Timing of post sumbmissions can be very impactful in online forums. Posts made during off-work hours may receive more attention and comments than posts made when most people are asleep or busy.

The other aspect of this analysis will examine post submission times to see if submission time may affect the engagement with a post.

## Data

### Structure

The data from Hacker News was retrieved from: <link>https://news.ycombinator.com/</link>

The original dataset contains ~300,000 records, but was pared down to 20,000 by removing submissions with 0 comments, then randomly sampling the remaining submissions.

The source dataset can be downloaded from Kaggle.com: <link>https://www.kaggle.com/hacker-news/hacker-news-posts#HN_posts_year_to_Sep_26_2016.csv</link>

The dataset contains the following fields:

- <code>id</code>: Unique post identifier from Hacker News
- <code>title</code>: Post title
- <code>url</code>: URL that the post links to, if present
- <code>num_points</code>: Total number of points (upvotes - downvotes)
- <code>num_comments</code>: Number of comments on the post
- <code>author</code>: Username of the submitter
- <code>created_at</code>: The date and time of the submissions

### Import

We'll begin analysis by importing the dataset and removing the header row: <code> hacker_news.csv </code>


In [4]:
import csv

hn = []

with open('hacker_news.csv', encoding='utf8') as csv_file:
    read_file = csv.reader(csv_file, delimiter = ',')
    for row in read_file:
        hn.append(row)

headers = hn[0]
hn = hn[1:]

print(headers)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


In [5]:
print(hn[0:5])

[['11793140', 'How unauthorized idiots repair Apple laptops [video]', 'https://www.youtube.com/watch?v=ocF_hrr83Oc', '18', '2', 'sshykes', '5/28/2016 19:51'], ['11925546', 'Crisis based forking can pierce the Decentralized Veil of Ethereum', 'https://blog.stakeventures.com/articles/piercing-ethereums-veil', '25', '8', 'pelle', '6/17/2016 21:05'], ['10241589', 'What sort of a job could I find with my background?', '', '1', '2', 'Teichopsia', '9/18/2015 19:39'], ['11741623', 'Bounty for Open-Source Diabetic pump control exceeds $11,000', 'http://www.openomni.org', '3', '1', 'oskarpearson', '5/20/2016 21:21'], ['12332813', 'Last Vesper Update, Sync Shutting Down', 'http://inessential.com/2016/08/21/last_vesper_update_sync_shutting_down', '4', '1', 'stephenr', '8/21/2016 21:28']]


### Post Types

The analysis calls for comparing engagement between the "Ask" and "Show" post types. Therefore we must identify and separate out these types of posts.

To do so, we'll examine the title for the phrase "Ask HN" and "Show HN", grouping each post into its respective group based on the title contents.

Note that it is possible a post may not fit into either category. Such posts will be tagged and collected as "Other".

In [6]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1].lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

print('Ask Posts: ', len(ask_posts), '\nShow Posts: ', len(show_posts),
      '\nOther Posts: ', len(other_posts))

Ask Posts:  1688 
Show Posts:  1277 
Other Posts:  17035


## Analysis

### Number of Comments

The data is now cleaned and ready for analysis. One of the metrics for determining post engagement is number of comments. We will calculate the total and average number of comments on each post type for comparison.

In [7]:
total_ask_comments = 0

for post in ask_posts:
    total_ask_comments += int(post[4])

avg_ask_comments = total_ask_comments / len(ask_posts)

print('Total "Ask" Comments: ', total_ask_comments,
     '\nAverage Comments per "Ask" Post: ', avg_ask_comments)

Total "Ask" Comments:  23392 
Average Comments per "Ask" Post:  13.85781990521327


In [8]:
total_show_comments = 0

for post in show_posts:
    total_show_comments += int(post[4])

avg_show_comments = total_show_comments / len(show_posts)

print('Total "Show" Comments: ', total_show_comments,
     '\nAverage Comments per "Show" Post: ', avg_show_comments)

Total "Show" Comments:  12136 
Average Comments per "Show" Post:  9.503523884103368


"Ask" posts receive about 4 more comments on average than "Show" posts.

This result is expected. The purpose of an "Ask" post is to solicit information or feedback. "Show" posts can spark discussion, but even users who appreciate the information may not feel the need to engage via a comment on the post. A successful "Show" post can measure its engagement through views or votes, whereas an "Ask" post needs comments else its question will go unanswered.

### Submission Timing

We've determined that "Ask" posts receive more comments on average than "Show" posts. But what if there are other factors that affect post engagement? One such factor could be time of day the post was submitted.

We can hypothesize that posts submitted during people's free time would elicit higher engagement than posts submitted late at night or during work hours.

To examine this possibility, we'll make use of the <code>datetime</code> module to calculate number of comments by hour in which the post is submitted.

In [11]:
import datetime as dt

result_list = []

for post in ask_posts:
    created_at = post[6]
    comments = post[4]
    result_list.append([created_at, comments])

counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    date = dt.datetime.strptime(row[0], '%m/%d/%Y %H:%M')
    hour = date.strftime('%H')
    comments = int(row[1])
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comments
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comments

print(counts_by_hour,'\n',comments_by_hour)

{'01': 50, '21': 92, '07': 42, '19': 97, '13': 94, '02': 43, '10': 51, '12': 58, '16': 102, '23': 73, '18': 108, '11': 56, '22': 68, '14': 92, '15': 137, '17': 101, '20': 89, '05': 40, '09': 42, '00': 58, '03': 61, '04': 50, '08': 38, '06': 46} 
 {'01': 375, '21': 807, '07': 205, '19': 1057, '13': 2529, '02': 395, '10': 588, '12': 1036, '16': 1114, '23': 468, '18': 1269, '11': 675, '22': 572, '14': 1513, '15': 5209, '17': 1878, '20': 921, '05': 185, '09': 325, '00': 634, '03': 497, '04': 438, '08': 395, '06': 307}


### Average Comments

The previous results contain 2 lists, one with the hour of the day and corresponding number of posts, and the other with hour of the day and total number of comments. These 2 lists provide us with the information needed to calculate an average number of comments per post, per hour in the day.

This next section calculates the average for each hour, then sorts the results and displays the top 5 hours of the day based on the highest average comments per post.

In [12]:
avg_by_hour = []

for hour in counts_by_hour:
    comments = comments_by_hour[hour]
    average = comments / counts_by_hour[hour]
    avg_by_hour.append([hour, average])
    
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])

sorted_swap = sorted(swap_avg_by_hour, reverse = True)

print("\nTop 5 Hours for Ask Post Comments\n")

for avg in sorted_swap[0:5]:
    print("{1}:00: {0:.2f} average comments".format(avg[0],avg[1]))


Top 5 Hours for Ask Post Comments

15:00: 38.02 average comments
13:00: 26.90 average comments
17:00: 18.59 average comments
12:00: 17.86 average comments
14:00: 16.45 average comments


## Results

Calculating the average number of comments revealed that 15:00 or 3:00 PM is the optimum hour to submit an "Ask" post to receive the highest number of comments on average at 38.02 comments. The averages drop quickly with 02:00 at second highest with 26.90 and ending with 14:00 at the lowest of the top 5 with 16.01 comments.

Looking at the top 5 hours, we see that 12:00 through 15:00 are adjacent. A cursory thought is that these times may correlate with slower afternoons during work days or free time during most people's lunch hour. Therefore if the goal is to increase an "Ask" post's comments, it is recommended to submit the post between 12:00 and 15:00.

## Considerations & Next Steps

Comments are a great way to measure engagement with a post on Hacker News. They are certainly not the only metric though. This project limited it's scope to comments of just certain types of posts. Other engagement metrics such as view and votes were not considered but may play a significant role in measuring overall engagement.

The next steps for determining when the best time to submit a post is involve a similar examination of views and votes. Analyzing all 3 metrics across time should provide a more sophisticated recommendation on the optimum submission time for maximizing engagement.