# Hacker News Analysis

Looking at data exported from the posts on the hn website.
### Breakdown of columns
    
- id: The unique identifier from Hacker News for the post
- title: The title of the post
- url: The URL that the posts links to, if it the post has a URL
- num_points: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
- num_comments: The number of comments that were made on the post
- author: The username of the person who submitted the post
- created_at: The date and time at which the post was submitted

We are focused on posts with the Show HN or Ask HN title tags as they are directed at the core hacker news library

In [4]:
#Opening and parsing the csv file
opened_file = open('hacker_news.csv')
from csv import reader
read_file = reader(opened_file)
hn = list(read_file)
#Printing the first five rows of data
for row in hn[:6]:
    print(row)
    print('\n')

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']




In [5]:
#Removing the header row from the dataset
header = hn[:1]
hn = hn[1:]

for row in hn[:5]:
    print(row)
    print('\n')
    

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']




# Filtering Rows

In order to analyze the core hn data, we need to limit our selection to titles that start with either 'Ask HN' or 'Show HN'

In [14]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    Title = row[1]
    title = Title.lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print('The number of posts starting with "Ask HN" is',len(ask_posts))
print('The number of posts starting with "Show HN" is',len(show_posts))
print('The no of posts that don\'t start with either "Ask HN" or "Show HN" is', len(other_posts))

The number of posts starting with "Ask HN" is 1744
The number of posts starting with "Show HN" is 1162
The no of posts that don't start with either "Ask HN" or "Show HN" is 17194


After filtering the posts out, we can begin exploring our desired datasets. First, a comparison between posts with 'Ask HN' and 'Show HN', focusing on how many comments they get on average.

In [15]:
total_ask_comments = 0
total_show_comments = 0

for row in ask_posts:
    num_comments = row[4]
    total_ask_comments += int(num_comments)

avg_ask_comments = total_ask_comments/len(ask_posts)

for row in show_posts:
    num_comments = row[4]
    total_show_comments += int(num_comments)

avg_show_comments = total_show_comments/len(show_posts)

print('The average number of comments starting with "Ask HN" is ', avg_ask_comments)
print('The average number of comments starting with "Show HN" is ', avg_show_comments)
    

The average number of comments starting with "Ask HN" is  14.038417431192661
The average number of comments starting with "Show HN" is  10.31669535283993


From the analysis above, we see posts starting with 'Ask HN' have a higher number of comments on average. With this, we can filter our exploration more by focusing on 'ask posts' as they are more likely to receive comments.

In [33]:
import datetime as dt
result_list = []
for row in ask_posts:
    created_at = row[6]
    num_comments = int(row[4])
    result_list.append([created_at, num_comments])
    
count_by_hour = {}
comments_by_hour = {}
for row in result_list:
    comment = row[1]
    date = row[0]
    date = dt.datetime.strptime(date, "%m/%d/%Y %H:%M")
    hour = date.strftime("%H")
    if hour not in count_by_hour:
        count_by_hour[hour] = 1
        comments_by_hour[hour] = comment
    else:
        count_by_hour[hour] += 1
        comments_by_hour[hour] += comment

In [41]:
avg_by_hour = []
for hour in count_by_hour:
    avg = comments_by_hour[hour]/count_by_hour[hour]
    avg_by_hour.append([hour, round(avg, 1)])
    
print(avg_by_hour)
print(len(avg_by_hour))

[['14', 13.2], ['04', 7.2], ['16', 16.8], ['10', 13.4], ['03', 7.8], ['18', 13.2], ['12', 9.4], ['15', 38.6], ['20', 21.5], ['23', 8.0], ['22', 6.7], ['13', 14.7], ['02', 23.8], ['06', 9.0], ['07', 7.9], ['01', 11.4], ['09', 5.6], ['17', 11.5], ['11', 11.1], ['19', 10.8], ['05', 10.1], ['08', 10.2], ['21', 16.0], ['00', 8.1]]
24


The analysis above shows the average number of comments per post for posts created during each hour of the day.
The next step is to arrange the values in ways that can be easily visualized and interpreted

In [43]:
swap_avg_by_hour = []
for row in avg_by_hour:
    swap1 = row[1]
    swap2 = row[0]
    swap_avg_by_hour.append([swap1, swap2])
    
print(swap_avg_by_hour)

sorted_swap = sorted(swap_avg_by_hour, reverse = True)
print('Top 5 Hours for Ask Posts Comments')
for row in sorted_swap[:5]:
    template = '{0}:00: {1} average comments per post'
    store = template.format(row[1], row[0])
    print(store)

[[13.2, '14'], [7.2, '04'], [16.8, '16'], [13.4, '10'], [7.8, '03'], [13.2, '18'], [9.4, '12'], [38.6, '15'], [21.5, '20'], [8.0, '23'], [6.7, '22'], [14.7, '13'], [23.8, '02'], [9.0, '06'], [7.9, '07'], [11.4, '01'], [5.6, '09'], [11.5, '17'], [11.1, '11'], [10.8, '19'], [10.1, '05'], [10.2, '08'], [16.0, '21'], [8.1, '00']]
Top 5 Hours for Ask Posts Comments
15:00: 38.6 average comments per post
02:00: 23.8 average comments per post
20:00: 21.5 average comments per post
16:00: 16.8 average comments per post
21:00: 16.0 average comments per post


# Findings

15:00 or 3pm is the optimal time to publish Ask Posts where comments are needed or expected on the Hacker News website