# Exploring Hacker News 

Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit.

You can find the data set [here](https://www.kaggle.com/hacker-news/hacker-news-posts), but for this project the data set being utilized has been reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions. 

## Data Set Column Descriptions

- **id**: The unique identifier from Hacker News for the post
- **title**: The title of the post
- **url**: The URL that the posts links to, if the post has a URL
- **num_points**: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
- **num_comments**: The number of comments that were made on the post
- **author**: The username of the person who submitted the post
- **created_at**: The date and time at which the post was submitted

## Project Goals
We'll compare two types of posts, `Ask HN` and `Show HN` to determine the following:

Do `Ask HN` or `Show HN` receive more comments on average?
Do posts created at a certain time receive more comments on average?

In [10]:
from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)
headers = hn[0]
print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


In [19]:
hn = hn[1:] #review after separating out header from the rest of the table
print(headers)
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12'], ['10482257', 'Title II kills investment? Comcast and other ISPs are now spending more', 'http://arstechnica.com/business/2015/10/comcast-and-othe

In [37]:
ask_post = []
show_posts = []
other_posts = []

for row in hn: #filtering out ask hn and show hn posts for further analysis, filtering the rest into "others"
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
ap = len(ask_posts)
sp = len(show_posts)
op = len(other_posts)

print(ap)
print(sp)
print(op)

6976
1162
17193


In [41]:
total_ask_comments = 0
for row in ask_posts:
    num_ask_comments = int(row[4])
    total_ask_comments += num_ask_comments
    
avg_ask_comments = total_ask_comments / ap
print(avg_ask_comments)

total_show_comments = 0
for row in show_posts:
    num_show_comments = int(row[4])
    total_show_comments += num_show_comments
    
avg_show_comments = total_show_comments / ap
print(avg_show_comments) # Come back and do markdown post for this

14.038417431192661
1.7184633027522935


In [52]:
import datetime as dt

result_list = []

for row in ask_posts:
    created_at = row[6]
    num_comments = int(row[4])
    result_list.append([created_at, num_comments])
    
counts_by_hour = {} # creating dictionaries for post counts per hour and comments by hour
comments_by_hour = {}

for row in result_list:
    date = row[0]
    comment = row[1]
    ask_dt = dt.datetime.strptime(date, "%m/%d/%Y %H:%M")
    ask_hour = ask_dt.strftime('%H')
    if ask_hour not in counts_by_hour:
        counts_by_hour[ask_hour] = 1
        comments_by_hour[ask_hour] = comment
    else:
        counts_by_hour[ask_hour] += 1
        comments_by_hour[ask_hour] += comment

In [67]:
#counts_by_hour: contains the number of ask posts created during each hour of the day.
#comments_by_hour: contains the corresponding number of comments ask posts created at each hour received.
#Use the example above to calculate the average number of comments per post for posts created during each hour of the day.
#The result should be a list of lists in which the first element is the hour
#and the second element is the average number of comments per post. 
#Assign the result to a variable named avg_by_hour. Display the results.

avg_by_hour = []

for hr in comments_by_hour:
    avg_by_hour.append([hr, comments_by_hour[hr] / counts_by_hour[hr]]) 
#   avg_by_hour.append([hr, round((comments_by_hour[hr] / counts_by_hour[hr]),2)]) 
    #by bracketing the append we can use hr as the first element for the hour
    # followed by the second element dividing the already created dicts comment/counts by hour for each hr in those dicts.
    # Commented out code rounded to 2 decimal points for readability
avg_by_hour



[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]

In [75]:
swap_avg_by_hour = [] # swapping the list around so avg comments is the first element and hours is second

for hr in avg_by_hour:
    swap_avg_by_hour.append([hr[1], hr[0]])
print(swap_avg_by_hour)

sorted_swap = sorted(swap_avg_by_hour, reverse = True)
print('\n')
print('\n')
print('Top 5 Hours for Ask Posts Comments')

for avg, hr in sorted_swap[:5]:
    print("{}: {:.2f} average comments per post".format(dt.datetime.strptime(hr, "%H").strftime("%H:%M"),avg))
    

[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]




Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post
