# Exploring Hacker News Posts

In this project, I will be looking at dataset submissions from the Hacker News website from up to 12 months (up to Sept 16, 2016), posts and their comments. 

Dataset includes the following columns: 

id: The unique identifier from Hacker News for the post

title: The title of the post 

url: The URL that the posts links to, if it the post 
has a URL

num_points: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes

num_comments: The number of comments that were made on the post

author: The username of the person who submitted the post 

created_at: The data and time at which the post was submitted 

In [22]:
import csv 

# Reading hacker_news.csv into a list of lists 
opened_file = open('hacker_news.csv')
hn = list(csv.reader(opened_file))

# Displaying the first five rows from the file 
print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


In [24]:
# Extracting the first row from the data, assigning it to a variable
headers = hn[0]

# Removing first row from hn
hn = hn[1:]

# Displaying headers 
print(headers)

# Displaying the first five rows
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


In [23]:
# Creating empty lists for ask_posts, show_posts, and other_posts
ask_posts = []
show_posts = []
other_posts = []

# Using for loop to filter hn posts with relevant lists 
for post in hn: 
    title = post[1]
    if title.lower().startswith("ask hn"):
        ask_posts.append(post)
    elif title.lower().startswith("show hn"):
        show_posts.append(post)
    else: 
        other_posts.append(post)

# Checking the number of posts in each list 
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17195


In [25]:
# Finding the total number of comments in ask posts
total_ask_comments = 0

for post in ask_posts:
    total_ask_comments += int(post[4])

avg_ask_comments = total_ask_comments / len(ask_posts)
print(avg_ask_comments)

14.038417431192661


From the calculation above, 14 comments were received in ask posts. 

In [26]:
# Finding the total number of comments in show posts
total_show_comments = 0

for row in show_posts:
    total_show_comments += int(row[4])
    
# Computing the average number of comments in show posts 
avg_show_comments = total_show_comments / len(show_posts)
print(avg_show_comments)

10.31669535283993


From the calculations above, 10 comments were received in show posts. 

In [28]:
# Importing datetime module 
import datetime as dt 

# Creating an empty list for result_list 
result_list = [] 

# Iterating over ask posts to find number of ask posts created every hour 
# And number of comments received in ask posts 
for row in ask_posts:
    created_at = row[6]
    num_comments = row[4]
    result_list.append([created_at, int(num_comments)])
    
counts_by_hour = {}
comments_by_hour = {}

# Changing the date format 
date_format = "%m/%d/%Y %H:%M"

for row in result_list:
    post_date = row[0]
    comments = row[1]
    date = dt.datetime.strptime(post_date, date_format)
    hour = dt.datetime.strftime(date, "%H")
    
# Creating frequency table
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comments
    else: 
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comments 

In [29]:
# Calculating average number of comments in ask posts received per hour 
avg_by_hour = []
for comment in comments_by_hour:
    avg_by_hour.append([comment, comments_by_hour[comment] / counts_by_hour[comment]])
print(avg_by_hour)

[['09', 5.5777777777777775], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['16', 16.796296296296298], ['23', 7.985294117647059], ['12', 9.41095890410959], ['17', 11.46], ['15', 38.5948275862069], ['21', 16.009174311926607], ['20', 21.525], ['02', 23.810344827586206], ['18', 13.20183486238532], ['03', 7.796296296296297], ['05', 10.08695652173913], ['19', 10.8], ['01', 11.383333333333333], ['22', 6.746478873239437], ['08', 10.25], ['04', 7.170212765957447], ['00', 8.127272727272727], ['06', 9.022727272727273], ['07', 7.852941176470588], ['11', 11.051724137931034]]


In [32]:
# Sorting the list of lists and printing the five highest values in a format that's easier to read 
swap_avg_by_hour = []
for row in avg_by_hour: 
    swap_avg_by_hour.append([row[1], row[0]])
    
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
print(sorted_swap[0:4], "\n")

# Formatting results display 
for avg, hr in sorted_swap[:5]:
    print(
        "{}: {:.2f} average comments per post".format(dt.datetime.strptime(hr, "%H").strftime("%H:%M"),avg
                                                     )
    
    )

[[38.5948275862069, '15'], [23.810344827586206, '02'], [21.525, '20'], [16.796296296296298, '16']] 

15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


In conclusion, I analyzed two types of posts from the Hacker News website forum -- ASK posts where users ask the Hack News Community questions and SHOW posts where users submit a project to Hacker News or anything interesting. 
Results showed that the ASK posts received more comments with an average of 14 comments as compared to the number of comments in the SHOW posts which was 10 comments. 
In addition, the number of comments per post varied by the hour. More comments were posted around the evening rather than the morning. 