## Analyzing Hacker News Posts

In this project we will be analyzing Hacker News, a popular technology site where user-submitted stories receive votes and comments, similar to reddit if you've used that.

The data file we are examining has been reduced from 300,000 rows to about 20,000 rows, to remove all posts that didn't receive comments and randomly sampling from remaining submissions.

The columns are identified as follows:
- id: Unique identifier from Hacker News for the post
- title: The title of the post
- url: The URL that the posts links to, if the post has a URL
- num_points: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
- num_comments: The number of comments on the post
- author: The username of the person who submitted the post
- created_at: The date and time of the post's submission

We are interested in posts that begin with either "Ask HN" or "Show HN". 
**"Ask HN"** simply means the poster is asking the community a certain question.
**"Show HN"** is a user wanting to enlighten the community to a new product, project or something interesting. 

Our goal is to determine the answer to the following 2 questions:
- Do Ask HN or Show HN recieve more comments on average?
- Do posts created at a certain time receive more comments on average?

## Reading Files
First we begin by importing the files

In [1]:
import csv
file = open('hacker_news.csv')
hn = list(csv.reader(file))
hn[:5] #Gives a view on the first 5 rows

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01']]

## Removing Headers
We will want to separate the file, into two lists with one being the variables and the other the raw data. 

In [3]:
headers = hn[0]
hn = hn[1:]
print(headers)
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


## Quantifying Posts
Now we have the raw data, we can easily identify which posts are "Ask" or "Show", or any other type of posts within the file. 

In [4]:
ask_posts = []
show_posts = []
other_posts = []
for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

In [5]:
print("Ask_posts has:", len(ask_posts), " number of posts.")
print("Show_posts has:", len(show_posts), " number of posts.")
print("Other_posts has:", len(other_posts), " number of posts.")

Ask_posts has: 1744  number of posts.
Show_posts has: 1162  number of posts.
Other_posts has: 17194  number of posts.


## Comparing comments 
Now we can answer one of our questions, which post category has more comments on average?

In [6]:
total_ask_comments = 0
for post in ask_posts:
    total_ask_comments += int(post[4])
    
avg_ask_comments = total_ask_comments/(len(ask_posts))
print("The average number of comments on ask posts is: ", avg_ask_comments)

The average number of comments on ask posts is:  14.038417431192661


In [7]:
total_show_comments = 0
for row in show_posts:
    total_show_comments += int(row[4])

avg_show_comments = total_show_comments / (len(show_posts))
print("The average number of comments on show posts is: ", avg_show_comments)

The average number of comments on show posts is:  10.31669535283993


On average we see about 14 comments on ask posts, whereas there are 10 comments on show posts. This is logical because ask posts are seeking commentary on their post, whereas show posts may just seek exposing their topic to the community.

# Comments by Hour
Now we will start to compare the data by the time of posting. 

In [11]:
import datetime as dt

result_list = []
for row in ask_posts:
    result_list.append([row[6],int(row[4])])
    
counts_by_hour = {}
comments_by_hour = {}

string_parse = "%m/%d/%Y %H:%M"
for row in result_list:
    time = row[0]
    comments = row[1]
    
    hour = dt.datetime.strptime(time, string_parse).strftime("%H")
    if hour in counts_by_hour:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comments
    else:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comments
comments_by_hour

{'09': 251,
 '13': 1253,
 '10': 793,
 '14': 1416,
 '16': 1814,
 '23': 543,
 '12': 687,
 '17': 1146,
 '15': 4477,
 '21': 1745,
 '20': 1722,
 '02': 1381,
 '18': 1439,
 '03': 421,
 '05': 464,
 '19': 1188,
 '01': 683,
 '22': 479,
 '08': 492,
 '04': 337,
 '00': 447,
 '06': 397,
 '07': 267,
 '11': 641}

We now have a list of the comments per hour and counts per hour, now let's compute the average comments per hour.

In [23]:
avg_by_hour = []
for hr in comments_by_hour:
    avg_by_hour.append([hr, comments_by_hour[hr]/ counts_by_hour[hr]])
avg_by_hour

[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]

In [24]:
swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
print(swap_avg_by_hour)
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
sorted_swap

[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]


[[38.5948275862069, '15'],
 [23.810344827586206, '02'],
 [21.525, '20'],
 [16.796296296296298, '16'],
 [16.009174311926607, '21'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [13.20183486238532, '18'],
 [11.46, '17'],
 [11.383333333333333, '01'],
 [11.051724137931034, '11'],
 [10.8, '19'],
 [10.25, '08'],
 [10.08695652173913, '05'],
 [9.41095890410959, '12'],
 [9.022727272727273, '06'],
 [8.127272727272727, '00'],
 [7.985294117647059, '23'],
 [7.852941176470588, '07'],
 [7.796296296296297, '03'],
 [7.170212765957447, '04'],
 [6.746478873239437, '22'],
 [5.5777777777777775, '09']]

In [25]:
print("Top 5 Hours for Ask Post Comments")
for avg,hr in sorted_swap[:5]:
    print("{}: {:.2f} average coments per post.".format(
    dt.datetime.strptime(hr, "%H").strftime("%H:%M"),avg) )

Top 5 Hours for Ask Post Comments
15:00: 38.59 average coments per post.
02:00: 23.81 average coments per post.
20:00: 21.52 average coments per post.
16:00: 16.80 average coments per post.
21:00: 16.01 average coments per post.


The hour which contains the most average comments per post is 3pm, averaging 15 more comments per post than the next closest hour, 2am.

# Conclusion
With this project we set out to analyze Hacker News, and see what type of posts receive more comments, and whether or not posts created at a certain time received more comments on average. 

What we discovered is that Ask posts receive more comments than Show posts, which logically fits what is expected of the user posting. We also found that 3pm posts contain the most comments, with 2am and 8pm falling to 2nd and 3rd most common commenting time periods. 