## Analyzing Hacker News Posts

In this project we will be analyzing Hacker News, a popular technology site where user-submitted stories receive votes and comments, similar to reddit if you've used that.

The data file we are examining has been reduced from 300,000 rows to about 20,000 rows, to remove all posts that didn't receive comments and randomly sampling from remaining submissions.

The columns are identified as follows:
- id: Unique identifier from Hacker News for the post
- title: The title of the post
- url: The URL that the posts links to, if the post has a URL
- num_points: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
- num_comments: The number of comments on the post
- author: The username of the person who submitted the post
- created_at: The date and time of the post's submission

We are interested in posts that begin with either "Ask HN" or "Show HN". 
**"Ask HN"** simply means the poster is asking the community a certain question.
**"Show HN"** is a user wanting to enlighten the community to a new product, project or something interesting. 

Our goal is to determine the answer to the following 2 questions:
- Do Ask HN or Show HN recieve more comments on average?
- Do posts created at a certain time receive more comments on average?

## Reading Files
First we begin by importing the files

In [None]:
import csv
file = open('hacker_news.csv')
hn = list(csv.reader(file))
hn[:5] #Gives a view on the first 5 rows

## Removing Headers
We will want to separate the file, into two lists with one being the variables and the other the raw data. 

In [None]:
headers = hn[0]
hn = hn[1:]
print(headers)
print(hn[:5])

## Quantifying Posts
Now we have the raw data, we can easily identify which posts are "Ask" or "Show", or any other type of posts within the file. 

In [None]:
ask_posts = []
show_posts = []
other_posts = []
for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

In [None]:
print("Ask_posts has:", len(ask_posts), " number of posts.")
print("Show_posts has:", len(show_posts), " number of posts.")
print("Other_posts has:", len(other_posts), " number of posts.")

## Comparing comments 
Now we can answer one of our questions, which post category has more comments on average?

In [None]:
total_ask_comments = 0
for post in ask_posts:
    total_ask_comments += int(post[4])
    
avg_ask_comments = total_ask_comments/(len(ask_posts))
print("The average number of comments on ask posts is: ", avg_ask_comments)

In [None]:
total_show_comments = 0
for row in show_posts:
    total_show_comments += int(row[4])

avg_show_comments = total_show_comments / (len(show_posts))
print("The average number of comments on show posts is: ", avg_show_comments)

On average we see about 14 comments on ask posts, whereas there are 10 comments on show posts. This is logical because ask posts are seeking commentary on their post, whereas show posts may just seek exposing their topic to the community.

**Additional Question**
Which post, ask or show, typically recieves more points on average?

In [None]:
total_ask_points = 0
for row in ask_posts:
    total_ask_points += int(row[3])

ask_average_points = total_ask_points / len(ask_posts)  
print("The average number of points on ask posts is: ", ask_average_points)

In [None]:
total_show_points = 0
for row in show_posts:
    total_show_points += int(row[3])
show_average_points = total_show_points / len(show_posts)
print("The average number of points on show posts is: ", show_average_points)

In [None]:
print("Ask posts average a number of {:.2f} comments per post as well as {:.2f} points per post.".format(avg_ask_comments, ask_average_points))
print("\n")
print("Show posts average a number of {:.2f} comments per post comparatively as well as {:.2f} points per post".format(avg_show_comments, show_average_points))

So, interestingly, show points receive more points than ask posts, even though ask points receive more comments on average. Perhaps this is because show posts are easier to understand and require less interaction. 

# Comments/Points by Hour
Now we will start to compare the data by the time of posting. 

In [39]:
import datetime as dt

result_list = []
for row in ask_posts:
    result_list.append([row[6],int(row[4]), int(row[3])])
    
counts_by_hour = {}
comments_by_hour = {}
points_by_hour = {}

string_parse = "%m/%d/%Y %H:%M"
for row in result_list:
    time = row[0]
    comments = row[1]
    points = row[2]
    
    hour = dt.datetime.strptime(time, string_parse).strftime("%H")
    if hour in counts_by_hour:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comments
        points_by_hour[hour] += points
    else:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comments
        points_by_hour[hour] = points

{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}
{'09': 329, '13': 2062, '10': 1102, '14': 1282, '16': 2522, '23': 581, '12': 782, '17': 1941, '15': 3479, '21': 1721, '20': 1151, '02': 793, '18': 1741, '03': 374, '05': 552, '19': 1513, '01': 700, '22': 511, '08': 515, '04': 389, '00': 451, '06': 591, '07': 361, '11': 825}


We now have a list of the comments per hour and counts per hour as well as points per hour, now let's compute the average comments per hour and points per hour.

In [41]:
avg_comments_by_hour = []
for hr in comments_by_hour:
    avg_comments_by_hour.append([comments_by_hour[hr]/ counts_by_hour[hr], hr])
avg_comments_by_hour

[[5.5777777777777775, '09'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [16.796296296296298, '16'],
 [7.985294117647059, '23'],
 [9.41095890410959, '12'],
 [11.46, '17'],
 [38.5948275862069, '15'],
 [16.009174311926607, '21'],
 [21.525, '20'],
 [23.810344827586206, '02'],
 [13.20183486238532, '18'],
 [7.796296296296297, '03'],
 [10.08695652173913, '05'],
 [10.8, '19'],
 [11.383333333333333, '01'],
 [6.746478873239437, '22'],
 [10.25, '08'],
 [7.170212765957447, '04'],
 [8.127272727272727, '00'],
 [9.022727272727273, '06'],
 [7.852941176470588, '07'],
 [11.051724137931034, '11']]

In [40]:
avg_points_by_hour = []
for hr in points_by_hour:
    avg_points_by_hour.append([points_by_hour[hr]/ counts_by_hour[hr], hr])
avg_points_by_hour

[[7.311111111111111, '09'],
 [24.258823529411764, '13'],
 [18.677966101694917, '10'],
 [11.981308411214954, '14'],
 [23.35185185185185, '16'],
 [8.544117647058824, '23'],
 [10.712328767123287, '12'],
 [19.41, '17'],
 [29.99137931034483, '15'],
 [15.788990825688073, '21'],
 [14.3875, '20'],
 [13.672413793103448, '02'],
 [15.972477064220184, '18'],
 [6.925925925925926, '03'],
 [12.0, '05'],
 [13.754545454545454, '19'],
 [11.666666666666666, '01'],
 [7.197183098591549, '22'],
 [10.729166666666666, '08'],
 [8.27659574468085, '04'],
 [8.2, '00'],
 [13.431818181818182, '06'],
 [10.617647058823529, '07'],
 [14.224137931034482, '11']]

In [42]:
sorted_comments = sorted(avg_comments_by_hour, reverse = True)
sorted_points = sorted(avg_points_by_hour, reverse = True)

In [43]:
print("Top 5 Hours for Ask Post Comments")
for avg,hr in sorted_comments[:5]:
    print("{}: {:.2f} average comments per post.".format(
    dt.datetime.strptime(hr, "%H").strftime("%H:%M"),avg) )
print("\n")
print("Top 5 Hours for Ask Post Points")
for avg, hr in sorted_points[:5]:
    print("{}: {:.2f} average points per post.".format(
    dt.datetime.strptime(hr, "%H").strftime("%H:%M"),avg) )

Top 5 Hours for Ask Post Comments
15:00: 38.59 average comments per post.
02:00: 23.81 average comments per post.
20:00: 21.52 average comments per post.
16:00: 16.80 average comments per post.
21:00: 16.01 average comments per post.


Top 5 Hours for Ask Post Points
15:00: 29.99 average points per post.
13:00: 24.26 average points per post.
16:00: 23.35 average points per post.
17:00: 19.41 average points per post.
10:00: 18.68 average points per post.


Users should expect to see the most interaction with their post if they post at 3pm, as that is the best hour for average comments and points. 

# Conclusion
With this project we set out to analyze Hacker News, and see what type of posts receive more comments, and whether or not posts created at a certain time received more comments on average. 

What we discovered is that Ask posts receive more comments than Show posts, which logically fits what is expected of the user posting. We also found that 3pm Ask posts contain the most comments, with 2am and 8pm falling to 2nd and 3rd most common commenting time periods. 