# Exploring Hacker News Posts
>Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

Dataset found here: [Hacker News Posts](https://www.kaggle.com/hacker-news/hacker-news-posts)

The goal is to explore posts that begin with `Ask HN` or `Show HN` to determine the following:
- Do `Ask HN` or `Show HN` receive more comments on average?
- Do posts created at a certain time receive more comments on average?

In [3]:
from csv import reader

# open HN file
with open("hacker_news.csv") as f:
    hn = list(reader(f))

hn[:3]

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30']]

So we have the following columns:
- title: title of the post (self explanatory)
- url: the url of the item being linked to
- num_points: the number of upvotes the post received
- num_comments: the number of comments the post received
- author: the name of the account that made the post
- created_at: the date and time the post was made (the time zone is Eastern Time in the US)

In [4]:
# remove column headers
headers = hn[0]
hn = hn[1:]

Let's split our dataset into posts with `ask hn`, `show hn`, and `other posts`

In [5]:
ask_posts = []
show_posts = []
other_posts = []

# split ask, show, and other posts into separate lists
for row in hn:
    title = row[1].lower()
    
    if title.startswith("ask hn"):
        ask_posts.append(row)
        
    elif title.startswith("show hn"):
        show_posts.append(row)
        
    else:
        other_posts.append(row)

print("Number of ask posts: {:,}".format(len(ask_posts)))
print("Number of show posts: {:,}".format(len(show_posts)))
print("Number of other posts: {:,}".format(len(other_posts)))

Number of ask posts: 1,744
Number of show posts: 1,162
Number of other posts: 17,194


## Part 1: exploring comments


### Goals
- Do `Ask HN` or `Show HN` receive more comments on average?
- Do posts created at a certain time receive more comments on average?


Next we'll get the average number of comments for both ask and show posts

In [19]:
# return average of a given column's elements
def col_average(posts, column):
    total_elements = 0
    for row in posts:
        
        n_elements = int(row[column])
        total_elements += n_elements
    
    return total_elements / len(posts)

In [20]:
# average number of comments in ask posts
avg_ask_comments = col_average(ask_posts, 4)

print("Average number of comments on ask posts: {:,.2f}".format(
    avg_ask_comments))


# average number of comments in show posts
avg_show_comments = col_average(show_posts, 4)

print("Average number of comments on show posts: {:,.2f}".format(
    avg_show_comments))

Average number of comments on ask posts: 14.04
Average number of comments on show posts: 10.32


Ask posts get more comments on average since people will be responding to their question(s).

Looking in more detail into ask posts, we can calculate the number of posts per hour to see when's best to post if we want the most people responding.

In [31]:
import datetime as dt

# return list of average of elements in column per hour
def average_perhour(posts, column):
    counts_by_hour = {}
    elements_by_hour = {}
    
    for row in posts:
        
        created_dt = dt.datetime.strptime(row[6], "%m/%d/%Y %H:%M") # format '8/4/2016 11:52'
        hour = created_dt.strftime("%H") # could chain to above
        n_elements = int(row[column])
        
        if hour not in counts_by_hour:
            counts_by_hour[hour] = 1
            elements_by_hour[hour] = n_elements
        
        else:            
            counts_by_hour[hour] += 1
            elements_by_hour[hour] += n_elements
    
    avg_by_hour = []
    for hour in counts_by_hour:
        avg_by_hour.append([hour, elements_by_hour[hour]/counts_by_hour[hour]])
    
    return avg_by_hour

In [32]:
# average ask posts by hour
avg_by_hour = average_perhour(ask_posts, 4)
avg_by_hour

[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]

If we swap the columns, we can order our table to show the max values first:

In [33]:
def swap_list(unswapped):
    swapped = []
    
    for row in unswapped:
        swapped.append([row[1], row[0]])
        
    return swapped

In [34]:
swap_avg_by_hour = swap_list(avg_by_hour)

swap_avg_by_hour

[[5.5777777777777775, '09'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [16.796296296296298, '16'],
 [7.985294117647059, '23'],
 [9.41095890410959, '12'],
 [11.46, '17'],
 [38.5948275862069, '15'],
 [16.009174311926607, '21'],
 [21.525, '20'],
 [23.810344827586206, '02'],
 [13.20183486238532, '18'],
 [7.796296296296297, '03'],
 [10.08695652173913, '05'],
 [10.8, '19'],
 [11.383333333333333, '01'],
 [6.746478873239437, '22'],
 [10.25, '08'],
 [7.170212765957447, '04'],
 [8.127272727272727, '00'],
 [9.022727272727273, '06'],
 [7.852941176470588, '07'],
 [11.051724137931034, '11']]

In [12]:
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

print("Top 5 Hours for Ask Posts Comments")
for avg, hr in sorted_swap[:5]:
    
    print("{}: {:.2f} average comments per post".format(
    dt.datetime.strptime(hr, "%H").strftime("%H:%M"),
    avg))

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


The post active commenting period is between 3pm and 4pm Eastern time.

The most active times to post an `ask post` is from 15:00-16:00, 02:00-03:00, and 20:00-21:00 (EST)

In [42]:
show_comments_by_hour = average_perhour(show_posts, 4)

swap_show_comments_by_hour = swap_list(show_comments_by_hour)
sorted_swap_show = sorted(swap_show_comments_by_hour, reverse=True)

print("Top 5 Hours for Show Posts Comments")
for avg, hr in sorted_swap_show[:5]:
    
    print("{}: {:.2f} average comments per post".format(
    dt.datetime.strptime(hr, "%H").strftime("%H:%M"),
    avg))

Top 5 Hours for Show Posts Comments
18:00: 15.77 average comments per post
00:00: 15.71 average comments per post
14:00: 13.44 average comments per post
23:00: 12.42 average comments per post
22:00: 12.39 average comments per post


Show points in comparison have a more uniform distribution of comments per post, with ~15 average comments per post from 18:00 to 00:00 (EST)

## Part 2: exploring points

Let's move onto points and see what the data shows:
- Do show or ask posts receive more points on average
- Are posts created at a certain time more likely to receive more points

Compare your results to the average number of comments and points other posts receive.
Use Dataquest's data science project style guide to format your project.

In [25]:
# average points for ask posts
avg_ask_points = col_average(ask_posts, 3)

print("Average number of points on ask posts: {:,.2f}".format(
    avg_ask_points))


# average points for show posts
avg_show_points = col_average(show_posts, 3)
print("Average number of points on show posts: {:,.2f}".format(
    avg_show_points))

Average number of points on ask posts: 15.06
Average number of points on show posts: 27.56


As expected, show posts receive more points as it's people showcasing something.

Let's compare the average of points by hour for both:

In [41]:
ask_points_by_hour = average_perhour(ask_posts, 3)

swap_ask_points_by_hour = swap_list(ask_points_by_hour)
sorted_swap_ask = sorted(swap_ask_points_by_hour, reverse=True)

print("Top 5 Hours for Ask Posts Points")
for avg, hr in sorted_swap_ask[:5]:
    
    print("{}: {:.2f} average points per post".format(
    dt.datetime.strptime(hr, "%H").strftime("%H:%M"),
    avg))
print("\n")

show_points_by_hour = average_perhour(show_posts, 3)

swap_show_points_by_hour = swap_list(show_points_by_hour)
sorted_swap_show = sorted(swap_show_points_by_hour, reverse=True)

print("Top 5 Hours for Show Posts Points")
for avg, hr in sorted_swap_show[:5]:
    
    print("{}: {:.2f} average points per post".format(
    dt.datetime.strptime(hr, "%H").strftime("%H:%M"),
    avg))

Top 5 Hours for Ask Posts Points
15:00: 29.99 average points per post
13:00: 24.26 average points per post
16:00: 23.35 average points per post
17:00: 19.41 average points per post
10:00: 18.68 average points per post


Top 5 Hours for Show Posts Points
23:00: 42.39 average points per post
12:00: 41.69 average points per post
22:00: 40.35 average points per post
00:00: 37.84 average points per post
18:00: 36.31 average points per post


Comparing what we know of most active times for Ask posts and Show posts:

Ask posts are most active from 15:00-16:00, which matches the 1st and 3rd on the list of top hours for points. Coming in 2nd at 13:00 could suggest lunchtime when people may be scrolling through Hacker News but not responding.

Show posts have a much higher average of points per post. The best time to post is from 22:00-00:00, or 12:00-13:00.