# Exploring 'Ask HN' and 'Show HN ' posts in Hacker News

[Hacker News](https://news.ycombinator.com/) is a site, similar to Reddit, where users-submitted stories or posts are voted and commented on. It is a popular site within the technology and startup communities, where a single post can receive hundreds of thousands of visitors. 

The data set that will be worked with can be found [here](https://www.kaggle.com/hacker-news/hacker-news-posts), but note that it has been reduced from 300,000 rows to approximately 20,000 rows from removing submissions that did not receive comments and then random sampling from the remaining submissions. Below breaksdown the columns of the data set.

| Column Name 	| Description 	|
|:-	|:-	|
| `id` 	| The unique identifier from Hacker News for the post 	|
| `title` 	| The title of the post 	|
| `url` 	| The URL that the posts links to, if the post has a URL 	|
| `num_points` 	| The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes 	|
| `num_comments` 	| The number of comments that were made on the post 	|
| `author` 	| The username of the person who submitted the post 	|
| `created_at` 	| The date and time at which the post was submitted 	|

The two types of posts that will be focused on are ones that start with 'Ask HN'(where a user asks a specific question to the Hacker News community) and 'Show HN'(where a user shares a project, product, or some interesting information to the Hacker News community).

The purpose of this project is to compare two types of posts and determine if:

- Do 'Ask HN' or 'Show HN' receive more comments on average?
- Do posts created at a certain time receive more comments on average?

# Importing and prepping the data set

We begin the analysis process by importing and reading the data set.

In [1]:
#open and read file in csv
from csv import reader
open_file = open('hacker_news.csv')
read_file = reader(open_file)
hn = list(read_file)

#seperate the header row and data
headers = hn[0]
hn = hn[1:]

#checking to see if header was removed properly
print(headers)
print('\n')
#display first 5 rows
for row in hn[:5]:
    print (row)
    print('\n')

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']




Creating a method where it displays the first three rows in a data set for checking purposes

In [2]:
def display_3(dataset):
    for row in dataset[:3]:
        print(row)

# Finding the averages

Below shows splitting the data set into three different parts: one for `Ask HN` posts, one for `Show HN` posts, and one for other posts

In [3]:
#Create three different empty lists to separate the hn dataset
ask_posts = []
show_posts = []
other_posts = []

#Appending values into the three data sets by seeing what the line starts with,
#converting the case of the string to lower case
for row in hn:
    title = row[1]
    title = title.lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print(len(ask_posts))
print(display_3(ask_posts))
print('\n')
print(len(show_posts))
print(display_3(show_posts))
print('\n')
print(len(other_posts))
print(display_3(other_posts))

1744
['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55']
['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43']
['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14']
None


1162
['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03']
['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46']
['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https://shanhu.io', '1', '1', 'h8liu', '4/28/2016 18:05']
None


17194
['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']
['10975351', 'How to Use Open Source and Shut the Fuck Up at the S

From the results above there are 1744 posts for 'Ask HN', 1162 posts for 'Show HN', and 17194 other posts. Focusing primarily on the 'Ask' and 'Show' posts we can see that are more ask post than show posts.

Below we see how many comments that 'Ask HN' and 'Show HN' average.

In [4]:
#creating a function to check the average when inputting one of the lists 
def avg_comments (dataset):
    #initializing variables
    total_comments = 0
    post_count = 0
    #iterating through the list
    for row in dataset:
        comments = int(row[4]) #converting the string to int
        total_comments = total_comments + comments
        post_count += 1
    avg_comments = total_comments/post_count
    return avg_comments

print('The average amount of comments for the `ask_posts` data set is', round(avg_comments(ask_posts)))
print('The average amount of comments for the `show_posts` data set is', round(avg_comments(show_posts)))


        

The average amount of comments for the `ask_posts` data set is 14
The average amount of comments for the `show_posts` data set is 10


The results above show that 'Ask HN' posts average more comments at 14, than 'Show HN' posts at 10. Although there are more posts in the 'Ask HN', the average is still higher. 

An inference that can be made is that there may be more users who comment trying to help the author, instead of those who just leave a comment posts with findings.

# Finding the average time where comments are posted more often

Now we'd like to see if there is a general time period in the day where posts receive more comments. To find out we have to:

1. Calculate the number of posts that are created each hour of the day and the comments from those posts
2. Calculate the average number of comments that posts receive by the hour created

Below we create functions to apply to both the `ask_posts` list and `show_posts` list. 

The function `posts_created_hourly` will return dictionaries that contains the hour of the day as the key:

1. `counts_by_hour` contains the number of posts created at that time of day
2. `metric_by_hour` contains the number of comments the type of posts created at each hour

In [5]:
import datetime as dt #import datetime module

#creating a function to find post created by the hour
def posts_created_hourly(dataset):
    
    result_list = []
    
    #store 'date created' and 'number of comments' into the list
    for row in dataset:
        created_at = row[6]
        num_comments = int(row[4]) #convertings string to int
        result_list.append([created_at, num_comments])
    
    counts_by_hour = {}
    comments_by_hour = {}
    #store the values above into two different dictionaries with matching key values
    for row in result_list:
        create_date = row[0]
        comment = row[1]
        create_date_dt = dt.datetime.strptime(create_date, '%m/%d/%Y %H:%M') #strip the date values
        hour = create_date_dt.strftime('%H') #only take the hour value in the time
        if hour not in counts_by_hour:
            counts_by_hour[hour] = 1
            comments_by_hour[hour] = comment
        else:
            counts_by_hour[hour] += 1
            comments_by_hour[hour] += comment

    return counts_by_hour, comments_by_hour

#call the function to the two lists
ask_counts_by_hour, ask_comments_by_hour = posts_created_hourly(ask_posts)
show_counts_by_hour, show_comments_by_hour = posts_created_hourly(show_posts)

print("The frequency table of 'Ask HN' posts per hour:")
print(ask_counts_by_hour, ask_comments_by_hour)
print('\n')
print("the frequency table of 'Show HN' posts per hour:")
print(show_counts_by_hour, show_comments_by_hour)


The frequency table of 'Ask HN' posts per hour:
{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58} {'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}


the frequency table of 'Show HN' posts per hour:
{'14': 86, '22': 46, '18': 61, '07': 26, '20': 60, '05': 19, '16': 93, '19': 55, '15': 78, '03': 27, '17': 93, '06': 16, '02': 30, '13': 99, '08': 34, '21': 47, '04': 26, '11': 44, '12': 61, '23': 36, '09': 30, '01': 28, '10': 36, '00': 31} {'14': 1156, '22': 570, '18': 962, '07': 299, '20': 612, '05': 58, '16': 1084, '19': 539, '15': 632, '03': 287, '17': 911, '06': 142, '02

From the two dictionaries created above, we can now determine the average number of comments for posts created during each hour of the day. To do this, we can create a function called `avg_comments_by_hour` that we can apply to both the `ask_posts` data set and `show_posts` data set. The function will take in both dictionaries, calculate the average based on the matching key value, and append the hour and its average into the new list.

In [6]:
#create function to calculate the average amount of comments created at each hour
def avg_comments_by_hour(counts_by_hour, comments_by_hour):
    avg_by_hour = []
    for key in counts_by_hour:
        avg = (comments_by_hour[key]/counts_by_hour[key])
        avg = round(avg, 2)
        avg_by_hour.append([key,avg])
    
    return avg_by_hour

ask_avg_by_hour = avg_comments_by_hour(ask_counts_by_hour, ask_comments_by_hour)
show_avg_by_hour = avg_comments_by_hour(show_counts_by_hour, show_comments_by_hour)

print(ask_avg_by_hour)
print('\n')
print(show_avg_by_hour)

[['09', 5.58], ['13', 14.74], ['10', 13.44], ['14', 13.23], ['16', 16.8], ['23', 7.99], ['12', 9.41], ['17', 11.46], ['15', 38.59], ['21', 16.01], ['20', 21.52], ['02', 23.81], ['18', 13.2], ['03', 7.8], ['05', 10.09], ['19', 10.8], ['01', 11.38], ['22', 6.75], ['08', 10.25], ['04', 7.17], ['00', 8.13], ['06', 9.02], ['07', 7.85], ['11', 11.05]]


[['14', 13.44], ['22', 12.39], ['18', 15.77], ['07', 11.5], ['20', 10.2], ['05', 3.05], ['16', 11.66], ['19', 9.8], ['15', 8.1], ['03', 10.63], ['17', 9.8], ['06', 8.88], ['02', 4.23], ['13', 9.56], ['08', 4.85], ['21', 5.79], ['04', 9.5], ['11', 11.16], ['12', 11.8], ['23', 12.42], ['09', 9.7], ['01', 8.79], ['10', 8.25], ['00', 15.71]]


We now calculated the average number of comments per hour, but its somewhat difficult to read. The next step is to reformat the lists and print the five highest values.

In [7]:
#create function that sorts the values of the 'average_comments_by_hour' function by finding the top 5 hours
def sort_by_avg_hour(avg_by_hour, post_type):
    swap_avg_by_hour = []
    for rows in avg_by_hour:
        swap_avg_by_hour.append([rows[1], rows[0]])
    #Sort from highest to lowest
    sorted_swap = sorted(swap_avg_by_hour, reverse = True)
    print('Top 5 hours for {} post comments:'.format(post_type))
    print_sorted(sorted_swap)

#a subfunction that displays the first 5 rows in a specified format
def print_sorted(sorted_swap):
    the_format = '{hr}: {avg} average comments per post'
    for row in sorted_swap[:5]:
        hour = dt.datetime.strptime(row[1], '%H')
        hour_str = hour.strftime('%H:%M')
        average = row[0]
        print(the_format.format(hr=hour_str, avg = average))
        
sorted_ask_avg_by_hour = sort_by_avg_hour(ask_avg_by_hour, 'Ask HN')
print('\n')
sorted_show_avg_by_hour = sort_by_avg_hour(show_avg_by_hour, 'Show HN')
        

        
    

Top 5 hours for Ask HN post comments:
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.8 average comments per post
21:00: 16.01 average comments per post


Top 5 hours for Show HN post comments:
18:00: 15.77 average comments per post
00:00: 15.71 average comments per post
14:00: 13.44 average comments per post
23:00: 12.42 average comments per post
22:00: 12.39 average comments per post


After making the readability more visible we can see that 'Ask HN' posts get the most comments on average at 3 PM Eastern Time, followed by 2 AM, 8 PM, 4, PM, and 7 PM. For the 'Show HN' posts, the time that a post gets the most comments is 6 PM followed by 12 AM, 2 PM, 11 PM, and 10 PM. If we want to convert it to Pacific Standard Time we would have to subtract it by three hours.

# Conclusion

The purpose of the project was to see whether 'Ask HN' or 'Show HN' posts receive more comments on average and then also when comments are posted for the two different posts. From the results we can see that the 'Ask HN' averages more comments at 14 comments compared to the 'Show HN' posts which average to 10. Splitting the two different posts and see when they each receive the most comments on average, it seems users post the most comments at 12 PM PST for 'Ask HN' and 5 PM PST for 'Show HN' posts. Converting the times to PST, almost all the times in the top five for both post types are in the afternoon and evening with the only outlier to be 11 AM PST for 'Show HN' posts. It is safe to conclude with this information that the most user interactions on a post would be a 'Ask HN' post in the later afternoon.