# Analysing posts on Hacker News

### Table of content
1) Introduction

2) Removing Headers from a List of Lists

3) Extracting Ask HN and Show HN Posts

4) Calculating the Average Number of Comments for Ask HN and Show    HN Posts

5) Do posts created at a certain time receive more comments on      average?
   * Finding the Amount of Ask Posts and Comments by Hour              Created.
   * Calculating the Average Number of Comments for Ask HN Posts      by Hour.

6) Conclusion

## Introduction

In this project, we'll work with a data set of submissions to popular technology site [Hacker News](https://news.ycombinator.com/).

Hacker News is a site started by the startup incubator [Y Combinator](https://www.ycombinator.com/), where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

We're specifically interested in posts whose titles begin with either Ask HN or Show HN. Users submit Ask HN posts to ask the Hacker News community a specific question. Below are a couple examples:
Ask HN: How to improve my personal website?
Ask HN: Am I the only one outraged by Twitter shutting down share counts?
Ask HN: Aby recent changes to CSS that broke mobile?

Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting. Below are a couple of examples:
Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform'
Show HN: Something pointless I made
Show HN: Shanhu.io, a programming playground powered by e8vm

We'll compare these two types of posts to determine the following:
Do Ask HN or Show HN receive more comments on average?
Do posts created at a certain time receive more comments on average?

In [1]:
# Read the file in as a list of lists
from csv import reader

# Open the file
opened_file = open('hacker_news.csv')

# Read the file
read_file = reader(opened_file)

# Convert the file into a list of list format using list()
hn = list(read_file)

# Close the open file
opened_file.close()

In [2]:
# Display the first five rows
print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


## Removing Headers from a List of Lists

In order to analyze our data, we need to first remove the row containing the column headers. Let's remove that first row next.

In [3]:
# Extract the first row
headers = hn[0]

# Remove the first row
hn1 = hn[1:]

# Display headers
print(headers)

print('\n')
# Display first five rows to verify header row has been removed properly
print(hn1[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


## Extracting Ask HN and Show HN Posts

Now that we've removed the headers from hn, we're ready to filter our data. Since we're only concerned with post titles beginning with Ask HN or Show HN, we'll create new lists of lists containing just the data for those titles.

To find the posts that begin with either Ask HN or Show HN, we'll use the string method startswith. Given a string object, say, string1, we can check if starts with, say, dq by inspecting the output of the object string1.startswith('dq'). If string1 starts with dq, it will return True, otherwise it will return False.

In [4]:
# Create three empty lists
ask_posts = []
show_posts = []
other_posts = []

# Loop through each row in hn1
for row in hn1:
    # Assign the title in each row to a variable named title
    title = row[1]
    # Let us assign lowercase version of title to title_lower
    title_lower = title.lower()
    # If the lowercase version of title starts with show hn1,
    # append the row to ask_poets.
    if title_lower.startswith('ask hn'):
        ask_posts.append(row)
    # Else if the lowercase version of title starts with show hn, append the row to show_posts
    elif title_lower.startswith('show hn'):
        show_posts.append(row)
    # Else append to other_posts
    else:
        other_posts.append(row)
        
print('Number of posts in ask_posts: ',len(ask_posts))
print('Number of posts in show_posts: ',len(show_posts))
print('Number of posts in other_posts: ',len(other_posts))

        

Number of posts in ask_posts:  1744
Number of posts in show_posts:  1162
Number of posts in other_posts:  17194


## Calculating the Average Number of Comments for Ask HN and Show HN Posts

We separated the "ask posts" and the "show posts" into two list of lists named ask_posts and show_posts. We will print the first five rows in the ask_posts and show_posts list of lists. Then we will determine if ask posts receives more comments on average.

In [5]:
# First five rows in the ask_posts list of lists
print(ask_posts[:5])

[['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'], ['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14'], ['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20'], ['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38']]


In [6]:
# First five rows in the show_posts list of lists
print(show_posts[:5])

[['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'], ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46'], ['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https://shanhu.io', '1', '1', 'h8liu', '4/28/2016 18:05'], ['12178806', 'Show HN: Webscope  Easy way for web developers to communicate with Clients', 'http://webscopeapp.com', '3', '3', 'fastbrick', '7/28/2016 7:11'], ['10872799', 'Show HN: GeoScreenshot  Easily test Geo-IP based web pages', 'https://www.geoscreenshot.com/', '1', '9', 'kpsychwave', '1/9/2016 20:45']]


Let us determine if ask posts receive more comments on average

In [7]:
# Assign total number of comments in ask posts to total_ask_comments
total_ask_comments = 0
for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments

# Compute the average number of comments on ask posts and assign it to avg_ask_comments
# To do that we will divide the total_ask_comments by the total number of ask_posts
avg_ask_comments = total_ask_comments / 1744

print('avg_ask_comments :', avg_ask_comments)

# Find the total number of comments in show posts and assign it to total_show_comments
total_show_comments = 0

for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments
    
# Compute the average number of comments on show posts and assign it to avg_show_comments
# To calculate the average, we will divide the total_show_comments by total number of show_posts
avg_show_comments = total_show_comments / 1162

print('avg_show_comments :', avg_show_comments)
    
    

avg_ask_comments : 14.038417431192661
avg_show_comments : 10.31669535283993


The result shows that on average, ask posts receive more comments. On average, ask posts in our sample receive approximately 14 comments, whereas show posts receive approximately 10. Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.

## Do posts created at a certain time receive more comments on average?

**Since ask posts are more likely to receive comments, we'll focus our analysis just on these posts**

We'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:

1) Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.

2) Calculate the average number of comments ask posts receive by hour created.

## Finding the Amount of Ask Posts and Comments by Hour Created

We will tackle the first step — calculating the amount of ask posts and comments by hour created. We'll use the datetime [module](https://docs.python.org/3/library/datetime.html) to work with the data in the created_at column.

In [19]:
# First, import the datetime module as dt.
import datetime as dt

# Create an empty list and assign it to result_list. This will be a list of lists.
result_list = []

# Iterate over ask_posts and append to result_list a list with two elements:
for row in ask_posts:
    created_at = row[6]
    num_comments = int(row[4])
    result_list.append([created_at, num_comments])
    
counts_by_hour = {}
comments_by_hour = {}

date_format = "%m/%d/%Y %H:%M"
for row in result_list:
    date = row[0]
    comments = row[1]
    date_dt = dt.datetime.strptime(date, date_format)
    date_str = date_dt.strftime("%H")
    if date_str not in counts_by_hour:
        counts_by_hour[date_str] = 1
        comments_by_hour[date_str] = comments
    else:
        counts_by_hour[date_str] += 1
        comments_by_hour[date_str] += comments

In [9]:
print(counts_by_hour)

{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}


In [10]:
print(comments_by_hour)

{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}


## Calculating the Average Number of Comments for Ask HN Posts by Hour

We created two dictionaries:

* counts_by_hour : contains the number of ask posts created during each hour of the day.

* comments_by_hour : contains the corresponding number of comments ask posts created at each hour received.

We'll use these two dictionaries to calculate the average number of comments for posts created during each hour of the day.

In [11]:
# Calculate the average amount of comments `Ask HN` posts created at each hour of the day receive.
avg_by_hour = []

for hour in comments_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour] / counts_by_hour[hour]])
    
avg_by_hour

[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]

### Sorting and Printing Values from a List of Lists

We calculated the average number of comments for posts created during each hour of the day, and stored the results in a list of lists named **avg_by_hour**. Although we now have the results we need, this format makes it hard to identify the hours with the highest values. Let's finish by sorting the list of lists and printing the five highest values in a format that's easier to read. 

In [16]:
swap_avg_by_hour = []
for row in avg_by_hour:
    list = (row[1], row[0])
    swap_avg_by_hour.append(list)
    
print(swap_avg_by_hour)

sorted_swap = sorted(swap_avg_by_hour, reverse=True)

[(5.5777777777777775, '09'), (14.741176470588234, '13'), (13.440677966101696, '10'), (13.233644859813085, '14'), (16.796296296296298, '16'), (7.985294117647059, '23'), (9.41095890410959, '12'), (11.46, '17'), (38.5948275862069, '15'), (16.009174311926607, '21'), (21.525, '20'), (23.810344827586206, '02'), (13.20183486238532, '18'), (7.796296296296297, '03'), (10.08695652173913, '05'), (10.8, '19'), (11.383333333333333, '01'), (6.746478873239437, '22'), (10.25, '08'), (7.170212765957447, '04'), (8.127272727272727, '00'), (9.022727272727273, '06'), (7.852941176470588, '07'), (11.051724137931034, '11')]


In [17]:
sorted_swap

[(38.5948275862069, '15'),
 (23.810344827586206, '02'),
 (21.525, '20'),
 (16.796296296296298, '16'),
 (16.009174311926607, '21'),
 (14.741176470588234, '13'),
 (13.440677966101696, '10'),
 (13.233644859813085, '14'),
 (13.20183486238532, '18'),
 (11.46, '17'),
 (11.383333333333333, '01'),
 (11.051724137931034, '11'),
 (10.8, '19'),
 (10.25, '08'),
 (10.08695652173913, '05'),
 (9.41095890410959, '12'),
 (9.022727272727273, '06'),
 (8.127272727272727, '00'),
 (7.985294117647059, '23'),
 (7.852941176470588, '07'),
 (7.796296296296297, '03'),
 (7.170212765957447, '04'),
 (6.746478873239437, '22'),
 (5.5777777777777775, '09')]

In [18]:
print("Top 5 Hours for Ask Posts Comments")

for row in sorted_swap[:5]:
    hour = dt.datetime.strptime(row[1], "%H")
    hr = hour.strftime("%H:%M")
    print("{}: {:.2f} average comments per post.".format(hr, row[0]))
    

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post.
02:00: 23.81 average comments per post.
20:00: 21.52 average comments per post.
16:00: 16.80 average comments per post.
21:00: 16.01 average comments per post.


From the restlt above, the hour that seems best to post is 15:00. This is because it has an average of 38.59 comments per post.

## Conclusion 

In this project, we analyzed both **ask posts** and **show posts** in order to determine which type of post and the particular time that receives the most comments on average. Our analysis shows that, for desireable results, the post should be categorized as **ask post** and created between 15:00 and 16:00 (3:00 pm est - 4:00 pm est).