# Exploring Hacker News Posts
## Introduction

We'll work with a dataset of submissions to popular technology site Hacker News.

**What is Hacker News?**

Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") receive votes and comments, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of the Hacker News listings can get hundreds of thousands of visitors as a result. Users submit Ask HN posts to ask the Hacker News community a specific question. Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just something interesting.

## Questions we will answer
* **Do Ask HN or Show HN receive more comments on average?**
* **Do posts created at a certain time receive more comments on average?**

In order to answer these questions, we gathered 20000 rows containing submissions on Hacker News that received at least one comment.
We will work with posts with titles that begin with either *Ask HN* or *Show HN* and we will compare these two types of posts.

# First question: 
# Do Ask HN or Show HN receive more comments on average?



In [1]:
#Let's start by importing the libraries we need and reading the dataset into a list of lists.
import csv 
from csv import reader
open_hn = open('hacker_news.csv')
hn = list(reader(open_hn))

print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


In [2]:
# Now we will remove the header
headers = hn[0]
hn = hn[1:]
print(hn[:5])

[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


*Next, we will extract **Ask HN** and **Show HN** Posts.*

*We will create two new lists of lists containing just the data for these two titles.*

In [3]:
# We'll use the string method startswith, 
# Since capitalization matters for startswith method, we also will make use of the lower() method. 

ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1].lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print(len(ask_posts))
print(len(show_posts))        
print(len(other_posts))


1744
1162
17194


*Now, we will determine if Ask HN posts or Show HN posts receive more comments on average. In other words, we will see in average which one receives more coments.*

In [4]:
# First we will find the total number of comments of each ask and show list of lists: 
# The comments data is in column with index 4.

total_ask_comments = 0

for row in ask_posts:
    comments = int(row[4])
    total_ask_comments += comments
    
total_show_comments = 0    
for row in show_posts:
    comments = int(row[4])
    total_show_comments += comments
    
# We need the total number of posts for Ask HN and Show HN.
ask_number_posts = len(ask_posts)
show_number_posts = len(show_posts) 

# We proceed to obtain the average number of comments for a Ask HN and Show HN post:
avg_ask_comments = total_ask_comments / ask_number_posts
avg_show_comments = total_show_comments / show_number_posts

print('Average number of comments per Ask HN post: ', avg_ask_comments)
print('Average number of comments per Show HN post: ', avg_show_comments)



Average number of comments per Ask HN post:  14.038417431192661
Average number of comments per Show HN post:  10.31669535283993


 **Answer to:   Do Ask HN or Show HN receive more comments on average?**

It seems that the Ask HN posts receive slightly more comments. The average number of comments for Ask HN ( ~14.04 comments per post ) is slightly higher than that of Show HN ( ~10.32 comments per post ).

# Second question: Do posts created at a certain time receive more comments on average?

*We'll determine if posts created at a certain time are more likely to attract comments. Since Ask HN posts are more likely to receive comments, we'll focus our remaining analysis just on these posts. We will use the following steps to perform this analysis:*

* *Calculate the number of ask posts created in each hour of the day, along with the number of comments received.*
* *Calculate the average number of comments ask posts receive by hour created.* 

*Number of ask posts created in each hour of the day and number of comments received:*

In [5]:
# We'll use the datetime module to work with the data in the created_at column:
import datetime as dt
# First, we create a list containing only the two columns of interest from the ask_posts list.
result_list = []
# The index of the date and time column, created_at, is = 6.
# The index of the number of comments received is = 4.
for row in ask_posts:
    date_time_column = row[6]
    comments = int(row[4])
    result_list.append([date_time_column,comments])
# In order to create a list of lists we have to append both columns at the same time in the format list.append([column1, column2])    
    
#print(ask_posts[:6]) 
#print(result_list[:6])
# We will make a frequency table for the number of posts by hour.
counts_by_hour = {}
# We will count the number of comments by hour.
comments_by_hour = {}

# We will extract the hour form the first column of the result_list (which is a string)
# by creating a datetime object from the string using the strptime() method:
# The format of the date and time for the created_at column is: 
# month/day/year hour:min
for row in result_list:
    date_time = row[0]
    comments = row[1]
    date_time_dt_object = dt.datetime.strptime(date_time,'%m/%d/%Y %H:%M' )
    #time_object = date_time_dt_object.time()
    #hour = time_object.time.hour()
    #hour1 = hour.strftime('%H')
    hour = date_time_dt_object.strftime('%H')  
    #print(hour1)
    #print('\n')                                           
    #print(hour2)  
    # since we initialize both dictionaries at the same time, we can use the same if...else statements to create both dictionaries.
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comments
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comments
        
    
print('Number of posts by hour:\n',counts_by_hour)
print('Number of comments by hour:\n',comments_by_hour)

    





Number of posts by hour:
 {'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}
Number of comments by hour:
 {'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}


*Next we will obtain the average number of comments ask posts receive by hour created.*

In [6]:
# We create a list of lists, the first column will be the hour and the second column will be the average number of comments per post posted at that specific hour.
avg_by_hour = []
for hour in counts_by_hour:
    posts_hour = counts_by_hour[hour]
    if hour in comments_by_hour:
        comments_hour = comments_by_hour[hour]
        avg = comments_hour / posts_hour
        avg_by_hour.append([hour, avg])

print(avg_by_hour)

[['09', 5.5777777777777775], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['16', 16.796296296296298], ['23', 7.985294117647059], ['12', 9.41095890410959], ['17', 11.46], ['15', 38.5948275862069], ['21', 16.009174311926607], ['20', 21.525], ['02', 23.810344827586206], ['18', 13.20183486238532], ['03', 7.796296296296297], ['05', 10.08695652173913], ['19', 10.8], ['01', 11.383333333333333], ['22', 6.746478873239437], ['08', 10.25], ['04', 7.170212765957447], ['00', 8.127272727272727], ['06', 9.022727272727273], ['07', 7.852941176470588], ['11', 11.051724137931034]]


*In order to make the data easier to read, we will sort the list by the number of comments*
*We also will print the five highest values of comments*

In [13]:
# First, we create a new list by swapping the columns of avg_by_hour. 
swap_avg_by_hour = []
for row in avg_by_hour:
    hour = row[0]
    avg_comments = row[1]
    swap_avg_by_hour.append([avg_comments,hour])
print(swap_avg_by_hour)

# Next, we sort this list and print the first 5 rows.
sorted_swap = sorted(swap_avg_by_hour,reverse=True)

print('Top 5 Hours for Ask Posts Comments: \n',sorted_swap[:6])

[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]
Top 5 Hours for Ask Posts Comments: 
 [[38.5948275862069, '15'], [23.810344827586206, '02'], [21.525, '20'], [16.796296296296298, '16'], [16.009174311926607, '21'], [14.741176470588234, '13']]


In [23]:
# We print the first 5 rows with a format easier to unserstand:

for row in sorted_swap[:5]:
    print(type(row[1]))
    object_datetime = dt.datetime.strptime(row[1],'%H')
    hour = object_datetime.strftime('%H:%M')
    average = row[0]
    print("{hour} {average:.2f} average comments per post".format(hour=hour, average=average))
#{hour} and {average:.2f} are format fields that will be replaced by the values of hour and average, respectively. 
#The code :.2f specifies that the value of average should be formatted as a decimal number with two digits after the decimal point.

#Another useful way to print it using format for the hour and average variables (row[1] and row[0] respectively) and without using datetime class:
#print("{hour}:00: {average:.2f} average comments per post".format(hour=hour, average=average))



<class 'str'>
15:00 38.59 average comments per post
<class 'str'>
02:00 23.81 average comments per post
<class 'str'>
20:00 21.52 average comments per post
<class 'str'>
16:00 16.80 average comments per post
<class 'str'>
21:00 16.01 average comments per post


In [37]:
# Now we will convert the hours to Madrid time. Most of the year the difference is of 6 hours:
from datetime import timedelta
for row in sorted_swap[:5]:
    # To convert the text chain into a datetime object:
    time_usa = dt.datetime.strptime(row[1],'%H')
    # To convert into Madrid time, we use the timedelta class from the datetime object time_usa.
    madrid_time = time_usa + dt.timedelta(hours=6)
    madrid_hour = madrid_time.strftime('%H:%M')
    average = row[0]
    #print(time_usa)
    #print(madrid_time)
    print("{hour} {average:.2f} average comments per post, Madrid time".format(hour=madrid_hour, average=average))
    
    

21:00 38.59 average comments per post, Madrid time
08:00 23.81 average comments per post, Madrid time
02:00 21.52 average comments per post, Madrid time
22:00 16.80 average comments per post, Madrid time
03:00 16.01 average comments per post, Madrid time


**Answer to: do posts created at a certain time receive more comments on average?**

According to the analysis, the highest number of comments on Ask HN posts in the Eastern Time zone of the US is around 15:00, which corresponds to 21:00 in Madrid. These posts receive an average of approximately 40 comments per post. The second most popular time for Ask HN posts is around 2:00, which is 8:00 in Madrid, with an average of about 23 comments per post. Additionally, posts uploaded around 20:00 (2:00 in Madrid) receive an average of approximately 21 comments. Posts written around 16:00 and 21:00 (22:00 and 3:00 in Madrid, respectively) take the fourth and fifth positions, with around 17 and 16 comments per post on average, respectively.
