# Hacker News Posts Project
## Introduction
Hacker News is a website where users submit their posts mainly concerning computer science and entrepreneurship, that can be voted and commented by other users, like Reddit.

Since this project aims to practice with relatively simple data sets I will use a reduced version of the [Hacker News Posts data set](https://www.kaggle.com/hacker-news/hacker-news-posts), that instead of 300,000 rows uses only 20,000 rows obtained by removing the submissions that did not receive any comments, and then randomly sampling from the remaining submissions.

The goal of the project is to compare two types of posts:
- **Ask HN** : users submit Ask HN to ask the Hacker News community a specific question
- **Show HN** : users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting

I'll compare these two types of posts to determine:
- Which type of post between Ask HN and Show HN receive more comments on average?
- Do posts created at a certain time receive more comments on average?

## Data preparation
I start by importing the dataset in a list of list:

In [1]:
from csv import reader

opened_file = open('hacker_news.csv', encoding='utf8')
read_file = reader(opened_file)
hn = list(read_file)
print(hn[:5]) # Display the first 5 rows, including the headers

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


I separate the column headers from the list containing the data:

In [2]:
headers = hn[0]
hn = hn[1:]
print(headers)
print('\n')
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


Now I'm ready to filter out data. Since I want to concentrate my analysis only on posts with titles beginning with *Ask HN* or *Show HN*, I'll create a new list of list containing just the data with those characteristics:

In [3]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    row_title = row[1]
    if (row_title.lower().startswith('ask hn')):
        ask_posts.append(row)
    elif (row_title.lower().startswith('show hn')):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


## Find the post type with more comments on average

Now it's time to compute the average number of comments for the two types of posts that I'm analyzing to find the one with the greatest number of comments:

In [4]:
total_ask_comments = 0
for post in ask_posts:
    total_ask_comments += int(post[4])
avg_ask_comments = total_ask_comments / len(ask_posts)
print(avg_ask_comments)

14.038417431192661


In [5]:
total_show_comments = 0
for post in show_posts:
    total_show_comments += int(post[4])
avg_show_comments = total_show_comments / len(show_posts)
print(avg_show_comments)

10.31669535283993


From the results shown above, the post *Ask HN* posts receive on average 14 comments, 4 more than the *Show HN* posts. 
## Determine if posts created at a certain time receive more comments on average
Since the *Ask HN* posts are the one more likely to receive comments, I'll focus my remaining analysis just on these posts. The steps that I will take are the following:
1. Calculate the amount of ask posts created in each hour of the day, along with the number of comments received
2. Calculate the average mumber of comments ask posts receive by hour created

In [6]:
import datetime as dt
result_list = []
for post in ask_posts:
    result_list.append([post[6], int(post[4])]) # Append a new list containing the 'created_at' and the 'num_comments' columns
counts_by_hour = {}
comments_by_hour = {}
date_format = "%m/%d/%Y %H:%M"
for row in result_list:
    date = row[0]
    comments = row[1]
    datetime_object = dt.datetime.strptime(date, date_format)
    hour = datetime_object.strftime("%H") # Return the hour as a string
    if hour in counts_by_hour:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comments
    else:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comments
        
comments_by_hour

{'00': 447,
 '01': 683,
 '02': 1381,
 '03': 421,
 '04': 337,
 '05': 464,
 '06': 397,
 '07': 267,
 '08': 492,
 '09': 251,
 '10': 793,
 '11': 641,
 '12': 687,
 '13': 1253,
 '14': 1416,
 '15': 4477,
 '16': 1814,
 '17': 1146,
 '18': 1439,
 '19': 1188,
 '20': 1722,
 '21': 1745,
 '22': 479,
 '23': 543}

Now I'll use the two dictionaries I just created, *counts_by_hour* and *comments_by_hour*, to calculate the average number of comments for posts created during each hour of the day:

In [7]:
avg_by_hour = []

for hour in comments_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour]/counts_by_hour[hour]])
    
avg_by_hour

[['05', 10.08695652173913],
 ['08', 10.25],
 ['23', 7.985294117647059],
 ['18', 13.20183486238532],
 ['02', 23.810344827586206],
 ['07', 7.852941176470588],
 ['16', 16.796296296296298],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['15', 38.5948275862069],
 ['14', 13.233644859813085],
 ['01', 11.383333333333333],
 ['17', 11.46],
 ['11', 11.051724137931034],
 ['06', 9.022727272727273],
 ['13', 14.741176470588234],
 ['12', 9.41095890410959],
 ['10', 13.440677966101696],
 ['03', 7.796296296296297],
 ['22', 6.746478873239437],
 ['21', 16.009174311926607],
 ['19', 10.8],
 ['20', 21.525],
 ['09', 5.5777777777777775]]

I put this in a more readable way, by sorting the list of lists and by printing the five highest values:

In [8]:
swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
swap_avg_by_hour

[[10.08695652173913, '05'],
 [10.25, '08'],
 [7.985294117647059, '23'],
 [13.20183486238532, '18'],
 [23.810344827586206, '02'],
 [7.852941176470588, '07'],
 [16.796296296296298, '16'],
 [7.170212765957447, '04'],
 [8.127272727272727, '00'],
 [38.5948275862069, '15'],
 [13.233644859813085, '14'],
 [11.383333333333333, '01'],
 [11.46, '17'],
 [11.051724137931034, '11'],
 [9.022727272727273, '06'],
 [14.741176470588234, '13'],
 [9.41095890410959, '12'],
 [13.440677966101696, '10'],
 [7.796296296296297, '03'],
 [6.746478873239437, '22'],
 [16.009174311926607, '21'],
 [10.8, '19'],
 [21.525, '20'],
 [5.5777777777777775, '09']]

In [9]:
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
print('Top 5 Hours for Ask Posts Comments')
for avg_hour in sorted_swap[:5]:
    avg = avg_hour[0]
    formatted_time = dt.datetime.strptime(avg_hour[1], '%H').strftime('%H:%M')
    print(
        '{}: {:.2f} average comments per post'.format(
        formatted_time, 
        avg_hour[0]
        )
    )

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


## Conclusions
From the analysis performed above, I can conclude that the Ask Post category has a greater number of comments with respect with the Show Post category.

Moreover, by analyzing the Ask Posts only, I found that posts created between 15:00 (=3 pm) and 16:00 (=4 pm) are the most commented, with an average of 38.59.
Note that the timezone used in the data set is Eastern Time in the US, so if you live in Europe like me you can find useful this [time converter](https://www.thetimezoneconverter.com/) to find your local time.

## Future improvements to this notebook
In the future, I will enrich the analysis provided in this notebook by analyzing other aspects of the data set. Some ideas could be the followings:
* Determine which kind of posts receive more points
* Determine if posts created at a certain time are more likely to receive more points
* Compare the results obtained with the comments with the ones obtained with the points