# Hacker News

Hacker news is similar to Reddit, where user submitted stories are voted and commented upon.

In this project, we'll explore a Hacker News dataset from 2016, and aim to answer the following questions:

1. Do Ask HN or Show HN receive more comments on average?
2. Do posts created at a certain time receive more comments on average?

Note:

* **Ask HN** refers to users asking questions of the Hacker News community. 
* **Show HN** refers to users showing the Hacker News community somthing

Let's get started

---

## Examining The Data

In [22]:
from csv import reader

import datetime as dt

In [23]:
with open("hacker_news.csv") as file:
    hn = list(reader(file))

In [24]:
header = hn[0]
hn = hn[1:]

In [25]:
print(header)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


In [4]:
for row in hn[:5]:
    print()
    print(row)


['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']

['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']

['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']

['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']

['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']


Next up, we want to create a new list of lists. If we looked closed at the data, we'll see that some titles start with **Ask HN**, while others start with **Show HN**. We only want to keep these ones. We can see that the `title` column is index 1.

## Filtering the Data

In [8]:
ask_posts = list()
show_posts = list()
other_posts = list()

for row in hn:
    title = row[1]
    if title.lower().startswith("ask hn"):
        ask_posts.append(row)
    elif title.lower().startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)

In [9]:
print(f"Ask Posts - {len(ask_posts)}")
print(f"Show Posts - {len(show_posts)}")
print(f"Other Posts - {len(other_posts)}")

Ask Posts - 1744
Show Posts - 1162
Other Posts - 17194


Now that we've allocated out the data, let's calculate the average number of comments for show posts, as well as for ask posts.

## Ask and Show Comments Average

In [13]:
total_ask_comments = 0
for row in ask_posts:
    comments = float(row[4])
    total_ask_comments += comments

avg_ask_comments = round(total_ask_comments / len(ask_posts), 2)

total_show_comments = 0
for row in show_posts:
    comments = float(row[4])
    total_show_comments += comments
    
avg_show_comments = round(total_show_comments / len(show_posts), 2)

In [26]:
print(f"Average number of ask comments is {avg_ask_comments}")
print(f"Average number of show comments is {avg_show_comments}")

Average number of ask comments is 14.04
Average number of show comments is 10.32


We can see from this that ask questions recieve slightly more comments on average compared to show questions.

## Ask posts and comments by Hour

Next, we'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:

* Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
* Calculate the average number of comments ask posts receive by hour created.

In [32]:
results_list = list()

for row in ask_posts:
    comments = row[4]
    date = row[6]
    results_list.append([date, comments])

In [44]:
counts_by_hour = dict()
comments_by_hour = dict()

for row in results_list:
    comments = float(row[1])
    date = dt.datetime.strptime(row[0], "%m/%d/%Y %H:%M")
    hour = date.strftime("%H")
    if hour not in counts_by_hour:
        comments_by_hour[hour] = comments
        counts_by_hour[hour] = 1
    else:
        comments_by_hour[hour] += comments
        counts_by_hour[hour] += 1

In [49]:
def print_table(dictionary):
    li = list()
    for key, value in dictionary.items():
        tup = (value, key)
        li.append(tup)
    sorted_li = sorted(li, reverse = True)
    for items in sorted_li:
        print(f"{items[1]:.5s} : {items[0]}")

In [51]:
print_table(counts_by_hour)

15 : 116
19 : 110
21 : 109
18 : 109
16 : 108
14 : 107
17 : 100
13 : 85
20 : 80
12 : 73
22 : 71
23 : 68
01 : 60
10 : 59
11 : 58
02 : 58
00 : 55
03 : 54
08 : 48
04 : 47
05 : 46
09 : 45
06 : 44
07 : 34


In [50]:
print_table(comments_by_hour)

15 : 4477.0
16 : 1814.0
21 : 1745.0
20 : 1722.0
18 : 1439.0
14 : 1416.0
02 : 1381.0
13 : 1253.0
19 : 1188.0
17 : 1146.0
10 : 793.0
12 : 687.0
01 : 683.0
11 : 641.0
23 : 543.0
08 : 492.0
22 : 479.0
05 : 464.0
00 : 447.0
03 : 421.0
06 : 397.0
04 : 337.0
07 : 267.0
09 : 251.0


Let's now work out the average number of comments for posts by hour

In [62]:
ask_comments_hour = list()

for key, value in comments_by_hour.items():
    ask_comments_hour.append([key, value / counts_by_hour[key]])

In [77]:
sorted_list = list()

for row in ask_comments_hour:
    sorted_list.append([row[1], row[0]])

sorted_list = sorted(sorted_list, reverse = True)
for row in sorted_list:
    print(f"{row[1]:2}:00 : {round(row[0], 1)}")

15:00 : 38.6
02:00 : 23.8
20:00 : 21.5
16:00 : 16.8
21:00 : 16.0
13:00 : 14.7
10:00 : 13.4
14:00 : 13.2
18:00 : 13.2
17:00 : 11.5
01:00 : 11.4
11:00 : 11.1
19:00 : 10.8
08:00 : 10.2
05:00 : 10.1
12:00 : 9.4
06:00 : 9.0
00:00 : 8.1
23:00 : 8.0
07:00 : 7.9
03:00 : 7.8
04:00 : 7.2
22:00 : 6.7
09:00 : 5.6


We can see from the above, that the best time to post an ask question Hacker News (if you want to recieve comments) is 3pm, followed by 2pm and 8pm (Eastern Time in the US). The worst time is 9am.

To be continued...