# Introduction

In this notebook we'll be working with a dataset of submissions from a popular website: [Hacker News](https://news.ycombinator.com/)

Hacker News is a website that is similar to Reddit, where users can submit stories and receive votes and comments. Created by the startup incubator [Y Combinator](https://www.ycombinator.com/), is popular among startup circles and technology. Posts that can make it to the top of the Hacker News listings can get hundreds of thousands of visitors as a result.

# About the Dataset

We have reduced the number of rows from almost 300,000 to approximately 20,000 rows by removing all submissions that didn't receive any comments and then randomly sampling form the remaining submissions.

You can find the dataset [here](https://www.kaggle.com/datasets/hacker-news/hacker-news-posts). Below are the datasets column description.

 - id : unique identifier.
 - title : title of the post.
 - url : the url of the item being linked to, if the post has url.
 - num_points : number of upvotes.
 - num_comments : number of comments.
 - author : the name of the account that created the post.
 - created_at : date and time post being created.
 
We are interested in posts with titles that begin with either `Ask HN` or `Show HN`. We will compare these two types of posts to determine the following:
 - Do `Ask HN` or `Show HN` receive more comments on average?
 - Do posts created at a certain time receive more comments on average?

In [1]:
# importing packages
from csv import reader
import datetime as dt

# loading the dataset
hn = list(reader(open("hacker_news.csv")))

In [2]:
print("Example of Ask HN: ")
for x in hn[:100]:
    if "Ask HN" in x[1]:
        print(x[1])

Example of Ask HN: 
Ask HN: How to improve my personal website?
Ask HN: Am I the only one outraged by Twitter shutting down share counts?
Ask HN: Aby recent changes to CSS that broke mobile?
Ask HN: Looking for Employee #3 How do I do it?
Ask HN: Someone offered to buy my browser extension from me. What now?
Ask HN: Limiting CPU, memory, and I/O usage on a program for testing
Ask HN: Which framework for a CRUD app in 2016?
Ask HN: Enter market with a well-funded competitor?
Ask HN: Do you use any realtime PaaS/framework and in case you so which one?


In [3]:
print("Example of show HN: ")
for x in hn[:100]:
    if "Show HN" in x[1]:
        print(x[1])

Example of show HN: 
Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform
Show HN: Something pointless I made
Show HN: Shanhu.io, a programming playground powered by e8vm
Show HN: Webscope  Easy way for web developers to communicate with Clients
Show HN: GeoScreenshot  Easily test Geo-IP based web pages


In [4]:
print("First five rows of the dataset: ")
print(hn[1:5])

print("=" * 50)
print("\n")
print("Header/columns: ")
print(hn[0])

First five rows of the dataset: 
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


Header/columns: 
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


## Filter Posts

In [5]:
header = hn[0]
hn = hn[1:] # skipping/removing the header

Since we are only interested in `Ask HN` and `Show HN` we need to filter them out. To start let's create an empty list:
 - `ask_posts` : which will contain all `Ask HN` posts.
 - `show_posts` : will contain all `Show HN` posts.
 - `other_posts` : will contain all other posts.

In [6]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith("ask hn"):
        ask_posts.append(row)
    elif title.lower().startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print("Number of rows for ask_posts: ", len(ask_posts))
print("Number of rows for show_posts: ", len(show_posts))
print("Number of rows for other_posts: ", len(other_posts))

Number of rows for ask_posts:  1744
Number of rows for show_posts:  1162
Number of rows for other_posts:  17194


## Calculate the average of each filtered posts

In [7]:
def find_total(data, index):
    total = 0

    for rows in data:
        total += int(rows[index])
    return total

total_ask_comments = find_total(ask_posts, 4)
total_show_comments = find_total(show_posts, 4)

avg_ask_comments = total_ask_comments / len(ask_posts)
avg_show_comments = total_show_comments / len(show_posts)

print("Average of asks_post: ", avg_ask_comments)
print("Average of show_comments: ", avg_show_comments)

Average of asks_post:  14.038417431192661
Average of show_comments:  10.31669535283993


Based on the average `ask_posts` has higher average than `show_posts` which means that there are more comments in the question related posts than `Show HN`.

Since `ask_posts` received more comments than `show_posts` let's determine wether if `ask_posts` created at a certain time attracts more comments or not. To do that we'll create a list of lists which contains `created_at` and number of comments for each post. Then we can calculate the number of ask posts created on each hour of the day and calculate the average number of comments received per hour.

In [8]:
result_list = []

for rows in ask_posts:
    created_at = rows[6] # date and time of creation
    num_comments = int(rows[4]) # number of comments in the ask posts
    result_list.append([created_at, num_comments])
print("Example of result_list:\n", result_list[:5])
print("=" * 50)

Example of result_list:
 [['8/16/2016 9:55', 6], ['11/22/2015 13:43', 29], ['5/2/2016 10:14', 1], ['8/2/2016 14:20', 3], ['10/15/2015 16:38', 17]]


In [9]:
counts_by_hour = {}
comments_by_hour = {}

for data in result_list:
    dt_string = data[0] # mm/dd/yyy 
    date_time_object = dt.datetime.strptime(dt_string, "%m/%d/%Y %H:%M") # converting to date time object
    time = date_time_object.strftime("%H") # extract the hour
    
    if not time in counts_by_hour:
        counts_by_hour[time] = 1
        comments_by_hour[time] = data[1]
    else:
        counts_by_hour[time] += 1
        comments_by_hour[time] += data[1]
print(comments_by_hour)

{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}


The two empty dictionaries contains the following:
- `counts_by_hour`: contains the number of ask posts created during each hour of the day.
- `comments_by_hour` : contains the number of comments in ask posts created at each hour.

Then let's use these two dictionaries to calculate the average number of comments for posts created during each hour of the day.

In [10]:
avg_by_hour = []

# calculate the average number of comments per post for posts created during each hour of the day.
for hour in comments_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour] / counts_by_hour[hour]])
print("Average by hour:\n", avg_by_hour[:10])

Average by hour:
 [['09', 5.5777777777777775], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['16', 16.796296296296298], ['23', 7.985294117647059], ['12', 9.41095890410959], ['17', 11.46], ['15', 38.5948275862069], ['21', 16.009174311926607]]


For the sake of readablity and to determine which hour has the highest value let's sort the `avg_by_hour` list, by ascedning order.

In [11]:
swap_avg_by_hour = [] # an empty list where the two elements from the previous list being swapped.
# previously: [[hour, average by hour]] --> swap elements to: [[avaerage by hour, hour]]

for x in avg_by_hour:
    swap_avg_by_hour.append([x[1], x[0]])

sorted_swap = sorted(swap_avg_by_hour, reverse=True)[:5]
print("Top 5 Hours for Ask Posts Comments:\n")
print("="*50)
for x in sorted_swap:
    time = dt.datetime.strptime(x[1], "%H").strftime("%H:%M")
    template = "{hour}: {num:.2f} average comments per post."
    print(template.format(hour=time, num=x[0]))

Top 5 Hours for Ask Posts Comments:

15:00: 38.59 average comments per post.
02:00: 23.81 average comments per post.
20:00: 21.52 average comments per post.
16:00: 16.80 average comments per post.
21:00: 16.01 average comments per post.


## Conclusion

From the last cell above we see that at the hour of 15:00 has the average of 38.59 which means that a post being created at that certain hour is more likely to receive comments. Since we're using 24 hour notation, 15:00 means 3 PM in 12 hour notation.

Not only that if you take a closer look at the list we see a pattern:
 - in the afternoon which is 15:00 - 16:00 (3PM - 4PM) has the average of 38.59 and 16.80
 - at night which is 20:00 - 21:00 (8 PM - 9 PM) has the average of 21.52 and 16.01
 - in the morning which is 02:00 (2 AM) has the average of 23.81
 
what this also means is that you are also more likely to receive comments at these hours. 