# Exploring Hacker News Posts

In this project, we will compare two different types of posts from [Hacker News](https://news.ycombinator.com/), a popular website where technology related 'posts' are voted and commented upon. We will be exploring two types of posts that begin with either `Ask HN` or `Show HN`.

Users submit `Ask HN` posts to ask the Hacker News Community a specific question, such as:

`Ask HN: How to improve my personal website?
 Ask HN: Am I the only one outraged by Twitter Shutting down share counts?
 Ask HN: Were early stage products always so buggy?
 Ask HN: Any recent changes to CSS that broke mobile?`

Similarly, users submit `Show HN` posts to show the Hacker News community a project, product, or just generally something interesting. Below are a couple of examples:

`Show HN: Open Codex – OpenAI Codex CLI with open-source LLMs
 Show HN: Wio Link ESP8266 Based Web of Things Hardware Development Platform
 Show HN: Something pointless I made
 Show HN: shanhu.io, a programming playground powered by e8vm`

We will specifically compare these two types of posts to determine the following:

* Do `Ask HN` or `Show HN` receive more comments on average?
* Do posts created at a certain time receive more comments on average?

It is important to note that the [dataset](https://www.kaggle.com/datasets/hacker-news/hacker-news-posts) we are working with was reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions.

## Introduction

First, we will read in the data, and remove the headers.

In [1]:
#Read in the data 

import csv

with open('hacker_news.csv', encoding='utf8') as f:
    hn = list(csv.reader(f))
    
hn[:5]

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01']]

## Removing Headers from a List of Lists

In [2]:
#Removing the headers

headers = hn[0]
hn = hn[1:]

print(headers)
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


## Extracting Ask HN and Show HN posts

Now that we have removed the headers from `hn`, we are ready to filter our data. Since we are only interested in posts titles beginning with `Ask HN` or `Show HN`, we will create new lists of lists containing just the data for those titles.

To find the posts that begin with either `Ask HN` or `Shoe HN`, we will use the string method `startswith`. Given a string object, say, `string1`, we can check if it starts with, say, `dq`, by inspecting the output of the object `string1.startswith('dq') `. If `string1` starts with `dq`, it will return `True`, otherwise it will return `False`.

In [3]:
print('dataquest'.startswith('Data'))
print('dataquest'.startswith('data'))

False
True


In the example above, the first print call gives is `False` because `dataquest` does not start with `Data`. The secong print call prints `True` because `dataquest` does start with `data`. Capitalisation matters.

If we wish to control for case, we can use the `lower` method which returns a lowecase version of the starting string.

In [4]:
print('DataQuest'.lower())

dataquest


In [5]:
#Identify posts that start with either 'Ask HN' or 'Show HN' and separate the data into different lists

ask_posts = [] #creating empty lists
show_posts = []
other_posts = []

for row in hn:  #loop through each row in hn
    title = row[1]  #assign the title in each row to a variable named title
    if title.lower().startswith("ask hn"):  #if lowercase version of title begins with ask hn
        ask_posts.append(row)  #append the row to ask_posts
    elif title.lower().startswith("show hn"):  ##if lowercase version of title begins with show hn
        show_posts.append(row)  #append the row to show_posts
    else:
        other_posts.append(row)  ##else append to other_posts

print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


## Calculating the Average Number of Comments for Ask HN and Show HN Posts

Below are the first five rows in the `ask_posts` list of lists:

In [6]:
print(ask_posts[:5])

[['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'], ['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14'], ['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20'], ['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38']]


Below are the first five rows in the `show_posts` list of lists:

In [7]:
print(show_posts[:5])

[['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'], ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46'], ['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https://shanhu.io', '1', '1', 'h8liu', '4/28/2016 18:05'], ['12178806', 'Show HN: Webscope  Easy way for web developers to communicate with Clients', 'http://webscopeapp.com', '3', '3', 'fastbrick', '7/28/2016 7:11'], ['10872799', 'Show HN: GeoScreenshot  Easily test Geo-IP based web pages', 'https://www.geoscreenshot.com/', '1', '9', 'kpsychwave', '1/9/2016 20:45']]


Now, let us determine if `ask_posts` or `show_posts` receive more comments on average.

In [8]:
#find the total number of comments in ask_posts
total_ask_comments = 0

for row in ask_posts:
    total_ask_comments += int(row[4])

#calculate the average number of comments in ask_posts    
avg_ask_comments = total_ask_comments / len(ask_posts)
print(avg_ask_comments)

14.038417431192661


In [9]:
#find the total number of comments in show posts
total_show_comments = 0

for row in show_posts:
    total_show_comments += int(row[4])

#compute the average number of comments on show_posts    
avg_show_comments = total_show_comments / len(show_posts)
print(avg_show_comments)

10.31669535283993


On average, ask posts in our sample receive approximately 14 comments, whereas show posts receive approximately 10 comments. Since ask posts are more likely to receive comments, we will focus our remaining analysis just on these posts.

## Finding the Amount of Ask Posts and Comments by Hour Created

We will determine if ask posts created at a certain time are more likely to attract comments. We will follow the steps:

1. Calculate the amout of ask posts created in each hour of the day, along with the number of comments received.
2. Calculate the number of comments ask posts receive by hour created.

In [10]:
import datetime as dt

result_list = []  #create an empty list

for row in ask_posts:  #iterate over ask_posts
    result_list.append(  #append two columns:
        [row[6], int(row[4])]  #the 'created_at' column and the number of comments as an integer
    )
    
counts_by_hour = {}  #create an empty dictionary
comments_by_hour = {}
date_format = "%m/%d/%Y %H:%M" #create a date formart

for each_row in result_list:  #loop through each row of result_list
    date = each_row[0]  #extract the date
    comment = each_row[1]
    time = dt.datetime.strptime(date, date_format).strftime("%H") #parse the date and create a datetime object
    #use datetime.strftime() method to select just the hour from the datetime object
    
    if time in counts_by_hour: #if the hour is already a key in counts_by_hour:
        comments_by_hour[time] += comment #increment comments_by_hour by tje comment number
        counts_by_hour[time] += 1 #increment counts_by_hour by 1
    else:  #if the hour isn't a key in counts_by_hour:
        comments_by_hour[time] = comment #create comments_by_hour and set it equal to the c
        counts_by_hour[time] = 1 ##create counts_by_hour and set it to 1
        
comments_by_hour

{'09': 251,
 '13': 1253,
 '10': 793,
 '14': 1416,
 '16': 1814,
 '23': 543,
 '12': 687,
 '17': 1146,
 '15': 4477,
 '21': 1745,
 '20': 1722,
 '02': 1381,
 '18': 1439,
 '03': 421,
 '05': 464,
 '19': 1188,
 '01': 683,
 '22': 479,
 '08': 492,
 '04': 337,
 '00': 447,
 '06': 397,
 '07': 267,
 '11': 641}

## Calculating the Average Number of Comments for Ask HN Posts by Hour

In [11]:
sample_dict = {
    'apple': 2,
    'banana': 4,
    'orange': 6
}

Suppose we wanted to multiply each of the values by ten and return the results as a list of lists. We can use the following code:

In [12]:
fruits = []  #initialise an empty list and assigned it to fruits

for fruit in sample_dict:  #iterated over keys of sample_dict and appended to fruits...
    fruits.append([fruit, 10*sample_dict[fruit]]) #.a list, first element is the key, and second element is the value corresponding to that key multiplied by ten 

Below are the results

In [13]:
print(fruits)

[['apple', 20], ['banana', 40], ['orange', 60]]


In [14]:
avg_by_hour = []

for hr in comments_by_hour:
    avg_by_hour.append([hr, comments_by_hour[hr] / counts_by_hour[hr]])
    
avg_by_hour

[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]

## Sorting and Printing Values from a List of Lists

In [15]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
    
print(swap_avg_by_hour)

sorted_swap = sorted(swap_avg_by_hour, reverse = True)
sorted_swap

[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]


[[38.5948275862069, '15'],
 [23.810344827586206, '02'],
 [21.525, '20'],
 [16.796296296296298, '16'],
 [16.009174311926607, '21'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [13.20183486238532, '18'],
 [11.46, '17'],
 [11.383333333333333, '01'],
 [11.051724137931034, '11'],
 [10.8, '19'],
 [10.25, '08'],
 [10.08695652173913, '05'],
 [9.41095890410959, '12'],
 [9.022727272727273, '06'],
 [8.127272727272727, '00'],
 [7.985294117647059, '23'],
 [7.852941176470588, '07'],
 [7.796296296296297, '03'],
 [7.170212765957447, '04'],
 [6.746478873239437, '22'],
 [5.5777777777777775, '09']]

In [16]:
print("Top 5 Hours for Ask Posts Comments")
for avg, hr in sorted_swap[:5]:
    print(
    f"{dt.datetime.strptime(hr, '%H').strftime('%H:%M')}: {avg:.2f} average comments per post"
    )

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


The hour that receives the most comments per post on average is 15:00, with an average of 38.59 comments per post. There's about a 60% increase in the number of comments between the hours with the highest and second highest average number of comments.

According to the dataset [documentation](https://www.kaggle.com/datasets/hacker-news/hacker-news-posts), the timezone used is Eastern Time in the US.

## Conclusion

In this project, we analysed `ask posts` and `show posts` to determine which type of posts and time receice the most comments on average. Based on our analysis, to maximise the amouts of comments a post receives, we would recommend the post be categorised as `ask post` and created between `15:00` and `16:00`.

It should be noted that the dataset we analysed excluded posts without any comments. Given that, it is more accurate to say that *of the posts that received comments* `ask posts` received more comments on average, and `ask posts` created between `15:00 and 16:00` received the most comments on average.