# Exploring Hacker News Posts

Hacker news is a site started by the startup incubator Y combinator, where user-submitted stories or posts receive votes and comments. Posts that make it to the top of the Hacker News listings can get hundreds of thousands of visitors. The dataset can be downloaded from this [link](https://www.kaggle.com/datasets/hacker-news/hacker-news-posts).

## The main purpose of this project
Our focus in this project is on the posts with the titles that begin with either Ask HN or Show HN. Under Ask HN, users ask the Hacker News community a specific question such as: "How to improve my website?", "Am I the only one outraged by Twitter shutting down share counts?". Show HN shows users' submission of a project, product or just something interesting such as "Shanhu.io, a programming playground powered by e8vm" etc.

### Our task here is to compare these two types of posts to determine the following:

- Do Ask HN or Show HN receive more comments on average?
- Do posts created at a certain time receive more comments on average?


In [1]:
# We read data from our file and create a list from it for analysis
# Display the first five columns

opened_file = open('hacker_news.csv')
from csv import reader
read_file = reader(opened_file)
hn = list(read_file)
headers = hn[0]
hn = hn[1:]
print(hn[:5])

[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


In [2]:
print(headers)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


Below are the description of the columns:

| Column | Description |
| :----------- | :----------- |
| id  | the unique identifier from Hacker News for the post |
| title  | the title of the post |
| url  | the URL that the posts links to, if the post has a URL |
| num_points  | the number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes |
| num_comments  | the number of comments on the post |
| author  | the username of the person who submitted the post |
| created_at  | the date and time of the post's submission  |

Since we are only interested in the posts whose titles begin with Ask HN and Show HN, we will create new lists of lists containing just the data for those titles

In [3]:
ask_posts = []
show_posts = []
other_posts = []

# to filter all posts by their initial letters (either 'Ask HN' or 'Show HN'),
# we iterate through each post and use startswith() method to check if 
# the first letters match the ones queried in the method 
# we then append the full row to the respective list above

for row in hn:
    title = row[1].lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

# the number of posts in each list
print('The number of ask posts: ' + str(len(ask_posts)))
print('The number of show posts: ' + str(len(show_posts)))
print('The number of other posts: ' + str(len(other_posts)))

The number of ask posts: 1744
The number of show posts: 1162
The number of other posts: 17194


## Calculating the Average Number of Comments for Ask HN and Show HN Posts

Our next task is to determine which one of the posts receive more comments on average

In [4]:
total_ask_comments = 0
for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments
    
avg_ask_comments = total_ask_comments / len(ask_posts)

total_show_comments = 0
for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments
    
avg_show_comments = total_show_comments / len(show_posts)

print('The average number of comments on ask posts: ' + str(avg_ask_comments))
print('The average number of comments on show posts: ' + str(avg_show_comments))

The average number of comments on ask posts: 14.038417431192661
The average number of comments on show posts: 10.31669535283993


The result from our code shows that the avergae number of comments for ask posts is 14.04 and that for the show posts is 10.32. This follows that asks posts receive more comments on average than show posts. This may be due to the tendency of users on the Hacker News community to contribute answers and suggestions or ask further questions on a particular question.

## Finding the Number of Ask Posts and Comments by Hour Created

Since ask posts are more likely to receive comments, we will focus our remaining analysis just on these posts.

Next, we'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:

 - Calculate the number of ask posts created in each hour of the day, along with the number of comments received.
 - Calculate the average number of comments ask posts receive by hour created.

In [11]:
# calculating the number of ask posts in each hour of the day
import datetime as dt

result_list = []
for row in ask_posts:
    created_at = row[6]
    num_comments = row[4]
    result_list.append([created_at, int(num_comments)])
    
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    date_str = row[0]
    # we convert the date string to a datetime object and extract the hour (%H) from the datetime object
    hour = dt.datetime.strptime(date_str, '%m/%d/%Y %H:%M').strftime('%H')
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = row[1]
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += row[1]
        
print('The number of posts grouped by hour')
print(counts_by_hour)
print('')
print('The number of posts comments grouped by hour')
print(comments_by_hour)

The number of posts grouped by hour
{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}

The number of posts comments grouped by hour
{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}


## Calculating the Average Number of Comments for Ask HN Posts by Hour

In this task we will create a list of lists containing the hours during which posts were created and the average number of comments those posts received.

In [6]:
avg_by_hour = []

for hour in counts_by_hour:
    avg_comment = comments_by_hour[hour] / counts_by_hour[hour]
    avg_by_hour.append([hour, avg_comment])
    
avg_by_hour

[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]

In [7]:
# Since the first column of this list is the average number of comments, 
# sorting the list will sort by the average number of comments.

swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
    
swap_avg_by_hour

[[5.5777777777777775, '09'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [16.796296296296298, '16'],
 [7.985294117647059, '23'],
 [9.41095890410959, '12'],
 [11.46, '17'],
 [38.5948275862069, '15'],
 [16.009174311926607, '21'],
 [21.525, '20'],
 [23.810344827586206, '02'],
 [13.20183486238532, '18'],
 [7.796296296296297, '03'],
 [10.08695652173913, '05'],
 [10.8, '19'],
 [11.383333333333333, '01'],
 [6.746478873239437, '22'],
 [10.25, '08'],
 [7.170212765957447, '04'],
 [8.127272727272727, '00'],
 [9.022727272727273, '06'],
 [7.852941176470588, '07'],
 [11.051724137931034, '11']]

In [8]:
# Sorting the list of average comments per post
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
sorted_swap

[[38.5948275862069, '15'],
 [23.810344827586206, '02'],
 [21.525, '20'],
 [16.796296296296298, '16'],
 [16.009174311926607, '21'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [13.20183486238532, '18'],
 [11.46, '17'],
 [11.383333333333333, '01'],
 [11.051724137931034, '11'],
 [10.8, '19'],
 [10.25, '08'],
 [10.08695652173913, '05'],
 [9.41095890410959, '12'],
 [9.022727272727273, '06'],
 [8.127272727272727, '00'],
 [7.985294117647059, '23'],
 [7.852941176470588, '07'],
 [7.796296296296297, '03'],
 [7.170212765957447, '04'],
 [6.746478873239437, '22'],
 [5.5777777777777775, '09']]

In [9]:
print("Top 5 Hours for Ask Posts Comments")
for num, hour in sorted_swap[:5]:
    print("{}: {:.2f} average comments per post".format(dt.datetime.strptime(hour, "%H").strftime("%H:%M"), num))    

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


To have a higher chance of receiving comments, one has to create a post at **15:00 Eastern Time**, which is equivalent to **7 pm GMT in Ghana.** 

## Conclusion

In this project, we analyzed the dataset collected from the Hacker News website. Our goal was to find comments and votes received by posts with titles that begin with either **Ask HN or Show HN**. After analyzing the dataset, we made the following observation:

- Ask HN posts received more comments on average than Show HN posts (14.04 to 10.32)
- Ask HN posts created at 15:00 Eastern Time (equivalent to 7 pm GMT) received the higher percentage of comments.

Based on this observations, it is recommended for a user to create an Ask HN post within this time in order to have a higher chance of receiving as many comments as possible. However, it should be noted that additional factors such as the exact topic, content, and audience of the post can also influence the number of comments received so it doesn't mean that we will always get the most comments if we publish at 15:00 ET (7 pm GMT). Nevertheless, the chances are higher to receive most comments in that time and therefore the analysis can be used as strategy for creating content on the Hacker News website.

