# Exploring Hacker News Posts

Hacker News (HN) is a site started by the startup incubator Y Combinator, where user-submitted posts receive votes and comments. Posts that make it to the top of the Hacker News listings can get hundreds of thousands of visitors as a result. The dataset can be downloaded from this [Link](https://www.kaggle.com/datasets/hacker-news/hacker-news-posts).

## The goal of this project

In this project we'll compare two types of posts in Hacker News to determine the following:

Do Ask HN posts or Show HN receive more comments on average?  
Do posts created at a certain time receive more comments on average?

Users submit Ask HN posts to ask the Hacker News community a specific question. Below are a few examples:

Ask HN: How to improve my personal website?  
Ask HN: Am I the only one outraged by Twitter shutting down share counts?  
Ask HN: Aby recent changes to CSS that broke mobile?  

Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just something interesting. Below are a few examples:

Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform'  
Show HN: Something pointless I made   
Show HN: Shanhu.io, a programming playground powered by e8vm  

In [24]:
#opening dataset

from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)
headers = hn[0]
hn = hn[1:]

In [25]:
#display the first five columns

print(headers)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


Below are descriptions of the columns:

| Column | Description |
| :------ | :----------- |
| id | the unique identifier from Hacker News for the post |
| title | the title of the post |
|url | the URL that the posts links to, if the post has a URL|
|num_points | the number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes|
|num_comments | the number of comments on the post|
|author | the username of the person who submitted the post|
|created_at |  the date and time of the post's submission|


In [26]:
# display the first five rows

print(hn[:5])

[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


## Extracting Ask HN and Show HN Posts

Create new lists of lists containing just the data for Ask HN or Show HN titels.

In [27]:
ask_posts = []
show_posts = []
other_posts = []

In [28]:
# separate posts beginning with Ask HN and Show HN using startswith method
# for each iteration of dataset 
# if title of the posts starts with 'ask hn' -> add full row in ask_posts list
# if title of the posts starts with 'show hn' -> add full row in show_posts list

for row in hn:
    title = row[1].lower()
    if title.startswith("ask hn"):
        ask_posts.append(row)
    elif title.startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)

In [29]:
print("number of ask posts: ", len(ask_posts))
print("number of show posts: ", len(show_posts))
print("number of other posts: ", len(other_posts))

number of ask posts:  1744
number of show posts:  1162
number of other posts:  17194


# Calculating the Average Number of Comments for Ask HN and Show HN Posts

Let's determine if ask posts or show posts receive more comments on average

In [30]:
# find the total number of comments in ask posts

total_ask_comments = 0
for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments
avg_ask_comments = total_ask_comments/len(ask_posts)

In [31]:
print("total number of comments in ask posts: ", total_ask_comments)
print("average number of comments on ask posts: {:.2f}".format(avg_ask_comments))

total number of comments in ask posts:  24483
average number of comments on ask posts: 14.04


In [32]:
# find the total number of comments in show posts

total_show_comments = 0
for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments
avg_show_comments = total_show_comments/len(show_posts)

In [33]:
print("total number of comments in show posts: ", total_show_comments)
print('average number of comments on show posts {:.2f}'.format(avg_show_comments))

total number of comments in show posts:  11988
average number of comments on show posts 10.32


As we can see ask HN posts receive more comments than Show HN posts (14.04 compared to 10.32)

## Finding the Number of Ask Posts and Comments by Hour Created

Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts. Let's determine if ask posts created at a certain time are more likely to attract comments.

We'll use the following steps:
* Calculate the number of ask posts created in each hour of the day, along with the number of comments received.
* Calculate the average number of comments ask posts receive by hour created

In [34]:
# creating list whith contain date of submitting the post and number of comments of the post

import datetime as dt
result_list = []
for row in ask_posts:
    created_at = row[6]
    num_comments = int(row[4])
    result_list.append((created_at, num_comments))

In [35]:
# crating 2 dectionaries to calculate number of post created in each our and number of comments

counts_by_hour = {}
comments_by_hour = {}
for row in result_list:
    post_date = row[0]
    num_comments = int(row[1])
    # parse the date and create a datetime object
    post_date = dt.datetime.strptime(post_date, "%m/%d/%Y %H:%M")
    # select just the hour from the datetime object
    post_hour = dt.datetime.strftime(post_date, "%H")
    if post_hour in counts_by_hour:
        counts_by_hour[post_hour] += 1
        comments_by_hour[post_hour] += num_comments
    else:
        counts_by_hour[post_hour] = 1
        comments_by_hour[post_hour] = num_comments

In [36]:
print("The number of posts grouped by hour:")
print(counts_by_hour)

The number of posts grouped by hour:
{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}


In [37]:
print("The number of posts grouped by comments:")
print(comments_by_hour)

The number of posts grouped by comments:
{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}


## Calculating the Average Number of Comments for Ask HN Posts by Hour

Let's calculate the average number of comments for posts created during each hour of the day.

In [38]:
avg_by_hour = []
for hour in comments_by_hour:
    avg_comments = comments_by_hour[hour] / counts_by_hour[hour]
    avg_by_hour.append([hour, avg_comments])

In [39]:
print(avg_by_hour)

[['09', 5.5777777777777775], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['16', 16.796296296296298], ['23', 7.985294117647059], ['12', 9.41095890410959], ['17', 11.46], ['15', 38.5948275862069], ['21', 16.009174311926607], ['20', 21.525], ['02', 23.810344827586206], ['18', 13.20183486238532], ['03', 7.796296296296297], ['05', 10.08695652173913], ['19', 10.8], ['01', 11.383333333333333], ['22', 6.746478873239437], ['08', 10.25], ['04', 7.170212765957447], ['00', 8.127272727272727], ['06', 9.022727272727273], ['07', 7.852941176470588], ['11', 11.051724137931034]]


Although we now have the results we need, this format makes it difficult to identify the hours with the highest values.

## Sorting and Printing Values from a List of Lists

In [40]:
# swapping columns in avg_by_hour list

swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])

In [41]:
print(swap_avg_by_hour)

[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]


In [46]:
# sorting swap_avg_by_hour in descending order

sorted_swap = sorted(swap_avg_by_hour, reverse=True)

In [49]:
print("Top 5 Hours for Ask Posts Comments:")
print()
for row in sorted_swap[:5]:
    hour = dt.datetime.strptime(row[1], "%H")
    hour = dt.datetime.strftime(hour, "%H:%M")
    avg_comments = row[0]
    print("{}: {:.2f} average comments per post".format(hour,avg_comments))

Top 5 Hours for Ask Posts Comments:

15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


The posts created at 15:00 (Eastern Time in the US) receive the highest average number of comments.

## Summary

In this project, we analysed data from the Hacker News website with the goal of identifying the comments and votes that posts beginning with either Ask HN or Show HN receive. Our analysis revealed the following findings:

On average, 
* Ask HN posts receive more comments than Show HN posts (14.04 compared to 10.32).
* Ask HN posts created at 15:00 Eastern Time receive the highest number of comments.

While it is advisable for users to create an Ask HN post during this time to maximize the number of comments, it is important to note that other factors such author engagement, content quality, post topic can also affect the number of comments received. Therefore, posting at 15:00 ET does not guarantee the highest number of comments every time. Nevertheless, this analysis can serve as a useful strategy for creating content on the Hacker News website.