# Exploring Hacker News Posts

## Table of Contents

1. [**Introduction**](#1)
	- Project Description
    - Data Description
2. [**Acquiring and Loading Data**](#2)
	- Importing Libraries
    - Reading in Data
    - Exploring Data
3. [**Data Cleaning**](#3)
4. [**Data Analysis**](#4)
5. [**Conclusion**](#5)


# 1

## Introduction

**Project Description:**

Hacker News is a site started by the startup incubator [Y Combinator](https://www.ycombinator.com/), where user-submitted stories (known as "posts") receive votes and comments, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of the Hacker News listings can get hundreds of thousands of visitors as a result.



**Goal/Purpose:**

In this project, we'll compare two different types of posts from [Hacker News](https://news.ycombinator.com/), a popular site where technology related stories (or 'posts') are voted and commented upon. The two types of posts we'll explore begin with either `Ask HN` or `Show HN`.

Users submit `Ask HN` posts to ask the Hacker News community a specific question, such as "What is the best online course you've ever taken?" Likewise, users submit `Show HN` posts to show the Hacker News community a project, product, or just generally something interesting.


**Questions to be Answered:**

We'll specifically compare these two types of posts to determine the following:

* Do Ask HN or Show HN receive more comments on average?
* Do posts created at a certain time receive more comments on average?

It should be noted that the data set we're working with was reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions.

You can find the data set [here](https://www.kaggle.com/datasets/hacker-news/hacker-news-posts), where you can find all necessary informations about the data including column description.

# 2

## Acquiring and Loading Data
### Library Import



In [1]:
import csv  #To read data
import datetime as dt    #Datetime analysis

## Reading in Data
Import all necessary data here.

In [2]:
opened_file = open('hacker_news.csv')
hn = list(csv.reader(opened_file))

print(hn[:5])    #Returns first five rows

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


# 3

## Data Cleaning

### Removing Headers from a List of Lists

In [3]:
headers = hn[0]  #Extracting the header
hn = hn[1:]  #Data without the header row

print(headers) #Returns only the headers
print('\n')   #Escapes a line
print(hn[:5])   #Returns first five rows

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


### Extracting Ask HN and Show HN Posts

In [4]:
#Creating empty lists
ask_posts = [] 
show_posts = []
other_posts = []

# Looping through the dataset
for row in hn:
    title = row[1]
    
#Logic    
    if title.lower().startswith("ask hn"):
        ask_posts.append(row)
    elif title.lower().startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)

# checking the number of posts in the newly created lists
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


# 4

## Data Analysis

### Calculating the Average Number of Comments for Ask HN and Show HN Posts

#### For Ask HN Posts

In [5]:
total_ask_comments = 0

for val in ask_posts:
    num_comments = val[4]
    total_ask_comments += int(num_comments)
    
avg_ask_comments = total_ask_comments / len(ask_posts)
print(avg_ask_comments)

14.038417431192661


#### For Show HN Posts

In [6]:
total_show_comments = 0

for val in show_posts:
    num_comments = val[4]
    total_show_comments += int(num_comments)
    
avg_show_comments = total_show_comments / len(show_posts)
print(avg_show_comments)

10.31669535283993


On average, `ask posts` from the data receive approximately 14 comments, whereas `show posts` receive approximately 10.

### Finding the Number of Ask Posts and Comments by Hour Created

We'll determine if ask posts created at a certain time are more likely to attract comments.

In [7]:
result_list = []  #Empty list

for posts in ask_posts:
    result_list.append(
        [posts[6], int(posts[4])]
    )
# For each posts in ask_posts, a new list is appended to result_list that contains two values: posts[6] (the date and time of the post) and int(posts[4]) (the number of comments the post received, converted to an integer).

#I'll have to create empty dictionaries and the date format
counts_by_hour = {}
comments_by_hour = {}
date_format = "%m/%d/%Y %H:%M"

#Iterating through the new list
for row in result_list:
    hour = row[0]
    comment = row[1]
    time = dt.datetime.strptime(hour, date_format).strftime("%H")

#Logic
    if time in counts_by_hour:
        comments_by_hour[time] += comment
        counts_by_hour[time] += 1
    else:
        comments_by_hour[time] = comment
        counts_by_hour[time] = 1

comments_by_hour


# Three variables are created: counts_by_hour and comments_by_hour, both of which are empty dictionaries, and date_format, which specifies the format of the date and time strings in the result_list.
# The for loop iterates over each row in result_list.
# The hour and comment variables are assigned to the first and second values of the row, respectively.
# The strptime() method is used to convert the hour string to a datetime object, which is then converted back to a string using the strftime() method with the "%H" format specifier to extract the hour of the post.
# If the hour (time) is already in counts_by_hour, then the number of comments for that hour (comments_by_hour[time]) is incremented by the number of comments for the current post (comment), and the count for that hour (counts_by_hour[time]) is incremented by 1.
# If the hour is not already in counts_by_hour, then a new key-value pair is added to comments_by_hour and counts_by_hour, with the hour as the key and the comment count as the value.
# Finally, the comments_by_hour dictionary is returned as the output of the code.
# Overall, this code is designed to count the number of comments on posts on the Hacker News site by hour of the day. It does this by extracting the hour from the date and time of each post and then keeping track of the total number of comments and the number of posts for each hour. The output is a dictionary where the keys are the hours of the day and the values are the total number of comments received during that hour.


{'09': 251,
 '13': 1253,
 '10': 793,
 '14': 1416,
 '16': 1814,
 '23': 543,
 '12': 687,
 '17': 1146,
 '15': 4477,
 '21': 1745,
 '20': 1722,
 '02': 1381,
 '18': 1439,
 '03': 421,
 '05': 464,
 '19': 1188,
 '01': 683,
 '22': 479,
 '08': 492,
 '04': 337,
 '00': 447,
 '06': 397,
 '07': 267,
 '11': 641}

### Sorting and Printing Values from a List of Lists

In [8]:
avg_by_hour = []

for hours in comments_by_hour:
     avg_by_hour.append([hours, comments_by_hour[hours] / counts_by_hour[hours]])

avg_by_hour

[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]

In [9]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
    
print(swap_avg_by_hour)

sorted_swap = sorted(swap_avg_by_hour, reverse=True)

sorted_swap

[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]


[[38.5948275862069, '15'],
 [23.810344827586206, '02'],
 [21.525, '20'],
 [16.796296296296298, '16'],
 [16.009174311926607, '21'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [13.20183486238532, '18'],
 [11.46, '17'],
 [11.383333333333333, '01'],
 [11.051724137931034, '11'],
 [10.8, '19'],
 [10.25, '08'],
 [10.08695652173913, '05'],
 [9.41095890410959, '12'],
 [9.022727272727273, '06'],
 [8.127272727272727, '00'],
 [7.985294117647059, '23'],
 [7.852941176470588, '07'],
 [7.796296296296297, '03'],
 [7.170212765957447, '04'],
 [6.746478873239437, '22'],
 [5.5777777777777775, '09']]

In [10]:
# Sort the values and print the the 5 hours with the highest average comments.

print("Top 5 Hours for 'Ask HN' Comments")
for avg, hr in sorted_swap[:5]:
    print(
        "{}: {:.2f} average comments per post".format(
            dt.datetime.strptime(hr, "%H").strftime("%H:%M"),avg
        )
    )

Top 5 Hours for 'Ask HN' Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


The hour that receives the most comments per post on average is 15:00, with an average of 38.59 comments per post. There's about a 60% increase in the number of comments between the hours with the highest and second highest average number of comments.

According to the data set [documentation](https://www.kaggle.com/datasets/hacker-news/hacker-news-posts), the timezone used is Eastern Time in the US. So, we could also write 15:00 as 3:00 pm est.
To convert to my timezone, it will be 16:00 as 4:00 pm WAT.

To create a post with a higher chance of receiving comments, it is recommended to create it during these hours. However, it is important to note that the data set used for this analysis is specific to the Hacker News website and may not be representative of other websites or platforms. Additionally, factors such as the topic, content, and audience of the post can also influence the number of comments received. Therefore, it is always important to consider multiple factors when creating content for online platforms.

# 5

## Conclusion

**Insights:** 

To create a post with a higher chance of receiving comments, it is recommended to create it during these hours. However, it is important to note that the data set used for this analysis is specific to the Hacker News website and may not be representative of other websites or platforms. Additionally, factors such as the topic, content, and audience of the post can also influence the number of comments received. Therefore, it is always important to consider multiple factors when creating content for online platforms.

In this analysis, we examined the average number of comments per post for `Ask HN` and `Show HN` posts on the Hacker News website. We found that `Ask HN` posts received more comments on average than Show `HN posts`. We also analyzed the average number of comments per post for `Ask HN` posts by the hour of the day and found that there are certain hours that tend to receive more comments than others. Specifically, the top 5 hours for `Ask Posts` comments were 15:00, 02:00, 20:00, 16:00, and 21:00 in the Eastern Standard Time (EST) time zone.

Furthermore, I converted the times to the WAT time zone (GMT+1, Nigeria) and found that the top 5 hours for `Ask Posts` comments in this time zone would be 16:00, 03:00, 21:00, 17:00, and 22:00. This information can be useful for individuals or organizations looking to create content on the Hacker News website with the goal of receiving more comments.

**Suggestions:**

Overall, this analysis provides insights into the factors that can influence the number of comments received on Hacker News, and may be helpful for content creators, marketers, and other individuals looking to optimize their content strategy.