# Exploring Hacker News Posts

A [downsampled dataset](https://www.kaggle.com/datasets/hacker-news/hacker-news-posts) from `Hacker News` will be used in this project. `Hacker News` is extremely popular website in technology and startup cicrles, and posts that make it to the top of the listings can get hundreds of thousands of visitors.

Primarily, we're specifically interested in posts with title that begin with etiher `Ask HN` or `Show HN`. Users submit `Ask HN` posts to ask `Hacker News` community a specific question while users submit `Show HN` posts to show `Hacker News` community a project, product, or just something interesting.

## Description of data column

| Column      | Description|
|:-------------|:------------|
|id           |the unique identifier from Hacker News for the post|
|title        |the title of the post
|url          |the URL that the posts link to
|num_points   |the number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes|
|num_comments |the number of comments on the post
|author       |the name of the account that made the post
|created_at   |the date and time the post was made (the time zone is Eastern Time in the US)

## Project Objective

This project will compare the 2 types of posts `Ask HN` and `Show HN` to determine:

- Do `Ask HN` or `Show HN` receive more comments on average?
- Do posts created at a certain time receive more comments on average?

## Read in the csv file as a list of lists

In [1]:
from csv import reader
opened_file = open('/kaggle/input/hacker-news-posts/HN_posts_year_to_Sep_26_2016.csv')
read_file = reader(opened_file)
hn = list(read_file)
print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']]


### 1. Extract the first row of data as headers

In [2]:
# Extract headers from data
headers = hn[0]

print('headers:')
print(headers)

headers:
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


### 2. Remove headers

In [3]:
# Remove header first row from hn
hn = hn[1:]

print('First 5 row of hn data without headers:')
print(hn[:5])

First 5 row of hn data without headers:
[['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16'], ['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']]


## Find the posts with titles beginning with `Ask HN` or `Show HN`

In [4]:
# Create 3 empty lists
ask_posts = []
show_posts = []
other_posts = []

In [5]:
# Seperate the posts
for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

In [6]:
# Count the number of posts in each list
length_ask_posts = len(ask_posts)
print(f'Number of "Ask HN" posts: {length_ask_posts}')

print()

length_show_posts = len(show_posts)
print(f'Number of "Show HN" posts: {length_show_posts}')

print()

length_other_posts = len(other_posts)
print(f'Number of other posts: {length_other_posts}')

Number of "Ask HN" posts: 9139

Number of "Show HN" posts: 10158

Number of other posts: 273822


## Display the first 5 rows of data beginning with `Ask HN` and `Show HN`

In [7]:
# Display first 5 rows of data for 'ask_posts' list
print('First 5 rows from ask_posts:')
print(ask_posts[:5])

First 5 rows from ask_posts:
[['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53'], ['12578522', 'Ask HN: How do you pass on your work when you die?', '', '6', '3', 'PascLeRasc', '9/26/2016 1:17'], ['12577908', 'Ask HN: How a DNS problem can be limited to a geographic region?', '', '1', '0', 'kuon', '9/25/2016 22:57'], ['12577870', 'Ask HN: Why join a fund when you can be an angel?', '', '1', '3', 'anthony_james', '9/25/2016 22:48'], ['12577647', 'Ask HN: Someone uses stock trading as passive income?', '', '5', '2', '00taffe', '9/25/2016 21:50']]


In [8]:
# Display first 5 rows of data for 'show_posts' list
print('First 5 rows from show_posts:')
print(show_posts[:5])

First 5 rows from show_posts:
[['12578335', 'Show HN: Finding puns computationally', 'http://puns.samueltaylor.org/', '2', '0', 'saamm', '9/26/2016 0:36'], ['12578182', 'Show HN: A simple library for complicated animations', 'https://christinecha.github.io/choreographer-js/', '1', '0', 'christinecha', '9/26/2016 0:01'], ['12578098', 'Show HN: WebGL visualization of DNA sequences', 'http://grondilu.github.io/dna.html', '1', '0', 'grondilu', '9/25/2016 23:44'], ['12577991', 'Show HN: Pomodoro-centric, heirarchical project management with ES6 modules', 'https://github.com/jakebian/zeal', '2', '0', 'dbranes', '9/25/2016 23:17'], ['12577142', 'Show HN: Jumble  Essays on the go #PaulInYourPocket', 'https://itunes.apple.com/us/app/jumble-find-startup-essay/id1150939197?ls=1&mt=8', '1', '1', 'ryderj', '9/25/2016 20:06']]


## Calculate the average number of comments for `Ask HN` and `Show HN` posts

In [9]:
# Find the total number of comments on ask posts

total_ask_comments = 0
for row in ask_posts:
    num_of_ask_comments = int(row[4])
    total_ask_comments += num_of_ask_comments

# Find the average number of comments on ask posts
avg_ask_comments = total_ask_comments // length_ask_posts

print(f'The average number of comments on ask posts is {avg_ask_comments}')

The average number of comments on ask posts is 10


In [10]:
# Find the total number of comments on show posts

total_show_comments = 0
for row in show_posts:
    num_of_show_comments = int(row[4])
    total_show_comments += num_of_show_comments

# Find the average number of comments on show posts
avg_show_comments = total_show_comments // length_show_posts

print(f'The average number of comments on show posts is {avg_show_comments}')

The average number of comments on show posts is 4


## Find the Number of `Ask HN` Posts and Comments by Hour Created

We'll determine if ask posts created at a certain time are mor likely to attract comments. To perform this analysis:

1. Calculate the number of ask posts created in each hour of the day, along with the number of comments received.

2. Calculate the average number of comments ask posts receive by hour created

In [11]:
# Import the datetime module
import datetime as dt

# Create an empty list 'result_list'
result_list = []

# Iterate over 'ask_posts' list
for post in ask_posts:
    result_list.append([post[6], int(post[4])])

# Create dictionaries 'counts_by_hour' and 'comments_by_hour'
counts_by_hour = {}
comments_by_hour = {}

for item in result_list:
    date = item[0]
    comment = item[1]
    date_format = '%m/%d/%Y %H:%M'
    
    # First is to parse string into datetime object using strptime, then extract the hour portion using strftime
    time = dt.datetime.strptime(date, date_format).strftime('%H')
    
    if time in counts_by_hour:
        comments_by_hour[time] += comment
        counts_by_hour[time] += 1
    else:
        comments_by_hour[time] = comment
        counts_by_hour[time] = 1

comments_by_hour

{'02': 2996,
 '01': 2089,
 '22': 3372,
 '21': 4500,
 '19': 3954,
 '17': 5547,
 '15': 18525,
 '14': 4972,
 '13': 7245,
 '11': 2797,
 '10': 3013,
 '09': 1477,
 '07': 1585,
 '03': 2154,
 '23': 2297,
 '20': 4462,
 '16': 4466,
 '08': 2362,
 '00': 2277,
 '18': 4877,
 '12': 4234,
 '04': 2360,
 '06': 1587,
 '05': 1838}

## Calculate the Average Number of Comments for `Ask HN` Posts by Hour

Next, we use the two dictionaries `comments_by_hour` and `counts_by_hour` to calculate the average number of comments for posts created during each hour of the day

In [12]:
avg_by_hour = []

for hour in comments_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour] / counts_by_hour[hour]])

avg_by_hour

[['02', 11.137546468401487],
 ['01', 7.407801418439717],
 ['22', 8.804177545691905],
 ['21', 8.687258687258687],
 ['19', 7.163043478260869],
 ['17', 9.449744463373083],
 ['15', 28.676470588235293],
 ['14', 9.692007797270955],
 ['13', 16.31756756756757],
 ['11', 8.96474358974359],
 ['10', 10.684397163120567],
 ['09', 6.653153153153153],
 ['07', 7.013274336283186],
 ['03', 7.948339483394834],
 ['23', 6.696793002915452],
 ['20', 8.749019607843136],
 ['16', 7.713298791018998],
 ['08', 9.190661478599221],
 ['00', 7.5647840531561465],
 ['18', 7.94299674267101],
 ['12', 12.380116959064328],
 ['04', 9.7119341563786],
 ['06', 6.782051282051282],
 ['05', 8.794258373205741]]

## Sorting and Printing Values

Sort the obtained results in order to identify the hours with the highest number of comments

In [13]:
# Create empty list 'swap_avg_by_hour'
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])

print(swap_avg_by_hour)

sorted_swap = sorted(swap_avg_by_hour, reverse=True)

sorted_swap

[[11.137546468401487, '02'], [7.407801418439717, '01'], [8.804177545691905, '22'], [8.687258687258687, '21'], [7.163043478260869, '19'], [9.449744463373083, '17'], [28.676470588235293, '15'], [9.692007797270955, '14'], [16.31756756756757, '13'], [8.96474358974359, '11'], [10.684397163120567, '10'], [6.653153153153153, '09'], [7.013274336283186, '07'], [7.948339483394834, '03'], [6.696793002915452, '23'], [8.749019607843136, '20'], [7.713298791018998, '16'], [9.190661478599221, '08'], [7.5647840531561465, '00'], [7.94299674267101, '18'], [12.380116959064328, '12'], [9.7119341563786, '04'], [6.782051282051282, '06'], [8.794258373205741, '05']]


[[28.676470588235293, '15'],
 [16.31756756756757, '13'],
 [12.380116959064328, '12'],
 [11.137546468401487, '02'],
 [10.684397163120567, '10'],
 [9.7119341563786, '04'],
 [9.692007797270955, '14'],
 [9.449744463373083, '17'],
 [9.190661478599221, '08'],
 [8.96474358974359, '11'],
 [8.804177545691905, '22'],
 [8.794258373205741, '05'],
 [8.749019607843136, '20'],
 [8.687258687258687, '21'],
 [7.948339483394834, '03'],
 [7.94299674267101, '18'],
 [7.713298791018998, '16'],
 [7.5647840531561465, '00'],
 [7.407801418439717, '01'],
 [7.163043478260869, '19'],
 [7.013274336283186, '07'],
 [6.782051282051282, '06'],
 [6.696793002915452, '23'],
 [6.653153153153153, '09']]

In [14]:
# Sort the values and print out the top 5 hours with highest average number of comments
print('Top 5 Hours for Ask Posts Comments')
for avg, hr in sorted_swap[:5]:
    print(
        '{hour}: {avg_comment:.2f} average comments per post'.format(
            hour = dt.datetime.strptime(hr,'%H').strftime('%H:%M'), avg_comment = avg))

Top 5 Hours for Ask Posts Comments
15:00: 28.68 average comments per post
13:00: 16.32 average comments per post
12:00: 12.38 average comments per post
02:00: 11.14 average comments per post
10:00: 10.68 average comments per post


The dataset timezone is in US Eastern Time

Based on the above analysis, 15:00 hrs had the highest number of average comments per post (28.68), followed by 13:00 hrs (16.32), 12:00 hrs (12.38), 02:00 hrs (11.14) and 10:00 hrs (10.68).

Therefore, to have a higher chance of receiving comments for post created, it is advisable to create a post at either 15:00 hrs (or 04:00 hrs SGT) or 13:00 hrs (or 02:00hrs SGT)