# Hacker News Content Analysis

## Analysis of Hacker News Content to Determine Visitor Analytics

#### *Kim Kirk* <br> *July 13, 2020*

## Synopsis

A descriptive multivariate data analysis was conducted on Hacker News posts. 20,000 rows from Kaggle were imported, cleaned, and analyzed. Content analysis was conducted to identify which post types received the most comments and the top 5 hours in which to make comments, with the idea that Hacker News is extremely popular in technology circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors; thus, increase the profile of the subject matter in the posts.

### Data Processing


Import necessary libraries and the data set. Explore the data set.

In [1]:
import csv as csv
import pandas as pd

open_data = open('hacker_news.csv')
read_file = csv.reader(open_data)
hn = list(read_file)
open_data.close()

print(hn[0:6])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


Remove headers to faciliate ease of data analysis.

In [2]:
headers = hn[0]
print("Headers")
print(headers)
hn = hn[1:]
print("\n")
print("First 5 rows of data set")
print(hn[0:6])

Headers
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


First 5 rows of data set
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', 

### Exploratory Data Analysis

Retrieve posts that begin with "Ask HN" or "Show HN"; these are the posts to analyze.

In [3]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

print('Ask posts have ', len(ask_posts), ' number of posts')
print('Show posts have ', len(show_posts), ' number of posts')
print('Other types of posts have ', len(other_posts), ' number of posts')


Ask posts have  1744  number of posts
Show posts have  1162  number of posts
Other types of posts have  17194  number of posts


Identify which types of posts receive more comments on average.

In [4]:
total_ask_comments = 0

for row in ask_posts:
    total_ask_comments += int(row[4])
    
avg_ask_comments = total_ask_comments / len(ask_posts)
print('Average number of "ask" posts is ', avg_ask_comments)


total_show_comments = 0

for row in show_posts:
    total_show_comments += int(row[4])
    
avg_show_comments = total_show_comments / len(show_posts)
print('Average number of "show" posts is ', avg_show_comments)





Average number of "ask" posts is  14.038417431192661
Average number of "show" posts is  10.31669535283993


Based on the analysis above, "ask" type posts receive more comments on average than "show" type posts.

Continuing analysis, focus on "ask" type posts because they receive more comments on average. Determine if there is a certain time more likely to attract comments. A check is performed at the end to verify dictionaries have populated correctly. 

In [5]:
import datetime as dt

result_list = []

for row in ask_posts:
    number_of_comments = int(row[4])
    create_date = row[6]
    result_list.append([create_date,number_of_comments])

print("Results list populated?")
print("\n")
print(result_list[0:4])

counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    date_time_object = dt.datetime.strptime(row[0], "%m/%d/%Y %H:%M")
       
    #extract the hour
    hour_only = date_time_object.hour
    
    if hour_only not in counts_by_hour:
        counts_by_hour[hour_only] = 1
        comments_by_hour[hour_only] = row[1]
        
    if hour_only in counts_by_hour:
        counts_by_hour[hour_only] += 1
        comments_by_hour[hour_only] += row[1]
        
#check 
print("counts by hour")
for key, value in counts_by_hour.items():
    print(key, value)
print("\n")
print("comments by hour")
for key, value in comments_by_hour.items():
    print(key, value)


Results list populated?


[['8/16/2016 9:55', 6], ['11/22/2015 13:43', 29], ['5/2/2016 10:14', 1], ['8/2/2016 14:20', 3]]
counts by hour
9 46
13 86
10 60
14 108
16 109
23 69
12 74
17 101
15 117
21 110
20 81
2 59
18 110
3 55
5 47
19 111
1 61
22 72
8 49
4 48
0 56
6 45
7 35
11 59


comments by hour
9 257
13 1282
10 794
14 1419
16 1831
23 544
12 691
17 1147
15 4478
21 1749
20 1724
2 1384
18 1441
3 422
5 493
19 1191
1 716
22 481
8 497
4 340
0 457
6 398
7 269
11 643


Identify for posts created during each hour of the day, the average number of comments per post. A check is performed at the end to ensure the list is populated correctly.

In [6]:
    
comments_per_post = []

for i in sorted (counts_by_hour.keys()):
    comments_per_post.append([i, comments_by_hour[i]/counts_by_hour[i]])
    
#check
print("List is not empty")
print(comments_per_post != None)

List is not empty
True


Give the list easier readability. A check is performed at the beginning to ensure the list is populated correctly.

In [7]:
swap_avg_by_hour = []

for item in comments_per_post:
    swap_avg_by_hour.append([item[1], item[0]])

sorted_swap = sorted(swap_avg_by_hour, reverse = True)
print('List is not empty')
#check
print(sorted_swap != None)
print('\n')
print("Top 5 Hours for Ask Posts Comments")

for item in sorted_swap[0:5]:
    hour_datetime = dt.datetime.strptime(str(item[1]), "%H")
    hour_string = hour_datetime.strftime("%H:%M")
    average_comments = item[0]
    stuff = str('{:.2f}'.format(average_comments))
    print(hour_string, ":", stuff)



List is not empty
True


Top 5 Hours for Ask Posts Comments
15:00 : 38.27
02:00 : 23.46
20:00 : 21.28
16:00 : 16.80
21:00 : 15.90


### Conclusion

Based on the analysis, "ask" type posts receive more comments on average than "show" type posts. For "ask" type posts, the top five hours to make comments for Ask Posts are 3:00pm PST, 2:00am PST, 8:00pm PST, 4:00pm PST, and 9:00pm PST. Possible reasons as to why these times are most active, most are afternoon or evening times when people on the West Coast of the United States would be awake. The singular 2:00am time could represent people who have insomnia or work swing shift and would post during early morning hour. 