# Hacker News - Analysis of posts

Hacker news is a technology oriented forum where useres can post questions and discuss topics. To evaluate popularity and weigh the importance of indevidual submissions, posts can be up or downvoted similar to reddit. In this project, we will analyse "Ask HN" submissions which ask specific questions. The time zone for the data is USA Eastern and therefore equal to Montreal time.

|Column name (index)| Description|
|---|---|
|id (0)| Unique identifier of user|
|title (1)| Title of post|
|url (2)| URL that the post links to|
|num_points (3)| Total points (upvotes - downvotes)|
|num_comments (4)| Total number of comments|
|author (5)| Username |
|created_at (6)| Date at post submission|

In [19]:
from csv import reader # to parse .csv file

open_file = open('C:/Users/User/Documents/data_sets/hacker_news.csv', encoding = 'utf-8') # utf-8 encoding required to read data
read = reader(open_file) # parse
hn = list(read) # transform into list of lists
headers = hn[0] # headers
hn = hn[1:] # data

In [20]:
# Filtering Ask HN and Show HN post
ask_posts = [] # posts asking a question
show_posts = [] # posts answering a question
other_posts = [] 

for row in hn:
    title = row[1].lower() # lower case title to facilitate usage of startswith() method
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else: other_posts.append(row)

print('Number of "Ask hn" posts: ', len(ask_posts))  
print('Number of "Show hn" posts: ', len(show_posts))  
print('Number of other posts: ', len(other_posts))  

Number of "Ask hn" posts:  9139
Number of "Show hn" posts:  10158
Number of other posts:  273822


In [21]:
def avg_comments(data):
    '''Returns the average number of comments in data'''
    total_comments = 0 # number of comments in data
    total_posts = len(data) # number of posts in data
    for row in data:
        comments = int(row[4])
        total_comments += comments
    return total_comments/total_posts # average number of comments

In [22]:
avg_ask_comments = avg_comments(ask_posts)
print('Averge number comments for ask posts: ', avg_ask_comments)
avg_show_comments = avg_comments(show_posts)
print('Averge number comments for show posts: ', avg_show_comments)
avg_other_comments = avg_comments(other_posts)
print('Averge number comments for other posts: ', avg_other_comments)

Averge number comments for ask posts:  10.393478498741656
Averge number comments for show posts:  4.886099625910612
Averge number comments for other posts:  6.4572678601427205


The average comments for each post type indicate a substantial difference between ask and show posts. This may be explained by questions stimulating discussions more easily (for example due to their concise nature), being less technically challenging as a consequence of being asked by mostly inexperienced users, or by a greater proclivity of users to help rather than discuss. Since questions generate the most comments, the following analysis will focus on "Ask HN" posts. 

## At what times do ask-posts aquire the most comments?

In [29]:
# identify date format
for row in ask_posts[0:3]:
    print(row[6])
    
date_format = "%m/%d/%Y %H:%M" # format of post dates

9/26/2016 2:53
9/26/2016 1:17
9/25/2016 22:57


In [56]:
import datetime as dt # time analysis

result_list = [] # list of tuples containing post dates and number of comments
for row in ask_posts:
    result_list.append((row[6], row[4])) # list of tuples (date created, number of comments)

posts_by_hour = {} # number of posts for each hour
comments_by_hour = {} # number of comments for each hour

# Extract hours and implement lists of post and comment frequency counts by hour
for tup in result_list:
    date = tup[0] # date of post
    comments = int(tup[1]) # convert to integer to summarize comments
    time = dt.datetime.strptime(date, date_format) # parse dates according to given format
    hour = time.strftime("%H") # extract hour from post
    if hour not in posts_by_hour: # frequency counts
        posts_by_hour[hour] = 1
        comments_by_hour[hour] = comments
    else:
        posts_by_hour[hour] += 1
        comments_by_hour[hour] += comments

# Post comment counts from 5 - 11 pm
table
for hour in sorted(posts_by_hour):
    print("Number of posts at ", hour, " pm: ", posts_by_hour[hour])
    print("Number of comments at ", hour, " pm: ", comments_by_hour[hour])

Number of posts at  00  pm:  301
Number of comments at  00  pm:  2277
Number of posts at  01  pm:  282
Number of comments at  01  pm:  2089
Number of posts at  02  pm:  269
Number of comments at  02  pm:  2996
Number of posts at  03  pm:  271
Number of comments at  03  pm:  2154
Number of posts at  04  pm:  243
Number of comments at  04  pm:  2360
Number of posts at  05  pm:  209
Number of comments at  05  pm:  1838
Number of posts at  06  pm:  234
Number of comments at  06  pm:  1587
Number of posts at  07  pm:  226
Number of comments at  07  pm:  1585
Number of posts at  08  pm:  257
Number of comments at  08  pm:  2362
Number of posts at  09  pm:  222
Number of comments at  09  pm:  1477
Number of posts at  10  pm:  282
Number of comments at  10  pm:  3013
Number of posts at  11  pm:  312
Number of comments at  11  pm:  2797
Number of posts at  12  pm:  342
Number of comments at  12  pm:  4234
Number of posts at  13  pm:  444
Number of comments at  13  pm:  7245
Number of posts at  

In [61]:
# Compute average number comments per post at each hour
avg_by_hour = [] # list of lists containting hours and corresponding average comments per post
for hour in posts_by_hour:
    posts = posts_by_hour[hour]
    comments = comments_by_hour[hour]
    avg = round(comments/posts, 1) # average number of comments per post 
    avg_by_hour.append([avg, hour])


In [70]:
# Print results
sorted_avg = sorted(avg_by_hour, reverse=True)[:5] # top 5 hours with highest avgerage comments per "Ask HN" post
for avg in sorted_avg:
    average = avg[0]
    hour = dt.datetime.strptime(avg[1], '%H') # Initialize hour as datetime object
    hour = dt.datetime.strftime(hour, "%H:%M") # Formate time 
    print('{average:.2f} average comments per "Ask HN" post at {hour}'.format(average = average, hour = hour))

28.70 average comments per "Ask HN" post at 15:00
16.30 average comments per "Ask HN" post at 13:00
12.40 average comments per "Ask HN" post at 12:00
11.10 average comments per "Ask HN" post at 02:00
10.70 average comments per "Ask HN" post at 10:00


The results show that at 15:00 the largest number of average comments are registers. This indicates, that for US eastern time, which includes Montreal, 15:00 is the best time to ask questions and generate the most comments on HN.

## Which URLs are most frequently discussed

In [79]:
# URL root analysis
url_count = {} # dictionary of URL root frequencies
for row in hn:
    trimmed = row[2].replace('http://', '').replace('https://', '').replace('www.', '') # remove superfluous prefix
    root = trimmed.split('/')[0] # split after backslash and keep only root
    if root not in url_count and root != '': # exclude posts without URLs
        url_count[root] = 1
    elif root != '': url_count[root] += 1

url_list = [] # list of frequencies to allow sorting
for element in url_count:
    url_list.append((url_count[element], element))
print(sorted(url_list, reverse=True)[:10]) # top 10 urls and their corresponding counts

[(15931, 'medium.com'), (14439, 'github.com'), (6069, 'nytimes.com'), (5280, 'youtube.com'), (4116, 'techcrunch.com'), (3432, 'theguardian.com'), (2979, 'arstechnica.com'), (2745, 'bloomberg.com'), (2321, 'en.wikipedia.org'), (1992, 'bbc.com')]


Not surprisingly, medium and github appear as the most cited URLs by a large margin. Both websites, offer information on tech related subjects that provide ample material for stimulating discussions; medium in the form of blogs and github mainly as a repository. Remaining websites are news related with exceptoin of youtube and wikipedia. Next we will analyze, whether several posts linked to the same full-length URL.

In [162]:
url_count = {} # dictionary of full-length URL frequencies
for row in hn:
    trimmed = row[2].replace('http://', '').replace('https://', '').replace('www.', '') # remove superfluous prefix
    if trimmed not in url_count and trimmed != '': # exclude posts without URLs
        url_count[trimmed] = 1
    elif trimmed != '': url_count[trimmed] += 1

url_list = [] # list of URL frequencies to allow sorting
for element in url_count:
    url_list.append((url_count[element], element))

# Count of frequencies
frequency_count = {}
for freq in url_list:
    count = freq[0] # URL count
    if count not in frequency_count:
        frequency_count[count] = 1
    else: frequency_count[count] += 1
    
# Print top 10 full-length URLs and their corresponding counts
for url in sorted(url_list, reverse=True)[:10]:  
    print(url)
print('\nCount of frequencies: ')
for count in frequency_count:
    print(count, ': ', frequency_count[count])

(22, 'aioptify.com/topmldmbooks.php?utm_source=hackernews&utm_medium=cpm&utm_campaign=topmlbooks')
(17, 'technologyreview.com/view/541276/deep-learning-machine-teaches-itself-chess-in-72-hours-plays-at-international-master/')
(16, 'systemmeasure.com')
(15, 'waitbutwhy.com/2015/11/the-cook-and-the-chef-musks-secret-sauce.html')
(15, 'journal.sjdm.org/15/15923a/jdm15923a.pdf')
(15, 'businessofsoftware.org/free-software-pricing-guide/')
(14, 'technologyreview.com/view/542626/why-self-driving-cars-must-be-programmed-to-kill/')
(14, 'speerty.com')
(14, 'moralmachine.mit.edu/')
(14, 'github.com/joowani/dtags')

Count of frequencies: 
1 :  221750
2 :  18862
5 :  385
3 :  3774
4 :  1046
10 :  12
11 :  6
6 :  146
7 :  80
9 :  22
8 :  39
14 :  5
13 :  1
12 :  3
22 :  1
15 :  3
16 :  1
17 :  1


The top cited URLs include a review of several data science books, an article on Elon Musk and a scientific paper on judgment and decision making. The latter probably rose to prominence as a result of the repeated use of the word **bullshit** (www.journal.sjdm.org/15/15923a/jdm15923a.pdf). Notably, an interesting website among the top 10 outlines 13 distinct moral dilemmas (www.moralmachine.mit.edu). In each scenario, an autonomous vehicle lost brake function and you are presented with two decisions leading to two distinct outcomes. The cases in reality are even more complex since the outcomes would not be easily known beforehand but it does raise some interesting questions on the ethical programming of self-driving cars. Lastly, the frequency list shows a sharp drop-off of frequencies at 4 posts linking to the same URL. This information will be later used as a cut-off in an attempt to find outstanding URL links.

Next we will associate full-length URLs with their corresponding number of comments. This may give some more insight on how popular the URLs are. It is possible that multiple users linked to the same URLs but significantly less users were interested in them. Evaluating, in addition, the number of comments and total points will give us a better understanding of the hype surrounding these links. 

In [130]:
url_count = {} # dictionary of full-length URL frequencies
# URL as keys; total points, comments and number of posts linking to same URL as values
for row in hn:
    trimmed = row[2].replace('http://', '').replace('https://', '').replace('www.', '') # remove superfluous prefix
    if trimmed not in url_count and trimmed != '': # exclude posts without URLs
        url_count[trimmed] = [int(row[3]), int(row[4]), 1] # tuple: (number of points, number of comments, number of links)
    elif trimmed != '':
        old = url_count[trimmed] # current counts in dictionary
        new = [int(row[3]), int(row[4]), 1] # new counts
        url_count[trimmed] = [sum(x) for x in zip(old, new)] # element-wise list addition of old and new counts

upvotes = [] # list of frequencies to allow sorting
for element in url_count:
    upvotes.append([url_count[element][0], url_count[element][2], element]) # upvotes, number of liks to URL, URL
    
comments = [] # list of frequencies to allow sorting
for element in url_count:
    comments.append([url_count[element][1], url_count[element][2], element]) # comments, number of liks to URL, URL

print('Upvotes:')
sorted_upvotes = sorted(upvotes, reverse=True) # descending order
for url in sorted_upvotes[:10]:  # top 10 full-length URLs and their corresponding upvotes
    print(url)
print('\nComments: ')
sorted_comments = sorted(comments, reverse=True) # descending order
for url in sorted_comments[:10]:  # top 10 full-length URLs and their corresponding comments
    print(url)

Upvotes:
[5771, 1, 'apple.com/customer-letter/']
[3125, 1, 'bbc.co.uk/news/uk-politics-36615028']
[2553, 1, 'pardonsnowden.org/']
[2049, 1, 'news.microsoft.com/2016/06/13/microsoft-to-acquire-linkedin/#sm.0000pigrxf7dne36z7o2gp4nrghse']
[2049, 1, 'blog.dustinkirkland.com/2016/03/ubuntu-on-windows.html?m=1']
[2011, 1, 'nytimes.com/2016/02/12/science/ligo-gravitational-waves-black-holes-einstein.html']
[1952, 1, 'techcrunch.com/2015/09/16/14-year-old-boy-arrested-for-bringing-homemade-clock-to-school/']
[1876, 1, 'blog.ycombinator.com/basic-income']
[1855, 1, 'bunniestudios.com/blog/?p=4782']
[1851, 1, 'tesla.com/blog/master-plan-part-deux']

Comments: 
[2531, 1, 'bbc.co.uk/news/uk-politics-36615028']
[1733, 1, 'apple.com/iPhone7']
[1448, 1, 'blog.ycombinator.com/moving-forward-on-basic-income']
[1120, 1, 'blog.ycombinator.com/basic-income']
[973, 1, 'businessinsider.com/github-the-full-inside-story-2016-2']
[967, 1, 'apple.com/customer-letter/']
[870, 1, 'techcrunch.com/2015/09/16/14-ye

Interestingly, non of the top 10 upvoted posts nor those with the most comments had more than one article linking to the same URL. Further, with one exception, all posts had 99 upvotes/comments. We will therefore filter posts out that link to only one URL to reduce some of the noise. The hope here is to find some unique articles that prompted multiple users to post, comment and upvote the informatoin found in the link.

In [170]:
upvotes_filtered = []
for upvotes in sorted_upvotes:
    if upvotes[1] > 3: # append only those posts that link to the same URL more than 4x
        upvotes_filtered.append(upvotes)
comments_filtered = []
for comments in sorted_comments:
    if comments[1] > 3: # append only those posts that link to the same URL more than 4x
        comments_filtered.append(comments)
        
print('Upvotes:')
for url in upvotes_filtered[:10]:  # top 10 full-length URLs and their corresponding upvotes
    print(url)
print('\nComments: ')
for url in comments_filtered[:10]:  # top 10 full-length URLs and their corresponding comments
    print(url)

Upvotes:
[1193, 4, 'github.com/HannahMitt/HomeMirror']
[1002, 4, 'youtube.com/watch?v=0nbkaYsR94c']
[981, 5, 'martinfowler.com/articles/serverless.html']
[923, 5, 'code.facebook.com/posts/1189117404435352/']
[857, 4, 'sci-hub.io/']
[778, 4, 'bouk.co/blog/hacking-developers/']
[763, 5, 'threadbase.com/unravelled']
[676, 5, 'brave.com/']
[644, 5, 'github.com/ptmt/react-native-desktop']
[643, 8, 'udacity.com/course/deep-learning--ud730']

Comments: 
[551, 5, 'brave.com/']
[428, 4, 'theguardian.com/business/2016/jun/08/mcdonalds-community-centers-us-physical-social-networks']
[420, 4, 'youtube.com/watch?v=0nbkaYsR94c']
[358, 5, 'martinfowler.com/articles/serverless.html']
[302, 4, 'theguardian.com/us-news/2016/mar/21/death-by-gentrification-the-killing-that-shamed-san-francisco']
[281, 4, 'blog.dustinkirkland.com/2016/02/zfs-is-fs-for-containers-in-ubuntu-1604.html']
[256, 5, 'github.com/ptmt/react-native-desktop']
[254, 4, 'thenation.com/article/universities-are-becoming-billion-dollar-he

To come up with a single top 10 list will search the top 100 of both and register entries with commone URLs. A new metric will be calulated to evaluate hype by multiplying the number of upvotes by the number of comments. The results will be divided by the maximum computed metric for easier interpretation. Multiplication is chosen since it favors number that are more similar in scale.

In [171]:
results = [] # results with new metric from both comments and upvotes
upvotes_comments = [] # list of upvotes and comments associated to the same URL. Used to calculate maximal metric 

# Find URLs that are represented in both the top 100 number upvotes and top 100 number of comments
for top_upvote in upvotes_filtered[:100]:
    for top_comment in comments_filtered[:100]: # Not efficient but lists are small in size
        if top_upvote[2] == top_comment[2]: # if same URL
            new_metric = top_upvote[0]*top_comment[0] # number of comments * number of upvotes
            results.append([new_metric, top_upvote[1], top_upvote[2]])
            upvotes_comments.append((top_upvote[0],top_comment[0]))

max_metric = max([x*y for x,y in upvotes_comments]) # largest metric; used to compute ratios
final_results = []

# Transform metrics into ratios (0,1]
for result in results:
    ratio = result[0]/max_metric
    final_results.append([ratio, result[1], result[2]])

final_results = sorted(final_results, reverse=True) # sort according to metric
# print top 10 of final results
for result in final_results[:10]:
    print(result)

[1.0, 4, 'youtube.com/watch?v=0nbkaYsR94c']
[0.8850774641193803, 5, 'brave.com/']
[0.8345166809238665, 5, 'martinfowler.com/articles/serverless.html']
[0.5241825872065393, 5, 'code.facebook.com/posts/1189117404435352/']
[0.43986313088109497, 4, 'sci-hub.io/']
[0.4251972245984222, 4, 'bouk.co/blog/hacking-developers/']
[0.3940381142476951, 4, 'github.com/HannahMitt/HomeMirror']
[0.39174983366600136, 5, 'github.com/ptmt/react-native-desktop']
[0.32544434939644523, 4, 'theguardian.com/business/2016/jun/08/mcdonalds-community-centers-us-physical-social-networks']
[0.2701430472388556, 5, 'threadbase.com/unravelled']
