# Exploring Hackers News posts

This project is about to understand which type of posts are more interesting in the website Hackers News (https://news.ycombinator.com/). The website has 2 types of posts "Ask HN"(a post to ask for recomendation in the comunity) and "Show HN"(post to show a project developed to the comunity). Also, we will find which day and hour would be more convinient to make a post.

In [1]:
# First we have to take a look of our data
import pandas as pd
file = pd.read_csv('HN_posts_year_to_Sep_26_2016.csv')
file.head(6)

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
0,12579008,You have two days to comment if you want stem ...,http://www.regulations.gov/document?D=FDA-2015...,1,0,altstar,9/26/2016 3:26
1,12579005,SQLAR the SQLite Archiver,https://www.sqlite.org/sqlar/doc/trunk/README.md,1,0,blacksqr,9/26/2016 3:24
2,12578997,What if we just printed a flatscreen televisio...,https://medium.com/vanmoof/our-secrets-out-f21...,1,0,pavel_lishin,9/26/2016 3:19
3,12578989,algorithmic music,http://cacm.acm.org/magazines/2011/7/109891-al...,1,0,poindontcare,9/26/2016 3:16
4,12578979,How the Data Vault Enables the Next-Gen Data W...,https://www.talend.com/blog/2016/05/12/talend-...,1,0,markgainor1,9/26/2016 3:14
5,12578975,Saving the Hassle of Shopping,https://blog.menswr.com/2016/09/07/whats-new-w...,1,1,bdoux,9/26/2016 3:13


In [2]:
# To make the data easier to work, we will convert the csv file into a list
# The dataset has an incompatible characters that the csv class can't read
# to fix that we encode the file with the key 'utf8'

import csv
opened_file = open('HN_posts_year_to_Sep_26_2016.csv', encoding="utf8")
read_file = csv.reader(opened_file)
hn = list(read_file)
head = hn[0]
hn = hn[1:]

print('head of the csv: ', head)
print('\n')
print('number of rows ', len(hn))

head of the csv:  ['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


number of rows  293119


In [3]:
# We olny will take care of the 'Ask' and 'Show' posts

ask_posts = []
show_posts =[]
other_posts = []

for row in hn:
    title = row[1]
    title = title.lower()
    
    if title.startswith('ask hn'):
        ask_posts.append(row)
    if title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
#Let's look how many post of each type we have
print('ask', len(ask_posts))
print('show', len(show_posts))
print('other', len(other_posts))

ask 9139
show 10158
other 282961


In [4]:
# This is how a ask post is saved
print(ask_posts[1])

['12578522', 'Ask HN: How do you pass on your work when you die?', '', '6', '3', 'PascLeRasc', '9/26/2016 1:17']


In [29]:
# How many comments has a'ask posts' 

total_ask_comments = 0
for i in ask_posts:
    comments = int(i[4])
    total_ask_comments += comments
    
avg_ask_comments = round(total_ask_comments/len(ask_posts),2)
print('There are: ',avg_ask_comments,'comments per ask post')

# How many comments has a 'show post' 

total_show_comments = 0
for i in show_posts:
    comments = int(i[4])
    total_show_comments += comments
    
avg_show_comments = round(total_show_comments/len(show_posts),2)
print('There are: ',avg_show_comments,'comments per show post')



There are:  10.39 comments per ask post
There are:  4.89 comments per show post


As we can see, asks post recieve more comments than show posts. 

In [6]:
# Now we will see if posting in a certain hour during the day will affect the visibility 
# First we have to create a list isolate the data we want "days, hours and comments"

import datetime as dt
result_list = []

for i in ask_posts:
    created_at = i[6]
    num_comments = int(i[4])
    result_list.append([created_at,num_comments])

print(result_list[1])

['9/26/2016 1:17', 3]


In [12]:
# Now we isolate each one to create 2 frequency tables.
# The first tell us the number of asks posts by hour
# The second tell us the number of comments in all ask posts by hour
counts_by_hour = {}
comments_by_hour = {}
# This is a template to read the time
template = "%m/%d/%Y %H:%M"

for i in result_list:
    date = i[0]
    comment = int(i[1])
    day_hour_minutes = dt.datetime.strptime(date, template)
    hour = day_hour_minutes.strftime('%H')
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comment
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comment
        
print('Ask posts by hour: ',counts_by_hour)
print('\n')
print('Comments in ask posts by hour: ',comments_by_hour)

Ask posts by hour:  {'02': 269, '01': 282, '22': 383, '21': 518, '19': 552, '17': 587, '15': 646, '14': 513, '13': 444, '11': 312, '10': 282, '09': 222, '07': 226, '03': 271, '23': 343, '20': 510, '16': 579, '08': 257, '00': 301, '18': 614, '12': 342, '04': 243, '06': 234, '05': 209}


Comments in ask posts by hour:  {'02': 2996, '01': 2089, '22': 3372, '21': 4500, '19': 3954, '17': 5547, '15': 18525, '14': 4972, '13': 7245, '11': 2797, '10': 3013, '09': 1477, '07': 1585, '03': 2154, '23': 2297, '20': 4462, '16': 4466, '08': 2362, '00': 2277, '18': 4877, '12': 4234, '04': 2360, '06': 1587, '05': 1838}


In [30]:
# Now we cross both dictionaries to find the average number of posts by hour
# We are putting it in a list of list to be sorted later

avg_by_hour = []
for hour in comments_by_hour:
    avg_by_hour.append([hour, round(comments_by_hour[hour]/counts_by_hour[hour],2) ])
    
print(avg_by_hour)

[['02', 11.14], ['01', 7.41], ['22', 8.8], ['21', 8.69], ['19', 7.16], ['17', 9.45], ['15', 28.68], ['14', 9.69], ['13', 16.32], ['11', 8.96], ['10', 10.68], ['09', 6.65], ['07', 7.01], ['03', 7.95], ['23', 6.7], ['20', 8.75], ['16', 7.71], ['08', 9.19], ['00', 7.56], ['18', 7.94], ['12', 12.38], ['04', 9.71], ['06', 6.78], ['05', 8.79]]


In [35]:
# As we can't sort the keys, we will sort it by number of comments
# To do that we swap the columns of our list of lists

swap_avg_by_hour = []
for i in avg_by_hour:
    swap_avg_by_hour.append([i[1],i[0]])

#Now we can sort, in the top we will se the biggers
    
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
print(sorted_swap)

[[28.68, '15'], [16.32, '13'], [12.38, '12'], [11.14, '02'], [10.68, '10'], [9.71, '04'], [9.69, '14'], [9.45, '17'], [9.19, '08'], [8.96, '11'], [8.8, '22'], [8.79, '05'], [8.75, '20'], [8.69, '21'], [7.95, '03'], [7.94, '18'], [7.71, '16'], [7.56, '00'], [7.41, '01'], [7.16, '19'], [7.01, '07'], [6.78, '06'], [6.7, '23'], [6.65, '09']]


In [50]:
# It's time to see the top rank

print("Top 5 Hours for Ask Posts Comments: ")
for i, j in sorted_swap[:5]:
    print('At the {} hours of the day are {} average comments per post'.format(dt.datetime.strptime(j,'%H').strftime('%H:%M'), i))

Top 5 Hours for Ask Posts Comments
At the 15:00 hours of the day are 28.68 average comments per post
At the 13:00 hours of the day are 16.32 average comments per post
At the 12:00 hours of the day are 12.38 average comments per post
At the 02:00 hours of the day are 11.14 average comments per post
At the 10:00 hours of the day are 10.68 average comments per post


### Conclusion
If we want to make our post more visible it would bebetter do it at 15:00 hours, because people
will tend to make more comments at that hour. Also we can pick the other hours, like 13:00 or 12:00, but it's better to do it at the top rank.