# Exploring Hacker News Posts

This project will focus on analyzing the data set focused on submissions to a popular technology site Hacker News. 

We are interested in posts whose titles begin with either Ask HN or Show HN. We'll compare these two types of posts to determine:
1. Which one receives more comments on average
2. Do posts created at a specific time receive more comments on average.

Let's first open the database and assign it to a variable in list format:

In [1]:
from csv import reader
opened_file=open('hacker_news.csv')
read_file=reader(opened_file)
hn=list(read_file)
hn_header=hn[0]
hn=hn[1:]

In [2]:
print(hn_header)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


In [3]:
ask_posts=[]
show_posts=[]
other_posts=[]
for row in hn:
    title=row[1]
    title=title.lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)


In [4]:
print(ask_posts[:5])

[['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'], ['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14'], ['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20'], ['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38']]


In [5]:
print(show_posts[:5])

[['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'], ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46'], ['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https://shanhu.io', '1', '1', 'h8liu', '4/28/2016 18:05'], ['12178806', 'Show HN: Webscope  Easy way for web developers to communicate with Clients', 'http://webscopeapp.com', '3', '3', 'fastbrick', '7/28/2016 7:11'], ['10872799', 'Show HN: GeoScreenshot  Easily test Geo-IP based web pages', 'https://www.geoscreenshot.com/', '1', '9', 'kpsychwave', '1/9/2016 20:45']]


In [6]:
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


There are many more `other_posts` than ask or show posts. We'll now find the total number of comments in ask posts. 

In [7]:
total_ask_comments=0

for row in ask_posts:
    num_com=row[4]
    num_com=int(num_com)
    total_ask_comments+=num_com

avg_ask_comments=total_ask_comments/len(ask_posts)
print(avg_ask_comments)

total_show_comments=0

for row in show_posts:
    num_com=row[4]
    num_com=int(num_com)
    total_show_comments+=num_com

avg_show_comments=total_show_comments/len(show_posts)
print(avg_show_comments)


14.038417431192661
10.31669535283993


Looks like ask hn posts receive 4 comments more on average than show hn posts. This could be due to the nature of ask posts. When someone posts an ask hn post, the poster requests some response from hn users. The question will inevitable receive an answer to the post. Show nh posts on other hand don't require a response from users - their are there to just show some information. 

Here we will determine if ask posts created at certain time are more likely to attract comments.

In [8]:
import datetime as dt

result_list=[]
for row in ask_posts:
    created_at=row[6]
    num_com=int(row[4])
    entry=[created_at, num_com]
    result_list.append(entry)
    

In [9]:
print(result_list[:4])

[['8/16/2016 9:55', 6], ['11/22/2015 13:43', 29], ['5/2/2016 10:14', 1], ['8/2/2016 14:20', 3]]


In [10]:
counts_by_hour={}
comments_by_hour={}

for each in result_list:
    hour_str=each[0]
    hour_dt=dt.datetime.strptime(hour_str, '%m/%d/%Y %H:%M') #parse the string as datetime with a format
    hour=hour_dt.strftime('%H') #assign hour from step above to a new variable. 
    
    if hour not in counts_by_hour:
        counts_by_hour[hour]=1
        comments_by_hour[hour]=each[1]
    else:
        counts_by_hour[hour]+=1
        comments_by_hour[hour]+=each[1]

In [11]:
print(counts_by_hour)

{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}


In [12]:
print(comments_by_hour)

{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}
