In this project, I will be analyzing Hacker News. Hacker News is a news website, similar to Reddit, that is popular in technology and startup cirlces. 

The RAW data set can be found here: https://www.kaggle.com/hacker-news/hacker-news-posts

The CLEANED data set that I will be using can be found here: https://github.com/RobertChaseSommer/HackerNews/blob/master/hacker_news.csv This dataset was cleaned by DataQuest. The RAW data set had ~ 300k rows and the CLEANED data set has ~ 30k. Posts without any votes or comments were removed.


Below are the descriptions of the columns:
* id - The unique identifier from Hacker News for the post
* title - The title of the post
* url - The URL that the post links to, if the post has a URL
* num_points - Each post is upvoted or downvoted, this column is the sum of all votes
* num_comments - The number of comments on the post
* author - The username of the person who submitted the post
* created_at - The date the post was created

We will be looking at posts that begin with Ask HN or Show HN.
* Ask HN is a post asking Hacker News users a question.
* Show HN is a post showing Hacker News users something interesting.
We will answer the questions:
* Do `Ask HN` or `Show HN` receive more comments on average?
* Do posts created at certain times receive more votes than others?


In [1]:
from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)

for row in hn[0:5]:
    print(row)
    print('\n')

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']




Below we are removing the header for analysis.

Be careful to only run the code below once because hn = hn[1:] is removing the first row in hn. For each time that this box is ran, hn will lose 1 row.

In [2]:
header = hn[0] 
hn = hn[1:]     

for row in hn[0:4]:
    print(row)
    print('\n')

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']




Below I filtered out the posts into three sections:
1. Ask HN posts
2. Show HN posts
3. Other HN posts

In [9]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    correct_title = row[1]
    correct_title = correct_title.lower()
    
    if correct_title.startswith('ask hn'):
        ask_posts.append(row)
    elif correct_title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


In [10]:
for row in ask_posts[:3]:
    print(row)

['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55']
['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43']
['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14']


In [11]:
for row in show_posts[:3]:
    print(row)

['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03']
['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46']
['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https://shanhu.io', '1', '1', 'h8liu', '4/28/2016 18:05']


Now we will find out whether ask or show posts receive more comments from other users.

In [39]:
total_ask_comments = 0

for row in ask_posts:
    comments = int(row[4])
    total_ask_comments += comments
    
print('Total Ask HN comments: ' + str(total_ask_comments))
print('Number of Ask HN posts: ' + str(len(ask_posts)))

avg_ask_comments = (total_ask_comments/len(ask_posts))

print('Average comments per Ask HN post: ' + str(round(avg_ask_comments,2)))


Total Ask HN comments: 24483
Number of Ask HN posts: 1744
Average comments per Ask HN post: 14.04


In [40]:
total_show_comments = 0

for row in show_posts:
    comments = int(row[4])
    total_show_comments += comments
    
print('Total Show HN comments: ' + str(total_show_comments))
print('Number of Show HN posts: ' + str(len(show_posts)))

avg_show_comments = (total_show_comments/len(show_posts))

print('Average comments per Show HN post: ' + str(round(avg_show_comments,2)))

Total Show HN comments: 11988
Number of Show HN posts: 1162
Average comments per Show HN post: 10.32


As we can see, Ask HN posts have approximately 4 more comments per post than Show HN posts. This shows us that people tend to respond more when asked for help, opposed to simply showing them something. 

Do you think this analysis supports the Socratic Method of teaching?

In [51]:
import datetime as dt

result_list = []

for row in ask_posts:
    posted_at = row[6]
    comments = int(row[4])
    a_list = [posted_at, comments]
    result_list.append(a_list)
    
counts_by_hour = {}
comments_by_hour = {}
    
for row in result_list:
    dt_hour = dt.datetime.strptime(row[0], '%m/%d/%Y %H:%M')
    hour = dt_hour.strftime('%H')
    
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = row[1]
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += row[1]

print('Counts by the Hour')
print(counts_by_hour)
print('\n')
print('Comments by the Hour')
print(comments_by_hour)
    
    
    

Counts by the Hour
{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}


Comments by the Hour
{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}


In [55]:
avg_by_hour = []

for row in comments_by_hour:
    avg_by_hour.append([row, round(comments_by_hour[row]/counts_by_hour[row],2)])
    
print(avg_by_hour)

[['09', 5.58], ['13', 14.74], ['10', 13.44], ['14', 13.23], ['16', 16.8], ['23', 7.99], ['12', 9.41], ['17', 11.46], ['15', 38.59], ['21', 16.01], ['20', 21.52], ['02', 23.81], ['18', 13.2], ['03', 7.8], ['05', 10.09], ['19', 10.8], ['01', 11.38], ['22', 6.75], ['08', 10.25], ['04', 7.17], ['00', 8.13], ['06', 9.02], ['07', 7.85], ['11', 11.05]]


In [58]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1],row[0]])
    
print(swap_avg_by_hour)
    

[[5.58, '09'], [14.74, '13'], [13.44, '10'], [13.23, '14'], [16.8, '16'], [7.99, '23'], [9.41, '12'], [11.46, '17'], [38.59, '15'], [16.01, '21'], [21.52, '20'], [23.81, '02'], [13.2, '18'], [7.8, '03'], [10.09, '05'], [10.8, '19'], [11.38, '01'], [6.75, '22'], [10.25, '08'], [7.17, '04'], [8.13, '00'], [9.02, '06'], [7.85, '07'], [11.05, '11']]


In [60]:
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
print(sorted_swap)

[[38.59, '15'], [23.81, '02'], [21.52, '20'], [16.8, '16'], [16.01, '21'], [14.74, '13'], [13.44, '10'], [13.23, '14'], [13.2, '18'], [11.46, '17'], [11.38, '01'], [11.05, '11'], [10.8, '19'], [10.25, '08'], [10.09, '05'], [9.41, '12'], [9.02, '06'], [8.13, '00'], [7.99, '23'], [7.85, '07'], [7.8, '03'], [7.17, '04'], [6.75, '22'], [5.58, '09']]


In [None]:
for row in sorted_swap[:4]:
    
    print('{1}: {2} ')