# Hacker News Posts-Analysis

By: Ernest Mack

### In this Jupyter notebook we will be working with Strings, OOP and dates and times using Python objects.

In so doing we will be analyzing the activity on the Hacker News platform and try to pull together some interesting data analysis.

In [1]:
from csv import reader
opened_file = open(r'hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)

### Extract the header rows and assign to a variable then display the header row variable.

In [2]:
for header_row in hn[0]:
    print(header_row)

id
title
url
num_points
num_comments
author
created_at


## Now print only the first five rows of actual data.

In [3]:
for row in hn[1:6]: # use slicing to print only the first five rows
    print(row)    

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']
['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']
['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']
['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']
['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']


## Now filter the data to identify only the data that has 'Ask HN' or 'Show HN' in it.

### See: https://docs.python.org/3/library/stdtypes.html?highlight=startswith#str.startswith

In [4]:
# let's create some empty lists for use in isolating the entries
# we want
ask_posts = []
show_posts = []
other_posts = []

for row in hn: # iterate through the rows
    title = row[1] # assign the title in each row to the variable title
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
        
    else:
        if title.lower().startswith('show hn'):
            show_posts.append(row)
else:
    other_posts.append(title)


In [5]:
show_posts[0:5]

[['10627194',
  'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform',
  'https://iot.seeed.cc',
  '26',
  '22',
  'kfihihc',
  '11/25/2015 14:03'],
 ['10646440',
  'Show HN: Something pointless I made',
  'http://dn.ht/picklecat/',
  '747',
  '102',
  'dhotson',
  '11/29/2015 22:46'],
 ['11590768',
  'Show HN: Shanhu.io, a programming playground powered by e8vm',
  'https://shanhu.io',
  '1',
  '1',
  'h8liu',
  '4/28/2016 18:05'],
 ['12178806',
  'Show HN: Webscope  Easy way for web developers to communicate with Clients',
  'http://webscopeapp.com',
  '3',
  '3',
  'fastbrick',
  '7/28/2016 7:11'],
 ['10872799',
  'Show HN: GeoScreenshot  Easily test Geo-IP based web pages',
  'https://www.geoscreenshot.com/',
  '1',
  '9',
  'kpsychwave',
  '1/9/2016 20:45']]

In [6]:
ask_posts[0:5]

[['12296411',
  'Ask HN: How to improve my personal website?',
  '',
  '2',
  '6',
  'ahmedbaracat',
  '8/16/2016 9:55'],
 ['10610020',
  'Ask HN: Am I the only one outraged by Twitter shutting down share counts?',
  '',
  '28',
  '29',
  'tkfx',
  '11/22/2015 13:43'],
 ['11610310',
  'Ask HN: Aby recent changes to CSS that broke mobile?',
  '',
  '1',
  '1',
  'polskibus',
  '5/2/2016 10:14'],
 ['12210105',
  'Ask HN: Looking for Employee #3 How do I do it?',
  '',
  '1',
  '3',
  'sph130',
  '8/2/2016 14:20'],
 ['10394168',
  'Ask HN: Someone offered to buy my browser extension from me. What now?',
  '',
  '28',
  '17',
  'roykolak',
  '10/15/2015 16:38']]

In [7]:
other_posts[0:5]

['RoboBrowser: Your friendly neighborhood web scraper']

# Find the number of comments in the posts.

In [8]:
total_ask_comments = 0
total_show_comments = 0
count = 0


for posts in ask_posts:
    total_ask_comments = int(posts[4]) + total_ask_comments
    count += 1
    
avg_ask_comments = total_ask_comments / count

count = 0
for s_posts in show_posts:
    total_show_comments = int(s_posts[4]) + total_show_comments
    count += 1
avg_show_comments = total_show_comments / count
    

In [9]:
avg_ask_comments

14.038417431192661

In [10]:
avg_show_comments

10.31669535283993

### From the analysis above it looks like "Hn Show" posts receive more comments than "Hn Ask" posts. This indicates people are responding more to the Hacker News Show posts.

Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts. 

### Next, we'll determine if ask posts created at a certain time are more likely to attract comments.

We'll use the following steps to perform this analysis:

1. Calculate the amount of ask posts created in each hour of the                 day, along with the number of comments received.
2. Calculate the average number of comments ask posts receive by hour created.


In [11]:
# Calculate the amount of ask posts created during each hour of day and the number of comments received.
import datetime as dt

result_list = []

for post in ask_posts:
    result_list.append(
        [post[6], int(post[4])]
    )

comments_by_hour = {}
counts_by_hour = {}
date_format = "%m/%d/%Y %H:%M"

for each_row in result_list:
    date = each_row[0]
    comment = each_row[1]
    time = dt.datetime.strptime(date, date_format).strftime("%H")
    if time in counts_by_hour:
        comments_by_hour[time] += comment
        counts_by_hour[time] += 1
    else:
        comments_by_hour[time] = comment
        counts_by_hour[time] = 1

comments_list = list(comments_by_hour)
comments_list[:5]
   

['09', '13', '10', '14', '16']

In [12]:
result_list[1:7]

[['11/22/2015 13:43', 29],
 ['5/2/2016 10:14', 1],
 ['8/2/2016 14:20', 3],
 ['10/15/2015 16:38', 17],
 ['9/26/2015 23:23', 1],
 ['4/22/2016 12:24', 4]]

### Now we will calculate the number of comments per post per hour

In [13]:
comments = []
avg_by_hour = []

for hour in comments_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour] / counts_by_hour[hour]])

avg_by_hour

[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]

### Now we will sort the list and print the five highest values from the sorted list

In [14]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
print(swap_avg_by_hour,'\n')

sorted_swap = sorted(swap_avg_by_hour, reverse=True)
sorted_swap


[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']] 



[[38.5948275862069, '15'],
 [23.810344827586206, '02'],
 [21.525, '20'],
 [16.796296296296298, '16'],
 [16.009174311926607, '21'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [13.20183486238532, '18'],
 [11.46, '17'],
 [11.383333333333333, '01'],
 [11.051724137931034, '11'],
 [10.8, '19'],
 [10.25, '08'],
 [10.08695652173913, '05'],
 [9.41095890410959, '12'],
 [9.022727272727273, '06'],
 [8.127272727272727, '00'],
 [7.985294117647059, '23'],
 [7.852941176470588, '07'],
 [7.796296296296297, '03'],
 [7.170212765957447, '04'],
 [6.746478873239437, '22'],
 [5.5777777777777775, '09']]

In [15]:
formatted_sorted = []
for item in sorted_swap:
    formatted_sorted.append([item[0], [item[1]]]) # append from list of lists
    
formatted_sorted    

[[38.5948275862069, ['15']],
 [23.810344827586206, ['02']],
 [21.525, ['20']],
 [16.796296296296298, ['16']],
 [16.009174311926607, ['21']],
 [14.741176470588234, ['13']],
 [13.440677966101696, ['10']],
 [13.233644859813085, ['14']],
 [13.20183486238532, ['18']],
 [11.46, ['17']],
 [11.383333333333333, ['01']],
 [11.051724137931034, ['11']],
 [10.8, ['19']],
 [10.25, ['08']],
 [10.08695652173913, ['05']],
 [9.41095890410959, ['12']],
 [9.022727272727273, ['06']],
 [8.127272727272727, ['00']],
 [7.985294117647059, ['23']],
 [7.852941176470588, ['07']],
 [7.796296296296297, ['03']],
 [7.170212765957447, ['04']],
 [6.746478873239437, ['22']],
 [5.5777777777777775, ['09']]]

In [16]:
# Sort the values and print the the 5 hours with the highest average comments.

print("Top 5 Hours for 'Ask HN' Comments")
for avg, hr in sorted_swap[:5]:
    print(
        "{}: {:.2f} average comments per post".format(
            dt.datetime.strptime(hr, "%H").strftime("%H:%M"),avg
        )
    )

Top 5 Hours for 'Ask HN' Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


# End of notebook project

In it I read through Hacker News Posts and analyzed the most popular voted and the times the highest votes occured and wrote the code to make it work. Basically working with lists and lists-in-lists and looping through and slicing as needed to get the analysis I wanted. I used the documentation at Pytho.org and some Stack Overflow queries as well. I refrained from using Pandas or any other anaconda packages as the exercise is in Python only in an effort to increase those skills specifically.