#  HACKER NEWS PROJECT

Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. 

It is an extremely popular site in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result. 

Users submit `Ask HN` posts to ask the Hacker News community a specific question or `Show HN` posts to show the Hacker News community a project, product, or just generally something interesting.

As a data analyst specifically analysing posts whose titles begin with either `Ask HN` or `Show HN`, the aim of this project is to:

* Determine if `Ask HN` posts or `Show HN` posts receive more comments on average.
* Determine if posts created at a certain time receive more comments on average.

## Introduction

In [1]:
from csv import reader 
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)
hn[:5]

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01']]

In [2]:
headers = hn[0]
print(headers)
hn = hn[1:]
hn[:5]

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01'],
 ['10301696',
  'Note by Note: The Making of Steinway L1037 (2007)',
  'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
  '8',
  '2',
  'walterbell',
  '9/30/2015 4:12']]

## Extracting the total number of different posts.

In [3]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith("ask hn"):
        ask_posts.append(row)
    elif title.lower().startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append('row')
        
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


* Using a `for` loop, the `lower` and `startswith` string function, the number of posts start with either `ask hn` or `show hn` or none of the two was put into seperate lists.

## Calculating the average number of comments for Ask posts and Show posts

In [4]:
total_ask_comments = 0

for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments
    
avg_ask_comments = total_ask_comments / len(ask_posts)
print("{:.2f}".format(avg_ask_comments))


14.04


In [5]:
total_show_comments = 0

for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments
    
avg_show_comments = total_show_comments / len(show_posts)
print("{:.2f}".format(avg_show_comments))



10.32


Upon further analysis, it is observed that `Ask HN` posts receive more comments than `Show HN` posts on an average. Therefore, the data contained in the `ask_post` variable will be the object of focus henceforth.

## Calculating the number of posts and comments for Ask Posts per hour.

In [6]:
import datetime as dt

result_list = []

for post in ask_posts:
    result_list.append(
        [post[6], int(post[4])]
    )
result_list[:3]

[['8/16/2016 9:55', 6], ['11/22/2015 13:43', 29], ['5/2/2016 10:14', 1]]

In [7]:
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    date_ = row[0]
    comment_no = row[1]    
    d_time = dt.datetime.strptime(date_, "%m/%d/%Y %H:%M")
    hr = dt.datetime.strftime(d_time, "%H")
    if hr in counts_by_hour:
        counts_by_hour[hr] += 1
        comments_by_hour[hr] += comment_no  
    else:
        counts_by_hour[hr] = 1
        comments_by_hour[hr] = comment_no
        
counts_by_hour, comments_by_hour

({'00': 55,
  '01': 60,
  '02': 58,
  '03': 54,
  '04': 47,
  '05': 46,
  '06': 44,
  '07': 34,
  '08': 48,
  '09': 45,
  '10': 59,
  '11': 58,
  '12': 73,
  '13': 85,
  '14': 107,
  '15': 116,
  '16': 108,
  '17': 100,
  '18': 109,
  '19': 110,
  '20': 80,
  '21': 109,
  '22': 71,
  '23': 68},
 {'00': 447,
  '01': 683,
  '02': 1381,
  '03': 421,
  '04': 337,
  '05': 464,
  '06': 397,
  '07': 267,
  '08': 492,
  '09': 251,
  '10': 793,
  '11': 641,
  '12': 687,
  '13': 1253,
  '14': 1416,
  '15': 4477,
  '16': 1814,
  '17': 1146,
  '18': 1439,
  '19': 1188,
  '20': 1722,
  '21': 1745,
  '22': 479,
  '23': 543})

## Calculating the average number of comments for Ask Posts per hour.

In [8]:
avg_by_hour = []
for hr in comments_by_hour:
   avg_by_hour.append([hr, comments_by_hour[hr] / counts_by_hour[hr]])

avg_by_hour

[['16', 16.796296296296298],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['17', 11.46],
 ['19', 10.8],
 ['10', 13.440677966101696],
 ['23', 7.985294117647059],
 ['00', 8.127272727272727],
 ['13', 14.741176470588234],
 ['08', 10.25],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['20', 21.525],
 ['21', 16.009174311926607],
 ['11', 11.051724137931034],
 ['14', 13.233644859813085],
 ['04', 7.170212765957447],
 ['06', 9.022727272727273],
 ['01', 11.383333333333333],
 ['09', 5.5777777777777775],
 ['15', 38.5948275862069],
 ['07', 7.852941176470588],
 ['12', 9.41095890410959],
 ['22', 6.746478873239437]]

## Sorting through List of data.

In [9]:
swap_avg_by_hour = []

for rows in avg_by_hour:
    swap_avg_by_hour.append([rows[1], rows[0]])
    
print(swap_avg_by_hour)

sorted_swap = sorted(swap_avg_by_hour, reverse = True)
sorted_swap

[[16.796296296296298, '16'], [7.796296296296297, '03'], [10.08695652173913, '05'], [11.46, '17'], [10.8, '19'], [13.440677966101696, '10'], [7.985294117647059, '23'], [8.127272727272727, '00'], [14.741176470588234, '13'], [10.25, '08'], [23.810344827586206, '02'], [13.20183486238532, '18'], [21.525, '20'], [16.009174311926607, '21'], [11.051724137931034, '11'], [13.233644859813085, '14'], [7.170212765957447, '04'], [9.022727272727273, '06'], [11.383333333333333, '01'], [5.5777777777777775, '09'], [38.5948275862069, '15'], [7.852941176470588, '07'], [9.41095890410959, '12'], [6.746478873239437, '22']]


[[38.5948275862069, '15'],
 [23.810344827586206, '02'],
 [21.525, '20'],
 [16.796296296296298, '16'],
 [16.009174311926607, '21'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [13.20183486238532, '18'],
 [11.46, '17'],
 [11.383333333333333, '01'],
 [11.051724137931034, '11'],
 [10.8, '19'],
 [10.25, '08'],
 [10.08695652173913, '05'],
 [9.41095890410959, '12'],
 [9.022727272727273, '06'],
 [8.127272727272727, '00'],
 [7.985294117647059, '23'],
 [7.852941176470588, '07'],
 [7.796296296296297, '03'],
 [7.170212765957447, '04'],
 [6.746478873239437, '22'],
 [5.5777777777777775, '09']]

In [10]:
import datetime as dt

print("Top 5 Hours for Ask Posts Comments.")
for ave,hr in sorted_swap[:5]:
    avg_com = "{}: {:.2f} average comments per post.".format(dt.datetime.strptime(hr, "%H").strftime("%H:%M"), ave)
    print(avg_com)

Top 5 Hours for Ask Posts Comments.
15:00: 38.59 average comments per post.
02:00: 23.81 average comments per post.
20:00: 21.52 average comments per post.
16:00: 16.80 average comments per post.
21:00: 16.01 average comments per post.


The `15:00` hour receives most comments per post with an average of 38.59 comments, followed by the `02:00` hour with an average of 23.81 comments.

According to the data [documentation](https://www.kaggle.com/hacker-news/hacker-news-posts), the time zone is Eatern Time in the US which is equivalent to 20:00 West African Time(WAT).

## Conclusion.

After analysing the `hacker_news.csv` dataset based on the set goals of this project, the `Ask HN` posts is seen to receive the most comments on average. Upon further analysis, ask posts with the most comments were created at 15:00(ET) equivalent to 20:00(WAT) with an average of 38.59 comments.