# Hackers News Posts

In this project, we'll contrast two different types of postings from [Hacker News](https://news.ycombinator.com/), a well-known website where articles about technology are voted on and discussed. We'll examine two different post formats, each of which starts with ```Ask HN``` or ```Show HN```.

Users submit ```Ask HN``` posts to ask the Hacker News community a specific question, such as "What is the best online course you've ever taken?" Similarly, people submit ```Show HN``` posts to showcase projects, services, or just simply fascinating things to the Hacker News community.

We'll specifically compare these two types of posts to determine the following:

-  Do ```Ask HN``` or ```Show HN``` receive more comments on average?
-  Do posts created at a certain time receive more comments on average?

The data set we're using was reduced from around 300,000 rows to roughly 20,000 rows by eliminating all entries that received no comments and then randomly selecting from the remaining ones.

## Introduction

Let's look at the data first, then take the headers out.

In [1]:
import csv

f = open('hacker_news.csv')
hn = list(csv.reader(f))
hn[:5]

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01']]

## Removing Headers

In [2]:
headers = hn[0]
hn = hn[1:]
print(headers)
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


The data set includes the titles of the posts, the number of comments for each post, and the date the post was posted, as we can see above. Let's check how many comments there are for each category of post.

## Extracting Ask HN and Show HN Posts

In [3]:
postsAsk = []
postsShow = []
postsOther = []

for eachPost in hn:
    title = eachPost[1]
    if title.lower().startswith('ask hn'):
        postsAsk.append(eachPost)
    elif title.lower().startswith('show hn'):
        postsShow.append(eachPost)
    else:
        postsOther.append(eachPost)

print(len(postsAsk))
print(len(postsShow))
print(len(postsOther))

1744
1162
17194


## AVG Number of Comments for Ask HN and Show HN Posts

Now that ```Ask HN``` and ```Show HN``` posts are into distinct lists, let's determine the average number of comments each type of post receives.

In [4]:
# avg number of comments `Ask HN` posts receive
commentsAsk = 0

for eachPost in postsAsk:
    commentsAsk += int(eachPost[4])
    
commentsAsk_avg = commentsAsk / len(postsAsk)
print(commentsAsk_avg)

14.038417431192661


In [5]:
# avg number of comments `Show HN` posts receive
commentsShow = 0

for eachPost in postsShow:
    commentsShow += int(eachPost[4])

commentsShow_avg = commentsShow / len(postsShow)
print(commentsShow_avg)

10.31669535283993


```Ask``` posts typically receive 14 comments, whereas ```Show``` posts only get 10 on average. As a result, ```Ask``` posts are more likely to get comments, so we'll limit the rest of our analysis to them.

## Number of Ask Posts and Comments by Hour Created

Next, let's look at whether posting a question at a particular time will increase the number of comments it receives. First, we'll look at how many ```Ask``` posts were made at each hour of the day and how many comments were left on those questions. Next, we'll calculate the average number of comments ```Ask``` posts typically receive throughout the day.

In [6]:
import datetime as dt

result = []

for eachPost in postsAsk:
    result.append([eachPost[6], int(eachPost[4])])
    
commentsHour = {}
countsHour = {}
dateFormat = "%m/%d/%Y %H:%M"

for eachRow in result:
    date = eachRow[0]
    comment = eachRow[1]
    time = dt.datetime.strptime(date, dateFormat).strftime('%H')
    if time in countsHour:
        commentsHour[time] += comment
        countsHour[time] += 1
    else:
        commentsHour[time] = comment
        countsHour[time] = 1
        
commentsHour

{'09': 251,
 '13': 1253,
 '10': 793,
 '14': 1416,
 '16': 1814,
 '23': 543,
 '12': 687,
 '17': 1146,
 '15': 4477,
 '21': 1745,
 '20': 1722,
 '02': 1381,
 '18': 1439,
 '03': 421,
 '05': 464,
 '19': 1188,
 '01': 683,
 '22': 479,
 '08': 492,
 '04': 337,
 '00': 447,
 '06': 397,
 '07': 267,
 '11': 641}

## AVG Number of Comments for Ask HN Posts by Hour

In [7]:
avgHour = []

for hr in commentsHour:
    avgHour.append([hr, commentsHour[hr] / countsHour[hr]])
    
avgHour

[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]

## Sorting Values

In [8]:
avgHour_swap = []

for row in avgHour:
    avgHour_swap.append([row[1], row[0]])

print(avgHour_swap)

sortedSwap = sorted(avgHour_swap, reverse = True)
sortedSwap

[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]


[[38.5948275862069, '15'],
 [23.810344827586206, '02'],
 [21.525, '20'],
 [16.796296296296298, '16'],
 [16.009174311926607, '21'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [13.20183486238532, '18'],
 [11.46, '17'],
 [11.383333333333333, '01'],
 [11.051724137931034, '11'],
 [10.8, '19'],
 [10.25, '08'],
 [10.08695652173913, '05'],
 [9.41095890410959, '12'],
 [9.022727272727273, '06'],
 [8.127272727272727, '00'],
 [7.985294117647059, '23'],
 [7.852941176470588, '07'],
 [7.796296296296297, '03'],
 [7.170212765957447, '04'],
 [6.746478873239437, '22'],
 [5.5777777777777775, '09']]

In [9]:
# sort the values and print the top 5 hours with the highest avg comments

for avg, hr in sortedSwap[:5]:
    print(
        "{}: {:.2f} average comments per post".format(
            dt.datetime.strptime(hr, "%H").strftime("%H:%M"),avg
        )
    )

15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


With an average of 38.59 comments per post, the hour with the most comments per post is 15:00. Between the hours with the highest and second highest average number of comments, there is a roughly 60% rise in the number of comments (US Eastern Time used).

## Conclusion

In this project, we compared ```Ask``` posts to ```Show``` posts in order to discover the post type and time that receives the most comments on average. According to our analysis, a post should be posted between 15:00 and 16:00 and should be classified as an ```Ask``` post to get the most comments (3:00pm - 4:00pm EST).

It should be noted, though, that posts without comments were not included in the data set we examined. Due to this, it would be more appropriate to say that ```Ask``` posts received more comments on average than other posts and that ```Ask``` posts published between the hours of 15:00 and 16:00 (3:00pm and 4:00pm EST) received the most comments on average.