# Analysis of Hacker News Posts
---

Hacker News is social news publishing website focusing on Technology and enterprenure fields.Hacker News allows user to vote and comment user-submitted stories.

In this project, we will focus on the following:

* Do **Ask HN** or **Show HN** receive more comments on average?
* Do posts created at a **Certain Time** receive more comments on average?

Ask HN: A specific question(post) asked by user to the HN community 
Show HN: A post to show interesting stories, news, projects, products etc to the HN community

This data is provided by [Hacker News]

Dataset is reduced from almost 300,000 rows to approximately 20,000 rows.
A post submission with no comments is removed and then randomly sampling from the remaining submissions has done.


In [1]:
import csv 

openned_file = open('hacker_news.csv')
hn = list(csv.reader(openned_file))
header = hn[0]
hn = hn[1:]
print(hn[:5])

[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


In [15]:
print(header)
print(len(hn))

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
20100


In [16]:
#----------------------
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    title = title.lower()
    
    if title.startswith("ask hn"):
        ask_posts.append(row)
    elif title.startswith("show hn"):
        show_posts.append(row)
    else: 
        other_posts.append(row)

In [17]:
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


In [33]:
# Total number of comments in Ask posts
#----------------------------------------
total_ask_comments = 0
total_show_comments = 0

for post in ask_posts:
    total_ask_comments += int(post[4])

avg_ask_comments = (total_ask_comments)/len(ask_posts)

# Total number of comments in Show posts
#---------------------------------------
for post in show_posts:
    total_show_comments += int(post[4])

avg_show_comments = total_show_comments/len(show_posts)
 

In [34]:
print(avg_ask_comments)
print(avg_show_comments)

14.038417431192661
10.31669535283993


As we can see, Ask post gets average 14 comments in each post and avegare 10 comments for show post.
So, we will be focusing only on Ask HN post for further analysis.

In [55]:
import datetime as dt 

result_list = []
counts_by_hour = {}
comments_by_hour = {}

date_format = "%m/%d/%Y %H:%M"

for post in ask_posts:
    result_list.append([post[-1], int(post[4])])

for row in result_list:
    date = row[0]
    n_comments = row[1]
    
    date = dt.datetime.strptime(date, date_format).strftime("%H")
      
    if date in counts_by_hour:
        counts_by_hour[date] +=1
        comments_by_hour[date] += n_comments
        
    else:
        counts_by_hour[date] = 1
        comments_by_hour[date] = n_comments
        
comments_by_hour   

{'00': 447,
 '01': 683,
 '02': 1381,
 '03': 421,
 '04': 337,
 '05': 464,
 '06': 397,
 '07': 267,
 '08': 492,
 '09': 251,
 '10': 793,
 '11': 641,
 '12': 687,
 '13': 1253,
 '14': 1416,
 '15': 4477,
 '16': 1814,
 '17': 1146,
 '18': 1439,
 '19': 1188,
 '20': 1722,
 '21': 1745,
 '22': 479,
 '23': 543}

In [62]:
avg_by_hour = []

for row in comments_by_hour:
    avg_comment = comments_by_hour[row] / counts_by_hour[row]
    avg_by_hour.append([row, avg_comment])
    
sorted(avg_by_hour)

[['00', 8.127272727272727],
 ['01', 11.383333333333333],
 ['02', 23.810344827586206],
 ['03', 7.796296296296297],
 ['04', 7.170212765957447],
 ['05', 10.08695652173913],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['08', 10.25],
 ['09', 5.5777777777777775],
 ['10', 13.440677966101696],
 ['11', 11.051724137931034],
 ['12', 9.41095890410959],
 ['13', 14.741176470588234],
 ['14', 13.233644859813085],
 ['15', 38.5948275862069],
 ['16', 16.796296296296298],
 ['17', 11.46],
 ['18', 13.20183486238532],
 ['19', 10.8],
 ['20', 21.525],
 ['21', 16.009174311926607],
 ['22', 6.746478873239437],
 ['23', 7.985294117647059]]

To sort by num of comment, we need to swap values before shorting

In [72]:
final_avg_by_hour = []

for row in avg_by_hour:
    final_avg_by_hour.append([row[1],row[0]])
    
sorted_avg_by_hour = sorted(final_avg_by_hour, reverse = True)

In [85]:
print("Top 5 hours for Ask HN comments:")

for row in sorted_avg_by_hour[:5]:
    print(row[1],"hr :",row[0])
    

Top 5 hours for Ask HN comments:
15 hr : 38.5948275862069
02 hr : 23.810344827586206
20 hr : 21.525
16 hr : 16.796296296296298
21 hr : 16.009174311926607


# Conclusion
--------

So the answers are here for our main questions:
* Do **Ask HN** or **Show HN** receive more comments on average?
    * Ask post gets average 14 comments in each post while show post gets 10 comments. 
* Do posts created at a **Certain Time receive more comments on average**?
    * From the above result we can see that Ask HN post created between 15:00-16:00 hr(3pm - 4pm) are most likely get more comments
    

    