# **Hacker News Site- Posts that get the Maximum Comments**






## **Objective**

In this guided project, we want to study posts from Hacker News, the two types of posts we want to look into are:
* `Ask HN`: users submit posts to ask the Hacker News community a specific question.
* `Show HN`: users sumbit posts to show the Hacker News community a project, product, or something interesting.

The data set is from <a href='https://www.kaggle.com/hacker-news/hacker-news-posts' target='_blank'>Kaggle.com</a>

The original data set from Kaggle were extracted in 2016 which contains 300,000 rows , we will only be working with 20,000 rows, since posts without comments were removed and then randomly sampled from the remaining submissions.




For the purpose of this project, we will compare the two types of posts to determine the following:
* Do `Ask HN` or `Show HN` receive more comments on average?
* Do posts created at a certain time receive more comments on average?



## **Importing the data**


In [1]:
import csv

with open('hacker_news.csv') as hn_posts:
    hn=list(csv.reader(hn_posts))
    
#separating the header from the data for ease of use
    hn_header=hn[0]
    hn=hn[1:]
print(hn_header)    
print(hn[:5])
    
    

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]




Since we are only concerned with post titles beginning with `Ask HN` and `Show HN`, we will separate the two type of posts into different lists.

In [23]:
ask_posts=[]
show_posts=[]
other_posts=[]

for row in hn:
    title=row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
        x+=1
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        


Let's check the numbers of post on each type of posts.

In [60]:
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))


1744
1162
17194


## **Determining which type of posts receive more comments**

Let's see if `Ask HN` or `Show HN` receive more comments on average

In [28]:
total_ask_comments=0
total_show_comments=0

for row in hn:
    num_comments=row[4]
    title=row[1]
    if title.lower().startswith('show hn'):
        total_show_comments=total_show_comments+int(num_comments)
        

    elif title.lower().startswith('ask hn'):
        total_ask_comments=total_ask_comments+int(num_comments)
        y+=1
avg_ask_comments=total_ask_comments/len(ask_posts)
avg_show_comments=total_show_comments/len(show_posts)

print('For Ask HN posts, there are on average {:.2f} comments.'.format(avg_ask_comments))
print('For Show HN posts, there are on average {:.2f} comments.'.format(avg_show_comments))

For Ask HN posts, there are on average 14.04 comments.
For Show HN posts, there are on average 10.32 comments.


Let's look for maximum number of comments on each type of posts

In [35]:
max_ask_post=[]
for post in ask_posts:
    comment=int(post[4])
    max_ask_post.append(comment)



max_show_post=[]
for post in show_posts:
    comment=int(post[4])
    max_show_post.append(comment)
    
    
print('For Ask HN posts, there is a maximum of {} comments'.format(max(max_ask_post)))
print('For Show HN posts, there is a maximum of {} comments'.format(max(max_show_post)))

For Ask HN posts, there is a maximum of 947 comments
For Show HN posts, there is a maximum of 306 comments


## **Posts on `Ask HN` receive more comments on average than `Show HN`**


As it turns out, `Ask HN` are on average receiving 1.4 times more comments than `Show HN`, and the max comments an `Ask HN` post received is 3 times more than max comments on `Show HN`.

Since `Ask HN` posts are more likely to receive comments, we will focus our analaysis on `Ask HN` posts.








## **What is the best time for a post to receive more comments**

Let's see if `Ask HN` created at a certain time are more likely to attract comments.

We will do this in the following steps:


**1. Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.**


In [37]:
#Importing the Datetime module since we are working with time and dates
import datetime as dt


In [39]:

result_list=[]

for post in ask_posts:
    created_at=post[6]
    num_comments=int(post[4])
    result=created_at,num_comments
    result_list.append(result)

counts_by_hour={}
comments_by_hour={}


for result in result_list:
    num_comments=result[1]
    time=result[0]
    time=dt.datetime.strptime(time,'%m/%d/%Y %H:%M').strftime('%H')

    if time not in counts_by_hour:
        counts_by_hour[time]=1
        comments_by_hour[time]=num_comments
    else:
        counts_by_hour[time]+=1
        comments_by_hour[time]+=num_comments
    
print(comments_by_hour)
print(counts_by_hour)


{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}
{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}


**2. Calculate the average number of comments ask posts receive by hour created.**

In [6]:
avg_by_hour=[]
for hr in comments_by_hour:
    avg_by_hour.append([hr,comments_by_hour[hr]/counts_by_hour[hr]])
    
avg_by_hour

[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]

Let's sort the average and print the five highest values in a format that's easier to read.

In [59]:
swap_avg_by_hour=[]

for hour in avg_by_hour:
    swap=hour[1], hour[0]
    swap_avg_by_hour.append(swap)
    

sorted_swap=sorted(swap_avg_by_hour,reverse=True)
print('Top 5 Hours for Ask Posts Comments\n')
top_5_hours=[]
for hour in sorted_swap[:5]:
    top_hour=dt.datetime.strptime(hour[1],'%H')
    top_hour=top_hour.strftime('%H:%M')
    top_comment=hour[0]
  
    print(f"At {top_hour} there are {top_comment :.2f} average comments per post.")
    
    


Top 5 Hours for Ask Posts Comments

At 15:00 there are 38.59 average comments per post.
At 02:00 there are 23.81 average comments per post.
At 20:00 there are 21.52 average comments per post.
At 16:00 there are 16.80 average comments per post.
At 21:00 there are 16.01 average comments per post.


## Conclusions

After our simple analysis we have discovered the following:
* `Ask HN` have higher comments on average than `Show HN` posts, `Ask HN` also have 3 times more maximum comments than `Show HN` posts. 



* Posts posting on 15:00 US Easter Time receives the highest average comments, that's 2.3 times as much comments compare to posting one hour later at 16:00 US Easter Time, average comments on 15:00 US Easter Time is also 1.6 times as much comments compare to the 2nd highest time period of average comments at 2:00 AM US Easter Time.

One thing to keep in mind is the data we are analysing had excluded the data on posts without any comments.

As a result, our conclusion should be that **in the posts that had received one or more comments**, `Ask HN` posts received more comments on average compared to `Show HN`, and `Ask HN` post created between 15:00 and 16:00 EST received the most comments on average. Further data analysis with more complete data set might change this conclusion.