# Analyzing HackerNews Dataset to find the best time and category for creating a post.
---
## Introduction

For all of you who do not know what HackerNews is,Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") receive votes and comments, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of the Hacker News listings can get hundreds of thousands of visitors as a result.You can find the original data set [here](https://www.kaggle.com/hacker-news/hacker-news-posts)

In this Project,We're specifically interested in posts with titles that begin with either `Ask HN` or `Show HN`. Users submit `Ask HN` posts to ask the Hacker News community a specific question. Likewise, users submit `Show HN` posts to show the Hacker News community a project, product, or just something interesting.  We are going to analyze a `downsampled dataset` of HackerNews posts to analyze the following things:
* Do `Ask HN` or `Show HN` receive more comments on average?
* Do posts created at a certain time receive more comments on average?

**You can download the downsampled data from [here](https://dq-content.s3.amazonaws.com/356/hacker_news.csv)**



## Reading the file

In [5]:
# Let's take a look at the data file first
from csv import reader
file= open('hacker_news.csv')
read=reader(file)
hn=list(read)
headers=hn[0]
hn=hn[1:]
print(headers,'\n')
for row in hn[:5]:
    print(row,'\n')


['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] 

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'] 

['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'] 

['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'] 

['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'] 

['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12'] 



## Extracting Ask HN and Show HN Posts
Now, We are done with opening and reading our file, The next step is to identify the posts that start with `Ask HN` and `Show HN` because our analysis revolves around these posts. So we'll create `three` lists named `ask_posts`, `show_posts` and `other_posts`. The names are pretty self explanatory. Now to see the if the posts start with `Ask` or `Show` , We'll be using built-in string method `startswith` and `lower` string method. We'll see how in the next step.

In [6]:
ask_posts=[]
show_posts=[]
other_posts=[]
for row in hn:
    if (row[1].lower()).startswith('ask hn'):
        ask_posts.append(row)
    elif (row[1].lower()).startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
print(len(ask_posts)) 
print(len(show_posts))
print(len(other_posts))
if len(ask_posts)+len(show_posts)+len(other_posts)==len(hn):
    print(True)

1744
1162
17194
True


Above We see that all the posts have been covered because our condition returned **`True`**, We used `string.lower()` method to make sure if somebody wrote `ask HN` or `Ask Hn` instead of `Ask HN` in their post title, it will still be added to `ask_posts` and not `other_posts`.

##  Calculating the Average Number of Comments for Ask HN and Show HN Posts

Now, We'll calculate the `average` number of comments received in `ask_posts` and the `average` number of comments received in `show_posts`, so we can determine what kind of posts get better engagement and viewership.



In [7]:
total_ask_comments=0
total_show_comments=0
for row in ask_posts:
    total_ask_comments+=int(row[4])
for row in show_posts:
    total_show_comments+=int(row[4])
avg_ask_comments=total_ask_comments/len(ask_posts)
avg_show_comments=total_show_comments/len(show_posts)
print('Average Comments in "Ask Hn:" posts:',avg_ask_comments)
print('Average Comments in "Show Hn:" posts:',avg_show_comments)

Average Comments in "Ask Hn:" posts: 14.038417431192661
Average Comments in "Show Hn:" posts: 10.31669535283993


As we can see, On average `Ask Hn:` posts have more comments than `Show HN:` posts, where `Ask` posts average at 14 comments and `Show` posts average at 10 comments.

## Finding the Number of Ask Posts and Comments by Hour Created
---
Now, We know that `Ask` posts generate more viewership and engagement, We'll continue with only `Ask` posts and leave the `show` posts behind. Now, We'll take a look at the posts and see the time the posts were created. We are going to count the posts created by `hour` and then analyse at which hour of the day max posts were created and which hour had maximum comments. For this step, we will be using the `datetime` module, specifically `strptime` method and `strftime` method in `datetime` class.

In [8]:
import datetime as dt #we are creating an alias 'dt' for datetime module so it is easier to write
counts_by_hour={}
comments_by_hour={}
result_list=[]
for row in ask_posts:
    result_list.append([row[6],int(row[4])])
for item in result_list:
    date=dt.datetime.strptime(item[0],'%m/%d/%Y %H:%M')
    hour=date.strftime('%H')
    comments=item[1]
    if hour in counts_by_hour:
        counts_by_hour[hour]+=1
        comments_by_hour[hour]+=comments
    else:
        counts_by_hour[hour]=1
        comments_by_hour[hour]=comments
comments_by_hour


{'09': 251,
 '13': 1253,
 '10': 793,
 '14': 1416,
 '16': 1814,
 '23': 543,
 '12': 687,
 '17': 1146,
 '15': 4477,
 '21': 1745,
 '20': 1722,
 '02': 1381,
 '18': 1439,
 '03': 421,
 '05': 464,
 '19': 1188,
 '01': 683,
 '22': 479,
 '08': 492,
 '04': 337,
 '00': 447,
 '06': 397,
 '07': 267,
 '11': 641}

## Calculating the Average Number of Comments for Ask HN Posts by Hour
Now We have two dictionaries, One representing the number of posts that were created on a particular hour throughout the time frame of the dataset, and the other dictionary representing the number of total comments that were posted on a particular hour each day, throughout the time frame of the dataset. Now that we have this information, We will make an average of number of comments on `Ask HN` posts by hour. That will give us an idea of when the community of HackerNews is most active and what is the best time to create a post. We will create a `list of lists` that will contain the **hour** of the post created and **average comments** by hour in the given order.

In [9]:
avg_by_hour=[]
for hour in comments_by_hour:
    comments=comments_by_hour[hour]
    count=counts_by_hour[hour]
    avg_by_hour.append([hour,comments/count])
avg_by_hour

[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]

## Sorting and Printing Values from a List of Lists
We are almost DONE!, However the list `avg_by_hour` that we see above, It doesn't really look neat and easy to analyze and that is because it's not sorted by the highest count of comments. This list is not that big so it is somewhat easy to go through it but if the list was really big, we wouldn't be able to find much without sorting. Let's also get the comments to two decimal places to make it look neat and easy to read.

In [10]:
swap_avg_by_hour=[]
for item in avg_by_hour:
    swap_avg_by_hour.append([item[1],item[0]]) #swapped because the sorted
sorted_swap=sorted(swap_avg_by_hour,reverse=True)#function sorts based on 
print('Top 5 Hours for Ask Posts Comments.\n')#first value
for item in sorted_swap[:5]:
    template='At {}: {:.2f} average comments per post.'
    time=dt.datetime.strptime(item[1],'%H')
    time=time.strftime('%H:%S')
    print(template.format(time,item[0]))

Top 5 Hours for Ask Posts Comments.

At 15:00: 38.59 average comments per post.
At 02:00: 23.81 average comments per post.
At 20:00: 21.52 average comments per post.
At 16:00: 16.80 average comments per post.
At 21:00: 16.01 average comments per post.


## Conclusion
---
So, There we go, The top 5 hours to make a post for the most interaction. But this is for EST timezone, as mentioned in the [documentation](https://www.kaggle.com/datasets/hacker-news/hacker-news-posts) of the dataset. Since, my timezone is the same, I didn't need to, But you might! We found that Ask Posts made between the time 1500-1600 EST hours(3:00 pm-4:00 pm EST) received more comments. Keep in mind that we did not include the posts with no title, so essentially this analysis is only for the Ask posts.

***A.G.***