# Exploring Hacker News Posts

- Introduction:
Hacker News is a popular technology site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. 

We're specifically interested in posts whose titles begin with either **Ask HN** or **Show HN**. Users submit **Ask HN** posts to ask the Hacker News community a specific question. And users submit **Show HN** posts to show the Hacker News community a project, product, or just generally something interesting.

First we load the dataset **hacker_news.csv** and display first 5 rows.

In [1]:
from csv import reader


We store the dataset as lists of lists in the variable **hn** and the header in the variable **header_hn**.

In [2]:
opened_file= open("hacker_news.csv")
read_file=reader(opened_file)
hn=list(read_file)
header_hn=hn[0]
hn=hn[1:]

Now, we display the column names and first 5 rows of the dataset.

In [3]:
print("The column names are: ",header_hn)
print("\n")
for rows in hn[:5]:
    print(rows)
    print("\n")

The column names are:  ['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2

## Filtering our data

We will differentiate the posts according to:
1. Posts with title **Ask Hn**
2. Posts with **Show Hn**
3. Other posts

We will categorize them by analysing the **"title"** column.

We will store these posts into three different lists :
1. ask_posts
2. show_posts
3. other_posts

In [4]:
ask_posts=[]
show_posts=[]
other_posts=[]

for row in hn:
    title=row[1]
    title=title.lower()
    if "ask hn" in title:
        ask_posts.append(row)
    elif "show hn" in title:
        show_posts.append(row)
    else:
        other_posts.append(row)

In [5]:
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1745
1165
17190


### Determining if ask posts or show posts receive more comments on average

**"num_comments"** is at index= **4** which we are going to analyse.

In [6]:
total_ask_comments=0
for row in ask_posts:
    number= int(row[4])
    total_ask_comments+=number
    
avg_ask_comments=(total_ask_comments)/len(ask_posts)
print("Average comments in Ask Hn posts: ",avg_ask_comments)


total_show_comments=0
for row in show_posts:
    number= int(row[4])
    total_show_comments+=number
    
avg_show_comments=(total_show_comments)/len(show_posts)
print("Average comments in Show Hn posts: ",avg_show_comments)

    

Average comments in Ask Hn posts:  14.031518624641834
Average comments in Show Hn posts:  10.302145922746782


Looking into the findings we see that the **ask posts** have received  on an average **14** comments and **show posts** have received on an average **10** comments.

As ask posts are more likely to get comments we will analyse our further analysis on ask posts.

### Finding most suitable time for the posts to receive more comments

In this case our desired column is **"created_at"** at index =**-1**.
The time part in created_at column follows 24 hour format as found from the first five rows(19:30,22:20)

In [14]:
date=str(ask_posts[0][-1])
date=dt.datetime.strptime(date,"%m/%d/%Y %H:%M")
print(date)
print(dt.datetime.strftime(date,"%H"))

2016-08-16 09:55:00
09


In [35]:
import datetime as dt
result_list=[]
for row in ask_posts:
    time= row[-1]
    comments= int(row[4])
    result_list.append([time,comments])
    
counts_by_hour={}
comments_by_hour={}

for row in result_list:
    num_comments=row[1]
    date_time= str(row[0])
    date_time=dt.datetime.strptime(date_time,"%m/%d/%Y %H:%M")
    hour=dt.datetime.strftime(date_time,"%H")
    if hour not in counts_by_hour:
        counts_by_hour[hour]=1
        comments_by_hour[hour]=num_comments
    else:
        counts_by_hour[hour]+=1
        comments_by_hour[hour]+=num_comments
        

In [36]:
print("Ask posts by hour: ",counts_by_hour)
print("\n")
print("Number of comments received by Ask post by hour of the post created",comments_by_hour)


Ask posts by hour:  {'02': 58, '11': 58, '04': 47, '07': 34, '18': 109, '06': 44, '23': 69, '15': 116, '22': 71, '21': 109, '05': 46, '00': 55, '03': 54, '14': 107, '09': 45, '12': 73, '10': 59, '16': 108, '01': 60, '19': 110, '17': 100, '08': 48, '13': 85, '20': 80}


Number of comments received by Ask post by hour of the post created {'02': 1381, '11': 641, '04': 337, '07': 267, '18': 1439, '06': 397, '23': 545, '15': 4477, '22': 479, '21': 1745, '05': 464, '00': 447, '03': 421, '14': 1416, '09': 251, '12': 687, '10': 793, '16': 1814, '01': 683, '19': 1188, '17': 1146, '08': 492, '13': 1253, '20': 1722}


### Average  per post for posts created during each hour of the day

In [19]:
avg_by_hour=[]
for keys in counts_by_hour:
    total_post=counts_by_hour[keys]
    total_comments= comments_by_hour[keys]
    avg_comment= total_comments/total_post
    avg_by_hour.append([keys,avg_comment])

avg_by_hour=sorted(avg_by_hour)

In [22]:
print(avg_by_hour)

[['00', 8.127272727272727], ['01', 11.383333333333333], ['02', 23.810344827586206], ['03', 7.796296296296297], ['04', 7.170212765957447], ['05', 10.08695652173913], ['06', 9.022727272727273], ['07', 7.852941176470588], ['08', 10.25], ['09', 5.5777777777777775], ['10', 13.440677966101696], ['11', 11.051724137931034], ['12', 9.41095890410959], ['13', 14.741176470588234], ['14', 13.233644859813085], ['15', 38.5948275862069], ['16', 16.796296296296298], ['17', 11.46], ['18', 13.20183486238532], ['19', 10.8], ['20', 21.525], ['21', 16.009174311926607], ['22', 6.746478873239437], ['23', 7.898550724637682]]


### Analyzing good time to time

**created_at**: the date and time the post was made (the time zone is Eastern Time in the US)
Indian Standard Time is 9 hours and 30 minutes ahead of Eastern Time.
We will convert the time into IST then print it and analyze.

In [30]:
swap_avg_by_hour=[]
for rows in avg_by_hour:
    swap_avg_by_hour.append([rows[1],rows[0]])
sorted_swap=sorted(swap_avg_by_hour,reverse=True)
print(sorted_swap)



[[38.5948275862069, '15'], [23.810344827586206, '02'], [21.525, '20'], [16.796296296296298, '16'], [16.009174311926607, '21'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [13.20183486238532, '18'], [11.46, '17'], [11.383333333333333, '01'], [11.051724137931034, '11'], [10.8, '19'], [10.25, '08'], [10.08695652173913, '05'], [9.41095890410959, '12'], [9.022727272727273, '06'], [8.127272727272727, '00'], [7.898550724637682, '23'], [7.852941176470588, '07'], [7.796296296296297, '03'], [7.170212765957447, '04'], [6.746478873239437, '22'], [5.5777777777777775, '09']]


In [32]:
print("Top 5 Hours for Ask Posts Comments")
for row in sorted_swap[:5]:
    time=str(row[1])
    time_hour=dt.datetime.strptime(time,"%H")
    time_gap=dt.timedelta(hours=9,minutes=30)
    IST_time=time_hour+time_gap
    hour_time=dt.datetime.strftime(IST_time,"%H:%M")
    print("{}: {:.2f} average comments per post".format(hour_time,row[0]))
    
    

Top 5 Hours for Ask Posts Comments
00:30: 38.59 average comments per post
11:30: 23.81 average comments per post
05:30: 21.52 average comments per post
01:30: 16.80 average comments per post
06:30: 16.01 average comments per post


The time is in **24 hour format**.

Looking at the average comments per post in _Ask Hn_ we find that the posts which received the most comments were created at times **00:30 am**,**11:30 am**, **05:30 am**,**01:30 am**,**06:30 am**.
 That is mostly during the late night (12 am) to early morning (06:30 am). So this time interval is best for creating an **Ask Hn** post on **Hacker News** platform.

In [45]:
x="14:30"
time_hour=dt.datetime.strptime(x,"%H:%M")
t=time_hour + dt.timedelta(hours=9,minutes=30)
print(dt.datetime.strftime(t,"%H:%M"))

00:00
