# Exploring Hacker News Posts

This project will focus on analyzing the data set focused on submissions to a popular technology site Hacker News. 

We are interested in posts whose titles begin with either Ask HN or Show HN. We'll compare these two types of posts to determine:
1. Which one receives more comments on average
2. Do posts created at a specific time receive more comments on average.

We will be working with the data set from Kaggle called hacker_news.csv. 
It has post from previous 12 months (September 2015 to September 2016). 

## Part 1

Firstly, let's start by importing the modules and opening the csv file:

In [1]:
from csv import reader
opened_file = open('hacker_news.csv', encoding='utf8')
read_file=reader(opened_file)
hn=list(read_file)
headers=hn[0]
hn=hn[1:]

Let's look at header of the data set:

In [2]:
headers

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

There are seven columns:
1. id: The unique identifier from Hacker News for the post
2. title: The title of the post
3. url: The URL that the posts links to, if the post has a URL
4. num_points: The number of points the post acquired, calculated as the total 
5. num_comments: The number of comments that were made on the post
6. author: The username of the person who submitted the post
7. created_at: The date and time at which the post was submitted

Let's explore first few rows of the data set:

In [3]:
hn[0:3]

[['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20']]

In [4]:
len(hn)

20100

We can see that this data set contains over 20,000 posts. Since we are interested in posts that start with Ask HN and Show HN, we would like to devide this data set into three separate lists. 

To do this, we'll loop through entire data set and filter using string method `string.startswith()` 

In our loop, we'll separete posts into three lists: `ask_posts`, `show_posts`, and `other_posts`. 

In [5]:
ask_posts=[]
show_posts=[]
other_posts=[]

for row in hn:
    title=row[1]
    title=title.lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)


Let's see if this worked corectly by exploring few rows from each new list:

In [6]:
print(ask_posts[:2])
print('\n')
print(show_posts[:2])
print('\n')
print(other_posts[:2])

[['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43']]


[['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'], ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46']]


[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']]


Everything was filtered correctly. Now let's check the number of posts in each list:

In [7]:
print(len(ask_posts))
print('\n')
print(len(show_posts))
print('\n')
print(len(other_posts))

1744


1162


17194


There are substantially more 'other' posts than Ask HN and Show HN posts. 

A good step forward will be to check the difference between total amount of comments for Ask HN vs Show HN posts. Let's assign total amount of comments for Ask HN post to an integer `total_ask_posts` while Show HN posts to `total_show_posts`

In [8]:
#Code to calculate average amount of comments for Ask HN posts
total_ask_posts=0
avg_ask_posts=0


for row in ask_posts:
    n_com=int(row[4])
    total_ask_posts+=n_com
avg_ask_posts=total_ask_posts/len(ask_posts)

#Code to calculate average amount of comments for Show HN posts
total_show_posts=0
avg_show_posts=0

for row in show_posts:
    n_com=int(row[4])
    total_show_posts+=n_com
avg_show_posts=total_show_posts/len(show_posts)
print('Average number of comments on Show posts is {:.2f}'.format(avg_show_posts))
print('Average number of comments on Ask posts is {:.2f}'.format(avg_ask_posts))


Average number of comments on Show posts is 10.32
Average number of comments on Ask posts is 14.04


We can clearly see that Ask posts on average generate higher amount of comments comparing to Show posts. This makes sense because by definition Ask posts 'ask' other users for an answer to a question. On the other side, Show posts don't ask other users for responce - they are strictly to display some information. 

## Part 2

Let's concentrate our focus on Ask HN posts. We would like to determine if there is a corelationship between what time a post was posted and amount of comments received. 

Because we will be working with time, we first need to import datetime module. We will then create a new list that will hold two columns: time of day(hour) and number of comments:

In [9]:
import datetime as dt

result_list=[]
date_format='%m/%d/%Y %H:%M'
for row in ask_posts:
    created_at=row[6]
#     line below first uses strptime to format string into datetime format
#     and then uses strftime to select hour only.
    created_at=dt.datetime.strptime(created_at, date_format).strftime("%H")
    n_com=int(row[4])
    result_list.append([created_at, n_com])


By looking at first few elements from the new table, we can see that this worked well:

In [10]:
result_list[0:5]

[['09', 6], ['13', 29], ['10', 1], ['14', 3], ['16', 17]]

Now let's created a frequency table that will have a distribution of hours and number of comments at that hour:

In [11]:
counts_by_hour={}
comments_by_hour={}

for row in result_list:
    n_hour=row[0]
    n_com=int(row[1])
    if n_hour in counts_by_hour:
        counts_by_hour[n_hour]+=1
        comments_by_hour[n_hour]+=n_com
    else:
        counts_by_hour[n_hour]=1
        comments_by_hour[n_hour]=n_com
    
    

In [15]:
comments_by_hour
counts_by_hour

{'09': 45,
 '13': 85,
 '10': 59,
 '14': 107,
 '16': 108,
 '23': 68,
 '12': 73,
 '17': 100,
 '15': 116,
 '21': 109,
 '20': 80,
 '02': 58,
 '18': 109,
 '03': 54,
 '05': 46,
 '19': 110,
 '01': 60,
 '22': 71,
 '08': 48,
 '04': 47,
 '00': 55,
 '06': 44,
 '07': 34,
 '11': 58}

Now, let's find the average number of comments these posts received:


In [28]:
avg_by_hour=[]
for hour in comments_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour]/counts_by_hour[hour]])

In [17]:
avg_by_hour

[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]

Even though we now have the information that we were looking for, this format is difficult to read. Let's try to sort this list and make it easier to read:

In [30]:
swap_avg_by_hour=[]
for hr in avg_by_hour:
    swap_avg_by_hour.append([hr[1],hr[0]])
swap_avg_by_hour

[[5.5777777777777775, '09'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [16.796296296296298, '16'],
 [7.985294117647059, '23'],
 [9.41095890410959, '12'],
 [11.46, '17'],
 [38.5948275862069, '15'],
 [16.009174311926607, '21'],
 [21.525, '20'],
 [23.810344827586206, '02'],
 [13.20183486238532, '18'],
 [7.796296296296297, '03'],
 [10.08695652173913, '05'],
 [10.8, '19'],
 [11.383333333333333, '01'],
 [6.746478873239437, '22'],
 [10.25, '08'],
 [7.170212765957447, '04'],
 [8.127272727272727, '00'],
 [9.022727272727273, '06'],
 [7.852941176470588, '07'],
 [11.051724137931034, '11']]

In [36]:
sorted_swap=sorted(swap_avg_by_hour,reverse=True)

In [43]:
for each in sorted_swap:
    print("{hr}:00: {avg:.2f} average comments per post".format(hr=each[1], avg=each[0]))

15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post
13:00: 14.74 average comments per post
10:00: 13.44 average comments per post
14:00: 13.23 average comments per post
18:00: 13.20 average comments per post
17:00: 11.46 average comments per post
01:00: 11.38 average comments per post
11:00: 11.05 average comments per post
19:00: 10.80 average comments per post
08:00: 10.25 average comments per post
05:00: 10.09 average comments per post
12:00: 9.41 average comments per post
06:00: 9.02 average comments per post
00:00: 8.13 average comments per post
23:00: 7.99 average comments per post
07:00: 7.85 average comments per post
03:00: 7.80 average comments per post
04:00: 7.17 average comments per post
22:00: 6.75 average comments per post
09:00: 5.58 average comments per post


Based on the information above, we can conclude that the best time to create a post on HN is during the 3PM hour(Eastern Time) or 2PM hour Central Time. 

# Conclusion

Based on our analysis, to maximize the amount of comments a post received, we'd recommend to create a post between 3PM and 4PM (eastern time)
However, since this data set does not include posts that did not receive comments, it is more accurate to say that of the posts that did receive comments, ask posts received more comments during 15:00 and 16:00. 