# Hacker News Posts
**Hacker News** is a site started by the startup incubator **Y Combinator**, where user-submitted stories (known as "***posts***") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

In this project, we're specifically interested in posts whose titles begin with either ***Ask HN*** or **Show HN***. Users submit *Ask HN* posts to ask the Hacker News community a specific question and *Show HN* posts to show the Hacker News community a project, product, or just generally something interesting.

The main aim of this project is to therefore find out:
- Do **Ask HN** or **Show HN** receive more comments on average?
- Do posts created at a certain times receive more comments on average?

## Opening and Exploring the Data
To save on time and resources, a sample of this data is obtained from Kaggle for analysis. This is the [data set](https://www.kaggle.com/hacker-news/hacker-news-posts) for a duration of 12 months upto September 26 2016:

**Opening the data sets:**

In [2]:
from csv import reader

### Hacker News Posts dataset ###
opened_file = open('C:/Users/Luci/Desktop/Data Science/DataQuest/my_datasets/HN_posts.csv', encoding = 'utf8')
read_file = reader(opened_file)
hn = list(read_file)
hn_header = hn[0]
hn = hn[1:]

print(hn_header)
print(hn[0])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']


Filtering the data so that we have only the posts that begin with either 'Ask HN' or 'Show HN'.

In [50]:
# lists to store the posts begining with 'Ask HN', 'Show HN' and others.
ask_posts = []
show_posts = []
other_posts = []

for each_row in hn:
    title = each_row[1]
    title =title.lower()
    if title.startswith('ask hn'):
        ask_posts.append(each_row)
    elif title.startswith('show hn'):
        show_posts.append(each_row)
    else:
        other_posts.append(each_row)

len_ask_posts = len(ask_posts)
len_show_posts = len(show_posts)
len_other_posts = len(other_posts)

#checking the filter performance
print(ask_posts[0])
print('\n')
print(show_posts[0])
print('\n')
print(other_posts[0])
print('\n')
print('Length of ASK posts: ',len_ask_posts)
print('\n')
print('Length of SHOW posts:' , len_show_posts)
print('\n')
print('Length of other posts:', len_other_posts)
print('\n')





['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53']


['12578335', 'Show HN: Finding puns computationally', 'http://puns.samueltaylor.org/', '2', '0', 'saamm', '9/26/2016 0:36']


['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']


Length of ASK posts:  9139


Length of SHOW posts: 10158


Length of other posts: 273822




## Calculating the average number of comments for ASK HN and SHOW HN posts
Determining if *ask posts* or *show posts* receive more comments on average:

In [10]:
# average ask comments
total_ask_comments = 0
count_ask = 0

for each_row in ask_posts:
    num_comments = each_row[4]
    num_comments = int(num_comments)
    total_ask_comments += num_comments
    count_ask += 1

avg_ask_comments = total_ask_comments / count_ask

print('Total ask comments: ', total_ask_comments)
print('Average ask comments: ', avg_ask_comments)

# average show comments
total_show_comments = 0
count_show = 0

for each_row in show_posts:
    num_comments = each_row[4]
    num_comments = int(num_comments)
    total_show_comments += num_comments
    count_show += 1

avg_show_comments = total_show_comments / count_show

print('\nTotal show comments: ', total_show_comments)
print('Average show comments: ', avg_show_comments)

Total ask comments:  94986
Average ask comments:  10.393478498741656

Total show comments:  49633
Average show comments:  4.886099625910612


From the above cell, it can be concluded that **ASK HN** posts receive more comments on average than **SHOW HN** posts.

Since ASK HN posts are more likely to receive comments, the remaining analysis is focussed just on these posts. 

## Determining if ASK posts created at a certain time are more likely to attract comments.

To perform this analysis, the following steps are followed:
1. The amount of ASK posts created in each hour of the day are calculated, along with the number of comments received.
2. The average number of comments ask posts receive by hour created is also calculated.

**1. Calculating the amount of ASK posts and comments by hour created.**


In [21]:
import datetime as dt

result_list = []

for each_row in ask_posts:
    created_at = each_row[6]
    num_comments = int(each_row[4])
    result_list.append([created_at, num_comments])
       
counts_by_hour = {}          #contains the number of ASK posts created during each hour of the day.
comments_by_hour = {}        #contains the corresponding number of comments ASK posts created at each hour received.

for each_row in result_list:
    my_datetime_object = dt.datetime.strptime(each_row[0], "%m/%d/%Y %H:%M")
    
    #Using the datetime.strftime() method to select just the hour from the datetime object.
    hour_str = my_datetime_object.strftime("%H")
    
    if hour_str in counts_by_hour:
        counts_by_hour[hour_str] += 1
        comments_by_hour[hour_str] += each_row[1]
    else:
        counts_by_hour[hour_str] = 1
        comments_by_hour[hour_str] = each_row[1]
    
print('Number of ASK posts created by hour: ', counts_by_hour)
print('\nNumber of comments received by hour: ', comments_by_hour)
    

Number of ASK posts created by hour:  {'02': 269, '01': 282, '22': 383, '21': 518, '19': 552, '17': 587, '15': 646, '14': 513, '13': 444, '11': 312, '10': 282, '09': 222, '07': 226, '03': 271, '23': 343, '20': 510, '16': 579, '08': 257, '00': 301, '18': 614, '12': 342, '04': 243, '06': 234, '05': 209}

Number of comments received by hour:  {'02': 2996, '01': 2089, '22': 3372, '21': 4500, '19': 3954, '17': 5547, '15': 18525, '14': 4972, '13': 7245, '11': 2797, '10': 3013, '09': 1477, '07': 1585, '03': 2154, '23': 2297, '20': 4462, '16': 4466, '08': 2362, '00': 2277, '18': 4877, '12': 4234, '04': 2360, '06': 1587, '05': 1838}


Using the two dictionaries(`counts_by_hour` and `comments_by_hour`), the next step is to calculate the average number of comments for posts created during each hour of the day.

In [26]:
avg_by_hour= []

for each_key in counts_by_hour:
    avg_num_comments = comments_by_hour[each_key] / counts_by_hour[each_key]
    avg_by_hour.append([each_key, avg_num_comments])
    
print(avg_by_hour)

[['02', 11.137546468401487], ['01', 7.407801418439717], ['22', 8.804177545691905], ['21', 8.687258687258687], ['19', 7.163043478260869], ['17', 9.449744463373083], ['15', 28.676470588235293], ['14', 9.692007797270955], ['13', 16.31756756756757], ['11', 8.96474358974359], ['10', 10.684397163120567], ['09', 6.653153153153153], ['07', 7.013274336283186], ['03', 7.948339483394834], ['23', 6.696793002915452], ['20', 8.749019607843136], ['16', 7.713298791018998], ['08', 9.190661478599221], ['00', 7.5647840531561465], ['18', 7.94299674267101], ['12', 12.380116959064328], ['04', 9.7119341563786], ['06', 6.782051282051282], ['05', 8.794258373205741]]


Although the results needed are as above, this format makes it hard to identify the hours with the highest values. Therefore, in the last step, the list of lists is sorted and the five highest values are printed in a format that's easier to read.

In [28]:
swap_avg_by_hour = []

for each_row in avg_by_hour:
    swap_avg_by_hour.append([each_row[1],each_row[0]])
    
print(swap_avg_by_hour)

[[11.137546468401487, '02'], [7.407801418439717, '01'], [8.804177545691905, '22'], [8.687258687258687, '21'], [7.163043478260869, '19'], [9.449744463373083, '17'], [28.676470588235293, '15'], [9.692007797270955, '14'], [16.31756756756757, '13'], [8.96474358974359, '11'], [10.684397163120567, '10'], [6.653153153153153, '09'], [7.013274336283186, '07'], [7.948339483394834, '03'], [6.696793002915452, '23'], [8.749019607843136, '20'], [7.713298791018998, '16'], [9.190661478599221, '08'], [7.5647840531561465, '00'], [7.94299674267101, '18'], [12.380116959064328, '12'], [9.7119341563786, '04'], [6.782051282051282, '06'], [8.794258373205741, '05']]


Using the `sorted()` function to sort `swap_avg_by_hour` in descending order. Since the first column of this list is the average number of comments, sorting the list will sort by the average number of comments.

In [30]:
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
for each_entry in sorted_swap:
    print(each_entry[1], ':' ,each_entry[0])

15 : 28.676470588235293
13 : 16.31756756756757
12 : 12.380116959064328
02 : 11.137546468401487
10 : 10.684397163120567
04 : 9.7119341563786
14 : 9.692007797270955
17 : 9.449744463373083
08 : 9.190661478599221
11 : 8.96474358974359
22 : 8.804177545691905
05 : 8.794258373205741
20 : 8.749019607843136
21 : 8.687258687258687
03 : 7.948339483394834
18 : 7.94299674267101
16 : 7.713298791018998
00 : 7.5647840531561465
01 : 7.407801418439717
19 : 7.163043478260869
07 : 7.013274336283186
06 : 6.782051282051282
23 : 6.696793002915452
09 : 6.653153153153153


In [51]:
print('Top five Hours for Ask Posts Comments:\n')

tempelate ="{time}: {avg:.2f} average comments per post." 
output_list_format = []

for each_entry in sorted_swap:
    avg_comments = each_entry[0]
    my_hour_object = dt.datetime.strptime(each_entry[1], '%H')
    hour_str = my_hour_object.strftime("%H:%M")
    
    output = tempelate.format(time = hour_str, avg = avg_comments)
    output_list_format.append(output)
    
for x in range (0, 5):
    print(output_list[x])
    

Top five Hours for Ask Posts Comments:

15:00: 28.68 average comments per post.
13:00: 16.32 average comments per post.
12:00: 12.38 average comments per post.
02:00: 11.14 average comments per post.
10:00: 10.68 average comments per post.


## Conclusion
From the analysis above, It is observed that creating a post at 15:00, 13:00, 12:00, 02:00 and 10:00 will result in having a higher chance of receiving comments.