# Guided Project : Exploring Hacker News Posts

Hacker News posts is a site where user submitted posts are voted or commented. [Hacker News](https://news.ycombinator.com/) is extremely popular in technology and with start ups and posts that are in the top of the list will recieve thousands of visitors as a result.

There is 'Ask HN'and 'Show HN' posts that we are interested for this project. User submits 'Ask HN' posts to ask the community a specific question.

User submit 'Show HN' posts to show the Hacker News community a project, prodcut or share something interesting.

For this project the data set from kaggle is here. [data set](https://www.kaggle.com/hacker-news/hacker-news-posts)
For the guided project the data has been reduced from rows 300,000 to approaximately 20,000 removing the submissions without any comments recieved and then randomly sampling.

It includes the following columns:

* id: The unique identifier from Hacker News for the post
* title: title of the post (self explanatory)
* url: the url of the item being linked to
* num_points: the number of upvotes the post received
* num_comments: the number of comments the post received
* author: the name of the account that made the post
* created_at: the date and time the post was made (the time zone is Eastern Time in the US)

The aim of the project is to compare the two types of posts to determine the following

1. Does 'Ask HN' or 'Show HN' receives more comments on average?
2. Does posts created at certain time receive more comments on average?

In [39]:
from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)
hn_header = hn[0]
hn = hn[1:]
print(hn_header)
print('\n')
print(hn[0:2])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']]


In [40]:
#created three empty lists to store the data in the loop
ask_posts =[]
show_posts = []
other_posts =[]

#loop through each row to seperate the type of question asked
for row in hn:
    title = row[1]
    title = title.lower() #change the capitals to lower case
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)        

In [41]:
len_ask_posts = len(ask_posts)
len_show_posts = len(show_posts)
len_other_posts = len(other_posts)
print('Number of posts in ask_posts :',len_ask_posts)
print('Number of posts in show_posts :',len_show_posts)
print('Number of posts in other_posts :',len_other_posts)

Number of posts in ask_posts : 1744
Number of posts in show_posts : 1162
Number of posts in other_posts : 17194


## 1. Does Ask HN or Show HN receives more comments on average?

In [42]:
# Calculte the average comments for ask hn category
total_ask_comments = 0
num = 0
for i in ask_posts:
    num_comments = int(i[4])
    total_ask_comments += num_comments
    num +=1
average_ask_comments = (total_ask_comments / num)
print('Average Ask HN comments : ',average_ask_comments)

# Calculte the average comments for show hn category
total_show_comments = 0
num_1 = 0
for k in show_posts:
    num_show_comments = int(k[4])
    total_show_comments += num_show_comments
    num_1 +=1
average_show_comments = (total_show_comments / num_1) 
print('Average Show HN comments : ',average_show_comments)

Average Ask HN comments :  14.038417431192661
Average Show HN comments :  10.31669535283993


From the calculation above we can see that the average comments is more for 'Ask HN'. We will fous on analysing the 'Ask HN' further for the second question in the introduction.

## 2. Does posts created at certain time receive more comments on average?

In [43]:
#create new list to append the data
result_list = [] 
#For loop to iterate and add in the created_at data and num comments to the result_list
for row in ask_posts:
    created_at = row[6]
    num_comments = int(row[4])
    new_row = [created_at,num_comments]
    result_list.append(new_row)

In the above code we appended the 'Ask HN' created_at data and the comments column to the result list. 

In the below code after importing datatime we looped over the result list and added the sum of comments for each hour in to a dictionary called 'comments_per_hour'. 'counts_by_hour' dictionary will calculate the number of loops for each hour.

In [44]:
import datetime as dt

counts_by_hour = {}
comments_by_hour = {}
for row in result_list:
    date_time = row[0]
    num_comments = row[1]
    hour_time = dt.datetime.strptime(date_time, "%m/%d/%Y %H:%M")
    
    hour = hour_time.strftime("%H")
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = num_comments
    else:
        counts_by_hour[hour] +=1
        comments_by_hour[hour] += num_comments
        
comments_by_hour

{'09': 251,
 '13': 1253,
 '10': 793,
 '14': 1416,
 '16': 1814,
 '23': 543,
 '12': 687,
 '17': 1146,
 '15': 4477,
 '21': 1745,
 '20': 1722,
 '02': 1381,
 '18': 1439,
 '03': 421,
 '05': 464,
 '19': 1188,
 '01': 683,
 '22': 479,
 '08': 492,
 '04': 337,
 '00': 447,
 '06': 397,
 '07': 267,
 '11': 641}

### Sorting values from a list of lists

We will then create an empty list as avg_by_hour and add in the average comments by hour to the list as a list of lists.

In [45]:
avg_by_hour = []

for hour in comments_by_hour:
    avg_by_hour.append([hour, round(comments_by_hour[hour]/counts_by_hour[hour],4)])

avg_by_hour

[['09', 5.5778],
 ['13', 14.7412],
 ['10', 13.4407],
 ['14', 13.2336],
 ['16', 16.7963],
 ['23', 7.9853],
 ['12', 9.411],
 ['17', 11.46],
 ['15', 38.5948],
 ['21', 16.0092],
 ['20', 21.525],
 ['02', 23.8103],
 ['18', 13.2018],
 ['03', 7.7963],
 ['05', 10.087],
 ['19', 10.8],
 ['01', 11.3833],
 ['22', 6.7465],
 ['08', 10.25],
 ['04', 7.1702],
 ['00', 8.1273],
 ['06', 9.0227],
 ['07', 7.8529],
 ['11', 11.0517]]

To sort out by the average number of comments we will swap the two elements in the list and append it to the 'swap_avg_by_hour' empty list. The sort it out and print out the first 5 highest average comments in the list.

In [46]:
swap_avg_by_hour = [] #create an empty list
for row in avg_by_hour:
    first_element = row[0]
    second_element = row[1]
    edited = [second_element,first_element]
    swap_avg_by_hour.append(edited)
print(swap_avg_by_hour)    

[[5.5778, '09'], [14.7412, '13'], [13.4407, '10'], [13.2336, '14'], [16.7963, '16'], [7.9853, '23'], [9.411, '12'], [11.46, '17'], [38.5948, '15'], [16.0092, '21'], [21.525, '20'], [23.8103, '02'], [13.2018, '18'], [7.7963, '03'], [10.087, '05'], [10.8, '19'], [11.3833, '01'], [6.7465, '22'], [10.25, '08'], [7.1702, '04'], [8.1273, '00'], [9.0227, '06'], [7.8529, '07'], [11.0517, '11']]


In [47]:
sorted_swap = sorted(swap_avg_by_hour, reverse = True)

print("Top 5 Hours for Ask Posts Comments")
print('\n')
for avg,hour in sorted_swap[:5]:
    message = "{} : {:.2f} average comments per post"
    print(message.format(dt.datetime.strptime(hour,'%H').strftime('%H:%M'),avg))

Top 5 Hours for Ask Posts Comments


15:00 : 38.59 average comments per post
02:00 : 23.81 average comments per post
20:00 : 21.52 average comments per post
16:00 : 16.80 average comments per post
21:00 : 16.01 average comments per post


## Conclusion

After the analysis of the data set we can conclude that 'Ask HN' receives more comments on average compared to 'Show HN'. Further for 'Ask HN' posts created at 3.00PM receives the highest average comments per posts of 38.59. We can conclude that 'Ask HN' posts created between 3.00PM and 4.00PM receives more average comments compared to post created at other times.