# Hacker News analysis

In this project I will be analyzing a data set from the popular website <a href="https://news.ycombinator.com/">Hacker News</a>. This <a href="https://www.kaggle.com/hacker-news/hacker-news-posts">data set</a> contains 293,119 rows of user submitted posts that contain at least one comment.

A brief description of the website; Hacker News is a social news website focusing on areas of computer science and entrepreneurship. Orginally, it was created with an intention that the site would resemble reddit; in terms of how users downvote, upvote or comment on posts. Hence, the site is group a community of users which is constantly administered and moderated.
HN (Hacker News) also has two tags; Ask HN and Show HN. The ask tags consists of questions posed by users which is openly asked to the community. Below is an example of questions posed under this tag. 


<img src="Project_images/Ask.png">

The Show HN tag consists of users displaying or showcasing their creations or anything in general. The community can provide their feedback, upvote or downvote respectively. Below is an example of articles displayed under this tag.


<img src="Project_images/Show.png">

## Exploring Hacker News Posts

There are two main goals in this project:

* To determine which posts between the subjects Ask HN or Show HN receieve more comments
* To determine what time of day has the biggest influence on number of comments

There are also two side goals:

* To determine which posts between the subjects Ask HN or Show HN receieve more points
* To determine what time of day has the biggest influence on number of points

In [1]:
# opening hacker news file as a lists of list
from csv import reader

open_file = open('Hacker_news.csv',encoding="utf8")
read_file = reader(open_file)
hn = list(read_file)

hn_header = hn[0]
hn = hn[1:]

print(hn_header, '\n')
for ee in hn[:5]:
    print(ee)
    
print("\n", "The Hacker News dataset consists of", len(hn), "rows")

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] 

['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']
['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24']
['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']
['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']
['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']

 The Hacker News da

## Data cleaning

After exploring the Hacker News dataset, I can see there are about 293,000 rows. Not all of these datas belong to Ask or Show HN which is predominantly what we're looking for. The latter two tags start with "Ask HN" or "Show HN" in their titles and for this project, I'll be focusing only on these two. Hence for data cleaning purposes, I'll be creating three empty lists to store relevant data respectively.

In [2]:
ask_posts = [] # this list will contain all of the 'Ask hn' posts on hacker news
show_posts = [] # this list will contain all of the 'Show hn' posts on hacker news
other_posts = [] # will contain data which belong to neither of 'Ask' or 'Show'

for ee in hn:
    title = ee[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(ee)
    elif title.lower().startswith('show hn'):
        show_posts.append(ee)
    else:
        other_posts.append(ee)
        
# checking how much data we are left with in each list
print("Ask HN tags consists of", len(ask_posts), "posts")
print("Show HN tags consists of", len(show_posts), "posts")
print("Other posts on HN are", len(other_posts), "in numbers")

Ask HN tags consists of 9139 posts
Show HN tags consists of 10158 posts
Other posts on HN are 273822 in numbers


We are left with 19,297 posts to work with. Our data is all in a clean, readable format for our first goal of determining which subject receives more comments. So let's begin.

## Analysis Part I: Average Comments & Points

As stipulated our objectives are to check which type of posts are more popular; Ask HN or Show. For this purpose,  I will be calculating the average number of comments and points Ask HN and Show HN posts receive.

### Ask HN Comments and Points

In [3]:
# creating variables for calculating average comments per post
total_ask_comments = 0
length = 0

# looping through the respective list
for ee in ask_posts:
    num_comments = int(ee[4])
    total_ask_comments += num_comments
    length += 1
    
avg_ask_comments = total_ask_comments / length
print("Ask HN posts have a total of",(total_ask_comments), "comments. With each post averaging about", round(avg_ask_comments, 1), "comments.")

Ask HN posts have a total of 94986 comments. With each post averaging about 10.4 comments.


In [4]:
total_points = 0
length = 0

for ee in ask_posts:
    num_points = int(ee[3])
    total_points += num_points
    length += 1
    
avg_ask_points = total_points/length
print("Ask HN posts have a total of",(total_points), "points. With each post averaging about", round(avg_ask_points, 1), "points.")

Ask HN posts have a total of 103378 points. With each post averaging about 11.3 points.


### Show HN Comments and Points

I have calculated average number of comments and points received by Ask HN posts. Now, I'll be repeating the process for Show HN posts.

In [5]:
total_show_comments = 0
length = 0

for ee in show_posts:
    num_comments = int(ee[4])
    total_show_comments += num_comments
    length += 1
    
avg_show_comments = total_show_comments / length
print("Show HN posts have a total of",(total_show_comments), "comments. With each post averaging about", round(avg_show_comments, 1), "comments.")
    

Show HN posts have a total of 49633 comments. With each post averaging about 4.9 comments.


In [6]:
total_points = 0
length = 0

for ee in show_posts:
    num_points = int(ee[3])
    total_points += num_points
    length += 1
    
avg_show_points = total_points/length
print("Show HN posts have a total of",(total_points), "points. With each post averaging about", round(avg_show_points, 1), "points.")


Show HN posts have a total of 150781 points. With each post averaging about 14.8 points.


It appears that Ask HN posts receieve more comments and discussion compared to Show HN posts. There's a large difference of 45,353 comments. It appears that Ask HN is a much more popular subject for comments compared to the Show HN subject.

However, Show HN receives more points compared to Ask HN. This average is only a difference of 3, though.

While Show HN contains more posts, Ask HN has an average of 6 more comments per post compared to Show HN. I wonder if there are some posts with a large number of comments that are affecting this average.

### Skewed comments

This part of analysis is to check whether certain posts (with large number of comments) are skewing up the information for our posts. In order to do this, I'll be creating two empty lists, each accounting for those posts which have more than 500 comments.

In [7]:
big_comments_ask = [] 
big_comments_show = []

for ee in ask_posts:
    comments = int(ee[4])
    if comments > 500:
        big_comments_ask.append(ee)
        
for ee in show_posts:
    comments = int(ee[4])
    if comments > 500:
        big_comments_show.append(ee)
        
print("Posts with over 500 Comments:" "\n" "\n"
      "Ask HN Posts :", len(big_comments_ask), "\n"
      "Show HN Posts :", len(big_comments_show))

Posts with over 500 Comments:

Ask HN Posts : 19 
Show HN Posts : 0


Can notice that there are 19 Ask HN posts with over 500 comments; whilst 0 exists for Show HN posts. Though 19 is a small number, I will find a new average for Ask HN posts excluding those of the greater than 500 comment posts. 

In [8]:
# finding the new average for Ask HN posts.  Not calculating for Show HN posts as average doesn't change.
under_500_comments = 0
under_500_total = 0
under_500_avg = 0

for row in ask_posts:
    comments = int(row[4])
    if comments < 500:
        under_500_comments += comments
        under_500_total += 1

under_500_avg = under_500_comments / under_500_total

print("Fixed Ask HN Average Comments :", round(under_500_avg, 1))

Fixed Ask HN Average Comments : 8.8


Ask HN posts still gather more comments compared to Show HN posts. Even when excluding the skewed part of the data, the average comments accumulated drops by 1.5 only (to 8.8). Hence, Ask HN gathers more comments in general and fairly a good number of points as well. 

## Analysis II:  Time of the Day

Now I can begin with the second goal of finding the most popular time of day for posts receiving comments and points. With our Ask HN subject receiving almost double the number of comments compared to Show HN, I'll base the analysis around this data. I'll also use Ask HN for our points comparison, since the averages were similar and it'll increase readability.

I'll first need to put all of the data into a readable and standard form, for this we'll use dictionaries. We only want the hour of day that posts and comments/points were made, so we can create two new lists with only this information.

In [9]:
import datetime

result_comment = []
result_point = []

for ee in ask_posts:
    created_at = ee[6]
    comments = int(ee[4])
    points = int(ee[3])
    result_comment.append([created_at, comments])
    result_point.append([created_at, points])
    
    
counts_by_hour = {}
comments_by_hour = {}

for ee in result_comment:
    comments = ee[1]
    dates = datetime.datetime.strptime(ee[0], "%m/%d/%Y %H:%M")
    hour = dates.strftime("%H")
    hour = hour + ":00"
    
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comments
        
    elif hour in counts_by_hour:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comments
        
print("Post by the Hour",'\n',counts_by_hour)
print('\n')
print("Comments by the Hour",'\n',comments_by_hour)

Post by the Hour 
 {'02:00': 269, '01:00': 282, '22:00': 383, '21:00': 518, '19:00': 552, '17:00': 587, '15:00': 646, '14:00': 513, '13:00': 444, '11:00': 312, '10:00': 282, '09:00': 222, '07:00': 226, '03:00': 271, '23:00': 343, '20:00': 510, '16:00': 579, '08:00': 257, '00:00': 301, '18:00': 614, '12:00': 342, '04:00': 243, '06:00': 234, '05:00': 209}


Comments by the Hour 
 {'02:00': 2996, '01:00': 2089, '22:00': 3372, '21:00': 4500, '19:00': 3954, '17:00': 5547, '15:00': 18525, '14:00': 4972, '13:00': 7245, '11:00': 2797, '10:00': 3013, '09:00': 1477, '07:00': 1585, '03:00': 2154, '23:00': 2297, '20:00': 4462, '16:00': 4466, '08:00': 2362, '00:00': 2277, '18:00': 4877, '12:00': 4234, '04:00': 2360, '06:00': 1587, '05:00': 1838}


### Average Comments per Hour

Now that we have information of the number of posts per hour as well as the comments accumulated, we can find the average number of comments attained per hour. For this purpose, I have created an empty list which will take input from the above two dictionaries created.

In [10]:
avg_comments_hour = []

for key in counts_by_hour:
    average = comments_by_hour[key] // counts_by_hour[key]
    
    avg_comments_hour.append([key, average])
    
print("Average number of comments per hour",'\n', avg_comments_hour)

Average number of comments per hour 
 [['02:00', 11], ['01:00', 7], ['22:00', 8], ['21:00', 8], ['19:00', 7], ['17:00', 9], ['15:00', 28], ['14:00', 9], ['13:00', 16], ['11:00', 8], ['10:00', 10], ['09:00', 6], ['07:00', 7], ['03:00', 7], ['23:00', 6], ['20:00', 8], ['16:00', 7], ['08:00', 9], ['00:00', 7], ['18:00', 7], ['12:00', 12], ['04:00', 9], ['06:00', 6], ['05:00', 8]]


The information is displayed as required. However, the list is not readable as the data is distorted i.e hours are not listed chronologically; comments are not listed in a descending manner.
In order to tackle this and make the information more readable, I will be creating an empty list which will present the highest average number of comments attained, from the best to the worst time.

In [11]:
swap_avg_by_hour = [] # empty list

for ee in avg_comments_hour:
    swap_avg_by_hour.append([ee[1],ee[0]]) # appending the number of comments and time period to list to sort it later
    
sorted_swap = sorted(swap_avg_by_hour, reverse=True) # created a new variables which intakes our list in a sorted manner
print(sorted_swap)

[[28, '15:00'], [16, '13:00'], [12, '12:00'], [11, '02:00'], [10, '10:00'], [9, '17:00'], [9, '14:00'], [9, '08:00'], [9, '04:00'], [8, '22:00'], [8, '21:00'], [8, '20:00'], [8, '11:00'], [8, '05:00'], [7, '19:00'], [7, '18:00'], [7, '16:00'], [7, '07:00'], [7, '03:00'], [7, '01:00'], [7, '00:00'], [6, '23:00'], [6, '09:00'], [6, '06:00']]


This is our lists of list (sorted_swap) which provides the highest average comments per hour from the best to the worst timings of the day. We can make this even more presentable by highlighting the top 5 or the worst 3 timings for Ask HN comments.

In [12]:
print("Top 5 best hours for Ask HN comments")
print('\n')
for row in sorted_swap[:5]:
    statement = "{0}: {1} average number of comments per post"
    print(statement.format(row[1],row[0]))
    
print('\n')
print("Top 5 worst hours for Ask HN comments")
print('\n')
for row in sorted_swap[-3:]:
    statement = "{0}: {1} average number of comments per post"
    print(statement.format(row[1],row[0]))

Top 5 best hours for Ask HN comments


15:00: 28 average number of comments per post
13:00: 16 average number of comments per post
12:00: 12 average number of comments per post
02:00: 11 average number of comments per post
10:00: 10 average number of comments per post


Top 5 worst hours for Ask HN comments


23:00: 6 average number of comments per post
09:00: 6 average number of comments per post
06:00: 6 average number of comments per post


It appears that 3 PM Eastern Time is when comments are most likely, with 2 AM and 8 PM in second and third place. The times with the least chance of comments is at 6 AM, 9 AM, and 11 PM. For the highest chance to receieve comments on your ask HM post, one should post at 3 PM EST, and avoid posting at 6 AM, 9 AM, and 11 PM EST.

### Average Points per Hour

In [13]:
points_by_hour = {}

for ee in result_point:
    points = ee[1]
    dates = datetime.datetime.strptime(ee[0], '%m/%d/%Y %H:%M')
    hour = dates.strftime("%H")
    hour = hour + ":00"
    
    if hour not in points_by_hour:
        points_by_hour[hour] = points
        
    elif hour in points_by_hour:
        points_by_hour[hour] += points
        
print('\n')
print("Points by the Hour",'\n',points_by_hour)



Points by the Hour 
 {'02:00': 2944, '01:00': 2662, '22:00': 3601, '21:00': 5042, '19:00': 4782, '17:00': 7155, '15:00': 13978, '14:00': 5390, '13:00': 7962, '11:00': 2856, '10:00': 3789, '09:00': 1763, '07:00': 2040, '03:00': 2539, '23:00': 2616, '20:00': 4491, '16:00': 5970, '08:00': 2744, '00:00': 2835, '18:00': 6850, '12:00': 4643, '04:00': 2650, '06:00': 2030, '05:00': 2046}


In [14]:
avg_points_hour = []

for key in counts_by_hour:
    average2 = points_by_hour[key] // counts_by_hour[key]
    avg_points_hour.append([key, average2])

print("Average number of points per hour",'\n', avg_points_hour)

Average number of points per hour 
 [['02:00', 10], ['01:00', 9], ['22:00', 9], ['21:00', 9], ['19:00', 8], ['17:00', 12], ['15:00', 21], ['14:00', 10], ['13:00', 17], ['11:00', 9], ['10:00', 13], ['09:00', 7], ['07:00', 9], ['03:00', 9], ['23:00', 7], ['20:00', 8], ['16:00', 10], ['08:00', 10], ['00:00', 9], ['18:00', 11], ['12:00', 13], ['04:00', 10], ['06:00', 8], ['05:00', 9]]


Got the result, but to make it more neat we will sort the list like we did for comments per hour.

In [15]:
swap_avg_pts_by_hour = []

for ee in avg_points_hour:
    hour = ee[0]
    avg = ee[1]
    swap_avg_pts_by_hour.append([avg, hour])

sorted_swap_pts = sorted(swap_avg_pts_by_hour, reverse=True)

print("Top 5 best hours for Ask HN points")
print('\n')
for row in sorted_swap_pts[:5]:
    statement = "{0}: {1} average number of points per post"
    print(statement.format(row[1],row[0]))
    
print('\n')
print("Top 5 worst hours for Ask HN points")
print('\n')
for row in sorted_swap_pts[-3:]:
    statement = "{0}: {1} average number of points per post"
    print(statement.format(row[1],row[0]))

Top 5 best hours for Ask HN points


15:00: 21 average number of points per post
13:00: 17 average number of points per post
12:00: 13 average number of points per post
10:00: 13 average number of points per post
17:00: 12 average number of points per post


Top 5 worst hours for Ask HN points


06:00: 8 average number of points per post
23:00: 7 average number of points per post
09:00: 7 average number of points per post


## Conclusion

Our inital goals were to answer our two questions of which subjects, between Ask HN and Show HN, receieve more comments on average, and what time of day has the greatest influence on number of comments.

While our side goals were to answer these same questions, but relating to points.

In conclusion, we have successfully answered all of our questions. Ask HM posts receive the most comments from the community, with almost double the number of comments compared to Show HM posts, and tend to have a greater chance of becoming viral. Show HM posts, however, receieve more points compared to Ask HM, though the averages are similar. To receive the most number of comments and points on an Ask HM post, one should post at 3 PM EST, when the user-base is most active.