# Exploring Hacker News Posts #

In this project we will be using a dataset from the famous tech website Hacker News.

On this website, people can posts articles, questions, their pieces of works, any kind of content... Just like on Reddit, those posts can be upvoted (gain points) and commented.

We have a dataset of 300.000 rows that corresponds to the different posts posted with the date when they were posted, the number of points, of comments, and so on...

We reduced that dataset to 20.000 rows, getting rid of all the posts withtout any comments.

We will focus mainly on two types of posts : 

1. The 'Ask HN' posts where people ask a question to the community
2. The 'Show HN' posts where people share their work to the community

We will go through this dataset and try to analyze the data : do a post created a certain time receive more comments? How can we compare those two subcategories of posts that are Ask HN and Show HN?

First let's open our .csv file, create a list of lists and separate the names of the colum from the rest of the dataset


In [1]:
from csv import reader
opened_file = open("hacker_news.csv")
read_file = reader(opened_file)
hn = list(read_file)
hn_header = hn[:1] 
hn = hn[1:]

Let's focus on our two categories and calculate the number of Show HN and Ask HN posts by using the 'startswith' method and taking care of the lower cases.

We will divide the posts in 3 list of lists.

In [2]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith("ask hn"):
        ask_posts.append(row)
    if title.lower().startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
18938


We can see that they are more Ask HN than Show HN posts, even if the majority of the posts are neither Ask or Show HN posts.

We will now calculate the average number of comment for each of these categories.

In [3]:
total_ask_comments = 0

for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments
    
avg_ask_comments = total_ask_comments/len(ask_posts) 

total_show_comments = 0

for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments
    
avg_show_comments = total_show_comments/len(show_posts) 

print(avg_ask_comments)
print(avg_show_comments)

14.038417431192661
10.31669535283993


We can see that Ask HN posts receive in average 14 comments whereas it is only 10 for Show HN posts. 
This could be explained by the fact that people loves to explain to questions and show their knowledge. Plus, this could lead to people arguing baout the best method to answer the issue raised in the question of the post, generating even more comments.

Now let's focus on the timing of those posts. We want to know :

- at what hour people tends to posts Ask HN posts 
- if the number of comments received is linked to a specific time of the day!

We will use the datetime module and the strptime and strftime method to help us!

In [4]:
import datetime as dt

result_list = []

for row in ask_posts:
    result_list.append([row[6],int(row[4])]) #We add the date of posting and the number of comments

    
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    post_time = row[0]
    nb_comment = row[1]
    post_time = dt.datetime.strptime(post_time, "%m/%d/%Y %H:%M")
    post_hour = post_time.strftime("%H")
    if post_hour not in counts_by_hour:
        counts_by_hour[post_hour] = 1
        comments_by_hour[post_hour] = nb_comment
    else:
        counts_by_hour[post_hour] += 1
        comments_by_hour[post_hour] += nb_comment
        



Now that we have a dictionnary with :

- the number of Ask HN posts per hour 
- the number of comments per hour

Let's calculate the average number of comments per post for posts created during each hour of the day and add it to a new list of lists

In [5]:
avg_by_hour = []

for hour in counts_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour]/counts_by_hour[hour]])
    
print(avg_by_hour)

[['09', 5.5777777777777775], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['16', 16.796296296296298], ['23', 7.985294117647059], ['12', 9.41095890410959], ['17', 11.46], ['15', 38.5948275862069], ['21', 16.009174311926607], ['20', 21.525], ['02', 23.810344827586206], ['18', 13.20183486238532], ['03', 7.796296296296297], ['05', 10.08695652173913], ['19', 10.8], ['01', 11.383333333333333], ['22', 6.746478873239437], ['08', 10.25], ['04', 7.170212765957447], ['00', 8.127272727272727], ['06', 9.022727272727273], ['07', 7.852941176470588], ['11', 11.051724137931034]]


We now have the results we wanted but they are not easy to read, let's fix that!

In [6]:
swap_avg_by_hour = []

for i in avg_by_hour:
    hour = i[0]
    avg = i[1]
    swap_avg_by_hour.append([avg,hour])
    
sorted_swap = sorted(swap_avg_by_hour, reverse = True)

print("Top 5 Hours for Ask Posts Comments")

for i in sorted_swap[0:5]:
        hour = i[1]
        hour = dt.datetime.strptime(i[1],"%H")
        hour = hour.strftime("%H:%M")
        avg = i[0]
        print("{}: {:.2f} average comments per post".format(hour,avg))
                                                  
                                        

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


From the results above it seems that the best time to post a Ask HN post would be at 15:00 (3 pm) in Eastern Time in US, which is -6 hours compare to here, in France. 

This could explain why posts that are posted at 15:00 and 16:00 receive so many comments, it corresponds to 21:00 and 20:00 in Central EU Time Zone. So we can assume that a lot of visitors on HN Website are Europeans!

We can see that posts posted at 02:00 receive a lot comments, this could be linked to the fact that 02:00 esatern US corresponds to 23:00 in Western US (Los Angeles, San Fransisco...).







### The points ###

Now let's focus on the points earned on each posts, like we did for the comments.

Let's see if show or ask posts receive more points on average.

In [7]:
total_ask_points = 0

for row in ask_posts:
    num_points = int(row[3])
    total_ask_points += num_points
    
avg_ask_points = total_ask_points/len(ask_posts) 

total_show_points = 0

for row in show_posts:
    num_points = int(row[3])
    total_show_points += num_points
    
avg_show_points = total_show_points/len(show_posts) 

print(avg_ask_points)
print(avg_show_points)

15.061926605504587
27.555077452667813


In the "Comment" part above, we saw that Ask HN generated more comments than Show HN posts (14 vs 10 comments in average).

However we can see that Show HN posts received far more points than Ask HN points (27,6 vs 15,1 in average)

This could be explained by the fact that people "upvotes" the work of others in acknowledgement, they "like" the content, just like on Facebook for instance.

Now let's determine if Ask HN created at a certain time are more likely to receive more points. We will follow the same process as before, calculating the average of points received at each hour.

In [24]:
result_list = [] 

for row in ask_posts:
    result_list.append([row[6],int(row[4]),int(row[3])]) #This time we also add the number of points
    
counts_points_by_hour = {}
points_by_hour = {}

for row in result_list:
    post_time = row[0]
    nb_points = row[2]
    post_time = dt.datetime.strptime(post_time, "%m/%d/%Y %H:%M")
    post_hour = post_time.strftime("%H")
    if post_hour not in counts_points_by_hour:
        counts_points_by_hour[post_hour] = 1
        points_by_hour[post_hour] = nb_points
    else:
        counts_points_by_hour[post_hour] += 1
        points_by_hour[post_hour] += nb_points

avg_points_by_hour = []

for hour in counts_points_by_hour:
    avg_points_by_hour.append([hour, points_by_hour[hour]/counts_points_by_hour[hour]])
    
print(avg_points_by_hour)

[['09', 7.311111111111111], ['13', 24.258823529411764], ['10', 18.677966101694917], ['14', 11.981308411214954], ['16', 23.35185185185185], ['23', 8.544117647058824], ['12', 10.712328767123287], ['17', 19.41], ['15', 29.99137931034483], ['21', 15.788990825688073], ['20', 14.3875], ['02', 13.672413793103448], ['18', 15.972477064220184], ['03', 6.925925925925926], ['05', 12.0], ['19', 13.754545454545454], ['01', 11.666666666666666], ['22', 7.197183098591549], ['08', 10.729166666666666], ['04', 8.27659574468085], ['00', 8.2], ['06', 13.431818181818182], ['07', 10.617647058823529], ['11', 14.224137931034482]]


Now let's make it easier to read those pieces of information.
    


In [27]:
swap_avg_points_by_hour = []

for i in avg_points_by_hour:
    hour = i[0]
    avg = i[1]
    swap_avg_points_by_hour.append([avg,hour])
    
sorted_points_swap = sorted(swap_avg_points_by_hour, reverse = True)

print("Top 5 Hours for Ask posts Pointss")

for i in sorted_points_swap[0:5]:
        hour = i[1]
        hour = dt.datetime.strptime(i[1],"%H")
        hour = hour.strftime("%H:%M")
        avg = i[0]
        print("{}: {:.2f} average points per post".format(hour,avg))
                                                  
                                        

Top 5 Hours for Ask posts Pointss
15:00: 29.99 average points per post
13:00: 24.26 average points per post
16:00: 23.35 average points per post
17:00: 19.41 average points per post
10:00: 18.68 average points per post


When we compare those results to the one we got for the number of comments on each post, we can see some differences.

It seems that most of the content liked is in the afternoon, which could be explained that, at that time, both US and EU users can be connected (it's the afternoon for the US users and the evening for the EU users).

It also means that the population that gives points ("likes") and the one that is commenting and answering the questions is not the same one.
