# Project: analysing Hacker News posts
The goal of this analysis is to study posts starting with `Ask HN`and `Show HN`.

The dataset is available [here](https://www.kaggle.com/hacker-news/hacker-news-posts)

We will compare this two type of posts to understand:
- Which type of post receive the most comments on average?
- On average, when should I post a question to have the maximum answers?

In [10]:
import csv
opened_file = open('HN_posts_year_to_Sep_26_2016.csv', encoding="utf8")
hn = list(csv.reader(opened_file))

headers = hn[0]
hn = hn[1:]

Now we are going to sort the posts between `ask_posts`, `show_posts`, and `other_posts`.

In [11]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    title = title.lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

9139
10158
273822


We study the number of comments per type of posts.

In [13]:
total_ask_comments, total_show_comments = 0, 0

for askpost in ask_posts: # loop for ask posts
    num_comments = askpost[4]
    num_comments = int(num_comments)
    total_ask_comments += num_comments
    
avg_ask_comments = total_ask_comments / len(ask_posts)
print('avg ask comments = ', round(avg_ask_comments,3))

for showpost in show_posts: # loop for show posts
    num_comments = showpost[4]
    num_comments = int(num_comments)
    total_show_comments += num_comments
    
avg_show_comments = total_show_comments / len(show_posts)
print('avg show comments = ', round(avg_show_comments,3))

avg ask comments =  10.393
avg show comments =  4.886


We can observe here that an ask post receive 14.038 comments on average, which is higher than a show post, that will receive 10.317 comments on average.

We will now try to determine if there is a time period during which created asking posts will receive more comments. For this, we will:
- Calculate the amount of ask posts created in each hour of the day, along with the number of comments received
- Calculate the average number of comments ask posts receive by hour created

In [14]:
import datetime as dt
result_list = [] # initialize list of lists with created time and number of comments

for post in ask_posts: # add created_at and num_comments as a list
    result_list.append(
        [post[6], int(post[4])]
    )
    
counts_by_hour = {} # number of ask posts created for each hour
comments_by_hour = {} # number of comments on ask posts created for each hour
date_format = "%m/%d/%Y %H:%M"

for row in result_list:
    date, comment = row[0], row[1]
    time = dt.datetime.strptime(date, date_format).strftime('%H') # extract hours in datetime object
    
    if time not in counts_by_hour:
        counts_by_hour[time] = 1
        comments_by_hour[time] = comment
    else:
        counts_by_hour[time] += 1
        comments_by_hour[time] += comment
        
comments_by_hour

{'02': 2996,
 '01': 2089,
 '22': 3372,
 '21': 4500,
 '19': 3954,
 '17': 5547,
 '15': 18525,
 '14': 4972,
 '13': 7245,
 '11': 2797,
 '10': 3013,
 '09': 1477,
 '07': 1585,
 '03': 2154,
 '23': 2297,
 '20': 4462,
 '16': 4466,
 '08': 2362,
 '00': 2277,
 '18': 4877,
 '12': 4234,
 '04': 2360,
 '06': 1587,
 '05': 1838}

In [15]:
avg_by_hour = [] # avg number of comments per post for each hour

for hour in comments_by_hour:
    avg = comments_by_hour[hour] / counts_by_hour[hour] # avg of comments created per posts created
    avg_by_hour.append([hour, round(avg,3)])

avg_by_hour

[['02', 11.138],
 ['01', 7.408],
 ['22', 8.804],
 ['21', 8.687],
 ['19', 7.163],
 ['17', 9.45],
 ['15', 28.676],
 ['14', 9.692],
 ['13', 16.318],
 ['11', 8.965],
 ['10', 10.684],
 ['09', 6.653],
 ['07', 7.013],
 ['03', 7.948],
 ['23', 6.697],
 ['20', 8.749],
 ['16', 7.713],
 ['08', 9.191],
 ['00', 7.565],
 ['18', 7.943],
 ['12', 12.38],
 ['04', 9.712],
 ['06', 6.782],
 ['05', 8.794]]

The list of lists `avg_by_hour` is not easy to read, so we are going to sort it and display the first five highest values.

In [16]:
swap_avg_by_hour = [] 

for row in avg_by_hour: # swap columns to sort later by the number of comments
    swap_avg_by_hour.append(
        [row[1], row[0]]
    )
    
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
print("Top 5 Hours for Ask Posts Comments: ")

for average, hours in sorted_swap[:5]: # loop for the 5 highests values
    hours = dt.datetime.strptime(hours, '%H').strftime('%H:%M') #strptime to convert into a datetime object; strftime to format
    print('{}: {:.2f} average comments per post'.format(hours, average))


Top 5 Hours for Ask Posts Comments: 
15:00: 28.68 average comments per post
13:00: 16.32 average comments per post
12:00: 12.38 average comments per post
02:00: 11.14 average comments per post
10:00: 10.68 average comments per post


# Conclusion
We can conclude that the posts asking a question will receive more comments if they are written at 15:00, 2:00 and 20:00 (more than 20 comments on average).
