# Analyzing Hacker News Posts
In this project, I will work with a dataset containing a series of information about approximately 20,000 posts published on the Hacker News website. [Here](https://www.kaggle.com/datasets/hacker-news/hacker-news-posts) is the link that provides the source from which the dataset can be downloaded. Please note that the dataset used in this project is a sample of the dataset available at the provided link.

Some of the posts in the dataset belong to two specific categories: **Ask HN** and **Show HN**. The first category includes posts in which users ask the Hacker News community for specific information. The second category includes posts in which users showcase, for informational purposes, a project, a product, or something interesting.

The goal of this work is to analyze these posts to determine which type of post (Ask HN or Show HN) receives, on average, a higher number of comments, and whether there is a time period during which posts published at that time receive, on average, more comments.

In [1]:
from csv import reader
with open('hacker_news.csv') as file:
    readed = reader(file)
    hn = list(readed)
for row in hn[:5]:
    print(row)
    print('\n')

    

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']




In [2]:
header = hn[0]
hn = hn[1:]
print(header)
print('\n')
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


## Extracting Ask HN and Show HN Posts
Once the dataset has been imported and the first row containing the column headers has been separated from the actual data, it is time to focus only on the content belonging to the **Ask HN** and **Show HN** categories. What distinguishes these posts is the presence of the labels *Ask HN* and *Show HN* at the beginning of each post. Since not all posted content belongs to one of these two categories, we must also identify and separate the posts that are neither **Ask HN** nor **Show HN**.

In [27]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
print(f"The number of 'Ask' posts are: {len(ask_posts)}")
print('\n')
print(f"The number of 'Show' posts are: {len(show_posts)}")
print('\n')
print(f"The number of 'Other' posts are: {len(other_posts)}")
print('\n')

The number of 'Ask' posts are: 1744


The number of 'Show' posts are: 1162


The number of 'Other' posts are: 17194




## Calculating the Average Number of Comments for Ask HN and Show HN Posts
In the following code, I will calculate the average number of comments per post for Ask HN and Show HN in order to determine which of the two types receives, on average, more interactions.

In [34]:
total_show_comments = 0
total_ask_comments = 0

for comments in show_posts:
    n_comments = int(comments[4])
    total_show_comments += n_comments
avg_show_comments = total_show_comments / len(show_posts)
print(f"The average number of comments for each 'Show' post is: {round(avg_show_comments,2)}")

for comments in ask_posts:
    n_comments = int(comments[4])
    total_ask_comments += n_comments
avg_ask_comments = total_ask_comments / len(ask_posts)
print(f"The average number of comments for each 'Ask' post is: {round(avg_ask_comments,2)}")

The average number of comments for each 'Show' post is: 10.32
The average number of comments for each 'Ask' post is: 14.04


## Finding the Number of Ask Posts and Comments by Hour Created
On average, *Ask* posts are more likely to receive comments. Therefore, from this point onward, the analysis will focus exclusively on this type of post. In particular, I will examine whether there is a time period during which *Ask* posts are more likely to receive comments.

In [35]:
import datetime as dt

In [66]:
result_list = []
count_by_hour = {}
comments_by_hour = {}

for row in ask_posts:
    creation_time = row[6]
    n_comments = int(row[4])
    result_list.append((creation_time, n_comments))

for row in result_list:
    hour = row[0]
    comments = row[1]
    dt_time = dt.datetime.strptime(hour, "%m/%d/%Y %H:%M")
    dt_hour = dt.datetime.strftime(dt_time, "%H")
    if dt_hour not in count_by_hour:
        count_by_hour[dt_hour] = 1
        comments_by_hour[dt_hour] = comments
    else:
        count_by_hour[dt_hour] += 1
        comments_by_hour[dt_hour] += comments
    

## Calculating the Average Number of Comments for Ask HN Posts by Hour
o obtain the average number of comments per post for each hour of the day, I created two dictionaries, `count_by_hour` and `comments_by_hour`, which store, respectively, the number of *Ask* posts created in each hour of the day and the total number of comments associated with posts published at each hour.

In [67]:
print(count_by_hour)
print(comments_by_hour)

{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}
{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}


In [68]:
avg_by_hour = []

for hour in count_by_hour:
    avg_by_hour.append([hour,comments_by_hour[hour] / count_by_hour[hour]])

print(avg_by_hour)

[['09', 5.5777777777777775], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['16', 16.796296296296298], ['23', 7.985294117647059], ['12', 9.41095890410959], ['17', 11.46], ['15', 38.5948275862069], ['21', 16.009174311926607], ['20', 21.525], ['02', 23.810344827586206], ['18', 13.20183486238532], ['03', 7.796296296296297], ['05', 10.08695652173913], ['19', 10.8], ['01', 11.383333333333333], ['22', 6.746478873239437], ['08', 10.25], ['04', 7.170212765957447], ['00', 8.127272727272727], ['06', 9.022727272727273], ['07', 7.852941176470588], ['11', 11.051724137931034]]


## Sorting and Printing Values from a List of Lists
Riordino i risultati ponendoli in ordine 

In [72]:
swap_avg_by_hour = []

for row in avg_by_hour:
    first = row[0]
    second = row[1]
    swap_avg_by_hour.append([second, first])
print(swap_avg_by_hour)

[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]


In [80]:
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
print(f"Top 5 Hour For Ask Posts Comments:")
for avg, hour in sorted_swap[:5]:
    dt_time = dt.datetime.strptime(hour, "%H")
    dtf_hour = dt_time.strftime("%H:%M")
    print(f"{dtf_hour}: {avg:.2f} average comments per post")

Top 5 Hour For Ask Posts Comments:
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


# Conclusions
Based on the results obtained, it appears that publishing an “Ask” post at 15:00 leads to a higher number of comments compared to any other hour of the day. This is followed by 02:00, 20:00, 16:00, and 21:00.

From this analysis, two specific time intervals can be identified: from 15:00 to 16:59 and from 20:00 to 21:59. During these hours, the average number of comments is higher. A possible explanation is that in the afternoon users, engaged in work or study, may seek clarifications from the Hacker News community. In the second interval, it is likely that users, having completed their daily activities, spend time on the site as a form of leisure and passion-driven learning.