## Exploring Hacker News Posts

# Overview
In this project, we are to explore and analyze a dataset from Hacker News, a popular tech-focused community site. We are to uncover trends in user submissions and identify factors that drive community engagement.

We're specifically interested in posts with titles that begin with either `Ask HN` or `Show HN`. Users submit `Ask HN` posts to ask the Hacker News community a specific question ad determine if those kind of post receive more engagements on average.

Steps of Analysis

- Remove headers from a list of lists
- Extract `Ask HN` and `Show HN` posts
- Calculate the average number of comments for `Ask HN` and `Show HN` posts
- Find the number of `Ask HN` posts and average comments by hour created
- Sort and print values from a list of lists

## Introduction

First, I will read the data and remove the headers.

In [40]:
from csv import reader

opened_file = open('hacker_news.csv', encoding='utf-8-sig') # The file is CSV UTF-8 Encoded 
read_file = reader(opened_file)
hn = list(read_file)
hn_header = hn[0] # this assigns the title row to the hn_header function so I  can carry out our analysis without any error
hn = hn[1:] # this assigns first row to the second row of the file which is actually the first row of our data

Checking to confirm our headers were properly assigned

In [42]:
print(hn_header)
print('\n')
explore_data(hn, 0, 3, True)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


Number of rows: 20100
Number of columns: 7


Now that I have removed the headers from `hn`, it is time to filter the data. Since we are only concerned with post titles beginning with `Ask HN` or `Show HN`, I will create new lists of lists containing just the data for those titles.

## Extracting Ask HN and Show HN Posts

First, I will identify posts that begin with either `Ask HN` or `Show HN` and separate the data for those two types of posts into different lists. Separating the data makes it easier to analyze in the following steps.

In [45]:
# first step is to create empty lists that would take the data
ask_posts = []
show_posts = []
other_posts = []

for id in hn:
    title = id[1]
    if title.lower().startswith('ask hn'): # this converts the string stored in 'title' column to lowercase to ensure the search is case insensitive
        ask_posts.append(id)
    elif title.lower().startswith('show hn'):
        show_posts.append(id)
    else:
        other_posts.append(id)

print(f"Number of Ask HN posts: {len(ask_posts)}")
print(f"Number of Show HN posts: {len(show_posts)}")
print(f"Number of Other posts: {len(other_posts)}")

Number of Ask HN posts: 1744
Number of Show HN posts: 1162
Number of Other posts: 17194


## Calculating the Average Number of Comments for `Ask HN` and `Show HN` Posts

Since we are to find out the average number of comments for `Ask HN` and `Show HN` posts, we:

In [47]:
total_ask_comments = 0
len_ask_comments = 0

for id in ask_posts:
    num_comment = int(id[4])
    if num_comment >= 0:
        total_ask_comments += int(id[4])
        len_ask_comments += 1
avg_ask_comments = total_ask_comments / len_ask_comments
        
print(avg_ask_comments)

14.038417431192661


In [48]:
total_show_comments = 0
len_show_comments = 0

for id in show_posts:
    num_comment = int(id[4])
    if num_comment >= 0:
        total_show_comments += int(id[4])
        len_show_comments += 1
avg_show_comments = total_show_comments / len_show_comments
        
print(avg_show_comments)

10.31669535283993


In [49]:
total_other_comments = 0
len_other_comments = 0

for id in other_posts:
    num_comment = int(id[4])
    if num_comment >= 0:
        total_other_comments += int(id[4])
        len_other_comments += 1
avg_other_comments = total_other_comments / len_other_comments
        
print(avg_other_comments)

26.8730371059672


From the last three cells, `Ask HN` posts average 14 comments per post, `Show HN` posts average just above 10 comments while `Other Posts` average almost 17 comments per post. 

It is observed that `Ask HN` receives more comments than `Show HN` posts but neither of them gets more comments than `Other posts`.

Since `Ask HN` posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.

## Finding the Amount of Ask Posts and Comments by Hour Created

Next, I will determine if `Ask HN` posts created at a certain time are more likely to attract comments. I will make use of the following steps to perform this analysis:

- Calculate the number of `Ask HN` posts created in each hour of the day, along with the number of comments received.
- Calculate the average number of comments `Ask HN` posts receive by hour created.

In [52]:
import datetime as dt
result_list = []

for post in ask_posts:
    result_list.append([post[6], int(post[4])])

comments_by_hour = {}
counts_by_hour = {}
date_format = "%m/%d/%Y %H:%M"

for each_row in result_list:
    date = each_row[0]
    comment = each_row[1]
    time = dt.datetime.strptime(date, date_format).strftime("%H")
    if time in counts_by_hour:
        comments_by_hour[time] += comment
        counts_by_hour[time] += 1
    else:
        comments_by_hour[time] = comment
        counts_by_hour[time] = 1

comments_by_hour

{'09': 251,
 '13': 1253,
 '10': 793,
 '14': 1416,
 '16': 1814,
 '23': 543,
 '12': 687,
 '17': 1146,
 '15': 4477,
 '21': 1745,
 '20': 1722,
 '02': 1381,
 '18': 1439,
 '03': 421,
 '05': 464,
 '19': 1188,
 '01': 683,
 '22': 479,
 '08': 492,
 '04': 337,
 '00': 447,
 '06': 397,
 '07': 267,
 '11': 641}

## Calculating the Average Number of Comments for `Ask HN` Posts by Hour

In [53]:
avg_by_hour = []

for hr in comments_by_hour:
    avg_by_hour.append([hr, comments_by_hour[hr] / counts_by_hour[hr]])

avg_by_hour

[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]

## Sorting the Values

In [79]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]]) # I swapped the columns such that the avg no of comments comes first. It is easier to rearrange that way

print(swap_avg_by_hour)

[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]


In [55]:
sorted_avg = sorted(swap_avg_by_hour, reverse = True)
print(sorted_avg)

[[38.5948275862069, '15'], [23.810344827586206, '02'], [21.525, '20'], [16.796296296296298, '16'], [16.009174311926607, '21'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [13.20183486238532, '18'], [11.46, '17'], [11.383333333333333, '01'], [11.051724137931034, '11'], [10.8, '19'], [10.25, '08'], [10.08695652173913, '05'], [9.41095890410959, '12'], [9.022727272727273, '06'], [8.127272727272727, '00'], [7.985294117647059, '23'], [7.852941176470588, '07'], [7.796296296296297, '03'], [7.170212765957447, '04'], [6.746478873239437, '22'], [5.5777777777777775, '09']]


In [56]:
print("Top 5 Hours for 'Ask HN' Comments")
for avg, hr in sorted_avg[:5]:
    hr = dt.datetime.strptime(hr, "%H").strftime("%H:%M")
    print(f"At {hr}: {avg :.2F} average comments per post")

Top 5 Hours for 'Ask HN' Comments
At 15:00: 38.59 average comments per post
At 02:00: 23.81 average comments per post
At 20:00: 21.52 average comments per post
At 16:00: 16.80 average comments per post
At 21:00: 16.01 average comments per post


The hour that receives the highest average number of comments per post is 15:00, with an average of 38.59 comments per post, followed by 02:00 with an average of 23.81 comments per post. There's approximately a 60% increase in the number of comments between the hour with the highest average and the second highest average.

## Conclusion

In this project, I analyzed `Ask HN` and `Show HN` posts to identify which type of post and what time of day receives the most comments on average. Based on the findings, to maximize the number of comments, posting an `Ask HN` post between 15:00 and 16:00 would generate the most comments per post.