# Exploring Hacker News Posts

This is a DataQuest guided project in Python. The goal is to explore data about a selection of Hacker News posts to see whether "Ask HN" or "Show HN" posts get more comments, and whether posts at certain times of day get more comments. Ask posts are those looking for information from the community, Show posts are those aiming to show off a piece of work: code, web-page, etc. For this exercise, DataQuest removed all posts which did not receive comments, and then randomly sampled the remainder - this reduced the 300k posts of the original dataset to 20k in the dataset I'll be using.

The whole dataset and documentation can be found [here](https://www.kaggle.com/datasets/hacker-news/hacker-news-posts).

## Importing the dataset

In [1]:
from csv import reader

opened_file = open("hacker_news.csv")

read_file = reader(opened_file)

hn = list(read_file)

for row in hn[:5]: print(row, '\n')

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] 

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'] 

['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'] 

['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'] 

['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'] 



## Separating the column headers from the dataset

In [2]:
hn_header = hn[0]

hn = hn[1:]

In [3]:
print(hn_header) #verifying that header row was placed into its own variable

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


In [4]:
#making a header dictionary for easier identification of column indices:
header_dict = {}
i = 0
for item in hn_header:
    header_dict[item] = i
    i += 1

In [5]:
print(header_dict.items()) #printing it out to refer back to

dict_items([('id', 0), ('title', 1), ('url', 2), ('num_points', 3), ('num_comments', 4), ('author', 5), ('created_at', 6)])


In [6]:
for row in hn[:5]: print(row, '\n') #verifying that the header row was removed from the main dataset

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'] 

['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'] 

['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'] 

['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'] 

['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12'] 



## Separating out posts of interest

The next step is to separate out the posts we're interested in (Ask HN and Show HN) as two lists, from all the others.

In [7]:
ask_posts = []
show_posts = [] 
other_posts = []

print(header_dict["title"]) #column index

1


In [8]:
for row in hn:
    
    title = (row[1]).lower()

    if title.startswith("ask hn"):
        ask_posts.append(row)
        
    elif title.startswith("show hn"):
        show_posts.append(row)
        
    else:
        other_posts.append(row)

In [9]:
print("ask: ", len(ask_posts), "| show: ", len(show_posts), "| other: ", len(other_posts))

ask:  1744 | show:  1162 | other:  17194


## Average number of comments

Next I'll calculate the average number of comments received for different types of posts:

In [10]:
print(header_dict["num_comments"])

4


In [11]:
total_ask_comments = 0

total_show_comments = 0

for row in ask_posts:
    comments = int(row[4])
    total_ask_comments += comments
    
for row in show_posts:
    comments = int(row[4])
    total_show_comments += comments

average_ask_comments = round(total_ask_comments/len(ask_posts))

average_show_comments = round(total_show_comments/len(show_posts))

print("Average number of comments:", "\nAsk posts: ", average_ask_comments, '\nShow posts: ', average_show_comments)


Average number of comments: 
Ask posts:  14 
Show posts:  10


Clearly Ask posts receive, on average, more comments than Show posts.

## Post timing

Dataquest requests a further analysis of whether the time of day affects how many comments Ask posts receive.

In [12]:
import datetime as dt

print(header_dict["created_at"])

6


### List of post times

The first task is to create a list containing the hour each Ask post was made and the number of comments it received, plus dictionaries of the overall number of posts at different hours and the number of comments at different hours.

In [13]:
result_list = []
counts_by_hour = {}
comments_by_hour = {}

#post time format example 9/30/2015 4:12

for post in ask_posts:
    
    hour = dt.datetime.strptime(post[6], "%m/%d/%Y %H:%M").strftime("%H") #process string into datetime object, format to extract just the hour
    
    #`.strftime("%H")` will run with the output of the `strptime()` as 'self', i.e. it will be run as `strftime(self, "%H")
    
    comments = int(post[4])
    
    result_list.append([hour, comments])
    
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comments
    
    elif hour in counts_by_hour:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comments

### Average number of comments

Next I will calculate the average number of comments per Ask post per hour:

In [14]:
avg_by_hour = []

for hour_key in counts_by_hour:
    avg_by_hour.append((hour_key, (comments_by_hour[hour_key]/counts_by_hour[hour_key])))

Average number of comments per post per hour, rounded to two decimal places:

In [15]:
avg_by_hour = sorted(avg_by_hour)

for hour in avg_by_hour:
    print(("{time} o\'clock: {comments:.2f} comments").format(time = hour[0], comments = hour[1]))

00 o'clock: 8.13 comments
01 o'clock: 11.38 comments
02 o'clock: 23.81 comments
03 o'clock: 7.80 comments
04 o'clock: 7.17 comments
05 o'clock: 10.09 comments
06 o'clock: 9.02 comments
07 o'clock: 7.85 comments
08 o'clock: 10.25 comments
09 o'clock: 5.58 comments
10 o'clock: 13.44 comments
11 o'clock: 11.05 comments
12 o'clock: 9.41 comments
13 o'clock: 14.74 comments
14 o'clock: 13.23 comments
15 o'clock: 38.59 comments
16 o'clock: 16.80 comments
17 o'clock: 11.46 comments
18 o'clock: 13.20 comments
19 o'clock: 10.80 comments
20 o'clock: 21.52 comments
21 o'clock: 16.01 comments
22 o'clock: 6.75 comments
23 o'clock: 7.99 comments


Just for fun, an improvised bar graph of the results:

In [16]:
for hour in avg_by_hour:
    print(hour[0], 'o\'clock: ', "--" * (int(hour[1])))

00 o'clock:  ----------------
01 o'clock:  ----------------------
02 o'clock:  ----------------------------------------------
03 o'clock:  --------------
04 o'clock:  --------------
05 o'clock:  --------------------
06 o'clock:  ------------------
07 o'clock:  --------------
08 o'clock:  --------------------
09 o'clock:  ----------
10 o'clock:  --------------------------
11 o'clock:  ----------------------
12 o'clock:  ------------------
13 o'clock:  ----------------------------
14 o'clock:  --------------------------
15 o'clock:  ----------------------------------------------------------------------------
16 o'clock:  --------------------------------
17 o'clock:  ----------------------
18 o'clock:  --------------------------
19 o'clock:  --------------------
20 o'clock:  ------------------------------------------
21 o'clock:  --------------------------------
22 o'clock:  ------------
23 o'clock:  --------------


Sorting the results from most to fewest posts per hour:

In [17]:
avg_by_hour_reversed = []

for item in avg_by_hour:
    avg_by_hour_reversed.append((item[1], item[0]))
    
avg_by_hour_reversed = sorted(avg_by_hour_reversed, reverse = True)

In [18]:
print("Average number of comments per post per hour:\n")

for hour in avg_by_hour_reversed:
    
    hrformat = dt.datetime.strptime(hour[1], "%H").strftime("%H:%M")
    
    bar = "--" * (round(hour[0]))
    
    print(("{time}: {bar} {comments:.2f}").format(time = hrformat, bar = bar,  comments = hour[0]))

Average number of comments per post per hour:

15:00: ------------------------------------------------------------------------------ 38.59
02:00: ------------------------------------------------ 23.81
20:00: -------------------------------------------- 21.52
16:00: ---------------------------------- 16.80
21:00: -------------------------------- 16.01
13:00: ------------------------------ 14.74
10:00: -------------------------- 13.44
14:00: -------------------------- 13.23
18:00: -------------------------- 13.20
17:00: ---------------------- 11.46
01:00: ---------------------- 11.38
11:00: ---------------------- 11.05
19:00: ---------------------- 10.80
08:00: -------------------- 10.25
05:00: -------------------- 10.09
12:00: ------------------ 9.41
06:00: ------------------ 9.02
00:00: ---------------- 8.13
23:00: ---------------- 7.99
07:00: ---------------- 7.85
03:00: ---------------- 7.80
04:00: -------------- 7.17
22:00: -------------- 6.75
09:00: ------------ 5.58


DataQuest also asks me to print the results in a specific format using `str.format()`:

In [19]:
for hour in avg_by_hour_reversed:
    
    hrformat = dt.datetime.strptime(hour[1], "%H").strftime("%H:%M")
    
    print(("{time}: {comments:.2f} average comments per post").format(time = hrformat, comments = hour[0]))
            

15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post
13:00: 14.74 average comments per post
10:00: 13.44 average comments per post
14:00: 13.23 average comments per post
18:00: 13.20 average comments per post
17:00: 11.46 average comments per post
01:00: 11.38 average comments per post
11:00: 11.05 average comments per post
19:00: 10.80 average comments per post
08:00: 10.25 average comments per post
05:00: 10.09 average comments per post
12:00: 9.41 average comments per post
06:00: 9.02 average comments per post
00:00: 8.13 average comments per post
23:00: 7.99 average comments per post
07:00: 7.85 average comments per post
03:00: 7.80 average comments per post
04:00: 7.17 average comments per post
22:00: 6.75 average comments per post
09:00: 5.58 average comments per post


Clearly the best time to post an Ask post to get a lot of engagement is 15:00. From the documentation of the dataset, we can see that the times are recorded from the US Eastern Time zone (UTC-5). This means that the best time for me to post (from UTC+0) is at about 8pm.

# Next steps:

* Determine if show or ask posts receive more points on average.
* Determine if posts created at a certain time are more likely to receive more points.
* Compare your results to the average number of comments and points other posts receive.

## Show/Ask points

In [20]:
print(header_dict.items())

dict_items([('id', 0), ('title', 1), ('url', 2), ('num_points', 3), ('num_comments', 4), ('author', 5), ('created_at', 6)])


In [21]:
#number of points on a post is at index 3

total_ask = 0
total_show = 0

for post in ask_posts: 
    total_ask += int(post[3])

for post in show_posts: 
    total_show += int(post[3])
    
avg_ask = total_ask/len(ask_posts)
avg_show = total_show/len(show_posts)

print(("The average number of points on an Ask post was {t1:.2f}, and the average number of points on a Show post was {t2:.2f}."
              ).format(t1 = avg_ask, t2 = avg_show))

The average number of points on an Ask post was 15.06, and the average number of points on a Show post was 27.56.


This result, in combination with previous exploration, suggests that on average when people ask a question on Hacker News they are quite likely to get lots of (hopefully helpful) comments. In contrast, when people post something they want to show off on Hacker News, they will receive a positive response (lots of votes/points), but not as many comments.

## Post timing vs points

In [22]:
#num_points index = 3
#created_at index = 6

timing_points = {}
timing_number = {}

for post in (ask_posts + show_posts):
    hour = dt.datetime.strptime(post[6], "%m/%d/%Y %H:%M").strftime("%H")
    points = int(post[3])
    
    if hour not in timing_number:
        timing_number[hour] = 1
        timing_points[hour] = points

    elif hour in timing_number:
        timing_number[hour] += 1
        timing_points[hour] += points
        
        
time_point_avg = []

for time in timing_number:
    avg = int(timing_points[time])/int(timing_number[time])
    time_point_avg.append((avg, time))

print("Time of day vs average number of points for Ask and Share posts collectively (top 5 times)\n") 

for row in sorted(time_point_avg, reverse = True)[:5]:   #I don't really need to see the whole day, top 5 suffices
    hrformat = dt.datetime.strptime(row[1], "%H").strftime("%H:%M")
    print(("{hour}: {points:.2f} pts").format(hour = hrformat, points = row[0]))

Time of day vs average number of points for Ask and Share posts (top 5 times)

15:00: 29.42 pts
16:00: 25.65 pts
12:00: 24.81 pts
13:00: 24.46 pts
18:00: 23.27 pts


## Comparison with "Other" posts



In [23]:
total_other_comments = 0

for post in other_posts:
    comments = int(post[4])
    total_other_comments += comments
    
average_other_comments = round(total_other_comments/len(other_posts))

print("Average number of comments on \"other\" posts:", average_other_comments)

Average number of comments on "other" posts: 27


Recall that Ask posts received an average of 14 comments, and Show posts only 10. This result indicates that, in aggregate, posts other than Ask and Show posts tend to receive many more comments. Since I previously only investigated Ask posts separately, I will now do some analysis on Show posts and Other posts as individual groups.

## Show/Other posts - timing

I'm going to need to extract information about the number of posts per hour and the number of comments per post per hour twice more, so why not practice making a function:


In [24]:
def post_hour(dataset):
    
    result_list = []
    counts_by_hour = {}
    comments_by_hour = {}

    for post in dataset:
    
        hour = dt.datetime.strptime(post[6], "%m/%d/%Y %H:%M").strftime("%H")
    
        comments = int(post[4])
    
        result_list.append([hour, comments])
    
        if hour not in counts_by_hour:
            counts_by_hour[hour] = 1
            comments_by_hour[hour] = comments
    
        elif hour in counts_by_hour:
            counts_by_hour[hour] += 1
            comments_by_hour[hour] += comments
            
    return result_list, counts_by_hour, comments_by_hour

In [27]:
#Show posts
show_results, show_by_hour, show_comments_hour = post_hour(show_posts) #unpack the information returned by the function into 3 vars

In [28]:
#Other posts
other_results, other_by_hour, other_comments_hour = post_hour(other_posts)