# Exploring Hacker News Posts
This project aims to determine the types of posts which receive the most comments on average.

Hacker News have specific posts that begin with 'Ask HN' or 'Show HN'. 'Ask HN' refers to a post in which a user asks the community a specific question. 'Show HN' refers to a post in which a user intends to show the community a project, product or something of interest. 

The goal for this project is to determine the following:
* Do 'Ask HN', 'Show HN' or Other posts receive more comments and points on average?
* Do posts created at a certain time receive more comments and points on average?
* Overall, what is best to post and when should it be posted?


## Opening & Exploring the Dataset
Before importing the dataset we plan on exploring, a reproducible function will be created that will allow for quick exploration of datasets. In doing so, it will output the necessary rows as well as the number of rows and columns present if required.

In [1]:
def explore_data(dataset, start, end, rows_and_columns = False, header = False):
    dataset_slice = dataset[start:end] # slices data using inputted integer values
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row
        
    rows = len(dataset)
    columns = len(dataset[0])
        
    if header:
        rows -= 1 # if dataset contains header, ignore header row
        
    if rows_and_columns:
        print('Number of rows:', rows)
        print('Number of columns:', columns)

The Hacker News dataset can now be imported for analysis. You can find the dataset [here](https://www.kaggle.com/datasets/hacker-news/hacker-news-posts?resource=download).

In [2]:
from csv import reader
opened_hacker = open("Datasets/HN_posts_year_to_Sep_26_2016.csv", encoding = 'utf8')
read_hacker = reader(opened_hacker)
hacker = list(read_hacker)

In [3]:
explore_data(hacker, 0, 6)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']


['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24']


['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']


['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']


['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']




The first 5 rows of the dataset have been printed above, alongside the header columns. Below are descriptions of the columns:
* `id`: the unique identifier from Hacker News for the post
* `title`: the title of the post
* `url`: the URL that the post links to, if the post has a URL
* `num_points`: the number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
* `num_comments`: the number of comments on the post
* `author`: the username of the person who submitted the post
* `created_at`: the date and time of the post's submission

## Cleaning the Dataset

### Removal of Headers
In order to analyse the data, we need to remove the row containing the column headers.


In [4]:
headers = hacker[0]
hacker = hacker[1:]
print(headers)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


In [5]:
explore_data(hacker, 0, 5)

['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']


['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24']


['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']


['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']


['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']




We can now see the headers are no longer present in the dataset.

### Removal of Non-Commented Posts
Given the aim of the project is to assess the average number of comments relative to each post type, we will reduce the size of the database by removing all submissions that failed to receive any comments.

In [6]:
for row in hacker:
    num = row[4]
    if int(num) == 0:
        del row

We will loop over the dataset to check if the previous code cell performed as expected.

In [7]:
count = 0
for row in hacker:
    num = row[4]
    if int(num) == 0:
        count += 1
print(count)

212718


### Filtering of Posts
Since we're only concerned with post titles beginning with 'Ask HN' or 'Show HN', we'll create new 2D lists containing just the data for those titles.

To find the posts that begin with either 'Ask HN' or 'Show HN', we'll use the string method `startswith`.

In [8]:
ask_posts = []
show_posts = []
other_posts = []

for row in hacker:
    title = row[1].lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
    

In [9]:
print(" Ask HN: {:,}\n Show HN: {:,}\n Other: {:,}.".format(len(ask_posts), len(show_posts), len(other_posts)))

 Ask HN: 9,139
 Show HN: 10,158
 Other: 273,822.


In [10]:
explore_data(ask_posts, 0, 5)

['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53']


['12578522', 'Ask HN: How do you pass on your work when you die?', '', '6', '3', 'PascLeRasc', '9/26/2016 1:17']


['12577908', 'Ask HN: How a DNS problem can be limited to a geographic region?', '', '1', '0', 'kuon', '9/25/2016 22:57']


['12577870', 'Ask HN: Why join a fund when you can be an angel?', '', '1', '3', 'anthony_james', '9/25/2016 22:48']


['12577647', 'Ask HN: Someone uses stock trading as passive income?', '', '5', '2', '00taffe', '9/25/2016 21:50']




In [11]:
explore_data(show_posts, 0, 5)

['12578335', 'Show HN: Finding puns computationally', 'http://puns.samueltaylor.org/', '2', '0', 'saamm', '9/26/2016 0:36']


['12578182', 'Show HN: A simple library for complicated animations', 'https://christinecha.github.io/choreographer-js/', '1', '0', 'christinecha', '9/26/2016 0:01']


['12578098', 'Show HN: WebGL visualization of DNA sequences', 'http://grondilu.github.io/dna.html', '1', '0', 'grondilu', '9/25/2016 23:44']


['12577991', 'Show HN: Pomodoro-centric, heirarchical project management with ES6 modules', 'https://github.com/jakebian/zeal', '2', '0', 'dbranes', '9/25/2016 23:17']


['12577142', 'Show HN: Jumble  Essays on the go #PaulInYourPocket', 'https://itunes.apple.com/us/app/jumble-find-startup-essay/id1150939197?ls=1&mt=8', '1', '1', 'ryderj', '9/25/2016 20:06']




## Analysis of Dataset
### How to Attract Comments
#### Which Posts Receive More Comments?

In [12]:
total_ask_comments = 0
total_show_comments = 0
total_other_comments = 0

for row in ask_posts:
    total_ask_comments += int(row[4])
avg_ask_comments = total_ask_comments / len(ask_posts)
print("The average number of ask comments is ", avg_ask_comments)

for row in show_posts:
    total_show_comments += int(row[4])
avg_show_comments = total_show_comments / len(show_posts)
print("The average number of show comments is ", avg_show_comments)

for row in other_posts:
    total_other_comments += int(row[4])
avg_other_comments = total_other_comments / len(other_posts)
print("The average number of other comments is ", avg_other_comments)

The average number of ask comments is  10.393478498741656
The average number of show comments is  4.886099625910612
The average number of other comments is  6.4572678601427205


We can see that there are over double the number of comments on 'Ask HN' posts as there is on 'Show HN' posts, showing that Ask HN posts attract more comments. Posts other than Show HN or Ask HN attract 6.46 comments.

#### Are Ask Posts Created at Specific Times More Likely to Attract Comments?

To answer this questions, the following steps will be carried out to perform the analysis:
* Calculate the number of ask posts created in each hour of the day, along with the number of comments received.
* Calculate the average number of comments ask posts receive by hour created.

In [13]:
import datetime as dt

result_list = []
counts_by_hour = {}
comments_by_hour = {}

for row in ask_posts:
    time = row[6]
    num = int(row[4])
    time = dt.datetime.strptime(time, "%m/%d/%Y %H:%M")
    hour = time.strftime("%H")
    if hour in counts_by_hour:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += num
    else:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = num

We now have two dictionaries:
* `counts_by_hour`: contains the number of ask posts created during each hour of the day.
* `comments_by_hour`: contains the corresponding number of comments ask posts created at each hour received.

We can now use these two dictionaries to calculate the average number of comments for posts created during each hour of the day. 

In [14]:
avg_by_hour = []

for post in counts_by_hour:
    avg = float(comments_by_hour[post]) / counts_by_hour[post]
    avg_by_hour.append([post, avg])
avg_by_hour

[['02', 11.137546468401487],
 ['01', 7.407801418439717],
 ['22', 8.804177545691905],
 ['21', 8.687258687258687],
 ['19', 7.163043478260869],
 ['17', 9.449744463373083],
 ['15', 28.676470588235293],
 ['14', 9.692007797270955],
 ['13', 16.31756756756757],
 ['11', 8.96474358974359],
 ['10', 10.684397163120567],
 ['09', 6.653153153153153],
 ['07', 7.013274336283186],
 ['03', 7.948339483394834],
 ['23', 6.696793002915452],
 ['20', 8.749019607843136],
 ['16', 7.713298791018998],
 ['08', 9.190661478599221],
 ['00', 7.5647840531561465],
 ['18', 7.94299674267101],
 ['12', 12.380116959064328],
 ['04', 9.7119341563786],
 ['06', 6.782051282051282],
 ['05', 8.794258373205741]]

Although we now have the results we need, this format makes it difficult to identify the hours with the highest values. We will sort the 2D list and print the 5 highest values in a more readable format.

In [15]:
swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
swap_avg_by_hour

[[11.137546468401487, '02'],
 [7.407801418439717, '01'],
 [8.804177545691905, '22'],
 [8.687258687258687, '21'],
 [7.163043478260869, '19'],
 [9.449744463373083, '17'],
 [28.676470588235293, '15'],
 [9.692007797270955, '14'],
 [16.31756756756757, '13'],
 [8.96474358974359, '11'],
 [10.684397163120567, '10'],
 [6.653153153153153, '09'],
 [7.013274336283186, '07'],
 [7.948339483394834, '03'],
 [6.696793002915452, '23'],
 [8.749019607843136, '20'],
 [7.713298791018998, '16'],
 [9.190661478599221, '08'],
 [7.5647840531561465, '00'],
 [7.94299674267101, '18'],
 [12.380116959064328, '12'],
 [9.7119341563786, '04'],
 [6.782051282051282, '06'],
 [8.794258373205741, '05']]

In [16]:
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
print("Top 5 Hours for Ask Posts Comments")
template = "{}: {:.2f} average comments per post"
for i in range(5):
    print(template.format(dt.datetime.strptime(sorted_swap[i][1], "%H").strftime("%H:%M"), sorted_swap[i][0]))

Top 5 Hours for Ask Posts Comments
15:00: 28.68 average comments per post
13:00: 16.32 average comments per post
12:00: 12.38 average comments per post
02:00: 11.14 average comments per post
10:00: 10.68 average comments per post


From the displayed results, it is clear creating an Ask HN post at 15:00 EST (20:00 BST) results in a higher chance of receiving comments. Ask HN posts created at this time generate 28.68 comments per post on average, representing a 76% increase from the second highest average displayed.

#### Are Show Posts Created at Specific Times More Likely to Attract Comments?
To answer this question, the same steps will be followed. As we are repeating the same code, we will make it a function.

In [17]:
def top_comments(dataset, post_type):
    result_list = []
    counts_by_hour = {}
    comments_by_hour = {}

    for row in dataset:
        time = row[6]
        num = int(row[4])
        time = dt.datetime.strptime(time, "%m/%d/%Y %H:%M")
        hour = time.strftime("%H")
        if hour in counts_by_hour:
            counts_by_hour[hour] += 1
            comments_by_hour[hour] += num
        else:
            counts_by_hour[hour] = 1
            comments_by_hour[hour] = num

    avg_by_hour = []

    for post in counts_by_hour:
        avg = float(comments_by_hour[post]) / counts_by_hour[post]
        avg_by_hour.append([post, avg])

    swap_avg_by_hour = []
    for row in avg_by_hour:
        swap_avg_by_hour.append([row[1], row[0]])

    sorted_swap = sorted(swap_avg_by_hour, reverse = True)
    print("Top 5 Hours for {} Comments".format(post_type))
    template = "{}: {:.2f} average comments per post"
    for i in range(5):
        print(template.format(dt.datetime.strptime(sorted_swap[i][1], "%H").strftime("%H:%M"), sorted_swap[i][0]))

In [18]:
top_comments(show_posts, 'Show Post')

Top 5 Hours for Show Post Comments
12:00: 6.99 average comments per post
07:00: 6.68 average comments per post
11:00: 6.00 average comments per post
08:00: 5.60 average comments per post
14:00: 5.52 average comments per post


Creating a Show HN post at 12:00 EST (17:00 BST) results in a higher chance of receiving comments. Show HN posts created at this time generate 6.99 comments per post on average, showing only 0.31 increase in comments per post from the 2nd highest average displayed. In fact, there is only a 27% increase in the number of comments averaged per post between the highest average time and the 5th highest average. This may highlight that the time a Show HN post is created plays less of a role in its ability to generate comments compared to Ask HN posts.

#### Are Other Posts Created at Specific Times More Likely to Attract Comments?
To answer this question, the previously created function will be used. Other posts refer to posts that are neither Ask HN posts or Show HN posts.

In [19]:
top_comments(other_posts, "Other")

Top 5 Hours for Other Comments
12:00: 7.59 average comments per post
11:00: 7.37 average comments per post
02:00: 7.18 average comments per post
13:00: 7.15 average comments per post
05:00: 6.79 average comments per post


Creating other posts at 12:00 EST (17:00 BST) results in a higher chance of attracting comments. However, similar to the Show HN posts, there is very little difference between the number of comments received, with there only being a 0.45 difference in the number of comments generated from the highest average and 4th highest average.

Interestingly, both Show HN and Other posts created at 12:00 EST (17:00 BST) attract the most comments, suggesting that this is the time users should post to potentially maximise their chance of receiving comments. Furthermore, 11:00 EST (16:00 BST) is 2nd highest for Other posts and 3rd highest for Show HN posts, suggesting that a time interval of 11:00 - 13:00 EST is best for creating posts.

To assess if this analysis is a true observation, analysis of all posts with comments will be completed.

#### Are Posts Created at Specific Times More Likely to Attract Comments?

In [20]:
top_comments(hacker, "All")

Top 5 Hours for All Comments
12:00: 7.69 average comments per post
11:00: 7.37 average comments per post
13:00: 7.34 average comments per post
02:00: 7.27 average comments per post
15:00: 7.05 average comments per post


We can see that the times displayed have all appeared previously in the analysis, most notably our time interval of 11:00 - 12:00 EST, with these times appearing as the highest and second highest averages. Posts created at 11:00 EST only generate 0.03 more comments than posts created at 13:00 EST (18:00 BST). 13:00 EST was seen as the 4th highest average for Other posts and 2nd highest for Ask HN posts, therefore this is clearly a suitable time to post to generate more comments.

From this analysis, it can be said that the most suitable time interval to create a post to attract more comments would be 11:00 - 14:00 EST (16:00 - 19:00 BST).

### How to Generate Points on Posts
#### Which Posts Receive More Points?

In [21]:
total_ask_points = 0
total_show_points = 0
total_other_points = 0

for row in ask_posts:
    total_ask_points += int(row[3])
avg_ask_points = total_ask_points / len(ask_posts)
print("The average number of points for Ask HN posts is ", avg_ask_points)

for row in show_posts:
    total_show_points += int(row[3])
avg_show_points = total_show_points / len(show_posts)
print("The average number of points for Show HN posts is ", avg_show_points)

for row in other_posts:
    total_other_points += int(row[3])
avg_other_points = total_other_points / len(other_posts)
print("The average number of points for Other posts is ", avg_other_points)

The average number of points for Ask HN posts is  11.31174089068826
The average number of points for Show HN posts is  14.843571569206537
The average number of points for Other posts is  15.156010108756783


Here, we can see that staying clear of Ask or Show HN posts results in greater chance of gaining points on a post, as Other posts receive 15 points per post on average. Ask HN posts receive the least, receiving 11 points per post on average.

Considering Ask HN posts generate the most comments, it is interesting that it receives the least amount of points, suggesting that the Ask HN and Show HN communities may differ in their interactions. As Ask HN posts require users to comment and answer, receiving a greater number of posts and a lower amount of points is understandible as showing appreciation of the post via upvoting is not as necessary. In contrast, Show HN posts are less likely to receive comments as these posts are designed for users to show appreciation for the post through an upvote. 

#### Are Ask Posts Created at Specific Times More Likely to Receive Points?
To answer this question, an approach similar to what was used to analyse the comments will be followed. 

In [22]:
def most_points(dataset, post_type):
    result_list = []
    counts_by_hour = {}
    points_by_hour = {}

    for row in dataset:
        time = row[6]
        points = int(row[3])
        time = dt.datetime.strptime(time, "%m/%d/%Y %H:%M")
        hour = time.strftime("%H")
        if hour in counts_by_hour:
            counts_by_hour[hour] += 1
            points_by_hour[hour] += points
        else:
            counts_by_hour[hour] = 1
            points_by_hour[hour] = points

    avg_by_hour = []

    for post in counts_by_hour:
        avg = float(points_by_hour[post]) / counts_by_hour[post]
        avg_by_hour.append([post, avg])

    swap_avg_by_hour = []
    for row in avg_by_hour:
        swap_avg_by_hour.append([row[1], row[0]])

    sorted_swap = sorted(swap_avg_by_hour, reverse = True)
    print("Top 5 Hours to Generate Points for {} Posts".format(post_type))
    template = "{}: {:.2f} average points per post"
    for i in range(5):
        print(template.format(dt.datetime.strptime(sorted_swap[i][1], "%H").strftime("%H:%M"), sorted_swap[i][0]))

In [23]:
most_points(ask_posts, "Ask HN")

Top 5 Hours to Generate Points for Ask HN Posts
15:00: 21.64 average points per post
13:00: 17.93 average points per post
12:00: 13.58 average points per post
10:00: 13.44 average points per post
17:00: 12.19 average points per post


Interestingly, 4 of the 5 times present in this top 5 are also present when analysing the top 5 hours to attract comments, with only 17:00 EST (the 5th highest average) not being in the top 5 hours to attract comments list. Furthermore, the top 3 hours in this list are in the same order to the comments list. This suggests that these hours may be due to having an increased number of users present on the platform at these hours. This data is not currently available however analysis of the times in which most users are present could generate further insight. 
#### Are Show Posts Created at Specific Times More Likely to Receive Points?

In [24]:
most_points(show_posts, "Show HN")

Top 5 Hours to Generate Points for Show HN Posts
12:00: 20.91 average points per post
11:00: 19.26 average points per post
13:00: 17.02 average points per post
19:00: 16.06 average points per post
06:00: 15.99 average points per post


Again, we see similarities between the Show HN Comments list and Points list, with 3 of the 5 hours being present in both lists. We also see posts created at 12:00 EST (17:00 BST) not only attract more comments, but generate more points, further suggesting that this is the best time to post on the site. 
#### Are Other Posts Created at Specific Times  More Likely to Receive Points?

In [25]:
most_points(other_posts, "Other")

Top 5 Hours to Generate Points for Other Posts
02:00: 16.71 average points per post
12:00: 16.70 average points per post
11:00: 16.29 average points per post
00:00: 16.12 average points per post
13:00: 16.02 average points per post


It appears similarities between the comments and points lists is a theme, as 4 of the 5 times are present in both lists, with only 00:00 EST not being present in the comments list. 12:00 EST and 11:00 EST are found at 2nd and 3rd in this list, furthering the suggestion that posting between 11:00 EST - 13:00 EST is best. Analysis will be performed to determine if this is true for all posts with comments. It can also be noted that 13:00 EST (18:00 BST) is the 5th highest average.
#### Are Posts Created at Specific Times More Likely to Receive Points?

In [26]:
most_points(hacker, "All")

Top 5 Hours to Generate Points for All Posts
12:00: 16.79 average points per post
02:00: 16.41 average points per post
11:00: 16.19 average points per post
13:00: 16.11 average points per post
00:00: 15.88 average points per post


Posts created at 12:00 EST (17:00 BST) receive the most points on average, generating 16.79. We also see 11:00 EST and 13:00 EST at 3rd and 4th respectively. Therefore, we can confidently say that the best time interval to create a post in order to generate points is 11:00 - 14:00 EST (16:00 - 19:00 BST). It must be noted that this is from data that excluded posts without comments.

## Conclusion
In this project, we analysed posts to determine which type of post and time receive the most comments and points on average. Based on the analysis, to maximize the amount of comments a post receives, it is recommended the post is categorised as an Ask HN post and created between 15:00 - 16:00 EST (20:00 - 21:00 BST). However, for all posts, it is best to post between 11:00 - 14:00 EST (16:00 - 19:00 BST). 

This time interval stays true for generating points, therefore it can be said that of the posts that received comments, posting between the hours of 11:00 EST and 14:00 EST, maximizes the number of comments and points a post can receive.

This time interval being true for both comments and points suggests that these times may be when users are most active on Hacker News. Data on the number of users active at each hour of the day is not available, however this would be worth investigating to discover if this is the case or not. 