# Exploring Hacker News Posts 

Time is a crucial factor in any analysis. From the analysis of time, we can obtain clearer insights into underlying trends and systemic patterns. For example, Information from time analysis could help a social media user in selecting the best time post content that maximally engages the community.

In this project, we will work with a dataset of submissions to a popular technology site called Hacker News. We will try to identify the 'golden hours' for users looking to create posts on the platform, then we will explore the type of posts that draw user engagement.

![](https://wiredcraft.com/images/posts/hacker-news.png)


## Overview

[Hacker News](https://news.ycombinator.com/) is a site started by the startup incubator, [Y Combinator](https://www.ycombinator.com/), where user-submitted stories (known as "posts") receive votes and comments, similar to reddit. The platform is extremely popular in technology and startup circles, and posts that make it to the top listing can get hundreds of thousands of visitors. 


## Dataset Information

The original dataset can be found [here](https://www.kaggle.com/hacker-news/hacker-news-posts). For this analysis, we have reduced the data from almost *300,000 rows* to approximately *20,000 rows*. We did this by eliminating submissions that didn't receive any comments and then randomly sampling from the remaining submissions. The `7` columns in the dataset are described below:

- `id`:- Unique identifier from Hacker News for the post
- `title`:- Title of the post
- `url`:- The URL that the posts links to, if the post has a URL
- `num_points`:- Number of points the post acquired (calculated as *total upvotes - total downvotes*)
- `num_comments`:- Number of comments on the post
- `author`:- Username of the person who submitted the post
- `created_at`:- Date and time of the post's submission

## Categories of Interest

We're specifically interested in posts with titles that begin with either Ask HN or Show HN. 

- Users submit `Ask HN` posts to ask the Hacker News community a specific question, for example:

```
Ask HN: How to improve my personal website?
Ask HN: Am I the only one outraged by Twitter shutting down share counts?

```
- Users submit `Show HN` posts to show the Hacker News community a project, product, or just something interesting. Below are a few examples:

```
Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform'
Show HN: Something pointless I made

```

## Importing Libraries
As we progress in this project, we will need to read files; work with date and time; and create visualizations. To make these processes possible, we will import python's `reader` and `datetime` modules, including plotly's visualisation libraries - `plotly.express`, `plotly.graph_objects` and `plotly.subplots`

In [1]:
from csv import reader
import datetime as dt
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

## Data Opening and Exploration
In our attempt to open and explore the dataset, we will create a function called `extract_data()` which takes a file path as an argument, then returns a read version of the file as a list of lists. For ease of analysis, the function will return the column headers and the actual data as seperate entities.

In [2]:
def extract_data(filepath):
    """locates file using filepath, reads it, then returns a list of lists"""
    
    opened_file = open(filepath)
    read_file = reader(opened_file)
    result = list(read_file)
    
    # return both the column headers and the actual data as seperate entities
    return result[0], result[1:]

We will define another function called `explore_data()` to enable us display specified ranges of dataset rows in readable format:

In [3]:
def explore_data(dataset, start, end, rows_and_columns=False):
    
    """
    Displays dataset rows in readable format
  
    Parameters:
    dataset (list): a list of lists
    start (int): start index for dataset slice
    end (int): end index for dataset slice
    rows and columns (boolean): specifies whether to print the number of rows and columns.
    
    output:
    prints the sliced dataset rows in readable format
    prints the number of dataset rows and columns if rows_and_columns is True
  
    """
    
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n')

    if rows_and_columns:
        print('Number of rows: {:,}'.format(len(dataset)))
        print('Number of columns:', len(dataset[0]))

Now, let's open and explore the Hacker News dataset using the functions we created:

In [4]:
# extract the header and data from 'hacker_news.csv'
hn_header, hn = extract_data('hacker_news.csv')

# print the column headers and explore the first five rows of 'hacker_news.csv'
print(hn_header, '\n')

explore_data(hn, 0, 4, True)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] 

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


Number of rows: 20,100
Number of columns: 7


### Observations
*The extracted data contains a total of **20,100 rows** and **7 columns**. The following columns will be useful in our analysis - `title`, `num_points`, `num_comments`, and `created_at`.*

Now that we have extracted the dataset and identified the columns that are relevant to our analysis, we are ready to isolate the posts that are of particular interest to us.

## Isolating the Ask HN and Show HN Posts

We are only concerned with post titles beginning with `Ask HN` or `Show HN`. Thus, we will create new lists containing just the data for those titles and store them in corresponding variables.

To find the posts that begin with either Ask HN or Show HN, we'll use Python's built-in string method `string.startswith()`. We also need the `string.lower_method()` to ensure that our search is case insensitive (i.e unaffected by irregular capitalizations of `Ask HN` or `Show HN`).

## 1. Removing Headers from a List of Lists

Notice that the first list in the inner lists contains the column headers, and the lists after contain the data for one row. In order to analyze our data, we need to first remove the row containing the column headers. Let's remove that first row next.

In [5]:
def open_dataset(file_name, header=True):        
    opened_file = open(file_name, encoding='utf8')
    read_file = reader(opened_file)
    data = list(read_file)
    
    if header:
        return data[1:], data[0]
    else:
        return data
    
hn, headers = open_dataset('hacker_news.csv')
display(headers)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

Now let's display the first five rows of `hn`.

In [6]:
for row in hn[:5]:
    print(row)

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']
['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']
['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']
['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']
['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']


## 2. Extracting Ask HN and Show HN Posts

Now that we've removed the headers from `hn`, we're ready to filter our data. Since we're only concerned with post titles beginning with `Ask HN` or `Show HN`, we'll create new lists of lists containing just the data for those titles. To find the posts that begin with either `Ask HN` or `Show HN`, we'll use the string method startswith.

First, we will create three empty lists called `ask_posts`, `show_posts`, and `other_posts`.

In [7]:
ask_posts = []
show_posts = []
other_posts =[]

Now we will loop through each row in `hn` and seperate all post titles into `ask_posts`, `show_posts`, and `other_posts`. 

In [8]:
for row in hn:
    title = row[1].lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


## 3. Calculating the Average Number of Comments for Ask HN and Show HN Posts

Next, let's determine if ask posts or show posts receive more comments on average.

We now find the total number of comments for `Ask HN` and `Show HN` through the expression: `avg_num_comments = total_num_comments / length`.

In [9]:
total_ask_comments = 0
for row in ask_posts:
    comments = int(row[4])
    total_ask_comments += comments

avg_ask_comments = total_ask_comments / len(ask_posts)
print("The average number of comments for Ask HN is:", avg_ask_comments)

The average number of comments for Ask HN is: 14.038417431192661


In [10]:
total_show_comments = 0
for row in show_posts:
    comments = int(row[4])
    total_show_comments += comments

avg_show_comments = total_show_comments / len(show_posts)
print("The average number of comments for Show HN is:", avg_show_comments)

The average number of comments for Show HN is: 10.31669535283993


Average number of comments per category:
- Ask posts: 14.04
- Show posts: 10.32

Ask posts recieve an average of about four more comments than show posts.

Now we will calculate a percent difference in the number of posts and a percent difference in the average number of comments.

In [11]:
# Percentage difference in the number of posts
print(((len(ask_posts) - len(show_posts))/len(ask_posts))*100)

33.37155963302752


In [12]:
# Percentage difference in the avg number of comments
print(((avg_ask_comments - avg_show_comments)/avg_ask_comments)*100)

26.510980291006664


From our summary measures above;

- `Ask HN` posts are 33% more than `Show HN` posts.
- `ASk HN` posts have 26% more comments on average than `Show HN` posts

These findings answer our first question: `Which post receives more comments on average?`. This implies that `Ask HN` posts are more popular. There's more activity associated with these posts hence the site should prioritize them and provide additional resources in handling them.

## 4. Finding the Number of Ask Posts and Comments by Hour Created

Next, we'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:


1. Calculate the number of ask posts created in each hour of the day, along with the number of comments received.
2. Calculate the average number of comments ask posts receive by hour created.

We'll first work on the first step — calculating the number of ask posts and comments by hour created. We'll use the datetime module to work with the data in the created_at column. We'll use the [`datetime` module](https://docs.python.org/3/library/datetime.html) to work with the data in the created_at column.


In [13]:
import datetime as dt
result_list = []
for row in ask_posts:
    created_at = row[6]
    num_comments = int(row[4])
    result_list.append([created_at, num_comments])
    
print(result_list[:5])

[['8/16/2016 9:55', 6], ['11/22/2015 13:43', 29], ['5/2/2016 10:14', 1], ['8/2/2016 14:20', 3], ['10/15/2015 16:38', 17]]


Now that we have a list of lists containing the date and number of comments, we can split it further into two dictionaries:

- `counts_by_hour`: contains the number of ask posts created during each hour of the day.
- `comments_by_hour`: contains the corresponding number of comments ask posts created at each hour received

In [14]:
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    date = row[0]
    num_comments = row[1]
    date = dt.datetime.strptime(date, "%m/%d/%Y %H:%M")
    hour = date.strftime('%H')
    
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] =  num_comments
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += num_comments
        
display("Posts per hour: ", counts_by_hour)
display("Comments per hour: ", comments_by_hour)

'Posts per hour: '

{'09': 45,
 '13': 85,
 '10': 59,
 '14': 107,
 '16': 108,
 '23': 68,
 '12': 73,
 '17': 100,
 '15': 116,
 '21': 109,
 '20': 80,
 '02': 58,
 '18': 109,
 '03': 54,
 '05': 46,
 '19': 110,
 '01': 60,
 '22': 71,
 '08': 48,
 '04': 47,
 '00': 55,
 '06': 44,
 '07': 34,
 '11': 58}

'Comments per hour: '

{'09': 251,
 '13': 1253,
 '10': 793,
 '14': 1416,
 '16': 1814,
 '23': 543,
 '12': 687,
 '17': 1146,
 '15': 4477,
 '21': 1745,
 '20': 1722,
 '02': 1381,
 '18': 1439,
 '03': 421,
 '05': 464,
 '19': 1188,
 '01': 683,
 '22': 479,
 '08': 492,
 '04': 337,
 '00': 447,
 '06': 397,
 '07': 267,
 '11': 641}

The `hour` with the most `posts` was `3:00 PM` and the `hour` with the most `comments` was also `3:00 PM`.

##  5. Calculating the Average Number of Comments for Ask HN Posts by Hour

Next, we'll use these two dictionaries to calculate the average number of comments for posts created during each hour of the day. To do this, we'll create a list of lists containing the hours during which posts were created and the average number of comments those posts received.

In [15]:
avg_by_hour = []

for hour in comments_by_hour:
    avg_by_hour.append([hour,round(comments_by_hour[hour] / counts_by_hour[hour],2)])
    
display(avg_by_hour)

[['09', 5.58],
 ['13', 14.74],
 ['10', 13.44],
 ['14', 13.23],
 ['16', 16.8],
 ['23', 7.99],
 ['12', 9.41],
 ['17', 11.46],
 ['15', 38.59],
 ['21', 16.01],
 ['20', 21.52],
 ['02', 23.81],
 ['18', 13.2],
 ['03', 7.8],
 ['05', 10.09],
 ['19', 10.8],
 ['01', 11.38],
 ['22', 6.75],
 ['08', 10.25],
 ['04', 7.17],
 ['00', 8.13],
 ['06', 9.02],
 ['07', 7.85],
 ['11', 11.05]]

Once again, the highest average number of comments occurs at `3:00 PM`.

## 6. Sorting and Printing Values from a List of Lists

On the previous cell, we calculated the average number of comments for posts created during each hour of the day, and stored the results in a list of lists named `avg_by_hour`. Although we now have the results we need, this format makes it difficult to identify the hours with the highest values. Let's finish by sorting the list of lists and printing the five highest values in a format that's easier to read.

In [16]:
# Sort the values
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1],row[0]])
    
display(swap_avg_by_hour)

[[5.58, '09'],
 [14.74, '13'],
 [13.44, '10'],
 [13.23, '14'],
 [16.8, '16'],
 [7.99, '23'],
 [9.41, '12'],
 [11.46, '17'],
 [38.59, '15'],
 [16.01, '21'],
 [21.52, '20'],
 [23.81, '02'],
 [13.2, '18'],
 [7.8, '03'],
 [10.09, '05'],
 [10.8, '19'],
 [11.38, '01'],
 [6.75, '22'],
 [10.25, '08'],
 [7.17, '04'],
 [8.13, '00'],
 [9.02, '06'],
 [7.85, '07'],
 [11.05, '11']]

We now use the [`sorted()` function](https://docs.python.org/3/library/functions.html#sorted) to sort `swap_avg_by_hour` in descending order. Since the first column of this list is the average number of comments, sorting the list will sort by the average number of comments.

In [17]:
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
print("Top 5 Hours for Ask Posts Comments:")
for row in sorted_swap:
    average = row[0]
    hour = row[1]
    hour = dt.datetime.strptime(hour,'%H')
    hour = hour.strftime('%H:%M')
    print('{}: {} average comments per post'.format(hour,average))

Top 5 Hours for Ask Posts Comments:
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.8 average comments per post
21:00: 16.01 average comments per post
13:00: 14.74 average comments per post
10:00: 13.44 average comments per post
14:00: 13.23 average comments per post
18:00: 13.2 average comments per post
17:00: 11.46 average comments per post
01:00: 11.38 average comments per post
11:00: 11.05 average comments per post
19:00: 10.8 average comments per post
08:00: 10.25 average comments per post
05:00: 10.09 average comments per post
12:00: 9.41 average comments per post
06:00: 9.02 average comments per post
00:00: 8.13 average comments per post
23:00: 7.99 average comments per post
07:00: 7.85 average comments per post
03:00: 7.8 average comments per post
04:00: 7.17 average comments per post
22:00: 6.75 average comments per post
09:00: 5.58 average comments per post


From the summary above, it is clear that the best time to create a post that will stand a chance of receiving more comments is between `15:00` and `16:00`. As the hacker news dataset is in the EST timezone, this corresponds to later in the afternoon between `3:00-4:00 PM EST`.

# Conclusions on the correlation between posting time and engagement

As we can see, the best hour to post and get comments is from `3:00 - 4:00 PM EST`. 

Looking at the first several rows, we can see that the most engaging times, and therefore the times when moderators are most needed, are the early evening (`3-4 PM EST`), and at night (`02:00, 20:00). 

This makes sense since a lot of the participants might have full time jobs, hence that a lot of these happen either after work or right before finishing.