# Explore Hacker News Posts

In this project, we'll work with a dataset of submissiona to popular technology site [Hacker News](https://news.ycombinator.com/).

<img src="https://s3.amazonaws.com/dq-content/354/hacker_news.jpg" width="600" height="600" />

Hacker News is a site started by the startup incubator [Y Combinator](https://www.ycombinator.com/), where user-submitted stories (known as **posts**) receive votes and comments, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of the Hacker News listings can get hundreds of thousands of visitors as a result.

We can find the dataset [here](https://www.kaggle.com/hacker-news/hacker-news-posts), but for this project we have reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that didn't recieve any comments and then randomly sampling from the remaining submissions. Below are the descriptions of the columns:
* `id`: the unique identifier from Hacker News for the post
* `title`: the title of the post
* `url`: the URL that the posts links to, if the post has a URL
* `num_points`: the number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
* `num_comments`: the number of comments on the post
* `author`: the username of the person who submitted the post
* `created_at`: the date and time of post's submission

## Introduction

We begin by importing the libraries we need and reading the dataset into a list of lists. We'll also remove the column headers from the rows.

In [1]:
from csv import reader

opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)

# Separate the column headers and rows
headers = hn[0] # columns
hn = hn[1:] # rows

Here are the columns and the first five rows of the dataset look like:

In [2]:
# Display columns
print(headers)

# Display rows
for row in hn[:5]:
    print(row)
    print()

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']

['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']

['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']

['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']

['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']



We're specifically interested in posts with the titles that begin with either `Ask HN` or `Show HN`. Users submit `Ask HN` posts to ask the Hacker News community a specific question. Below are a few examples:

```
Ask HN: How to improve my personal website?
Ask HN: Am I the only one outraged by Twitter shutting down share counts?
Ask HN: Aby recent changes to CSS that broke mobile?
```

Likewise, users submit `Show HN` posts to show the Hacker News community a project, product, or just something interesting. Below are a few examples:

```
Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform'
Show HN: Something pointless I made
Show HN: Shanhu.io, a programming playground powered by e8vm
```

We'll compare these two types of posts to determine the following:
* Do `Ask HN` or `Show HN` receive more comments on average?
* Do posts created at a certain time receive more comments on average?

## Extracting Ask HN and Show HN Posts

Since we're only concerned with post titles beginning with `Ask HN` or `Show HN`, we'll create new lists of lists containing just data for those titles.

To find the posts that begin with either `Ask HN` or `Show HN`, we'll use the string method `startswith`.

Strings are case sensitive so for this we'll also ensure that we filter out data based on lower case using `lower` string method.

In [3]:
# Identify posts that begin with either `Ask HN` or `Show HN`
# and separate the data into different lists
ask_posts = []
show_posts = []
other_posts = []

# Convert posts to lower case
# then append them to the repective list of lists
for post in hn:
    title = post[1].lower()
    if title.startswith('ask hn'):
        ask_posts.append(post)
    elif title.startswith('show hn'):
        show_posts.append(post)
    else:
        other_posts.append(post)
        
print(f'Number of ask posts: {len(ask_posts)}')
print(f'Number of show posts: {len(show_posts)}')
print(f'Number of other posts: {len(other_posts)}')

Number of ask posts: 1744
Number of show posts: 1162
Number of other posts: 17194


We have more **ask posts** (`1744`) than **show posts** (`1162`). To verify if the data is stored properly, we have a look at the first and last two entries of each list.

In [4]:
for entry in range(0, 2):
    print(ask_posts[entry][1])
    print(ask_posts[-entry-1][1])
    
print('\n')

for entry in range(0, 2):
    print(show_posts[entry][1])
    print(show_posts[-entry-1][1])

Ask HN: How to improve my personal website?
Ask HN: Why are papers still published as PDFs?
Ask HN: Am I the only one outraged by Twitter shutting down share counts?
Ask HN: How do you balance a serious relationship with starting a company?


Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform
Show HN: Parse recipe ingredients using JavaScript
Show HN: Something pointless I made
Show HN: PhantomJsCloud, Headless Browser SaaS


## Calculating the Average Number of Comments for Ask HN and Show HN Posts

Next, we'll try to answer to our **first basic question** and determine if ask posts or show posts receive more comments on average.

In [5]:
# Calculate the average number of comments in 'Ask HN' posts
total_ask_comments = 0
for post in ask_posts:
    num_comments = int(post[4])
    total_ask_comments += num_comments

avg_ask_comments = total_ask_comments/len(ask_posts)
print(f'The average number of comments on ask posts is: {avg_ask_comments:.2f}')

The average number of comments on ask posts is: 14.04


In [6]:
# Calculate the average number of comments in 'Show HN' posts
total_show_comments = 0
for post in show_posts:
    num_comments = int(post[4])
    total_show_comments += num_comments
    
avg_show_comments = total_show_comments/len(show_posts)
print(f'The average number of comments on show posts is: {avg_show_comments:.2f}')

The average number of comments on show posts is: 10.32


On average, **ask posts** receive more comments, around 14.04, compared to **show posts** which is approxiately 10.32. Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.

## Finding the Amount of Ask Posts and Comments by Hour Created

Moving on to our **second question**, we'll determine if ask posts created at a certain *time* are more likely to attract comments. We'll use the following steps to perform the analysis:
1. Calculate the number of ask posts created in each hour of the day, along with the number of comments received.
2. Calculate the average number of comments ask posts receive by hour created.

In the code cell below, we'll work on the first step and calculate the number of ask posts and comments by hour created. We'll use the [`datetime` module](https://docs.python.org/3/library/datetime.html) to work with the data in the `created_at` column and with `datetime.strptime()` constructor we can parse dates stored as strings and return as datetime objects.

To begin, we'll import `datetime` as `dt` and iterate over `ask_posts` to extract `post_created` and `num_comments` into the `result_list`.

Next, we'll loop through `result_list` and store the data into two separate dictionaries `counts_by_hour` and `comments_by_hour` respectively.

In [7]:
import datetime as dt

result_list = [] # list to store post time and num of comments

for post in ask_posts:
    post_created = post[6] # post time
    num_comments = int(post[4]) # number of comments
    result_list.append([post_created, num_comments])

counts_by_hour = {} # dict to store number of posts by hour
comments_by_hour = {} # dict to store number of comments by hour
date_format = '%m/%d/%Y %H:%M'

for row in result_list:
    hour_str = row[0]
    comment = row[1]
    # Convert dates from string to datetime objects
    hour_dt  = dt.datetime.strptime(hour_str, date_format)
    # Extract only hours from the datetime objects
    hour = hour_dt.strftime('%H')
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comment
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comment

Now that we have created two dictionaries:
* `counts_by_hour`: contains the number of ask posts created during each hour of the day.
* `comments_by_hour`: contains the corresponding number of comments ask posts created at each hour received.

Let's view the data we have stored in these dictionaries:

In [8]:
# Posts count
counts_by_hour

{'09': 45,
 '13': 85,
 '10': 59,
 '14': 107,
 '16': 108,
 '23': 68,
 '12': 73,
 '17': 100,
 '15': 116,
 '21': 109,
 '20': 80,
 '02': 58,
 '18': 109,
 '03': 54,
 '05': 46,
 '19': 110,
 '01': 60,
 '22': 71,
 '08': 48,
 '04': 47,
 '00': 55,
 '06': 44,
 '07': 34,
 '11': 58}

In [9]:
# Comments count
comments_by_hour

{'09': 251,
 '13': 1253,
 '10': 793,
 '14': 1416,
 '16': 1814,
 '23': 543,
 '12': 687,
 '17': 1146,
 '15': 4477,
 '21': 1745,
 '20': 1722,
 '02': 1381,
 '18': 1439,
 '03': 421,
 '05': 464,
 '19': 1188,
 '01': 683,
 '22': 479,
 '08': 492,
 '04': 337,
 '00': 447,
 '06': 397,
 '07': 267,
 '11': 641}

## Calculating the Average Number of Comments for Ask HN Posts by Hour

We'll now use these two dictionaries to calculate the **average number of comments** for **post created during each hour** of the day.

For this purpose, we'll create a list of lists `avg_by_hour` in which the first element containing the hours during which posts were created and the second element is the average number of comments those posts received.

In [10]:
# Calculate the average amount of comments `Ask HN` posts created at each hour of the day receive
avg_by_hour = []
for comment_hr in comments_by_hour:
    comment_avg = comments_by_hour[comment_hr]/counts_by_hour[comment_hr]
    avg_by_hour.append([comment_hr, comment_avg])

avg_by_hour

[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]

## Sorting the Values from the List (avg_by_hour)

Although we now have the results we need, this format makes it difficult to identify the hours with the highest values. Let's finish by sorting the list of lists `swap_avg_by_hour` which is simply a **swapped** version of `avg_by_hour` list.

In [11]:
# Swap the elements of 'avg_by_hour' and store them into a new list of lists
swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
    
swap_avg_by_hour

[[5.5777777777777775, '09'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [16.796296296296298, '16'],
 [7.985294117647059, '23'],
 [9.41095890410959, '12'],
 [11.46, '17'],
 [38.5948275862069, '15'],
 [16.009174311926607, '21'],
 [21.525, '20'],
 [23.810344827586206, '02'],
 [13.20183486238532, '18'],
 [7.796296296296297, '03'],
 [10.08695652173913, '05'],
 [10.8, '19'],
 [11.383333333333333, '01'],
 [6.746478873239437, '22'],
 [10.25, '08'],
 [7.170212765957447, '04'],
 [8.127272727272727, '00'],
 [9.022727272727273, '06'],
 [7.852941176470588, '07'],
 [11.051724137931034, '11']]

We can now use the [`sorted()` function](https://docs.python.org/3/library/functions.html#sorted) to sort `swap_avg_by_hour` in descending order. Since the first column of this list is the average number of comments, sorting the list will sort by the average number of comments.

In [12]:
# Sort the list in descending order
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

## Displaying the Results of Sorted List

In the end, we'll print the five highest values in a format that's easier to read.

In [13]:
# Display the top 5 hours with the highest average comments
print("Top 5 Hours for Ask Posts Comments.")
for row in sorted_swap[:5]:
    comment_avg = row[0]
    str_hr = row[1]
    # Convert dates from string to datetime objects
    dt_hr = dt.datetime.strptime(str_hr, '%H')
    # Extract only hours and minutes
    format_hr = dt_hr.strftime('%H:%M')
    print(f'{format_hr}: {comment_avg:.2f} average comments per post')

Top 5 Hours for Ask Posts Comments.
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


The hour that receives the most comments per post on average is 15:00, with an average of 38.59 comments per post. There's about a 62% increase in the average of comments between the hours with the highest and second highest average number of comments.

According to the dataset [documentation](https://www.kaggle.com/hacker-news/hacker-news-posts/home), the time zone used is Eastern Time in the US, so if we convert 15:00 (EST) into local time that is Pakistan Standard Time (PKT). This will make easy for us to decide what time we should create a post to maximize the ammount of comments.

We'll use [pytz module](https://pypi.org/project/pytz/) that allows accurate and cross platform timezone calculations using Python. Let's break down the steps how we'll implement the conversion in the code cell below:
1. We'll import the `pytz` package
2. Extract the posts' **hour** where we have maximum comments from the list `sorted_swap`
3. Instantiate `US/Eastern` and `Asia/Karachi` time for conversion
4. Parse the string date to datetime object
5. Convert datetime object to `US datetime`
6. Convert `US datetime` to `Pakistan datetime`

In [14]:
import pytz

# Extract hour of max comments from 'sorted_swap'
max_comments_hr = sorted_swap[0][1]
# Instantiate timezone of 'US' and 'Pakistan'
us_tz = pytz.timezone('US/Eastern')
pakistan_tz = pytz.timezone('Asia/Karachi')
# String format for time
fmt = '%I:%M %p'

# Convert date into datetime object
dt_hr = dt.datetime.strptime(max_comments_hr, '%H')
# Convert time to eastern time
us_dt = us_tz.localize(dt_hr)
# Convert eastern time to pakistan time
pkt_dt = us_dt.astimezone(pakistan_tz)
print(f'{us_dt.strftime(fmt)} EST is {pkt_dt.strftime(fmt)} PKT')

03:00 PM EST is 12:24 AM PKT


So now we can decide if we write a post around **midnight** local time that will increase the chances to get maximum comments.

## Conclusion

In this project, we analyzed **ask posts** and **show posts** to determine which type of post and time receive the most comments on average. Based on our analysis, to maximize the amount of comments a post receives, we'd recommend the post be categorized as ask post and created between 15:00 and 16:00 EST (3:00-4:00 pm)

However, it should be noted that the dataset we analyzed excluded posts without any comments. Given that, it's more accurate to say that *of the posts that received comments*, **ask posts** received more comments on average and ask posts created between 15:00 and 16:00 EST received the most comments on average.