# Hacker News Posts
*An analysis by Heather Gray -- October 18, 2019*
_____________________________________


## Overview
Hacker News is a discussion page for tech-geeks and startups. Users create posts to share news or projects or ask questions. In this analysis, we will be comparing two types of HN posts: 

* Ask HN (posts where users ask other members of HN tech questions)
* Show HN (posts where projects or news articles or other items of interest are shared)

We want to know which type of post is more popular based on user engagement and scores. We also want to understand what times users on the site are most active by analysing post times.

## About the data set

[The csv file](https://www.kaggle.com/hacker-news/hacker-news-posts), obtained from Kaggle, originally contained nearly 300k rows. But for the purposes of this analysis, the data set has been reduced to ~20k rows because all rows representing posts with no comments were removed, as they are not useful to this project. The data set in use was compiled in 2016.

## Columns of the data set

| Column       	| Purpose                                                                                                                	|
|--------------	|------------------------------------------------------------------------------------------------------------------------	|
| id           	| The unique identifier from Hacker News for the post                                                                    	|
| title        	| The title of the post                                                                                                  	|
| url          	| The URL that the posts links to, if it the post has a URL                                                              	|
| num_points   	| The number of points the post acquired, calculated as the total  number of upvotes minus the total number of downvotes 	|
| num_comments 	| The number of comments that were made on the post                                                                      	|
| author       	| The username of the person who submitted the post                                                                      	|
| created_at   	| The date and time at which the post was submitted                                                                      	|

## Exploring the data set

In this section we will load the data set, verify its columns and print a few sample rows to get a feel for its structure. We will create a function to display a table of the dataset headers with indices so that it can be referred to from any point in the program by calling the function.

In [1]:
from csv import reader

# open the file, read it and convert it to a list
hn = list(reader(open('hacker_news.csv', encoding='utf-8')))

# create a list of headers
headers = hn[0]

# store data set without headers
hn = hn[1:]

def display_headers(header_list):
    ''' prints a formatted list of headers with indices'''
    print('|Ind', '| Heading')
    for index, header in enumerate(header_list):
        print('|', index, ' |', header)

display_headers(headers)

|Ind | Heading
| 0  | id
| 1  | title
| 2  | url
| 3  | num_points
| 4  | num_comments
| 5  | author
| 6  | created_at


## Exploring & analyzing the data set

In this section we will load the data set, verify its columns and print a few sample rows to get a feel for its structure. We will create a function to display a table of the dataset headers with indices so that it can be referred to from any point in the program by calling the function.

In [2]:
from csv import reader

# open the file, read it and convert it to a list
hn = list(reader(open('hacker_news.csv', encoding='utf-8')))

# create a list of headers
headers = hn[0]

# store data set without headers
hn = hn[1:]

def display_headers(header_list):
    ''' prints a formatted list of headers with indices'''
    print('|Ind', '| Heading')
    for index, header in enumerate(header_list):
        print('|', index, ' |', header)

display_headers(headers)

|Ind | Heading
| 0  | id
| 1  | title
| 2  | url
| 3  | num_points
| 4  | num_comments
| 5  | author
| 6  | created_at


#### Sample Data
We will define a function to read and display data in hn within a given range of rows. Testing the function with an end_row of 5 will give us a decent sample of the data.

In [3]:
def print_data_rows(data, start_row=0, end_row=''):
    for row in data[start_row:end_row]:
        print(row)
print_data_rows(hn, end_row=5)

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']
['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']
['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']
['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']
['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']


### Post & comment analysis
In this section, we'll explore the count of posts and comments in Ask HN and Show HN posts, which are explained below.

#### Partitioning the data
We will now partition the data into three categories:
* ask_posts - posts containing 'ask hn' in their title
* show_posts - posts containing 'show hn' in their title
* other_posts - all posts not fitting the above categories

In [4]:
def categorize_data(dataset):
    '''Split dataset into 3 post categories'''
    ask_posts = []
    show_posts = []
    other_posts = []
    
    for row in dataset:
        post_title = row[1].lower()
        if 'ask hn' in post_title:
            ask_posts.append(row)
        elif 'show hn' in post_title:
            show_posts.append(row)
        else:
            other_posts.append(row)
            
    # return all 3 lists as a tuple        
    return ask_posts, show_posts, other_posts


# unpack categorized data into local variables
ask_posts, show_posts, other_posts = categorize_data(hn)

# print a sample title of each type of post category
print(' == Post Summary ==\n' 
      ' Show: (', len(show_posts), ' posts) ', show_posts[1][1],'\n',
      'Ask:  (', len(ask_posts), ' posts) ', ask_posts[1][1],'\n',
      'Other:(', len(other_posts), 'posts) ', other_posts[1][1])

 == Post Summary ==
 Show: ( 1165  posts)  Show HN: Something pointless I made 
 Ask:  ( 1745  posts)  Ask HN: Am I the only one outraged by Twitter shutting down share counts? 
 Other:( 17190 posts)  How to Use Open Source and Shut the Fuck Up at the Same Time


#### Analyzing comments
We want to understand the types of posts which are getting the most user engagement, so we will get the average number of comments in each subset of HN posts.

In [5]:
def get_comment_count(post_list):
    total_comments = 0
    number_of_posts = len(post_list)
    for row in post_list:
        num_comments = int(row[4])
        total_comments += num_comments
    
    average_comments = round(total_comments / number_of_posts)
    
    return average_comments
    
print('== Average Comments ==\n'
      '  Askposts:   ',get_comment_count(ask_posts),
      '\n  Show posts: ', get_comment_count(show_posts),
      '\n  Otherposts: ', get_comment_count(other_posts))

== Average Comments ==
  Askposts:    14 
  Show posts:  10 
  Otherposts:  27


#### Analysis of post & comment counts
After parsing and exploring the data, we can make the following observations.

Of all the Ask HN and Show HN posts which have comments:

1. There are approximately 33% more Ask HN posts than Show HN posts
2. The average number of comments on Ask posts outnumber comments on Show posts by a ratio of 5:7

In [6]:
# proofs
print('Ask HN posts are ' + str(round(100 - (len(show_posts) / len(ask_posts) * 100))) + 
      '% more common than Show HN posts')
print('For every ' + str(get_comment_count(show_posts)) + ' Show HN comments there are ' + 
       str(get_comment_count(ask_posts)) + ' Ask HN comments')

Ask HN posts are 33% more common than Show HN posts
For every 10 Show HN comments there are 14 Ask HN comments


### Time analysis
Are certain times more popular for certain types of posts and do posts made during certain times get more user interaction? In this section we will answer these questions and more.

#### Parse and format dates
To parse and format the dates, we will use the datetime module from the Python standard library. Each date & time string must be converted to Python datetime object with strptime()

In [7]:
import datetime as dt

def parse_date(posts):
    for row in posts:
        post_datetime = row[6]
        row[6] = dt.datetime.strptime(post_datetime, '%m/%d/%Y %H:%M')
        
parse_date(ask_posts)
parse_date(show_posts)
parse_date(other_posts)

#### Calculate posts per hour
We want to know how many posts are made per hour as well as how many comments. To do that, we create a function to get the posts per hour which returns a tuple containing two dictionary objects. 

After we've created the dictionaries, we will later turn them into Counter objects in order to display stats on the top 5 hours 

Because we are primarily focused on Ask and Show posts, we won't calculate posts per hour or do any other analysis on them from this point forward.

In [8]:
def posts_per_hour(posts):
    counts_by_hour = {}
    comments_by_hour = {}
    
    for row in posts:
        created_hour = row[6].strftime('%H')
        num_comments = int(row[4])
        if created_hour in counts_by_hour:
            comments_by_hour[created_hour] += num_comments
            counts_by_hour[created_hour] += 1
        else:
            comments_by_hour[created_hour] = num_comments
            counts_by_hour[created_hour] = 1
            
    return counts_by_hour, comments_by_hour

In [9]:
ask_posts_per_hour = posts_per_hour(ask_posts)
show_posts_per_hour = posts_per_hour(show_posts)

In [10]:
def print_posts_per_hour(posts_per_hour):
    if posts_per_hour is ask_posts_per_hour: print('   A S K * P O S T S')
    if posts_per_hour is show_posts_per_hour: print('   S H O W * P O S T S')
    print('\n  /Posts per Hour/\n')
    for hour, count in posts_per_hour[0]:
        print('   hour', hour, ':' ,count)
    print('\n\n  /Comments per Hour/\n')
    for hour, count in posts_per_hour[1]:
        print('   hour', hour, ':' ,count)
    print()
    

In [11]:
from collections import Counter
# get the top 5 items out of each dictionary and create a counter object. Assign to variable as tuple
ask_counts_comments_by_hour = Counter(ask_posts_per_hour[0]).most_common(5), Counter(ask_posts_per_hour[1]).most_common(5)

print('\n', '|'*3, 'M O S T | A C T I V E | H O U R S', '|'*3,'\n')
print('_'*18)
print_posts_per_hour(ask_counts_comments_by_hour)
print('_'*20)
print_posts_per_hour(ask_counts_comments_by_hour)
print()


 ||| M O S T | A C T I V E | H O U R S ||| 

__________________

  /Posts per Hour/

   hour 15 : 116
   hour 19 : 110
   hour 21 : 109
   hour 18 : 109
   hour 16 : 108


  /Comments per Hour/

   hour 15 : 4477
   hour 16 : 1814
   hour 21 : 1745
   hour 20 : 1722
   hour 18 : 1439

____________________

  /Posts per Hour/

   hour 15 : 116
   hour 19 : 110
   hour 21 : 109
   hour 18 : 109
   hour 16 : 108


  /Comments per Hour/

   hour 15 : 4477
   hour 16 : 1814
   hour 21 : 1745
   hour 20 : 1722
   hour 18 : 1439




#### Average comments per post, per hour

We have the numbers for the top 5 hours in each category for number of posts and number of comments. Now we will go a little further and get the average comments for posts created for each of the most active hours in the day.

To do this, we will convert our Counter objects to simple dictionaries and then use a list comprehension to create pairs of hour and average comments per post per hour.

For this section, we will focus on the ask posts since they are the most popular.

In [12]:
ask_dict_counts = dict(ask_posts_per_hour[0])
ask_dict_comments = dict(ask_posts_per_hour[1])

In [13]:
avg_comments_hr = []
for hour, count in ask_dict_counts.items():
    avg_comments_hr.append([hour, ask_dict_comments[hour] - count])

In [14]:
avg_comments_hr = Counter(dict(avg_comments_hr)).most_common(24)

print('\n', '|'*3, 'C O M M E N T S / H O U R', '|'*3,'\n')
print('     (R A N K E D)\n')

for i, (hour, avg_comments) in enumerate(avg_comments_hr, start=1):
    print('     ' + format('%02d'% i) + '| ' + hour + ':00' + ': ' + str(avg_comments))


 ||| C O M M E N T S / H O U R ||| 

     (R A N K E D)

     01| 15:00: 4361
     02| 16:00: 1706
     03| 20:00: 1642
     04| 21:00: 1636
     05| 18:00: 1330
     06| 02:00: 1323
     07| 14:00: 1309
     08| 13:00: 1168
     09| 19:00: 1078
     10| 17:00: 1046
     11| 10:00: 734
     12| 01:00: 623
     13| 12:00: 614
     14| 11:00: 583
     15| 23:00: 476
     16| 08:00: 444
     17| 05:00: 418
     18| 22:00: 408
     19| 00:00: 392
     20| 03:00: 367
     21| 06:00: 353
     22| 04:00: 290
     23| 07:00: 233
     24| 09:00: 206


#### Observations

Based on the ranked data of average comments per hour, most people interact with posts around 1500 (3pm). This is an interesting time because it seems likely that most of the Hacker News audience, who are tech workers, would be at work during this time. However, tech companies are often flexible and allow time for their workers to catch up on the happenings in the industry. 3pm is probably the time when workers are feeling a little unfocused or bored with the day. Most of the day's most productive work will have been done and meetings are usually earlier in the day.

The 2nd most popular time is 1600 (4pm) and this also makes sense for the same reasons as above. Either these are the workers who have spent more than an hour on Hacker News or these are the ones who like to read their news near quitting time. 

After that, 20:00 (8pm) is the next most popular time. This is personal time for a lot of people. They've come home from work, visited with the family, eaten dinner and maybe watched a little television so they're free to do their own thing around this time.

#### Conclusion

Ask HN posts are the most popular by far. If a person wants the most engagement on Hacker News, posting an Ask HN thread around 3pm is their best bet.