# Analyzing Hacker News Posts For User Engagement

`Hacker News` is a site started by the startup incubator `Y Combinator`, where users can submit posts and be voted and commented upon. This website is similar to Reddit, but has a philosophy of only posting 'interesting' topics. The website says if it's likely to be on the news, then it shoud not be on the site. 

The data set was over 300,000 thousand rows but was shortened to 20,000 by removing posts not commented on as well as random sampling the rest. 

Below are the columns:

 |Name| Description|
 |---|---|
 |`title`| title of the post (self explanatory)|
 |`url`| the url of the item being linked to|
 |`num_points`| the number of upvotes the post received|
 |`num_comments`| the number of comments the post received|
 |`author`| the name of the account that made the post|
 |`created_at`| the date and time the post was made (the time zone is Eastern Time in the US)|

Currently, I am only interested in posts that begin with:
- `Ask HN`
- `Show HN`

On this website, users can post on either Ask HN or Show HN depending on the type of post.

My goal is to find which posts recieve the most amount of user to user interaction (comments), and what time of day is best to post to receive that user to user interaction.

# Introduction

I will begin by opening the file, reading it in, then creating a list of lists.

In [1]:
# Opening, reading, and listing the file

from csv import reader
import csv

open_file = open('hacker_news.csv', encoding='utf8')
read_file = reader(open_file)
hn = list(read_file)

# Displaying the first five rows

hn[:5]

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01']]

# Removing the Header

Above, it will show the headers for the data set. Do analyze this date I will need to separate the header from the rest of the data. 

In [2]:
headers = hn[0]
h_news = hn[1:]

print(headers)
print('\n')
print(h_news[:5])      

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


# HN Post Types

Because I only care for `Ask HN` and `Show HN`, I will separate these types of posts into lists. I will have three lists:

- Ask posts
- Show posts
- Other posts

I will look through the `h_news` list (list without the header) and separate the data into the lists.

In [15]:
# New lists to sort the data into post types
ask_posts = []
show_posts = []
other_posts = []

for row in h_news:
    # Pulling out the title row, which is index 2
    title = row[1]
    
    # Sorting the rows into particular lists based off the title
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

# Printing the header and each column with the amount of each post
print(headers, '\n')
print(len(ask_posts))
print(ask_posts[0], '\n')
print(len(show_posts))
print(show_posts[0], '\n')
print(len(other_posts))
print(other_posts[0], '\n')

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] 

1744
['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'] 

1162
['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'] 

17194
['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'] 



# Comments

Now I will look to see which type of post, `Ask HN` or `Show HN`, had more comments on average.

To do this I will do the following:

- Create variables for each type of post
- Iterate through each post list and add the number of comments to each post to the respective variable
- Get the average number of posts by dividing by the number of posts in the list

In [22]:
# Count Variables for the types of posts
total_ask_comments = 0
total_show_comments = 0
total_other_comments = 0

# Each type of post will be iterated and the average will be found
for row in ask_posts:
    # Pulling out the comment column with index 4
    comments = int(row[4])
    total_ask_comments += comments
    
avg_ask_comments = total_ask_comments / 1744

for row in show_posts:
    comments = int(row[4])
    total_show_comments += comments

avg_show_comments = total_show_comments / 1162

for row in other_posts:
    comments = int(row[4])
    total_other_comments += comments 
    
avg_other_comments = total_other_comments / 17194

# Each post's average will be displayed with a newline imputed after
print("The average number of 'Ask HN' posts are:", avg_ask_comments, '\n')
print("The average number of 'Show HN' posts are:", avg_show_comments, '\n')
print("The average number of 'Other' posts are:", avg_other_comments, '\n')

The average number of 'Ask HN' posts are: 14.038417431192661 

The average number of 'Show HN' posts are: 10.31669535283993 

The average number of 'Other' posts are: 26.8730371059672 



**Note: Currently, this analysis is only based off of the `Ask HN` and `Show HN` posts. The 'Other' posts will be ignored for now.**

As you can see above, `Ask HN` posts on average received about 4 more comments per post than `Show HN`. This could be because it is the nature of the post. If a user asks a question they are probably more likely to recieve an answer than a user who is just showing something.

Since `Ask HN` posts on average recieve more posts, we will focus the rest of our analysis on `Ask HN` posts.

In [23]:
# Displaying the headers to prevent having to scroll up to view them
print(headers)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


# Calculating the number of ask posts and comments by hour

I will now calculate the number of `Ask HN` posts and comments by the hour. 

To do this I will have to do the following:
- Import the `datetime` module to help parse through the dates and times
- Pull the `Created_at` and `num_comments` columns you see above at index `-1` and `4` into a list of lists
- Create a frequency table for display

In [49]:
# Importing the datetime module as dt
import datetime as dt

result_list = []

# There will be two elements to this loop: Time created and Comments
for row in ask_posts:
    created = row[-1]
    comments = int(row[4])
    
    # Each element will be appended to the result_list together, created a list of lists
    result_list.append([created, comments])
    
counts_by_hour = {}
comments_by_hour = {}

# This variable is to tell the strptime method what format the date from the dataset is in
date_format = "%m/%d/%Y %H:%M"

for row in result_list:
    created = row[0]
    comments = int(row[1])
    
    # Creating a datetime object that allows me to pull just the hour from the date
    hour = dt.datetime.strptime(created, date_format).strftime("%H")
    
    
    # If the hour is not already in the dictionary, then it will create it,
    # otherwise is will increment the comment number by the number of comments and the hour number by 1
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comments
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comments
        
        
print("Comments By Hour:")        
print(comments_by_hour, '\n')
print("Posts By Hour:")
print(counts_by_hour)

Comments By Hour:
{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641} 

Posts By Hour:
{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}


# Average number of comments per hour


I will now create a list of lists where the first element is the hour and the second element is the average number of comments per post


In [57]:
# New list for average comments by hour
avg_by_hour = []

for row in comments_by_hour:
        
        # Selecting an hour, dividing the number of comments in that hour
        # by the number of posts in that hour to find the average. Then the
        # results are appended to the avg_by_hour list.
        avg_by_hour.append([row, comments_by_hour[row] / counts_by_hour[row]])
        
avg_by_hour

[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]

# Sorting the values



In [59]:
# Swapping the avg by hour to sort by the average
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])

swap_avg_by_hour

[[5.5777777777777775, '09'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [16.796296296296298, '16'],
 [7.985294117647059, '23'],
 [9.41095890410959, '12'],
 [11.46, '17'],
 [38.5948275862069, '15'],
 [16.009174311926607, '21'],
 [21.525, '20'],
 [23.810344827586206, '02'],
 [13.20183486238532, '18'],
 [7.796296296296297, '03'],
 [10.08695652173913, '05'],
 [10.8, '19'],
 [11.383333333333333, '01'],
 [6.746478873239437, '22'],
 [10.25, '08'],
 [7.170212765957447, '04'],
 [8.127272727272727, '00'],
 [9.022727272727273, '06'],
 [7.852941176470588, '07'],
 [11.051724137931034, '11']]

In [63]:
# sorting the columns from highest to lowest

sorted_swap = sorted(swap_avg_by_hour, reverse=True)

sorted_swap

[[38.5948275862069, '15'],
 [23.810344827586206, '02'],
 [21.525, '20'],
 [16.796296296296298, '16'],
 [16.009174311926607, '21'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [13.20183486238532, '18'],
 [11.46, '17'],
 [11.383333333333333, '01'],
 [11.051724137931034, '11'],
 [10.8, '19'],
 [10.25, '08'],
 [10.08695652173913, '05'],
 [9.41095890410959, '12'],
 [9.022727272727273, '06'],
 [8.127272727272727, '00'],
 [7.985294117647059, '23'],
 [7.852941176470588, '07'],
 [7.796296296296297, '03'],
 [7.170212765957447, '04'],
 [6.746478873239437, '22'],
 [5.5777777777777775, '09']]

In [67]:
# Printing the top 5 hours for Ask Posts

print("Top 5 Hours for `Ask HN` Comments:")
for avg, hr in sorted_swap[:5]:
    print(
        "{}:{:.2f} average comments per post".format(dt.datetime.strptime(hr, "%H").strftime("%H:%M"), avg
          )
    )

Top 5 Hours for `Ask HN` Comments:
15:00:38.59 average comments per post
02:00:23.81 average comments per post
20:00:21.52 average comments per post
16:00:16.80 average comments per post
21:00:16.01 average comments per post


# Conclusion

From above, the Top 5 hours to post are listed. The best time to post is by far 3 PM (Eastern Time) with an average 38.59 comments *per* post. If a user is looking for the highest amount of user to user engagement as possible, then posting at this time will be ideal which would be 10 AM my time (Hawaii Standard Time). This data covers 12 months from 2016, so times could be very different by now, but this is still a great place to start. 

Another idea for further analysis would be to find the top ten authors who receive the most comments and look into what kind of posts do they post. That way you could possibly find the following information:

- How do the top authors write their titles?
- What type of posts are the top authors posting?
- What is the format of the post that the top authors using? Are they the same? Are they different?

You can use this new information to increase the value of your own posts. 