# Exploring Hacker News Post

# Introduction

On this project we will look into data from a social news website called Hacker News. The goal of this project is to understand what type of posts are likely to receive higher engagement. On Hacker News, users can submit stories that other people can vote or comment on. We will use this [data](https://www.kaggle.com/hacker-news/hacker-news-posts) from Kaggle throughout this project. 

We will perform the following analysis to arrive to our conclusion:
 * Compare `Ask HN` and `Show HN` posts
 * Explore whether posts that are created at a certain time receives more comments on avarage

### Summary of Results

`Ask HN` posts that are created around 3 PM EST are more likely to receive comments. On our data set, `Ask HN` posts receive 112.7% more comments compared to `Show HN` posts. Therefore, if a user is interested in posting on Hacker News, he or she should post `Ask HN` type of post at 3 PM EST.

For more details, please refer to the full analysis below.

# Data Exploration

Let's first explore the data set from Kaggle and transform the data into list of lists.

In [1]:
from csv import reader

# Read the data
opened_file = open('HN_posts_year_to_Sep_26_2016.csv')
read_file = reader(opened_file)

# Transform into list of lists
hn = list(read_file)

# Store header data on a separate list
hn_header = hn[0]
hn = hn[1:]

print(hn_header)
hn[0:10]

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['12579008',
  'You have two days to comment if you want stem cells to be classified as your own',
  'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018',
  '1',
  '0',
  'altstar',
  '9/26/2016 3:26'],
 ['12579005',
  'SQLAR  the SQLite Archiver',
  'https://www.sqlite.org/sqlar/doc/trunk/README.md',
  '1',
  '0',
  'blacksqr',
  '9/26/2016 3:24'],
 ['12578997',
  'What if we just printed a flatscreen television on the side of our boxes?',
  'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43',
  '1',
  '0',
  'pavel_lishin',
  '9/26/2016 3:19'],
 ['12578989',
  'algorithmic music',
  'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext',
  '1',
  '0',
  'poindontcare',
  '9/26/2016 3:16'],
 ['12578979',
  'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake',
  'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94',
  '1',
  '0',
  'markgainor1',
  '9/26/2016 3:14'],
 ['12578975',
  'Saving the H

For our analysis, we will be focusing on `title`, `num_comments`, and `created_at` columns. The `title` and `created_at` columns can be useful to segment our data, while the `num_comments` column can be used to measure the user engagement.

# Data Analysis

## Post Types Comparison

On our analysis, we will focus on comparing `Ask HN` and `Show HN` posts to understand which type of post is more popular. A user may submit `Ask HN` post to ask the community a specific question, while `Show HN` post is where a user show a project, product or just generally something interesting. We can segment the data by using the first couple of strings of the `title` column.

In [2]:
# Create empty list for each post type
ask_posts  = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]                      # Use the title column
    title = title.lower()               # Converts the title to all lower case
    if title.startswith('ask hn'):      # Adds post to ask_posts list if it starts with 'ask hn'
        ask_posts.append(row)
    elif title.startswith('show hn'):   # Adds post to show_posts list if it starts with 'show hn'
        show_posts.append(row)
    else:                               # Adds the rest of the post to other_posts list
        other_posts.append(row)

# Check the number of posts on each list
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))
print('\n')

# Confirm that the list contains the correct types of post
print(ask_posts[0:5])
print('\n')
print(show_posts[0:5])

9139
10158
273822


[['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53'], ['12578522', 'Ask HN: How do you pass on your work when you die?', '', '6', '3', 'PascLeRasc', '9/26/2016 1:17'], ['12577908', 'Ask HN: How a DNS problem can be limited to a geographic region?', '', '1', '0', 'kuon', '9/25/2016 22:57'], ['12577870', 'Ask HN: Why join a fund when you can be an angel?', '', '1', '3', 'anthony_james', '9/25/2016 22:48'], ['12577647', 'Ask HN: Someone uses stock trading as passive income?', '', '5', '2', '00taffe', '9/25/2016 21:50']]


[['12578335', 'Show HN: Finding puns computationally', 'http://puns.samueltaylor.org/', '2', '0', 'saamm', '9/26/2016 0:36'], ['12578182', 'Show HN: A simple library for complicated animations', 'https://christinecha.github.io/choreographer-js/', '1', '0', 'christinecha', '9/26/2016 0:01'], ['12578098', 'Show HN: WebGL visualization of DNA sequences', 'http://grondilu.github.io/dna.html', '1', 

The data set currently includes a comparable amount of `Ask HN` versus `Show HN` posts with `Show HN` list having 11% more rows compared to `Ask HN` post type.

Next, let's determine if `Ask HN` or `Show HN` posts receive more comments in average.

In [3]:
# Create a variable to sum the number of comments for Ask HN type
total_ask_comments = 0

# Create a loop that sums up the num_comments column for ask_posts list
for row in ask_posts:                   
    num_comments = int(row[4])           
    total_ask_comments += num_comments 

# Calculate the average number of comments for Ask HN type
avg_ask_comments = total_ask_comments / len(ask_posts)  
print('The average number of comments for ask posts is',avg_ask_comments)

# Create a variable to sum the number of comments for Show HN type
total_show_comments = 0

# Create a loop that sums up the num_comments column for show_posts list
for row in show_posts:                   
    num_comments = int(row[4])           
    total_show_comments += num_comments 
    
# Calculate the average number of comments for Show HN type
avg_show_comments = total_show_comments / len(show_posts) 
print('The average number of comments for show posts is',avg_show_comments)

The average number of comments for ask posts is 10.393478498741656
The average number of comments for show posts is 4.886099625910612


On average, `Ask HN` posts receives more than twice the amount of comments compared to `Show HN` posts.

Since `Ask HN` receives more comments compared to `Show HN`, we will be focusing on `Ask HN` posts for the rest of our analysis. 

## Post Created Time Analysis
 
Let's explore whether post created time has an effect on the number of comments that the post receive. To do this analysis we will perform the following steps:
 * Calculate the amount of ask posts created in each hour of the day
 * Calculate the average number of comments ask posts receive by hour created

Let's first calculate the amount of ask posts created in each hour of the day.

In [4]:
import datetime as dt

# Create an empty list for the result 
result_lists = []

# Append post creation time and number of comments variable to the result list
for row in ask_posts:
    created_at = row[6]
    num_comments = int(row[4])
    result_lists.append([created_at,num_comments])

# Create empty dictionaries to count the number of posts and comments by hour 
counts_by_hour = {}
comments_by_hour = {}

for row in result_lists:
    date_str = str(row[0])
    date_dt = dt.datetime.strptime(date_str,'%m/%d/%Y %H:%M') # Convert created_at column as datetime object
    date_hr = dt.datetime.strftime(date_dt, '%H')             # Parse the hour from the datetime object
    if date_hr not in counts_by_hour:
        counts_by_hour[date_hr] = 1               # Create key in counts_by_hour and set value to 1
        comments_by_hour[date_hr] = row[1]        # Create key in comments_by_hour and set value to number of comments
    else:
        counts_by_hour[date_hr] += 1              # Increment the value in counts_by_hour by 1
        comments_by_hour[date_hr] += row[1]       # Increment the value in comments_by_hour by the number of comments

Now we have two dictionaries that:
 1. Counts the number of ask posts by the hour
 2. Counts the number of comments by the hour

Next, we will use these dictionaries to calculate the average number of comments for each hour of the day.

In [5]:
# Create an empty list to store the result
avg_by_hour = []

# Loop through each hour on comments_by_hour
for hour in comments_by_hour:
    # Divide the number of comment by hour by the number of post by hour
    avg_by_hour.append([hour, comments_by_hour[hour] / counts_by_hour[hour]]) 

avg_by_hour

[['02', 11.137546468401487],
 ['01', 7.407801418439717],
 ['22', 8.804177545691905],
 ['21', 8.687258687258687],
 ['19', 7.163043478260869],
 ['17', 9.449744463373083],
 ['15', 28.676470588235293],
 ['14', 9.692007797270955],
 ['13', 16.31756756756757],
 ['11', 8.96474358974359],
 ['10', 10.684397163120567],
 ['09', 6.653153153153153],
 ['07', 7.013274336283186],
 ['03', 7.948339483394834],
 ['23', 6.696793002915452],
 ['20', 8.749019607843136],
 ['16', 7.713298791018998],
 ['08', 9.190661478599221],
 ['00', 7.5647840531561465],
 ['18', 7.94299674267101],
 ['12', 12.380116959064328],
 ['04', 9.7119341563786],
 ['06', 6.782051282051282],
 ['05', 8.794258373205741]]

Next, let's identify which hour has the highest average number of comments. We will first need to swap the average number of comments column with the hour of day column in order to use the sort function.

In [6]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1],row[0]]) # Swap the average number of comments with the hour of day column

swap_avg_by_hour

[[11.137546468401487, '02'],
 [7.407801418439717, '01'],
 [8.804177545691905, '22'],
 [8.687258687258687, '21'],
 [7.163043478260869, '19'],
 [9.449744463373083, '17'],
 [28.676470588235293, '15'],
 [9.692007797270955, '14'],
 [16.31756756756757, '13'],
 [8.96474358974359, '11'],
 [10.684397163120567, '10'],
 [6.653153153153153, '09'],
 [7.013274336283186, '07'],
 [7.948339483394834, '03'],
 [6.696793002915452, '23'],
 [8.749019607843136, '20'],
 [7.713298791018998, '16'],
 [9.190661478599221, '08'],
 [7.5647840531561465, '00'],
 [7.94299674267101, '18'],
 [12.380116959064328, '12'],
 [9.7119341563786, '04'],
 [6.782051282051282, '06'],
 [8.794258373205741, '05']]

Next, we'll sort this list based on the highest number of average comment.

In [7]:
sorted_swap = sorted(swap_avg_by_hour,reverse = True) # Sort by highest number of average comment

print("Top 5 Hours for Ask Posts Comments")

for row in sorted_swap[:5]:                    # Loop through the first five rows
    time = dt.datetime.strptime(row[1],"%H")   # Convert the hour as datetime object
    time = dt.datetime.strftime(time,"%H:%M")  # Formats the hour to hour:minute format
    print("{time}: {avg:.2f} average comments per post".format(time=time, avg=row[0])) # Prints with properly formatted time and average numbers

Top 5 Hours for Ask Posts Comments
15:00: 28.68 average comments per post
13:00: 16.32 average comments per post
12:00: 12.38 average comments per post
02:00: 11.14 average comments per post
10:00: 10.68 average comments per post


# Conclusion

If a user is interested in posting on Hacker News, he or she should create `Ask HN` post at 3 PM EST. However, there are a couple other variables that a user might need to consider:
* The data set was from 2016, so there might be a shift in user's interest in recent years
* Which topics receive the most comments?
* Which types of posts receive more points or upvotes?