## Exploring Hacker News Data Analysis

Magnus Skonberg | December 29th 2022

### Project Intro

The purpose of this project is to utilize basic Python string, object oriented programming, and date / time functionality, covered up to this point to perform a practical data analysis. 

In this project, we'll work with a dataset of submissions to popular technology site [Hacker News](https://news.ycombinator.com/). We'll work with a modified dataset and explore Ask HN and Show HN posts to determine the following:

* Do Ask HN or Show HN receive more comments on average?
* Do posts created at a certain time receive more comments on average?

The data set we're working with was reduced from ~300k rows to ~20k rows by removing submissions that did not receive comments, and then randomly sampling from remaining submissions.

### Load Data

To start we read in our data:

In [1]:
# Import relevant libraries
from csv import reader
import datetime as dt

# Define function to open CSV file and return header and data
def header_and_data(csv):
    opened_file = open(csv, encoding='utf8') # open CSV
    read_file = reader(opened_file) # return list of strings
    list_of_lists = list(read_file) # generate list of lists
    
    return list_of_lists[0], list_of_lists[1:] #return header and data

In [2]:
headers, hn = header_and_data('hacker_news.csv')
#headers #verify header extraction
hn[:5] #verify first 5 rows

[['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01'],
 ['10301696',
  'Note by Note: The Making of Steinway L1037 (2007)',
  'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
  '8',
  '2',
  'walterbell',
  '9/30/2015 4:12']]

### Extract Posts Based on Type

We iterate over our dataset and redistribute posts based on type (Ask HN, Show HN, or other):

In [3]:
# Create three empty lists called ask_posts, show_posts, and other_posts
ask_posts = []
show_posts = []
other_posts = []

# Instantiate comment count variables
total_ask_comments = 0
total_show_comments = 0

# Instantiate empty list (of lists)
result_list = []

for row in hn:
    title = row[1]
    num_comments = int(row[4])
    created_at = row[6]

    #check whether the post is of ask, show, or other type
    if title.lower().startswith("ask hn"):
        ask_posts.append(title)
        total_ask_comments += num_comments
        
        # Iterate over ask_posts and append to result_list a list with two elements (created_at, comment_num):
        result_list.append([created_at, num_comments])
        
    elif title.lower().startswith("show hn"):
        show_posts.append(title)
        total_show_comments += num_comments
        
    else:
        other_posts.append(title)
       

In [4]:
# Check the number of posts in ask_posts, show_posts, and other_posts.
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


### Type Popularity

We calculate the average number of comments per each type of posts as a proxy for popularity:

In [5]:
# Check whether ask or show posts get more comments
avg_ask_comments = total_ask_comments / len(ask_posts)
avg_show_comments = total_show_comments / len(show_posts)

print(avg_ask_comments)
print(avg_show_comments)

14.038417431192661
10.31669535283993


Ask posts get more comments. This would confirm that, specific to Hacker news, users enjoy engaging with questions more than they enjoy being shown something.

Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.

### Comments per Hour

Next, we'll determine if ask posts created at a certain time are more likely to attract comments:

In [6]:
# Create two empty dictionaries called counts_by_hour and comments_by_hour.
counts_by_hour = {}
comments_by_hour = {}

# Loop through each row of result_list and extract the hour from the date
for row in result_list:
    created_at = row[0]
    hour = dt.datetime.strptime(created_at, "%m/%d/%Y %H:%M").strftime("%H")
    
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
    else:
        counts_by_hour[hour] += 1
    
    if hour not in comments_by_hour:
        comments_by_hour[hour] = row[1]
    else:
        comments_by_hour[hour] += row[1]

In [7]:
# Create a list of lists containing the hours during which posts were created and the average number of comments
avg_by_hour = []

for hour in comments_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour] / counts_by_hour[hour]])

print(avg_by_hour)

[['09', 5.5777777777777775], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['16', 16.796296296296298], ['23', 7.985294117647059], ['12', 9.41095890410959], ['17', 11.46], ['15', 38.5948275862069], ['21', 16.009174311926607], ['20', 21.525], ['02', 23.810344827586206], ['18', 13.20183486238532], ['03', 7.796296296296297], ['05', 10.08695652173913], ['19', 10.8], ['01', 11.383333333333333], ['22', 6.746478873239437], ['08', 10.25], ['04', 7.170212765957447], ['00', 8.127272727272727], ['06', 9.022727272727273], ['07', 7.852941176470588], ['11', 11.051724137931034]]


In [8]:
# Create a list that equals avg_by_hour with swapped columns.
swap_avg_by_hour = []

# Iterate over the rows of avg_by_hour, and append to swap_avg_by_hour a list whose first element is the second element of the row, and whose second element is the first element of the row.
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])

sorted_swap = sorted(swap_avg_by_hour, reverse = True)
#print(sorted_swap) #verify sort

print("Tope 5 Hours for Ask Posts Comments.")

for row in sorted_swap[:5]:
    string = "{}: {:.2f} average comments per post".format(dt.datetime.strptime(row[1],'%H').strftime('%H:%M'), row[0])
    print(string)

Tope 5 Hours for Ask Posts Comments.
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


By far the best time for publishing a post is ~ 3pm. Whereas the next best times are 2am (likely specific to Hacker News users) and 8pm respectively.

### Conclusion

In this project, we analyzed Hacker News' Ask vs Show posts to determine which type of post and time of posting would get the most traction. We utilized comment number as a proxy for popularity. 

**For maximum popularity, we'd recommend posting from 3-4pm as an ask post.**