# Exploring Hacker News Posts

## Introduction

Hacker News is a social media website where users post and discuss technology related content. Just like on Reddit, users have the ability to upvote their favorite posts to the front page of the website. The more upvotes a post earns, the more views it is likely to attract. There are many different types of posts in Hacker News. In this project, I aim to determine which time of the day sees the most number of comments on average. In addition, I want to find out which of the two classes of publications, **Ask HN** or **Show HN**, receives more comments on average. In submitting **Ask HN** posts, users seek answers to questions from Hacker News community. In submitting **Show HN** posts, users display an interesting project or product to the forum.  

## Data 

The data set I will be working on comes from Dataquest. It is a modified version of the original [source](https://www.kaggle.com/hacker-news/hacker-news-posts). It contains approximately 20,000 observations on posts that have at least one comment.

## Data Cleaning

In [71]:
from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)

#observe first five rows
hn[:5]

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01']]

To perform analysis on the data, I will remove column headers.

In [72]:
headers = hn[0]

#remove headers
hn = hn[1:]

print(headers)

#verify that headers are removed
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


## Data Analysis

Now that the column headers have been extracted from the *hn* list, I can  filter the posts by **Ask HN** and **Show HN** titles and arrange them into their own lists. 

In [73]:
#create empty lists for each type of post
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith("ask hn"):
        ask_posts.append(row)
    elif title.lower().startswith("show hn"): 
        show_posts.append(row)
    else:
        other_posts.append(row)
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))


1744
1162
17194


In [74]:
total_ask_comments = 0

for post in ask_posts:
    total_ask_comments = total_ask_comments + int(post[4])
avg_ask_comments  = total_ask_comments / len(ask_posts)
print (avg_ask_comments)

total_show_comments = 0 

for post in show_posts:
    total_show_comments = total_show_comments + int(post [4])
avg_show_comments = total_show_comments / len(show_posts)
print(avg_show_comments)

    

14.038417431192661
10.31669535283993


On average, ask posts receive 14 comments and show posts receive 10 comments. Since ask posts generate more user participation than show posts do, I would like to learn if ask posts submitted at a certain time during the day are more likely to get comments. 

In [75]:
import datetime as dt

result_list = []

for post in ask_posts:
    result_list.append([post[6],int(post[4])])

counts_by_hour = {}

comments_by_hour = {}

for row in result_list:
    date = row[0]
    date = dt.datetime.strptime(date,"%m/%d/%Y %H:%M")
    date_obj = date.time()
    hour = date_obj.strftime('%H')
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = row[1]
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += row[1]
    
comments_by_hour

{'00': 447,
 '01': 683,
 '02': 1381,
 '03': 421,
 '04': 337,
 '05': 464,
 '06': 397,
 '07': 267,
 '08': 492,
 '09': 251,
 '10': 793,
 '11': 641,
 '12': 687,
 '13': 1253,
 '14': 1416,
 '15': 4477,
 '16': 1814,
 '17': 1146,
 '18': 1439,
 '19': 1188,
 '20': 1722,
 '21': 1745,
 '22': 479,
 '23': 543}

In [76]:
avg_by_hour = []

for hour in comments_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour] / counts_by_hour[hour]])
avg_by_hour

[['09', 5.5777777777777775],
 ['10', 13.440677966101696],
 ['15', 38.5948275862069],
 ['06', 9.022727272727273],
 ['02', 23.810344827586206],
 ['23', 7.985294117647059],
 ['20', 21.525],
 ['08', 10.25],
 ['16', 16.796296296296298],
 ['01', 11.383333333333333],
 ['14', 13.233644859813085],
 ['00', 8.127272727272727],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['21', 16.009174311926607],
 ['11', 11.051724137931034],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['17', 11.46],
 ['04', 7.170212765957447],
 ['22', 6.746478873239437],
 ['12', 9.41095890410959],
 ['07', 7.852941176470588],
 ['13', 14.741176470588234]]

In [77]:
swap_avg_by_hour = []

for hour in avg_by_hour:
    swap_avg_by_hour.append([hour[1],hour[0]])

sorted_swap = sorted(swap_avg_by_hour, reverse = True)

print("Top 5 Hours for Ask Posts Comments")

for average, hour in sorted_swap[:5]:
    hour = dt.datetime.strptime(hour,"%H")
    hour = hour.strftime('%H:%M')
    output = "{}: {num:.2f} average comments per post.".format(hour,num = average)
    print(output)

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post.
02:00: 23.81 average comments per post.
20:00: 21.52 average comments per post.
16:00: 16.80 average comments per post.
21:00: 16.01 average comments per post.


Ask posts see the highest number of comments per post made on average at 15:00 US Eastern time. The number of comments made at 3 p.m. is substantially higher than the number of comments received per post at 2 a.m. 