# Exploring Hacker News Posts

## Introduction

In this project we will be exploring a dataset of posts to a popular technology site, [Hacker News](https://news.ycombinator.com/). Hacker News is very popular in tech and start-up circles, with the most popular posts getting hundreds and thousands of visitors. 

We will be working with a reduced version of the dataset, a random sample of 20,000 of the posts with at least one comment. We are specifically interested in posts with titles beginning with `Ask HN` or `Show HN`, which ask the community a question or show them something of interest. We'll compare these two types of post to ask the following:

- Do `Ask HN` or `Show HN` posts received more comments on average?
- Do posts created at a certain time receive more comments on average?

Let's read in the dataset:

In [1]:
opened_file = open('hacker_news.csv')

from csv import reader

read_file = reader(opened_file)

hn = list(read_file)

hn[:5]

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01']]

## Removing Headers

Let's remove the header row:

In [2]:
headers = hn[0]
print(headers)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


In [3]:
hn = hn[1:]
hn[:5]

[['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01'],
 ['10301696',
  'Note by Note: The Making of Steinway L1037 (2007)',
  'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
  '8',
  '2',
  'walterbell',
  '9/30/2015 4:12']]

## Extracting Ask HN and Show HN Posts

We're only concerned with `Ask HN` and `Show HN` posts, so we'll create a new list of lists with only these posts. To check if a title contains `Ask HN` or `Show HN`, we'll use the string method `string.startswith("substring")` which returns `True` if the string starts with `"substring"` and `False` otherwise:

In [4]:
string1 = "DataQuest is the best!"
string1.startswith("DataQuest")

True

We will analyse `Ask HN` and `Show HN` posts separately, so we'll create separate lists for each. To control for case, we'll use the `string.lower()` method:

In [5]:
ask_posts = []
show_posts = []
other_posts = []
for row in hn:
    title = row[1].lower()
    if title.startswith("ask hn"):
        ask_posts.append(row)
    elif title.startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
len(ask_posts), len(show_posts), len(other_posts)

(1744, 1162, 17194)

## Calculating the Average Number of Comments for Ask HN and Show HN Posts

Let's now calculate the average number of comments in each type of post. The number of comments is stored as a string so we'll need to convert it. Instead of repeating the same code 3 times, we'll write a function which calculates the average number of comments:

In [6]:
def avg_comments(type_of_post):
    total_comments = 0
    for post in type_of_post:
        comments = int(post[4])
        total_comments += comments
    avg_comments = total_comments / len(type_of_post)
    return avg_comments

In [7]:
print("Average number of comments:\n-----------------------------")
print("Ask Posts:",avg_comments(ask_posts))
print("Show Posts:",avg_comments(show_posts))
print("Other Posts:",avg_comments(other_posts))

Average number of comments:
-----------------------------
Ask Posts: 14.038417431192661
Show Posts: 10.31669535283993
Other Posts: 26.8730371059672


Ask posts receive about 14 comments on average, whereas show posts receive around 10 comments on average.

## Finding the Number of Ask Posts and Comments by Hour Created

Since ask posts receive more comments, we'll focus the rest of our analysis on these posts. We now want to find out if posts created at a specific time of day receive more posts, so we'll calculate the number of ask posts created in each hour of the day, along with the number of comments received and the average number of comments ask posts received by hour.

In [8]:
print(headers)
ask_posts[:5]

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['12296411',
  'Ask HN: How to improve my personal website?',
  '',
  '2',
  '6',
  'ahmedbaracat',
  '8/16/2016 9:55'],
 ['10610020',
  'Ask HN: Am I the only one outraged by Twitter shutting down share counts?',
  '',
  '28',
  '29',
  'tkfx',
  '11/22/2015 13:43'],
 ['11610310',
  'Ask HN: Aby recent changes to CSS that broke mobile?',
  '',
  '1',
  '1',
  'polskibus',
  '5/2/2016 10:14'],
 ['12210105',
  'Ask HN: Looking for Employee #3 How do I do it?',
  '',
  '1',
  '3',
  'sph130',
  '8/2/2016 14:20'],
 ['10394168',
  'Ask HN: Someone offered to buy my browser extension from me. What now?',
  '',
  '28',
  '17',
  'roykolak',
  '10/15/2015 16:38']]

Let's first create a list of lists `result_list` which contains pairs, the date/time a post was created, and the number of comments:

In [9]:
import datetime as dt

result_list = []

for post in ask_posts:
    created_at = post[-1]
    num_comments = int(post[-3])
    result_list.append([created_at,num_comments])
    
result_list[:6]

[['8/16/2016 9:55', 6],
 ['11/22/2015 13:43', 29],
 ['5/2/2016 10:14', 1],
 ['8/2/2016 14:20', 3],
 ['10/15/2015 16:38', 17],
 ['9/26/2015 23:23', 1]]

Now let's create frequency tables of the number of posts by hour and the number of comments by hour. We'll need to parse the `created_at` strings:

In [10]:
counts_by_hour = {}
comments_by_hour = {}

for item in result_list:
    created_at = item[0]
    created_dt_object = dt.datetime.strptime(created_at,"%m/%d/%Y %H:%M")
    hour = created_dt_object.strftime("%H")
    comments = item[1]
    if hour in counts_by_hour:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comments
    else:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comments
    

"Number of posts by hour:",counts_by_hour, "Number of comments by hour:",comments_by_hour

('Number of posts by hour:',
 {'09': 45,
  '13': 85,
  '10': 59,
  '14': 107,
  '16': 108,
  '23': 68,
  '12': 73,
  '17': 100,
  '15': 116,
  '21': 109,
  '20': 80,
  '02': 58,
  '18': 109,
  '03': 54,
  '05': 46,
  '19': 110,
  '01': 60,
  '22': 71,
  '08': 48,
  '04': 47,
  '00': 55,
  '06': 44,
  '07': 34,
  '11': 58},
 'Number of comments by hour:',
 {'09': 251,
  '13': 1253,
  '10': 793,
  '14': 1416,
  '16': 1814,
  '23': 543,
  '12': 687,
  '17': 1146,
  '15': 4477,
  '21': 1745,
  '20': 1722,
  '02': 1381,
  '18': 1439,
  '03': 421,
  '05': 464,
  '19': 1188,
  '01': 683,
  '22': 479,
  '08': 492,
  '04': 337,
  '00': 447,
  '06': 397,
  '07': 267,
  '11': 641})

## Calculating the Average Number of Comments for Ask HN Posts by Hour

We'll use the dictionaries we just created to calculate the average number of comments for posts created during each hour of the day. There are two ways we can do this:

In [11]:
# dictionary
avg_dict = {}

for hour in counts_by_hour:
    avg_dict[hour] = comments_by_hour[hour] / counts_by_hour[hour]
    
print("Average number of comments per hour:")
print(avg_dict)

# list
avg_by_hour = []

for hr in comments_by_hour:
    avg_by_hour.append([hr, comments_by_hour[hr] / counts_by_hour[hr]])

avg_by_hour

Average number of comments per hour:
{'09': 5.5777777777777775, '13': 14.741176470588234, '10': 13.440677966101696, '14': 13.233644859813085, '16': 16.796296296296298, '23': 7.985294117647059, '12': 9.41095890410959, '17': 11.46, '15': 38.5948275862069, '21': 16.009174311926607, '20': 21.525, '02': 23.810344827586206, '18': 13.20183486238532, '03': 7.796296296296297, '05': 10.08695652173913, '19': 10.8, '01': 11.383333333333333, '22': 6.746478873239437, '08': 10.25, '04': 7.170212765957447, '00': 8.127272727272727, '06': 9.022727272727273, '07': 7.852941176470588, '11': 11.051724137931034}


[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]

This format makes it difficult to see which hours have the most comments. We'll use the built-in `sorted()` function which works on any iterable object, including lists and dictionaries. We'll use the list of lists, although an analogous method would work for the dictionary.

First we make a list with the values swapped:

In [12]:
swap_avg_by_hour = []

for element in avg_by_hour:
    first = element[1]
    second = element[0]
    swap_avg_by_hour.append([first,second])

swap_avg_by_hour

[[5.5777777777777775, '09'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [16.796296296296298, '16'],
 [7.985294117647059, '23'],
 [9.41095890410959, '12'],
 [11.46, '17'],
 [38.5948275862069, '15'],
 [16.009174311926607, '21'],
 [21.525, '20'],
 [23.810344827586206, '02'],
 [13.20183486238532, '18'],
 [7.796296296296297, '03'],
 [10.08695652173913, '05'],
 [10.8, '19'],
 [11.383333333333333, '01'],
 [6.746478873239437, '22'],
 [10.25, '08'],
 [7.170212765957447, '04'],
 [8.127272727272727, '00'],
 [9.022727272727273, '06'],
 [7.852941176470588, '07'],
 [11.051724137931034, '11']]

Now we sort this list by average number of comments using `sorted()`. Setting `reverse = True` ensures the list is in descending order:

In [13]:
sorted_avg = sorted(swap_avg_by_hour, reverse = True)
sorted_avg

[[38.5948275862069, '15'],
 [23.810344827586206, '02'],
 [21.525, '20'],
 [16.796296296296298, '16'],
 [16.009174311926607, '21'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [13.20183486238532, '18'],
 [11.46, '17'],
 [11.383333333333333, '01'],
 [11.051724137931034, '11'],
 [10.8, '19'],
 [10.25, '08'],
 [10.08695652173913, '05'],
 [9.41095890410959, '12'],
 [9.022727272727273, '06'],
 [8.127272727272727, '00'],
 [7.985294117647059, '23'],
 [7.852941176470588, '07'],
 [7.796296296296297, '03'],
 [7.170212765957447, '04'],
 [6.746478873239437, '22'],
 [5.5777777777777775, '09']]

Let's now display the top 5 hours in an easy to read format:

In [14]:
print("Top 5 Hours for Asks Posts Comments\n----------------------------------------------------")
for item in sorted_avg[:5]:
    comments = item[0]
    hour = item[1]
    dt_hour = dt.datetime.strptime(hour,"%H")
    format_hour = dt_hour.strftime("%H:00")
    print("{}: {:.2f} average comments per post".format(format_hour, comments))

Top 5 Hours for Asks Posts Comments
----------------------------------------------------
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


Based on this, to have a higher chance of receiving comments an `Ask HN` post, you should post between 3pm and 4pm Eastern Time, between 2am and 3am ET, or between 8pm and 9pm ET. The timezone can be found in the [documentation](https://www.kaggle.com/datasets/hacker-news/hacker-news-posts) for the dataset.

## Summary

In this project, we analysed a dataset with data about posts created on [Hacker News](https://news.ycombinator.com/). We narrowed down our scope to `Ask HN` and `Show HN` posts and saw that `Ask HN` posts received more comments on average. We then created frequency tables to break down the `Ask HN` posts by the hour they were created. In conclusion, we found that to have a higher chance of receiving more comments, you should make an `Ask HN` post between 3pm and 4pm ET (Eastern Time), between 2am and 3am ET, or between 8pm and 9pm ET.

## Next Steps

For a more in-depth analysis, we could explore the following points:

- Determine if `show` or `ask` posts receive more points on average.
- Determine if posts created at a certain time are more likely to receive more points.
- Compare these results to the average number of comments and points other posts receive.