# Exploring Hacker News Posts

We are eploring [Hacker News](https://news.ycombinator.com/) posts, specifically, the comparison of two different types of posts, `Ask HN` and `Show HN` posts.

For `Ask HN` posts, users will submit a question to the Hacker News community (ie. "What is that best programming language to learn for data science?"). In `Show HN` posts, users post project, products, and other interesting technology related items.

We will campare these two types of posts to learn the following:
- Do `Ask HN` or `Show HN` recieve more comments on average?
- Do posts created at a certain time receive more comments on average?

We are working with a reduced dataset, shrunk from 300,000 to 20,000 rows. This was done by removing posts with no comments and randomly sampling the rest to hit the smaller size.

In [1]:
from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)
print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


Here is an example of the date we are working with. Just 7 columns worth of categories, most important being the title, number of comments and dates.

Before we get too far, let's remove the header from the dataset, but keep it for reference.

In [2]:
headers = hn[0]
hn = hn[1:]
print(headers)
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


## Separating Ask HN and Show HN posts

A key step in the process is finding the `Ask HN` and `Show HN` posts and creating our own datasets to use.

In [3]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


## Calculating Average Number of Comments

Now that we separated the data, let's look at the average number of comments for both types of posts.

In [4]:
total_ask_comments = 0

for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments
    
avg_ask_comments = total_ask_comments / len(ask_posts)
print(avg_ask_comments)

14.038417431192661


In [5]:
total_show_comments = 0

for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments
    
avg_show_comments = total_show_comments / len(show_posts)
print(avg_show_comments)

10.31669535283993


On average, ask posts get around 14 comments while show posts get around 10. Ask posts are more likely to get more attention, so we will focus more on that as we proceed.

## Ask Posts and Comments by Hour

Now lets look at the time that will most likely attract comments. 

In [6]:
import datetime as dt

result_list = []

for row in ask_posts:
    result_list.append([row[6], int(row[4])])
    
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    date = row[0]
    comments = row[1]
    hour = dt.datetime.strptime(date, "%m/%d/%Y %H:%M").strftime("%H")
    
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comments
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comments
        
comments_by_hour

{'00': 447,
 '01': 683,
 '02': 1381,
 '03': 421,
 '04': 337,
 '05': 464,
 '06': 397,
 '07': 267,
 '08': 492,
 '09': 251,
 '10': 793,
 '11': 641,
 '12': 687,
 '13': 1253,
 '14': 1416,
 '15': 4477,
 '16': 1814,
 '17': 1146,
 '18': 1439,
 '19': 1188,
 '20': 1722,
 '21': 1745,
 '22': 479,
 '23': 543}

## Calculating the Average per Hour

We can see some of the data above with the amount of comments categorized by hour. Hour 15 (3pm EST) seems to be by far the largest amount of comments in the hour. However, let's compare the amount of comments to the amount of posts.

In [7]:
avg_by_hour =[]

for hour in comments_by_hour:
    avg_by_hour.append([hour, round((comments_by_hour[hour] / counts_by_hour[hour]), 2)])
    
avg_by_hour

[['00', 8.13],
 ['20', 21.52],
 ['18', 13.2],
 ['07', 7.85],
 ['14', 13.23],
 ['04', 7.17],
 ['19', 10.8],
 ['12', 9.41],
 ['17', 11.46],
 ['09', 5.58],
 ['16', 16.8],
 ['03', 7.8],
 ['02', 23.81],
 ['13', 14.74],
 ['23', 7.99],
 ['10', 13.44],
 ['01', 11.38],
 ['22', 6.75],
 ['08', 10.25],
 ['21', 16.01],
 ['05', 10.09],
 ['06', 9.02],
 ['15', 38.59],
 ['11', 11.05]]

Let's sort this out.

In [8]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
    
print(swap_avg_by_hour)

sorted_swap = sorted(swap_avg_by_hour, reverse=True)
sorted_swap

[[8.13, '00'], [21.52, '20'], [13.2, '18'], [7.85, '07'], [13.23, '14'], [7.17, '04'], [10.8, '19'], [9.41, '12'], [11.46, '17'], [5.58, '09'], [16.8, '16'], [7.8, '03'], [23.81, '02'], [14.74, '13'], [7.99, '23'], [13.44, '10'], [11.38, '01'], [6.75, '22'], [10.25, '08'], [16.01, '21'], [10.09, '05'], [9.02, '06'], [38.59, '15'], [11.05, '11']]


[[38.59, '15'],
 [23.81, '02'],
 [21.52, '20'],
 [16.8, '16'],
 [16.01, '21'],
 [14.74, '13'],
 [13.44, '10'],
 [13.23, '14'],
 [13.2, '18'],
 [11.46, '17'],
 [11.38, '01'],
 [11.05, '11'],
 [10.8, '19'],
 [10.25, '08'],
 [10.09, '05'],
 [9.41, '12'],
 [9.02, '06'],
 [8.13, '00'],
 [7.99, '23'],
 [7.85, '07'],
 [7.8, '03'],
 [7.17, '04'],
 [6.75, '22'],
 [5.58, '09']]

Now, let's clean it up.

In [9]:
print("Top 5 Hours for Ask Post Comment")

for avg, hour in sorted_swap[:5]:
    print(
        "{}: {:.2f} average comments per post".format(dt.datetime.strptime(hour, "%H")
                                                      .strftime("%H:%M"),avg)
    )

Top 5 Hours for Ask Post Comment
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


## Conclusion

As we can see, the top hour is hour 15 (3pm EST). We can use this to maximize the amount of comments that are aquired when making a post on Hacker Noon in the `Ask HN` catergory.

## Ending notes for continued expansion

That's it for the guided steps! Here's a quick summary of what we accomplished in this guided project:

- We set a goal for the project.
- We collected and sorted the data.
- We reformatted and cleaned the data to prepare it for analysis.
- We analyzed the data.

Guided projects can be used to build a portfolio to showcase to potential employers, so we encourage you to keep working on this. Here are some next steps for you to consider:

- Determine if show or ask posts receive more points on average.
- Determine if posts created at a certain time are more likely to receive more points.
- Compare your results to the average number of comments and points other posts receive.
- Use Dataquest's [data science project style guide](https://www.dataquest.io/blog/data-science-project-style-guide/) to format your project.

You're welcome to keep working on the project here, but we recommend downloading it to your computer using the download icon above the notebook and working on it locally.

If you choose to work on the next steps independently, you'll inevitably not know how to perform certain tasks or hit errors that you won't know how to resolve. Don't get discouraged! This is part of the learning process. Although referring back to previous missions is a great way to refresh your memory on certain topics, there are also a couple tools you should get familiar with because you'll need to use them in a real world job setting.

The best thing to do if you hit an error you can't resolve or don't know how to perform a task is search for the answer on Google. When you search, make sure to include the word "python" — otherwise, you'll get results from other programming languages. For example, instead of searching for "how to find the first element in a list," search for "python how to find the first element in a list."

As you search, you'll see one site constantly appear at the top of the results — Stack Overflow. Stack Overflow is an online community where people ask and answer programming questions. In most situations, you'll find that someone has asked the same question as you or a similar question that can help you. The community is very active, so the answers are almost always accurate.

Congratulations, this is the end of the course!