##  Guide project 2: Exploring Hacker News Posts

![alt text][logo]

[logo]: https://cdn-images-1.medium.com/max/700/1*CVOGx9ckrpWyTvtQgauYpw.jpeg "Logo Hacker News"


[Hacker news](https://news.ycombinator.com/news) is a site started by the startup incubator [Y Combinator](https://www.ycombinator.com/), where user-submitted stories (known as 'posts') are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

The data set can be found [here](https://www.kaggle.com/hacker-news/hacker-news-posts) and it has been reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions.

Below are descriptions of the columns: 

- **`id`** - The unique identifier from Hacker News for the post
* **`title`** - The title of the post
* **`url`** - The URL that the posts links to, if it the post has a URL
* **`num_points`** - The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
* **`num_comments`** - The number of comments that were made on the post
* **`author`** - The username of the person who submitted the post
* **`created_at`** - The date and time at which the post was submitted


We're specifically interested in posts whose titles begin with either **`Ask HN`** or **`Show HN`**. Users submit **`Ask HN`** posts to ask the Hacker News community a specific question. Below are a couple examples:

    Ask HN: How to improve my personal website?
    Ask HN: Am I the only one outraged by Twitter shutting down share counts?
    Ask HN: Aby recent changes to CSS that broke mobile?

Likewise, users submit **`Show HN`** posts to show the Hacker News community a project, product, or just generally something interesting. Below are a couple of examples:

    Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform'
    Show HN: Something pointless I made
    Show HN: Shanhu.io, a programming playground powered by e8vm
    

We'll compare these two types of posts to determine the following:

* Do **`Ask HN`** or **`Show HN`** receive more comments on average?
* Do posts created at a certain time receive more comments on average?

Let's start by importing the libraries we need and reading the data set into a list of lists.

In [1]:
# Open and read the data
import csv

data = open('hacker_news.csv')
hn = list(csv.reader(data))
hn[:5]

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01']]

## Separating the data from the header

In [2]:
# Extract the first row of data, and assign it to the variable headers.

headers = hn[0]
hn = hn[1:]
print(headers)
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


## Isolating Ask and Show posts from the others

In [3]:
#Create lists called ask_posts, show_posts, and other_posts to store the posts.
ask_posts = []
show_posts = []
other_posts = []
for rows in hn:
    title = rows[1]
    if title.lower().startswith("ask hn"):
        ask_posts.append(rows)
# I had to rerun to debug my code from here because I ran into an error while getting the average comments for each list.
# I found out that I appended the title instead of the row. So instead of list, I had letter, error Base 10.
# So I had to go back and find out what I was missing.
#Good learning curve
    elif title.lower().startswith("show hn"):
        show_posts.append(rows)
    else:
        other_posts.append(rows)
#Check the numer of posts in each list      
        
print("There are", len(ask_posts), "Ask Post in HN")
print("There are", len(show_posts), "Show Post in HN")
print("There are", len(other_posts), "Other Post in HN")

There are 1744 Ask Post in HN
There are 1162 Show Post in HN
There are 17194 Other Post in HN


## Verify the content of our lists

In [4]:
#Let's print the first five rows of ask and show posts
print(ask_posts[:5])

[['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'], ['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14'], ['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20'], ['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38']]


In [5]:
#Let's print the first five rows of ask and show posts

print(show_posts[:5])

[['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'], ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46'], ['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https://shanhu.io', '1', '1', 'h8liu', '4/28/2016 18:05'], ['12178806', 'Show HN: Webscope  Easy way for web developers to communicate with Clients', 'http://webscopeapp.com', '3', '3', 'fastbrick', '7/28/2016 7:11'], ['10872799', 'Show HN: GeoScreenshot  Easily test Geo-IP based web pages', 'https://www.geoscreenshot.com/', '1', '9', 'kpsychwave', '1/9/2016 20:45']]


## Average comments for Ask and Show Posts

In [6]:
# Let's find the average number of Ask HN comments
total_ask_comments  = 0
for comment in ask_posts:
    total_ask_comments += int(comment[4])
avg_ask_comments = total_ask_comments / len(ask_posts)
print(avg_ask_comments)

# I applied the DRY principle in this section of the code.
# I first defined a var to store comment and added the var to total_ask.

14.038417431192661


In [7]:
# Let's find the average number of show HN comments
total_show_comments  = 0
for comment in show_posts:
    total_show_comments += int(comment[4])
avg_show_comments = total_show_comments / len(show_posts)
print(avg_show_comments)

10.31669535283993


Based on the results above, Ask posts receive more comment than post. Which could be normal since more users might want to offer their perspective on the post.

Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts. 


## Let's find the best time to ask qestions in HN

We'll use the following steps to perform this analysis:
1. Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
2. Calculate the average number of comments ask posts receive by hour created.


### 1. Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.

In [8]:
# 1. Calculate the average number of comments ask posts receive by hour created.

import datetime as dt
result_list = []

for rows in ask_posts:
    result_list.append([rows[6], int(rows[4])])
    
count_by_hour = {}
comments_by_hour = {}
date_template = "%m/%d/%Y %H:%M"

for t_and_c in result_list:
    #print(t_and_c) debug point
    times = t_and_c[0]
    comments = int(t_and_c[1])
    time_posts = dt.datetime.strptime(times, date_template)
    hours = time_posts.strftime("%H")
    #print(hours), debug point
    
    if hours not in count_by_hour:
        count_by_hour[hours] = 1
        comments_by_hour[hours] = comments
    else:
        count_by_hour[hours] += 1
        comments_by_hour[hours] += comments

## Posts by hour
For some reasons they are ordered

In [9]:
count_by_hour

{'00': 55,
 '01': 60,
 '02': 58,
 '03': 54,
 '04': 47,
 '05': 46,
 '06': 44,
 '07': 34,
 '08': 48,
 '09': 45,
 '10': 59,
 '11': 58,
 '12': 73,
 '13': 85,
 '14': 107,
 '15': 116,
 '16': 108,
 '17': 100,
 '18': 109,
 '19': 110,
 '20': 80,
 '21': 109,
 '22': 71,
 '23': 68}

## Comments by hour

In [10]:
comments_by_hour

{'00': 447,
 '01': 683,
 '02': 1381,
 '03': 421,
 '04': 337,
 '05': 464,
 '06': 397,
 '07': 267,
 '08': 492,
 '09': 251,
 '10': 793,
 '11': 641,
 '12': 687,
 '13': 1253,
 '14': 1416,
 '15': 4477,
 '16': 1814,
 '17': 1146,
 '18': 1439,
 '19': 1188,
 '20': 1722,
 '21': 1745,
 '22': 479,
 '23': 543}

### 2. Calculate and display the average number of comments ask posts receive by hour created

In [11]:
avg_by_hour = []
# To get the average of posts per hour, you divide the number of comment by the number of posts of that hour. 
for heure in comments_by_hour:
    #ok = "{:.3f}".format(comments_by_hour[heure] / count_by_hour[heure])
    # Used format method to make the numbers readable but sorted didn' produce the desired output.
    # Once deleted, sorted worked just fine. Not sure why
    avg_by_hour.append([heure, comments_by_hour[heure] / count_by_hour[heure]] )
    
avg_by_hour

[['12', 9.41095890410959],
 ['23', 7.985294117647059],
 ['07', 7.852941176470588],
 ['09', 5.5777777777777775],
 ['11', 11.051724137931034],
 ['14', 13.233644859813085],
 ['19', 10.8],
 ['15', 38.5948275862069],
 ['06', 9.022727272727273],
 ['08', 10.25],
 ['10', 13.440677966101696],
 ['13', 14.741176470588234],
 ['17', 11.46],
 ['22', 6.746478873239437],
 ['00', 8.127272727272727],
 ['21', 16.009174311926607],
 ['18', 13.20183486238532],
 ['02', 23.810344827586206],
 ['04', 7.170212765957447],
 ['03', 7.796296296296297],
 ['20', 21.525],
 ['01', 11.383333333333333],
 ['05', 10.08695652173913],
 ['16', 16.796296296296298]]

## Let's sort the list
The end goal being a recommandation of the best time frame to post in Hacker News.

In [12]:
swap_avg_by_hour = []

for rows in avg_by_hour:
    swap_avg_by_hour.append([rows[1], rows[0]])
    
print(swap_avg_by_hour)

sorted_swap = sorted(swap_avg_by_hour, reverse=True)
sorted_swap

[[9.41095890410959, '12'], [7.985294117647059, '23'], [7.852941176470588, '07'], [5.5777777777777775, '09'], [11.051724137931034, '11'], [13.233644859813085, '14'], [10.8, '19'], [38.5948275862069, '15'], [9.022727272727273, '06'], [10.25, '08'], [13.440677966101696, '10'], [14.741176470588234, '13'], [11.46, '17'], [6.746478873239437, '22'], [8.127272727272727, '00'], [16.009174311926607, '21'], [13.20183486238532, '18'], [23.810344827586206, '02'], [7.170212765957447, '04'], [7.796296296296297, '03'], [21.525, '20'], [11.383333333333333, '01'], [10.08695652173913, '05'], [16.796296296296298, '16']]


[[38.5948275862069, '15'],
 [23.810344827586206, '02'],
 [21.525, '20'],
 [16.796296296296298, '16'],
 [16.009174311926607, '21'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [13.20183486238532, '18'],
 [11.46, '17'],
 [11.383333333333333, '01'],
 [11.051724137931034, '11'],
 [10.8, '19'],
 [10.25, '08'],
 [10.08695652173913, '05'],
 [9.41095890410959, '12'],
 [9.022727272727273, '06'],
 [8.127272727272727, '00'],
 [7.985294117647059, '23'],
 [7.852941176470588, '07'],
 [7.796296296296297, '03'],
 [7.170212765957447, '04'],
 [6.746478873239437, '22'],
 [5.5777777777777775, '09']]

## Print the value of the top 5 hours Asks Posts Comments

In [13]:
print("Top 5 Hours for Ask Posts Comments")

for average, hour in sorted_swap[:5]:
    print("{}: {:.2f} average comments per post.".format(
        dt.datetime.strptime(hour, "%H").strftime("%H:%M"), average)
         )
    

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post.
02:00: 23.81 average comments per post.
20:00: 21.52 average comments per post.
16:00: 16.80 average comments per post.
21:00: 16.01 average comments per post.


#### Abstract

The goal of this project was specifically to analyze `Ask HN` or `Show HN` with comments.
The original [data set](https://www.kaggle.com/hacker-news/hacker-news-posts) was reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions.
The code above applied the original set might then produce different results.

## Conclusion

According to the [data set documentation](https://www.kaggle.com/hacker-news/hacker-news-posts), the timezone used is Eastern Time in the US. Meaning, 15:00 = 3PM.
From the results of our analyze:
    The best hour to `Ask HN` is 3PM. But the best interval is 3-9PM.
    People are back from work and are more flexible to run some errands than earlier in the day with a peak at 3PM.
There are more avenues to explore, more questions to ask and answer. Our imagination is the limit.