# Exploring Hackers News Posts

In this project, we wil  compare two different types of posts from Hacker News, a popular site where people can post or ask questions about technology related topics.

We will specifically compare two types of posts:

* `Ask HN`: Posts in which the name starts with Ask HN are questions to the Hacker News community about a specific topic.
* `Show HN`: Posts in which the name starts with Show HN are meant to introduce the community about a project, product or something interesting.


The information we want to obtain through the analysis is the following:
* Do `Ask HN` or `Show HN` receive more comments on average?
* Do posts created at a certain time receive more comments on average?

## Introduction

First, we will open and read the `hacker_news.csv` file that contains the data base we will be working with in a list of lists:

In [1]:
# Read in the data.
import csv

f = open('hacker_news.csv')
hn = list(csv.reader(f))

# Show first five rows:
hn[:5]

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01']]

We can see above that the data set contains:

* `id`: The unique identifier from Hacker News for the post
* `title`: The title of the post
* `url`: The URL that the posts links to, if the post has a URL
* `num_points`: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
* `num_comments`: The number of comments that were made on the post
* `author`: The username of the person who submitted the post
* `created_at`: The date and time at which the post was submitted

As we can see, the first row contains the headers of this data ser, so we will extract this information and insert it in another list:

In [2]:
#Make headers list
headers = hn[0]

#Remove the bheaders from hn list
hn = hn[1:]

## Extracting Ask HN and Show HN Posts

In order to separate to posts beginning with `Ask HN`, `Show HN`, and the rest of the other posts in three different lists, we will use the the string method `startswith`:

In [3]:
# Split into three different lists

ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if (title.lower()).startswith('ask hn') is True:
        ask_posts.append(row)
    elif (title.lower()).startswith('show hn') is True:
        show_posts.append(row)
    else:
        other_posts.append(row)
        
# Check number of posts for each condition

print("Ask posts: " + str(len(ask_posts)))
print("Show posts: " + str(len(show_posts)))
print("Other posts: " + str(len(other_posts)))

Ask posts: 1744
Show posts: 1162
Other posts: 17194


## Calculating the Average Number of Comments for Ask HN and Show HN Posts

Now we will determine which type of posts received the highest number of comments on average:

In [4]:
# Calculate the average number of comments `Ask HN` posts receive
total_ask_comments = 0

for post in ask_posts:
    total_ask_comments += int(post[4])
    
avg_ask_comments = total_ask_comments/len(ask_posts)
print("Average number of comments of 'Ask HN' posts: " + str(avg_ask_comments))


# Calculate the average number of comments `Show HN` posts receive
total_show_comments = 0

for post in show_posts:
    total_show_comments += int(post[4])
    
avg_show_comments = total_show_comments/len(show_posts)
print("Average number of comments of 'Show HN' posts: " + str(avg_show_comments))

Average number of comments of 'Ask HN' posts: 14.038417431192661
Average number of comments of 'Show HN' posts: 10.31669535283993


On average, ask posts in receive approximately 14 comments, whereas show posts receive approximately 10. Therefore, we will focus our remaining analysis just on these posts.

## Finding the Amount of Ask Posts and Comments by Hour Created

Next we will verify if we can maximize the amount of comments an ask post receives by creating it at a certain time of the day.

In order to do that we will find the amount of ask posts created during each hour of day, along with the number of comments those posts received:

In [5]:
# Calculate the amount of ask posts created during each hour of the day and the number of comments received

import datetime as dt

result_list = []

for post in ask_posts:
    created_at = post[6]
    comments = int(post[4])
    result_list.append([created_at, comments])
    
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    time1 = row[0]
    time2 = dt.datetime.strptime(time1, "%m/%d/%Y %H:%M")
    post_hour = dt.datetime.strftime(time2, "%H")
    if post_hour not in counts_by_hour:
        counts_by_hour[post_hour] = 1
        comments_by_hour[post_hour] = row[1]
    else:
        counts_by_hour[post_hour] += 1
        comments_by_hour[post_hour] += row[1]
        
print("Number of posts by hour:")      
print(counts_by_hour)
print("Number of comments by hour:")      
print(comments_by_hour)

Number of posts by hour:
{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}
Number of comments by hour:
{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}


## Calculating the Average Number of Comments for Ask HN Posts by Hour


Then, we will calculate the average amount of comments per post made at each hour of the day with the `counts_by_hour` and `comments_by_hour` obtained in the previous section:

In [6]:
# Calculate the average amount of comments per 'Ask HN' posts created at each hour.

avg_by_hour = []

for hour in counts_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour] / counts_by_hour[hour]])

print("Average number of comments per post per hour:")
print(avg_by_hour)
    

Average number of comments per post per hour:
[['09', 5.5777777777777775], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['16', 16.796296296296298], ['23', 7.985294117647059], ['12', 9.41095890410959], ['17', 11.46], ['15', 38.5948275862069], ['21', 16.009174311926607], ['20', 21.525], ['02', 23.810344827586206], ['18', 13.20183486238532], ['03', 7.796296296296297], ['05', 10.08695652173913], ['19', 10.8], ['01', 11.383333333333333], ['22', 6.746478873239437], ['08', 10.25], ['04', 7.170212765957447], ['00', 8.127272727272727], ['06', 9.022727272727273], ['07', 7.852941176470588], ['11', 11.051724137931034]]


## Sorting and Printing Values from a List of Lists

In order to make it easier to identify the hours with the highest values, we will finish this analysis by sorting the list of lists and printing the five highest values in a format that's easier to read:

In [7]:
#Sort the hours by descending order of highest numbers of posts

swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])

sorted_swap = sorted(swap_avg_by_hour, reverse = True)

print("Ranking of Hours for Ask Posts Comments:")

for average, hour in sorted_swap:
    print("{}: {:.2f} average comments per post".format
    (dt.datetime.strptime(hour, "%H").strftime("%H:%M"), average))

Ranking of Hours for Ask Posts Comments:
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post
13:00: 14.74 average comments per post
10:00: 13.44 average comments per post
14:00: 13.23 average comments per post
18:00: 13.20 average comments per post
17:00: 11.46 average comments per post
01:00: 11.38 average comments per post
11:00: 11.05 average comments per post
19:00: 10.80 average comments per post
08:00: 10.25 average comments per post
05:00: 10.09 average comments per post
12:00: 9.41 average comments per post
06:00: 9.02 average comments per post
00:00: 8.13 average comments per post
23:00: 7.99 average comments per post
07:00: 7.85 average comments per post
03:00: 7.80 average comments per post
04:00: 7.17 average comments per post
22:00: 6.75 average comments per post
09:00: 5.58 average comments per post


The hour that receives the most comments per post on average is 15:00, with an average of 38.59 comments per post, with about a 60% increase compared to the second hour in the ranking 02:00.

## Conclusion

In this project, we analyzed the posts in the website Hacker News and determined which type of post and time receive the most comments on average. Based on our analysis, which did not take into account posts without comments, to maximize the amount of comments a post receives, we'd recommend the post be categorized as ask post and be created between 15:00 and 16:00.