# Project
**Exploring Hacker News Posts**

Hacker News is a site started by the startup incubator [Y Combinator](https://www.ycombinator.com/), where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

The data set can be found [here](https://www.kaggle.com/hacker-news/hacker-news-posts), but it has been reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions. Below are descriptions of the columns:

- id: The unique identifier from Hacker News for the post
- title: The title of the post
- url: The URL that the posts links to, if it the post has a URL
- num_points: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
- num_comments: The number of comments that were made on the post
- author: The username of the person who submitted the post
- xcreated_at: The date and time at which the post was submitted
This project's objective is therefore to analyze the posts in the Hacker News site.


In [1]:
#Reading the hacker_news.csv file:
opened_file = open('hacker_news.csv')
from csv import reader
read_file = reader(opened_file)
hn = list(read_file)
hn[:5]


[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01']]

In [2]:
 #extracting the header
headers = hn[0]
hn = hn[1:]
print(headers)
hn[:5]

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01'],
 ['10301696',
  'Note by Note: The Making of Steinway L1037 (2007)',
  'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
  '8',
  '2',
  'walterbell',
  '9/30/2015 4:12']]

Having removed the headers from hn, we're ready to filter our data. Since we're only concerned with post titles beginning with Ask HN or Show HN, we'll create new lists of lists containing just the data for those titles.
We'll use the string ,method *startswith*

In [3]:
#create empty lists
ask_posts = []
show_posts = []
other_posts = []

#looping through each row to find number of posts in each of
#the stated categories

for row in hn:
    title = row[1]
    
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(title)
        
print('The number of ask posts is: ' + str(len(ask_posts)))
print('The number of show posts is: ' + str(len(show_posts)))
print('The number of other posts is: ' + str(len(other_posts)))

        

The number of ask posts is: 1744
The number of show posts is: 1162
The number of other posts is: 17194


The code above has separated the "ask posts" and the "show posts" into two list of lists named ask_posts and show_posts.

Next, we'll determine which, among the two, received more comments on average.

In [4]:
print(ask_posts[:3])

[['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'], ['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14']]


In [5]:
total_ask_comments = 0
for item in ask_posts:
    num_comments = int(item[4])
    total_ask_comments += num_comments
avg_ask_comments = total_ask_comments/len(ask_posts)
print('The average comments on ask posts is: ' + str(avg_ask_comments))

total_show_comments = 0
for item in show_posts:
    num_comments = int(item[4])
    total_show_comments += num_comments
avg_show_comments = total_show_comments/len(show_posts)
print('The average comments on show posts is: ' + str(avg_show_comments)
     )


The average comments on ask posts is: 14.038417431192661
The average comments on show posts is: 10.31669535283993


From the output above, ask posts receive more comments on average.

Next, we'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:

1. Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
2. Calculate the average number of comments ask posts receive by hour created.

In [6]:
import datetime as dt
result_list = []
for post in ask_posts:
    date_created = post[6]
    num_of_comments = int(post[4])
    result_list.append([date_created, num_of_comments])
    
counts_by_hour = {}
comments_by_hour = {}

date_format = "%m/%d/%Y %H:%M"

for row in result_list:
    created_date = row[0]
    comment = row[1]
    
    parsed_date = dt.datetime.strptime(created_date, date_format)
    hour_created = parsed_date.strftime("%H")
    
    if hour_created not in counts_by_hour:
        counts_by_hour[hour_created] = 1
        comments_by_hour[hour_created] = comment
    else:
        counts_by_hour[hour_created] += 1
        comments_by_hour[hour_created] += comment 
        
print(comments_by_hour)
print(counts_by_hour)

{'02': 1381, '06': 397, '20': 1722, '01': 683, '22': 479, '11': 641, '14': 1416, '13': 1253, '05': 464, '19': 1188, '08': 492, '10': 793, '18': 1439, '15': 4477, '09': 251, '07': 267, '17': 1146, '03': 421, '21': 1745, '04': 337, '00': 447, '16': 1814, '23': 543, '12': 687}
{'02': 58, '06': 44, '20': 80, '01': 60, '22': 71, '11': 58, '14': 107, '13': 85, '05': 46, '19': 110, '08': 48, '10': 59, '18': 109, '15': 116, '09': 45, '07': 34, '17': 100, '03': 54, '21': 109, '04': 47, '00': 55, '16': 108, '23': 68, '12': 73}


We'll use these two dictionaries, _counts_by_hour and comments_by_hour to calculate the average number of comments for posts created during each hour of the day.

To achieve this, we need to create a list of lists containing the hours during which posts were created and the average number of comments those posts received.


In [7]:
avg_by_hour = []
for hour_created in comments_by_hour:
    average = (comments_by_hour[hour_created])/ (counts_by_hour[hour_created])
    avg_by_hour.append([hour_created, average ])
    
print("The average number of comments per ask post per hour is: " +
      str(avg_by_hour))
    

The average number of comments per ask post per hour is: [['02', 23.810344827586206], ['06', 9.022727272727273], ['20', 21.525], ['01', 11.383333333333333], ['22', 6.746478873239437], ['11', 11.051724137931034], ['14', 13.233644859813085], ['13', 14.741176470588234], ['05', 10.08695652173913], ['19', 10.8], ['08', 10.25], ['10', 13.440677966101696], ['18', 13.20183486238532], ['15', 38.5948275862069], ['09', 5.5777777777777775], ['07', 7.852941176470588], ['17', 11.46], ['03', 7.796296296296297], ['21', 16.009174311926607], ['04', 7.170212765957447], ['00', 8.127272727272727], ['16', 16.796296296296298], ['23', 7.985294117647059], ['12', 9.41095890410959]]


To finish, we need to sort the list of lists to identify hours with highest values, and print the five highest values in a format that's easier to read.

In [8]:
swap_avg_by_hour = []
for hour in avg_by_hour:
    swap_avg_by_hour.append([hour[1], hour[0]])
    
print(swap_avg_by_hour)


[[23.810344827586206, '02'], [9.022727272727273, '06'], [21.525, '20'], [11.383333333333333, '01'], [6.746478873239437, '22'], [11.051724137931034, '11'], [13.233644859813085, '14'], [14.741176470588234, '13'], [10.08695652173913, '05'], [10.8, '19'], [10.25, '08'], [13.440677966101696, '10'], [13.20183486238532, '18'], [38.5948275862069, '15'], [5.5777777777777775, '09'], [7.852941176470588, '07'], [11.46, '17'], [7.796296296296297, '03'], [16.009174311926607, '21'], [7.170212765957447, '04'], [8.127272727272727, '00'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12']]


In [14]:
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
print("Top 5 Hours for Ask Posts Comments")

for row in sorted_swap[:5]:
    average = row[0]
    hour = row[1]
    
    hour_format = "%H"
    new_hour = dt.datetime.strptime(hour, hour_format)
    hour = new_hour.strftime("%H:%M")
    
    output = "{a}: {b:.2f} average comments per post.".format(a = hour, b = average)
    print(output)   
    
        

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post.
02:00: 23.81 average comments per post.
20:00: 21.52 average comments per post.
16:00: 16.80 average comments per post.
21:00: 16.01 average comments per post.


**1500hrs(EST)** have the highest average comments per post, therefore it is advisable to create a post at this time to have a higher chance of receiving comments.