# Exploring Hacker News Posts

    In this project, we'll work with a data set of submissions to popular technology site Hacker News.
    Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.
    

# Aim of the project 
In this project,we are speicfically interested in posts whose titles begin with either Ask HN or Show HN.Users submit Ask HN posts to ask the Hacker News community a specific question.

Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting.

Our goal is to compare these two types of post to determine the following?
1. Do Ask HN or Show HN receive more comments on average?
2. Do posts created at a certain time receive more comments on average?


## Opening and Exploring  the dataset
You can find the data set [here](https://www.kaggle.com/hacker-news/hacker-news-posts), but note that it has been reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions. 

Let's start by Opening  the dataset and explore the same by importing the libraries we need and reading the data set into a list of lists.

In [13]:
#open the hacker_news.csv
from csv import reader
opened_file = open("hacker_news.csv")
read_file = reader(opened_file)
hn = list(read_file)
hn_header=hn[0]#holds the header of the dataset
#Remove the header  the store the rest in hn
hn = hn[1:]


To make them easier for you to explore, we created a function named explore_data() that you can repeatedly use to print rows in a readable way.

In [14]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        


In [15]:
#Now lets explore the hn dataset.
print(hn_header)
print('\n')
print(explore_data(hn,0,5,True))# will print first 5 rows

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']


Number of 

So there are total of 20100 rows and 7 columns in the dataset.
Below is the decribtion of the columns.

- id: The unique identifier from Hacker News for the post
- title: The title of the post
- url: The URL that the posts links to, if it the post has a URL
- num_points: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
- num_comments: The number of comments that were made on the post
- author: The username of the person who submitted the post
- created_at: The date and time at which the post was submitted

## Filter the data
Next step is to filter our data.Since we're only concerned with post titles beginning with 'Ask HN 'or 'Show HN', we'll create new lists of lists containing just the data for those titles.

For this,we are first going to do the following steps.
- Create three empty lists called ask_posts, show_posts, and other_posts.
- Loop through each row in hn.
- Assign the title in each row to a variable named title.
Because the title column is the second column, you'll need to get the element at index 1 in each row.
- Implement the following steps:
  1. If the lowercase version of title starts with ask hn, append the row to ask_posts.
  2. Else if the lowercase version of title starts with show hn, append the row to show_posts.
  3. Else append to other_posts.
- Check the number of posts in ask_posts, show_posts, and other_posts.


In [16]:
ask_posts =[]
show_posts =[]
other_posts =[]

for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else :
        other_posts.append(row)
print("There are {num} posts starting with Ask Hn".format(num=len(ask_posts)))
print("There are {num} posts starting with Show Hn".format(num=len(show_posts)))
print("There are {num} other posts ".format(num=len(other_posts)))



There are 1744 posts starting with Ask Hn
There are 1162 posts starting with Show Hn
There are 17194 other posts 


Now we can check the first five rows of ask_posts and show_posts.For that we can call the explore_data() function

In [17]:
#first 5 rows of ask_posts
print("The first 5 rows of ask_posts:")
print('\n')
print(explore_data(ask_posts,0,5,True))
print('\n')
#first 5 rows of show_posts:
print("The first 5 rows of show_posts:")
print('\n')
print(explore_data(show_posts,0,5,True))

The first 5 rows of ask_posts:


['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55']


['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43']


['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14']


['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20']


['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38']


Number of rows: 1744
Number of columns: 7
None


The first 5 rows of show_posts:


['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03']


['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015

Next, let's determine if ask posts or show posts receive more comments on average.

# Part 1: Find if ask posts or show post receive more comments on average.

## Step 1: Average of ask post comments
    For doing this,we have to do the following :
    1. Find the total number of comments in ask posts 
    2.Find the average number of comments on ask posts

In [20]:
#find the total number of comments in ask posts.
total_ask_comments =0
for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments+=num_comments
#Avg number of comments
avg_ask_comments = total_ask_comments/len(ask_posts)
print("The average number of comments on ask post is {average:.2f}".format(average=avg_ask_comments))
    

The average number of comments on ask post is 14.04


## step 2: Find the average of number of comments in show post
For this ,we are doing the following :
1. Find the total number of comments in show post
2. Find the average of number of comments on show post

In [19]:
#find the total number of comments in show posts.
total_show_comments =0
for row in show_posts:
    num_comments = int(row[4])
    total_show_comments+=num_comments
#Avg number of comments on show posts
avg_show_comments = total_show_comments/len(show_posts)
print("The average number of comments on show post is {average:.2f}".format(average=avg_show_comments))
    

The average number of comments on show post is 10.32


From above findings,we can see that the average number of comments on ask posts and show post are 14.04 and 10.32 respectively.
So it is clear that the ask post receive more comments on average than the show post

# Part 2: Do posts created at a certain time receive more comments on average?


Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.

For performing this analysis,the following steps are to be done.
1. Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
2. Calculate the average number of comments ask posts receive by hour created.

## Step1: calculating the amount of ask posts and comments by hour created. 
    We will make  use  of the datetime module to work with the data in the created_at column.

In this step,we are going to calculate the amount of ask posts created per hour, along with the total amount of comments.

For this the following steps are done.
- Import the datetime module as dt.
- Create an empty list and assign it to 'result_list'. This will be a list of lists.
- Iterate over ask_posts and append to result_list a list with two elements:
   1. The first element shall be the column created_at.
   2. The second element shall be the number of comments of the post.(convert to int)
   The result_list should look like this:
   result_list = [[created_at,num_of_comments],....]

- Create two empty dictionaries called counts_by_hour and comments_by_hour.
- Loop through each row of result_list.
- Extract the hour from the date, which is the first element of the row.
- Use the datetime.strptime() method to parse the date and create a datetime object.
- Use the string we want to parse as the first argument and a string that specifies the format as the second argument.
    1. Use the datetime.strftime() method to select just the hour from the datetime object.
    2. If the hour isn't a key in counts_by_hour:
       - Create the key in counts_by_hour and set it equal to 1.
       - Create the key in comments_by_hour and set it equal to the comment number.
    3. If the hour is already a key in counts_by_hour:
       - Increment the value in counts_by_hour by 1.
       - Increment the value in comments_by_hour by the comment number.   
   
   

In [50]:
import datetime as dt
result_list = []
for row in ask_posts:
    created_at = row[6]
    num_of_comments = int(row[4])
    result_list.append([created_at,num_of_comments])
print("The first 3 rows of result_list:")
print('\n')
print(result_list[0])#prints first rows in the result_list


The first 3 rows of result_list:


['8/16/2016 9:55', 6]


From the above output we can see that,the result_list is a list of lists and the first element is 'created_at' and second element is num_of comments.

In [38]:
#create two empty dictionary
counts_by_hour ={}
comments_by_hour ={}
for row in result_list:
    create_date = row[0]
    num_of_comments = row[1]
    create_date = dt.datetime.strptime(create_date,"%m/%d/%Y %H:%M")#to parse and create a datetime object
    created_time = dt.datetime.strftime(create_date,"%H")#to select just the hour from the datetime object.
    if created_time not in counts_by_hour:
        counts_by_hour[created_time]=1
        comments_by_hour[created_time]=num_of_comments
    else:
        counts_by_hour[created_time]+=1
        comments_by_hour[created_time]+=num_of_comments
        
print("The amount of ask post created by hour:")
print('\n')
print(counts_by_hour)
print('\n')
print("The total number of comments created by hour:")
print('\n')
print(comments_by_hour)
        
        

The amount of ask post created by hour:


{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}


The total number of comments created by hour:


{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}


From Above,we can find the total number of posts created by hour and total number of comments created by hour.
For instance,the total number of posts created between 8.59am and 10am is 45.the total number of comments for the same is 90.

## Step 2:Calculate the average number of comments ask posts receive by hour created.
In this step we are going to  calculate the average number of comments per post for posts created during each hour of the day.The result should be a list of lists in which the first element is the hour and the second element is the average number of comments per post. 

In [49]:
avg_by_hour =[]
hour_list=counts_by_hour.keys()
for hour in hour_list:
    #total_post=counts_by_hour[hour]
    #total_comments = comments_by_hour[hour]
    avg_comments = comments_by_hour[hour]/counts_by_hour[hour]
    avg_by_hour.append([hour,avg_comments])
    
print(avg_by_hour)
    
    
    

[['09', 5.5777777777777775], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['16', 16.796296296296298], ['23', 7.985294117647059], ['12', 9.41095890410959], ['17', 11.46], ['15', 38.5948275862069], ['21', 16.009174311926607], ['20', 21.525], ['02', 23.810344827586206], ['18', 13.20183486238532], ['03', 7.796296296296297], ['05', 10.08695652173913], ['19', 10.8], ['01', 11.383333333333333], ['22', 6.746478873239437], ['08', 10.25], ['04', 7.170212765957447], ['00', 8.127272727272727], ['06', 9.022727272727273], ['07', 7.852941176470588], ['11', 11.051724137931034]]


The format of the above result makes it hard to identify the hours with the highest values. Let's finish by sorting the list of lists and printing the five highest values in a format that's easier to read.

For this we have create a list that equals avg_by_hour with swapped columns.

In [52]:
#create an empty list swap_avg_by_hour
swap_avg_by_hour = []
for row in avg_by_hour:
    #swap elements of the row.ie first element will be second
    #and second element will be first.This makes it easier to sort
    swap_avg_by_hour.append([row[1],row[0]])
print("The swapped avg_by_hour is:")
print('\n')
print(swap_avg_by_hour)


The swapped avg_by_hour is:


[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]


We need the results sorted in desc order so that we can find the first 5 highest average of comments by hour.For this we can use the sorted() on the swap_avg_by_hour.At this point we can do that since the first element of row in swap_avg_by_hour is the average of the comments by hour

In [59]:
sorted_swap=sorted(swap_avg_by_hour,reverse=True )

print("Top 5 Hours for Ask posts Comments")
print('\n')
for row in sorted_swap[0:5]:
    time_dt = dt.datetime.strptime(row[1],"%H")
    time = dt.datetime.strftime(time_dt,"%H:%M")
    print("{t}: {avg:.2f} average comments per post".format(t=time,avg=row[0]))

Top 5 Hours for Ask posts Comments


15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


From analyzing the above  results we can come to a conclusion that the ask posts created during 15:00 ie(between 3pm and 4pm) are more likely to get the most comments ie,38.59 average comments per post.
Also the posts created during 02:00am has the 2nd highest
average comments which is about 23.81.The post created during 8pm,4pm and 9pm are also more likely to get more comments and the average of comments received are 21.52,16.80,16.01 respectively.