# Hacker News Data Analysis Project

## By: David Vieira

In this project we are agoing to work with a dataset provided by the famous online website Hacker News, which was create by the startup incubator Y Combinator. This site works with user submitted posts that are voted and commented getting recognition to the point that popular post can get a bunch of visitors.

The data set that is going to be used can be found [Here](https://www.kaggle.com/hacker-news/hacker-news-posts). This data set has a particularity, because it was reduced from about 300,000 rows to about 20,000 by removing posts that were not commented at all and then randomnly taking a sample from the remaining ones.

The data sets column description is somthing like this:

* **id:** The unique identifier from Hacker News for the post
* **title:** The title of the post
* **url:** The URL that the posts links to, if it the post has a URL
* **num_points:** The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
* **num_comments:** The number of comments that were made on the post
* **author:** The username of the person who submitted the post
* **created_at:** The date and time at which the post was submitted

The interest of our analysis is mainly focus in posts with titles that include **Ask HN** or **Show HN** a the begining of the post.

* **Ask HN**: Posts to ask questions to the Hacker News community
* **Show HN** Posts to show something to Hacker News community

Our questions are:

* Which of both of the two types of posts receive more comments on average?
* Do posts created on certain time receive more comments on average?

Let's begin:

First we will imported the libraries that we need.


In [1]:
#Libraries
from csv import reader
import datetime as dt



In [6]:
#Dataset to List of lists
file = open('C:/Users/david/Desktop/Autodidacta/Datasets/HN_posts_year_to_Sep_26_2016.csv', encoding="utf8")
read_file = reader(file)
hn = list(read_file)
headers = hn[0]
hn = hn[1:]

#Function for data set visualization
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(row))

#Data Set Columns
print(headers)
print("\n")

#First five rows and number of columns
explore_data(hn, 0, 5, True)


['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']


['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24']


['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']


['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']


['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']


Number of 

Now that we are all set with our dataset, let's make a filter. As we said we only need the posts that begin with **Ask HN** or **Show HN**. So we need to test the name of the posts and separate them in sub lists with this criteria in account. So we will make use of the stirng method **startswith** for this. The method evaluates a string to see if it starts with a given substring.

In [3]:
ask_posts = []
show_posts = []
other_posts = []

#Post Filtering
for row in hn:
    title = row[1]
    title = title.lower()
    if title.startswith("ask hn"):
        ask_posts.append(row)
    elif title.startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)
#Post quantity in each category
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

#First posts
explore_data(ask_posts, 0, 5)
explore_data(show_posts, 0, 5)

9139
10158
273822
['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53']


['12578522', 'Ask HN: How do you pass on your work when you die?', '', '6', '3', 'PascLeRasc', '9/26/2016 1:17']


['12577908', 'Ask HN: How a DNS problem can be limited to a geographic region?', '', '1', '0', 'kuon', '9/25/2016 22:57']


['12577870', 'Ask HN: Why join a fund when you can be an angel?', '', '1', '3', 'anthony_james', '9/25/2016 22:48']


['12577647', 'Ask HN: Someone uses stock trading as passive income?', '', '5', '2', '00taffe', '9/25/2016 21:50']


['12578335', 'Show HN: Finding puns computationally', 'http://puns.samueltaylor.org/', '2', '0', 'saamm', '9/26/2016 0:36']


['12578182', 'Show HN: A simple library for complicated animations', 'https://christinecha.github.io/choreographer-js/', '1', '0', 'christinecha', '9/26/2016 0:01']


['12578098', 'Show HN: WebGL visualization of DNA sequences', 'http://grondilu.github.io/dna.html', '1',

Now that we filtered the posts, let's answer our first question:

Which of the two types of posts receives more comments.

In [4]:
#Get total number of comments in ask posts
total_ask_comments = 0
for post in ask_posts:
    comments = int(post[4])
    total_ask_comments += comments
#Get the average number of comments for ask posts
avg_ask_comments = total_ask_comments/len(ask_posts)
print("Average ask comments: {}".format(avg_ask_comments))

#Get total number of comments in show posts
total_show_comments = 0
for post in show_posts:
    comments = int(post[4])
    total_show_comments += comments
#Get the average number of comments for show posts
avg_show_comments = total_show_comments/len(show_posts)
print("Average show comments: {}".format(avg_show_comments))

Average ask comments: 10.393478498741656
Average show comments: 4.886099625910612


With this analysis we have found that the Ask HN posts are the ones with more comments, and it does make sense, because if you think about it, this posts are the ones that are asking to the community about something, and when you get asked something you feel that it needs to be answered. On the other side the Show HN posts are the ones that are for just showing something to the community, the people that comments this, does it because they voluntarily wanted to say something to the person that is showing something in the post. So it makes sense that the average number of comments for Ask HN is approximately 10 and for Show HN approximately 5.

Since ask posts are the ones that get more comments, we will focus our analysis in this kind of posts.

Now we will determine if ask posts created in certain time are more likely to get comments. For this we will:

* Calculate the amount of ask posts created in each hour of the day and the amount of received comments.

* Calculate the average number of comments ask posts receive by hour created.

So let's do it!

For this we will use the datetime module that we imported at the beginning of this project to parse the dates that are given to us as a string into datetiem objects so we can make operations with these.


In [13]:
#Create a list of lists that contain the date  and number of comments
result_list = []

for post in ask_posts:
    creation_date = post[6]
    comments = int(post[4])   
    result_list.append([creation_date, comments])

#Create 2 dictionaries
counts_by_hour = {} #frequency table for number of posts per hour of day
comments_by_hour = {} #Total of comments per hour
for row in result_list:
    date = row[0]
    date = dt.datetime.strptime(date,"%m/%d/%Y  %H:%M")
    hour = date.strftime("%H")
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = row[1]
    else:
        counts_by_hour[hour] += 1 
        comments_by_hour[hour] += row[1]
 

Now that we have 2 dictionaries that tell us the posts per hour and the total comments per hour, we will use them to calculate the average  number of comments for posts created during each hour of the day

In [15]:
avg_by_hour = []
for hour in counts_by_hour:
    average = comments_by_hour[hour]/counts_by_hour[hour]
    avg_by_hour.append([hour, average])


[['02', 11.137546468401487],
 ['01', 7.407801418439717],
 ['22', 8.804177545691905],
 ['21', 8.687258687258687],
 ['19', 7.163043478260869],
 ['17', 9.449744463373083],
 ['15', 28.676470588235293],
 ['14', 9.692007797270955],
 ['13', 16.31756756756757],
 ['11', 8.96474358974359],
 ['10', 10.684397163120567],
 ['09', 6.653153153153153],
 ['07', 7.013274336283186],
 ['03', 7.948339483394834],
 ['23', 6.696793002915452],
 ['20', 8.749019607843136],
 ['16', 7.713298791018998],
 ['08', 9.190661478599221],
 ['00', 7.5647840531561465],
 ['18', 7.94299674267101],
 ['12', 12.380116959064328],
 ['04', 9.7119341563786],
 ['06', 6.782051282051282],
 ['05', 8.794258373205741]]

Now we have the average number of comments for posts created during each hour of the day, let's sort it to have a clear view of the information to analize it.

In [17]:
swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1],row[0]])

sorted_swap = sorted(swap_avg_by_hour,reverse=True)
print("Top 5 Hours for Ask Posts Comments")
for row in sorted_swap[:5]:
    avg = row[0]
    hour = row[1]
    hour = dt.datetime.strptime(hour, "%H")
    hour = hour.strftime("%H:%M")
    template = "{0}: {1:.2f} average comments per post"
    print(template.format(hour, avg))



Top 5 Hours for Ask Posts Comments
15:00: 28.68 average comments per post
13:00: 16.32 average comments per post
12:00: 12.38 average comments per post
02:00: 11.14 average comments per post
10:00: 10.68 average comments per post


From this analysis we can determine that the 3 pm  or 15:00 in 24-hour format (the time zone is Eastern Time in the US), is the hour you are more likely to get the most comments, so you should post at this hour your Ask HN posts if you want more people commenting.