# Hacker News Post Comment Analysist

We are looking at the Hacker News website.  It is a website that is similar to a forum board which is popular in technology and startup circles.  Each post has a number of comments and a number of points with points being the number of upvotes minus the number of downvotes.

Our goal is to look at posts that specifically are `Ask HN` and `Show HN`.  `Ask HN` are posts where the user are asking everyone on the forum a question.  `Show HN` are posts that the user is showing the forum something they have done.  We will be comparing how many comments each of these types of posts get.  We will also be looking to see if posts made at a certain time get more comments.

In [1]:
'''
Importing the data set
'''
from csv import reader

#Open up the hacker news data set
opened_file = open('Data_Sets\hacker_news.csv', encoding='utf8')
read_file = reader(opened_file)
h_news = list(read_file)

#Split the header out of the list
h_news_header = h_news[0]
h_news = h_news[1:]

'''
Printing the data set
'''
#Make a function to print the data set
def print_list(arg_list, start, stop):
    #Loop through the list section
    for row in arg_list[start:stop]:
        #Print the row with a space after it
        print(row)
        print('\n')
        
#Print the first 5 rows with the header
print(h_news_header)
print('\n')
print_list(h_news, 0, 5)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']


['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24']


['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']


['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']


['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']




Now that we have the data set loaded, we can start working towards our goals.

#### Seperating The List
Now that we have the data set in a list and a header, We will now take the list and split it up.  We are looking at `Ask HN` and `Show HN` so we will split the data into 3 lists.  One list for `Ask HN`, one for `Show HN`, and one for all the rest of the posts.

In [2]:
'''
Seperating the List into 3 Lists
'''
#Create 3 empty lists
ask_posts = []
show_posts = []
other_posts = []

#Loop through the data set
for row in h_news:
    #Grab the title
    title = row[1]
    
    #Only if the post has comments
    if int(row[4]) != 0:
        #If the title starts with "ask hn" put in the ask list
        if title[:6].lower().startswith('ask hn'):
            ask_posts.append(row)
        #Else if the title starts with "show hn" put in the show list
        elif title[:7].lower().startswith('show hn'):
            show_posts.append(row)
        #Otherwise put the post in the other list
        else:
            other_posts.append(row)
        
#Look at the lengths of each list
print("Ask HN posts: ", len(ask_posts))
print("Show HN posts: ", len(show_posts))
print("Other Posts: ", len(other_posts))

Ask HN posts:  6911
Show HN posts:  5059
Other Posts:  68431


#### Comparing Ask vs. Show
Now that we have all the `Ask HN` posts and all the `Show HN` posts in seperate lists we can compare them.  To do this we will take the number of comments from each list and average them out.  Then we will be able to compare the average number of comments for each type of post.

In [3]:
'''
Calculating the average Ask HN
'''
#Create a total variable
total_ask_comments = 0

#Loop through the ask list
for row in ask_posts:
    #Grab the number of comments and convert it into an integer
    num_comments = row[4]
    num_comments = int(num_comments)
    
    #Add to the total counter
    total_ask_comments += num_comments
    
#Calculate the average for the comments and print it out
avg_ask_comments = total_ask_comments / len(ask_posts)
print("Average Ask HN Comments: {:.2f}".format(avg_ask_comments))

'''
Calculating the average Show HN
'''
#Create a total variable
total_show_comments = 0

#Loop through the show list
for row in show_posts:
    #Grab the number of comments and convert it into an integer
    num_comments = row[4]
    num_comments = int(num_comments)
    
    #Add to the total counter
    total_show_comments += num_comments
    
#Calculate the average for the comments and print it out
avg_show_comments = total_show_comments / len(show_posts)
print("Average Show HN Comments: {:.2f}".format(avg_show_comments))

Average Ask HN Comments: 13.74
Average Show HN Comments: 9.81


Now we have the average amount of comments for `Ask HN` and `Show HN`.  One thing that is immediatly obvious is that `Ask HN` posts average about 4 comments more than the `Show HN` posts.  
One possible reason is answering questions can be easier than giving compliments.  When giving an answer to a question you can lean back on experience to answer and even piggy back off someone elses answer and clarify something or add something.  But when replying to someone showing off something they did, there are only so many ways to respond and you can't really piggy back off other posts.

#### Comparing time of posts
Now we will be looking to see if the time a post is posted affects the amount of comments it will get.  Because we know that `Ask HN` posts get more comments on average, we will be using that list for our time comparison.  For our comparison we will be doing it in 2 steps:
1. We will group `Ask HN` posts by each hour of the day that they are created.
2. We will calculate the average number of comments each group receives.

In [4]:
'''
Grabbing the time and comment numbers
'''
import datetime as dt

#Create an empty time comment list
time_comment_list = []

#Loop through the ask posts list
for row in ask_posts:
    #grab the created time and the number of comments
    created_time = row[6]
    num_comments = row[4]
    num_comments = int(num_comments)
    
    #Append the time and comments to the time comment list
    time_comment_list.append([created_time, num_comments])
    
'''
Calculating the number of comments per hour
'''
#Create 2 empty dictionaries
posts_by_hour = {}
comments_by_hour = {}

#Loop through the time comment list
for row in time_comment_list:
    #Grab the time of each post
    hour = dt.datetime.strptime(row[0], "%m/%d/%Y %H:%M")
    #Grab only the hour for each post
    hour = hour.strftime("%H")
    
    #If the hour is a key in the dictionaries
    if hour in posts_by_hour:
        #Then increment the posts counter by 1 and add up the comments
        posts_by_hour[hour] += 1
        comments_by_hour[hour] += row[1]
    #Else set the hour as a key in the dictionaries
    else:
        #Set the counter to 1 and set the number of comments
        posts_by_hour[hour] = 1
        comments_by_hour[hour] = row[1]

'''
Calculating the average comments per hour
'''
#Create an empty average by hour list
average_by_hour = []

#Loop through the dictionaries
for hour in posts_by_hour:
    #Grab the average comments per hour
    average = comments_by_hour[hour] / posts_by_hour[hour]
    
    #Append the average and hour into the average by hour list
    average_by_hour.append([average, hour])

'''
Printing the average comments per hour
'''
#Sort the average by hour list
average_by_hour.sort(reverse = True)
#Loop through the average by hour list
for i in average_by_hour:
    #Create a string for formatting
    print_string = "{1}:00 has an Average Comment of {0:>5.2f}"
    #print with formatting
    print(print_string.format(i[0], i[1]))

15:00 has an Average Comment of 39.67
13:00 has an Average Comment of 22.22
12:00 has an Average Comment of 15.45
10:00 has an Average Comment of 13.76
17:00 has an Average Comment of 13.73
02:00 has an Average Comment of 13.20
14:00 has an Average Comment of 13.15
04:00 has an Average Comment of 12.69
08:00 has an Average Comment of 12.43
22:00 has an Average Comment of 11.75
20:00 has an Average Comment of 11.38
11:00 has an Average Comment of 11.14
05:00 has an Average Comment of 11.14
21:00 has an Average Comment of 11.06
18:00 has an Average Comment of 10.79
16:00 has an Average Comment of 10.76
03:00 has an Average Comment of 10.16
07:00 has an Average Comment of 10.10
00:00 has an Average Comment of  9.86
19:00 has an Average Comment of  9.41
01:00 has an Average Comment of  9.37
06:00 has an Average Comment of  9.02
09:00 has an Average Comment of  8.39
23:00 has an Average Comment of  8.32


Now we have a nice list of average comments per hour.  We sorted the list by the average number of comments so that it is easier to see the best times to post.  One thing to keep in mind about the data set is that the times in the data set are Eastern Time.  This means that the top time of 3:00PM for the data set is actually 2:00PM for us.  That leaves the best times for us to post as 2:00PM, 12:00PM, 11:00AM, 9:00AM, and 4:00PM.  Four of these top 5 times are within 3 hours of noon.

## Conclusion
In this project, we took a data set from the forum website Hacker News which is popular in technology and startup circles.  Our goal was to compare Show HN posts which are posts that show off something someone has done and Ask HN posts which are posts that ask the HN community a question.  The way we are comparing the posts are by the average number of comments each post gets.  Our next goal is the see what hour is best to post and get the most number of comments.

The comparison between Ask HN posts and Show HN posts showed us that on average Ask HN posts get on average 4 comments more than Show HN posts.  One possible reason is because multiple people can answer a question and add to a previous answer while commenting on someone's achievement tends to be all similar comments.

The best hour to post.  For this goal we focused on Ask HN posts to see which hour had the most average comments per post.  One thing to note about the data set is that the times are in Eastern Time.  In Texas, we are using Central Time so that would be 1 hour behind the times in the data set.  Based on the data, for us 2:00PM is the best time to post. With 11:00AM and 12:00PM being the next best.  This leaves a 4 hour window in the afternoon for us to post and on average get a good number of comments.