# Exploring Hacker News Post: Ask or Show Hacker News?

### Introduction
This project seeks to discover the category of post that receives more comments on Hacker News - Ask Hacker News (Ask HN) or Show Hacker News (Show HN) posts? Hacker News is a popular site where technology related stories (or 'posts') are voted and commented upon. The two types of posts we'll explore begin with either Ask HN or Show HN.

Users submit Ask HN posts to ask the Hacker News community a specific question, such as "What is the best online course you've ever taken?" Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting.

We'll specifically compare these two types of posts to determine the following:

Do Ask HN or Show HN receive more comments on average?
Do posts created at a certain time receive more comments on average?
It should be noted that the data set we're working with was reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions.

The data for this project and the column descriptions can be found [here](https://www.kaggle.com/hacker-news/hacker-news-posts)

The total user posts is approximately 300,000. For this project, we have removed posts/user submissions with no comments and are left with a randomly sliced out 20,000 submissions.



In [2]:
# Read the file in as a list of lists
from csv import reader
opened_file = open("hacker_news.csv")
read_file = reader(opened_file)
hacker_news_header = list(read_file)
hacker_news = hacker_news_header[1:]

# to have a soft feel of our dataset
print(hacker_news_header[0])
print(hacker_news[:6])





['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12'], ['10482257', 'Ti

# From the Dataset, we extract the needed Data for our Project

* We extract the data entries for ask HN posts
* We extract the datat/information for show HN posts

In [3]:
# To do this data extraction, we shall make use of the startswith and lower method of the string class. This is 
# because ask HN or Show HN entries might have been ramdomly done between lower and upper cases. So for uniformity
# we first convert all to lower case before applyng the startswith method.

ask_posts = []
show_posts = []
other_posts = []
for each_list in hacker_news:
    title = each_list[1]
    if title.lower().startswith("ask hn"):
        ask_posts.append(each_list)
    if title.lower().startswith("show hn"):
        show_posts.append(each_list)
    else:
        other_posts.append(each_list)
print("Number of ask hn posts", len(ask_posts))
print("\n")
print("Number of show hn posts", len(show_posts))
print("\n")
print("Number of other posts", len(other_posts))



Number of ask hn posts 1744


Number of show hn posts 1162


Number of other posts 18938


In [4]:
# print the first 2 rows of ask_posts, show_posts to have a feel of the data
print(ask_posts[:2])
print("\n")
print(show_posts[:2])

[['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43']]


[['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'], ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46']]


# We Analyse our Extracted Data

* Here, we shall check for which category of posts has more comments on the average

In [5]:
# we calculate the total number of comments for ask HN posts, show HN posts, and calculate which post has more comment
# on the average
total_ask_comments = 0
total_show_comments = 0
for each_list in ask_posts:
    num_comments = each_list[4]
    num_comments = int(num_comments)
    total_ask_comments = total_ask_comments + num_comments
for each_list in show_posts:
    num_comments = each_list[4]
    num_comments = int(num_comments)
    total_show_comments = total_show_comments + num_comments
avg_ask_comments = total_ask_comments / len(ask_posts)
avg_show_comments = total_show_comments / len(show_posts)
print("Average Comments for Ask Posts:", avg_ask_comments)
print("Average Comments for Show Posts:", avg_show_comments)
    

Average Comments for Ask Posts: 14.038417431192661
Average Comments for Show Posts: 10.31669535283993


From the above, we can see that ask Hacker News posts have more comments on the average than posts seeking to enlighten Hacker News

Could the time the ask posts are made be a contributing factor to the number of comments received? We'll determine this by 

* Calculating the number of ask posts created in each hour of the day and the corresponding comments received for the posts.

* After which, we"ll calculate the average number of comments ask posts receive for each hour 

In [6]:
# we need the datetime module to do this job
import datetime as dt
# we create an empty list to store the time the ask posts were created and the corresponding number of comments received.
result_list = []
for each_list in ask_posts:
    time = each_list[6]
    n_comment = each_list[4]
    comment = int(n_comment)
    result_list.append([time, n_comment])
    
# print result_list to see what you've done
print(result_list[:10])


[['8/16/2016 9:55', '6'], ['11/22/2015 13:43', '29'], ['5/2/2016 10:14', '1'], ['8/2/2016 14:20', '3'], ['10/15/2015 16:38', '17'], ['9/26/2015 23:23', '1'], ['4/22/2016 12:24', '4'], ['11/16/2015 9:22', '1'], ['2/24/2016 17:57', '1'], ['6/4/2016 17:17', '2']]


In [7]:

# we proceed
num_ask_posts_per_time = {}
num_comments_per_time = {}
for each_list in result_list:
    date_time = each_list[0]
    comment = int(each_list[1])
    date_time = dt.datetime.strptime(date_time, "%m/%d/%Y %H:%M")
    date_time = date_time.strftime("%H")
    if date_time in num_ask_posts_per_time:
        num_ask_posts_per_time[date_time] += 1
        num_comments_per_time[date_time] += comment   
    else:
        num_ask_posts_per_time[date_time] = 1
        num_comments_per_time[date_time] = comment

# observe your dictionaries
print("Hours:Posts:", num_ask_posts_per_time)
print("\n")
print("Hours:Comments:", num_comments_per_time)




Hours:Posts: {'01': 60, '22': 71, '00': 55, '10': 59, '02': 58, '14': 107, '19': 110, '12': 73, '21': 109, '15': 116, '07': 34, '09': 45, '06': 44, '04': 47, '23': 68, '17': 100, '05': 46, '13': 85, '16': 108, '03': 54, '11': 58, '20': 80, '08': 48, '18': 109}


Hours:Comments: {'01': 683, '22': 479, '00': 447, '10': 793, '02': 1381, '14': 1416, '19': 1188, '12': 687, '21': 1745, '15': 4477, '07': 267, '09': 251, '06': 397, '04': 337, '23': 543, '17': 1146, '05': 464, '13': 1253, '16': 1814, '03': 421, '11': 641, '20': 1722, '08': 492, '18': 1439}


In [8]:
# We proceed to calculate the average number of comments ask posts receive for each hour
avg_comments_per_hour = []
for each_hour in num_comments_per_time:
    a = each_hour
    b = num_comments_per_time[each_hour]
    c = num_ask_posts_per_time[each_hour]
    d = b / c
    avg_comments_per_hour.append([a, d])
avg_comments_per_hour

[['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['00', 8.127272727272727],
 ['10', 13.440677966101696],
 ['02', 23.810344827586206],
 ['14', 13.233644859813085],
 ['19', 10.8],
 ['12', 9.41095890410959],
 ['21', 16.009174311926607],
 ['15', 38.5948275862069],
 ['07', 7.852941176470588],
 ['09', 5.5777777777777775],
 ['06', 9.022727272727273],
 ['04', 7.170212765957447],
 ['23', 7.985294117647059],
 ['17', 11.46],
 ['05', 10.08695652173913],
 ['13', 14.741176470588234],
 ['16', 16.796296296296298],
 ['03', 7.796296296296297],
 ['11', 11.051724137931034],
 ['20', 21.525],
 ['08', 10.25],
 ['18', 13.20183486238532]]

We can see from the result above the average comments per hour for ask HN posts. However, let us format the result for better reporting

In [64]:
avg_comments_hour = []
for each_list in avg_comments_per_hour:
    avg_comments_hour.append([each_list[1], each_list[0]])
avg_comments = sorted(avg_comments_hour, reverse = True)
print("Top 5 hours that have the highest comments for ask HN post:")
avg_comments[:10]

Top 5 hours that have the highest comments for ask HN post:


[[38.5948275862069, '15'],
 [23.810344827586206, '02'],
 [21.525, '20'],
 [16.796296296296298, '16'],
 [16.009174311926607, '21'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [13.20183486238532, '18'],
 [11.46, '17']]

In [10]:
#We proceed with our formatting
print("Top 5 hours that have the highest comments for ask HN post:")
final_report = avg_comments[:5]
for each_list in final_report:
    a = each_list[0]
    b = each_list[1]
    b = dt.datetime.strptime(b, "%H").strftime("%H:%M")
    template = "{time}: {figure:.2f} average comments per post."
    output = template.format(time = b, figure = a)
    print(output)
    
        
   
    

Top 5 hours that have the highest comments for ask HN post:
15:00: 38.59 average comments per post.
02:00: 23.81 average comments per post.
20:00: 21.52 average comments per post.
16:00: 16.80 average comments per post.
21:00: 16.01 average comments per post.


The hour that receives the most comments per post on average is 15:00, with an average of 38.59 comments per post. There's about a 60% increase in the number of comments between the hours with the highest and second highest average number of comments.

According to the [data set documentation](https://www.kaggle.com/hacker-news/hacker-news-posts), the timezone used is Eastern Time in the US. So, we could also write 15:00 as 3:00 pm est and 8:00 pm WAT

# Conclusion
In this project, we analyzed ask posts and show posts to determine which type of post and time receive the most comments on average. Based on our analysis, to maximize the amount of comments a post receives, we'd recommend the post be categorized as ask post and created between 15:00 and 16:00(3:00 pm est - 4:00 pm est/8:00 pm WAT - 9:00 pm WAT).

However, it should be noted that the data set we analyzed excluded posts without any comments. Given that, it's more accurate to say that of the posts that received comments, ask posts received more comments on average and ask posts created between 15:00 and 16:00(3:00 pm est - 4:00 pm est/8:00 pm WAT - 9:00 pm WAT) received the most comments on average.

# Further Investigation on the dataset

Here, we will:
* Determine if show or ask posts receive more points on average.
* Determine if posts created at a certain time are more likely to receive more points.
* Compare your results to the average number of comments and points other posts receive.

In [11]:
# We proceed to determine if show or ask posts receive more point on the average

total_ask_points = 0
total_show_points = 0
for each_list in ask_posts:
    num_points = each_list[3]
    num_points = int(num_points)
    total_ask_points = total_ask_points + num_points
for each_list in show_posts:
    num_points = each_list[3]
    num_points = int(num_points)
    total_show_points = total_show_points + num_points
avg_ask_points = total_ask_points / len(ask_posts)
avg_show_points = total_show_points / len(show_posts)
print("Average Points for Ask Posts:", avg_ask_points)
print("Average Points for Show Posts:", avg_show_points)
    

Average Points for Ask Posts: 15.061926605504587
Average Points for Show Posts: 27.555077452667813


In [35]:
points_per_hour = []
for each_list in show_posts:
    points = each_list[3]
    hours = each_list[-1]
    points_per_hour.append([points, hours])
points_per_hour

# parse datetime using strptime method and format using strftime method to extract the hour only

num_points_per_hour = {}
hour = {}
for each_list in points_per_hour:
    date_time = each_list[1]
    points = int(each_list[0])
    date_time = dt.datetime.strptime(date_time, "%m/%d/%Y %H:%M")
    date_time = date_time.strftime("%H")
    if date_time in hour:
        hour[date_time] += 1
    else:
        hour[date_time] = 1
    if date_time in num_points_per_hour:
        num_points_per_hour[date_time] = (num_points_per_hour[date_time] + points) 
    else:
        num_points_per_hour[date_time] = points
average_points_per_hour = []
for each_hour in num_points_per_hour:
    ax = each_hour
    bx = num_points_per_hour[each_hour]
    cx = hour[each_hour]
    dx = bx / cx
    average_points_per_hour.append([each_hour, dx])
average_points_per_hour


    


[['01', 25.0],
 ['22', 40.34782608695652],
 ['00', 37.83870967741935],
 ['10', 18.916666666666668],
 ['05', 5.473684210526316],
 ['14', 25.430232558139537],
 ['19', 30.945454545454545],
 ['12', 41.68852459016394],
 ['21', 18.425531914893618],
 ['11', 33.63636363636363],
 ['07', 19.0],
 ['09', 18.433333333333334],
 ['06', 23.4375],
 ['04', 14.846153846153847],
 ['17', 27.107526881720432],
 ['23', 42.388888888888886],
 ['02', 11.333333333333334],
 ['13', 24.626262626262626],
 ['16', 28.322580645161292],
 ['03', 25.14814814814815],
 ['15', 28.564102564102566],
 ['20', 30.316666666666666],
 ['08', 15.264705882352942],
 ['18', 36.31147540983606]]

In [43]:
avg_points_hour = []
for each_list in average_points_per_hour:
    avg_points_hour.append([each_list[1], each_list[0]])
avg_points = sorted(avg_points_hour, reverse = True)
print("Top 5 hours that have the highest points for show HN post:")
avg_points[:5]

Top 5 hours that have the highest points for show HN post:


[[42.388888888888886, 23],
 [41.68852459016394, 12],
 [40.34782608695652, 22],
 [37.83870967741935, 0],
 [36.31147540983606, 18]]

In [63]:
a = avg_points[:20]
for each_list in a:
    points = each_list[0]
    hour = str(each_list[1])
    hour = dt.datetime.strptime(hour, "%H").strftime("%H:%M")
    template = "There are an average of {point:.2f} points per show HN post made at {time}"
    output = template.format(point = points, time = hour)
    print(output)
  

There are an average of 42.39 points per show HN post made at 23:00
There are an average of 41.69 points per show HN post made at 12:00
There are an average of 40.35 points per show HN post made at 22:00
There are an average of 37.84 points per show HN post made at 00:00
There are an average of 36.31 points per show HN post made at 18:00
There are an average of 33.64 points per show HN post made at 11:00
There are an average of 30.95 points per show HN post made at 19:00
There are an average of 30.32 points per show HN post made at 20:00
There are an average of 28.56 points per show HN post made at 15:00
There are an average of 28.32 points per show HN post made at 16:00
There are an average of 27.11 points per show HN post made at 17:00
There are an average of 25.43 points per show HN post made at 14:00
There are an average of 25.15 points per show HN post made at 03:00
There are an average of 25.00 points per show HN post made at 01:00
There are an average of 24.63 points per show HN

We further investigated  ask posts and show posts to determine which type of post and time receive the most points on average. Based on our analysis, although the highest number of points received for a show HN post was submitted by 23:00 US Eastern Time( 4:00 am West African Time), it seems that points associated with show posts do not correlate with the time it was submitted. From our result, the dispersion of the post hours was high: times were as dispersed as 00:00 to 23:00 but with some very strong central tendencies from the corresponding post points. It seems that the kind of post submitted determines the points received as against the post time. As such further investigation is recommended.
