# Hacker News comment project

In this project we will analyse a Hacker news data set to see which kind of post will most likely receive more comments from the community. 
Hacker News is a social news website focusing on technology and start-ups. Users can submit stories (known as "posts") and the community can vote and comment on them, as it is done at reddit. The site was started by the startup incubator Y Combinator and created by Paul Graham in February 2007.

The data set we are working with is from [Kaggle](https://www.kaggle.com/hacker-news/hacker-news-posts) and contains posts from September 26 2015 to September 26 2016 and contains the following data:

| Column name | Description|
|----|----|
|id|The unique identifier from Hacker News for the post|
|title| The title of the post|
|url| The URL that the posts links to, if the post has a URL|
|num_points| The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes|
|num_comments| The number of comments that were made on the post|
|author| The username of the person who submitted the post|
|created_at| The date and time at which the post was submitted. The time zone is Eastern Time in the US|

We are interested in posts whose titles begin either with `Ask HN` or `Show HN`. Users submit `Ask HN`posts to ask the Hacker News community a specific question and submit `Show HN` posts to show the Hacker News community a project, product or just generally something interesting. 

We will analyse `Ask HN` and `Show HN` posts and try to determine which of these two kind posts will receive more comments on average. Further we are interested if the time a comment is created influences the number of comments it will receive. 

**Importing data**

In [1]:
# read the csv file into python
from csv import reader 

file_opened = open("hacker_news.csv")
file_read = reader(file_opened)
list_of_list = list(file_read)

# final data set we will work with
hn = list_of_list[1:]
hn_header = list_of_list[0]

In [6]:
# show the first three rows
print(hn[:3])
print("\n")
# number of data points
print(len(hn))

[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']]


20100


We are only interested in post beginning with `Ask HN` and `Show HN`. There are 1744 `Ask HN` posts and 1162 `Show HN` posts.

In [13]:
# create empty lists
ask_posts = []
show_posts = []
other_posts = []

# ieterate over the dataset and seperate them in three different lists
for row in hn:
    title = row[1]
    title = title.lower()
    if title.startswith("ask hn"):
        ask_posts.append(row)
    elif title.startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)

print("Posts starting with 'Ask HN':")        
print(len(ask_posts))
print("Posts starting with 'Show HN':")
print(len(show_posts))
print("All the other posts:")
print(len(other_posts))

Posts starting with 'Ask HN':
1744
Posts starting with 'Show HN':
1162
All the other posts:
17194


On average `Ask HN` (14,04) receives slightly more comments than `Show HN` (10.32). As `Ask HN` receives more comments we will analyse these posts further. We will investigate if ask posts created at a certain time are more likely to attract more comments than at other times.

In [18]:
total_ask_comments = 0

for row in ask_posts:
    comments = int(row[4])
    total_ask_comments += comments
    
avg_ask_comments = total_ask_comments / len(ask_posts)

print("Average comments of the 'Ask HN' posts:")
print(avg_ask_comments)

total_show_comments = 0

for row in show_posts:
    comments = int(row[4])
    total_show_comments += comments

avg_show_comments = total_show_comments / len(show_posts)

print("Average comments of the 'Show HN' posts:")
print(avg_show_comments)
    

Average comments of the 'Ask HN' posts:
14.038417431192661
Average comments of the 'Show HN' posts:
10.31669535283993


First we will calculate the amount of ask posts created in each hour of the day, along with the number of comments received. Than we calculate the average number of comments ask posts receive by hour created.

In [25]:
# import datetime 
import datetime as dt

# create empty list
result_list = []

# create a list of list with created_at and comments
for post in ask_posts:
    created_at = post[6]
    comments = int(post[4])
    list = [created_at, comments]
    result_list.append(list)

# create two empty dictionaries
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    # extract the hour from the date
    time = row[0]
    time = time.split(" ")
    hour = time[1]
    # create datetime object
    hour = dt.datetime.strptime(hour, "%H:%M")
    # select only the hour
    hour = dt.datetime.strftime(hour, "%H")
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1 
        comments_by_hour[hour] = row[1]
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += row[1] 

`counts_by_hour` contains the number of ask posts created during each hour of the day

In [31]:
counts_by_hour

{'00': 55,
 '01': 60,
 '02': 58,
 '03': 54,
 '04': 47,
 '05': 46,
 '06': 44,
 '07': 34,
 '08': 48,
 '09': 45,
 '10': 59,
 '11': 58,
 '12': 73,
 '13': 85,
 '14': 107,
 '15': 116,
 '16': 108,
 '17': 100,
 '18': 109,
 '19': 110,
 '20': 80,
 '21': 109,
 '22': 71,
 '23': 68}

`comments_by_hour` contains the corresponding number of comments of ask posts created at each hour

In [33]:
comments_by_hour

{'00': 447,
 '01': 683,
 '02': 1381,
 '03': 421,
 '04': 337,
 '05': 464,
 '06': 397,
 '07': 267,
 '08': 492,
 '09': 251,
 '10': 793,
 '11': 641,
 '12': 687,
 '13': 1253,
 '14': 1416,
 '15': 4477,
 '16': 1814,
 '17': 1146,
 '18': 1439,
 '19': 1188,
 '20': 1722,
 '21': 1745,
 '22': 479,
 '23': 543}

We will use the two above dictionaries to calculate the average number of comments for for posts created during each hour of the day.

In [37]:
avg_by_hour = []

for hour in comments_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour] / counts_by_hour[hour]])

In [38]:
avg_by_hour

[['00', 8.127272727272727],
 ['22', 6.746478873239437],
 ['01', 11.383333333333333],
 ['23', 7.985294117647059],
 ['03', 7.796296296296297],
 ['18', 13.20183486238532],
 ['09', 5.5777777777777775],
 ['16', 16.796296296296298],
 ['21', 16.009174311926607],
 ['19', 10.8],
 ['12', 9.41095890410959],
 ['08', 10.25],
 ['02', 23.810344827586206],
 ['06', 9.022727272727273],
 ['14', 13.233644859813085],
 ['15', 38.5948275862069],
 ['17', 11.46],
 ['13', 14.741176470588234],
 ['05', 10.08695652173913],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034],
 ['20', 21.525],
 ['10', 13.440677966101696],
 ['04', 7.170212765957447]]

In [40]:
# swap the columns 
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])

print(swap_avg_by_hour)

[[8.127272727272727, '00'], [6.746478873239437, '22'], [11.383333333333333, '01'], [7.985294117647059, '23'], [7.796296296296297, '03'], [13.20183486238532, '18'], [5.5777777777777775, '09'], [16.796296296296298, '16'], [16.009174311926607, '21'], [10.8, '19'], [9.41095890410959, '12'], [10.25, '08'], [23.810344827586206, '02'], [9.022727272727273, '06'], [13.233644859813085, '14'], [38.5948275862069, '15'], [11.46, '17'], [14.741176470588234, '13'], [10.08695652173913, '05'], [7.852941176470588, '07'], [11.051724137931034, '11'], [21.525, '20'], [13.440677966101696, '10'], [7.170212765957447, '04']]


In [44]:
# sort the result in decending order
sorted_swap = sorted(swap_avg_by_hour, reverse = True)

In [53]:
print("Top 5 Hours for Ask Posts Comments (Eastern time US)")
for average, hour in sorted_swap[:5]:
    template = "{hour}: {avg:.2f} average comments per post"
    hours = dt.datetime.strptime(hour, "%H")
    hours = dt.datetime.strftime(hours, "%H:%M")
    print(template.format(hour = hours, avg = average)) 

Top 5 Hours for Ask Posts Comments (Eastern time US)
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


It seems like the best time period to publish a Ask HN post is between **3 to 4 pm** Eastern Time(US) or 8pm to 9pm London time. Which maybe is a good time because people finished work, therefore have time to read and comment it on their way home or at home. The next good time period seems to be between **8 and 9pm** Eastern Time(US) (1am to 2am London Time). Also **2 am** seems to be great time to post (or 7am London time) which maybe can be explained that people will see it first thing in the morning. 
To confirm our findings we could maybe find a dataset, where you have the time when a comment is written. But for now this is a broad analysis about at which time it is best to publish a post. I would also assume that the number of upvotes and downvotes (points) highly influence how many comments you will receive. 

Further plan for analysis:
* Determine if show or ask posts receive more points on average.
* Determine if posts created at a certain time are more likely to receive more points.
* Compare your results to the average number of comments and points other posts receive.