# Exploring Hacker News Posts

The goal of that project is to compare posts from [Hacker News](https://news.ycombinator.com/), which is a popular site for voting and commenting technology related stories. We will focus on exploring two different types of posts: starting with 'Ask HN' or 'Show HN'.

'Ask HN' posts are submitted by users to ask the Hacker News community a  question, for example: How to imporve my business website?. In case of 'Show H' related stories, users show the Hacker News community a product, project or something else that could be interesting, for instance 'Shanhu.io, a programming playground powered by e8vm'.

Comparing the two types of posts mentioned above will enable us to answer following questions:
- Which type of post does receive more comments on average?
- Is the number of comments related to a certain time the post is created?

It should be noted that the data set we're working with was reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions.

## Introduction

Our first action will include a number of steps. We will open the file containg our data set, read and convert it to list type. Afterwards we will isolate headers from a list of lists.

In [1]:
from csv import reader 

### Opening the Hacker News data set ###
opened_file = open('C:\\Users\\malgo\\OneDrive\\Pulpit\\DataQuest\\hacker_news.csv', encoding='utf8')
read_file = reader(opened_file)
hn = list(read_file)
hn[:5]

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01']]

In [2]:
# Remove the headers.
headers = hn[0]
hn = hn[1:]
print(headers)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


Our data set has seven columns: 
- 'id' as an identifier for the post, 
- 'title' representing the title of the post,
- 'url' - link to the post,
- 'num_points' - the number of points the post received, 
- 'num_comments' - the number of comments the post received, 
- 'author' the username of the user who submitted the post,
- 'created_at' - date and time of the post creation.

## Identifing Posts

The next step is to filter our data for post titles beginning with Ask HN or Show HN. We will create list of lists containing the data for those titles.

In [12]:
# Identifing posts that start with either `Ask HN` or `Show HN` and separate the data into different lists.
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

print('Number of posts starting with Ask HN:', len(ask_posts))    
print('Number of posts starting with Show HN:', len(show_posts))    
print('Number of posts starting with Other HN:', len(other_posts))    

Number of posts starting with Ask HN: 1744
Number of posts starting with Show HN: 1162
Number of posts starting with Other HN: 17194


We end up with 1744 posts starting with Ask HN and 1162 with Show HN.

In [9]:
ask_posts[:2]

[['12296411',
  'Ask HN: How to improve my personal website?',
  '',
  '2',
  '6',
  'ahmedbaracat',
  '8/16/2016 9:55'],
 ['10610020',
  'Ask HN: Am I the only one outraged by Twitter shutting down share counts?',
  '',
  '28',
  '29',
  'tkfx',
  '11/22/2015 13:43']]

In [8]:
show_posts[:2]

[['10627194',
  'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform',
  'https://iot.seeed.cc',
  '26',
  '22',
  'kfihihc',
  '11/25/2015 14:03'],
 ['10646440',
  'Show HN: Something pointless I made',
  'http://dn.ht/picklecat/',
  '747',
  '102',
  'dhotson',
  '11/29/2015 22:46']]

## Calculation the Average Number of Comments 

Finally, once we seperated Ask HN and Show HN posts, we can determine which type of post receive more comments on average.

In [15]:
# Calculating the average number of comments received by 'Ask HN' posts.
total_ask_comments = 0

for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments
    
avg_ask_comments = total_ask_comments / len(ask_posts)
print('Average number of Ask HN comments is:', round(avg_ask_comments,2))

Average number of Ask HN comments is: 14.04


In [16]:
# Calculating the average number of comments received by 'Show HN' posts.
total_show_comments = 0

for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments
    
avg_show_comments = total_show_comments / len(show_posts)
print('Average number of Show HN comments is:', round(avg_show_comments,2))

Average number of Show HN comments is: 10.32


Based on above calculations, Ask HN received more comments on average (approximately 14) than Show HN (approximately 10). Because Ask HN posts are more likely to receive more comments, we decided to focus only on their analysis.

## Post Creation Time and Number of Comments

As we determined that Ask HN posts are more likely to receive more comments on average, we would like to find out if certain time the post has been submitted matters. We would like to answer the question: are posts created at certain point in time attracting more comments? In order to answer this question we will use `datatime` module.

In [25]:
# Calculate the amount of Ask HN posts and comments by hour created
import datetime as dt

result_list = []

for row in ask_posts:
    result_list.append(
        [row[-1], int(row[-3])]
    )

count_by_hour = {}
comments_by_hour = {}
date_format = "%m/%d/%Y %H:%M"

for each_row in result_list:
    date = each_row[0]
    comment = each_row[1]
    time = dt.datetime.strptime(date, date_format).strftime("%H")
    if time in count_by_hour:
        comments_by_hour[time] += comment
        count_by_hour[time] += 1
    else:
        comments_by_hour[time] = comment
        count_by_hour[time] = 1
        
comments_by_hour

{'09': 251,
 '13': 1253,
 '10': 793,
 '14': 1416,
 '16': 1814,
 '23': 543,
 '12': 687,
 '17': 1146,
 '15': 4477,
 '21': 1745,
 '20': 1722,
 '02': 1381,
 '18': 1439,
 '03': 421,
 '05': 464,
 '19': 1188,
 '01': 683,
 '22': 479,
 '08': 492,
 '04': 337,
 '00': 447,
 '06': 397,
 '07': 267,
 '11': 641}

## Calculating the average number of comments received by Ask HN posts 

We already know what is the number of comments for a specific hour. However, to make any conclusions, it is necessary to calculate the average number of comments for Ask HN posts by hour. Furthermore, we will sort the outcomes.

In [29]:
avg_by_hour = []

for hr in comments_by_hour:
    avg_by_hour.append([hr, comments_by_hour[hr] / count_by_hour[hr]])

avg_by_hour    

[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]

In [35]:
# Swapping the values (in order to sort them by the average number of comments)
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
    
print(swap_avg_by_hour)

[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]


In [37]:
# Sorting the values
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
print("Top 5 Hours for Ask Posts Comments")

for avg, hr in sorted_swap[:5]:
    print(
        "{}: {:.2f} average comments per post".format(
            dt.datetime.strptime(hr, "%H").strftime("%H:%M"), avg
        )
    )

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


The most popular hour in case of comments the post received is 15:00 (approximately 39 comments). There is a big difference between the first hour and the second, which is 2:00 (on average 24 comments). 

Based on data set [documentation](https://www.kaggle.com/hacker-news/hacker-news-posts) the timezone is Eastern Time in the U.S., therefore, we could also write 3:00 om est and not 15:00.

## Points on Average

In this part of our project, we would like to determine which type of post receives more points on average.

In [42]:
# Calculating average points for Show HN posts
sum_show = 0
for row in show_posts:
    points = int(row[3])
    sum_show += points
    
avg_show = sum_show / len(show_posts)
round(avg_show,2)

27.56

In [46]:
# Calculating average points for Ask HN posts
sum_ask = 0
for row in ask_posts:
    points = int(row[3])
    sum_ask+= points
    
avg_ask = sum_ask / len(ask_posts)
round(avg_ask,2)

15.06

Show HN posts received 28 points on average and it was significantly more in comparison tp Ask HN posts, which received only about 15 points on average.

## Points and Time

Our further concern is whether posts submitted at a certain time are more likely to receive more points. This time we will focus on Show HN posts, because they receive more points on average.

In [51]:
# Calculate the amount of Show HN posts and points by hour created
import datetime as dt

final_list = []

for row in show_posts:
    final_list.append(
        [row[-1], int(row[3])]
    )

num_by_hour = {} # amount of show posts created
points_by_hour = {} # amount of points received
date_format = "%m/%d/%Y %H:%M"

for each_row in final_list:
    date = each_row[0]
    points = each_row[1]
    time = dt.datetime.strptime(date, date_format).strftime("%H")
    if time in num_by_hour:
        points_by_hour[time] += points
        num_by_hour[time] += 1
    else:
        points_by_hour[time] = points
        num_by_hour[time] = 1
        
points_by_hour

{'14': 2187,
 '22': 1856,
 '18': 2215,
 '07': 494,
 '20': 1819,
 '05': 104,
 '16': 2634,
 '19': 1702,
 '15': 2228,
 '03': 679,
 '17': 2521,
 '06': 375,
 '02': 340,
 '13': 2438,
 '08': 519,
 '21': 866,
 '04': 386,
 '11': 1480,
 '12': 2543,
 '23': 1526,
 '09': 553,
 '01': 700,
 '10': 681,
 '00': 1173}

## Calculating the average number of points received by Show HN posts 

We already know what is the number of points for a specific hour. However, to make any conclusions, it is necessary to calculate the average number of points for Show HN posts by hour. Furthermore, we will sort the outcomes.

In [55]:
avg_points_hour = []

for hr in points_by_hour:
    avg_points_hour.append([hr, points_by_hour[hr] / num_by_hour[hr]])

avg_points_hour

[['14', 25.430232558139537],
 ['22', 40.34782608695652],
 ['18', 36.31147540983606],
 ['07', 19.0],
 ['20', 30.316666666666666],
 ['05', 5.473684210526316],
 ['16', 28.322580645161292],
 ['19', 30.945454545454545],
 ['15', 28.564102564102566],
 ['03', 25.14814814814815],
 ['17', 27.107526881720432],
 ['06', 23.4375],
 ['02', 11.333333333333334],
 ['13', 24.626262626262626],
 ['08', 15.264705882352942],
 ['21', 18.425531914893618],
 ['04', 14.846153846153847],
 ['11', 33.63636363636363],
 ['12', 41.68852459016394],
 ['23', 42.388888888888886],
 ['09', 18.433333333333334],
 ['01', 25.0],
 ['10', 18.916666666666668],
 ['00', 37.83870967741935]]

In [56]:
# Swapping the values (in order to sort them by the average number of points)
swap_avg_points_hour = []

for row in avg_points_hour:
    swap_avg_points_hour.append([row[1], row[0]])
    
print(swap_avg_points_hour)

[[25.430232558139537, '14'], [40.34782608695652, '22'], [36.31147540983606, '18'], [19.0, '07'], [30.316666666666666, '20'], [5.473684210526316, '05'], [28.322580645161292, '16'], [30.945454545454545, '19'], [28.564102564102566, '15'], [25.14814814814815, '03'], [27.107526881720432, '17'], [23.4375, '06'], [11.333333333333334, '02'], [24.626262626262626, '13'], [15.264705882352942, '08'], [18.425531914893618, '21'], [14.846153846153847, '04'], [33.63636363636363, '11'], [41.68852459016394, '12'], [42.388888888888886, '23'], [18.433333333333334, '09'], [25.0, '01'], [18.916666666666668, '10'], [37.83870967741935, '00']]


In [57]:
# Sorting the values
sort_swap = sorted(swap_avg_points_hour, reverse=True)
print("Top 5 Hours for Show Posts Points")

for avg, hr in sort_swap[:5]:
    print(
        "{}: {:.2f} average points per post".format(
            dt.datetime.strptime(hr, "%H").strftime("%H:%M"), avg
        )
    )

Top 5 Hours for Show Posts Points
23:00: 42.39 average points per post
12:00: 41.69 average points per post
22:00: 40.35 average points per post
00:00: 37.84 average points per post
18:00: 36.31 average points per post


Based on our outcomes we can assume that it is the best to create a post at late evening, because what we can observe is that 22:00, 23:00 and 00:00 hours receive most points on average (more than 38 points on average). There could be one exception in case of 12:00 which also received a big number of points (about 42 points on average). 

## Conclusion

To conclude, we analyzed ask posts and show posts to find out which type of post and time obtained the most comments on average. We determined that, to maximize the amount of comments a post receives, the post should be be categorized as ask post and created between 15:00 and 16:00 (3:00 pm est - 4:00 pm est).
In the project we excluded posts without any comments. Given that, it's more accurate to say that of the posts that received comments, ask posts received more comments on average and ask posts created between 15:00 and 16:00 (3:00 pm est - 4:00 pm est) received the most comments on average.