# Exploring Hacker News Posts

This project is aimed to explore the data set containing information about posts from [Hacker News](https://news.ycombinator.com) platform and analyze the type and time of the posts to find the best option for writer to write their posts.<br><br>
There are two types of posts:
1. Those which start with "Ask HN" where users ask the community of Hacker News a question.
2. Those which start with "Show HN" where users show the community a project, product and etc.<br>

The goal is to determine which type of posts get more number of comments on average and what time of publication is more efficient to get more comments under a post.

# 1. Opening and exploring the data set
First, we read the data set and turn the values into a list, removing the header.

In [1]:
from csv import reader
opened_file = open("hacker_news.csv")
read_file = reader(opened_file)
hn = list(read_file)
headers=hn[0]
hn=hn[1:]
print('\033[1m' + "Total number of posts before cleaning: " + '\033[0m', len(hn),"\n")
print(headers,"\n")
for post in hn[:5]:
    print(post,"\n")

[1mTotal number of posts before cleaning: [0m 293119 

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] 

['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'] 

['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'] 

['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'] 

['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16'] 

['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94'

# 2. Cleaning the data set
We can see that there are a lot of rows in the data set where the value for the number of comments is zero. We do not need such posts, so we delete them.

In [2]:
new_hn=[]
for post in hn:
    if post[4] != '0':
        new_hn.append(post)
    
hn = new_hn

print('\033[1m' + "Total number of posts after cleaning: " + '\033[0m', len(hn),"\n")
for post in hn[:5]:
    print(post,"\n")

[1mTotal number of posts after cleaning: [0m 80401 

['12578975', 'Saving the Hassle of Shopping', 'https://blog.menswr.com/2016/09/07/whats-new-with-your-style-feed/', '1', '1', 'bdoux', '9/26/2016 3:13'] 

['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53'] 

['12578822', 'Amazons Algorithms Dont Find You the Best Deals', 'https://www.technologyreview.com/s/602442/amazons-algorithms-dont-find-you-the-best-deals/', '1', '1', 'yarapavan', '9/26/2016 2:26'] 

['12578694', 'Emergency dose of epinephrine that does not cost an arm and a leg', 'http://m.imgur.com/gallery/th6Ua', '2', '1', 'dredmorbius', '9/26/2016 1:54'] 

['12578624', 'Phone Makers Could Cut Off Drivers. So Why Dont They?', 'http://www.nytimes.com/2016/09/25/technology/phone-makers-could-cut-off-drivers-so-why-dont-they.html', '4', '1', 'danso', '9/26/2016 1:37'] 



# 3. Dividing posts into groups
We will divide posts from the data sets into three groups:
1. Posts starting with Ask HN
2. Posts starting with Show HN
3. Other posts starting neither with Ask HN nor Show HN

In [3]:
ask_posts = []
show_posts = []
other_posts = []

for post in hn:
    title = post[1].lower()
    if title.startswith("ask hn"):
        ask_posts.append(post)
    elif title.startswith("show hn"):
        show_posts.append(post)
    else:
        other_posts.append(post)
        
print('\033[1m' + "Total number of posts in hacker news dataset: " + '\033[0m', len(hn),"\n")
print('\033[1m' + "Number of posts in ask_posts list: " + '\033[0m', len(ask_posts),"\n")
for post in ask_posts[:5]:
    print(post,"\n")

print('\033[1m' + "Number of posts in show_posts list: " + '\033[0m', len(show_posts),"\n")
for post in show_posts[:5]:
    print(post,"\n")
    
print('\033[1m' + "Number of posts in other_posts list: " + '\033[0m', len(other_posts),"\n")
for post in other_posts[:5]:
    print(post,"\n")

[1mTotal number of posts in hacker news dataset: [0m 80401 

[1mNumber of posts in ask_posts list: [0m 6911 

['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53'] 

['12578522', 'Ask HN: How do you pass on your work when you die?', '', '6', '3', 'PascLeRasc', '9/26/2016 1:17'] 

['12577870', 'Ask HN: Why join a fund when you can be an angel?', '', '1', '3', 'anthony_james', '9/25/2016 22:48'] 

['12577647', 'Ask HN: Someone uses stock trading as passive income?', '', '5', '2', '00taffe', '9/25/2016 21:50'] 

['12576946', 'Ask HN: How hard would it be to make a cheap, hackable phone?', '', '2', '1', 'hkt', '9/25/2016 19:30'] 

[1mNumber of posts in show_posts list: [0m 5059 

['12577142', 'Show HN: Jumble  Essays on the go #PaulInYourPocket', 'https://itunes.apple.com/us/app/jumble-find-startup-essay/id1150939197?ls=1&mt=8', '1', '1', 'ryderj', '9/25/2016 20:06'] 

['12576813', 'Show HN: Learn Japanese Vocab via multiple cho

# 4. Calculating average number of comments for both types of posts

In [4]:
total_ask_comments = 0
for post in ask_posts:
    total_ask_comments += int(post[4])
    
avg_ask_comments = total_ask_comments/len(ask_posts)
print("Average number of comments in ask_posts: ", avg_ask_comments,"\n")

Average number of comments in ask_posts:  13.744175951381855 



In [5]:
total_show_comments = 0
for post in show_posts:
    total_show_comments += int(post[4])
    
avg_show_comments = total_show_comments/len(show_posts)
print("Average number of comments in show_posts: ", avg_show_comments,"\n")

Average number of comments in show_posts:  9.810832180272781 



In [6]:
print(avg_ask_comments > avg_show_comments)

True


We see that Ask HN posts receive more comments rather than Show HN posts. This can be explained by the fact that posts where authors ask the community get more comments because of the answers for the question. 

# 5. Finding the number of Ask HN posts in every hour
We will separate Ask HN posts by hour they were published and find the average number of comments per hour

In [7]:
import datetime as dt
result_list = []

for post in ask_posts:
    result_list.append([post[6], int(post[4])])
    
counts_by_hour = {}
comments_by_hour = {}

for result in result_list:
    dt_object = dt.datetime.strptime(result[0], "%m/%d/%Y %H:%M")
    dt_hour = dt_object.strftime("%H")
    if dt_hour not in counts_by_hour:
        counts_by_hour[dt_hour]=1
        comments_by_hour[dt_hour]=result[1]
    else:
        counts_by_hour[dt_hour]+=1
        comments_by_hour[dt_hour]+=result[1]

In [8]:
avg_by_hour = []

for hour in counts_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour]/counts_by_hour[hour]])

In [9]:
avg_by_hour.sort()
for item in avg_by_hour:
    print(item,)

['00', 9.857142857142858]
['01', 9.367713004484305]
['02', 13.198237885462555]
['03', 10.160377358490566]
['04', 12.688172043010752]
['05', 11.139393939393939]
['06', 9.017045454545455]
['07', 10.095541401273886]
['08', 12.43157894736842]
['09', 8.392045454545455]
['10', 13.757990867579908]
['11', 11.143426294820717]
['12', 15.452554744525548]
['13', 22.2239263803681]
['14', 13.153439153439153]
['15', 39.66809421841542]
['16', 10.76144578313253]
['17', 13.73019801980198]
['18', 10.789823008849558]
['19', 9.414285714285715]
['20', 11.38265306122449]
['21', 11.056511056511056]
['22', 11.749128919860627]
['23', 8.322463768115941]


# 6. Sorting and printing the result

In [10]:
swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1],row[0]])

for row in swap_avg_by_hour:
    print(row)

[9.857142857142858, '00']
[9.367713004484305, '01']
[13.198237885462555, '02']
[10.160377358490566, '03']
[12.688172043010752, '04']
[11.139393939393939, '05']
[9.017045454545455, '06']
[10.095541401273886, '07']
[12.43157894736842, '08']
[8.392045454545455, '09']
[13.757990867579908, '10']
[11.143426294820717, '11']
[15.452554744525548, '12']
[22.2239263803681, '13']
[13.153439153439153, '14']
[39.66809421841542, '15']
[10.76144578313253, '16']
[13.73019801980198, '17']
[10.789823008849558, '18']
[9.414285714285715, '19']
[11.38265306122449, '20']
[11.056511056511056, '21']
[11.749128919860627, '22']
[8.322463768115941, '23']


In [11]:
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

In [12]:
print("Top 5 Hours for Ask Posts Comments")
for row in sorted_swap[:5]:
    dt_obj = dt.datetime.strptime(row[1], "%H")
    dt_str = dt_obj.strftime("%H:%M")
    print("{0}: {1:.2f} average comments per post".format(dt_str, row[0]))

Top 5 Hours for Ask Posts Comments
15:00: 39.67 average comments per post
13:00: 22.22 average comments per post
12:00: 15.45 average comments per post
10:00: 13.76 average comments per post
17:00: 13.73 average comments per post


# 7. Conclusion
We can conclude that Ask HN posts tend to receive more comments than Show HN posts. Moreover, Ask HN posts get the highest number of comments between 3pm and 4pm, about 39.67 comments on average.