# Exploring Hacker News Posts

In this project we will explore the dataset collected from [Hacker News Posts](https://news.ycombinator.com). Here everyone can submit their post, get comments and votes. 

More specifically, we are going to find out what type of the posts is more popular in terms of comments and points. 

Afterwards we'll search for the best period of the day (CET) to submit your post to get as much as possible of comments or points.

Our dataset contains following columns:
* `id`: The unique identifier from Hacker News for the post
* `title`: The title of the post
* `url`: The URL that the posts links to, if the post has a URurl: The URL that the posts links to, if the post has a URL
* `num_points`: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
* `num_comments`: The number of comments that were made on the post
* `author`: The username of the person who submitted the post
* `created_at`: The date and time at which the post was submitted

[Here](https://www.kaggle.com/aquaregis32/exploring-hacker-news-posts) you can find the dataset and its documentation.

First of all, we are openning and transforming our dataset into a suitable format: a list of lists.

In [1]:
from csv import reader
import datetime as dt
from dateutil import tz

opened = open('hacker_news.csv')
read = reader(opened)
hn = list(read)
#extracting header
header = hn[0]
#saving the lines
hn = hn[1:]

print(header)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


In [2]:
def explore_data(dataset, start, end, rows_cols=False):
    #printing the asked number of lines
    for row in dataset[start:end]:
        print(row)
        print('\n')
        
    #returning data about the dataset
    if rows_cols:
        num_row = len(dataset)
        num_cols = len(dataset[0])
        print('Dataset contains : \nRows {0} \nColumns {1}'.format(num_row, num_cols))
        
print(explore_data(hn, 0, 6, True))

['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']


['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24']


['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']


['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']


['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']


['12578975', 'Saving the Hassle of Shopping', 'https://blog.menswr.com/2016/09/07/whats-n

### Extracting Ask HN and Show HN Posts

We are interested only by post titles that start with `Ask H` or `Show H`. So we will exploit only the rows that are refering to these categories.

In [3]:
#creating empty lists to store our data
ask_posts = []
show_posts = []
other_posts = []

#filling up the list with relevant data
for row in hn:
    #lowercasing to match needed pattern
    title = row[1].lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print('{0} titles from Ask category\n{1} titles from Show category\n{2} titles from Other categories'.format(len(ask_posts), len(show_posts), len(other_posts)))


9139 titles from Ask category
10158 titles from Show category
273822 titles from Other categories


### Finding the average for comments and points

Below we're going to check which of the categories receives the most of comments/points.

In [4]:
def avg_num(dataset, num_row):
    #initialiaziting counter to 0
    total_num = 0
    for row in dataset:
        num = int(row[num_row])
        #cumming up all the comments
        total_num += num
    #returning the average
    return total_num / len(dataset)
    

In [5]:
avg_ask_comments = avg_num(ask_posts, 4)
avg_show_comments = avg_num(show_posts, 4)
print('The average number of comments for :\n Ask titles  {0:.2f}:\n Show titles is {1:.2f}'.format(avg_ask_comments, avg_show_comments))

The average number of comments for :
 Ask titles  10.39:
 Show titles is 4.89


As expected the `Ask` section has more comments. These post submissions imply that other users will provide answers.

In [6]:
avg_ask_points = avg_num(ask_posts, 3)
avg_show_points = avg_num(show_posts, 3)
print('The average number of points for :\n Ask titles  {0:.2f}:\n Show titles is {1:.2f}'.format(avg_ask_points, avg_show_points))

The average number of points for :
 Ask titles  11.31:
 Show titles is 14.84


As for the points, people tend to upvote the Show titles more.

That's why we are going evaluate our `Ask titles` by comments and `Show titles` by points.

### Most popular time for posting

Below we're going to explore if posts created at a certain time are attracting more comments/points.

In [7]:
def extract_attention(dataset, col_time, col_att):
    result_list = []
    for row in dataset:
        #saving creation time
        time_create = row[col_time]
        #saving the number of comments/points
        num = int(row[col_att])
        #appending in a form of list
        result_list.append([time_create, num])
    return result_list

For further analysis we're going to extract following pairs :
- **hour-comments** from `ask_posts`
- **hour-points** from `show_posts`

In [8]:
ask_hour_comm = extract_attention(ask_posts, -1, 4)

In [9]:
show_hour_points = extract_attention(show_posts, -1, 3)

Below we're going to create 2 dictionnaries that we will fill with specific information:
 * `counts_by_hour` - the number of ask posts created during each hour of the da`
 * `comments_by_hour` - the number of comments ask posts created at each hour received

In [11]:
def attention_by_hour(result_list, time_col, att_col):
    #number of total posts by hour
    counts_by_hour = {}
    #number of total comments or points by hour
    att_by_hour = {}

    #defining zones
    from_zone = tz.gettz('EST')
    to_zone = tz.gettz('CET')
    
    for row in result_list:
        time_create = row[0]
        num_att = row[1]
        #parsing the time into a datetime object
        time_obj = dt.datetime.strptime(time_create, '%m/%d/%Y %H:%M')
        #assigning time info
        utc = time_obj.replace(tzinfo=from_zone)
        #assigning local time
        local = utc.astimezone(to_zone)
        #extracting only the hour
        hour = dt.datetime.strftime(local, '%H')
        if hour not in counts_by_hour:
            counts_by_hour[hour] = 1
            att_by_hour[hour] = num_att
        else:
            counts_by_hour[hour] += 1
            att_by_hour[hour] += num_att  
    return counts_by_hour, att_by_hour
            
ask_posts_by_hour, ask_comments_by_hour = attention_by_hour(ask_hour_comm, -1, 4)
show_posts_by_hour, show_points_by_hour = attention_by_hour(show_hour_points, -1, 3)

In [12]:
print('Ask Titles :\nPosts per hour : ', sorted(ask_posts_by_hour.items()))
print('\n')
print('Ask Titles :\nComments per hour : ', sorted(ask_comments_by_hour.items()))

Ask Titles :
Posts per hour :  [('00', 628), ('01', 576), ('02', 546), ('03', 502), ('04', 497), ('05', 361), ('06', 317), ('07', 278), ('08', 288), ('09', 272), ('10', 259), ('11', 216), ('12', 209), ('13', 231), ('14', 241), ('15', 259), ('16', 234), ('17', 292), ('18', 305), ('19', 380), ('20', 484), ('21', 557), ('22', 651), ('23', 556)]


Ask Titles :
Comments per hour :  [('00', 6372), ('01', 4584), ('02', 3719), ('03', 5165), ('04', 4008), ('05', 2888), ('06', 2054), ('07', 2537), ('08', 1879), ('09', 2870), ('10', 2213), ('11', 2476), ('12', 1545), ('13', 1411), ('14', 1847), ('15', 2270), ('16', 1727), ('17', 2880), ('18', 3211), ('19', 4837), ('20', 7062), ('21', 10661), ('22', 13266), ('23', 3504)]


In [13]:
print('Show Titles :\nPosts per hour : ', sorted(show_posts_by_hour.items()))
print('\n')
print('Show Titles :\nPoints per hour : ', sorted(show_points_by_hour.items()))

Show Titles :
Posts per hour :  [('00', 697), ('01', 641), ('02', 548), ('03', 484), ('04', 378), ('05', 363), ('06', 314), ('07', 263), ('08', 211), ('09', 201), ('10', 212), ('11', 181), ('12', 179), ('13', 194), ('14', 277), ('15', 305), ('16', 304), ('17', 371), ('18', 443), ('19', 541), ('20', 659), ('21', 772), ('22', 851), ('23', 769)]


Show Titles :
Points per hour :  [('00', 9864), ('01', 8943), ('02', 8467), ('03', 7384), ('04', 5175), ('05', 4951), ('06', 3888), ('07', 3982), ('08', 3632), ('09', 1499), ('10', 3084), ('11', 1888), ('12', 1902), ('13', 3273), ('14', 4294), ('15', 4288), ('16', 3677), ('17', 5285), ('18', 9919), ('19', 10240), ('20', 10041), ('21', 11581), ('22', 11431), ('23', 12093)]


Now that we have this structured information by hour, we need to find out 
* the average comments by post generated by hour from Ask Titles
* the average points by post generated by hour from Show Titles

In [14]:
def avg_by_hour(attention_by_hour, counts_by_hour):   
    avg = []
    for hour in attention_by_hour:
        #appending the hour and the number of coments devided by the number of posts
        avg.append([hour, attention_by_hour[hour] / counts_by_hour[hour]])
    return avg

#average comments by hour from Ask Titles
ask_avg_comm = avg_by_hour(ask_comments_by_hour, ask_posts_by_hour)
#average points by hour from Show Titles
show_avg_points = avg_by_hour(show_points_by_hour, show_posts_by_hour) 

In [15]:
print(explore_data(ask_avg_comm, 0, len(ask_avg_comm)))

['09', 10.551470588235293]


['08', 6.524305555555555]


['05', 8.0]


['04', 8.064386317907445]


['02', 6.811355311355311]


['00', 10.146496815286625]


['22', 20.377880184331797]


['21', 19.14003590664273]


['20', 14.590909090909092]


['18', 10.527868852459017]


['17', 9.863013698630137]


['16', 7.380341880341881]


['14', 7.66390041493776]


['10', 8.544401544401545]


['06', 6.479495268138801]


['03', 10.288844621513944]


['23', 6.302158273381295]


['15', 8.764478764478765]


['07', 9.12589928057554]


['01', 7.958333333333333]


['19', 12.728947368421053]


['11', 11.462962962962964]


['13', 6.108225108225108]


['12', 7.392344497607655]


None


In [16]:
print(explore_data(show_avg_points, 0, len(show_avg_points)))

['07', 15.140684410646388]


['06', 12.382165605095542]


['03', 15.256198347107437]


['02', 15.450729927007298]


['01', 13.951638065522621]


['23', 15.725617685305592]


['21', 15.001295336787564]


['17', 14.245283018867925]


['16', 12.095394736842104]


['15', 14.059016393442622]


['13', 16.871134020618555]


['10', 14.547169811320755]


['04', 13.69047619047619]


['00', 14.152080344332855]


['22', 13.432432432432432]


['18', 22.390519187358915]


['14', 15.501805054151625]


['11', 10.430939226519337]


['20', 15.236722306525039]


['19', 18.927911275415898]


['08', 17.213270142180093]


['05', 13.639118457300276]


['09', 7.45771144278607]


['12', 10.625698324022347]


None


Our nex step will be sorting the data by comments/points in descending order.

In [17]:
def swap_sort(avg_by_hour):
    swap_avg_by_hour = []
    for row in avg_by_hour:
        #appending swapped columns
        swap_avg_by_hour.append([row[1], row[0]])
    sorted_swap = sorted(swap_avg_by_hour, reverse=True)
    return sorted_swap

In [18]:
ask_comm_sort = swap_sort(ask_avg_comm)

print('Top 5 Hours(CET) for Ask Posts Comments')
for row in ask_comm_sort[:6]:
    hour = row[1]
    #parsing the string into datetime oject
    hour_obj = dt.datetime.strptime(hour, '%H')
    #spectifying format for datettime str
    hour_str = hour_obj.strftime('%H:%M')
    avg_comm = row [0]
    print('{}: {:.2f} average comments per post'.format(hour_str, avg_comm))

Top 5 Hours(CET) for Ask Posts Comments
22:00: 20.38 average comments per post
21:00: 19.14 average comments per post
20:00: 14.59 average comments per post
19:00: 12.73 average comments per post
11:00: 11.46 average comments per post
09:00: 10.55 average comments per post


In [19]:
show_points_sort = swap_sort(show_avg_points)

print('Top 5 Hours(CET) for Show Posts Comments')
for row in show_points_sort[:6]:
    hour = row[1]
    #parsing the string into datetime oject
    hour_obj = dt.datetime.strptime(hour, '%H')
    #spectifying format for datettime str
    hour_str = hour_obj.strftime('%H:%M')
    avg_comm = row [0]
    print('{}: {:.2f} average comments per post'.format(hour_str, avg_comm))

Top 5 Hours(CET) for Show Posts Comments
18:00: 22.39 average comments per post
19:00: 18.93 average comments per post
08:00: 17.21 average comments per post
13:00: 16.87 average comments per post
23:00: 15.73 average comments per post
14:00: 15.50 average comments per post


Comments popular submissions and points popular submissions are not popular at the same time.

It's better to post your question between 19:00-23:00(CET). You have more chances to get the maximum of commetns.

If you want to share something and want people to vote for your post you're better posting mostly between 18:00-20:00 and 13:00-15:00 or try early in the morning between 08:00-09:00.