# POSTING ONLINE: ARE MY PEERS' VOLITIONAL ONLINE ENGAGEMENT TIME DEPENDENT?


In this project, we'll work with a dataset of submissions to popular technology site Hacker News. Our aim is to understand when user's most engage in posts and how it affects the reach and success of a 
Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") receive votes and comments, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of the Hacker News listings can get hundreds of thousands of visitors as a result.

You can find the data set [here](https://www.kaggle.com/hacker-news/hacker-news-posts). Below are descriptions of the columns:

id: the unique identifier from Hacker News for the post
<br>
title: the title of the post
<br>
url: the URL that the posts links to, if the post has a URL
<br>
num_points: the number of points the post acquired, calculated as the total
number of upvotes minus the total number of downvotes
<br>
num_comments: the number of comments on the post
<br>
author: the username of the person who submitted the post
<br>
created_at: the date and time of the post's submission

In [2]:
# lET'S READ THE DATASET AND CHECK FIRST FEW ROWS
from csv import reader 

hn = open("hacker_news.csv")
hn = reader(hn) # parse opened data 
hn = list(hn) # convert the data to a list
for each_row in hn[:5]:
    print(each_row, '\n')

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] 

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'] 

['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'] 

['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'] 

['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'] 



In [3]:
# Let's seperate out the header information
header = hn[:1][0]
print(header)
print()
hn = hn[1:]
for each_row in hn[:5]:
    print(each_row)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']
['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']
['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']
['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']
['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']


In [4]:
# let's Create three empty lists
ask_posts = []
show_posts = [] 
other_posts = []

for each_row in hn:
    title = each_row[1]
    title = title.lower()
    if title.startswith('ask hn'):
        ask_posts.append(each_row)
    elif title.startswith('show hn'):
        show_posts.append(each_row)
    else:
        other_posts.append(each_row)
print('Number of posts as ASK HN: {}'.format(len(ask_posts)))
print()
print('Number of posts as SHOW HN: {}'.format(len(show_posts)))
print()
print('Number of posts from other categories: {}'.format(len(other_posts)))

Number of posts as ASK HN: 1744

Number of posts as SHOW HN: 1162

Number of posts from other categories: 17194


In [6]:
total_ask_comments = 0
for post in ask_posts:
    Num_comment = post[4]
    Num_comment = int(Num_comment)
    total_ask_comments += Num_comment
avg_ask_comments = total_ask_comments/ len(ask_posts)
print('Average number of comments for ASK HN: {}'.format(avg_ask_comments))

# Now calculate the average number of comments for the show post
total_show_comments = 0
for post in show_posts:
    Num_comment2 = post[4]
    Num_comment2 = int(Num_comment2)
    total_show_comments += Num_comment2
avg_show_comments = total_show_comments/ len(show_posts)
print('Average number of comments for SHOW HN: {}'.format(avg_show_comments))

Average number of comments for ASK HN: 14.038417431192661
Average number of comments for SHOW HN: 10.31669535283993


Looking into the average values for both ASK HN and SHOW HN posts, it's obvious that posts under the catagory of ASK HN receives on average more comments than the SHOW HN catagory. One reason can be that people in platforms like this, tends to demonstrate their personal skills, or understanding rather than looking into other's projects. This might influence users to engage in replying and responding to other's questions, because it gives an opportunity to exercise personal opinions, and skills. 
But looking into someone else's project requires several things, it should be interesting to the person to go and check it. Moreover, to comment on a project, you'll need to go through a full project to comment about it. People tend avoid investing so much time unless the project is so interesting that it excites him.

So, on average, ask posts receive more comments than show posts. Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.

Next, we'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:

We'll calculate the number of ask posts created in each hour of the day, along with the number of comments received.
we then calculate the average number of comments ask posts receive by hour created.

In [7]:
# Let's calculate the number of ask posts and comments by hour created. 
# We'll use the datetime module to work with the data in the created_at column.

# First prepare a list of list with two elements in each; date and number of comments
import datetime as dt
result_list = []
for post in ask_posts:
    result_list.append([post[6], int(post[4])])
# Let's check if we got the list correctly. 
for element in result_list[:3]:
    print (element)
    
# Create two empty dictionaries to generate a frequency of hourly comments
counts_by_hour = {}
comments_by_hour = {}
for element in result_list:
    time = element[0]
    datetime_object = dt.datetime.strptime(time, "%m/%d/%Y %H:%M")
    hour = datetime_object.strftime("%H")
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = element[1]
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += element[1]     


['8/16/2016 9:55', 6]
['11/22/2015 13:43', 29]
['5/2/2016 10:14', 1]


Above, we created two dictionaries:

counts_by_hour: contains the number of ask posts created during each hour of the day.
comments_by_hour: contains the corresponding number of comments ask posts created at each hour received.
Next, we'll use these two dictionaries to calculate the average number of comments for posts created during each hour of the day.

In [8]:
# Let's calculate the average number of comments per post for posts created during each hour of the day.
avg_by_hour = []

for comment_hour in counts_by_hour:
    avg_by_hour.append([comment_hour, (comments_by_hour[comment_hour])/counts_by_hour[comment_hour]])

# print the result as sorted by average comments
#from operator import itemgetter
swap_avg_by_hour = []
for element in avg_by_hour:
    swap_avg_by_hour.append([element[1], element[0]])

sorted_swap = sorted(swap_avg_by_hour, reverse = True)
# check if the list has been correctly sorted
for element in sorted_swap:
    print (element)  

[38.5948275862069, '15']
[23.810344827586206, '02']
[21.525, '20']
[16.796296296296298, '16']
[16.009174311926607, '21']
[14.741176470588234, '13']
[13.440677966101696, '10']
[13.233644859813085, '14']
[13.20183486238532, '18']
[11.46, '17']
[11.383333333333333, '01']
[11.051724137931034, '11']
[10.8, '19']
[10.25, '08']
[10.08695652173913, '05']
[9.41095890410959, '12']
[9.022727272727273, '06']
[8.127272727272727, '00']
[7.985294117647059, '23']
[7.852941176470588, '07']
[7.796296296296297, '03']
[7.170212765957447, '04']
[6.746478873239437, '22']
[5.5777777777777775, '09']


In [9]:
print("Top 5 Hours for Ask Posts Comments")
for element in sorted_swap[:5]:
    the_average = element[0]
    the_hour = element[1]
    the_hour = dt.datetime.strptime(the_hour, "%H").strftime("%H:%M")
    print("{}: {:.2f} average comments per post".format(the_hour, the_average))

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


Looking into the results of our analysis, we can easily identify the hours when most of the posts received on average higher reach, i.e., comments. So, if someone is interested in reaching more users, he should post in the above 5 hours and especially in the evening at around 3 pm or in the night at around 02.00 am. Because most users stay online at these hours. 
# Next, we'll try to determine if show or ask posts receive more points on average

data set description:
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] 

In [10]:
total_ask_points = 0
for post in ask_posts:
    Num_points = post[3]
    Num_points = int(Num_points)
    total_ask_points += Num_points
    
avg_ask_points = total_ask_points/ len(ask_posts)
print('Average number of points for ASK HN: {}'.format(avg_ask_points))

# Now calculate the average number of comments for the show post
total_show_points = 0
for post in show_posts:
    Num_points2 = post[3]
    Num_points2 = int(Num_points2)
    total_show_points += Num_points2
avg_show_points = total_show_points/ len(show_posts)
print('Average number of points for SHOW HN: {}'.format(avg_show_comments))

Average number of points for ASK HN: 15.061926605504587
Average number of points for SHOW HN: 10.31669535283993


As evident from the result above, like comments, post that were formed as questions received more points than the show type posts.
### Now, we'll determine if posts created at a certain time are more likely to receive more points.

In [13]:
# Let's calculate the number of ask posts and points by hour created. 
# We'll use the datetime module to work with the data in the created_at column.

# First prepare a list of list with two elements in each; date and number of points
import datetime as dt
result_list = []
for post in ask_posts:
    result_list.append([post[6], int(post[3])])

# Create two empty dictionaries to generate a frequency of hourly points
counts_by_hour = {}
points_by_hour = {}
for element in result_list:
    time = element[0]
    datetime_object = dt.datetime.strptime(time, "%m/%d/%Y %H:%M")
    hour = datetime_object.strftime("%H")
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        points_by_hour[hour] = element[1]
    else:
        counts_by_hour[hour] += 1
        points_by_hour[hour] += element[1]  

# Let's calculate the average number of comments per post for posts created during each hour of the day.
avg_by_hour = []

for point_hour in counts_by_hour:
    avg_by_hour.append([point_hour, (points_by_hour[point_hour])/counts_by_hour[point_hour]])

# print the result as sorted by average comments
#from operator import itemgetter
swap_avg_by_hour = []
for element in avg_by_hour:
    swap_avg_by_hour.append([element[1], element[0]])

sorted_swap = sorted(swap_avg_by_hour, reverse = True)

#print("__________________________________________________________________________________________")
print("Top 5 Hours for Ask Posts points")
for element in sorted_swap[:5]:
    the_average = element[0]
    the_hour = element[1]
    the_hour = dt.datetime.strptime(the_hour, "%H").strftime("%H:%M")
    print("{}: {:.2f} average points per post".format(the_hour, the_average))


Top 5 Hours for Ask Posts points
15:00: 29.99 average points per post
13:00: 24.26 average points per post
16:00: 23.35 average points per post
17:00: 19.41 average points per post
10:00: 18.68 average points per post


From our above analysis, we can easily determine that people are more engaged in their volutional online activities at the evening, usually after 2 pm to 6 pm. This is interesting becuase most of the offices close after 5 pm or the earliest at 4 pm. Our results indicate that many people after their lunch break tend to spend time in less stressful or less mentally demanding activities. Does it imply that companies and businesses better start their offices rather very early, maybe at 6 am and close it after 2 pm? That would be something very interesting.   