# Predicting Community Engagement for Hacker News Posts

This project will look at a sampling of approx. 20,000 user posts to the Hacker News online forum ([dataset](https://www.kaggle.com/hacker-news/hacker-news-posts) available here).

The data set includes the following:

* id (Hacker News's unique comment ID number)
* title (post's title)
* url (url that the post links to)
* num_points (number of upvotes post received minus number of downvotes post received)
* num_comments (number of user comments on post)
* author (name of post's author)
* created_at (date and time of post - in %-m/%-d/%Y %-H:%M format)

Specifically, the project will examine two primary classes of posts: those whose titles begin with either `Ask HN` or `Show HN`. 

`Ask HN` are posts to ask the Hacker News community a specific question.

`Show HN` are posts to show the Hacker News community a project, product, or just generally something interesting.

We'll compare these two categories of posts to deterimine the following:

* Do `Ask HN` or `Show HN` posts receive more comments on average?
* Do posts created at a certain time receive more comments on average?
* Do `Ask HN` or `Show HN` posts receive more points on average?
* Do posts created at a certain time receive more points on average?


In [1]:
from csv import reader
opened_file = open("hacker_news.csv")
read_file = reader(opened_file)
hn = list(read_file)

print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


Next, we'll assign the header row to its own variable `headers`, and remove the header row from the rest of the data set.

In [2]:
headers = hn[0]
hn = hn[1:]

print(headers)
print("\n")
print(hn[:2])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']]


Next, we'll separate the posts into three separate lists, depending on what type of post they are - Ask, Show, or Other. Then we'll check the length of each list.

In [3]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    Title = row[1]
    title = Title.lower()
    if title.startswith("ask hn"):
        ask_posts.append(row)
    elif title.startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print(len(ask_posts))
print("\n")
print(len(show_posts))
print("\n")
print(len(other_posts))

1744


1162


17194


Next, we'll find the total number of comments for each category (Ask and Show) and calculate the average number of comments per post in the category.

In [4]:
total_ask_comments = 0

for row in ask_posts:
    comments = int(row[4])
    total_ask_comments += comments
    
avg_ask_comments = total_ask_comments / len(ask_posts)

total_show_comments = 0

for row in show_posts:
    comments = int(row[4])
    total_show_comments += comments
    
avg_show_comments = total_show_comments / len(show_posts)

print(avg_ask_comments)
print(avg_show_comments)

14.038417431192661
10.31669535283993


As shown above, posts in the Ask category receive approx. four more comments per post on average than posts in the Show category.

From this point, we'll focus our remaining comments analysis on just the Ask posts, since they are more likely to generate comments.

The next step is to determine if Ask posts are more or less likely to receive comments depending on the posting time. 

In [5]:
import datetime as dt

result_list = []

for row in ask_posts:
    ask_list = []
    time = row[6]
    comments = int(row[4])
    ask_list.append(time)
    ask_list.append(comments)
    result_list.append(ask_list)
    
print(result_list[:2])

posts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    time = row[0]
    comments = row[1]
    time_dt = dt.datetime.strptime(time, "%m/%d/%Y %H:%M")
    hour = time_dt.hour
    if hour in posts_by_hour:
        posts_by_hour[hour] += 1        
    else:
        posts_by_hour[hour] = 1
    if hour in comments_by_hour:
        comments_by_hour[hour] += comments
    else:
        comments_by_hour[hour] = comments
        
print('\n')
print(posts_by_hour)
print('\n')
print(comments_by_hour)

[['8/16/2016 9:55', 6], ['11/22/2015 13:43', 29]]


{0: 55, 1: 60, 2: 58, 3: 54, 4: 47, 5: 46, 6: 44, 7: 34, 8: 48, 9: 45, 10: 59, 11: 58, 12: 73, 13: 85, 14: 107, 15: 116, 16: 108, 17: 100, 18: 109, 19: 110, 20: 80, 21: 109, 22: 71, 23: 68}


{0: 447, 1: 683, 2: 1381, 3: 421, 4: 337, 5: 464, 6: 397, 7: 267, 8: 492, 9: 251, 10: 793, 11: 641, 12: 687, 13: 1253, 14: 1416, 15: 4477, 16: 1814, 17: 1146, 18: 1439, 19: 1188, 20: 1722, 21: 1745, 22: 479, 23: 543}


In the cell above, we created the following variables (remember, this is just for the Ask posts):

* `result_list` - This is a list of lists, with each inner list comprised of two elements: The time of the post (as a str), and the number of comments the post received (as an int).


* `posts_by_hour` - This is a dictionary containing a frequency table. Keys = hours (0-23); values = total number of posts made in each hour.


* `comments_by_hour` - This is a dictionary containing a frequency table. Keys = hours (0-23); values = total number of comments made in each hour.

Next, we will calculate the average number of comments per Ask post for each hour of the day. This will be stored as the variable `avg_by_hour`, a list of lists.

In [6]:
avg_by_hour = []

for hour in posts_by_hour:
    avg_by_hour.append([hour, (comments_by_hour[hour] / 
                              posts_by_hour[hour])])

for hour in avg_by_hour:
    print(*hour)

0 8.127272727272727
1 11.383333333333333
2 23.810344827586206
3 7.796296296296297
4 7.170212765957447
5 10.08695652173913
6 9.022727272727273
7 7.852941176470588
8 10.25
9 5.5777777777777775
10 13.440677966101696
11 11.051724137931034
12 9.41095890410959
13 14.741176470588234
14 13.233644859813085
15 38.5948275862069
16 16.796296296296298
17 11.46
18 13.20183486238532
19 10.8
20 21.525
21 16.009174311926607
22 6.746478873239437
23 7.985294117647059


This is the information we need, but it's not in the most readable format. Next, we'll sort the `avg_by_hour` list so it's ordered by average comment count, rather than by hour.

We'll start by creating a new list, `swap_avg_by_hour`, wherein the columns and comment counts of `avg_by_hour` are reversed.

In [7]:
swap_avg_by_hour = []

for row in avg_by_hour:
    hour = row[0]
    comments = row[1]
    swap_avg_by_hour.append([comments, hour])
    
for row in swap_avg_by_hour:
    print(*row)

8.127272727272727 0
11.383333333333333 1
23.810344827586206 2
7.796296296296297 3
7.170212765957447 4
10.08695652173913 5
9.022727272727273 6
7.852941176470588 7
10.25 8
5.5777777777777775 9
13.440677966101696 10
11.051724137931034 11
9.41095890410959 12
14.741176470588234 13
13.233644859813085 14
38.5948275862069 15
16.796296296296298 16
11.46 17
13.20183486238532 18
10.8 19
21.525 20
16.009174311926607 21
6.746478873239437 22
7.985294117647059 23


In [8]:
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

print('\n')
print('Top 5 Hours for Ask Post Comments:')

for row in sorted_swap[:5]:
    hour = str(row[1])
    hour_dt = dt.datetime.strptime(hour, "%H")
    new_hour = hour_dt.strftime("%H:%M")
    output = "{}: {:.2f} average comments per post".format(
    new_hour, row[0])
    print(output)



Top 5 Hours for Ask Post Comments:
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


Based on our analysis, Ask HN posts during the hour of 15:00 (UTC -5) are most likely to generate comments.

Let's now examine which type of posts receive more points. Remember, points are the total number of upvotes given to a post less the total number of downvotes.

We'll begin by calculating the total number of comments per category (Ask and Show), and then calculate the average number of points per post for each category.

In [9]:
total_ask_points = 0

for row in ask_posts:
    points = int(row[3])
    total_ask_points += points
    
total_show_points = 0
    
for row in show_posts:
    points = int(row[3])
    total_show_points += points

ave_ask_points = total_ask_points / len(ask_posts)
ave_show_points = total_show_points / len(show_posts)

print(ave_ask_points)
print(ave_show_points)

15.061926605504587
27.555077452667813


We see here that Show posts receive on average nearly twice as many points per post as Ask posts. 

From this point, we'll focus our points analysis on the Show posts, since they are most likely to be upvoted.

The next step is to determine if Show posts are more or less likely to be upvoted depending on the time of their posting.

In [10]:
show_result_list = []

for row in show_posts:
    time = row[6]
    points = row[3]
    show_result_list.append([time, points])

print(show_result_list[:2])

[['11/25/2015 14:03', '26'], ['11/29/2015 22:46', '747']]


In [11]:
show_posts_by_hour = {}
show_points_by_hour = {}

for row in show_result_list:
    time = dt.datetime.strptime(row[0], "%m/%d/%Y %H:%M")
    hour = time.hour
    points = int(row[1])
    if hour in show_posts_by_hour:
        show_posts_by_hour[hour] += 1
    else:
        show_posts_by_hour[hour] = 1
    if hour in show_points_by_hour:
        show_points_by_hour[hour] += points
    else:
        show_points_by_hour[hour] = points
      
print(show_posts_by_hour)    
print('\n')
print(show_points_by_hour)



{0: 31, 1: 28, 2: 30, 3: 27, 4: 26, 5: 19, 6: 16, 7: 26, 8: 34, 9: 30, 10: 36, 11: 44, 12: 61, 13: 99, 14: 86, 15: 78, 16: 93, 17: 93, 18: 61, 19: 55, 20: 60, 21: 47, 22: 46, 23: 36}


{0: 1173, 1: 700, 2: 340, 3: 679, 4: 386, 5: 104, 6: 375, 7: 494, 8: 519, 9: 553, 10: 681, 11: 1480, 12: 2543, 13: 2438, 14: 2187, 15: 2228, 16: 2634, 17: 2521, 18: 2215, 19: 1702, 20: 1819, 21: 866, 22: 1856, 23: 1526}


In the cells above, we created the following variables (remember, this is just for the Show posts):

* `show_result_list` - This is a list of lists, with each inner list comprised of two elements: The time of the post (as a str), and the number of points the post received (as a str).


* `show_posts_by_hour` - This is a dictionary containing a frequency table. Keys = hours (0-23); values = total number of posts made in each hour.


* `show_points_by_hour` - This is a dictionary containing a frequency table. Keys = hours (0-23); values = total number of points received in each hour.

Next, we will calculate the average number of points per Show post for each hour of the day. This will be stored as the variable `show_avg_by_hour`, a list of lists.

In [12]:
show_avg_by_hour = []

for hour in show_posts_by_hour:
    show_avg_by_hour.append([hour, show_points_by_hour[hour] / 
                            show_posts_by_hour[hour]])
    
for row in show_avg_by_hour:
    print(*row)

0 37.83870967741935
1 25.0
2 11.333333333333334
3 25.14814814814815
4 14.846153846153847
5 5.473684210526316
6 23.4375
7 19.0
8 15.264705882352942
9 18.433333333333334
10 18.916666666666668
11 33.63636363636363
12 41.68852459016394
13 24.626262626262626
14 25.430232558139537
15 28.564102564102566
16 28.322580645161292
17 27.107526881720432
18 36.31147540983606
19 30.945454545454545
20 30.316666666666666
21 18.425531914893618
22 40.34782608695652
23 42.388888888888886


Again, this is the information we need, but it's not in the most readable format. Next, we'll sort the `show_avg_by_hour` list so it's ordered by average point count, rather than by hour.

We'll start by creating a new list, `show_swap_avg_by_hour`, wherein the columns and comment counts of `avg_by_hour` are reversed.

In [13]:
show_swap_avg_by_hour = []

for row in show_avg_by_hour:
    show_swap_avg_by_hour.append([row[1], row[0]])
    
for row in show_swap_avg_by_hour:
    print(*row)

37.83870967741935 0
25.0 1
11.333333333333334 2
25.14814814814815 3
14.846153846153847 4
5.473684210526316 5
23.4375 6
19.0 7
15.264705882352942 8
18.433333333333334 9
18.916666666666668 10
33.63636363636363 11
41.68852459016394 12
24.626262626262626 13
25.430232558139537 14
28.564102564102566 15
28.322580645161292 16
27.107526881720432 17
36.31147540983606 18
30.945454545454545 19
30.316666666666666 20
18.425531914893618 21
40.34782608695652 22
42.388888888888886 23


In [14]:
show_sorted_swap = sorted(show_swap_avg_by_hour, reverse=True)

print('\n')
print('Top 5 Hours for Show Post Points:')

for row in show_sorted_swap[:5]:
    time = dt.datetime.strptime(str(row[1]), "%H")
    hour = time.strftime("%H:%M")
    output = "{}: {:.2f} average points per post"
    print(output.format(hour, row[0]))
    



Top 5 Hours for Show Post Points:
23:00: 42.39 average points per post
12:00: 41.69 average points per post
22:00: 40.35 average points per post
00:00: 37.84 average points per post
18:00: 36.31 average points per post


## Summary

From our analysis, we have seen that:

* Ask posts are more likely to receive comments than Show posts.

* Ask posts are most likely to receive comments during the hour of 15:00 (UTC -5).

* Show posts are more likely to receive points than Ask posts.

* Show posts are most likely to receive points during the hour of 23:00 (UTC -5).

More broadly, we can also see by looking at the top five most active hours (for both Ask post comments and Show post points) that the times span from noon to 2:00 a.m., but that the greatest community engagement generally occurs in two time pockets: 3:00-6:00 p.m. and 8:00 p.m. to midnight. There are no daytime morning hours in the top 10. 