# Researching Patterns in Hacker News Posts
In this project I will check some patterns of success in writing posts on [Hacker News](https://news.ycombinator.com/) (a site where people from start-ups and techs share their stories and comment other's). Two main questions of exploring:
1. Do special words *Ask HN* or *Show HN* in title receive more comments on average?
2. Do posts created at a certain time receive more comments on average?

The data set consists of 20,000 rows with post information as Id, Title, URL, Number of points, Number of comments, Author, Date of submission

In [3]:
from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)

print('First rows of our data set: ')
for row in hn[:5]:
    print(' - ',row)

First rows of our data set: 
 -  ['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
 -  ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']
 -  ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']
 -  ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']
 -  ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


### Removing Headers from a dataset

In [4]:
headers = hn[0]
hn = hn[1:]

print(headers, '\n')
for row in hn[:5]:
    print(' - ', row)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] 

 -  ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']
 -  ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']
 -  ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']
 -  ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']
 -  ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']


### Exctracting `Ask HN` and `Show HN` Posts

In [5]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print('Number of Ask Posts is', len(ask_posts))
print('Number of Show Posts is', len(show_posts))
print('Number of other posts is', len(other_posts))

Number of Ask Posts is 1744
Number of Show Posts is 1162
Number of other posts is 17194


### The Average Number of Comments for `Ask HN` and `Show HN` Posts

In [6]:
def average_item(dataset, index):
    total_items = 0

    for row in dataset:
        num_items = int(row[index])
        total_items += num_items

    avg_item = total_items / len(dataset)
    return avg_item
    

avg_ask_comments = average_item(ask_posts, 4)
avg_show_comments = average_item(show_posts, 4)
avg_all_comments = average_item(hn, 4)
print('Average comments number of `Ask HN` is {:.2f}'
      .format(avg_ask_comments))
print('Average comments number of `Show HN` is {:.2f}'
      .format(avg_show_comments))
print('Average comments number of all posts is {:.2f}'
      .format(avg_all_comments))

Average comments number of `Ask HN` is 14.04
Average comments number of `Show HN` is 10.32
Average comments number of all posts is 24.80


By calculating we can see that Posts of `Ask` and` Show` category get twice less comments than common posts. Comparing `Ask` and` Show` with each other we can suppose people like responding a showed things less than answering questions.

### Finding the Amount of Posts and Comments by Hour Created

In [7]:
import datetime as dt

result_list = []

for row in hn:
    created_at = row[6]
    num_comments = int(row[4])
    result_list.append([created_at, num_comments])
    
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    date_time_created = dt.datetime.strptime(row[0], "%m/%d/%Y %H:%M")
    an_hour = date_time_created.strftime("%H")
    
    if an_hour in counts_by_hour:
        counts_by_hour[an_hour] += 1
        comments_by_hour[an_hour] += row[1]
    elif an_hour not in counts_by_hour:
        counts_by_hour[an_hour] = 1
        comments_by_hour[an_hour] = row[1]

print('We created a frequency table for amount of posts and comments by hours')
print('Hour: amount of posts')
print(counts_by_hour)
print('\nHour: amount of comments')
print(comments_by_hour)

We created a frequency table for amount of posts and comments by hours
Hour: amount of posts
{'11': 762, '19': 1145, '22': 875, '00': 697, '04': 527, '09': 609, '16': 1302, '18': 1254, '14': 1151, '10': 686, '12': 923, '13': 1102, '20': 1051, '03': 488, '17': 1362, '01': 588, '23': 778, '08': 578, '02': 529, '21': 1030, '15': 1234, '06': 468, '07': 508, '05': 453}

Hour: amount of comments
{'11': 20664, '19': 27894, '22': 18684, '00': 17478, '04': 11537, '09': 15274, '16': 30857, '18': 31587, '14': 33545, '10': 16818, '12': 25351, '13': 30562, '20': 23414, '03': 11626, '17': 34784, '01': 12465, '23': 17582, '08': 14062, '02': 13762, '21': 22652, '15': 35809, '06': 9253, '07': 12576, '05': 10290}


### Average Number of Comments for Posts by Hours

In [8]:
avg_comments_by_hour = []

for an_hour in comments_by_hour:
    avg_by_hour = comments_by_hour[an_hour] / counts_by_hour[an_hour]
    avg_comments_by_hour.append([an_hour, avg_by_hour])

print('List of hours and average comments per post by that hour')
for row in avg_comments_by_hour:
    print(row)

List of hours and average comments per post by that hour
['11', 27.118110236220474]
['19', 24.361572052401748]
['22', 21.353142857142856]
['00', 25.076040172166426]
['04', 21.891840607210625]
['09', 25.080459770114942]
['16', 23.69969278033794]
['18', 25.188995215311003]
['14', 29.14422241529105]
['10', 24.516034985422742]
['12', 27.465872156013003]
['13', 27.733212341197824]
['20', 22.27783063748811]
['03', 23.82377049180328]
['17', 25.53891336270191]
['01', 21.198979591836736]
['23', 22.59897172236504]
['08', 24.32871972318339]
['02', 26.015122873345934]
['21', 21.992233009708738]
['15', 29.01863857374392]
['06', 19.771367521367523]
['07', 24.755905511811022]
['05', 22.71523178807947]


### Sorting and Printing Values from a List of Lists

In [9]:
swap_avg_comments_by_hour = []

for row in avg_comments_by_hour:
    swap_avg_comments_by_hour.append([row[1],row[0]])

print('Swapped position:')
for row in swap_avg_comments_by_hour:
    print(row)

Swapped position:
[27.118110236220474, '11']
[24.361572052401748, '19']
[21.353142857142856, '22']
[25.076040172166426, '00']
[21.891840607210625, '04']
[25.080459770114942, '09']
[23.69969278033794, '16']
[25.188995215311003, '18']
[29.14422241529105, '14']
[24.516034985422742, '10']
[27.465872156013003, '12']
[27.733212341197824, '13']
[22.27783063748811, '20']
[23.82377049180328, '03']
[25.53891336270191, '17']
[21.198979591836736, '01']
[22.59897172236504, '23']
[24.32871972318339, '08']
[26.015122873345934, '02']
[21.992233009708738, '21']
[29.01863857374392, '15']
[19.771367521367523, '06']
[24.755905511811022, '07']
[22.71523178807947, '05']


In [10]:

sorted_swap = sorted(swap_avg_comments_by_hour, reverse = True)
highest_chance_comm_hour = sorted_swap[0][0]

print('Top 5 Hours for Posts')
for row in sorted_swap[:5]:
    an_hour = row[1]
    an_hour = dt.datetime.strptime(an_hour, "%H")
    text_hour = an_hour.strftime("%H:%M")
    print('{}: {:.2f} average comments per post'.format(text_hour, row[0]))

Top 5 Hours for Posts
14:00: 29.14 average comments per post
15:00: 29.02 average comments per post
13:00: 27.73 average comments per post
12:00: 27.47 average comments per post
11:00: 27.12 average comments per post


By now we can see that the highest chance to receive more comments hour is from 14 to 15. Let's calculate, how much different this chance relatively to a just random hour chance

In [11]:
print('Average comment per post for whole day is {:.2f}'
      .format(avg_all_comments))
print('Average comment per post by the highest chance hour is {:.2f}'
      .format(highest_chance_comm_hour))
print('How much better is it to post at a certain hour? {:.2f} / {:.2f} = {:.2f}'
      .format(highest_chance_comm_hour, avg_all_comments
      , highest_chance_comm_hour / avg_all_comments)) 


Average comment per post for whole day is 24.80
Average comment per post by the highest chance hour is 29.14
How much better is it to post at a certain hour? 29.14 / 24.80 = 1.18


> **So, it's like 18 percent more efficient and reasonable to post at a certain hour - from 14 to 15.** 

### Exploration by points (upvotes and downvotes)
Compare `Ask` and `Show` again

In [13]:
avg_ask_points = average_item(ask_posts, 3)
avg_show_points = average_item(show_posts, 3)
avg_all_points = average_item(hn, 3)

print('Average points number of `Ask HN` is {:.2f}'
      .format(avg_ask_points))
print('Average points number of `Show HN` is {:.2f}'
      .format(avg_show_points))
print('Average points number of all posts is {:.2f}'
      .format(avg_all_points))

Average points number of `Ask HN` is 15.06
Average points number of `Show HN` is 27.56
Average points number of all posts is 50.30


Despite the fact that `Ask` posts have more comments, the `Show` posts have twice more points meaning positive reaction, whilst more amount of comments can be neutral or negative. But all the same, the common group of posts still has times better response.

----------
The same method to calculate top hours according to the points this time

In [22]:
result_list_2 = []

for row in hn:
    created_at = row[6]
    num_points = int(row[3])
    result_list_2.append([created_at, num_points])
    
points_by_hour = {}
counts_by_hour = {}

for row in result_list_2:
    date_time_created = dt.datetime.strptime(row[0], "%m/%d/%Y %H:%M")
    an_hour = date_time_created.strftime("%H")
    
    if an_hour in counts_by_hour:
        counts_by_hour[an_hour] += 1
        points_by_hour[an_hour] += row[1]
    elif an_hour not in counts_by_hour:
        counts_by_hour[an_hour] = 1
        points_by_hour[an_hour] = row[1]
        
avg_points_by_hour = []

for an_hour in points_by_hour:
    avg_by_hour = points_by_hour[an_hour] / counts_by_hour[an_hour]
    avg_points_by_hour.append([an_hour, avg_by_hour])
    
swap_avg_points_by_hour = []

for row in avg_points_by_hour:
    swap_avg_points_by_hour.append([row[1],row[0]])
    

sorted_swap_2 = sorted(swap_avg_points_by_hour, reverse = True)
highest_chance_points_hour = sorted_swap_2[0][0]

print('Top 5 Hours for Posts')
for row in sorted_swap_2[:5]:
    an_hour = row[1]
    an_hour = dt.datetime.strptime(an_hour, "%H")
    text_hour = an_hour.strftime("%H:%M")
    print('{}: {:.2f} average points per post'.format(text_hour, row[0]))

Top 5 Hours for Posts
13:00: 56.17 average points per post
15:00: 55.65 average points per post
10:00: 54.71 average points per post
14:00: 54.44 average points per post
19:00: 54.17 average points per post


From this we can find out time from 13 to 14 is the best in terms of points the post gets.

In [24]:
print('Average point per post for whole day is {:.2f}'
      .format(avg_all_points))
print('Average point per post by the highest chance hour is {:.2f}'
      .format(highest_chance_points_hour))
print('How much better is it to post at a certain hour? {:.2f} / {:.2f} = {:.2f}'
      .format(highest_chance_points_hour, avg_all_points
      , highest_chance_points_hour / avg_all_points)) 


Average point per post for whole day is 50.30
Average point per post by the highest chance hour is 56.17
How much better is it to post at a certain hour? 56.17 / 50.30 = 1.12


If we unite two ways of calculating, two term - *comments* and *points*, we finally get that perfect time to post - **from 13 to 15**.

Percent of efficiency and reasonability of posting at the certain time equals (18% + 12%) / 2 = 15%

## Conclusion
Completing the assigned tasks we found two important results:
* Special posts starting with words `Ask` and `Show` get *times less* feedback and attraction than usual posts
* There exists a time when it's at 15% better, more successfully and efficiently to post stories according to average amount of comments and points it can take - from 13 to 15 o'clock.