# Parsing Hacker News Posts
## The main goal of this projects is analyze the differents posts of the famous web *Hacker News* and find differences between ask posts and show posts.
### Firstly, we will open the file where is registered all posts that have had at least one comment:

In [22]:
from csv import reader

open_file = open('D:/PROYECTOS_PYTHON/HN_posts_year_to_Sep_26_2016.csv', encoding="utf8")
read_file = reader(open_file)
hn = list(read_file)
hn_header = hn[0]
hn = hn[1:]

print(hn_header)
print(hn[:5])
print(hn[46])
print(hn[555])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16'], ['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']]
['12578430', 'Wi

#### The next step will be to separate in three different lists the ask posts from the show posts and from other posts:

In [11]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    title = title.lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print('Number of ask posts: ', len(ask_posts))
print('Number of show posts: ', len(show_posts))
print('Number of other posts: ', len(other_posts))

Number of ask posts:  9139
Number of show posts:  10158
Number of other posts:  273822


### Now, we will get the number of comments for each type of post:

In [18]:
total_ask_comments = 0
for row in ask_posts:
    comments = int(row[4])
    total_ask_comments += comments

avg_ask_comments = total_ask_comments / len(ask_posts)

total_show_comments = 0
for row in show_posts:
    comments = int(row[4])
    total_show_comments += comments

avg_ask_comments = total_ask_comments / len(ask_posts)
avg_show_comments = total_show_comments / len(show_posts)

print('Total of comments in ask posts: ', total_ask_comments)
print('Average of comments in ask posts: ',round(avg_ask_comments, 2))
print('\n')
print('Total of comments in show posts: ', total_show_comments)
print('Average of comments in show posts: ',round(avg_show_comments, 2))

Total of comments in ask posts:  94986
Average of comments in ask posts:  10.39


Total of comments in show posts:  49633
Average of comments in show posts:  4.89


### We can observe that ask posts has more average of comments (10.39) than the show posts (4.89).  There are a similar number of posts of both types, and also we see that there are more than twice as many comments in ask posts.

#### Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.

### Next, we'll determine if ask posts created at a certain time are more likely to attract comments.

In [52]:
import datetime as dt

result_list = []
for row in ask_posts:
    created_at = row[6]
    comments = int(row[4])
    result_list.append([created_at, comments])
    
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    dateformat = '%m/%d/%Y %H:%M'
    hour = row[0]
    num_comments = int(row[1])
    h = dt.datetime.strptime(hour, dateformat)
    h_time = h.strftime('%H')
    
    if h_time not in counts_by_hour:
        counts_by_hour[h_time] = 1
        comments_by_hour[h_time] = num_comments
    else:
        counts_by_hour[h_time] += 1
        comments_by_hour[h_time] += num_comments

print('Number of posts by hour: ')
display(counts_by_hour)
print('Number of comments by hour: ')
display(comments_by_hour)

Number of posts by hour: 


{'02': 269,
 '01': 282,
 '22': 383,
 '21': 518,
 '19': 552,
 '17': 587,
 '15': 646,
 '14': 513,
 '13': 444,
 '11': 312,
 '10': 282,
 '09': 222,
 '07': 226,
 '03': 271,
 '23': 343,
 '20': 510,
 '16': 579,
 '08': 257,
 '00': 301,
 '18': 614,
 '12': 342,
 '04': 243,
 '06': 234,
 '05': 209}

Number of comments by hour: 


{'02': 2996,
 '01': 2089,
 '22': 3372,
 '21': 4500,
 '19': 3954,
 '17': 5547,
 '15': 18525,
 '14': 4972,
 '13': 7245,
 '11': 2797,
 '10': 3013,
 '09': 1477,
 '07': 1585,
 '03': 2154,
 '23': 2297,
 '20': 4462,
 '16': 4466,
 '08': 2362,
 '00': 2277,
 '18': 4877,
 '12': 4234,
 '04': 2360,
 '06': 1587,
 '05': 1838}

In [53]:
avg_by_hour = []
for hour_time in counts_by_hour:
    avg_by_hour.append([hour_time, round(comments_by_hour[hour_time] / counts_by_hour[hour_time], 2)])

print('Average of comments by post (by hour): ')
display(avg_by_hour)

Average of comments by post (by hour): 


[['02', 11.14],
 ['01', 7.41],
 ['22', 8.8],
 ['21', 8.69],
 ['19', 7.16],
 ['17', 9.45],
 ['15', 28.68],
 ['14', 9.69],
 ['13', 16.32],
 ['11', 8.96],
 ['10', 10.68],
 ['09', 6.65],
 ['07', 7.01],
 ['03', 7.95],
 ['23', 6.7],
 ['20', 8.75],
 ['16', 7.71],
 ['08', 9.19],
 ['00', 7.56],
 ['18', 7.94],
 ['12', 12.38],
 ['04', 9.71],
 ['06', 6.78],
 ['05', 8.79]]

### We display the information in order to read it clearer:

In [73]:
swap_avg_by_hour = []

for row in avg_by_hour:
    first = row[1]
    second = row[0]
    swap_avg_by_hour.append([first, second])


print(swap_avg_by_hour)
print('\n')


sorted_swap = sorted(swap_avg_by_hour, reverse=True)

print('Top 5 Hours for Ask Posts Comments: ')
print('\n')

for row in sorted_swap [:5]:
    timeformat = "%H"
    hour = row[1]
    avg = row[0]
    hour_oo = dt.datetime.strptime(hour, timeformat)
    hour_string = hour_oo.strftime("%H:00")
    print('{h}: {a:.2f} average comments per post'.format(h=hour_string, a=avg))

[[11.14, '02'], [7.41, '01'], [8.8, '22'], [8.69, '21'], [7.16, '19'], [9.45, '17'], [28.68, '15'], [9.69, '14'], [16.32, '13'], [8.96, '11'], [10.68, '10'], [6.65, '09'], [7.01, '07'], [7.95, '03'], [6.7, '23'], [8.75, '20'], [7.71, '16'], [9.19, '08'], [7.56, '00'], [7.94, '18'], [12.38, '12'], [9.71, '04'], [6.78, '06'], [8.79, '05']]


Top 5 Hours for Ask Posts Comments: 


15:00: 28.68 average comments per post
13:00: 16.32 average comments per post
12:00: 12.38 average comments per post
02:00: 11.14 average comments per post
10:00: 10.68 average comments per post


# Conclusions:

### Our results show that creating a post at 15:00 - 16:00 has the highest chance of receiving comments. In general, the best hours for posting are the midday hours (15h, 13h, 12h...), probably because the web users are more available to answer these posts in the evening. But, we don't really know. If we wanted to know it we should analyze how many time the posts needs to be answered. Other explanation is that while occuring this hours both American and Europeans users are actives and probably they are the most of the users.
### The second time period most popular to a post be commented is the morning and early morning (02h, 10h...) (maybe because other good time period to answer is the morning).