# Exploring 'Hacking News Posts'
The purpose for this project is to compare 
1. average of comments
2. average of comments by hour

between two different types of posts, Ask HN and Show HN.
- Ask HN : ask the Hacker News community a specific question
- Show HN : show the Hacker News community a project, product, or  something interesting

The data set  was reduced  to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions.

In [24]:
#open data and store in list
from csv import reader
opened_file = open('hacker_news.csv')
hn = reader(opened_file)
hn = list(hn)
opened_file.close()
header = hn[0] #separate data header
hn = hn[1:]
print(header)
for row in hn[:3]:
    print(row)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']
['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']
['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


## Classify each type of posts

In [25]:
ask_posts = []
show_posts = []
other_posts = []
for post in hn:
    title = post[1].lower() #set all characters to lowercase
    if title.startswith('ask hn'):
        ask_posts.append(post) #add post which starts with 'ask hn' to list
    elif title.startswith('show hn'):
        show_posts.append(post) #add post which starts with 'show hn' to list
    else:
        other_posts.append(post)
print('Number of ask posts : ',len(ask_posts))
print('Number of show posts : ',len(show_posts))
print('Number of other posts : ',len(other_posts))

Number of ask posts :  1744
Number of show posts :  1162
Number of other posts :  17194


## Calculate average of comments from total posts


In [32]:
#ask post
total_ask_comment = 0
for post in ask_posts:
    num_comment = int(post[4])
    total_ask_comment += num_comment
avg_ask_comments = total_ask_comment/len(ask_posts)
print('Average of comments from ask posts : ',avg_ask_comments)

#show post
total_show_comment = 0
for post in show_posts:
    num_comment = int(post[4])
    total_show_comment += num_comment
avg_show_comments = total_show_comment/len(show_posts)
print('Average of comments from show posts : ',avg_show_comments)

Average of comments from ask posts :  14.038417431192661
Average of comments from show posts :  10.31669535283993


from the calculation above, we can notice that ask posts receive comments more than show posts. So from now, we will only analyze on asks posts to find the time which is receive the most comments.

# Ask Posts

## Collect number of create posts and comments post depending on time

This part is to create two dictionaries to collect number by time.

In [27]:
import datetime as dt
result_list = []
for post in ask_posts:
    created_at = post[6]
    comment = int(post[4])
    result_list.append([created_at,comment]) 
    #create list containing datetime and comment and add to list
counts_by_hour = {}
comments_by_hour = {}
for result in result_list:
    #use datetime.strptime to convert string data to datetime data
    #including month, day, year, hour
    dt_result = dt.datetime.strptime(result[0],"%m/%d/%Y %H:%M")
    #use strftime to export data from datetime and select only hour
    hours = dt_result.strftime("%H")
    #count and collect in dictionary
    if hours not in counts_by_hour:
        counts_by_hour[hours] = 1
        comments_by_hour[hours] = result[1]
    else:
        counts_by_hour[hours] += 1
        comments_by_hour[hours] += result[1]
print('Dictionary of the number of creating posts by hour created')
print(counts_by_hour)
print('Dictionary of the number of commets by hour created')
print(comments_by_hour)

Dictionary of the number of creating posts by hour created
{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}
Dictionary of the number of commets by hour created
{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}


## Calculate average of comments from total post depending on hour created

This part is to use two dictionaries created before and then calculate average.

In [28]:
avg_by_hour = []
for count in counts_by_hour:
    post_num = counts_by_hour[count]
    comment_num = comments_by_hour[count]
    avg = comment_num/post_num
    avg_by_hour.append([count,avg])
print('List of the average of comments from total posts by hour created')
print(avg_by_hour)

List of the average of comments from total posts by hour created
[['09', 5.5777777777777775], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['16', 16.796296296296298], ['23', 7.985294117647059], ['12', 9.41095890410959], ['17', 11.46], ['15', 38.5948275862069], ['21', 16.009174311926607], ['20', 21.525], ['02', 23.810344827586206], ['18', 13.20183486238532], ['03', 7.796296296296297], ['05', 10.08695652173913], ['19', 10.8], ['01', 11.383333333333333], ['22', 6.746478873239437], ['08', 10.25], ['04', 7.170212765957447], ['00', 8.127272727272727], ['06', 9.022727272727273], ['07', 7.852941176470588], ['11', 11.051724137931034]]


## Sort the average of comments from total post depending on hour created

In [29]:
swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1],row[0]])
sorted_swap = sorted(swap_avg_by_hour,reverse=True)
print('Top 5 of the average of comments :')
for row in sorted_swap[:5]:
    print(row)

Top 5 of the average of comments :
[38.5948275862069, '15']
[23.810344827586206, '02']
[21.525, '20']
[16.796296296296298, '16']
[16.009174311926607, '21']


or you can represent in another way which is easier for reading

In [31]:
print('Top 5 of the average of comments :')
for row in sorted_swap[:5]:
    template = '{}: {:.2f} average comments per post.'
    time = dt.datetime.strptime(row[1],'%H')
    time = time.strftime('%H:%M')
    avg = row[0]
    print(template.format(time,avg))

Top 5 of the average of comments :
15:00: 38.59 average comments per post.
02:00: 23.81 average comments per post.
20:00: 21.52 average comments per post.
16:00: 16.80 average comments per post.
21:00: 16.01 average comments per post.


# Conclusion

Based on our analysis, to maximize the amount of comments a post receives, we'd recommend the post be categorized as ask post and created between 15:00 and 16:00.