# Exploring Hacker News Posts
## The project will explore the Hacker News Posts and determine whether the Hacker News community has more average comments on posts that asks a question, or user submitted posts that showcases a project, project or just generally something interesting.  

In [2]:
from csv import reader

open_file = open('hacker_news.csv',  encoding='utf-8')
read_file = reader(open_file)
hn = list(read_file)

print(hn[:5]) #Displays the first 5 rows of the list to determine what kind of data we are dealing with.

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']]


After determining the first row contains the Header with the subsequent rows containing the actual data, we will remove the first row from the list 'hn' and put it into its own 'header'.

In [3]:
headers = hn[0]
hn = hn[1:]

print(hn[:5]) # re-displaying the list to make sure we have eliminated the header row

[['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16'], ['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']]


In [4]:
print(headers) #prints the header of the list of data

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


After we have separated the data we want to work with from the Header, we then proceed to separate each post according to their titles. We create 3 different categories: Ask Post, Show Post and Other Post, depending on whether the title of the post begins with "ask hn", "show hn", or if either of these are absent we will categorize the post to the Other Post category. 

In [5]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'): # title.lower() standarizes the title so we can match it to a lower case version of the title
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print("# of Ask hn Posts:", len(ask_posts))
print("# of Show hn Posts:", len(show_posts))
print("# of Other hn Posts:", len(other_posts))

# of Ask hn Posts: 9139
# of Show hn Posts: 10158
# of Other hn Posts: 273822


After running the filter through the whole list of data, we can see there are 9139 posts that are "ask posts", 10158 posts that are "show posts" and 273822 posts that are neither "ask" nor "show" posts. 

Having separated the lists of Ask HN posts and Show HN posts, we then are able to figure out the average number of comments for each of these Post categories.

In [7]:
total_ask_comments = 0

for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments

avg_ask_comments = round(total_ask_comments / len(ask_posts),2)

print("The average # of Ask Posts is :", avg_ask_comments)

total_show_comments = 0

for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments
    
avg_show_comments = round(total_show_comments / len(show_posts),2)

print("The average # of Show Posts is :", avg_show_comments)

The average # of Ask Posts is : 10.39
The average # of Show Posts is : 4.89


As we can see, ask posts, on average, receives more comments than show posts. Moving forward, we will continue our analysis focused on ask posts since they are more likely to receive comments. 

The next step in our analysis will be to determine if asks posts created at a certain time are more likely to attract comments. To do this, will we calculate the amount of ask posts create in each hour of the day and figure our the average number of comments received at each hour.

In [13]:
import datetime as dt

result_list = []

#created a list of list that consists of two elements: the date/time the post was create and the number of comments
for ask in ask_posts:
    created_at = ask[6]
    num_comments = int(ask[4])
    result_list.append([created_at, num_comments]) 

print(result_list[:5]) #printed the first 5 lines for verification



[['9/26/2016 2:53', 7], ['9/26/2016 1:17', 3], ['9/25/2016 22:57', 0], ['9/25/2016 22:48', 3], ['9/25/2016 21:50', 2]]


In [20]:
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    date_time = row[0]
    date_time = dt.datetime.strptime(date_time, "%m/%d/%Y %H:%M")
    post_time = date_time.strftime("%H")

    if post_time not in counts_by_hour:
        counts_by_hour[post_time] = 1
        comments_by_hour[post_time] = int(row[1])
    else:
        counts_by_hour[post_time] += 1
        comments_by_hour[post_time] += int(row[1])

sorted_counts_by_hour = {}
sorted_comments_by_hour ={}

sorted_list = sorted(counts_by_hour)

for each_hour in sorted_list:
    sorted_counts_by_hour[each_hour] = counts_by_hour[each_hour]
    sorted_comments_by_hour[each_hour] = comments_by_hour[each_hour]

print(sorted_counts_by_hour)
print(sorted_comments_by_hour)

{'00': 301, '01': 282, '02': 269, '03': 271, '04': 243, '05': 209, '06': 234, '07': 226, '08': 257, '09': 222, '10': 282, '11': 312, '12': 342, '13': 444, '14': 513, '15': 646, '16': 579, '17': 587, '18': 614, '19': 552, '20': 510, '21': 518, '22': 383, '23': 343}
{'00': 2277, '01': 2089, '02': 2996, '03': 2154, '04': 2360, '05': 1838, '06': 1587, '07': 1585, '08': 2362, '09': 1477, '10': 3013, '11': 2797, '12': 4234, '13': 7245, '14': 4972, '15': 18525, '16': 4466, '17': 5547, '18': 4877, '19': 3954, '20': 4462, '21': 4500, '22': 3372, '23': 2297}


In [21]:
#Calculating the average number of comments per post for each hour of the day

avg_by_hour = []
for each_hour in sorted_list:
    avg_comment = comments_by_hour[each_hour] / counts_by_hour[each_hour]
    avg_by_hour.append([each_hour, avg_comment])

print(avg_by_hour)

[['00', 7.5647840531561465], ['01', 7.407801418439717], ['02', 11.137546468401487], ['03', 7.948339483394834], ['04', 9.7119341563786], ['05', 8.794258373205741], ['06', 6.782051282051282], ['07', 7.013274336283186], ['08', 9.190661478599221], ['09', 6.653153153153153], ['10', 10.684397163120567], ['11', 8.96474358974359], ['12', 12.380116959064328], ['13', 16.31756756756757], ['14', 9.692007797270955], ['15', 28.676470588235293], ['16', 7.713298791018998], ['17', 9.449744463373083], ['18', 7.94299674267101], ['19', 7.163043478260869], ['20', 8.749019607843136], ['21', 8.687258687258687], ['22', 8.804177545691905], ['23', 6.696793002915452]]


In [37]:
# Next we are going to change position of the hour and the avg in order to use the sort method to determine the top highest average to lowest average
swap_avg_by_hour = []

for key in avg_by_hour:
    swap_avg_by_hour.append([key[1],key[0]])

sorted_swap = sorted(swap_avg_by_hour, reverse = True)

#Printing the "Top 5 Hours for Ask Posts Comments"

print("Top 5 Hours for Ask Posts Comments")

for row in sorted_swap[:5]:
    hour_object = dt.datetime.strptime(row[1], "%H")
    hour_object = hour_object.strftime("%H:%M")
    result_string = "{}: {:.2f} average comments per post".format(hour_object, row[0])
    print(result_string)

print("Worst 5 Hours for Ask Posts Comments")

for row in sorted_swap[-6:-1]:
    hour_object = dt.datetime.strptime(row[1], "%H")
    hour_object = hour_object.strftime("%H:%M")
    result_string = "{}: {:.2f} average comments per post".format(hour_object, row[0])
    print(result_string)



Top 5 Hours for Ask Posts Comments
15:00: 28.68 average comments per post
13:00: 16.32 average comments per post
12:00: 12.38 average comments per post
02:00: 11.14 average comments per post
10:00: 10.68 average comments per post
Worst 5 Hours for Ask Posts Comments
01:00: 7.41 average comments per post
19:00: 7.16 average comments per post
07:00: 7.01 average comments per post
06:00: 6.78 average comments per post
23:00: 6.70 average comments per post


In conclusion, from the given data, when users submits Ask Posts at 3PM, it generates on average, the most comments with Ask Posts submitted at 11pm having the worst average comments per post.