<a id="1"></a>
# <p style=";font-size:150%;text-align:center;border-radius:10px 10px;">**Hacker News Posts**</p>

# About Dataset

This data set is Hacker News posts from the last 12 months (up to September 26 2016)


| # | Attribute | Description |
| --- | --- | --- |
|1| id | The unique identifier from Hacker News for the post|
|2|title| The title of the post|
|3|url| The URL that the posts links to, if the post has a URL|
|4|num_points| The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes|
|5|num_comments| The number of comments on the post|
|6|author| The username of the person who submitted the post|
|7|created_at| The date and time of the post's submission(the time zone is Eastern Time in the US)|


In [1]:
from csv import reader

In [20]:
data=open("HN_posts_year_to_Sep_26_2016.csv","r",encoding='utf-8')
#data=open("hacker_news.csv","r",encoding='utf-8')

# Read the file in as a list of lists
data = reader(data)
data = list(data)

# Display the first five rows 
print(data[0:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']]


According to the above the first five rows, it is clear that the first list in the inner lists contains the column headers, and the lists after containing the data for one row. In order to analyze the data, we need first to remove the row containing the column headers.

In [21]:
data = data[1:]
# Check the data so that the first column is certainly removed. 
print(data[0:5])

[['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16'], ['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']]


Since we are only concerned with submissions that receive comments, we need to remove all submissions that didn't receive any comments, which means that we need to clear the 0 value in the fourth column (the num_comments column).

In [22]:
temp = data.copy()
for row in temp:
    if int(row[4]) == 0:
        data.remove(row)

Next, as we're only concerned with post titles beginning with *Ask HN* or *Show HN*, we'll create new lists of lists containing just the data for those titles.

In [24]:
# To find the posts that begin with either Ask HN or Show HN, we'll use the string method string.startswith(value, start, end)
ask_posts = []
show_posts = [] 
other_posts = []

for value in data:
    # The startswith() method in Python is capable of distinguishing between uppercase and lowercase letters when comparing strings.
    # To control for the cases, we can use the lower method, which returns a lowercase version of the starting string.
    title = value[1].lower()
    if title.startswith('ask hn'):
        ask_posts.append(value)
    elif title.startswith('show hn'):
        show_posts.append(value)
    else:
        other_posts.append(value)

print(' Number of ask posts: ', len(ask_posts))
print(' Number of show posts: ', len(show_posts))
print(' Number of other posts: ', len(other_posts))

 Number of ask posts:  6911
 Number of show posts:  5059
 Number of other posts:  68431


In [25]:
total_ask_comments = 0
for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments
    
avg_ask_comments = total_ask_comments/len(ask_posts)

print('The total comments on ask posts: ', total_ask_comments)
print('The average number of comments on ask posts: ', avg_ask_comments)

The total comments on ask posts:  94986
The average number of comments on ask posts:  13.744175951381855


In [26]:
total_show_comments = 0
for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments
    
avg_show_comments = total_show_comments/len(show_posts)

print('The total comments on show posts: ', total_show_comments)
print('The average number of comments on show posts: ', avg_show_comments)

The total comments on show posts:  49633
The average number of comments on show posts:  9.810832180272781


**According to the above analysis:** 
  - It is seen that ask posts receive more comments than show posts. 

In [27]:
import datetime as dt

In [28]:
result_list = []
for row in ask_posts:
    temp = [row[6], int(row[4])]
    result_list.append(temp)

Next, we'll use these two dictionaries to calculate the average number of comments for posts created during each hour of the day.

In [29]:
counts_by_hour = {} #  contains the number of ask posts created during each hour of the day.
comments_by_hour = {} # contains the corresponding number of comments ask posts created at each hour received.

for row in result_list:
    date, time = row[0].split()
    
    hour, minutes = time.split(':') 
    month, day, year = date.split('/')
    date = dt.datetime(int(year), int(month),int(day))
    #date = datetime.strftime()
    if hour not in counts_by_hour.keys(): 
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = row[1]
    else: 
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += row[1]

We calculated the average number of comments for posts created during each hour of the day, and stored the results in a list of lists named avg_by_hour.

In [30]:
avg_by_hour = []
for hour, count in counts_by_hour.items():
    avg_by_hour.append([hour,comments_by_hour[hour]/count])
print(avg_by_hour)

[['2', 13.198237885462555], ['1', 9.367713004484305], ['22', 11.749128919860627], ['21', 11.056511056511056], ['19', 9.414285714285715], ['17', 13.73019801980198], ['15', 39.66809421841542], ['14', 13.153439153439153], ['13', 22.2239263803681], ['11', 11.143426294820717], ['10', 13.757990867579908], ['9', 8.392045454545455], ['7', 10.095541401273886], ['3', 10.160377358490566], ['16', 10.76144578313253], ['8', 12.43157894736842], ['0', 9.857142857142858], ['23', 8.322463768115941], ['20', 11.38265306122449], ['18', 10.789823008849558], ['12', 15.452554744525548], ['4', 12.688172043010752], ['6', 9.017045454545455], ['5', 11.139393939393939]]


In [31]:
swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1],row[0]])
sorted_swap = sorted(swap_avg_by_hour,  reverse = True)

In [33]:
sorted_swap

[[39.66809421841542, '15'],
 [22.2239263803681, '13'],
 [15.452554744525548, '12'],
 [13.757990867579908, '10'],
 [13.73019801980198, '17'],
 [13.198237885462555, '2'],
 [13.153439153439153, '14'],
 [12.688172043010752, '4'],
 [12.43157894736842, '8'],
 [11.749128919860627, '22'],
 [11.38265306122449, '20'],
 [11.143426294820717, '11'],
 [11.139393939393939, '5'],
 [11.056511056511056, '21'],
 [10.789823008849558, '18'],
 [10.76144578313253, '16'],
 [10.160377358490566, '3'],
 [10.095541401273886, '7'],
 [9.857142857142858, '0'],
 [9.414285714285715, '19'],
 [9.367713004484305, '1'],
 [9.017045454545455, '6'],
 [8.392045454545455, '9'],
 [8.322463768115941, '23']]

In [32]:
print('Top 5 Hours for Ask Posts Comments:')    
for value in sorted_swap[:5]:
    time = dt.datetime.strptime(value[1] + ':00',"%H:%M")
    a = time.strftime("%H:%M")
    string = '{}: {:.2f} average comments per post'
    print(string.format(a, value[0]))

Top 5 Hours for Ask Posts Comments:
15:00: 39.67 average comments per post
13:00: 22.22 average comments per post
12:00: 15.45 average comments per post
10:00: 13.76 average comments per post
17:00: 13.73 average comments per post


**According to above the analysis:**
 - 13:00 and 15:00 are the time frames that you should create your ask post because of the highest number of receiving comments, especially at the 15:00 time frame. By contrast, you need to avoid posting at the 9:00 and 23:00 time frames.
 
 -  Moreover, this dataset was collected in the US, so if you want to create the ask post when you live in Viet Nam, the most suitable time frame is 2:00. 