# Exploring Hacker News Posts

This project is part of my data engineer self learning through *Dataquest.io*. This guided projects, it will focus on the technology site **Hacker News**. The purpose is to estimate what time most people will be online by computed from the average of comments that occurred during each hour of a day. With this exploring, I will know that if I want to ask the question at Hacker News, what time will I get a higher rate of the answer.

### Read the CSV files to a list of lists
First, I begin to read the information from hacker_news.csv and assign the result to the variable **hn** as the list of lists. I'll display the first five rows (lists) of hn.

In [1]:
opened_csv = open('Dataset\Guided Project Hacker New Posts\hacker_news.csv')
import csv
read_csv = csv.reader(opened_csv)
hn = list(read_csv)
opened_csv.close()
print(*hn[0:5], sep='\n\n')

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']

['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24']

['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']

['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']


As we saw from above, I will seperate the header which is the first row from hn list.

In [2]:
headers = hn[0]
hn.remove(hn[0])
print(*hn[0:5], sep='\n\n')

['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']

['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24']

['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']

['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']

['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']


### Filtered for concerned data
Now the hn list is ready to be filtered out unwanted data. As I inform at the beginning, I only consider with post titles that beginning with **Ask HN** or **Show HN**. So the next step, I'll create new lists of lists.
- ask_posts : the list contain Ask HN title
- show_posts : the list contain Show HN title
- other_posts : the list that I won't focus

In [3]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    title_lower = title.lower()
    if title_lower.startswith('ask hn'):
        ask_posts.append(row)
    elif title_lower.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

print("ask hn : "+ str(len(ask_posts)))
print("show hn : "+ str(len(show_posts)))
print("other : "+ str(len(other_posts)))

ask hn : 9139
show hn : 10158
other : 273822


### Find average comment

Next I will check which type of title (ask or show) is got more attention from the web member.
By first, I will create a function to calculate the average: **find_avg**.

In [4]:
def find_avg(main_list, cal_index):
    total = 0 
    for row in main_list:
        total += int(row[cal_index])
    avg = total/len(main_list)
    return avg

avg_ask_comments = find_avg(ask_posts, 4)
print('Ask posts average comment: '+ format(avg_ask_comments,'.2f')+' comments')
avg_show_comments = find_avg(show_posts, 4)
print('Show posts average comment: '+ format(avg_show_comments,'.2f')+' comments')


Ask posts average comment: 10.39 comments
Show posts average comment: 4.89 comments


The ask posts type is recieved more attention from the member as the average comment for ask posts is around 2 time of show posts type. Since ask posts are more like to receive comments. Next, I'll focus on these types.

### The average of comment given for entire day.
In the next step, I'll calculate the amount of ask posts and comments by hour created by using the datetime (**dt**) module. 

In [5]:
import datetime as dt
result_list = []
for row in ask_posts:
    result_list.append([dt.datetime.strptime(row[6], "%m/%d/%Y %H:%M"), int(row[4])])

from above I create result_list, a list of lists that contain datetime and number of comments. Next I'll loop through this list and find sum for a comment for each datetime.

In [6]:
print(*result_list[0:4], sep='\n')

[datetime.datetime(2016, 9, 26, 2, 53), 7]
[datetime.datetime(2016, 9, 26, 1, 17), 3]
[datetime.datetime(2016, 9, 25, 22, 57), 0]
[datetime.datetime(2016, 9, 25, 22, 48), 3]


In [7]:
counts_by_hour = {}
comments_by_hour = {}
for element in result_list:
    hour = element[0].strftime('%H')
    if hour in counts_by_hour:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += element[1]
    else:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = element[1]

From above, I know the posting frequency for each hour (from counts_by_hour) and sum of comments for each hour (from comments_by_hour). But both is in the form of dictionary. Next I will change them to a lists of list instead.

In [8]:
def dict_to_list(dict_):
    List = []
    for key in dict_:
        List.append([key, dict_[key]])
    return List

counts_by_hour_list = dict_to_list(counts_by_hour)
comments_by_hour_list = dict_to_list(comments_by_hour)

Right now, I have the lists of *how many post by hour* (counts_by_hour_list) and the lists of *how many comment by hour* (comments_by_hour_list). So, I can find the average comment per post by hour (**avg_by_hour**).

In [9]:
avg_by_hour = []
for element in comments_by_hour_list:
    hour = element[0]
    comment = element[1]
    for element2 in counts_by_hour_list:
        if hour == element2[0]:
            avg_comment = comment/element2[1]
            avg_by_hour.append([avg_comment, hour])

Next, I will sorted the average comment in descending order. Remark that the time zone in this data set is GMT -4 which is different from my timezone -11 hours as I live in Thailand (GMT+7).

In [10]:
avg_by_hour.sort(reverse=True)

In [11]:
print("Top 5 Hours for Ask Post Comments")
for element in avg_by_hour[0:5]:
    avg_comment = format(element[0], '.2f')
    dt_hour = dt.datetime.strptime(element[1], '%H') + dt.timedelta(hours=11)
    show_str = 'At {time} : {avg} average comments per post.'
    print(show_str.format(time=dt_hour.strftime('%H:%M'), avg = avg_comment))

Top 5 Hours for Ask Post Comments
At 02:00 : 28.68 average comments per post.
At 00:00 : 16.32 average comments per post.
At 23:00 : 12.38 average comments per post.
At 13:00 : 11.14 average comments per post.
At 21:00 : 10.68 average comments per post.
