## Exploring Hacker News Posts

Hacker News是一个类似于 Reddit 的网络社区，在技术和创业领域很受欢迎。

网站的用户在论坛发帖(post)时如果想要问社区成员一些问题，会使用`Ask HN`作为开头，如：

`Ask HN: How to improve my personal website?`

如果是想展示一些东西，则会以`Show HN`作为开头，如：

`Show HN: Something pointless I made`

本项目中我们关心的问题是：

- Do `Ask HN` or `Show HN` receive more comments on average?

- Do posts created at a certain time receive more comments on average?

数据来源：[Hacker News Posts](https://www.kaggle.com/hacker-news/hacker-news-posts)

- id: the unique identifier from Hacker News for the post
- title: title of the post (self explanatory)
- url: the url of the item being linked to
- num_points: the number of upvotes the post received
- num_comments: the number of comments the post received
- author: the name of the account that made the post
- created_at: the date and time the post was made (the time zone is Eastern Time in the US)

In [1]:
from csv import reader
open_file = open("hacker_news.csv")
readed_file = reader(open_file)
hn = list(readed_file)
hn[0]

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

In [2]:
headers = hn[0]
hn = hn[1:]
print(headers)
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


方法`str.startswith()`可以检测`str`是否以特定字符串为开头，并返回布尔值。我们用此方法找到以`Ask HN` or `Show HN`开头的数据。

`str.startswith()`对字母大小写敏感，因此可以先使用`lower()`将数据都转化为小写字母，以免遗漏。

In [3]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    title = title.lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

In [4]:
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


In [7]:
total_ask_comments = 0
for row in ask_posts:
    num_cmnts = float(row[4])
    total_ask_comments += num_cmnts
avg_ask_comments = total_ask_comments / len(ask_posts)
print(avg_ask_comments)

total_show_comments = 0
for row in show_posts:
    num_cmnts = float(row[4])
    total_show_comments += num_cmnts
avg_show_comments = total_show_comments / len(show_posts)
print(avg_show_comments)

14.038417431192661
10.31669535283993


`Ask HN`平均收到的评论更多

---

In [37]:
import datetime as dt

def num_cmnts(dataset):
    result_list = []
    for row in dataset:
        created_at = row[6]
        num = row[4]
        result_list.append([created_at, num])
    
    counts_by_hour = {}
    comments_by_hour = {}
    for row in result_list:
        num = int(row[1])
        datetime = dt.datetime.strptime(row[0], '%m/%d/%Y %H:%M')
        row[0] = datetime
        hour = datetime.hour
        if hour in counts_by_hour:
            counts_by_hour[hour] += 1
            comments_by_hour[hour] += num
        else:
            counts_by_hour[hour] = 1
            comments_by_hour[hour] = num
            
            
    return result_list, counts_by_hour, comments_by_hour

In [56]:
result_list, counts_by_hour, comments_by_hour = num_cmnts(ask_posts)

In [57]:
comments_by_hour

{9: 251,
 13: 1253,
 10: 793,
 14: 1416,
 16: 1814,
 23: 543,
 12: 687,
 17: 1146,
 15: 4477,
 21: 1745,
 20: 1722,
 2: 1381,
 18: 1439,
 3: 421,
 5: 464,
 19: 1188,
 1: 683,
 22: 479,
 8: 492,
 4: 337,
 0: 447,
 6: 397,
 7: 267,
 11: 641}

In [58]:
avg_by_hour = []
for k in counts_by_hour:
    counts = counts_by_hour[k]
    avg_by_hour.append([k, comments_by_hour[k]/counts])
sorted(avg_by_hour)

[[0, 8.127272727272727],
 [1, 11.383333333333333],
 [2, 23.810344827586206],
 [3, 7.796296296296297],
 [4, 7.170212765957447],
 [5, 10.08695652173913],
 [6, 9.022727272727273],
 [7, 7.852941176470588],
 [8, 10.25],
 [9, 5.5777777777777775],
 [10, 13.440677966101696],
 [11, 11.051724137931034],
 [12, 9.41095890410959],
 [13, 14.741176470588234],
 [14, 13.233644859813085],
 [15, 38.5948275862069],
 [16, 16.796296296296298],
 [17, 11.46],
 [18, 13.20183486238532],
 [19, 10.8],
 [20, 21.525],
 [21, 16.009174311926607],
 [22, 6.746478873239437],
 [23, 7.985294117647059]]

In [59]:
swap_avg_by_hour = []
for row in avg_by_hour:
    avg = row[1]
    hour = row[0]
    swap_avg_by_hour.append([avg, hour])
    
sorted_swap = sorted(swap_avg_by_hour, reverse = True)   
print(sorted_swap)

[[38.5948275862069, 15], [23.810344827586206, 2], [21.525, 20], [16.796296296296298, 16], [16.009174311926607, 21], [14.741176470588234, 13], [13.440677966101696, 10], [13.233644859813085, 14], [13.20183486238532, 18], [11.46, 17], [11.383333333333333, 1], [11.051724137931034, 11], [10.8, 19], [10.25, 8], [10.08695652173913, 5], [9.41095890410959, 12], [9.022727272727273, 6], [8.127272727272727, 0], [7.985294117647059, 23], [7.852941176470588, 7], [7.796296296296297, 3], [7.170212765957447, 4], [6.746478873239437, 22], [5.5777777777777775, 9]]


In [63]:
print("Top 5 Hour for Ask Posts Comments:")

for row in sorted_swap[:5]:
    num = row[0]
    hour = str(row[1])
    hour = dt.datetime.strptime(hour, "%H")
    hour = dt.datetime.strftime(hour, "%H:%M")
    #上一行等价于： hour = hour.strftime("%H:%M")
    template = "{hour}: {num:.2f} average comments per post"
    print(template.format(hour = hour, num = num))
    

Top 5 Hour for Ask Posts Comments:
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post
