# Hacker News site analysis

In this project, we'll work with a dataset of submissions to popular technology site Hacker News.
<br>
<br>
Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") receive votes and comments, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of the Hacker News listings can get hundreds of thousands of visitors as a result.
<br>
<br>
In this project, I will be exploring a cleaned dataset of Hacker News post data to answer two main research questions about post questions (<strong>ask hn</strong>) and feedback (<strong>show hn</strong>) posts:

- Which category elicits the most commentary and feedback?
- When is the best time of day to post ask hn and show hn items?


Let's start with reading the data

In [7]:
import datetime as dt
from csv import reader
opened_file = open('C:/Users/Пользователь/Desktop/data science/dataquest/projects/project python for data science interm/HN.csv', encoding='Latin-1')
read_file = reader(opened_file)
hn = list(read_file)
headers = hn[0]
hn = hn[1:]
print(headers)
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16'], ['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Ã\x82Â\x93the-data-vaultÃ\x82Â\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']]


Below are descriptions of the columns:
    
**id:** the unique identifier from Hacker News for the post
<br>
**title:** the title of the post
<br>
**url:** the URL that the posts links to, if the post has a URL
<br>
**num_points:** the number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
<br>
**num_comments:** the number of comments on the post
<br>
**author:** the username of the person who submitted the post
<br>
**created_at:** the date and time of the post's submission

Now let's categorize the posts into three categories - ask_posts, show_posts, and other_posts. These categories will help us see which types of posts are more common.

In [8]:
ask_posts = []
show_posts = []
other_posts = []
for row in hn:
    title = row[1].lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        

In [9]:
### display number of posts by category
print("number of ask hn posts: ", len(ask_posts))
print("number of show hn posts: ", len(show_posts))
print("number of other posts: ", len(other_posts))
print("check: ", len(hn) == len(ask_posts) + len(show_posts) + len(other_posts))

number of ask hn posts:  9139
number of show hn posts:  10158
number of other posts:  273822
check:  True


Now let's see which category drives more comments. To do this we need to calculate the average number of comments by category.

In [10]:
total_ask_comments = 0
for i in ask_posts:
    num_comments = int(i[4])
    total_ask_comments += num_comments
avg_ask_comments = total_ask_comments / len(ask_posts)

10.393478498741656


In [11]:
total_show_comments = 0
for i in show_posts:
    num_comments = int(i[4])
    total_show_comments += num_comments
avg_show_comments = total_show_comments / len(show_posts)

4.886099625910612


In [12]:
template = "average number of comments for {name}: {avg: ,.2f}"

print(template.format(name = "ask hn posts", avg = avg_ask_comments))
print(template.format(name = "show hn posts", avg = avg_show_comments))

average number of comments for ask hn posts:  10.39
average number of comments for show hn posts:  4.89


So we've determined that, on average ask posts received more comments show posts. Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.
<br>
<br>
Next, we'll determine if ask posts created at a certain time are more likely to attract comments. To do this, we need to calculate the amount of <strong>ask hn</strong> posts created by hour of the day along with the comments received. Then we will calculate the average number of comments ask hn posts receive by hour created

In [13]:
result_list = []
for i in ask_posts:
    created_at = i[6]
    comments = int(i[4])
    result_list.append((created_at, comments))
    
counts_by_hour = {}
comments_by_hour = {}
date_format = '%m/%d/%Y %H:%M'
for i in result_list:
    date = i[0]
    comment = i[1]
    date = dt.datetime.strptime(date, date_format)
    hour = date.strftime('%H')
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comment
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comment
    

In [14]:
avg_by_hour = []
for hour in counts_by_hour:
    avg = comments_by_hour[hour]/counts_by_hour[hour]
    avg_by_hour.append([hour, avg])
print(avg_by_hour)   

[['02', 11.137546468401487], ['01', 7.407801418439717], ['22', 8.804177545691905], ['21', 8.687258687258687], ['19', 7.163043478260869], ['17', 9.449744463373083], ['15', 28.676470588235293], ['14', 9.692007797270955], ['13', 16.31756756756757], ['11', 8.96474358974359], ['10', 10.684397163120567], ['09', 6.653153153153153], ['07', 7.013274336283186], ['03', 7.948339483394834], ['23', 6.696793002915452], ['20', 8.749019607843136], ['16', 7.713298791018998], ['08', 9.190661478599221], ['00', 7.5647840531561465], ['18', 7.94299674267101], ['12', 12.380116959064328], ['04', 9.7119341563786], ['06', 6.782051282051282], ['05', 8.794258373205741]]


Although we now have the results we need, this format makes it difficult to identify the hours with the highest values. Let's finish by sorting the list of lists and printing <strong>the five highest values</strong> in a format that's easier to read.

In [16]:
swap_avg_by_hour = []
for i in avg_by_hour:
    swap_avg_by_hour.append([i[1], i[0]])
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

for row in sorted_swap[:5]:
    hour_form = dt.datetime.strptime(str(row[1]), "%H")
    hour_form = hour_form.strftime("%H:%M")
    print(hour_form,'{:.2f}'.format(row[0]),'average comments per post')
    

15:00 28.68 average comments per post
13:00 16.32 average comments per post
12:00 12.38 average comments per post
02:00 11.14 average comments per post
10:00 10.68 average comments per post


# Conclusions

It appears that 3pm is the best time to post questions on Hacker News.