# Hacker News Data Analysis
## Site Info
---
Hacker News is a site that began on the startup website Y Combinator. Here, posts created by users can be voted and commented on, much like the site Reddit. This site appeals to those interested in technology and other startups, where posts related to this content can reach high publicity if they reach the top Hacker News' listings.

## Data Info
---
The data is comprised of 20,000 rows of data, which was reduced from 300,000 rows after removing posts that have not received any comments, and then random sampling after that process.

The column descriptions are as follows:

* id: The unique identifier from Hacker News for the post
* title: The title of the post
* url: The URL that the posts links to, if it the post has a URL
* num_points: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
* num_comments: The number of comments that were made on the post
* author: The username of the person who submitted the post
* created_at: The date and time at which the post was submitted

In [1]:
from csv import reader
hn = []
hn_header = []
with open("hacker_news.csv") as f:
    read_file = reader(f)
    hn = list(read_file)
    hn_header = hn[0]
    hn.pop(0)

In [2]:
for v in enumerate(hn_header):
    print(v)
# print('\n')
# for i in range(5):
#     print(hn[i], '\n')

(0, 'id')
(1, 'title')
(2, 'url')
(3, 'num_points')
(4, 'num_comments')
(5, 'author')
(6, 'created_at')


## Ask HN vs. Show HN
---
For this analysis, we will look only at posts that are directed towards the 'Ask HN' and 'Show HN' subject. From here, we will filter the data and find which of the two post subjects are more popular on the Hacker News site, in terms of number of posts and average number of comments per post.

First, we will have to separate the dataset into three separate lists: `ask_posts`, `show_posts`, and `other_posts`.

In [3]:
ask_posts = []
show_posts = []
other_posts = []
for row in hn:
    title = row[1].lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
print('Number of \'Ask HN\' posts:', len(ask_posts))
print('Number of \'Show HN\' posts:', len(show_posts))
print('Number of other posts:', len(other_posts))

Number of 'Ask HN' posts: 1744
Number of 'Show HN' posts: 1162
Number of other posts: 17194


Next we will assign the total number of comments of each list to `total_ask_comments` and `total_show_comments` respectively.

In [4]:
total_ask_comments = 0
for row in ask_posts:
    total_ask_comments += int(row[4])
avg_ask_comments = total_ask_comments / len(ask_posts)

total_show_comments = 0
for row in show_posts:
    total_show_comments += int(row[4])
avg_show_comments = total_show_comments / len(show_posts)

print("Avg. number of comments on 'Ask HN' posts:", round(avg_ask_comments, 2))
print("Avg. number of comments on 'Show HN' posts:", round(avg_show_comments, 2))

Avg. number of comments on 'Ask HN' posts: 14.04
Avg. number of comments on 'Show HN' posts: 10.32


From this data, we can see that the amount of Ask HN posts outnumber the Show HN posts, in addition to the average number of comments on Ask HN posts is also greater than the average for Show HN posts. This shows that Ask HN posts tend to be more popular with Hacker News users.

## When Posts are Created
---
In this section, we will compare posts based on the *time* of day they were posted. This will be calculated with the `datetime` module in order to access the relevant information in a comparable datatype.

First, we will need to import the `datetime` module, and then organize the results into a list of lists.

In [12]:
import datetime as dt
result_list= []
for row in ask_posts:
    date_created = row[6]
    comments = int(row[4])
    result_list.append([date_created, comments])
    
counts_by_hour = {}
comments_by_hour = {}
date_format= '%m/%d/%Y %H:%M'
for row in result_list:
    hour = dt.datetime.strptime(row[0], date_format).strftime('%H')
    comments = row[1]
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comments
    else:
        counts_by_hour[hour] +=1
        comments_by_hour[hour] += comments

avg_by_hour = []
for hour in comments_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour]/counts_by_hour[hour]])
avg_by_hour.sort()
print("Avg. number of comments by the hour: ")
result_template = "{}:00, {:.2f} comments on average per post."
for row in avg_by_hour:
    print(result_template.format(row[0], row[1]))
print('\n')
swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
swap_avg_by_hour.sort(reverse=True)
print("Highest avg. number of comments by the hour: ")
for row in swap_avg_by_hour:
    print(result_template.format(row[1], row[0]))

Avg. number of comments by the hour: 
00:00, 8.13 comments on average per post.
01:00, 11.38 comments on average per post.
02:00, 23.81 comments on average per post.
03:00, 7.80 comments on average per post.
04:00, 7.17 comments on average per post.
05:00, 10.09 comments on average per post.
06:00, 9.02 comments on average per post.
07:00, 7.85 comments on average per post.
08:00, 10.25 comments on average per post.
09:00, 5.58 comments on average per post.
10:00, 13.44 comments on average per post.
11:00, 11.05 comments on average per post.
12:00, 9.41 comments on average per post.
13:00, 14.74 comments on average per post.
14:00, 13.23 comments on average per post.
15:00, 38.59 comments on average per post.
16:00, 16.80 comments on average per post.
17:00, 11.46 comments on average per post.
18:00, 13.20 comments on average per post.
19:00, 10.80 comments on average per post.
20:00, 21.52 comments on average per post.
21:00, 16.01 comments on average per post.
22:00, 6.75 comments on

## Conclusion
---
From the beginning of this analysis, it was found that Ask HN posts were more popular for discussion on the Hacker News site. With the new information that posts around 3:00PM recieve the highest amount of comments per post on average, it should be noted by users that the prime time to create a post is around this time, with the subject beginning with 'Ask HN'.

The techniques used for analyzing the response to certain posts can be used for knowing when to post onto a social media platform. This is applicable to marketing strategies, self-promotion, or swift response time in the fact these aspects of social media usage desire the largest number of spectators and/or participators.

For the future, these aspects of the dataset can also be compared: 

* Determine if show or ask posts receive more points on average.
* Determine if posts created at a certain time are more likely to receive more points.
* Compare your results to the average number of comments and points other posts receive.