# Title: The Hacker News Dataset
# Author: Daniyal Siddiqui

### Purpose of the Project:
**To use Basic Python to explore large datasets and work with datetime objects**

---

**In this project, we'll work with a data set of submissions to popular technology site [Hacker News](https://news.ycombinator.com/).**

Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

You can find the data set here, but note that it has been reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions. 

We're specifically interested in posts whose titles begin with either `Ask HN` or `Show HN`. Users submit Ask HN posts to ask the Hacker News community a specific question.
Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting.

We'll compare these two types of posts to determine the following:

- *Do Ask HN or Show HN receive more comments on average?*
- *Do posts created at a certain time receive more comments on average?*

we will first read the file

In [41]:
from csv import reader
dataset = open("hacker_news.csv")
read = reader(dataset)
hn = list(read)

The list of columns in the dataset looks like this:

In [42]:
headers = hn[0]
headers

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

Below are descriptions of the columns:

- id: The unique identifier from Hacker News for the post
- title: The title of the post
- url: The URL that the posts links to, if it the post has a URL
- num_points: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
- num_comments: The number of comments that were made on the post
- author: The username of the person who submitted the post
- created_at: The date and time at which the post was submitted

Here are the first two rows of data without headers:

In [43]:
hn = hn[1:]
hn[0:2]

[['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30']]

Next we will separate out the type of post into three categories:
- Ask post
- Show post
- Other post

In [44]:
ask_post = list()
show_post = list()
other_post = list()

for i in hn:
    title = i[1] # since the column title is at the second position
    if title.startswith("Ask HN"):
        ask_post.append(i)
    elif title.startswith("Show HN"):
        show_post.append(i)
    else:
        other_post.append(i)    

Let's look at the first two rows of the newly created ask_post list:

In [45]:
ask_post[0:2]

[['12296411',
  'Ask HN: How to improve my personal website?',
  '',
  '2',
  '6',
  'ahmedbaracat',
  '8/16/2016 9:55'],
 ['10610020',
  'Ask HN: Am I the only one outraged by Twitter shutting down share counts?',
  '',
  '28',
  '29',
  'tkfx',
  '11/22/2015 13:43']]

#### We will now find out the Average comments that features in each of the Ask HN and Show HN posts.

For this, we wil define a function, so that we dont have to repeat ourself.

In [48]:
def tot_avg_com(post_type):
    tot_com = 0
    count = 0
    for i in post_type:
        tot_com += int(i[4])
        count += 1
    average = round(tot_com/count,2)
    return average
print("Average comments in Ask HN posts",tot_avg_com(ask_post))
print("Average comments in Show HN posts",tot_avg_com(show_post))

Average comments in Ask HHN posts 14.04
Average comments in Show HHN posts 10.32


We can see that on average the Ask HN posts recieve more comments than the Show HN posts. It goes on to show that people are more engaged when a certain question is posed rather than just viewing something.

### Next, we'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:

- Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
- Calculate the average number of comments ask posts receive by hour created.

In [97]:
from datetime import datetime as dt
tot_com_by_hr = dict()
posts_by_hr = dict()
for i in ask_post:
    created_at = dt.strptime(i[6], "%m/%d/%Y %H:%M").strftime("%H")
    tot_com_by_hr[created_at] = tot_com_by_hr.get(created_at,int(i[4]))+int(i[4])
    posts_by_hr[created_at] = posts_by_hr.get(created_at, 0)+1
avg_by_hr = []
for i in tot_com_by_hr:
    avg_by_hr.append([i, round(tot_com_by_hr[i]/posts_by_hr[i],2)]) 
print("Average Ask HN posts sorted by hour of the day:")
for i in avg_by_hr:
    print(*i)

Average Ask HN posts sorted by hour of the day:
09 5.71
13 15.08
10 13.46
14 13.26
16 16.95
23 8.0
12 9.47
17 11.47
15 38.6
21 16.05
20 21.55
02 23.86
18 13.26
03 7.81
05 10.72
19 10.83
01 11.93
22 6.77
08 10.35
04 7.23
00 8.31
06 9.05
07 7.91
11 11.09


Although we now have the results we need, this format makes it hard to identify the hours with the highest values. Let's finish by sorting the list of lists and printing the five highest values in a format that's easier to read.

In [110]:
avg = []
for i in avg_by_hr:
    avg.append([i[1],i[0]])
avg_sorted = sorted(avg, reverse = True)[0:5]

for i in avg_sorted:
    print("{0}:00 - {1:.2f} average comments per post".format(i[1],i[0]))

15:00 - 38.60 average comments per post
02:00 - 23.86 average comments per post
20:00 - 21.55 average comments per post
16:00 - 16.95 average comments per post
21:00 - 16.05 average comments per post


## Conclusion

The above analysis shows us that to get the most engagement on the Hacker News Site, one should use the `Ask HN` tag in the title and should post at 15:00 hrs or 3p.m. 
