### Exploring Hacker News


[Hacker News](https://news.ycombinator.com/) is a project that came ouf of the startup incubator called [Y Combinator](https://www.ycombinator.com/), where user-submitted stories are voted and commented upon. Hacker News website attracts large volumes of visitors from the technology and startup circles which explain popularity of that portal.
  
We're specifically interested in posts whose titles begin with either Ask HN or Show HN. **Ask HN** posts are aimed at asking the Hacker News community a specific question. **Show HN** posts are aimed at showing the Hacker News community a project, product, or just generally something interesting

Objective of this project is to explore the Hacker News dataset and to answer the following questions:
* Which of the two types of posts receive more comments on average
* Analyze the distribution of posts over certain times and determine the most popular times


The data set contains the following column headers:
* **id:**  The unique identifier from Hacker News for the post
* **title:** The title of the post
* **url:** The URL that the posts links to, if it the post has a URL
* **num_points:** The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
* **author:** The username of the person who submitted the post
* **created_at:** The date and time at which the post was submitted

We are going to seperate the headers from the rest of the data for future work on this dataset.
Let's read the data and explore first few rows.

In [1]:
from csv import reader
open_file = open('hacker_news.csv')
read_file = reader(open_file)
dataset = list(read_file)

hn = dataset[1:] #rest of the data
headers = dataset[0] #headers
#print(dataset[:5])

i = 0
while i < 5:
    print(dataset[i],"\n")
    i += 1

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] 

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'] 

['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'] 

['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'] 

['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'] 



### Extraction of *Ask HN* and *Show HN* posts

At this point of the analysis we are interested in filtering the data. **Ask HN** ans **Show HN** are the posts of our interest. To get there we are going to create new lists of lists and use the `starts_with()` string method to complete this task. In order to ensure effectivness of this method regardless of the letter case used, we will combine the `starts_with()` method with the `lower()`  method.

In [2]:
ask_posts = list()
show_posts = list()
other_posts = list()

for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'): ask_posts.append(row)
    elif title.lower().startswith('show hn'): show_posts.append(row)
    else: other_posts.append(row)
        
print('Ask HN posts:',len(ask_posts))
print('Show HN posts',len(show_posts))
print('Other posts:',len(other_posts))

Ask HN posts: 1744
Show HN posts 1162
Other posts: 17194


### Calculating averagers for the Ask HN and Show HN posts

Which type of posts attract more comments in average? Let's find out.

In [3]:
#Ask HN number of comments
ask_posts_comments = 0
for row in ask_posts:
    ask_posts_comments += int(row[4])
ask_posts_average = round(ask_posts_comments / len(ask_posts),1)   
    
show_posts_comments = 0
for row in show_posts:
    show_posts_comments += int(row[4])
show_posts_average = round(show_posts_comments / len(show_posts),1)   

print('Total Ask HN comments:',ask_posts_comments)
print('Total Show HN comments:',show_posts_comments)
print('Average number of comments per Ask HN post:',ask_posts_average)
print('Average number of comments per Show HN post:',show_posts_average)

Total Ask HN comments: 24483
Total Show HN comments: 11988
Average number of comments per Ask HN post: 14.0
Average number of comments per Show HN post: 10.3


On average the Ask HN posts attract 14 comments per post and Show HN posts attract 10 comments per post.

### Distribution of comments for the Ask posts per hour

Based on the previous step we can see that on average the Ask posts attract more comments than the Show posts therefore we will focus further analysis on that group. 
We would like to measure now if there are specifc times of the day that attract more comments.  

To do this we are going to perform two calculations:
* Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
* Calculate the average number of comments ask posts receive by hour created.

In [4]:
import datetime as dt

result_list = list()
posts_by_hour = dict()
comments_by_hour = dict()

for row in ask_posts:
    created_dt_str = row[6]
    comments_count = int(row[4])
    created_dt = dt.datetime.strptime(created_dt_str,'%m/%d/%Y %H:%M')
    #hour = created_dt.hour #alternative way of deriving the hour from the datetime object
    hour2 = created_dt.strftime('%H')
    result_list.append([hour2,comments_count])

for i in result_list:
    hour = i[0]
    comments_count = i[1]
    posts_by_hour[hour] = posts_by_hour.get(hour,0) + 1
    comments_by_hour[hour] = comments_by_hour.get(hour,0) + comments_count
    
print('Count of posts by hour')
print(posts_by_hour,'\n')
print('Count of comments by hour')
print(comments_by_hour)   

Count of posts by hour
{'09': 45, '02': 58, '23': 68, '11': 58, '19': 110, '10': 59, '07': 34, '00': 55, '04': 47, '20': 80, '21': 109, '01': 60, '12': 73, '05': 46, '08': 48, '14': 107, '06': 44, '18': 109, '17': 100, '03': 54, '16': 108, '15': 116, '22': 71, '13': 85} 

Count of comments by hour
{'09': 251, '02': 1381, '23': 543, '11': 641, '19': 1188, '10': 793, '07': 267, '00': 447, '04': 337, '20': 1722, '21': 1745, '01': 683, '12': 687, '05': 464, '08': 492, '14': 1416, '06': 397, '18': 1439, '17': 1146, '03': 421, '16': 1814, '15': 4477, '22': 479, '13': 1253}


We can now use both dictionaries in order to calculate the average number of comments per post per hour.

In [9]:
avg_by_hour = list()

for hour in posts_by_hour:
    avg_by_hour.append([hour,comments_by_hour[hour] / posts_by_hour[hour]])
for i in avg_by_hour:
    print(i)

['09', 5.5777777777777775]
['02', 23.810344827586206]
['23', 7.985294117647059]
['11', 11.051724137931034]
['19', 10.8]
['10', 13.440677966101696]
['07', 7.852941176470588]
['00', 8.127272727272727]
['04', 7.170212765957447]
['20', 21.525]
['21', 16.009174311926607]
['01', 11.383333333333333]
['12', 9.41095890410959]
['05', 10.08695652173913]
['08', 10.25]
['14', 13.233644859813085]
['06', 9.022727272727273]
['18', 13.20183486238532]
['17', 11.46]
['03', 7.796296296296297]
['16', 16.796296296296298]
['15', 38.5948275862069]
['22', 6.746478873239437]
['13', 14.741176470588234]


The new list of lists gives us exactly what we need however that data is not sorted so let's fix it.

In [10]:
import datetime as dt

swap_avg_by_hour = list()
for i in avg_by_hour:
    hour = i[0]
    avg = i[1]
    swap_avg_by_hour.append([avg,hour])
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
print('Top 5 Hours for Ask Posts Comments:')

for i in sorted_swap[:5]:
    hour = i[1] #string
    avg = i[0] #float
    hour = dt.datetime.strptime(hour,'%H')
    hour_str = hour.strftime('%H:%M')
    print(hour_str,' ','{:.2f}'.format(avg))
    

Top 5 Hours for Ask Posts Comments:
15:00   38.59
02:00   23.81
20:00   21.52
16:00   16.80
21:00   16.01


### Conclusion

Based on the findings above we can see that 3 pm (Eastern Time in the US) is the most popular time that attracts most comments. The second best time is at 2:00 am and it is nowhere near as good as the top 1 recording 38% reduction of comments when compared to the volumes recorded at 3pm.

This project allowed us to accomplish the following:
* Set the goal for the project
* Collect and store the data
* Clean and prepare the data for analysis
* Analyze the data and derive conclusion

We have determined which types of posts and which time of the day are the most popular for attracting most comments. If one was to take the advantage of this analysis, the times between 15:00 - 17:00, 20:00 - 22:00 and also between 02:00 - 03:00 are when the Hacker News community is most active and chances of attracting comments at those times are at the highest.