# Working with dates in python - Hacker News Posts
##### Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result. The dataset can be found [here](https://www.kaggle.com/hacker-news/hacker-news-posts).
##### We're specifically interested in posts whose titles begin with either Ask HN or Show HN. These two types of posts will be compared and the following will be determined:

1. Do Ask HN or Show HN receive more comments on average?
2. Posts created at a certain time receive more comments on average?






The date module is imported and the dataset is read in as a list of list.
The first five rows are shown.

In [1]:
import datetime as dt
from csv import reader
open_file = open("hacker_news.csv")
read_file = reader(open_file)
hn = list(read_file)


print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


The header is separated from the rest of the dataset.

In [2]:
headers = hn[0]
hn = hn[1:]
print(headers)
print('\n')
print(hn[:4])
print("Length of hn", len(hn))

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]
Length of hn 20100


Now that the header is removed, the dataset has to be filtered to separate the  posts of interest (posts that have a title that begin with "Ask Hn" or "Show Hn"). All of the titles are converted to lowercase to make to them easier to compare

In [3]:
ask_posts = []
show_posts = []
other_posts=[]

for row in hn:
    title = row[1]
   
    lowercase_title =title.lower()
    
    if lowercase_title.startswith('ask hn'):
        ask_posts.append(row)
    elif lowercase_title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print("Length of ask posts: ",len(ask_posts))
print("Length of show posts: ",len(show_posts))
print("Length of other posts", len(other_posts))
print("Length of hn:", len(hn), "/ Length of each posts added together: ", len(ask_posts) + len(show_posts) + len(other_posts))

Length of ask posts:  1744
Length of show posts:  1162
Length of other posts 17194
Length of hn: 20100 / Length of each posts added together:  20100


In [4]:
print(ask_posts[:5]) #examining ask posts

[['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'], ['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14'], ['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20'], ['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38']]


In [8]:
print(show_posts[:5]) #examining show posts

[['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'], ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46'], ['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https://shanhu.io', '1', '1', 'h8liu', '4/28/2016 18:05'], ['12178806', 'Show HN: Webscope  Easy way for web developers to communicate with Clients', 'http://webscopeapp.com', '3', '3', 'fastbrick', '7/28/2016 7:11'], ['10872799', 'Show HN: GeoScreenshot  Easily test Geo-IP based web pages', 'https://www.geoscreenshot.com/', '1', '9', 'kpsychwave', '1/9/2016 20:45']]


Determining which type of posts receive more comments on average

In [10]:
total_ask_comments = 0
for row in ask_posts:
    total_ask_comments+= int(row[4])
avg_ask_comments = total_ask_comments/len(ask_posts)


total_show_comments = 0
for row in show_posts:
    total_show_comments += int(row[4])

avg_show_comments = total_show_comments/len(show_posts)
    
print("Average ask comments", avg_ask_comments) 
print("Average show comments", avg_show_comments)
print("Ask posts receive more comments on average")

    

Average ask comments 14.038417431192661
Average show comments 10.31669535283993
Ask posts receive more comments on average


Since ask posts are more likely to receive more comments on average, these posts will be focused. Below, it is determined whether ask posts created at a certain time is likely to attract more comments

In [11]:
result_list = []
for row in ask_posts:
    temp_list = [row[6], row[4]] # created a list of the created_at and  num_comments columns
    result_list.append(temp_list)# created a list of lists to store the created_at and num_comments

counts_by_hour ={} #frequency table(dictionary) of the number of ask posts created by each hour during the day
comments_by_hour={} # frequency table (dictionary) of the corresponding number of comments ask posts created at each hour received.

for row in result_list:
    date_dt = dt.datetime.strptime(row[0], "%m/%d/%Y %H:%M") # parsing date from created_at column
    hour = date_dt.strftime("%H") # parsing the hour
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1 # for every hour that a post is created that is not yet in counts_by_hour will be added to it with a frequency of one
        comments_by_hour[hour] = int(row[1])#for every hour that a comment is created that is not yet in comments_by_hour it will be added to it with a frequency of 1 
    else: 
        counts_by_hour[hour] +=1 # counts the particular hour as it reoccurs 
        comments_by_hour[hour] += int(row[1]) # adds the number of comments as they occur by hour
 
print("Number of askk posts per hour")
print(counts_by_hour)
print("/n")
print("Number of comments per hour")
print(comments_by_hour)
        
        
    

Number of askk posts per hour
{'21': 109, '22': 71, '09': 45, '04': 47, '10': 59, '14': 107, '08': 48, '15': 116, '19': 110, '01': 60, '12': 73, '00': 55, '18': 109, '05': 46, '13': 85, '11': 58, '23': 68, '07': 34, '16': 108, '20': 80, '02': 58, '17': 100, '03': 54, '06': 44}
/n
Number of comments per hour
{'21': 1745, '22': 479, '09': 251, '04': 337, '10': 793, '14': 1416, '08': 492, '15': 4477, '19': 1188, '01': 683, '12': 687, '00': 447, '18': 1439, '05': 464, '13': 1253, '11': 641, '23': 543, '07': 267, '16': 1814, '20': 1722, '02': 1381, '17': 1146, '03': 421, '06': 397}


The result is list of lists in which the first element is the hour and the second element is the average number of comments per post. 

In [13]:
avg_by_hour = []

for hours in comments_by_hour:
    avg_by_hour.append([hours, comments_by_hour[hours]/counts_by_hour[hours]])

import pandas as pd
pd.DataFrame(avg_by_hour)


Unnamed: 0,0,1
0,21,16.009174
1,22,6.746479
2,9,5.577778
3,4,7.170213
4,10,13.440678
5,14,13.233645
6,8,10.25
7,15,38.594828
8,19,10.8
9,1,11.383333


The results indicate that ask posts receive most comments on average during  3pm