# Exploring Hacker News Posts
---
We're specifically interested in posts whose titles begin with either Ask HN or Show HN. Users submit Ask HN posts to ask the Hacker News community a specific question. Below are a couple examples:

- Ask HN: How to improve my personal website?
- Ask HN: Am I the only one outraged by Twitter shutting down share counts?
- Ask HN: Aby recent changes to CSS that broke mobile?

Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting. Below are a couple of examples:

- Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform'
- Show HN: Something pointless I made
- Show HN: Shanhu.io, a programming playground powered by e8vm

We'll compare these two types of posts to determine the following:

- Do Ask HN or Show HN receive more comments on average?
- Do posts created at a certain time receive more comments on average?

Answering these two questions will help us to determine what hours are best for Ask HN and Show HN posts. The criteria for "best" is having high average comment count and high consistency.

## Opening File

In [22]:
from csv import reader

openFile = open(r'C:\Users\Jason Minhas\Jupyter Projects\Exploring Hacker News Posts\rawDate\Hacker_News_Posts_09.26.2015-09.26.2016.csv', encoding="utf8")
readFile = reader(openFile)
hackerNews = list(readFile)
hackerNews = hackerNews[1:]

for row in hackerNews[0:10]:print(row)
    
print(len(hackerNews))

['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']
['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24']
['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']
['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']
['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']
['12578975', 'Saving the Hassle of Shopping', 'https://blog.menswr.com/2016/09/07/whats-new-with-yo

## Isolating "Ask HN" and "Show HN" posts

Below we sort "Ask HN" and "Show HN" posts into their own datasets. This will allows us to work with each type of post without interfering with each other.

In [4]:
askHN = []
showHN = []

for row in hackerNews:
    title = row[1].lower()
    if title[:6] == "ask hn":
        askHN.append(row)
    elif title[:7] == "show hn":
        showHN.append(row)
    else:
        pass
        
print('Ask HN Count: ' + str(len(askHN)))
print('Show HN Count: ' + str(len(showHN)))
print('Total Count: ' + str(len(askHN)+len(showHN)))

Ask HN Count: 9139
Show HN Count: 10158
Total Count: 19297


## Average by Column

<code>columnAvg</code> is a function to get the average of any given column. In this case we will get the average comments for the both datasets.

In [5]:
def columnAvg(dataset, colIndex):
    tempList = []
    for row in dataset:
        tempList.append(int(row[colIndex]))
        
    return round(sum(tempList)/len(tempList),2)

In [6]:
print('Ask HN Average Comment:', columnAvg(askHN,4))
print('Show HN Average Comment:', columnAvg(showHN,4))

Ask HN Average Comment: 10.39
Show HN Average Comment: 4.89


"Ask HN" have 10.4 comments on average while "Show HN" posts have 4.89. Next we'll look at which one performs more consistently.

## Standard Deviation by Column

Similar to the <code>columnAvg</code> function the next part will help us determine the standard deviation of any given column. The first function <code>StDev</code> will help us get the standard deviation of any list, This function will be used in the next <code>columnStDev</code>.

In [7]:
def StDev(values):
    mean = sum(values) / len(values)
    variance = sum((i - mean) ** 2 for i in values)/(len(values)-1)    
    return variance ** (1/2)

In [8]:
def columnStDev(dataset, colIndex):
    tempList = []
    for row in dataset:
        tempList.append(int(row[colIndex]))
        
    return round(StDev(tempList),2)

In [9]:
print('Ask HN Standard Deviation:',columnStDev(askHN,4))
print('Show HN Standard Deviation:',columnStDev(showHN,4))

Ask HN Standard Deviation: 43.51
Show HN Standard Deviation: 16.15


"Ask HN" has a standard deviation of 43.53 comments while "Show HN" posts has 16.15. Although "Show HN" has less comments on average it receives a more consistent number of comments.

## Converting column to datetime format

<code>datetimeConvert</code> function below will allow us to convert any given column into datetime. This will make it easier to manipulate our datasets in regards to hour which is what our focus will be in the analysis.

In [10]:
def datetimeConvert(dataset, colIndex, datetimeFormatStr):
    import datetime as dt
    
    for row in dataset:
        if type(row[colIndex]) != 'datetime.datetime':
            datetimeConverted = dt.datetime.strptime(row[colIndex], datetimeFormatStr)
            row[colIndex] = datetimeConverted
        else: pass
        
    return dataset
        
askHN = datetimeConvert(askHN, 6, "%m/%d/%Y %H:%M")
showHN = datetimeConvert(showHN, 6, "%m/%d/%Y %H:%M")


## Frequency Tables

Both <code>countByHour</code> and <code>commentsByHour</code> functions create dictionaries that will be used in <code>commentAvgByHour</code>.

In [11]:
def countByHour(dataset, hourIndex):
    freqTable = {}
    for row in dataset:
        hour = row[hourIndex].strftime("%I:00 %p")
        if hour in freqTable:
            freqTable[hour] += 1
        else:
            freqTable[hour] = 1
    return freqTable

In [12]:
def commentsByHour(dataset, hourIndex, commentIndex):
    freqTable= {}
    for row in dataset:
        hour = row[hourIndex].strftime("%I:00 %p")
        commentAmt = int(row[4])
        if hour in freqTable:
            freqTable[hour] += commentAmt
        else:
            freqTable[hour] = commentAmt
    return freqTable

<code>commentAvgByHour</code> prints the average comments by hour for a given dataset. In our case we are looking at the "Ask HN" and "Show HN" posts.

In [13]:
def commentAvgByHour(dataset, hourIndex, commentIndex):
    dataset_CountByHour = countByHour(dataset,hourIndex)
    dataset_CommentsByHour = commentsByHour(dataset, hourIndex, commentIndex)  
    
    commentAvgByHour = {}
    
    for item in dataset_CountByHour:
        commentAvgByHour[item] = round(dataset_CommentsByHour[item]/dataset_CountByHour[item],2)

    commentAvgByHour = {k: v for k, v in sorted(commentAvgByHour.items(), key=lambda item: item[1], reverse=True)}

    return commentAvgByHour

In [14]:
askHNcommentAvgByHour = commentAvgByHour(askHN, 6, 4)
showHNcommentAvgByHour = commentAvgByHour(showHN, 6, 4)

print('Average Comments By Hour for Ask HN')
for item in askHNcommentAvgByHour: 
    print(item, '-', askHNcommentAvgByHour[item])
print('\n')
print('Average Comments By Hour for Show HN')
for item in showHNcommentAvgByHour: 
    print(item, '-', showHNcommentAvgByHour[item])

Average Comments By Hour for Ask HN
03:00 PM - 28.68
01:00 PM - 16.32
12:00 PM - 12.38
02:00 AM - 11.14
10:00 AM - 10.68
04:00 AM - 9.71
02:00 PM - 9.69
05:00 PM - 9.45
08:00 AM - 9.19
11:00 AM - 8.96
10:00 PM - 8.8
05:00 AM - 8.79
08:00 PM - 8.75
09:00 PM - 8.69
03:00 AM - 7.95
06:00 PM - 7.94
04:00 PM - 7.71
12:00 AM - 7.56
01:00 AM - 7.41
07:00 PM - 7.16
07:00 AM - 7.01
06:00 AM - 6.78
11:00 PM - 6.7
09:00 AM - 6.65


Average Comments By Hour for Show HN
12:00 PM - 6.99
07:00 AM - 6.68
11:00 AM - 6.0
08:00 AM - 5.6
02:00 PM - 5.52
01:00 PM - 5.43
02:00 AM - 5.15
04:00 AM - 5.04
07:00 PM - 5.02
06:00 PM - 4.94
04:00 PM - 4.71
06:00 AM - 4.71
09:00 AM - 4.67
12:00 AM - 4.65
03:00 PM - 4.57
11:00 PM - 4.53
03:00 AM - 4.53
05:00 PM - 4.25
08:00 PM - 4.16
09:00 PM - 4.09
01:00 AM - 4.07
10:00 PM - 3.85
10:00 AM - 3.8
05:00 AM - 3.44


In [15]:
def commentStDevByHour(dataset, hourIndex, commentIndex):
    commentStDevByHour = {}
    for row in dataset:
        hour = row[hourIndex].strftime("%I:00 %p")
        commentAmt = int(row[4])
        if hour in commentStDevByHour:
            commentStDevByHour[hour].append(commentAmt)
        else:
            commentStDevByHour[hour] = []
    
    for item in commentStDevByHour: 
        commentStDevByHour[item] = round(StDev(commentStDevByHour[item]), 2)      
            
    commentStDevByHour = {k: v for k, v in sorted(commentStDevByHour.items(), key=lambda item: item[1])}
    
    return commentStDevByHour

In [16]:
askHNcommentStDevByHour = commentStDevByHour(askHN, 6, 4)
showHNcommentStDevByHour = commentStDevByHour(showHN, 6, 4)

print('Standard Deviation of Comments By Hour for Ask HN')
for item in askHNcommentStDevByHour: 
    print(item, '-', askHNcommentStDevByHour[item])
print('\n')
print('Standard Deviation of Comments By Hour for Show HN')
for item in showHNcommentStDevByHour: 
    print(item, '-', showHNcommentStDevByHour[item])

Standard Deviation of Comments By Hour for Ask HN
11:00 PM - 13.07
06:00 AM - 13.86
09:00 AM - 14.89
03:00 AM - 17.5
07:00 PM - 18.5
01:00 AM - 20.06
07:00 AM - 20.65
12:00 AM - 21.15
08:00 AM - 23.32
10:00 AM - 23.7
11:00 AM - 24.26
06:00 PM - 24.88
02:00 PM - 25.67
10:00 PM - 29.21
04:00 PM - 32.32
09:00 PM - 33.01
12:00 PM - 34.55
08:00 PM - 35.24
05:00 AM - 36.59
05:00 PM - 36.97
04:00 AM - 44.3
02:00 AM - 54.87
01:00 PM - 57.8
03:00 PM - 116.23


Standard Deviation of Comments By Hour for Show HN
10:00 AM - 10.1
01:00 AM - 10.65
05:00 AM - 10.81
03:00 AM - 10.89
06:00 AM - 11.02
10:00 PM - 11.54
08:00 PM - 12.59
09:00 PM - 13.6
04:00 AM - 13.84
03:00 PM - 13.92
12:00 AM - 13.99
05:00 PM - 14.2
11:00 PM - 14.5
04:00 PM - 15.2
09:00 AM - 16.87
11:00 AM - 16.88
07:00 PM - 16.91
01:00 PM - 16.99
08:00 AM - 18.5
06:00 PM - 18.52
02:00 PM - 20.65
12:00 PM - 20.86
07:00 AM - 23.88
02:00 AM - 24.68


## Normalize the Frequency Table

Our objective for this next part is to find the hour that is best for having high average comment consistently. In this situation we will weigh both high number of comments and consistency the same. Normalizing the frequency tables will allow us to compare the performance of average comments and standard deviation for each hour. The range of the tables will be converted into a 0-1 scale. We can then create a measurement to find which hour will give us the best of both high average comments and high consistency. Below is a function that normalizes the values in a frequency table.

In [17]:
def normalizeFreqTable(freqTable):
    tempList = []
    for item in freqTable:    
        tempList.append(freqTable[item])
    maxNum = max(tempList)
    minNum = min(tempList)

    freqTableNorm = freqTable.copy()

    for item in freqTableNorm:
        freqTableNorm[item] = round((freqTableNorm[item]-minNum)/(maxNum-minNum),2)
        
    return freqTableNorm

Below you can see how the function normalizes the values in the dictionary. The first table is the original average comments by hour, The second table is normalizing the first table. 3pm had the highest average comments at 28.68 and so it gets the value of 1 when normalized. The lowest average comment value, 6.65 at 9am, now gets the value of 0. All the other values from the original Ask HN frequency table is in between 0 and 1. 

In [18]:
askHNcommentAvgByHourNorm = normalizeFreqTable(askHNcommentAvgByHour)

print("Ask HN Average Comments")
for item in askHNcommentAvgByHour: 
    print(item, '-', askHNcommentAvgByHour[item])
print("\n")
print("Ask HN Average Comments Normalized")
for item in askHNcommentAvgByHourNorm: 
    print(item, '-', askHNcommentAvgByHourNorm[item])

Ask HN Average Comments
03:00 PM - 28.68
01:00 PM - 16.32
12:00 PM - 12.38
02:00 AM - 11.14
10:00 AM - 10.68
04:00 AM - 9.71
02:00 PM - 9.69
05:00 PM - 9.45
08:00 AM - 9.19
11:00 AM - 8.96
10:00 PM - 8.8
05:00 AM - 8.79
08:00 PM - 8.75
09:00 PM - 8.69
03:00 AM - 7.95
06:00 PM - 7.94
04:00 PM - 7.71
12:00 AM - 7.56
01:00 AM - 7.41
07:00 PM - 7.16
07:00 AM - 7.01
06:00 AM - 6.78
11:00 PM - 6.7
09:00 AM - 6.65


Ask HN Average Comments Normalized
03:00 PM - 1.0
01:00 PM - 0.44
12:00 PM - 0.26
02:00 AM - 0.2
10:00 AM - 0.18
04:00 AM - 0.14
02:00 PM - 0.14
05:00 PM - 0.13
08:00 AM - 0.12
11:00 AM - 0.1
10:00 PM - 0.1
05:00 AM - 0.1
08:00 PM - 0.1
09:00 PM - 0.09
03:00 AM - 0.06
06:00 PM - 0.06
04:00 PM - 0.05
12:00 AM - 0.04
01:00 AM - 0.03
07:00 PM - 0.02
07:00 AM - 0.02
06:00 AM - 0.01
11:00 PM - 0.0
09:00 AM - 0.0


In [19]:
showHNcommentAvgByHourNorm = normalizeFreqTable(showHNcommentAvgByHour)
askHNcommentStDevByHourNorm = normalizeFreqTable(askHNcommentStDevByHour)
showHNcommentStDevByHourNorm = normalizeFreqTable(showHNcommentStDevByHour)

for item in askHNcommentStDevByHourNorm: 
    print(item, '-', askHNcommentStDevByHourNorm[item])

11:00 PM - 0.0
06:00 AM - 0.01
09:00 AM - 0.02
03:00 AM - 0.04
07:00 PM - 0.05
01:00 AM - 0.07
07:00 AM - 0.07
12:00 AM - 0.08
08:00 AM - 0.1
10:00 AM - 0.1
11:00 AM - 0.11
06:00 PM - 0.11
02:00 PM - 0.12
10:00 PM - 0.16
04:00 PM - 0.19
09:00 PM - 0.19
12:00 PM - 0.21
08:00 PM - 0.21
05:00 AM - 0.23
05:00 PM - 0.23
04:00 AM - 0.3
02:00 AM - 0.41
01:00 PM - 0.43
03:00 PM - 1.0


## Performance Metric
Below is a function that uses the two dictionaries we created to make a new metric. We will use the <code>commentavgNorm</code> and <code>StdDevNorm</code>. The formula is as follows:

**Performance Metric = commentavgNorm-StdDevNorm**

For example, if the 5:00 pm hour has a normalized value of .90 for average comments and a normalized value of .70 for standard deviation. Then the performance indicator value  would be .30. We are assigning a positive value to the normalized value for average comments because higher is better. On the other hand we are assigning a negative value to the normalized value for standard deviation because lower is better. Lower standard deviation means more consistent. With this metric the perfect hour would be a performance metric of 1 this means it has the highest comment average and the lowest standard deviation. On the pother hand the worst hour would be a performance metric of -1. This would indicate a that the hour has the lowest comment average and the highest standard deviation.


In [20]:
def indicatorDict(AvgByHourNorm, StDevByHourNorm):
    indicatorDict = {}
    for hour in AvgByHourNorm:
        indicatorDict[hour] = round(AvgByHourNorm[hour]-StDevByHourNorm[hour],2)
        
    indicatorDict = {k: v for k, v in sorted(indicatorDict.items(), key=lambda item: item[1],reverse=True)}
        
    return indicatorDict

In [21]:
askHNcommentPerfIndicator = indicatorDict(askHNcommentAvgByHourNorm, askHNcommentStDevByHourNorm)
showHNcommentPerfIndicator = indicatorDict(showHNcommentAvgByHourNorm, showHNcommentStDevByHourNorm)

print("Ask HN Performance Indicator")
for item in askHNcommentPerfIndicator: 
    print(item, '-', askHNcommentPerfIndicator[item])
print("\n")
print("Show HN Performance Indicator")
for item in showHNcommentPerfIndicator: 
    print(item, '-', showHNcommentPerfIndicator[item])

Ask HN Performance Indicator
10:00 AM - 0.08
12:00 PM - 0.05
02:00 PM - 0.02
08:00 AM - 0.02
03:00 AM - 0.02
01:00 PM - 0.01
03:00 PM - 0.0
06:00 AM - 0.0
11:00 PM - 0.0
11:00 AM - -0.01
09:00 AM - -0.02
07:00 PM - -0.03
12:00 AM - -0.04
01:00 AM - -0.04
06:00 PM - -0.05
07:00 AM - -0.05
10:00 PM - -0.06
05:00 PM - -0.1
09:00 PM - -0.1
08:00 PM - -0.11
05:00 AM - -0.13
04:00 PM - -0.14
04:00 AM - -0.16
02:00 AM - -0.21


Show HN Performance Indicator
06:00 AM - 0.3
12:00 PM - 0.26
03:00 AM - 0.26
11:00 AM - 0.25
04:00 AM - 0.19
01:00 AM - 0.14
10:00 AM - 0.1
01:00 PM - 0.09
12:00 AM - 0.07
03:00 PM - 0.06
08:00 AM - 0.03
08:00 PM - 0.03
10:00 PM - 0.02
04:00 PM - 0.01
11:00 PM - 0.01
07:00 PM - -0.02
07:00 AM - -0.04
05:00 PM - -0.05
05:00 AM - -0.05
09:00 PM - -0.06
09:00 AM - -0.11
02:00 PM - -0.13
06:00 PM - -0.16
02:00 AM - -0.52


## Conclusion

Here are the results if you are looking for high comment average and low standard deviation.


For Ask HN you should post between 10:00 am and 10:59 am. This will give you an average comment count of 10.68 and standard deviation of 23.7 which are ranked 5 and 10 respectively.

For Show HN you should post between 6:00 am and 6:59 am. This will give you an average comment count of 4.71 and standard deviation of 11.02 which are ranked 12 and 5 respectively.