# Hacker News Data Analysis 

We're specifically interested in posts with titles that begin with either Ask HN or Show HN. Users submit Ask HN posts to ask the Hacker News community a specific question.

> Ask HN: How to improve my personal website?
Ask HN: Am I the only one outraged by Twitter shutting down share counts?
Ask HN: Aby recent changes to CSS that broke mobile?

Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just something interesting. Below are a few examples:

> Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform'
Show HN: Something pointless I made
Show HN: Shanhu.io, a programming playground powered by e8vm

We'll compare these two types of posts to determine the following:
- Do Ask HN or Show HN receive more comments on average?
- Do posts created at a certain time receive more comments on average?

### Summary of Results
1. On average, **Ask HN posts receive more commments.** Show HN posts average ten comments per post compared to Ask HN Posts which receive fourteen comments per post.
2. Yes, **ASK HN posts created at a certain tme receive more comments on average. Specifically, at 3pm (EST) ASK HN posts receive ~24 more comments than the average.**

In [1]:
from csv import reader
openf = open('hacker_news.csv')
readf = reader(openf)
hn = list(readf)
hn[:5]

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01']]

In [2]:
headers = hn[0]
hn = hn[1:]
print(headers,'\n')
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] 

[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


# Part One: Filtering Data

In [3]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


In [4]:
print(ask_posts[:2])

[['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43']]


In [5]:
print(show_posts[:2])

[['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'], ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46']]


In [6]:
def TotalandAverageNumberofCommentCalc(dataset,index,calculateaverage=False):
    total_comments = 0 
    if calculateaverage==False:
        for row in dataset:
            num_comments = row[index]
            num_comments = int(num_comments)
            total_comments += num_comments
            return total_comments
    else:
        total_comments_foravg = 0
        for row in dataset:
            num_comments = row[index]
            num_comments = int(num_comments)
            total_comments_foravg += num_comments
        avg_comments = total_comments_foravg/len(dataset)
        return total_comments_foravg,avg_comments

ask_posts_totalavg = TotalandAverageNumberofCommentCalc(ask_posts,4,True)
show_posts_totalavg = TotalandAverageNumberofCommentCalc(show_posts,4,True)
print("Total Number of Comments for Ask Posts: {totalcomments:,}\nAverage Number of Comments for Ask Posts: {avgcomments:,.0f}."
      .format(totalcomments=ask_posts_totalavg[0],
              avgcomments=ask_posts_totalavg[1]))
print('------------------------------------------------')
print("Total Number of Comments for Show Posts: {totalcomments:,}\nAverage Number of Comments for Show Posts: {avgcomments:,.0f}."
      .format(totalcomments=show_posts_totalavg[0],
              avgcomments=show_posts_totalavg[1]))

Total Number of Comments for Ask Posts: 24,483
Average Number of Comments for Ask Posts: 14.
------------------------------------------------
Total Number of Comments for Show Posts: 11,988
Average Number of Comments for Show Posts: 10.


On average ask posts receive a higher number of comments. Makes logical sense--questions will receive answers versus "showing" posts would be more feedback oriented comments. Note, there is only a marginal difference between averages. ~4 comments is not statistically significant.

Since asks posts receive more comments on average, we will focus on that segment of the data for our analysis.

# Part Two: Analyzing Frequency of Asks Posts by Time

In [7]:
import datetime as dt

In [8]:
print(ask_posts[:2])

[['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43']]


In [10]:
# counts_by_hour = {}
# comments_by_hour = {}
    
# for element in result_list:
#     dateformat = "%m/%d/%Y %H:%M"
#     datetimeobj = dt.datetime.strptime(element[0],dateformat)
#     datetimeobj = datetimeobj.strftime("%H")
#     print(datetimeobj)
#     hour = datetimeobj.hour

def commentsfreq(inputdataset,dateindex,numcommindex):
    result_list = []
    for row in inputdataset:
        result_list.append([row[dateindex],int(row[numcommindex])])  
    counts_by_hour = {}
    comments_by_hour = {}
    dateformat = "%m/%d/%Y %H:%M"
    for element in result_list:
        date = element[0]
        comments = element[1]
        hour = dt.datetime.strptime(date,dateformat).strftime("%H")
        if hour not in counts_by_hour:
            counts_by_hour[hour] = 1
            comments_by_hour[hour] = comments
        else:
            counts_by_hour[hour] += 1
            comments_by_hour[hour] += comments
    return counts_by_hour, comments_by_hour

In [11]:
askpostscounts = commentsfreq(ask_posts,6,4)

In [12]:
countsbyhour = askpostscounts[0]
commentsperhour = askpostscounts[1]

In [14]:
print("Count of Asks Posts Comments by Hour:")
for row in sorted(commentsperhour):
    print("{row} : {count}".format(row=row,count=commentsperhour[row]))

Count of Asks Posts Comments by Hour:
00 : 447
01 : 683
02 : 1381
03 : 421
04 : 337
05 : 464
06 : 397
07 : 267
08 : 492
09 : 251
10 : 793
11 : 641
12 : 687
13 : 1253
14 : 1416
15 : 4477
16 : 1814
17 : 1146
18 : 1439
19 : 1188
20 : 1722
21 : 1745
22 : 479
23 : 543


In [38]:
avg_by_hour =  []
for hour in commentsperhour:
    avg_by_hour.append([hour,(commentsperhour[hour]/countsbyhour[hour])])
avg_by_hour

[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]

In [60]:
swap_avg_by_hour = []
for hour in avg_by_hour:
    swap_avg_by_hour.append([hour[1],hour[0]])
sorted_swap = sorted(swap_avg_by_hour,reverse=True)

# print(sorted_swap)

for eachhr in sorted_swap[:5]:
    print("{hr} : {avgcomment:.2f} average comments per post."
          .format(hr=dt.datetime.strptime(eachhr[1],"%H")
                  .strftime("%H:%M"),avgcomment=eachhr[0]))
                  

15:00 : 38.59 average comments per post.
02:00 : 23.81 average comments per post.
20:00 : 21.52 average comments per post.
16:00 : 16.80 average comments per post.
21:00 : 16.01 average comments per post.


Taking into account the data presented, the best time to post an ask question on Hacker News is 3-4pm (EST). From 3-4pm the average amount of comments a post exceed 15+, this makes it more favorable for the user to post at these times. 

`TODO`
1. Determine if show or ask posts receive more points on average.
2. Determine if posts created at a certain time are more likely to receive more points.
3. Compare your results to the average number of comments and points other posts receive.