# Project_Exploring Hacker News Posts

Objectives:
- compare two types of posts to determine the following
    a. Do `Ask HN` or `Show HN` recieve more comments on average?
    b. Do posts created at a certain time recieve more comments on average?

Source of data set is [here](https://www.kaggle.com/hacker-news/hacker-news-posts)

In [14]:
from csv import reader

In [15]:
opn_hn = open('hacker_news.csv')
rd_hn = reader(opn_hn)
hn = list(rd_hn)
headers = hn[0]
hn = hn[1:]

In [65]:
headers

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

In [17]:
hn[:5]

[['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01'],
 ['10301696',
  'Note by Note: The Making of Steinway L1037 (2007)',
  'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
  '8',
  '2',
  'walterbell',
  '9/30/2015 4:12']]

In [40]:
ask_posts = []
show_posts = []
other_posts = []

In [41]:
for rows in hn:
    title = rows[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(rows)
    elif title.lower().startswith('show hn'):
        show_posts.append(rows)
    else:
        other_posts.append(rows)

In [50]:
len(ask_posts)

1744

In [49]:
len(show_posts)

1162

In [51]:
len(other_posts)

17194

In [54]:
#calculate the total number of comments in ask posts
total_ask_comments = 0
for rows in ask_posts:
    comm = int(rows[4])
    total_ask_comments += comm

In [55]:
total_ask_comments

24483

In [58]:
avg_ask_comments = total_ask_comments/len(ask_posts)
avg_ask_comments

14.038417431192661

In [59]:
#calculate the total number of comments in show posts
total_show_comments = 0
for rows in show_posts:
    comm = int(rows[4])
    total_show_comments += comm

In [60]:
total_show_comments

11988

In [61]:
avg_show_comments = total_show_comments/len(show_posts)
avg_show_comments

10.31669535283993

Based on our data exploration of average of comments in ask and show post, we can say if ask post recieve more comments than show posts

Next, we'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:

Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
Calculate the average number of comments ask posts receive by hour created.

In [63]:
import datetime as dt

In [107]:
result_list = []

In [109]:
for rows in ask_posts:
    result_list.append([rows[6],int(rows[4])])

In [111]:
result_list[:3] #check result_list

[['8/16/2016 9:55', 6], ['11/22/2015 13:43', 29], ['5/2/2016 10:14', 1]]

In [112]:
counts_by_hour = {}
comments_by_hour = {}

In [114]:
format_string = "%m/%d/%Y %H:%M"
for row in result_list:
    date = row[0]
    comment = row[1]
    time = dt.datetime.strptime(date, format_string).strftime("%H")
    if time not in counts_by_hour:
        counts_by_hour[time] = 1
        comments_by_hour[time] = comment
    else:
        counts_by_hour[time] += 1
        comments_by_hour[time] += comment

`counts_by_hour`: contains the number of ask posts created during each hour of the day.

In [115]:
counts_by_hour #check the dict

{'00': 55,
 '01': 60,
 '02': 58,
 '03': 54,
 '04': 47,
 '05': 46,
 '06': 44,
 '07': 34,
 '08': 48,
 '09': 45,
 '10': 59,
 '11': 58,
 '12': 73,
 '13': 85,
 '14': 107,
 '15': 116,
 '16': 108,
 '17': 100,
 '18': 109,
 '19': 110,
 '20': 80,
 '21': 109,
 '22': 71,
 '23': 68}

`comments_by_hour`: contains the corresponding number of comments ask posts created at each hour received.

In [116]:
comments_by_hour #check the dict

{'00': 447,
 '01': 683,
 '02': 1381,
 '03': 421,
 '04': 337,
 '05': 464,
 '06': 397,
 '07': 267,
 '08': 492,
 '09': 251,
 '10': 793,
 '11': 641,
 '12': 687,
 '13': 1253,
 '14': 1416,
 '15': 4477,
 '16': 1814,
 '17': 1146,
 '18': 1439,
 '19': 1188,
 '20': 1722,
 '21': 1745,
 '22': 479,
 '23': 543}

In [117]:
avg_by_hour = []

In [118]:
for hour in counts_by_hour:
    avg_by_hour.append([hour,comments_by_hour[hour]/counts_by_hour[hour]])

In [119]:
avg_by_hour[:5]

[['03', 7.796296296296297],
 ['16', 16.796296296296298],
 ['08', 10.25],
 ['00', 8.127272727272727],
 ['20', 21.525]]

In [120]:
swap_avg_by_hour = []

In [121]:
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1],row[0]])

In [123]:
swap_avg_by_hour[:5]

[[7.796296296296297, '03'],
 [16.796296296296298, '16'],
 [10.25, '08'],
 [8.127272727272727, '00'],
 [21.525, '20']]

In [124]:
sorted_swap = sorted(swap_avg_by_hour,reverse=True)
print(sorted_swap[:5])

[[38.5948275862069, '15'], [23.810344827586206, '02'], [21.525, '20'], [16.796296296296298, '16'], [16.009174311926607, '21']]


In [128]:
for row in sorted_swap[:5]:
    formatstr = '{a}:00 {b:.2f} average comments per post'
    hour = row[1]
    a = dt.datetime.strptime(hour,'%H').strftime('%H')
    b = row[0]
    print(formatstr.format(a=a,b=b))

15:00 38.59 average comments per post
02:00 23.81 average comments per post
20:00 21.52 average comments per post
16:00 16.80 average comments per post
21:00 16.01 average comments per post
