# Guided Project: Exploring Hacker News Posts

For this particular project, I worked with the data set of posts from Hacker News. The data set originally consisted of 300,000 rows but was reduced to about 20,000 rows that deleted submissions with no comments and were selected from a random sample. The column names of the dataset are id, title, url, num_points, num_comments, author, and created_at. I was most interested in looking at posts that had titles that began with 'Ask HN' and 'Show HN' particularly. The goal of this project was to: 

1) Analyze two types of posts to determine  whether titles that started with 'Ask HN' or 'Show HN' had more comments on average.

2) Do posts created at a certain time receive more comments on average? 

### Step 1: Convert Data to List of Lists

In [1]:
#Step 1: Create list of lists of the data
from csv import reader 
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
#Remove the header row
hn = list(read_file)
print(hn[:5])



[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


In [2]:
headers = hn[0]
hn = hn[1:] 
print(headers)
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


### Step 2: Calculate the Length of Ask, Show and Other posts

In [3]:
#Create empty lists of ask_posts, show_posts, other_posts
ask_posts = []
show_posts = []
other_posts = []
for row in hn: 
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


### Step 3: Conduct Analysis by Calculating Average


In [4]:
#Calculate the ask_comments average
total_ask_comments = 0
for row in ask_posts: 
    total_ask_comments += int(row[4])
avg_ask_comments = total_ask_comments / len(ask_posts)
print(avg_ask_comments)
#Calculate the show comments 
total_show_comments = 0
for row in show_posts: 
    total_show_comments += int(row[4])
avg_show_comments = total_show_comments / len(show_posts)
print(avg_show_comments)

14.038417431192661
10.31669535283993


From the given output of the last cell, it is shown that the ask posts (14.038417431192661) receives on average more comments than show posts (10.31669535283993).

### Step 4: Calculate Amount of Ask Posts and Comments by Hour 

The following procedures are to determine if ask posts created at a certain time are more likely to attract comments since ask posts get more comments than show posts on average. There are two parts that need to be done in order to obtain the answer: 

1) Calculate the amount of ask posts created in each hour of the day, along with the amount of comments received.

2) Calculate the average number of comments ask posts receive by hour created. 

To start the procedure, we'll attempt the first part. 

In [5]:
#Import the datetime module as dt
import datetime as dt 
result_list = [] 
for row in ask_posts: 
    created_at = row[6]
    n_comments = int(row[4])
    result_list.append([created_at, n_comments])

counts_by_hour = {}
comments_by_hour = {}
date_form = '%m/%d/%Y %H:%M'

for row in result_list:
    date = row[0]
    comment = row[1]
    hour = dt.datetime.strptime(date, date_form).strftime('%H')
    if hour in counts_by_hour: 
        comments_by_hour[hour] += comment
        counts_by_hour[hour] += 1
    else: 
        comments_by_hour[hour] = comment
        counts_by_hour[hour] = 1

print(comments_by_hour)
print(counts_by_hour)
    

{'19': 1188, '01': 683, '14': 1416, '02': 1381, '06': 397, '07': 267, '12': 687, '21': 1745, '18': 1439, '08': 492, '10': 793, '00': 447, '23': 543, '22': 479, '11': 641, '20': 1722, '09': 251, '13': 1253, '03': 421, '17': 1146, '04': 337, '16': 1814, '05': 464, '15': 4477}
{'19': 110, '01': 60, '14': 107, '02': 58, '06': 44, '07': 34, '12': 73, '21': 109, '18': 109, '08': 48, '10': 59, '00': 55, '23': 68, '22': 71, '11': 58, '20': 80, '09': 45, '13': 85, '03': 54, '17': 100, '04': 47, '16': 108, '05': 46, '15': 116}


In [6]:
avg_by_hour = []
total = 0
for comment in comments_by_hour: 
    avg_by_hour.append([comment, comments_by_hour[comment] / counts_by_hour[comment]]) 
    
print(avg_by_hour)

[['19', 10.8], ['01', 11.383333333333333], ['14', 13.233644859813085], ['02', 23.810344827586206], ['06', 9.022727272727273], ['07', 7.852941176470588], ['12', 9.41095890410959], ['21', 16.009174311926607], ['18', 13.20183486238532], ['08', 10.25], ['10', 13.440677966101696], ['00', 8.127272727272727], ['23', 7.985294117647059], ['22', 6.746478873239437], ['11', 11.051724137931034], ['20', 21.525], ['09', 5.5777777777777775], ['13', 14.741176470588234], ['03', 7.796296296296297], ['17', 11.46], ['04', 7.170212765957447], ['16', 16.796296296296298], ['05', 10.08695652173913], ['15', 38.5948275862069]]


In [8]:
swap_avg_by_hour = []
for row in avg_by_hour: 
    swap_avg_by_hour.append([row[1], row[0]])
print(swap_avg_by_hour)
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
print("Top 5 Hours for 'Ask HN' Posts Comments")
for avg, hour in sorted_swap[:5]:
    print(
        '{}: {:.2f} average comments per post'.format(dt.datetime.strptime(hour, '%H').strftime('%H : %M'), avg)
         )
    


[[10.8, '19'], [11.383333333333333, '01'], [13.233644859813085, '14'], [23.810344827586206, '02'], [9.022727272727273, '06'], [7.852941176470588, '07'], [9.41095890410959, '12'], [16.009174311926607, '21'], [13.20183486238532, '18'], [10.25, '08'], [13.440677966101696, '10'], [8.127272727272727, '00'], [7.985294117647059, '23'], [6.746478873239437, '22'], [11.051724137931034, '11'], [21.525, '20'], [5.5777777777777775, '09'], [14.741176470588234, '13'], [7.796296296296297, '03'], [11.46, '17'], [7.170212765957447, '04'], [16.796296296296298, '16'], [10.08695652173913, '05'], [38.5948275862069, '15']]
Top 5 Hours for 'Ask HN' Posts Comments
15 : 00: 38.59 average comments per post
02 : 00: 23.81 average comments per post
20 : 00: 21.52 average comments per post
16 : 00: 16.80 average comments per post
21 : 00: 16.01 average comments per post


### Final Results: 

The hours you should create a post to have a higher chance of receiving comments are 15:00 (3 PM), 02:00 (2 AM), 20:00 (8 PM), 16:00 (4 PM) and 21:00 (9 PM). Also, with the comments per post printed, they are all greater than the average of ask posts which concludes that posts created at a certain time are more than average. 