### Hacker News Posts

This project is going to analyse Hacker News posts in order to answer the following two questions:
1. Do *ask* or *show* posts receive more comments on average?
2. Do posts created at a certain time receive more comments on average?

*ask* posts are submitted asking a specific questions while *show* posts show the Hacker News community a project, product, or an interesting topic.  

#### Backgound
This notebook is being prepared to satisfy the Dataquest "Python - Data Science Fundamentals" course. Quoting of page numbers references the location of the project requirements.

We start our analysis by importing the required libraries and creating a helper function for printing lists. 

In [3]:
# Library requirements
import csv

# Import data
hn_opened = open('hacker_news.csv')
hn = list(csv.reader(hn_opened))

# Initialise explore_data function
# This function is taken from the last guided project
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        # adds a new (empty) line after each row
        print('\n') 

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

Lets have a look at the first five rows of the `hacker_news` csv file.

In [4]:
explore_data(hn, 0, 5, "TRUE")

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


Number of rows: 20101
Number of columns: 7


#### Ref. page 2

The first row is a header.  Lets extract the header row to a separate variable and then remove it from the `hn` object.

In [5]:
headers = hn[0] 
print(headers)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


In [6]:
hn = hn[1:]

In [7]:
explore_data(hn, 0, 5, "TRUE")

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']


Number of rows: 20100
Number of columns: 7


No more header, that looks good.

#### Ref. page 3
We now want to extract the *ask*, *show* and *other* posts into separate lists.  The loop below extracts posts that start with `Ask` or `Show` (regardless of case) and prints the total records for each.

In [8]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1].lower()
    if title.startswith('ask'):
        ask_posts.append(row)
    elif title.startswith('show'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print("Ask posts:", len(ask_posts))
print("Show posts:", len(show_posts))
print("Other posts:", len(other_posts))

# Check totals
total_posts = len(ask_posts) + len(show_posts) + len(other_posts)
print("Total posts:", total_posts)

Ask posts: 1756
Show posts: 1164
Other posts: 17180
Total posts: 20100


#### Ref. page 4

We have been asked to find determine if ask posts or show posts receive more comments on average.

In [9]:
# Ask comments
total_ask_comments = 0

for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments

avg_ask_comments = total_ask_comments / len(ask_posts)

# Show comments
total_show_comments = 0

for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments

avg_ask_comments = total_ask_comments / len(ask_posts)
avg_show_comments = total_show_comments / len(show_posts)

print("Average ask post comments:", avg_ask_comments)
print("Average show post comments:", avg_show_comments)


Average ask post comments: 14.203302961275627
Average show post comments: 10.323024054982818


Ask posts on average have 14 comments while show posts have on average 10 comments.

#### Ref. page 5

Moving on, the next phase of analysis will focus on ask posts and determine if there are any trends based on the hour of day these are posted.

The code below populates two dictionaries.  One listing the number of posts submitted per hour, the other listing the total comments for these posts.

In [10]:
import datetime as dt

result_list = []

for row in ask_posts:
    created_at = row[6]
    num_comments = int(row[4])
    result_list.append([created_at, num_comments])

counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    date_str = row[0]
    date = dt.datetime.strptime(date_str, "%m/%d/%Y %H:%M")
    hour = dt.datetime.strftime(date,"%H")
    comments = row[1]
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comments
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comments

In [11]:
# View the dictionaries
print(counts_by_hour)
print(comments_by_hour)

{'12': 73, '00': 55, '19': 112, '17': 100, '06': 44, '18': 109, '15': 116, '21': 109, '16': 109, '09': 45, '10': 60, '23': 70, '07': 34, '11': 58, '08': 48, '22': 71, '13': 86, '14': 110, '02': 58, '04': 48, '03': 55, '20': 80, '01': 60, '05': 46}
{'12': 687, '00': 447, '19': 1295, '17': 1146, '06': 397, '18': 1439, '15': 4477, '21': 1745, '16': 1949, '09': 251, '10': 794, '23': 713, '07': 267, '11': 641, '08': 492, '22': 479, '13': 1254, '14': 1420, '02': 1381, '04': 339, '03': 459, '20': 1722, '01': 683, '05': 464}


#### Ref. page 6

We now want to determine the average comments per post for each hour of the day. 

Below we create a list of lists in which the first element is the hour, and the second element is the average number of comments per post.

In [14]:
avg_by_hour = []

for hour in counts_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour] / counts_by_hour[hour]])

In [15]:
# View the resultant list
print(avg_by_hour)

[['12', 9.41095890410959], ['00', 8.127272727272727], ['19', 11.5625], ['17', 11.46], ['06', 9.022727272727273], ['18', 13.20183486238532], ['15', 38.5948275862069], ['21', 16.009174311926607], ['16', 17.880733944954127], ['09', 5.5777777777777775], ['10', 13.233333333333333], ['23', 10.185714285714285], ['07', 7.852941176470588], ['11', 11.051724137931034], ['08', 10.25], ['22', 6.746478873239437], ['13', 14.581395348837209], ['14', 12.909090909090908], ['02', 23.810344827586206], ['04', 7.0625], ['03', 8.345454545454546], ['20', 21.525], ['01', 11.383333333333333], ['05', 10.08695652173913]]


#### Ref. page 7

To pretty things up we will format our list so that we print the five hours with the highest average posts per comment.

In [16]:
swap_avg_by_hour = []

for row in avg_by_hour:
    avg = row[1]
    hour = row[0]
    swap_avg_by_hour.append([avg, hour])

In [17]:
print(swap_avg_by_hour)

[[9.41095890410959, '12'], [8.127272727272727, '00'], [11.5625, '19'], [11.46, '17'], [9.022727272727273, '06'], [13.20183486238532, '18'], [38.5948275862069, '15'], [16.009174311926607, '21'], [17.880733944954127, '16'], [5.5777777777777775, '09'], [13.233333333333333, '10'], [10.185714285714285, '23'], [7.852941176470588, '07'], [11.051724137931034, '11'], [10.25, '08'], [6.746478873239437, '22'], [14.581395348837209, '13'], [12.909090909090908, '14'], [23.810344827586206, '02'], [7.0625, '04'], [8.345454545454546, '03'], [21.525, '20'], [11.383333333333333, '01'], [10.08695652173913, '05']]


In [18]:
sorted_swap = sorted(swap_avg_by_hour, reverse = True)

In [19]:
print(sorted_swap)

[[38.5948275862069, '15'], [23.810344827586206, '02'], [21.525, '20'], [17.880733944954127, '16'], [16.009174311926607, '21'], [14.581395348837209, '13'], [13.233333333333333, '10'], [13.20183486238532, '18'], [12.909090909090908, '14'], [11.5625, '19'], [11.46, '17'], [11.383333333333333, '01'], [11.051724137931034, '11'], [10.25, '08'], [10.185714285714285, '23'], [10.08695652173913, '05'], [9.41095890410959, '12'], [9.022727272727273, '06'], [8.345454545454546, '03'], [8.127272727272727, '00'], [7.852941176470588, '07'], [7.0625, '04'], [6.746478873239437, '22'], [5.5777777777777775, '09']]


In [57]:
for row in range(5):
    hour = dt.datetime.strptime(sorted_swap[row][1], "%H")
    fmtd_hour = dt.datetime.strftime(hour, "%R")
    print("{}: {:.2f} average comments per post".format(fmtd_hour, sorted_swap[row][0]))
    

15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 17.88 average comments per post
21:00: 16.01 average comments per post


This small piece of analysis set out to answer the following two questions in relation to Hacker News posts:
1. Do *ask* or *show* posts receive more comments on average?
2. Do posts created at a certain time receive more comments on average?

Based on the above analysis, we can conclude that posting an *ask* question will on average receive more comments than a *show* post.

Further, posting these *ask* comments at 3 o'clock in the afternoon will illicit more comments than at other times of the day.

Looks like it's the afternoon lull that has techies commenting on Hacker News.....