# Data Pre-cleaning

The original data file includes +300,000 entries. This data was cleaned in two steps:

1. Remove all submissions without comments
2. Randomly sample from remaining submissions
The final data-set has ~20,000 entries.

## Dataset Headings¶

[0] = id: the unique identifier number

[1] = title: the title of the post

[2] = url: the url that the post links to, if it links to a URL

[3] = num_points: the number of points the post acquired, calculated as upvotes less downvotes

[4] = num_comments: the number of comments made on the post

[5] = author: the username of the person who submitted the post

[6] = created_at: the date and time at which the post was submitted

## (1)Importing Data

In [1]:
from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)

headers = hn[0]
hn = hn[1:]

print('A sample of the dataset is below:')
print(headers)
print('=============================================================================')
print(hn[0:4])
print('=============================================================================')

A sample of the dataset is below:
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


## (2) Data sorting and aggregation
With the data read into our program, we start by filtering the remaining data into three buckets:

1. 'Ask HN'
2. 'Show HN'
3. 'Other'
Additionally, we can calulate the average number of comments for each of these categories.

In [2]:
ask_posts = []
show_posts = []
other_posts = []
for entry in hn:
    title = entry[1].lower()
    if title.startswith('ask hn'):
        ask_posts.append(entry)
    elif title.startswith('show hn'):
        show_posts.append(entry)
    else:
        other_posts.append(entry)

        num_ask_posts = len(ask_posts)
num_show_posts = len(show_posts)
num_other_posts = len(other_posts)
print('The total number of posts is {}'.format(len(hn)))
print('======================================')
print('The number of ask posts is \t{}'.format(num_ask_posts))
print('The number of show posts is \t{}'.format(num_show_posts))
print('The number of other posts is \t{}'.format(num_other_posts))    

The total number of posts is 20100
The number of ask posts is 	1744
The number of show posts is 	1162
The number of other posts is 	17194


## Data analysis
### (Q-1) What post types receive the most comments?

Next, let's determine if ask posts or show posts receive more comments on average.

In [3]:
total_ask_comments = 0

for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments
    
avg_ask_comments = total_ask_comments/len(ask_posts)  
print(avg_ask_comments)

14.038417431192661


In [4]:
total_show_comments = 0

for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments
    
avg_show_comments = total_show_comments/len(show_posts)
print(avg_show_comments)

10.31669535283993


In [5]:
print('Average comments on Ask HN posts: {:.2f}'.format(avg_ask_comments))
print('Average comments on Show HN comments: {:.2f}'.format(avg_show_comments))

Average comments on Ask HN posts: 14.04
Average comments on Show HN comments: 10.32


As expected Ask HN posts have a higher average comments.

### (A-1) Ask posts receive most comments on average
Based on our analysis, ask posts receive the most comments, followed by other posts, then show posts. The exact results are in the output above. This finding follows logic, as those seeking help are specifically soliciting comments.


### (Q-2) Does time-of-day affect Ask HN post comment number?

Next, we'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:

1. Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
2. Calculate the average number of comments ask posts receive by hour created.

Calculating average number of comments by hour created

In [6]:
import datetime as dt

In [7]:
result_list = []
for row in ask_posts:
    result_list.append([row[6], int(row[4])])

In [8]:
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    comments_count = row[1]
    date_string = row[0]
    date_created = dt.datetime.strptime(date_string,"%m/%d/%Y %H:%M")
    hour_created = date_created.hour
    if hour_created in counts_by_hour:
        counts_by_hour[hour_created] += 1
        comments_by_hour[hour_created] += comments_count 
    else:
        counts_by_hour[hour_created] = 1
        comments_by_hour[hour_created] = comments_count

We have now the amount of posts and comments created for each hour so we can calculate the average number of comments per post for posts created during each hour of the day.

In [9]:
print('Posts created by hour: ', counts_by_hour)

print('Total comments added by hour', comments_by_hour)

Posts created by hour:  {9: 45, 13: 85, 10: 59, 14: 107, 16: 108, 23: 68, 12: 73, 17: 100, 15: 116, 21: 109, 20: 80, 2: 58, 18: 109, 3: 54, 5: 46, 19: 110, 1: 60, 22: 71, 8: 48, 4: 47, 0: 55, 6: 44, 7: 34, 11: 58}
Total comments added by hour {9: 251, 13: 1253, 10: 793, 14: 1416, 16: 1814, 23: 543, 12: 687, 17: 1146, 15: 4477, 21: 1745, 20: 1722, 2: 1381, 18: 1439, 3: 421, 5: 464, 19: 1188, 1: 683, 22: 479, 8: 492, 4: 337, 0: 447, 6: 397, 7: 267, 11: 641}


In [10]:
avg_by_hour = []

for key in counts_by_hour:
    avg_posts = comments_by_hour[key]/counts_by_hour[key]
    avg_by_hour.append([key,avg_posts])

In [11]:
avg_by_hour

[[9, 5.5777777777777775],
 [13, 14.741176470588234],
 [10, 13.440677966101696],
 [14, 13.233644859813085],
 [16, 16.796296296296298],
 [23, 7.985294117647059],
 [12, 9.41095890410959],
 [17, 11.46],
 [15, 38.5948275862069],
 [21, 16.009174311926607],
 [20, 21.525],
 [2, 23.810344827586206],
 [18, 13.20183486238532],
 [3, 7.796296296296297],
 [5, 10.08695652173913],
 [19, 10.8],
 [1, 11.383333333333333],
 [22, 6.746478873239437],
 [8, 10.25],
 [4, 7.170212765957447],
 [0, 8.127272727272727],
 [6, 9.022727272727273],
 [7, 7.852941176470588],
 [11, 11.051724137931034]]

Althought we now have the results we need, this format makes it hard to identify the hours with the highest values. Let's finish by soring the list and printing the five highest values in a format that's easier to read.

In [12]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1],row[0]])
    
swap_avg_by_hour

[[5.5777777777777775, 9],
 [14.741176470588234, 13],
 [13.440677966101696, 10],
 [13.233644859813085, 14],
 [16.796296296296298, 16],
 [7.985294117647059, 23],
 [9.41095890410959, 12],
 [11.46, 17],
 [38.5948275862069, 15],
 [16.009174311926607, 21],
 [21.525, 20],
 [23.810344827586206, 2],
 [13.20183486238532, 18],
 [7.796296296296297, 3],
 [10.08695652173913, 5],
 [10.8, 19],
 [11.383333333333333, 1],
 [6.746478873239437, 22],
 [10.25, 8],
 [7.170212765957447, 4],
 [8.127272727272727, 0],
 [9.022727272727273, 6],
 [7.852941176470588, 7],
 [11.051724137931034, 11]]

In [13]:
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

print('Top 5 Hours for Ask Posts Comments')

for row in sorted_swap[:5]:
    hour_formatted = dt.datetime.strptime(str(row[1]),'%H')
    hour_formatted += dt.timedelta(hours=2) #converting EST to UTC-3  
    hour_formatted = hour_formatted.strftime('%H:%M')

    print('{}: {:.2f} average comments per post'.format(hour_formatted,row[0]))

Top 5 Hours for Ask Posts Comments
17:00: 38.59 average comments per post
04:00: 23.81 average comments per post
22:00: 21.52 average comments per post
18:00: 16.80 average comments per post
23:00: 16.01 average comments per post


### (A-2) Early afternoon includes several of the best times to post
As shown in the code block above, if you want to maximize the number of comments you receive on your posts, it is best to post during the hours of 4pm, 3pm, 9pm, 5pm, or 10pm Central time (time above are given in Eastern time)