# Guided Project:

Exploring Hacker News Posts

We'll compare these two types of posts to determine the following:

- Do "Ask HN" or "Show HN" receive more comments on average?
- Do posts created at a certain time receive more comments on average?

In [1]:
from csv import reader

# Open the Hacker News Posts dataset
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)
hn_header = hn[0]
hn = hn[1:]

print(hn_header)
print("")
print(hn[:3])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']]


In [2]:
# start & end (integers representing start and end of slice)
# row_and_columns (Boolean, default is False)
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print("\n") # Adds a new (empty) line after each row
        
    if rows_and_columns:
        print("Number of rows:", len(dataset))
        print("Number of columns:", len(dataset[0]))
        
explore_data(hn,0,2,True)

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


Number of rows: 20100
Number of columns: 7


# Data Cleaning Process
- Detect inaccurate data, and correct or remove it.
- Detect duplicate data, and remove the duplicates.
- Check for any non-English posts.

In [3]:
# Find Errors in the data, by using the length of the rows
print("Errors in posts, by length of the rows:")
for row in hn:
    if len(row) != len(hn_header):
        print(row)
        print("Error row #:", hn.index(row))

Errors in posts, by length of the rows:


In [4]:
# Find any duplicate posts, based on the id number

hn_duplicate_posts = []
hn_unique_posts = []

for row in hn:
    name = row[0]
    if name in hn_unique_posts:
        hn_duplicate_posts.append(name)
    else:
        hn_unique_posts.append(name)        
        
print("Number of duplicate posts:", len(hn_duplicate_posts))
print("Number of unique posts:", len(hn_unique_posts))

Number of duplicate posts: 0
Number of unique posts: 20100


In [5]:
# Checking for English Words leaving in some words with Emoji's and symbols
def is_english(string):
    non_ascii = 0
    
    for character in string:
        if ord(character) > 127: # The first 127 characters are English
            non_ascii += 1
            
    if non_ascii > 3:
        return False
    else:
        return True

print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

True
True


In [6]:
# Filter out the non-English posts by title
hn_english = []
hn_non_english = []

for row in hn:
    title = row[1]
    if is_english(title):
        hn_english.append(row)
    else:
        hn_non_english.append(row)
        
print("Examples of some post titles:")
explore_data(hn_english,0,2,True)
print("")
print(len(hn) - len(hn_english), "posts removed for non-English words.")
print("")
print(hn_non_english)

Examples of some post titles:
['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


Number of rows: 20097
Number of columns: 7

3 posts removed for non-English words.

[['11365297', 'CryptÂ·oÂ·phobe', 'https://www.cryptophobia.com/', '4', '1', 'r0muald', '3/26/2016 11:41'], ['11459868', 'Today is 2Â²/2Â³/2?', '', '33', '6', 'sinak', '4/9/2016 4:14'], ['10606767', 'Dormir en el limbo: radiografÃ\xada de Airbnb  EL ESPAÃ\x91OL', 'http://datos.elespanol.com/proyectos/airbnb/', '1', '1', 'malditojavi', '11/21/2015 13:39']]


In [7]:
# Filter out posts with titles beginning with "Ask HN" or "Show HN"
ask_posts = []
show_posts = []
other_posts = []

for row in hn_english:
    title = row[1].lower()
    if title.startswith("ask hn"):
        ask_posts.append(row)
    elif title.startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print('Number of "Ask HN" posts:', len(ask_posts))
ask_hn_percent = (len(ask_posts)/(len(hn)))*100
print('"Ask HN" posts make up', round(ask_hn_percent, 3), "% of the overall posts.")

print('Number of "Show HN" posts:', len(show_posts))
show_hn_percent = (len(show_posts)/(len(hn)))*100
print('"Show HN" posts make up', round(show_hn_percent, 3), "% of the overall posts.")

print('Number of other posts:', len(other_posts))
other_hn_percent = (len(other_posts)/(len(hn)))*100
print('Other HN posts make up', round(other_hn_percent, 3), "% of the overall posts.")
print("")
print('Examples of some "Ask HN" posts:')
print(ask_posts[:3])
print("")
print('Examples of some "Show HN" posts:')
print(show_posts[:3])

Number of "Ask HN" posts: 1744
"Ask HN" posts make up 8.677 % of the overall posts.
Number of "Show HN" posts: 1162
"Show HN" posts make up 5.781 % of the overall posts.
Number of other posts: 17191
Other HN posts make up 85.527 % of the overall posts.

Examples of some "Ask HN" posts:
[['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'], ['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14']]

Examples of some "Show HN" posts:
[['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'], ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46'], ['11590768', 'Show HN: Shanhu.i

In [8]:
# Finding the total number of comments on "Ask HN" and "Show HN" posts
# ['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
total_ask_comments = 0
total_show_comments = 0

for row in ask_posts:
    num_comments = row[4]
    total_ask_comments = total_ask_comments + int(num_comments)
    
for row in show_posts:
    num_comments = row[4]
    total_show_comments = total_show_comments + int(num_comments)
    
avg_ask_comments = total_ask_comments / len(ask_posts)
avg_show_comments = total_show_comments / len(show_posts)

print('Total numbers of comments on "Ask HN" posts:', total_ask_comments)
print('The average number of comments on "Ask HN" posts:', round(avg_ask_comments, 3))
print("")
print('Total numbers of comments on "Show HN" posts:', total_show_comments)
print('The average number of comments on "Show HN" posts:', round(avg_show_comments, 3))
print("")
print('"Ask HN" has', round(ask_hn_percent - show_hn_percent, 3), '% more posts, than "Show HN"')
ask_show_percent = (total_ask_comments / total_show_comments)*100
print('"Ask HN" has', round(ask_show_percent, 3) , '% more comments, than "Show HN"')

Total numbers of comments on "Ask HN" posts: 24483
The average number of comments on "Ask HN" posts: 14.038

Total numbers of comments on "Show HN" posts: 11988
The average number of comments on "Show HN" posts: 10.317

"Ask HN" has 2.896 % more posts, than "Show HN"
"Ask HN" has 204.229 % more comments, than "Show HN"


# Do Show posts or Ask posts receive more comments on average?
I calculated the overall average of "Ask HN" posts, "Show HN" posts, and all the other posts on the Hacker News.

Even though "Ask HN" has about 3% more (582 more) overall posts than "Show HN" it has a little more than double the comments.

----

Next, we'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:

- Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
- Calculate the average number of comments ask posts receive by hour created.

In [42]:
# Create a list of lists with two elements (time created and number of comments)
import datetime as dt

result_list = []

for row in ask_posts:
    created_at = row[6]
    num_comments = int(row[4])
    two_elements = [created_at, num_comments]
    result_list.append(two_elements)

# Create dictionaries: number of posts by hour and number of comments by hour
counts_by_hour = {}
comments_by_hour = {}
for row in result_list:
    date_time = row[0]
    date_time = dt.datetime.strptime(date_time, "%m/%d/%Y %H:%M")
    post_time = date_time.strftime("%H:00")
    if post_time not in counts_by_hour:
        counts_by_hour[post_time] = 1
        comments_by_hour[post_time] = int(row[1])
    else:
        counts_by_hour[post_time] += 1
        comments_by_hour[post_time] += int(row[1])

print("Example of the result_list:")
print(result_list[:3])
print("")
# Convert the dictionary into a list of tuples, so it can be sorted
counts_by_hour_list = sorted([(k,v) for k, v in counts_by_hour.items()])
print("Number of posts, by the hour:")
for row in counts_by_hour_list:
    print(row)
print("")
comments_by_hour_list = sorted([(k,v) for k, v in comments_by_hour.items()])
print("Number of comments, by the hour:")
for row in comments_by_hour_list:
    print(row)

Example of the result_list:
[['8/16/2016 9:55', 6], ['11/22/2015 13:43', 29], ['5/2/2016 10:14', 1]]

Number of posts, by the hour:
('00:00', 55)
('01:00', 60)
('02:00', 58)
('03:00', 54)
('04:00', 47)
('05:00', 46)
('06:00', 44)
('07:00', 34)
('08:00', 48)
('09:00', 45)
('10:00', 59)
('11:00', 58)
('12:00', 73)
('13:00', 85)
('14:00', 107)
('15:00', 116)
('16:00', 108)
('17:00', 100)
('18:00', 109)
('19:00', 110)
('20:00', 80)
('21:00', 109)
('22:00', 71)
('23:00', 68)

Number of comments, by the hour:
('00:00', 447)
('01:00', 683)
('02:00', 1381)
('03:00', 421)
('04:00', 337)
('05:00', 464)
('06:00', 397)
('07:00', 267)
('08:00', 492)
('09:00', 251)
('10:00', 793)
('11:00', 641)
('12:00', 687)
('13:00', 1253)
('14:00', 1416)
('15:00', 4477)
('16:00', 1814)
('17:00', 1146)
('18:00', 1439)
('19:00', 1188)
('20:00', 1722)
('21:00', 1745)
('22:00', 479)
('23:00', 543)


In [33]:
# Calculate the average number of comments received by the hour created
avg_by_hour = []
for hour in comments_by_hour:
    avg_comments = comments_by_hour[hour] / counts_by_hour[hour]
    avg_by_hour.append([hour, avg_comments])

sorted_avg_by_hour = sorted(avg_by_hour)
for row in sorted_avg_by_hour:
    print(row)

['00:00', 8.127272727272727]
['01:00', 11.383333333333333]
['02:00', 23.810344827586206]
['03:00', 7.796296296296297]
['04:00', 7.170212765957447]
['05:00', 10.08695652173913]
['06:00', 9.022727272727273]
['07:00', 7.852941176470588]
['08:00', 10.25]
['09:00', 5.5777777777777775]
['10:00', 13.440677966101696]
['11:00', 11.051724137931034]
['12:00', 9.41095890410959]
['13:00', 14.741176470588234]
['14:00', 13.233644859813085]
['15:00', 38.5948275862069]
['16:00', 16.796296296296298]
['17:00', 11.46]
['18:00', 13.20183486238532]
['19:00', 10.8]
['20:00', 21.525]
['21:00', 16.009174311926607]
['22:00', 6.746478873239437]
['23:00', 7.985294117647059]


In [48]:
# Sorting the list of lists and printing the five highest values
# in a format that's easier to read
swap_avg_by_hour = []
for row in avg_by_hour:
    first_element = row[0]
    second_element = row[1]
    swap_avg_by_hour.append([second_element, first_element])
    
# Sort the list
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

# Print the five highest values
print('The top 5 hours of "Ask HN" posts comments')
for row in sorted_swap[:5]:
    hour_element = dt.datetime.strptime(row[1], "%H:%M")
    hour_element = hour_element.strftime("%H:%M")
    result_string = "{} EST: {:.2f} average comments per post.".format(hour_element, row[0])
    print(result_string)

The top 5 hours of "Ask HN" posts comments
15:00 EST: 38.59 average comments per post.
02:00 EST: 23.81 average comments per post.
20:00 EST: 21.52 average comments per post.
16:00 EST: 16.80 average comments per post.
21:00 EST: 16.01 average comments per post.


# Which hours creating a post have a higher chance of receiving comments?

- 15:00 EST (3pm EST) has the Highest average number of comments, per post.
- Followed by 02:00 EST (2am EST) with the second highest.
----
- With 3pm being the highest and 4pm the fourth highest, it looks like posts around this time period have the highest chance of receiveing comments.