# Summary of this curse

- how to work with strings
- object-oriented programming
- dates and times

We are interested in titles begining with:
- Ask HN
- Show HN

Let's compare these types of posts to determine:
- which receives more comments on average
- at which time do posts receive more comments on average


# Importing the libraries needed and reading the dataset

In [1]:
from csv import reader
hn = list(reader(open('hacker_news.csv')))

In [2]:
#print(hn[:4])
print(hn[0])
print('\n')
print(hn[1])
print('\n')
print(hn[2])
print('\n')
print(hn[3])
print('\n')
print(hn[4])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


In [3]:
# extracting the header
headers = hn[0]

In [4]:
# removing the header from the dataset
hn = hn[1:]

In [5]:
print(headers)
print('\n')
print(hn[0])
print('\n')
print(hn[1])
print('\n')
print(hn[2])
print('\n')
print(hn[3])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


# Moving to filter out the data

Now to clean the data and keep just the posts that start with:
- Ask HN
- Show HN

In [9]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1] #assigning the title to a variable
    
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


In [10]:
print(ask_posts[:3])
print(show_posts[:3])
print(other_posts[:3])

[['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'], ['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14']]
[['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'], ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46'], ['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https://shanhu.io', '1', '1', 'h8liu', '4/28/2016 18:05']]
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://h

> To find the total number of comments in ask posts:

In [13]:
total_ask_comments = 0

for row in ask_posts:
    total_ask_comments += int(row[4])

avg_ask_comments = total_ask_comments/len(ask_posts)
print(avg_ask_comments)

14.038417431192661


In [14]:
total_show_comments = 0

for row in show_posts:
    total_show_comments += int(row[4])
    
avg_show_comments = total_show_comments/len(show_posts)
print(avg_show_comments)

10.31669535283993


On average 'Ask Posts' have an higher number os posts. Since ask posts receive more comments than show posts, the remaining analysis will focus on 'Ask Posts'

# Ask Posts analysis

It will be determined if there is a specific time that attracts more comments.

To do this:
1. Calculate the amount of ask posts created in each hour of the day, along with the number of comments received
2. Calculate the average number of comments ask posts receive by hour created 


In [15]:
import datetime as dt

In [29]:
result_list = [] #this will be a list of lists

for row in ask_posts:
    #1st element is the 'created at', 2nd element is the number of comments
    result_list.append(
        [row[6], int(row[4])] 
    )
    
counts_by_hour = {}
comments_by_hour = {}
date_format = "%m/%d/%Y %H:%M"

for row in result_list:
    hour = row[0]
    comment = row[1]
    
    objt_time = dt.datetime.strptime(hour, date_format) #convert to object datetime
    time = objt_time.strftime("%H") #extracts the hour and converts to string
    
    if time not in counts_by_hour:
        counts_by_hour[time] = 1
        comments_by_hour[time] = comment
    else:
        counts_by_hour[time] += 1
        comments_by_hour[time] += comment
        
counts_by_hour

{'00': 55,
 '01': 60,
 '02': 58,
 '03': 54,
 '04': 47,
 '05': 46,
 '06': 44,
 '07': 34,
 '08': 48,
 '09': 45,
 '10': 59,
 '11': 58,
 '12': 73,
 '13': 85,
 '14': 107,
 '15': 116,
 '16': 108,
 '17': 100,
 '18': 109,
 '19': 110,
 '20': 80,
 '21': 109,
 '22': 71,
 '23': 68}

> PROGRESS SO FAR:

Created two dictionaries:
- `counts_by_hour`: contains the number of ask posts created during each hour of the day
- `comments_by_hour`: contains the corresponding number of comments ask posts created at each hour recieved

> calculated the average number of comments per post for posts created during each hour of the day:

In [31]:
avg_by_hour = []

for hour in comments_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour]/counts_by_hour[hour]])
    
avg_by_hour

[['11', 11.051724137931034],
 ['07', 7.852941176470588],
 ['22', 6.746478873239437],
 ['05', 10.08695652173913],
 ['17', 11.46],
 ['01', 11.383333333333333],
 ['19', 10.8],
 ['13', 14.741176470588234],
 ['00', 8.127272727272727],
 ['21', 16.009174311926607],
 ['06', 9.022727272727273],
 ['08', 10.25],
 ['14', 13.233644859813085],
 ['20', 21.525],
 ['12', 9.41095890410959],
 ['18', 13.20183486238532],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['03', 7.796296296296297],
 ['10', 13.440677966101696],
 ['15', 38.5948275862069],
 ['09', 5.5777777777777775],
 ['02', 23.810344827586206],
 ['04', 7.170212765957447]]

# Sorting the list

And print the five highest values in a format easier to read

In [36]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])

print(swap_avg_by_hour)

sorted_swap = sorted(swap_avg_by_hour, reverse=True)

sorted_swap

[[11.051724137931034, '11'], [7.852941176470588, '07'], [6.746478873239437, '22'], [10.08695652173913, '05'], [11.46, '17'], [11.383333333333333, '01'], [10.8, '19'], [14.741176470588234, '13'], [8.127272727272727, '00'], [16.009174311926607, '21'], [9.022727272727273, '06'], [10.25, '08'], [13.233644859813085, '14'], [21.525, '20'], [9.41095890410959, '12'], [13.20183486238532, '18'], [16.796296296296298, '16'], [7.985294117647059, '23'], [7.796296296296297, '03'], [13.440677966101696, '10'], [38.5948275862069, '15'], [5.5777777777777775, '09'], [23.810344827586206, '02'], [7.170212765957447, '04']]


[[38.5948275862069, '15'],
 [23.810344827586206, '02'],
 [21.525, '20'],
 [16.796296296296298, '16'],
 [16.009174311926607, '21'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [13.20183486238532, '18'],
 [11.46, '17'],
 [11.383333333333333, '01'],
 [11.051724137931034, '11'],
 [10.8, '19'],
 [10.25, '08'],
 [10.08695652173913, '05'],
 [9.41095890410959, '12'],
 [9.022727272727273, '06'],
 [8.127272727272727, '00'],
 [7.985294117647059, '23'],
 [7.852941176470588, '07'],
 [7.796296296296297, '03'],
 [7.170212765957447, '04'],
 [6.746478873239437, '22'],
 [5.5777777777777775, '09']]

In [56]:
print("Top 5 Hours for Ask Posts Comments")
print("[converted from ET to GMT+1 (Summer Time)]")

for avg, hour in sorted_swap[:5]:
    
    lisbon_time = dt.datetime.strptime(hour, "%H") + dt.timedelta(hours=5)
    print(
        "{}: {:.2f} average comments per post".format(lisbon_time.strftime("%H:%M"), avg
        )

    )

Top 5 Hours for Ask Posts Comments
[converted from ET to GMT+1 (Summer Time)]
20:00: 38.59 average comments per post
07:00: 23.81 average comments per post
01:00: 21.52 average comments per post
21:00: 16.80 average comments per post
02:00: 16.01 average comments per post


In [52]:
# Just a few tests to convert to Lisbon time (GMT+1 since it's Summer Time)

tempo = dt.datetime.strptime(sorted_swap[0][1], "%H")
gmt_tempo = tempo + dt.timedelta(hours=5)

print(gmt_tempo)

str_tempo = gmt_tempo.strftime("%H:%M")

str_tempo

1900-01-01 20:00:00


'20:00'

# Conclusions

> on average, there are more comments between 3pm-4pm ET (or 20h-21h GMT+1 - Lisbon Summer time)

However posts without comments were not included, therefore this conclusion applies to posts which were commented 