# Exploring Hacker News Posts

We will work with a data set from Hacker News (if want to check the site, click [here](https://news.ycombinator.com/)) wich contains information about posts in the site. We will focus on specific post where users ask cuestions to the community by writing Ask HN and post where users show projects by writting Show HN. We will start with reading the CSV file:

In [1]:
from csv import reader

opened_file = open('hacker_news.csv')
read_file = reader(opened_file)

hn = list(read_file)

Let's check some rows in the dataset:

In [2]:
hn[0:6]

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01'],
 ['10301696',
  'Note by Note: The Making of Steinway L1037 (2007)',
  'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
 

First row is the header of the dataset. We need to remove in order to avoid errors while cleaning data, but we're goint to save it to a different variable:


In [3]:
headers = hn[0]
del hn[0]

headers

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

We will now start analyzing data, by separating posts starting with either Show HN or Ask HN

In [4]:
ask_posts = []
show_posts = []
other_posts = []

for post in hn:
    title = post[1]
    title = title.lower()
    
    if title.startswith('ask hn'):
        ask_posts.append(post)
    elif title.startswith('show hn'):
        show_posts.append(post)
    else:
        other_posts.append(post)
        
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


We can now check our post saved in ask_posts list:

In [5]:
ask_posts[0:5]

[['12296411',
  'Ask HN: How to improve my personal website?',
  '',
  '2',
  '6',
  'ahmedbaracat',
  '8/16/2016 9:55'],
 ['10610020',
  'Ask HN: Am I the only one outraged by Twitter shutting down share counts?',
  '',
  '28',
  '29',
  'tkfx',
  '11/22/2015 13:43'],
 ['11610310',
  'Ask HN: Aby recent changes to CSS that broke mobile?',
  '',
  '1',
  '1',
  'polskibus',
  '5/2/2016 10:14'],
 ['12210105',
  'Ask HN: Looking for Employee #3 How do I do it?',
  '',
  '1',
  '3',
  'sph130',
  '8/2/2016 14:20'],
 ['10394168',
  'Ask HN: Someone offered to buy my browser extension from me. What now?',
  '',
  '28',
  '17',
  'roykolak',
  '10/15/2015 16:38']]

Let's now separate posts by post-time and number of comments:

In [6]:
import datetime as dt

result_list = []

for post in ask_posts:
    post_time = post[6]
    n_comments = int(post[4])
    time_data = [post_time, n_comments]
    result_list.append(time_data)
    
counts_by_hour = {}
comments_by_hour = {}

for data in result_list:
    hour_data = data[0]
    comments = data[1]
    post_d, post_h = hour_data.split()
    post_time = dt.datetime.strptime(post_h, '%H:%M')
    hr = post_time.strftime('%H')
    
    if hr not in counts_by_hour:
        counts_by_hour[hr] = 1
        comments_by_hour[hr] = comments
    else:
        counts_by_hour[hr] += 1
        comments_by_hour[hr] += comments

We can now use this information to calculate the average number of comments by hour:

In [14]:
avg_comments = []

for hr in counts_by_hour:
    avg_by_hour = comments_by_hour[hr]/counts_by_hour[hr]
    avg_comments.append([hr, avg_by_hour])

avg_comments

[['21', 16.009174311926607],
 ['11', 11.051724137931034],
 ['05', 10.08695652173913],
 ['01', 11.383333333333333],
 ['23', 7.985294117647059],
 ['17', 11.46],
 ['06', 9.022727272727273],
 ['10', 13.440677966101696],
 ['04', 7.170212765957447],
 ['07', 7.852941176470588],
 ['00', 8.127272727272727],
 ['13', 14.741176470588234],
 ['18', 13.20183486238532],
 ['08', 10.25],
 ['15', 38.5948275862069],
 ['16', 16.796296296296298],
 ['20', 21.525],
 ['14', 13.233644859813085],
 ['02', 23.810344827586206],
 ['19', 10.8],
 ['22', 6.746478873239437],
 ['12', 9.41095890410959],
 ['03', 7.796296296296297],
 ['09', 5.5777777777777775]]

We would now like to know what post-time gets more comments. We need first swap order ine every list in our *avg_comments* list, and then use the __sorted__ function:

In [16]:
swap_avg_by_hour = []

for hr in avg_comments:
    swap_avg_by_hour.append([hr[1],hr[0]])
print(swap_avg_by_hour)
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

[[16.009174311926607, '21'], [11.051724137931034, '11'], [10.08695652173913, '05'], [11.383333333333333, '01'], [7.985294117647059, '23'], [11.46, '17'], [9.022727272727273, '06'], [13.440677966101696, '10'], [7.170212765957447, '04'], [7.852941176470588, '07'], [8.127272727272727, '00'], [14.741176470588234, '13'], [13.20183486238532, '18'], [10.25, '08'], [38.5948275862069, '15'], [16.796296296296298, '16'], [21.525, '20'], [13.233644859813085, '14'], [23.810344827586206, '02'], [10.8, '19'], [6.746478873239437, '22'], [9.41095890410959, '12'], [7.796296296296297, '03'], [5.5777777777777775, '09']]


We now can know wich hours have more average comments:

In [20]:
print('Top 5 hours for Ask Posts comments')

for hr in sorted_swap[0:5]:
    post_time = str(hr[1])
    avg_comments = hr[0]
    final_string = "{hr}: {n_comments:.2f} average comments per post"
    post_time_p = dt.datetime.strptime(post_time, '%H')
    post_time_f = post_time_p.strftime('%H')
    top_time = final_string.format(hr=post_time_f, n_comments=avg_comments)
    print(top_time)

Top 5 hours for Ask Posts comments
15: 38.59 average comments per post
02: 23.81 average comments per post
20: 21.52 average comments per post
16: 16.80 average comments per post
21: 16.01 average comments per post
