# Exploring Hacker News Posts

For this project, the objective is to analyze data from posts published on a popular website called Hacker News.
Main 2 things to evaluate are the category of posts that received more comments and the periods of times of most commented posts.

In [2]:
from csv import reader
file = open('hacker_news.csv')
data = list(reader(file))
headers = data[0]
hn = data[1:]

Columns

In [3]:
headers

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

Data

In [4]:
hn[:5]

[['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01'],
 ['10301696',
  'Note by Note: The Making of Steinway L1037 (2007)',
  'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
  '8',
  '2',
  'walterbell',
  '9/30/2015 4:12']]

Get the posts that are titled `Ask HN` and `Show HN`

In [11]:
# Lists to store the posts separated by category
ask_posts = []
show_posts = []
other_posts = []

# Iterate over the data
for row in hn:
    
    # Get the title and apply the lowercase method
    title = row[1].lower()
    
    # Check the category
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print(f"Number of Ask HN posts: {len(ask_posts)}")
print(f"Number of Show HN posts: {len(show_posts)}")
print(f"Number of other posts: {len(other_posts)}")

Number of Ask HN posts: 1744
Number of Show HN posts: 1162
Number of other posts: 17194


Create a function to compute the average of the number of comments for each post category

In [15]:
def compute_avg_total_comments(posts):
    total_comments = 0
    for row in posts:
        total_comments += int(row[4])
    return total_comments/len(posts)

In [19]:
avg_ask_comments = compute_avg_total_comments(ask_posts)
avg_show_comments = compute_avg_total_comments(show_posts)
avg_other_comments = compute_avg_total_comments(other_posts)
print("Average number of comments on ask posts: {:.2f}".format(avg_ask_comments))
print("Average number of comments on show posts: {:.2f}".format(avg_show_comments))
print("Average number of comments on other posts: {:.2f}".format(avg_other_comments))

Average number of comments on ask posts: 14.04
Average number of comments on show posts: 10.32
Average number of comments on other posts: 26.87


We can see that, on average, the number of comments of ask posts is greater than the number of comments on show posts. Also, neither of these posts have the greatest number of comments on average.

Now, let's see which periods of time caught up the greatest number of comments for `Ask HN` posts

In [33]:
import datetime as dt
result_list = []
for post in ask_posts:
    element = []
    element.append(post[6]) # created_at
    element.append(int(post[4])) # num_comments
    result_list.append(element)

counts_by_hour = {}
comments_by_hour = {}
for element in result_list:
    
    # Convert the time to a datetime format
    time_obj = dt.datetime.strptime(element[0],"%m/%d/%Y %H:%M")
    
    # Get the hour only
    hour = time_obj.strftime("%H")
    
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = element[1]
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += element[1]

Number of posts by hour

In [34]:
counts_by_hour

{'09': 45,
 '13': 85,
 '10': 59,
 '14': 107,
 '16': 108,
 '23': 68,
 '12': 73,
 '17': 100,
 '15': 116,
 '21': 109,
 '20': 80,
 '02': 58,
 '18': 109,
 '03': 54,
 '05': 46,
 '19': 110,
 '01': 60,
 '22': 71,
 '08': 48,
 '04': 47,
 '00': 55,
 '06': 44,
 '07': 34,
 '11': 58}

Number of comments by hour

In [32]:
comments_by_hour

{'09': 251,
 '13': 1253,
 '10': 793,
 '14': 1416,
 '16': 1814,
 '23': 543,
 '12': 687,
 '17': 1146,
 '15': 4477,
 '21': 1745,
 '20': 1722,
 '02': 1381,
 '18': 1439,
 '03': 421,
 '05': 464,
 '19': 1188,
 '01': 683,
 '22': 479,
 '08': 492,
 '04': 337,
 '00': 447,
 '06': 397,
 '07': 267,
 '11': 641}

Let's compute the average number of comments per post for posts created during each hour of the day

In [38]:
avg_by_hour = []
for time in counts_by_hour:
    
    # Get the number of posts and comments for each hour
    num_posts = counts_by_hour[time]
    num_comments = comments_by_hour[time]
    
    # Compute the average of comments per post
    avg = num_comments/num_posts
    
    # Append to the list of lists
    avg_by_hour.append([time, avg])
    
avg_by_hour

[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]

Sort the list by using the average of the number of comments per post

In [39]:
swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
swap_avg_by_hour

[[5.5777777777777775, '09'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [16.796296296296298, '16'],
 [7.985294117647059, '23'],
 [9.41095890410959, '12'],
 [11.46, '17'],
 [38.5948275862069, '15'],
 [16.009174311926607, '21'],
 [21.525, '20'],
 [23.810344827586206, '02'],
 [13.20183486238532, '18'],
 [7.796296296296297, '03'],
 [10.08695652173913, '05'],
 [10.8, '19'],
 [11.383333333333333, '01'],
 [6.746478873239437, '22'],
 [10.25, '08'],
 [7.170212765957447, '04'],
 [8.127272727272727, '00'],
 [9.022727272727273, '06'],
 [7.852941176470588, '07'],
 [11.051724137931034, '11']]

In [41]:
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
sorted_swap

[[38.5948275862069, '15'],
 [23.810344827586206, '02'],
 [21.525, '20'],
 [16.796296296296298, '16'],
 [16.009174311926607, '21'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [13.20183486238532, '18'],
 [11.46, '17'],
 [11.383333333333333, '01'],
 [11.051724137931034, '11'],
 [10.8, '19'],
 [10.25, '08'],
 [10.08695652173913, '05'],
 [9.41095890410959, '12'],
 [9.022727272727273, '06'],
 [8.127272727272727, '00'],
 [7.985294117647059, '23'],
 [7.852941176470588, '07'],
 [7.796296296296297, '03'],
 [7.170212765957447, '04'],
 [6.746478873239437, '22'],
 [5.5777777777777775, '09']]

In [44]:
print("****Top 5 Hours for Ask Posts Comments****")
for avg, hour in sorted_swap[:5]:
    print("{}:00: {:.2f} average comments per post".format(hour,avg))

****Top 5 Hours for Ask Posts Comments****
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


### Conclusions

The `Ask HN` posts were the most populars over the `Show HN` ones and the hours on which these `Ask HN` posts were found with more comments are the ones shown in the last cell.