# Exploring Hacker News Posts

## Intoduction
We're specifically interested in posts whose titles begin with either Ask HN or Show HN. Users submit Ask HN posts to ask the Hacker News community a specific question.

We'll compare these two types of posts to determine the following:

- Do Ask HN or Show HN receive more comments on average?
- Do posts created at a certain time receive more comments on average?

In [13]:
import datetime as dt
from csv import reader

In [14]:
#open the dataset, reading it and transform to list to list 
open_hn = open('hacker_news.csv')
read_hn = reader(open_hn)
hn = list(read_hn)
print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


The first list in the inner lists contains the column headers, and the lists after contain the data for one row. In order to analyze our data, we need to first remove the row containing the column headers. Let's remove that first row next.

In [15]:
#Extract the first row of data that called header
headers = hn[:1]
hn = hn[1:]
print(headers)
print('\n')
print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']]


[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


## Extracting Ask HN and Show HN Posts

Since we're only concerned with post titles beginning with Ask HN or Show HN, we'll create new lists of lists containing just the data for those titles.

In [16]:
ask_posts = []
show_posts = []
other_posts = []
for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

        
    

1744
1162
17194


##  Calculating the Average Number of Comments for Ask HN and Show HN Posts

Next, will determine if ask posts or show posts receive more comments on average.

In [17]:
total_ask_comments = 0
for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments
    
#Compute the average number of comments on ask posts
avg_ask_comments = total_ask_comments / len(ask_posts)
print(avg_ask_comments)
    

14.038417431192661


In [18]:
total_show_comments = 0
for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments
    
#Compute the average number of comments on ask posts
avg_show_comments = total_show_comments / len(show_posts)
print(avg_show_comments)



10.31669535283993


On average, __ask posts__ receive more comments than show posts. Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.

__we'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:__

- Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
- Calculate the average number of comments ask posts receive by hour created.

In [19]:
result_list = []
counts_by_hour = {}
comments_by_hour = {}

for post in ask_posts:
    created_at = post[6]
    num_comments = int(post[4])
    
    #Extract the hour from the date, which is the first element of the row
    date = dt.datetime.strptime(created_at, "%m/%d/%Y %H:%M")
    hour = dt.datetime.strftime(date, "%H")

    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = num_comments
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += num_comments
    
    

Now going to calculate the average number of comments per post for posts created during each hour of the day.

In [20]:
avg_by_hour = []
for comment in comments_by_hour:
    avg_by_hour.append([comment,comments_by_hour[comment] / counts_by_hour[comment]])
print(avg_by_hour)

[['09', 5.5777777777777775], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['16', 16.796296296296298], ['23', 7.985294117647059], ['12', 9.41095890410959], ['17', 11.46], ['15', 38.5948275862069], ['21', 16.009174311926607], ['20', 21.525], ['02', 23.810344827586206], ['18', 13.20183486238532], ['03', 7.796296296296297], ['05', 10.08695652173913], ['19', 10.8], ['01', 11.383333333333333], ['22', 6.746478873239437], ['08', 10.25], ['04', 7.170212765957447], ['00', 8.127272727272727], ['06', 9.022727272727273], ['07', 7.852941176470588], ['11', 11.051724137931034]]


In [21]:
swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
    
    

sorted_swap = sorted(swap_avg_by_hour, reverse = True)
print(sorted_swap[0:4],"\n")

print("Average comments per post: \n")

#formating results display.

for avg, hr in sorted_swap[:5]:
    print(
        "{}: {:.2f} average comments per post".format(
            dt.datetime.strptime(hr, "%H").strftime("%H:%M"),avg
        )
    )
        
       

[[38.5948275862069, '15'], [23.810344827586206, '02'], [21.525, '20'], [16.796296296296298, '16']] 

Average comments per post: 

15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


__Determine if show or ask posts receive more points on average__

In [22]:
total_ask_points = 0
for row in ask_posts:
    num_points = int(row[3])
    total_ask_points += num_points
    
#Compute the average number of comments on ask posts
avg_ask_points = total_ask_points / len(ask_posts)
print(avg_ask_points)

15.061926605504587


In [23]:
total_show_points = 0
for row in show_posts:
    num_points = int(row[3])
    total_show_points += num_points
    
#Compute the average number of comments on ask posts
avg_show_points = total_show_points / len(show_posts)
print(avg_show_points)

27.555077452667813


As we see that the show posts has more average receive points, that maybe be more sense becouse the beginners search and read more before asking.

__Determine if posts created at a certain time are more likely to receive more points.__

In [25]:
result = []
counts_by_hour = {}
points_by_hour = {}

for post in show_posts:
    created_at = post[6]
    num_points = int(post[3])
    
    #Extract the hour from the date, which is the first element of the row
    date = dt.datetime.strptime(created_at, "%m/%d/%Y %H:%M")
    hour = dt.datetime.strftime(date, "%H")

    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        points_by_hour[hour] = num_points
    else:
        counts_by_hour[hour] += 1
        points_by_hour[hour] += num_points

In [26]:
avg_hour = []
for points in points_by_hour:
    avg_hour.append([points,points_by_hour[points] / counts_by_hour[points]])
print(avg_hour)

[['14', 25.430232558139537], ['22', 40.34782608695652], ['18', 36.31147540983606], ['07', 19.0], ['20', 30.316666666666666], ['05', 5.473684210526316], ['16', 28.322580645161292], ['19', 30.945454545454545], ['15', 28.564102564102566], ['03', 25.14814814814815], ['17', 27.107526881720432], ['06', 23.4375], ['02', 11.333333333333334], ['13', 24.626262626262626], ['08', 15.264705882352942], ['21', 18.425531914893618], ['04', 14.846153846153847], ['11', 33.63636363636363], ['12', 41.68852459016394], ['23', 42.388888888888886], ['09', 18.433333333333334], ['01', 25.0], ['10', 18.916666666666668], ['00', 37.83870967741935]]


And becouse the result its hard to read going to format it more easy.

In [27]:
swap_avg_hour = []
for row in avg_hour:
    swap_avg_hour.append([row[1], row[0]])
    
    

sorted_swap = sorted(swap_avg_hour, reverse = True)
print(sorted_swap[0:4],"\n")

print("Average comments per post: \n")

#formating results display.

for avg, hr in sorted_swap[:5]:
    print(
        "{}: {:.2f} average points per post".format(
            dt.datetime.strptime(hr, "%H").strftime("%H:%M"),avg
        )
    )

[[42.388888888888886, '23'], [41.68852459016394, '12'], [40.34782608695652, '22'], [37.83870967741935, '00']] 

Average comments per post: 

23:00: 42.39 average points per post
12:00: 41.69 average points per post
22:00: 40.35 average points per post
00:00: 37.84 average points per post
18:00: 36.31 average points per post


By compare our results to the average number of comments and points other posts receive at 23:00: 42.39 average points per post has more receive points on the show posts but at 15:00: 38.59 average comments per post on the Ask posts

## Conclusion:

During the analisis the ask posts receive more comments by the average 14 rather than show posts 10 comments on an average. also found that the number of comments per post can vary by the time when the post was submitted.For example at 15:00 there was an average of 38 comments per post where at 21:00 only 16 comments, As for the show posts receive more points by the average 27.55 rather than ask posts 15.06 points on an average and also the number of points per posts can vary by the time and it be have more reach at 23:00 there was an average of 42.39 points per post.