# **Exploring posts on Hacker News**

#### In this project, I analyze a dataset taken from the website Hacker News (https://www.kaggle.com/hacker-news/hacker-news-posts) to find posts that generate high number of comments from the users. 
#### The two types of posts I explore begin with either Ask HN or Show HN.

#### Users submit Ask HN posts to ask the Hacker News community a specific question.
#### Users submit Show HN posts to showcase their works to the Hacker News community.

#### I will try to gain insights into this dataset by finding answers to the following questions:

 ##### Do Ask HN or Show HN receive more comments on average?
 ##### Do posts created at a certain time receive more comments on average?
 ##### It should be noted that the data set in consideration was reduced from almost 300,000 rows to approximately 20,000 rows by removing all   submissions that did not receive any comments, and then randomly sampling from the remaining submissions.

### Introduction

#### First, I read the file containing the dataset and store it as a list of lists.

In [6]:
from csv import reader
hn = list(reader(open("hacker_news.csv")))

for i in range (0,5):
     print (hn[i])
     print("\n")

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']




### Then, I remove the header row and store the remaining rows

In [10]:
headers = hn[0]

hn = hn[1:]

print(headers)
print("\n")

for i in range (0,5):
    print (hn[i])
    print("\n")

['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']


['10482257', 'Title II kills investment? Comcast and other ISPs are now spending more', 'http://arstechnica.com/business/2015/10/comcast-and-other-isps-boost-network-investment-despite-net-neutrality/', '53', '22', 'Dei

### Separating AskHN and ShowHN posts

#### I identified posts that start with AskHN and ShowHN. Then, I stored them into separate lists.

In [24]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

print("Number of AskHN posts - " + str(len(ask_posts)))
print("\n")
print("Number of ShowHN posts - " + str(len(show_posts)))
print("\n")
print("Other posts - " + str(len(other_posts)))
print("\n")

Number of AskHN posts - 1744


Number of ShowHN posts - 1162


Other posts - 17192




### Average number of AskHN and ShowHN posts

#### Now, I calculate the average number of AskHN and ShowHN posts

In [20]:
total_ask_comments = 0

for row in ask_posts:
    num_ask_comments = int(row[4])
    total_ask_comments += num_ask_comments

avg_ask_comments = total_ask_comments/len(ask_posts)

print("Average comments on ask posts: {}".format(avg_ask_comments))

total_show_comments = 0

for row in show_posts:
    num_show_comments = int(row[4])
    total_show_comments += num_show_comments

avg_show_comments = total_show_comments/len(show_posts)

print("Average comments on show posts: {}".format(avg_show_comments))

Average comments on ask posts: 14.038417431192661
Average comments on show posts: 10.31669535283993


#### As you can see, ask posts receive more comments on average when compared to show posts. This may be because a lot more people may want answers to ask posts and are willing to discuss about them. Hence, many people contribute to the comments section for ask posts.

### Finding number of AskHN posts and comments by the hour
#### Next, I calculate the number of AskHN posts and the corresponding number of comments at each hour of the day. The goal is to see if a post can receive more comments when created at a particular time of day.

In [16]:
import datetime as dt

result_list = []

for row in ask_posts:
    created_at = row[6]
    num_comments = int(row[4])
    result_list.append([created_at,num_comments])

counts_by_hour = {}
comments_by_hour = {}

for element in result_list:
    date = element[0]
    strformat = "%m/%d/%Y %H:%M"
    obj = dt.datetime.strptime(date,strformat)
    hour = obj.strftime("%H")
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = int(element[1])
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += int(element[1])

print(counts_by_hour)
print("\n")
print(comments_by_hour)

{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}


{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}


### Average number of comments by hour

In [17]:
avg_by_hour = []

for hr in comments_by_hour:
    avg_by_hour.append([hr,comments_by_hour[hr]/counts_by_hour[hr]])
    
print(avg_by_hour)

[['09', 5.5777777777777775], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['16', 16.796296296296298], ['23', 7.985294117647059], ['12', 9.41095890410959], ['17', 11.46], ['15', 38.5948275862069], ['21', 16.009174311926607], ['20', 21.525], ['02', 23.810344827586206], ['18', 13.20183486238532], ['03', 7.796296296296297], ['05', 10.08695652173913], ['19', 10.8], ['01', 11.383333333333333], ['22', 6.746478873239437], ['08', 10.25], ['04', 7.170212765957447], ['00', 8.127272727272727], ['06', 9.022727272727273], ['07', 7.852941176470588], ['11', 11.051724137931034]]


### Sorting and printing values from a list of lists

In [20]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1],row[0]])

print(swap_avg_by_hour)

sorted_swap = sorted(swap_avg_by_hour,reverse=True)

print("\n"+"Top 5 Hours fo AskHN Comments"+"\n")

for i in range (0,5):
    hr = sorted_swap[i][1]
    hrobject = dt.datetime.strptime(hr,"%H")
    hour = hrobject.strftime("%H")
    
    avgcomments = sorted_swap[i][0]
    
    print("{0}:00: {1:.2f} average comments per post".format(hour,avgcomments))

[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]

Top 5 Hours fo AskHN Comments

15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


#### The hour that receives the *most* comments per post on average is 15:00, with an average of 38.59 comments per post.
#### The hour that receives the *least* comments per post on average is 21:00, with an average of 16.01 comments per post.

#### According to the data set documentation, the timezone used is Eastern Time in the US. So, we could also write 15:00 as 3:00 pm est.