# Guided Project: Exploring Hacker News Posts

Hacker News is a site similar to reddit where posts can be commented on and are voted (upvote or downvote). The site is very popular in technology and startup circles. 

In this project, we will inspect a reduce data set from the Hacker News website to determine if:
   - Posts beginning with "Ask HN" or "Show HN" receive more comments on average or not.
   - Posts created at certain time receive more comments on average than the others
    

**DATA READ**

In this first part, we read the data set file and save it in an appropraite format.
Some rows are printed to assure evrything was done right.

In [12]:
file = open('hacker_news.csv')
from csv import reader
data = reader(file)
hn = list(data) # with header

print(*hn[:5], sep = "\n")

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']
['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']
['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']
['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


In [14]:
headers = hn[0] # assigning columns headers a variable
hn=hn[1:] # removing headers from the data set

print(headers, "\n")
print(*hn[:5], sep="\n")

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'] 

['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']
['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']
['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']
['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']
['10482257', 'Title II kills investment? Comcast and other ISPs are now spending more', 'http://a

**DATA CLEANING**

In this section, we will start removing and make some changes to the data set to get what we need for our analysis.

1. Separate posts beginning with Ask HN and Show HN (and case variations) into two different lists.

In [15]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17193


2. Let's determine if ask posts or show posts receive more comments on average

In [18]:
total_ask_comments = 0
total_show_comments = 0

for post in ask_posts:
    total_ask_comments += int(post[4])
    
for post in show_posts:
    total_show_comments += int(post[4])

avg_ask_comments = total_ask_comments / len(ask_posts)

avg_show_comments = total_show_comments / len(show_posts)

print("The average number of comments on ask posts is", round(avg_ask_comments, 2))
print("The average number of comments on show posts is", round(avg_show_comments, 2))



The average number of comments on ask posts is 14.04
The average number of comments on show posts is 10.32


From the average numbers we calculated above, we can cleary see that ask posts receive more comments than the show post. A quick deduction is that people are more susceptible to answer a question asked in a post than to comment on an informative post.

3. Let's calculate the amount of ask posts created per hour, along with the total amount of comments. We will need to manipulate time for it

In [21]:
import datetime as dt

result_list = []
counts_by_hour = {}
comments_by_hour = {}

for post in ask_posts:
    result_list.append([post[6], int(post[4])])
    
for row in result_list:
    date = row[0]
    template = "%m/%d/%Y %H:%M"
    date_dt = dt.datetime.strptime(date, template)
    hour = date_dt.strftime("%H")
    
    if hour in counts_by_hour:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += row[1]
    else:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = row[1]

print(result_list[:5], "\n")
print(counts_by_hour, "\n")
print(comments_by_hour, "\n")
        
        
    

[['8/16/2016 9:55', 6], ['11/22/2015 13:43', 29], ['5/2/2016 10:14', 1], ['8/2/2016 14:20', 3], ['10/15/2015 16:38', 17]] 

{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58} 

{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641} 



4. Let's create a list of lists containing the hours during which posts were created and the average number of comments those posts received

In [33]:
avg_by_hour = []

for comment in comments_by_hour:
    avg_by_hour.append([comment, (comments_by_hour[comment] / counts_by_hour[comment])])

print(*avg_by_hour, sep="\n")

['09', 5.5777777777777775]
['13', 14.741176470588234]
['10', 13.440677966101696]
['14', 13.233644859813085]
['16', 16.796296296296298]
['23', 7.985294117647059]
['12', 9.41095890410959]
['17', 11.46]
['15', 38.5948275862069]
['21', 16.009174311926607]
['20', 21.525]
['02', 23.810344827586206]
['18', 13.20183486238532]
['03', 7.796296296296297]
['05', 10.08695652173913]
['19', 10.8]
['01', 11.383333333333333]
['22', 6.746478873239437]
['08', 10.25]
['04', 7.170212765957447]
['00', 8.127272727272727]
['06', 9.022727272727273]
['07', 7.852941176470588]
['11', 11.051724137931034]


5. Finally, let's finish by sorting the list of lists and printing the five highest values in a format that's easier to read.

In [39]:
swap_avg_by_hour = []

for elem in avg_by_hour:
    swap_avg_by_hour.append([elem[1], elem[0]])
    
print(*swap_avg_by_hour, "\n", sep="\n")

sorted_swap = sorted(swap_avg_by_hour, reverse=True)

print("Top 5 Hours for Ask Posts Comments\n")

for row in sorted_swap[:5]:
    date = row[1]
    date_dt = dt.datetime.strptime(date, "%H")
    hour = date_dt.strftime("%H")
    template = "{time}:00: {average:.2f} average comments per post"
    output = template.format(time=hour , average = row[0])
    print(output, "\n")
    
    


[5.5777777777777775, '09']
[14.741176470588234, '13']
[13.440677966101696, '10']
[13.233644859813085, '14']
[16.796296296296298, '16']
[7.985294117647059, '23']
[9.41095890410959, '12']
[11.46, '17']
[38.5948275862069, '15']
[16.009174311926607, '21']
[21.525, '20']
[23.810344827586206, '02']
[13.20183486238532, '18']
[7.796296296296297, '03']
[10.08695652173913, '05']
[10.8, '19']
[11.383333333333333, '01']
[6.746478873239437, '22']
[10.25, '08']
[7.170212765957447, '04']
[8.127272727272727, '00']
[9.022727272727273, '06']
[7.852941176470588, '07']
[11.051724137931034, '11']


Top 5 Hours for Ask Posts Comments

15:00: 38.59 average comments per post 

02:00: 23.81 average comments per post 

20:00: 21.52 average comments per post 

16:00: 16.80 average comments per post 

21:00: 16.01 average comments per post 



From the data we retrieved above, we can cleary see that around 3 pm, the average comment per hour is larger. So, we can conculde that around 3 pm, we have better chance to have our question (in the ask post) be answered.