![Image](https://s3.amazonaws.com/dq-content/354/hacker_news.jpg)

Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") receive votes and comments, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of the Hacker News listings can get hundreds of thousands of visitors as a result.

You can find the data set [here](https://www.kaggle.com/hacker-news/hacker-news-posts), but note that we have reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that didn't receive any comments and then randomly sampling from the remaining submissions. Below are descriptions of the columns:

- `id:` the unique identifier from Hacker News for the post
- `title:` the title of the post
- `url:` the URL that the posts links to, if the post has a URL
- `num_points:` the number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
- `num_comments:` the number of comments on the post
- `author:` the username of the person who submitted the post
- `created_at:` the date and time of the post's submission

In this project, we'll work with a dataset of submissions to popular technology site Hacker News.

We're specifically interested in posts with titles that begin with either `Ask HN` or `Show HN`. Users submit Ask HN posts to ask the Hacker News community a specific question.

We'll compare these two types of posts to determine the following:

- Do Ask HN or Show HN receive more comments on average?
- Do posts created at a certain time receive more comments on average?

Let's start by importing the libraries we need and reading the dataset into a list of lists.

In [1]:
from csv import reader
file_open= open("data/HN_posts_year_to_Sep_26_2016.csv",  encoding='utf-8')
file_read= reader(file_open)
hn_data=list(file_read)


headers=hn_data[0]
print(headers, "\n")
hn_data=hn_data[1:]
print(hn_data[:5])

print(len(hn_data[0]))
print("Number of rows: {}\n Number of Columns".format(len(hn_data),len(hn_data[0])))


['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] 

[['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16'], ['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']]
7
Number of ro

We can see that demonstrated above posts have 0 (zero) comments. As our goal to examine posts that get more comments, we will clean our dataset from posts that don't have comments.

In [2]:
# collecting rows with comments in separate list 'hn'
hn = []
for row in hn_data:
    if row[4] != '0':
        hn.append(row)

# checking if there are rows with '0' points
number_points_0 = 0
for row in hn:
    if row[3] == '0':
        number_points_0 += 1
print("Number of rows with '0' points:", number_points_0)        

print('Number of rows in dataset:', len(hn))  

print('First 5 rows:')

for row in hn[:5]:
    print(row)

Number of rows with '0' points: 0
Number of rows in dataset: 80401
First 5 rows:
['12578975', 'Saving the Hassle of Shopping', 'https://blog.menswr.com/2016/09/07/whats-new-with-your-style-feed/', '1', '1', 'bdoux', '9/26/2016 3:13']
['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53']
['12578822', 'Amazons Algorithms Dont Find You the Best Deals', 'https://www.technologyreview.com/s/602442/amazons-algorithms-dont-find-you-the-best-deals/', '1', '1', 'yarapavan', '9/26/2016 2:26']
['12578694', 'Emergency dose of epinephrine that does not cost an arm and a leg', 'http://m.imgur.com/gallery/th6Ua', '2', '1', 'dredmorbius', '9/26/2016 1:54']
['12578624', 'Phone Makers Could Cut Off Drivers. So Why Dont They?', 'http://www.nytimes.com/2016/09/25/technology/phone-makers-could-cut-off-drivers-so-why-dont-they.html', '4', '1', 'danso', '9/26/2016 1:37']


In [3]:
ask_posts=[]
show_posts=[]
other_posts=[]
for row in  hn:
    title = row[1]
    title=title.lower()
    
    if title.startswith("show hn"):
        show_posts.append(row)
    elif title.startswith("ask hn"):
        ask_posts.append(row)
    else:
        other_posts.append(row)

print (len(ask_posts),"\n\n", len(show_posts),"\n\n", len(other_posts),"\n\n")

    
                                        

6911 

 5059 

 68431 




In [4]:
total_ask_comments=0
for row in ask_posts:
    num_comments=int(row[4])
    total_ask_comments+=num_comments
avg_ask_comments=total_ask_comments/len(ask_posts)


print('Total Ask Comments =', total_ask_comments)
print('\n')
print('Average Ask Comments =', avg_ask_comments)
print('Average Rounded Ask Comments =', round(avg_ask_comments))

total_show_comments=0
for row in show_posts:
    num_comments=int(row[4])
    total_show_comments+=num_comments
    avg_show_comments=total_show_comments/len(show_posts)


    
print('\n')
print('Total Show Comments =', total_show_comments)
print('\n')
print('Average Show Comments =', avg_show_comments)
print('Average Rounded Show Comments =', round(avg_show_comments))

total_other_comments=0
for row in other_posts:
    num_comments=int(row[4])
    total_other_comments+=num_comments
    avg_other_comments=total_other_comments/len(other_posts)

print('\n')
print('Total other Comments =', total_other_comments)
print('\n')
print('Average other Comments =', avg_other_comments)
print('Average Rounded other Comments =', round(avg_other_comments))
    

Total Ask Comments = 94986


Average Ask Comments = 13.744175951381855
Average Rounded Ask Comments = 14


Total Show Comments = 49633


Average Show Comments = 9.810832180272781
Average Rounded Show Comments = 10


Total other Comments = 1768142


Average other Comments = 25.838318890561297
Average Rounded other Comments = 26


The findings above show that `ask posts` get more comments on average

We will exract the created_at and number of coments from each row and create a `result_list`

then going through that list we create to dictionaries as folowing :

- `counts_by_hour:` contains the number of ask posts created during each hour of the day.
- `comments_by_hour:` contains the corresponding number of comments ask posts created at each hour received.

In [5]:
import datetime as dt
result_list=[]
for row in ask_posts:
    created_at=row[6]
    num_comments=int(row[4])
    result_list.append((created_at,num_comments))

counts_by_hour , comments_by_hour={},{}

for row in result_list:
    hour=row[0]
    date_object=dt.datetime.strptime(hour,"%m/%d/%Y %H:%M")
    hour=date_object.strftime("%H")
    if hour not in counts_by_hour:
        counts_by_hour[hour]=1
        comments_by_hour[hour]=row[1]
    else:
        counts_by_hour[hour]+=1
        comments_by_hour[hour]+=row[1]
    
print(counts_by_hour)
print(comments_by_hour)
    
    
    

{'02': 227, '01': 223, '22': 287, '21': 407, '19': 420, '17': 404, '15': 467, '14': 378, '13': 326, '11': 251, '10': 219, '09': 176, '07': 157, '03': 212, '16': 415, '08': 190, '00': 231, '23': 276, '20': 392, '18': 452, '12': 274, '04': 186, '06': 176, '05': 165}
{'02': 2996, '01': 2089, '22': 3372, '21': 4500, '19': 3954, '17': 5547, '15': 18525, '14': 4972, '13': 7245, '11': 2797, '10': 3013, '09': 1477, '07': 1585, '03': 2154, '16': 4466, '08': 2362, '00': 2277, '23': 2297, '20': 4462, '18': 4877, '12': 4234, '04': 2360, '06': 1587, '05': 1838}


Next, we'll use these two dictionaries to calculate the average number of comments for posts created during each hour of the day.
- `avg_by_hour:` is a list of lists containing the hours during which posts were created and the average number of comments those posts received.


In [6]:
avg_by_hour=[]
for hour in counts_by_hour:
    avg_by_hour.append([hour,comments_by_hour[hour]/counts_by_hour[hour]])
print(sorted(avg_by_hour))


[['00', 9.857142857142858], ['01', 9.367713004484305], ['02', 13.198237885462555], ['03', 10.160377358490566], ['04', 12.688172043010752], ['05', 11.139393939393939], ['06', 9.017045454545455], ['07', 10.095541401273886], ['08', 12.43157894736842], ['09', 8.392045454545455], ['10', 13.757990867579908], ['11', 11.143426294820717], ['12', 15.452554744525548], ['13', 22.2239263803681], ['14', 13.153439153439153], ['15', 39.66809421841542], ['16', 10.76144578313253], ['17', 13.73019801980198], ['18', 10.789823008849558], ['19', 9.414285714285715], ['20', 11.38265306122449], ['21', 11.056511056511056], ['22', 11.749128919860627], ['23', 8.322463768115941]]


Creat a list with where order is swapped:
- `swap_avg_by_hour`

Then assign the sorted resuluts from highest average number of comments to lowest by using the sorted method and assign the results to :

- `sorted_swap`

In [7]:
swap_avg_by_hour=[]
for i in avg_by_hour:
    swap_avg_by_hour.append([i[1],i[0]])
sorted_swap=sorted(swap_avg_by_hour, reverse=True)
    

Then we use the str.format() method to print the hour and average in the following format: `15:00: 38.59 average comments per post`.

In [8]:
print("Top 5 Hours for Ask Posts Comments")
template="{}:00: {:.2f} average comments per post"
for i in sorted_swap[:5]:
    print(template.format(i[1],i[0]))
    

Top 5 Hours for Ask Posts Comments
15:00: 39.67 average comments per post
13:00: 22.22 average comments per post
12:00: 15.45 average comments per post
10:00: 13.76 average comments per post
17:00: 13.73 average comments per post
