# Hacker News
Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

You can find the data set here, but note that it has been reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions. Below are descriptions of the columns: 

* First step:
- read the file: hacker_news.csv
- Display first row

In [7]:
from csv import reader

open_file = open("HN_posts_year_to_Sep_26_2016.csv")
read_file = reader(open_file)

hn = list(read_file)

print(hn[0])
print("\n")

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']




* Steps
- Create a list only with the header
- create a list with all the data without the header
- create a function to explore the dataset, print lines and display the total of rows and columns
- Print the first 5 lines 

In [8]:
# 1 - Dataset's header
hn_header = hn[0]
# 2 - dataset without header
hn_dataset = hn[1:]
# 3 - Function to explore the data

def explore_dataset(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row
        
    if rows_and_columns:
        print('\n------------------------------------------------------')
        print(' * Number of rows:', len(dataset))
        print(' * Number of columns:', len(dataset[0]))
        
explore_dataset(hn_dataset, 0 , 5, True)

        
        

['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']


['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24']


['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']


['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']


['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']



------------------------------------------------------
 * Number of rows: 293119
 * Numb

* Now that we've removed the headers from hn, we're ready to filter our data. Since we're only concerned with post titles beginning with Ask HN or Show HN, we'll create new lists of lists containing just the data for those titles.

In [9]:
ask_posts = []
show_posts = []
other_posts = []

for posts in hn_dataset:
    title = posts[1]
    title = title.lower()
    if title.startswith('ask hn'):
        ask_posts.append(posts) 
    elif title.startswith('show hn'):
        show_posts.append(posts)
    else:
        other_posts.append(posts)
        
print("* Numbers of the ASK HN", len(ask_posts))
print("* Numbers of the SHOW HN", len(show_posts))
print("* Numbers of OTHERS", len(other_posts))

* Numbers of the ASK HN 9139
* Numbers of the SHOW HN 10158
* Numbers of OTHERS 273822


* Explorer the new lists

In [10]:
print('\n-----------------------------------------------------------------------------------------------------')
print('\n--Some pots starting with the "Ask hn":--------------------------------------------------------------')
print(ask_posts[:5])  
print('\n--Some pots starting with the "Show hn":-------------------------------------------------------------')
print(show_posts[:5])  
print('\n-----------------------------------------------------------------------------------------------------')


-----------------------------------------------------------------------------------------------------

--Some pots starting with the "Ask hn":--------------------------------------------------------------
[['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53'], ['12578522', 'Ask HN: How do you pass on your work when you die?', '', '6', '3', 'PascLeRasc', '9/26/2016 1:17'], ['12577908', 'Ask HN: How a DNS problem can be limited to a geographic region?', '', '1', '0', 'kuon', '9/25/2016 22:57'], ['12577870', 'Ask HN: Why join a fund when you can be an angel?', '', '1', '3', 'anthony_james', '9/25/2016 22:48'], ['12577647', 'Ask HN: Someone uses stock trading as passive income?', '', '5', '2', '00taffe', '9/25/2016 21:50']]

--Some pots starting with the "Show hn":-------------------------------------------------------------
[['12578335', 'Show HN: Finding puns computationally', 'http://puns.samueltaylor.org/', '2', '0', 'saamm', '9/

* Next, let's determine if ask posts or show posts receive more comments on average.

In [11]:
for row in ask_posts:
    total_ask_comments = 0
    number_comments = int(row[4])
    total_ask_comments += number_comments
    
avg_ask_comments = (len(ask_posts) / total_ask_comments) * 100
print("* The average of post with ASK HN", avg_ask_comments) 

for row in show_posts:
    total_ask_comments = 0
    number_comments = int(row[4])
    total_ask_comments += number_comments
    
avg_ask_comments = (len(show_posts) / total_ask_comments) * 100
print("* The average of post with SHOW HN", avg_ask_comments) 
    

* The average of post with ASK HN 45695.0
* The average of post with SHOW HN 1015800.0


* Next, we'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:

 1 Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
 2 Calculate the average number of comments ask posts receive by hour created.


In [90]:
import datetime as dt

result_list = []

for row in ask_posts:
    number_comments = int(row[4])
    data_and_time = row[6]
#    append to the list the two variables
    result_list.append([data_and_time, number_comments])
                                               
counts_by_hour = {}
comments_by_hour = {}

date_format = "%m/%d/%Y %H:%M"

for row in result_list:
    # take the datetime
    date = row[0]
    # take the number of comments
    comment = row[1]
    # Use the datetime.strftime() method to select just the hour from the datetime object.
    time = dt.datetime.strptime(date, date_format).strftime('%H')

    if time not in counts_by_hour:
        counts_by_hour[time] = 1
        comments_by_hour[time] = comment
    else:
        counts_by_hour[time] += 1
        comments_by_hour[time] += comment
        
print(comments_by_hour)
print('\n')
print(counts_by_hour)
                        

{'02': 2996, '01': 2089, '22': 3372, '21': 4500, '19': 3954, '17': 5547, '15': 18525, '14': 4972, '13': 7245, '11': 2797, '10': 3013, '09': 1477, '07': 1585, '03': 2154, '23': 2297, '20': 4462, '16': 4466, '08': 2362, '00': 2277, '18': 4877, '12': 4234, '04': 2360, '06': 1587, '05': 1838}


{'02': 269, '01': 282, '22': 383, '21': 518, '19': 552, '17': 587, '15': 646, '14': 513, '13': 444, '11': 312, '10': 282, '09': 222, '07': 226, '03': 271, '23': 343, '20': 510, '16': 579, '08': 257, '00': 301, '18': 614, '12': 342, '04': 243, '06': 234, '05': 209}


* calculate the average number of comments per post for posts created during each hour of the day.

In [101]:
avg_by_hours = []

for hours in comments_by_hour:
    avg_by_hours.append([hours, comments_by_hour[hours] / counts_by_hour[hours]])

avg_by_hours


[['02', 11.137546468401487],
 ['01', 7.407801418439717],
 ['22', 8.804177545691905],
 ['21', 8.687258687258687],
 ['19', 7.163043478260869],
 ['17', 9.449744463373083],
 ['15', 28.676470588235293],
 ['14', 9.692007797270955],
 ['13', 16.31756756756757],
 ['11', 8.96474358974359],
 ['10', 10.684397163120567],
 ['09', 6.653153153153153],
 ['07', 7.013274336283186],
 ['03', 7.948339483394834],
 ['23', 6.696793002915452],
 ['20', 8.749019607843136],
 ['16', 7.713298791018998],
 ['08', 9.190661478599221],
 ['00', 7.5647840531561465],
 ['18', 7.94299674267101],
 ['12', 12.380116959064328],
 ['04', 9.7119341563786],
 ['06', 6.782051282051282],
 ['05', 8.794258373205741]]

* Although we now have the results we need, this format makes it hard to identify the hours with the highest values. Let's finish by sorting the list of lists and printing the five highest values in a format that's easier to read.
### Calculating the Average Number of Comments for Ask HN Posts by Hour

In [107]:
swap_avg_by_hour = []

for row in avg_by_hours:
    swap_avg_by_hour.append([row[1], row[0]])
    
print(swap_avg_by_hour)

[[11.137546468401487, '02'], [7.407801418439717, '01'], [8.804177545691905, '22'], [8.687258687258687, '21'], [7.163043478260869, '19'], [9.449744463373083, '17'], [28.676470588235293, '15'], [9.692007797270955, '14'], [16.31756756756757, '13'], [8.96474358974359, '11'], [10.684397163120567, '10'], [6.653153153153153, '09'], [7.013274336283186, '07'], [7.948339483394834, '03'], [6.696793002915452, '23'], [8.749019607843136, '20'], [7.713298791018998, '16'], [9.190661478599221, '08'], [7.5647840531561465, '00'], [7.94299674267101, '18'], [12.380116959064328, '12'], [9.7119341563786, '04'], [6.782051282051282, '06'], [8.794258373205741, '05']]


* Use the sorted() function to sort swap_avg_by_hour in descending order. Since the first column of this list is the average number of comments, sorting the list will sort by the average number of comments.

 - Set the reverse argument to True, so that the highest value in the first column appears first in the list.
 - Assign the result to sorted_swap.


### Sorting and Printing Values from a List of Lists

In [110]:
swap_sorted = sorted(swap_avg_by_hour, reverse=True)

swap_sorted

[[28.676470588235293, '15'],
 [16.31756756756757, '13'],
 [12.380116959064328, '12'],
 [11.137546468401487, '02'],
 [10.684397163120567, '10'],
 [9.7119341563786, '04'],
 [9.692007797270955, '14'],
 [9.449744463373083, '17'],
 [9.190661478599221, '08'],
 [8.96474358974359, '11'],
 [8.804177545691905, '22'],
 [8.794258373205741, '05'],
 [8.749019607843136, '20'],
 [8.687258687258687, '21'],
 [7.948339483394834, '03'],
 [7.94299674267101, '18'],
 [7.713298791018998, '16'],
 [7.5647840531561465, '00'],
 [7.407801418439717, '01'],
 [7.163043478260869, '19'],
 [7.013274336283186, '07'],
 [6.782051282051282, '06'],
 [6.696793002915452, '23'],
 [6.653153153153153, '09']]

* Sort the values and print the the 5 hours with the highest average comments.
* Arguments in the print "{}", arg1
  - Argument_1 in the print is the time, identify by "{}"
     - format the hours format()
       - use: datetime.strptime() to especify the atual format "%H" and use strftime() to give a new format "%H:%M"
  - Argument_2 in the print is the avg, identify by the formt {:.2f}
      

In [120]:
print("Top 5 Hours for 'Ask HN' Comments")
#Loop through each average and each hour (in this order) in the first five lists of sorted_swap.
for avg, hr in swap_sorted[:5]:
# Take the hours in the first argument {} that is in '%H' format exemple '15' 
#to transform using strftime to a new format "%H:%M" so '15:00'
    print("In hour: {}, the average comments per post is {:.2f} "
          
          .format(dt.datetime.strptime(hr, "%H").strftime("%H:%M"), avg)
         
         )

Top 5 Hours for 'Ask HN' Comments
In hour: 15:00, the average comments per post is 28.68 
In hour: 13:00, the average comments per post is 16.32 
In hour: 12:00, the average comments per post is 12.38 
In hour: 02:00, the average comments per post is 11.14 
In hour: 10:00, the average comments per post is 10.68 


The hour that receives the most comments per post on average is 15:00, with an average of 38.59 comments per post. There's an increase in the number of comments between the hours with the highest and second highest average number of comments.


# Conclusion

In this project, we analyzed ask posts and show posts to determine which type of post and time receive the most comments on average. Based on our analysis, to maximize the amount of comments a post receives, we'd recommend the post be categorized as ask post and created between 15:00 and 16:00 (3:00 pm est - 4:00 pm est).

However, it should be noted that the data set we analyzed excluded posts without any comments. Given that, it's more accurate to say that of the posts that received comments, ask posts received more comments on average and ask posts created between 15:00 and 16:00 (3:00 pm est - 4:00 pm est) received the most comments on average.
