# Exploring Hacker News Posts
In this project, we are analyzing `20,000` rows of Hacker News posts. 

We want to know whether 'Ask HN' posts or 'Show HN' posts get more comments on Hacker News. 

We are also analyzing whether posts created at a certain time get more comments on average than others. 

# Opening Our Data Set
We must open our csv file to access the data.

1. Import `reader` from `csv` by using `from csv import reader`.
2. Use `open('HN_posts.csv')` to open the file and save it to the variable `opened_csv`.
3. Use `reader(opened_csv)` to read the file and save it to the variable `read_csv`.
4. Use `list(read_csv)` to create a list of the data and save it to the variable `hn`.

We can combine steps 2-4 into one line of code instead of three as shown below:

In [1]:
from csv import reader

hn = list(reader(open('HN_posts.csv')))

# We print the first few rows of the data to analyze the columns
for row in hn[:5]:
    print(row)
    print('\n')

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']


['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24']


['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']


['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']




# Removing the Header from Data Set
To analyze our data, we must first remove the `header` row containing the column information.

We then display the first few rows of the new data set to confirm that the `header` row was removed.

In [2]:
headers = hn[0]

hn = hn[1:]

print(headers)
print('\n')
print('END HEADER')
print('\n')

for row in hn[:3]:
    print(row)
    print('\n')

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


END HEADER


['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']


['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24']


['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']




# Extracting 'Ask HN' and 'Show HN' Posts
Now that we have removed the `header` row, we can filter the data set to display posts beginning with either `Ask` or `Show`.

We will be using the `startswith` method to find these posts and sort them into separate lists.

We first loop through our data set:

1. We save the title at index 1 to the variable `title`
2. We then set `title` to lowercase using `title.lower()`
3. We then check, if `title` starts with either `ask hn`, `show hn`, or `neither`, and append them to their respective lists.
4. We then print the length of each list.

In [3]:
# We create three lists: ask_posts, show_posts, other_posts
ask_posts = []
show_posts = []
other_posts = []

# We loop through the data set to filter our posts
for row in hn:
    title = row[1]
    title = title.lower()
    
    # If title starts with 'ask hn', append to ask_posts
    if title.startswith('ask hn'):
        ask_posts.append(row)
    # elif title starts with 'show hn', append to show_posts
    elif title.startswith('show hn'):
        show_posts.append(row)
    # else append title to other_posts
    else:
        other_posts.append(row)
        
print('Total Ask HN Posts: ', len(ask_posts))
print('Total Show HN Posts: ', len(show_posts))
print('Total Other Posts: ', len(other_posts))

Total Ask HN Posts:  9139
Total Show HN Posts:  10158
Total Other Posts:  273822


# Calculating Average Comments for 'Ask HN' and 'Show HN' Posts
Now that we have our posts sorted by `Ask HN` and `Show HN`, we can calculate the average number of comments for each type of post.

In [4]:
# Total number of comments on either posts
total_ask_comments = 0
total_show_comments = 0

# Loop for ask_posts
for comments in ask_posts:
    num_comments = int(comments[4])
    total_ask_comments += num_comments
    
# Loop for show_posts
for comments in show_posts:
    num_comments = int(comments[4])
    total_show_comments += num_comments
    
# Calculate average comments for ask_posts
avg_ask_comments = total_ask_comments / len(ask_posts)

# Calculate average comments for show_posts
avg_show_comments = total_show_comments / len(show_posts)

# Print both averages
print('Average Number of Comments for ask_posts: ', round(avg_ask_comments, 2))
print('Average Number of Comments for show_posts: ', round(avg_show_comments, 2))

Average Number of Comments for ask_posts:  10.39
Average Number of Comments for show_posts:  4.89


The average number of comments for `ask_posts` is `10.39`.

The average number of comments for `show_posts` is `4.89`.

It can be inferred that posts starting with `Ask HN` have a **higher average number of comments** than posts starting with `Show HN`.

# Calculating Amount of 'Ask HN' Posts and Comments Created Per Hour
Now that we have determined which type of post receives more comments on average, we want to know if posting at a specific time has an effect on the number of comments on the post. We will now only be working with the `Ask HN` posts since they are more likely to receive more comments.

To determine these specific times, we will follow two steps:
Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.

1. First, we will calculate the number of `Ask HN` posts created in each hour of the day, as well as the number of comments they received.
2. Then, we will calculate the average number of comments these posts receive per hour.

We will be using the `datetime` module, as well as the `strptime()` and `strftime()` methods for our analysis.

In [5]:
import datetime as dt

result_list = []

# We append the `created_at` and `num_comments` columns to our `result_list`
for post in ask_posts:
    result_list.append([post[6], int(post[4])])
    
# Holds number of ask posts created at each hour of the day
posts_per_hour = {}

# Holds the number of comments the posts received by hour
comments_per_hour = {}
date_format = '%m/%d/%Y %H:%M'

for row in result_list:
    date = row[0]
    num_comments = row[1]
    # Here we format the `created_at` time and grab the `hour`
    hour = dt.datetime.strptime(date, date_format).strftime('%H')
    
    # We make a simple frequency table to count the hours and comments
    if hour not in posts_per_hour:
        posts_per_hour[hour] = 1
        comments_per_hour[hour] = num_comments
    else:
        posts_per_hour[hour] += 1
        comments_per_hour[hour] += num_comments

comments_per_hour

{'02': 2996,
 '01': 2089,
 '22': 3372,
 '21': 4500,
 '19': 3954,
 '17': 5547,
 '15': 18525,
 '14': 4972,
 '13': 7245,
 '11': 2797,
 '10': 3013,
 '09': 1477,
 '07': 1585,
 '03': 2154,
 '23': 2297,
 '20': 4462,
 '16': 4466,
 '08': 2362,
 '00': 2277,
 '18': 4877,
 '12': 4234,
 '04': 2360,
 '06': 1587,
 '05': 1838}

# Calculating Average Number of 'Ask HN' Posts Per Hour
We now have two dictionaries `posts_per_hour` and `comments_per_hour`. 

`posts_per_hour` contains the number of `Ask HN` posts created during each hour of the day.

`comments_per_hour` contains the number of comments the posts received by hour.

We will now use these two dictionaries to calculate the average number of comments per post during each hour of the day.

In [6]:
# Average comments for posts created per hour of the day
avg_per_hour = []

for hour in comments_per_hour:
    avg_per_hour.append([hour, comments_per_hour[hour] / posts_per_hour[hour]])
    
avg_per_hour

[['02', 11.137546468401487],
 ['01', 7.407801418439717],
 ['22', 8.804177545691905],
 ['21', 8.687258687258687],
 ['19', 7.163043478260869],
 ['17', 9.449744463373083],
 ['15', 28.676470588235293],
 ['14', 9.692007797270955],
 ['13', 16.31756756756757],
 ['11', 8.96474358974359],
 ['10', 10.684397163120567],
 ['09', 6.653153153153153],
 ['07', 7.013274336283186],
 ['03', 7.948339483394834],
 ['23', 6.696793002915452],
 ['20', 8.749019607843136],
 ['16', 7.713298791018998],
 ['08', 9.190661478599221],
 ['00', 7.5647840531561465],
 ['18', 7.94299674267101],
 ['12', 12.380116959064328],
 ['04', 9.7119341563786],
 ['06', 6.782051282051282],
 ['05', 8.794258373205741]]

# Sorting and Printing Values from a List of Lists
Now that we have the average number of comments for posts created during each hour of the day, we will sort the `avg_per_hour` list for easier reading.

In [7]:
swap_avg_per_hour = []

for row in avg_per_hour:
    swap_avg_per_hour.append([row[1], row[0]])
    
print(swap_avg_per_hour)

sorted_swap = sorted(swap_avg_per_hour, reverse = True)

sorted_swap

[[11.137546468401487, '02'], [7.407801418439717, '01'], [8.804177545691905, '22'], [8.687258687258687, '21'], [7.163043478260869, '19'], [9.449744463373083, '17'], [28.676470588235293, '15'], [9.692007797270955, '14'], [16.31756756756757, '13'], [8.96474358974359, '11'], [10.684397163120567, '10'], [6.653153153153153, '09'], [7.013274336283186, '07'], [7.948339483394834, '03'], [6.696793002915452, '23'], [8.749019607843136, '20'], [7.713298791018998, '16'], [9.190661478599221, '08'], [7.5647840531561465, '00'], [7.94299674267101, '18'], [12.380116959064328, '12'], [9.7119341563786, '04'], [6.782051282051282, '06'], [8.794258373205741, '05']]


[[28.676470588235293, '15'],
 [16.31756756756757, '13'],
 [12.380116959064328, '12'],
 [11.137546468401487, '02'],
 [10.684397163120567, '10'],
 [9.7119341563786, '04'],
 [9.692007797270955, '14'],
 [9.449744463373083, '17'],
 [9.190661478599221, '08'],
 [8.96474358974359, '11'],
 [8.804177545691905, '22'],
 [8.794258373205741, '05'],
 [8.749019607843136, '20'],
 [8.687258687258687, '21'],
 [7.948339483394834, '03'],
 [7.94299674267101, '18'],
 [7.713298791018998, '16'],
 [7.5647840531561465, '00'],
 [7.407801418439717, '01'],
 [7.163043478260869, '19'],
 [7.013274336283186, '07'],
 [6.782051282051282, '06'],
 [6.696793002915452, '23'],
 [6.653153153153153, '09']]

In [8]:
print('Top 5 Hours for Ask Post Comments')
print('\n')

for avg, hour in sorted_swap[:5]:
    print('{}: {:.2f} average comments per post.'.format(dt.datetime.strptime(hour, "%H").strftime("%H:%M"),avg))

Top 5 Hours for Ask Post Comments


15:00: 28.68 average comments per post.
13:00: 16.32 average comments per post.
12:00: 12.38 average comments per post.
02:00: 11.14 average comments per post.
10:00: 10.68 average comments per post.


From our analysis above, `15:00` sees the highest average `28.68` average comments per post. 

According to the [documentation](https://www.kaggle.com/hacker-news/hacker-news-posts), the `created_at` column is in Eastern Standard time. Therefore, `15:00` becomes `3:00pm`.

Creating an `Ask HN` post at around `3:00pm` seems to attract the greatest number of comments on average.

# Conclusion
In this project, we analyzed post data from the `Hacker News data set`. We sorted posts by those that start with `Ask HN` and `Show HN` to determine which type of post attracts more comments. `Ask HN` posts were determined to have a higher number of comments per post.

We also analyzed our findings to determine what time a post might receive a higher average number of comments. We determined that `15:00`, or `3:00pm EST` saw the highest average number of comments per post. 