# **Exploration of Hacker News Posts**

This project was developed as part of the 'Python for Data Science: Intermediate' training from [Dataquest](https://www.dataquest.io/).

The goal of this project is to compare two types of posts from the Hacker News [website](https://news.ycombinator.com/), Ask HN and Show HN. We will be exploring and analyzing the data to determine which type of post receives more comments on average, and whether posts created at certain times receive more comments on average.

Note: The dates and times in this project are from the US Eastern Time Zone (EST/EDT).

## Opening the Data Set

The data set we will be using for this project contains data for 12 months up to September 26, 2016. The data set can be downloaded [here](https://www.kaggle.com/hacker-news/hacker-news-posts?select=HN_posts_year_to_Sep_26_2016.csv). The data set contains almost 300,000 rows.

Below, we will open the data set and take a look at the first 5 rows:

In [1]:
from csv import reader

# Opening the Hacker News data set
opened_file = open('hacker_news.csv', encoding = 'utf8')
read_file = reader(opened_file)
hn = list(read_file)

# Separate the column headers from the main data
hn_header = hn[0]
hn = hn[1:]

In [2]:
# Displaying the first 5 rows
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n') # Adds a new (empty) line after each row for readability
        
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        
print(hn_header)
print('\n')
explore_data(hn, 0, 5, True)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']


['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24']


['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']


['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']


['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']


Number of 

Here we can see that the Hacker News data set contains 293,119 rows (or posts/entries) and 7 columns. To clean up this data slightly for better readability, we have removed the column headers from the main data.

## Extracting Ask HN and Show HN Posts

As we mentioned earlier, we are interested in comparing the Ask HN and Show HN posts, so we will be separating these two types of posts into their own lists below:

In [3]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
    
print('Ask HN:', len(ask_posts))
print('Show HN:', len(show_posts))
print('Other:', len(other_posts))

print('\n')
print('Ask HN:')
print(ask_posts[:5])

print('\n')
print('Show HN:')
print(show_posts[:5])

Ask HN: 9139
Show HN: 10158
Other: 273822


Ask HN:
[['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53'], ['12578522', 'Ask HN: How do you pass on your work when you die?', '', '6', '3', 'PascLeRasc', '9/26/2016 1:17'], ['12577908', 'Ask HN: How a DNS problem can be limited to a geographic region?', '', '1', '0', 'kuon', '9/25/2016 22:57'], ['12577870', 'Ask HN: Why join a fund when you can be an angel?', '', '1', '3', 'anthony_james', '9/25/2016 22:48'], ['12577647', 'Ask HN: Someone uses stock trading as passive income?', '', '5', '2', '00taffe', '9/25/2016 21:50']]


Show HN:
[['12578335', 'Show HN: Finding puns computationally', 'http://puns.samueltaylor.org/', '2', '0', 'saamm', '9/26/2016 0:36'], ['12578182', 'Show HN: A simple library for complicated animations', 'https://christinecha.github.io/choreographer-js/', '1', '0', 'christinecha', '9/26/2016 0:01'], ['12578098', 'Show HN: WebGL visualization of DNA sequences', 'h

Here we can see in the data set that there are 9,139 Ask HN posts and 10,158 Show HN posts. We've placed the remaining posts in a list called other_posts. We have also displayed the first 5 posts for the Ask and Show lists.

## Calculating the Average Number of Comments for Ask HN and Show HN Posts

Next, we will determine if Ask posts or Show posts receive more comments on average:

In [4]:
# Finding the total number of comments in Ask posts
total_ask_comments = 0

for row in ask_posts:
    total_ask_comments += int(row[4])

# Finding the average number of comments in Ask posts
avg_ask_comments = total_ask_comments / len(ask_posts)

print('Total Ask Comments:', total_ask_comments)
print('Average Ask Comments:', avg_ask_comments)

Total Ask Comments: 94986
Average Ask Comments: 10.393478498741656


In [5]:
# Finding the total number of comments in Show posts
total_show_comments = 0

for row in show_posts:
    total_show_comments += int(row[4])
    
# Finding the average number of comments in Show posts
avg_show_comments = total_show_comments / len(show_posts)

print('Total Show Comments:', total_show_comments)
print('Average Show Comments:', avg_show_comments)

Total Show Comments: 49633
Average Show Comments: 4.886099625910612


Based on the findings above, Ask posts receive more comments on average compared to Show posts (about 10 comments per Ask post vs. about 5 comments per Show post).

Since the Ask posts are more likely to receive comments, we will focus our remaining analysis just on these posts.

## Date/Time Analysis of Ask HN Posts and Comments

Now, we will determine if Ask posts created at a certain *time* are more likely to attract comments. We'll use the following steps to perform this analysis:
1. Calculate the amount of Ask posts created in each hour of the day, along with the number of comments received.
2. Calculate the average number of comments Ask posts receive by hour created.

### Part One: Finding the Amount of Ask Posts and Comments by Hour Created

First up, we will calculate the amount of Ask posts and comments created by hour:

In [6]:
import datetime as dt

result_list = []

for row in ask_posts:
    result_list.append([row[6], int(row[4])])

counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    date = row[0]
    comment = row[1]
    date_str = dt.datetime.strptime(date, '%m/%d/%Y %H:%M')
    date_dt = date_str.strftime('%H')
    
    if date_dt not in counts_by_hour:
        counts_by_hour[date_dt] = 1
        comments_by_hour[date_dt] = row[1]
    elif date_dt in counts_by_hour:
        counts_by_hour[date_dt] += 1
        comments_by_hour[date_dt] += row[1]
        
print(comments_by_hour)
print('\n')
print(counts_by_hour)

{'02': 2996, '01': 2089, '22': 3372, '21': 4500, '19': 3954, '17': 5547, '15': 18525, '14': 4972, '13': 7245, '11': 2797, '10': 3013, '09': 1477, '07': 1585, '03': 2154, '23': 2297, '20': 4462, '16': 4466, '08': 2362, '00': 2277, '18': 4877, '12': 4234, '04': 2360, '06': 1587, '05': 1838}


{'02': 269, '01': 282, '22': 383, '21': 518, '19': 552, '17': 587, '15': 646, '14': 513, '13': 444, '11': 312, '10': 282, '09': 222, '07': 226, '03': 271, '23': 343, '20': 510, '16': 579, '08': 257, '00': 301, '18': 614, '12': 342, '04': 243, '06': 234, '05': 209}


### Part Two: Calculating the Average Number of Comments for Ask HN Posts by Hour

Next, we will calculate the average number of comments per post for posts created during each hour of the day:

In [7]:
avg_by_hour = []

for hour in comments_by_hour:
    comments = comments_by_hour[hour]
    counts = counts_by_hour[hour]
    avg_by_hour.append([hour, (comments/counts)])
    
for row in avg_by_hour:
    print(row)
    print('\n')

['02', 11.137546468401487]


['01', 7.407801418439717]


['22', 8.804177545691905]


['21', 8.687258687258687]


['19', 7.163043478260869]


['17', 9.449744463373083]


['15', 28.676470588235293]


['14', 9.692007797270955]


['13', 16.31756756756757]


['11', 8.96474358974359]


['10', 10.684397163120567]


['09', 6.653153153153153]


['07', 7.013274336283186]


['03', 7.948339483394834]


['23', 6.696793002915452]


['20', 8.749019607843136]


['16', 7.713298791018998]


['08', 9.190661478599221]


['00', 7.5647840531561465]


['18', 7.94299674267101]


['12', 12.380116959064328]


['04', 9.7119341563786]


['06', 6.782051282051282]


['05', 8.794258373205741]




## Sorting the Results

Now we can see the average number of comments for each hour a comment was created in. Although we have the results, it is not very easy to identify the hours with the highest values. To resolve this, we'll sort the results and print the 5 highest values in a format that is easier to read:

In [8]:
swap_avg_by_hour = []

# Swapping the indexes for average number of comments and associated hour
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])

print(swap_avg_by_hour)

[[11.137546468401487, '02'], [7.407801418439717, '01'], [8.804177545691905, '22'], [8.687258687258687, '21'], [7.163043478260869, '19'], [9.449744463373083, '17'], [28.676470588235293, '15'], [9.692007797270955, '14'], [16.31756756756757, '13'], [8.96474358974359, '11'], [10.684397163120567, '10'], [6.653153153153153, '09'], [7.013274336283186, '07'], [7.948339483394834, '03'], [6.696793002915452, '23'], [8.749019607843136, '20'], [7.713298791018998, '16'], [9.190661478599221, '08'], [7.5647840531561465, '00'], [7.94299674267101, '18'], [12.380116959064328, '12'], [9.7119341563786, '04'], [6.782051282051282, '06'], [8.794258373205741, '05']]


In [9]:
# Sorting the results in descending order
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

print('Top 5 Hours for Ask Posts Comments:')

for row in sorted_swap[:5]:
    hour_str = dt.datetime.strptime(row[1], '%H')
    hour_time = hour_str.strftime('%H:%M')
    
    results_str = '{}: {:.2f} average comments per post.'
    print(results_str.format(hour_time, row[0]))

Top 5 Hours for Ask Posts Comments:
15:00: 28.68 average comments per post.
13:00: 16.32 average comments per post.
12:00: 12.38 average comments per post.
02:00: 11.14 average comments per post.
10:00: 10.68 average comments per post.


## Conclusion

Now that we have results that are much easier to read, we can see that 15:00 (or 3:00pm) is the hour where a newly created post has the highest chance of receiving comments. 15:00 has an average of almost 29 comments per post, compared to 13:00 (or 1:00pm) in second place with an average of about 16 comments per post.

Note: The dates and times in this project are from the US Eastern Time Zone (EST/EDT).