# Exploring Hacker News Posts

[Hacker News]("https://news.ycombinator.com/") is a website where user-submitted stories (beter know as posts) recives votes and comments. It's popular in technology and startup circle.

For our analzy two type of post are interesting:

- `Ask HN`: questions to the community,
- `Show HN`: announcements and showcasing of products, projects etc.

In our analysis, we will determine:

- Which post(`Ask HN` or `Show HN`) receive more comments on average.
- Whether posts created at a certain time of the day receive more comments on average.

We'll be working with the [`hacker_news.csv`]("https://dq-content.s3.amazonaws.com/356/hacker_news.csv") dataset.

## Data import

We starting with importing dataset from csv file called `"hacker_news.csv"`. To do that we need to open file and import library `csv` which let us read it. After that we convert data in to list of list and asign it to a variable called `hn`.

In [1]:
from csv import reader

open_file = open("hacker_news.csv")
read_file = reader(open_file)
hn = list(read_file)

Let's take a look on few firsst rows in our dataset

In [2]:
print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


As we can see now in our dataset the list of list are the headers of our data. We want to separate them to work on data.

### Removing Headers from a List of Lists

We will slice the dataset `hn` to extract the header which will be saved to a variable called `header`. Then, to save the dataset without a header, we'll save the sliced dataset back to the variable `hn` to overwrite the version with the header.

In [3]:
headers = hn[0]
hn = hn[1:]

In [4]:
print(headers)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


In the table below, we have included the name of the column and a description of what it refers to in order to make it easier to understand the data.

| Column Name | Description |
| ----------- | ----------- |
| id | the unique identifier from Hacker News for the post |
| title | the title of the post |
| url | the URL that the posts links to, if the post has a URL |
| num_points | the number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes |
| num_comments | the number of comments on the post |
| author | the username of the person who submitted the post |
| created_at | the date and time of the post's submission |

Now we print few rows to check if list with column names was removed.

In [5]:
print(hn[:5])

[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


### Extracting Ask HN and Show HN Posts

Now we are ready to filtered our data. We're only interested with post begin with `Ask HN` or `Show HN`, so we will create a list of list containing data with only those titles

To sepratae post begining with `Ask HN` and `Show HN` we create three empty lists: `ask_posts`, `show_posts` and `other_posts`. We loop through each row in dataset to find the posts for each of these categories. We use the built in methods `starswith()` and `lower()`. First we use the `lower()` method on a row which refers to title and save the name of the post to a variable called `title`. This will ensure the entire title is lowercased so we won't miss any posts due to capitalization errors. Next, we will use the `startswith()` method to determine if the post is an `Ask HN` post, a `Show HN` post, or `Other`. Tand chek if title starts with: 
- `Ask HN` - if it's `True` we append that row to `ask_posts` list, 
- `Show HN` - if it's `True` we append that row to `show_posts` list,
- other way - if the previous conditions are `False` we append row to`other_posts` list.

In [6]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    title = title.lower()
    
    if title.startswith("ask hn"):
        ask_posts.append(row)
    elif title.startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)

We print few rows to check if previous code work fine

In [7]:
print("Ask posts: ", ask_posts[:5])
print("\n")
print("Show posts: ", show_posts[:5])
print("\n")
print("Other posts: ", other_posts[:5])

Ask posts:  [['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'], ['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14'], ['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20'], ['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38']]


Show posts:  [['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'], ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46'], ['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https

Now we check how many posts are in each of our lists.

In [8]:
print("Number of Ask HN posts: ", len(ask_posts))
print("Number of Show HN posts: ", len(show_posts))
print("Number of other posts: ", len(other_posts))

Number of Ask HN posts:  1744
Number of Show HN posts:  1162
Number of other posts:  17194


### Calculating the Average Number of Comments for Ask HN and Show HN Posts

Let's determine if ask posts or show posts receive more comments on average.

To determinate average number of comments on ask posts we initialize an empty variable `total_ask_coments`. Next we iterate on rows in `ask_posts` list to: 
- extract the number of comments from each row and asign it to a variable called `n_comments`
- add it to the `total_ask_coments`,
- in the end of look recived variable `total_ask_coments` with exactly number on coments.

Then we calculating average number of coments on ask posts (asign to variable calld `avg_ask_comments`) by deviding `total_ask_comments` by length of `ask_posts` list.

In [9]:
total_ask_comments = 0

for row in ask_posts:
    n_comments = int(row[4])
    total_ask_comments += n_comments
    
avg_ask_comments = round((total_ask_comments / len(ask_posts)), 2)

print("Average number of comments on ask posts: ", avg_ask_comments)

Average number of comments on ask posts:  14.04


To determinate average number of comments on show posts we initialize an empty variable `total_show_coments`. Next we iterate on rows in ask_posts list to:
- extract the number of comments from each row and asign it to a variable called `n_comments`,
- add it to the `total_show_coments`,
- in the end of look recived variable `total_show_coments` with exactly number on coments.

Then we calculating average number of coments on show posts (asign to variable calld `avg_show_comments`) by deviding `total_show_comments` by length of `show_posts` list.


In [10]:
total_show_comments = 0

for row in show_posts:
    n_comments = int(row[4])
    total_show_comments += n_comments
    
avg_show_comments = round((total_show_comments / len(show_posts)), 2)

print("Average number of comments on show posts: ", avg_show_comments)

Average number of comments on show posts:  10.32


AS we can see `Ask HN` posts generate more comments than `Show HN` posts. Because of the posts are more likely to generate comments, going forward we will be focusing our analysis on the `ask_posts` dataset.

### Finding the Number of Ask Posts and Comments by Hour Created

We will focus on determinate if `Ask HN` posts created at a certain time are more likely to attract comments.

To do that we need to import library which let us work on a date and time.

Then we create an empty list and name it `result_list`. We loop over each row in `ask_posts` to asign te columns: 
- first with date and time of the post's submission to variable called `created_at` and
- second with number of comments on the post to variable called `n_comments`.

After that we append these two variables as a list to `result_list`. In the end of loop we recived a list of lists with information about data and time when posts were submission and number of comments posted at that time.

In [11]:
import datetime as dt

result_list = []

for row in ask_posts:
    created_at = row[6]
    n_comments = int(row[4])
    result_list.append([created_at, n_comments])

Let's print few rows to check if our list of lists look like we want

In [12]:
print(result_list[:5])

[['8/16/2016 9:55', 6], ['11/22/2015 13:43', 29], ['5/2/2016 10:14', 1], ['8/2/2016 14:20', 3], ['10/15/2015 16:38', 17]]


To calculate the number of ask posts created in each hour of the day, along with the number of comments received we create two empty dictionaries:
- `counts_by_hour`: contains the number of ask posts created during each hour of the day, and 
- `comments_by_hour`contains the corresponding number of comments ask posts created at each hour received. 

In for loop we iterate on rows in `results_list`:
- we using `datetime.strptime()` method to create datatime object and assigned it to variable `data_obj`,
- we use `datetime.strftime()` method with work only with datetime object to extract the hour form the date and assig it to a variable `hour`,
- we also create variable `comments` which contain the number of comments.

Next we check if the `hour` isn't a key in `counts_by_hour`:
- if it is `True`:
    - we create the key (which is `hour`) in `counts_by_hour` and set it equal to 1,
    - we we create the key (which is `hour`) in `comments_by_hour` and set it equal to `comment`,
- if it is `False`:
    - we increment the value in `counts_by_hour` for each key (`hour`) by 1,
    - we increment the value in `comments_by_hour` for each key (`hour`) by `comment`.

In [13]:
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    date_obj = dt.datetime.strptime(row[0], "%m/%d/%Y %H:%M")
    hour = date_obj.strftime("%H")
    comment = int(row[1])
    
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comment
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comment

Lets print dictionary `counts_by_hour` to check if it looks correct

In [14]:
print(counts_by_hour)

{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}


Lets print dictionary `comments_by_hour` to check if it looks correct

In [15]:
print(comments_by_hour)

{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}


### Calculating the Average Number of Comments for Ask HN Posts by Hour

Now we will used dictionaries `counts_by_hour` and `comments_by_hour` to calculate the average number of comments for posts created during each hour of the day.

We creat a list `avg_by_hour`. In for loop we iteratte on every hour in `counts_by_hour` and:
- create variable `n_of_counts` containing number of post in each hour,
- create variable `n_of_comments` containing number of comments post in each hour,
- calculate average number of comments per post for posts created during each hour of the dayby deviding `n_of comments` by `n_of_counts` and round these value to two decimal places,
- in the end we append `hour` and `average` to our list `avg_by_hour`.

In [16]:
avg_by_hour = []

for hour in counts_by_hour:
    n_of_counts = counts_by_hour[hour]
    n_of_comments = comments_by_hour[hour]
    average = round((n_of_comments / n_of_counts), 2)
    avg_by_hour.append([hour, average])

Let's print our `avg_by_hour` list

In [17]:
avg_by_hour

[['09', 5.58],
 ['13', 14.74],
 ['10', 13.44],
 ['14', 13.23],
 ['16', 16.8],
 ['23', 7.99],
 ['12', 9.41],
 ['17', 11.46],
 ['15', 38.59],
 ['21', 16.01],
 ['20', 21.52],
 ['02', 23.81],
 ['18', 13.2],
 ['03', 7.8],
 ['05', 10.09],
 ['19', 10.8],
 ['01', 11.38],
 ['22', 6.75],
 ['08', 10.25],
 ['04', 7.17],
 ['00', 8.13],
 ['06', 9.02],
 ['07', 7.85],
 ['11', 11.05]]

Although we now have the results we need, this format makes it difficult to identify the hours with the highest values. Let's sort the list of lists and print the few highest values in a format that's easier to read.

### Sorting and Printing Values from a List of Lists

We start by creat a list `swap_avg_by_hour` where the firs column refers to average number of comments for posts created during each hour of the day and the second column refers to hour.

In [18]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])

Let's check our changes

In [19]:
swap_avg_by_hour

[[5.58, '09'],
 [14.74, '13'],
 [13.44, '10'],
 [13.23, '14'],
 [16.8, '16'],
 [7.99, '23'],
 [9.41, '12'],
 [11.46, '17'],
 [38.59, '15'],
 [16.01, '21'],
 [21.52, '20'],
 [23.81, '02'],
 [13.2, '18'],
 [7.8, '03'],
 [10.09, '05'],
 [10.8, '19'],
 [11.38, '01'],
 [6.75, '22'],
 [10.25, '08'],
 [7.17, '04'],
 [8.13, '00'],
 [9.02, '06'],
 [7.85, '07'],
 [11.05, '11']]

we use `sorted()` function with `reverse` argument asign to `True` to sort our `swap_avg_by_hour` list from the higest average number of comments.

In [20]:
sorted_swap = sorted(swap_avg_by_hour, reverse = True)

To show few highest values in a format that's easier to read we creat an empty list called `final_list`. In for loop we iterate over each row of `sorted_swap` list and:
- we using `datetime.strptime()` method to create datatime object and assigned it to variable `data_obj`,
- we use `datetime.strftime()` method which work only with datetime object to change the format of hour and assig it to a variable `hour`,
- we also create variable `template` which contain the string which we want to generate when cheking data and it follow format: `15:00: 38.59 average comments per post`,
- in the end we append our template with `format()` method (which let us change  placeholder `{}` in to value we want using varaiables) to `final_list`

In [21]:
final_list = []

for row in sorted_swap:
    data_obj = dt.datetime.strptime(row[1], "%H")
    hour = data_obj.strftime("%H:%M")
    template = "{time}{average: .2f} average comments per post"
    
    final_list.append(template.format(time = hour, average = row[0]))

Let's print "Top 5 Hours for Ask Posts Comments"

In [22]:
final_list[:5]

['15:00 38.59 average comments per post',
 '02:00 23.81 average comments per post',
 '20:00 21.52 average comments per post',
 '16:00 16.80 average comments per post',
 '21:00 16.01 average comments per post']

The hour in the day with the most average comments per post is 15:00, or 3:00 pm est. The next closest is 02:00 (2:00 am) at an average of 23.81 comments per hour.

## Conclusion

After analyzing the dataset, we were able to find that `Ask HN` posts generate more average comments per post than `Show HN` posts, and that `Ask HN` posts created at 3:00 pm central standard time (cst) at 38.59 average comments per post. The next closest time, 2:00 am cst, averaged 23.81.