# Exploring Hacker News Posts

![Image](https://s3.amazonaws.com/dq-content/354/hacker_news.jpg)

This project is about analyzing information from the Hacker News Posts (HN) dataset of submissions to a technology site [Hacker News](http://news.ycombinator.com/). Hacker News is a site where user-submitted stories (known as "posts") receive votes and comments and, depending on a number of positive or negative votes, can be raised in the feed of posts or lowered down, similar to reddit.

You can find the the Hacker News Posts dataset [here](http://www.kaggle.com/hacker-news/hacker-news-posts), but note that we have reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that didn't receive any comments and then randomly sampling from the remaining submissions. Below are descriptions of the columns:
- `id`: the unique identifier from Hacker News for the post;
- `title`: the title of the post;
- `url`: the URL that the posts links to, if the post has a URL;
- `num_points`: the number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes;
- `num_comments`: the number of comments on the post;
- `author`: the username of the person who submitted the post;
- `created_at`: the date and time of the post's submission.

We are specifically interested in posts with titles that begin with either "Ask HN" or "Show HN". Users submit "Ask HN" posts to ask the Hacker News community a specific question. Likewise, users submit "Show HN" posts to show the Hacker News community a project, product, or just something interesting. We will compare these two types of posts to determine the following:
- Do "Ask HN" or "Show HN" receive more comments on average?
- Do posts created at a certain time receive more comments on average?

### Opening and exploring the data
First of all, to start working with the information stored in the dataset we will have to extract it from a `CSV` file and assign it to the `hn` variable. To do it, we will import the `reader`class from the `csv` module and use the `extract_data` function that takes an argument `directory`. The function returns information from the dataset in the "list of lists" format.

In [1]:
from datetime import datetime as dt
from datetime import timedelta
from csv import reader
def extract_data(directory):
    OpenedDataset = open(directory, encoding = "utf8")
    ReadData = reader(OpenedDataset)
    return list(ReadData)
HN = extract_data('..\..\Datasets\P2_Exploring_Hacker_News_Posts\hacker_news.csv')


To have a first look at the data from the Hacker News dataset we will write the `explore_data` function that takes 4 arguments:
1. `dataset` - a title of the dataset.
2. `start and end` - the start and the end indexes of a given dataset to display a certain number of rows that we want to display.
3. `rows_and_columns` - this argument is used to indicate if we need to display the aggregated information about a number of rows and a number of columns on the interval of rows chosen in the previous step. The argument is "False" by default.

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n')
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        print('\n')

Let's use the `explore_data` function to display first five rows to have a first look at the data.

In [3]:
explore_data(HN, 0, 6)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']




Notice that the first list in the inner lists contains the column `headers`. In the next step we will extract the first row of data, and assign it to the variable `headers`.

In [4]:
headers = HN[:1]
print(headers)

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']]


In the next step we will remove the column `headers` from the `HN` list and use the function `explore_data` to display first five rows to verify that we removed the header row properly.

In [5]:
HN = HN[1:]
explore_data(HN, 0, 5)

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']




Since we are only concerned with post titles beginning with "Ask HN" or "Show HN", we will create new lists of lists containing just the data for those titles. To find the posts that begin with either "Ask HN" or "Show HN", we will use the string method `startswith`. The method will return `True` if the given string object the string parametr starts with a substring given as an argument. Notice what capitalization matters, so we could will be using the string method `lower`, which returns a lowercase version of the starting string. Let's use these methods to separate posts beginning with "Ask HN" and "Show HN" (and case variations) into two different lists.

In [6]:
ask_posts = []
show_posts = []
other_posts = []
for data in HN:
    title = data[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(data)
    elif title.lower().startswith('show hn'):
        show_posts.append(data)
    else:
        other_posts.append(data)
print("The number of posts in the list 'ask_posts' is {list}".format(list = len(ask_posts)), "\n")
print("The number of posts in the list 'show_posts' is {list}".format(list = len(show_posts)), "\n")
print("The number of posts in the list 'show_posts' is {list}".format(list = len(other_posts)), "\n")

The number of posts in the list 'ask_posts' is 1744 

The number of posts in the list 'show_posts' is 1162 

The number of posts in the list 'show_posts' is 17194 



Next, let's determine if ask posts or show posts receive more comments on average.

In [7]:
def comments_summary(comments_list):
    total_comments = 0
    for data in comments_list:
        num_comments = int(data[4])
        total_comments += num_comments
    return total_comments

total_ask_comments = comments_summary(ask_posts)
avg_ask_comments = total_ask_comments/len(ask_posts)
print("Average number of comments in the 'Ask HN' posts is {avg_ask_comments:.2f}".format(avg_ask_comments = avg_ask_comments), "\n")

total_show_comments = comments_summary(show_posts)
avg_show_comments = total_show_comments/len(show_posts)
print("Average number of comments in the 'Show HN' posts is {avg_show_comments:.2f}".format(avg_show_comments = avg_show_comments), "\n")

total_other_comments = comments_summary(other_posts)
avg_other_comments = total_other_comments/len(other_posts)
print("Average number of comments in the 'Other' posts is {avg_other_comments:.2f}".format(avg_other_comments = avg_other_comments), "\n")
    

Average number of comments in the 'Ask HN' posts is 14.04 

Average number of comments in the 'Show HN' posts is 10.32 

Average number of comments in the 'Other' posts is 26.87 



As we can see in the output the theory what "Ask HN" posts or "Show HN" posts receive more comments on average in comparison with other posts is wrong. Another thing that can see from the output is that, on average, ask posts receive more comments than show posts. Since "Ask HN" posts are more likely to receive comments, we will focus our remaining analysis just on these posts.

Next, we will determine if "Ask HN" posts created at a certain time are more likely to attract comments. We will use the following steps to perform this analysis:
1. Calculate the number of "Ask HN" posts created in each hour of the day, along with the number of comments received.
2. Calculate the average number of comments "Ask HN" posts receive by hour created.

In [8]:
results_list = []

for data in ask_posts:
    result_list = []
    created_at = data[6]
    comments_number = int(data[4])
    result_list.append(created_at)
    result_list.append(comments_number)
    results_list.append(result_list)

counts_by_hour = {}
comments_by_hour = {}

for data in results_list:
    date_time_dt = dt.strptime(data[0], '%m/%d/%Y %H:%M')
    hour_dt = int(date_time_dt.hour)
    if hour_dt not in counts_by_hour:
        counts_by_hour[hour_dt] = 1
        comments_by_hour[hour_dt] = data[1]
    else:
        counts_by_hour[hour_dt] += 1
        comments_by_hour[hour_dt] += data[1]

Let's make sure what everything works correctly. To do it we will print examples of values of the dictionaries `counts_by_hour` and `comments_by_hour` with a key equal 6.

In [9]:
print("Number of 'Ask HN' posts created at 6 am is {number}".format(number = counts_by_hour[6]), "\n")
print("Number of comments in 'Ask HN' posts created at 6 am is {number}".format(number = comments_by_hour[6]), "\n")

Number of 'Ask HN' posts created at 6 am is 44 

Number of comments in 'Ask HN' posts created at 6 am is 397 



Next, we will use the dictionaries `counts_by_hour` and `comments_by_hour` to calculate the average number of comments for posts created during each hour of the day.

In [10]:
avg_by_hour = []
for hour in comments_by_hour:
    avg_by_hour.append([hour, round(comments_by_hour[hour]/counts_by_hour[hour], 2)])
print(avg_by_hour)

[[9, 5.58], [13, 14.74], [10, 13.44], [14, 13.23], [16, 16.8], [23, 7.99], [12, 9.41], [17, 11.46], [15, 38.59], [21, 16.01], [20, 21.52], [2, 23.81], [18, 13.2], [3, 7.8], [5, 10.09], [19, 10.8], [1, 11.38], [22, 6.75], [8, 10.25], [4, 7.17], [0, 8.13], [6, 9.02], [7, 7.85], [11, 11.05]]


Let's display the results.

In [11]:
for data in avg_by_hour:
    print("The average number of comments for the post created during from {hour_start} to {hour_end} is {avg_number}".format(hour_start = data[0], hour_end = data[0] + 1, avg_number = data[1]))

The average number of comments for the post created during from 9 to 10 is 5.58
The average number of comments for the post created during from 13 to 14 is 14.74
The average number of comments for the post created during from 10 to 11 is 13.44
The average number of comments for the post created during from 14 to 15 is 13.23
The average number of comments for the post created during from 16 to 17 is 16.8
The average number of comments for the post created during from 23 to 24 is 7.99
The average number of comments for the post created during from 12 to 13 is 9.41
The average number of comments for the post created during from 17 to 18 is 11.46
The average number of comments for the post created during from 15 to 16 is 38.59
The average number of comments for the post created during from 21 to 22 is 16.01
The average number of comments for the post created during from 20 to 21 is 21.52
The average number of comments for the post created during from 2 to 3 is 23.81
The average number of c

Although we now have the results we need, this format makes it difficult to identify the hours with the highest values. Let's finish by sorting the list of lists and printing the five highest values in a format that is easier to read.

In [12]:
swap_avg_by_hour = []
for data in avg_by_hour:
    swap_avg_by_hour.append([data[1], data[0]])
print(swap_avg_by_hour)

[[5.58, 9], [14.74, 13], [13.44, 10], [13.23, 14], [16.8, 16], [7.99, 23], [9.41, 12], [11.46, 17], [38.59, 15], [16.01, 21], [21.52, 20], [23.81, 2], [13.2, 18], [7.8, 3], [10.09, 5], [10.8, 19], [11.38, 1], [6.75, 22], [10.25, 8], [7.17, 4], [8.13, 0], [9.02, 6], [7.85, 7], [11.05, 11]]


Let's use the built in function `sorted()` to sort `swap_avg_by_hour` in descending order. Since the first column of this list is the average number of comments, sorting the list will sort by the average number of comments.

In [13]:
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
print("Top 5 Hours for 'Ask HN' posts Comments:")
for data in sorted_swap[:5]:
    print("{time}:00: {comments} average comments per post".format(time = data[1], comments = data[0]))

Top 5 Hours for 'Ask HN' posts Comments:
15:00: 38.59 average comments per post
2:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.8 average comments per post
21:00: 16.01 average comments per post


If we refer back to [the documentation](https://www.kaggle.com/hacker-news/hacker-news-posts) for the dataset we would see from the description that the time zone in the column `created_at` is Eastern Time in the US. Since we live in Moscow where time zone is EST +8, in order to answer the question "During which hours should we create a post to have a higher chance of receiving comments?" we have to loop through the list `sorted_swap` and add 8 hours to every hour.
To do it, we will use the built in method `timedelta` from the module `datettime`. There is a one problem: at this moment hours in the list `sorted_swap` stored in the format `int`, but the method `timedelta` works with data stored as the `timedelta` object. To solve the problem we will change the type `int` to `string` and then use the method `strptime()`.

In [14]:
sorted_swap_EST8 = []

for data in sorted_swap:
    hours_dt = dt.strptime(str(data[1]), '%H')
    hours_to_add = timedelta(hours = 8)
    hours = int((hours_dt + hours_to_add).strftime("%H"))
    sorted_swap_EST8.append([data[0], hours])

for data in sorted_swap_EST8[:5]:
    print("{time}:00: {comments} average comments per post".format(time = data[1], comments = data[0]))

23:00: 38.59 average comments per post
10:00: 23.81 average comments per post
4:00: 21.52 average comments per post
0:00: 16.8 average comments per post
5:00: 16.01 average comments per post


# Conclusions

In this project, we analyzed data from the Hacker News Posts (HN) [data set](https://www.kaggle.com/hacker-news/hacker-news-posts) in order to answer two questions:
1. Do posts with titles that begin with either "Ask HN" or "Show HN" receive more comments on average and, accordingly, have a higher position in the feed of posts.
2. Do posts created at a certain time receive more comments on average and, if it is so, which time users should create posts to get a higher number of comments.

To answer on the first question we can take a look on the output of `In[7]`. From the output we can make a conclusion that "Ask HN" posts or "Show HN" receive less comments on average in comparison with other posts, but at the same time we can say that, on average, "Ask HN" posts receive more comments than "Show HN" posts.

To answer on the second question we can take a look on the output of `In[13]`and `In[14]`. From the outputs we can make a conclusion that to get a higher number of comments users should create posts at `15:00, 2:00, 20:00, 16:00, 21:00` EST +0 or at `23:00, 10:00, 4:00, 0:00, 5:00` EST +8.