# Finding the best ways to get to the top of the news feed - analyzing Hacker News Posts

![Image](https://s3.amazonaws.com/dq-content/354/hacker_news.jpg)

This project is about analyzing information from the Hacker News Posts (HN) dataset of submissions to a technology site [Hacker News](http://news.ycombinator.com/). Hacker News is a site where user-submitted stories (known as "posts") receive votes and comments and, depending on a number of comments and positive or negative votes, can be raised in the feed of posts or lowered down, similar to reddit. 

We are specifically interested in posts with titles that begin with either "Ask HN" or "Show HN". Users submit "Ask HN" posts to ask the Hacker News community a specific question. Likewise, users submit "Show HN" posts to show the Hacker News community a project, product, or just something interesting. We will compare these two types of posts to determine the following:
- Do "Ask HN" or "Show HN" receive more comments and number of points on average?
- Do posts created at a certain time receive more comments and number of points on average?

You can find the the Hacker News Posts dataset [here](http://www.kaggle.com/hacker-news/hacker-news-posts), but note that we have reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that didn't receive any comments and then randomly sampling from the remaining submissions. Below are descriptions of the columns:
- `id`: the unique identifier from Hacker News for the post;
- `title`: the title of the post;
- `url`: the URL that the posts links to, if the post has a URL;
- `num_points`: the number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes;
- `num_comments`: the number of comments on the post;
- `author`: the username of the person who submitted the post;
- `created_at`: the date and time of the post's submission.

#### Opening and exploring the data
First of all, to start working with the information stored in the dataset we will have to extract it from a `CSV` file and assign it to the `hn` variable. To do it, we will import the `reader`class from the `csv` module and use the `extract_data` function that takes an argument `directory`. The function returns information from the dataset in the "list of lists" format.

In [1]:
# Import requared classes from the 'datetime' module
from datetime import datetime as dt
from datetime import timedelta
from csv import reader

# Extraction of the data
def extract_data(directory):
    OpenedDataset = open(directory, encoding = "utf8")
    ReadData = reader(OpenedDataset)
    return list(ReadData)
HN = extract_data('..\..\Datasets\P2_Exploring_Hacker_News_Posts\hacker_news.csv')


To have a first look at the data from the Hacker News dataset we will write the `explore_data` function that takes 4 arguments:
1. `dataset` - a title of the dataset.
2. `start` - the start index of a given dataset to display a certain number of rows that we want to display.
3. `end` - the end of a given dataset to display a certain number of rows that we want to display.
4. `rows_and_columns` - this argument is used to indicate if we need to display the aggregated information about a number of rows and a number of columns on the interval of rows chosen in the previous step. The argument is "False" by default.

In [2]:
# Exploration of the data
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n')
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        print('\n')

Let's use the `explore_data` function to display first five rows to have a first look at the data.

In [3]:
explore_data(HN, 0, 6)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']




#### Removing the column 'headers'

Notice that the first list in the inner lists contains the column `headers`. In the next step we will extract the first row of data, and assign it to the variable `headers`.

In [4]:
# Assign the 'headers' row to the variable
headers = HN[:1]
print(headers)

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']]


In the next step we will remove the column `headers` from the `HN` list and use the function `explore_data` to display first five rows to verify that we removed the header row properly.

In [5]:
# Remove the row 'headers'
HN = HN[1:]
explore_data(HN, 0, 5)

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']




#### Sorting and analyzing the data

Since we are only concerned with post titles beginning with "Ask HN" or "Show HN", we will create new lists of lists containing just the data for those titles. To find the posts that begin with either "Ask HN" or "Show HN", we will use the string method `startswith`. The method will return `True` if the given string object the string parametr starts with a substring given as an argument. Notice what capitalization matters, so we could will be using the string method `lower`, which returns a lowercase version of the starting string. Let's use these methods to separate posts beginning with "Ask HN" and "Show HN" (and case variations) into two different lists.

In [6]:
# Create three emty lists for storing data from the posts 'Ask HN', 'Show HN' and 'Other'
ask_posts = []
show_posts = []
other_posts = []

# Sort data from the dataset 'HN'
for data in HN:
    title = data[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(data)
    elif title.lower().startswith('show hn'):
        show_posts.append(data)
    else:
        other_posts.append(data)

# Print a summary of results of the sorting    
print("The number of posts in the list 'ask_posts' is {list}".format(list = len(ask_posts)), "\n")
print("The number of posts in the list 'show_posts' is {list}".format(list = len(show_posts)), "\n")
print("The number of posts in the list 'show_posts' is {list}".format(list = len(other_posts)), "\n")

The number of posts in the list 'ask_posts' is 1744 

The number of posts in the list 'show_posts' is 1162 

The number of posts in the list 'show_posts' is 17194 



Next, let's determine if 'Ask HN' posts or 'Show HN' posts receive more comments and points on average.

In [7]:
# Summation of comments
def points_comments_summary(data_list):
    total_points = 0
    total_comments = 0
    for data in data_list:
        num_points = int(data[3])
        num_comments = int(data[4])
        total_points += num_points
        total_comments += num_comments
    return total_points, total_comments

# Calculation of a number of average comments and points per a post in the list 'ask_posts' and print the result
total_ask_points, total_ask_comments = points_comments_summary(ask_posts)
avg_ask_points = total_ask_points/len(ask_posts)
avg_ask_comments = total_ask_comments/len(ask_posts)

# Print the result
print("Average number of points in the 'Ask HN' posts is {avg_ask_points:.2f}".format(avg_ask_points = avg_ask_points))
print("Average number of comments in the 'Ask HN' posts is {avg_ask_comments:.2f}".format(avg_ask_comments = avg_ask_comments), "\n")

# Calculation of a number of average comments and points per a post in the list 'show_posts' and print the result
total_show_points, total_show_comments = points_comments_summary(show_posts)
avg_show_points = total_show_points/len(show_posts)
avg_show_comments = total_show_comments/len(show_posts)

# Print the result
print("Average number of points in the 'Show HN' posts is {avg_show_points:.2f}".format(avg_show_points = avg_show_points))
print("Average number of comments in the 'Show HN' posts is {avg_show_comments:.2f}".format(avg_show_comments = avg_show_comments), "\n")

# Calculation of a number of average comments and points per a post in the list 'other_posts' and print the result
total_other_points, total_other_comments = points_comments_summary(other_posts)
avg_other_points = total_other_points/len(other_posts)
avg_other_comments = total_other_comments/len(other_posts)

# Print the result
print("Average number of points in the 'Other' posts is {avg_other_points:.2f}".format(avg_other_points = avg_other_points))
print("Average number of comments in the 'Other' posts is {avg_other_comments:.2f}".format(avg_other_comments = avg_other_comments), "\n")
    

Average number of points in the 'Ask HN' posts is 15.06
Average number of comments in the 'Ask HN' posts is 14.04 

Average number of points in the 'Show HN' posts is 27.56
Average number of comments in the 'Show HN' posts is 10.32 

Average number of points in the 'Other' posts is 55.41
Average number of comments in the 'Other' posts is 26.87 



As we can see in the output the theory what "Ask HN" posts or "Show HN" posts receive more comments and points on average in comparison with other posts is wrong.

Another thing that can see from the output is that, on average, "Ask HN" posts receive more comments than "Show HN" posts, but, at the same time, "Ask HN" posts receive less points than "Show HN" posts. Since position of posts in the news feed depends on a number of comments and positive or negative votes, let's determine what type of posts we should focus.

Since "Ask HN" posts are more likely to receive comments and get more points, we will focus our remaining analysis just on these posts.

Next, we will determine if "Ask HN" posts created at a certain time are more likely to attract comments and points. We will use the following steps to perform this analysis:
1. Calculate the number of "Ask HN" posts created in each hour of the day, along with the number of comments and points received.
2. Calculate the average number of comments and points "Ask HN" posts receive by hour created.

In [8]:
# Create an empty list to store the number of comments and points received at each hour
results_list = []

# Store the data
for data in ask_posts:
    result_list = []
    created_at = data[6]
    points_number = int(data[3])
    comments_number = int(data[4])
    result_list.append(created_at)
    result_list.append(points_number)
    result_list.append(comments_number)
    results_list.append(result_list)

# Create empty dictionaries to store frequency tables of hours, comments and points
counts_by_hour = {}
points_by_hour = {}
comments_by_hour = {}

# Sort number of comments and points receive by hour created
for data in results_list:
    date_time_dt = dt.strptime(data[0], '%m/%d/%Y %H:%M')
    hour_dt = int(date_time_dt.hour)
    if hour_dt not in counts_by_hour:
        counts_by_hour[hour_dt] = 1
        points_by_hour[hour_dt] = data[1] 
        comments_by_hour[hour_dt] = data[2] 
    else:
        counts_by_hour[hour_dt] += 1
        points_by_hour[hour_dt] += data[1] 
        comments_by_hour[hour_dt] += data[2]

Let's make sure what everything works correctly. To do it we will print examples of values of the dictionaries `counts_by_hour`, `points_by_hour` and `comments_by_hour` with a key equal 6.

In [9]:
print("Number of 'Ask HN' posts created at 6 am is {number}".format(number = counts_by_hour[6]))
print("Number of points in 'Ask HN' posts created at 6 am is {number}".format(number = points_by_hour[6]))
print("Number of comments in 'Ask HN' posts created at 6 am is {number}".format(number = comments_by_hour[6]), "\n")

Number of 'Ask HN' posts created at 6 am is 44
Number of points in 'Ask HN' posts created at 6 am is 591
Number of comments in 'Ask HN' posts created at 6 am is 397 



Next, we will use the dictionaries `counts_by_hour`, `points_by_hour` and `comments_by_hour` to calculate the average number of comments and points for posts created during each hour of the day.

In [10]:
# Create an empty list to store average number of comments and points created during each hour of the day
avg_by_hour = []

# Calculate the average number of comments and points for posts created during each hour of the day
for hour in comments_by_hour:
    avg_by_hour.append([hour, round(points_by_hour[hour]/counts_by_hour[hour], 2), round(comments_by_hour[hour]/counts_by_hour[hour], 2)])

# Print preliminary result
print(avg_by_hour)

[[9, 7.31, 5.58], [13, 24.26, 14.74], [10, 18.68, 13.44], [14, 11.98, 13.23], [16, 23.35, 16.8], [23, 8.54, 7.99], [12, 10.71, 9.41], [17, 19.41, 11.46], [15, 29.99, 38.59], [21, 15.79, 16.01], [20, 14.39, 21.52], [2, 13.67, 23.81], [18, 15.97, 13.2], [3, 6.93, 7.8], [5, 12.0, 10.09], [19, 13.75, 10.8], [1, 11.67, 11.38], [22, 7.2, 6.75], [8, 10.73, 10.25], [4, 8.28, 7.17], [0, 8.2, 8.13], [6, 13.43, 9.02], [7, 10.62, 7.85], [11, 14.22, 11.05]]


Let's display the results.

In [11]:
# Display the results.
for data in avg_by_hour:
    print("The average number of points for the post created during from {hour_start} to {hour_end} is {avg_number}".format(hour_start = data[0], hour_end = data[0] + 1, avg_number = data[1]))
    print("The average number of comments for the post created during from {hour_start} to {hour_end} is {avg_number}".format(hour_start = data[0], hour_end = data[0] + 1, avg_number = data[2]), '\n')

The average number of points for the post created during from 9 to 10 is 7.31
The average number of comments for the post created during from 9 to 10 is 5.58 

The average number of points for the post created during from 13 to 14 is 24.26
The average number of comments for the post created during from 13 to 14 is 14.74 

The average number of points for the post created during from 10 to 11 is 18.68
The average number of comments for the post created during from 10 to 11 is 13.44 

The average number of points for the post created during from 14 to 15 is 11.98
The average number of comments for the post created during from 14 to 15 is 13.23 

The average number of points for the post created during from 16 to 17 is 23.35
The average number of comments for the post created during from 16 to 17 is 16.8 

The average number of points for the post created during from 23 to 24 is 8.54
The average number of comments for the post created during from 23 to 24 is 7.99 

The average number of p

Although we now have the results we need, this format makes it difficult to identify the hours with the highest values. Let's finish by sorting the list of lists and printing the five highest values in a format that is easier to read.

In [12]:
swap_avg_by_hour = []
for data in avg_by_hour:
    swap_avg_by_hour.append([data[1], data[0]])
print(swap_avg_by_hour)

[[7.31, 9], [24.26, 13], [18.68, 10], [11.98, 14], [23.35, 16], [8.54, 23], [10.71, 12], [19.41, 17], [29.99, 15], [15.79, 21], [14.39, 20], [13.67, 2], [15.97, 18], [6.93, 3], [12.0, 5], [13.75, 19], [11.67, 1], [7.2, 22], [10.73, 8], [8.28, 4], [8.2, 0], [13.43, 6], [10.62, 7], [14.22, 11]]


Let's use the built in function `sorted()` to sort `swap_avg_by_hour` in descending order. Since the first column of this list is the average number of comments, sorting the list will sort by the average number of comments.

In [13]:
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
print("Top 5 Hours for 'Ask HN' posts Comments:")
for data in sorted_swap[:5]:
    print("{time}:00: {comments} average comments per post".format(time = data[1], comments = data[0]))

Top 5 Hours for 'Ask HN' posts Comments:
15:00: 29.99 average comments per post
13:00: 24.26 average comments per post
16:00: 23.35 average comments per post
17:00: 19.41 average comments per post
10:00: 18.68 average comments per post


If we refer back to [the documentation](https://www.kaggle.com/hacker-news/hacker-news-posts) for the dataset we would see from the description that the time zone in the column `created_at` is Eastern Time in the US. Since we live in Moscow where time zone is EST +8, in order to answer the question "During which hours should we create a post to have a higher chance of receiving comments?" we have to loop through the list `sorted_swap` and add 8 hours to every hour.
To do it, we will use the built in method `timedelta` from the module `datettime`. There is a one problem: at this moment hours in the list `sorted_swap` stored in the format `int`, but the method `timedelta` works with data stored as the `timedelta` object. To solve the problem we will change the type `int` to `string` and then use the method `strptime()`.

In [14]:
sorted_swap_EST8 = []

for data in sorted_swap:
    hours_dt = dt.strptime(str(data[1]), '%H')
    hours_to_add = timedelta(hours = 8)
    hours = int((hours_dt + hours_to_add).strftime("%H"))
    sorted_swap_EST8.append([data[0], hours])

for data in sorted_swap_EST8[:5]:
    print("{time}:00: {comments} average comments per post".format(time = data[1], comments = data[0]))

23:00: 29.99 average comments per post
21:00: 24.26 average comments per post
0:00: 23.35 average comments per post
1:00: 19.41 average comments per post
18:00: 18.68 average comments per post


# Conclusions

In this project, we analyzed data from the Hacker News Posts (HN) [data set](https://www.kaggle.com/hacker-news/hacker-news-posts) in order to answer two questions:
1. Do posts with titles that begin with either "Ask HN" or "Show HN" receive more comments on average and, accordingly, have a higher position in the feed of posts.
2. Do posts created at a certain time receive more comments on average and, if it is so, which time users should create posts to get a higher number of comments.

To answer on the first question we can take a look on the output of `In[7]`. From the output we can make a conclusion that "Ask HN" posts or "Show HN" receive less comments on average in comparison with other posts, but at the same time we can say that, on average, "Ask HN" posts receive more comments than "Show HN" posts.

To answer on the second question we can take a look on the output of `In[13]`and `In[14]`. From the outputs we can make a conclusion that to get a higher number of comments users should create posts at `15:00, 2:00, 20:00, 16:00, 21:00` EST +0 or at `23:00, 10:00, 4:00, 0:00, 5:00` EST +8.