# *Hacker News* posts strategy

In this project, we'll work with a data set of submissions to popular technology site [Hacker News](https://news.ycombinator.com).

Hacker News is a site started by the startup incubator [Y Combinator](https://www.ycombinator.com), where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

__Through our data analysis of HN, we'll identify the best post strategy to get the most comments and so attract the most visibility.__

The data set : [Link](https://www.kaggle.com/hacker-news/hacker-news-posts). It should be noted that it has been reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments.

The two types of posts we'll explore begin with either Ask HN or Show HN.
Users submit Ask HN posts to ask the Hacker News community a specific question, such as "What is the best online course you've ever taken?" Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting.

We'll specifically compare these two types of posts to determine the following:
* Between Ask HN and Show HN, which one receive more comments on average
* Do posts created at a certain time receive more comments on average?

## Set up

In [35]:
import csv
open_file = open('hacker_news.csv')
hn = csv.reader(open_file)
hn = list(hn)
hn[:5]

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01']]

__Removing Headers from the List of Lists__

In [36]:
hn = hn[1:]
hn[:5]

[['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01'],
 ['10301696',
  'Note by Note: The Making of Steinway L1037 (2007)',
  'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
  '8',
  '2',
  'walterbell',
  '9/30/2015 4:12']]

## Extracting Ask HN and Show HN Posts

In [37]:
ask_posts = []
show_posts = []

for row in hn:
    title = row[1]
    title = title.lower()
    if title.startswith("ask hn"):
        ask_posts.append(row)
    elif title.startswith("show hn"):
        show_posts.append(row)

In [38]:
show_posts[:2]

[['10627194',
  'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform',
  'https://iot.seeed.cc',
  '26',
  '22',
  'kfihihc',
  '11/25/2015 14:03'],
 ['10646440',
  'Show HN: Something pointless I made',
  'http://dn.ht/picklecat/',
  '747',
  '102',
  'dhotson',
  '11/29/2015 22:46']]

In [39]:
len(ask_posts)

1744

In [40]:
len(show_posts)

1162

## Calculating the Average Number of Comments for Ask HN and Show HN Posts

In [41]:
total_ask_comments = 0

for row in ask_posts:
    num_comments = row[4]
    num_comments = int(num_comments)
    total_ask_comments += num_comments
avg_ask_comments = total_ask_comments / len(ask_posts)
print(avg_ask_comments)

14.038417431192661


In [42]:
total_show_comments = 0

for row in show_posts:
    num_comments = row[4]
    num_comments = int(num_comments)
    total_show_comments += num_comments

avg_show_comments = total_show_comments / len(show_posts)
print(avg_show_comments)

10.31669535283993


We just determined that, on average, ask posts receive more comments than show posts. Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.

## What time is the optimum time to post for attracting comments ?

We'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:

1. Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
2. Calculate the average number of comments ask posts receive by hour created.

__Finding the Amount of Ask Posts and Comments by Hour Created__

In [45]:
##calculating the amount of ask posts and comments by hour created
from datetime import datetime

result_list = [] 

for row in ask_posts:
    created_at = row[6]
    comments = int(row[4])
    result_list.append([created_at, comments])

In [47]:
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    date = row[0]
    comments = row[1]
    time = datetime.strptime(date, "%m/%d/%Y %H:%M").strftime("%H")
    
    if time in counts_by_hour:
        counts_by_hour[time] += 1
        comments_by_hour[time] += comments
    else:
        counts_by_hour[time] = 1
        comments_by_hour[time] = comments

In [48]:
print(comments_by_hour)

{'21': 1745, '12': 687, '18': 1439, '05': 464, '04': 337, '22': 479, '03': 421, '09': 251, '01': 683, '16': 1814, '02': 1381, '06': 397, '15': 4477, '08': 492, '13': 1253, '00': 447, '17': 1146, '20': 1722, '14': 1416, '07': 267, '19': 1188, '23': 543, '10': 793, '11': 641}


__Average number of comments for ask posts created during each hour of the day__

In [49]:
average_comments_by_hour = [[round(comments_by_hour[x]/counts_by_hour[x],3), x] for x in counts_by_hour]

print(average_comments_by_hour)

[[16.009, '21'], [9.411, '12'], [13.202, '18'], [10.087, '05'], [7.17, '04'], [6.746, '22'], [7.796, '03'], [5.578, '09'], [11.383, '01'], [16.796, '16'], [23.81, '02'], [9.023, '06'], [38.595, '15'], [10.25, '08'], [14.741, '13'], [8.127, '00'], [11.46, '17'], [21.525, '20'], [13.234, '14'], [7.853, '07'], [10.8, '19'], [7.985, '23'], [13.441, '10'], [11.052, '11']]


__Sorting and Printing Values from a List of Lists__

This format makes it hard to identify the hours with the highest values. Let's finish by sorting the list of lists and printing the five highest values in a format that's easier to read.

In [50]:
swap_average_by_hour = [[h[0],h[1]] for h in average_comments_by_hour]
sorted_swap = sorted(average_comments_by_hour, reverse=True)
for avg, h in sorted_swap:
    time = datetime.strptime(h, "%H").strftime("%H:%M")
    print("{}: {:.2f} average comments per post".format(time, avg)
         )

15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post
13:00: 14.74 average comments per post
10:00: 13.44 average comments per post
14:00: 13.23 average comments per post
18:00: 13.20 average comments per post
17:00: 11.46 average comments per post
01:00: 11.38 average comments per post
11:00: 11.05 average comments per post
19:00: 10.80 average comments per post
08:00: 10.25 average comments per post
05:00: 10.09 average comments per post
12:00: 9.41 average comments per post
06:00: 9.02 average comments per post
00:00: 8.13 average comments per post
23:00: 7.99 average comments per post
07:00: 7.85 average comments per post
03:00: 7.80 average comments per post
04:00: 7.17 average comments per post
22:00: 6.75 average comments per post
09:00: 5.58 average comments per post


## Analysis conclusion

Ask HN posts receive the most comments if posted in the 3pm hour. To optimize the chance to attract the most comments, the best type of posts is ask HN post which leads to more reactions and the optimum time to post is 3pm. This result is appealing given that there is about a 62% increase in the number of comments between the hours with the highest and second highest average number of comments.

# Natural Language Processing

For this part, we work with the same dataset - submissions users made to Hacker News from 2006 to 2015. But we use a dataset which has been sampled to 3000 rows from the data randomly, and and in which all of the extraneous columns have been removed.

Our data only has four columns:

- submission_time - When the article was submitted
- upvotes - The number of upvotes the article received
- url - The base URL of the article
- headline - The article's headline

#### Our goal is to discover which types of articles tend to be the most popular on Hacker News. 
To do so, we'll analyse the correlation between headlines and upvotes. Since upvotes are an indicator of popularity, we'll be predicting the number of upvotes the articles received, based on their headlines.

## Set up

In [1]:
import csv

open_file = open('sel_hn_stories.csv')
stories = csv.reader(open_file)
stories = list(stories)

stories[:5]

[['2014-06-24T05:50:40.000Z',
  '1',
  'flux7.com',
  '8 Ways to Use Docker in the Real World'],
 ['2010-02-17T16:57:59Z',
  '1',
  'blog.jonasbandi.net',
  'Software: Sadly we did adopt from the construction analogy'],
 ['2014-02-04T02:36:30Z',
  '1',
  'blogs.wsj.com',
  ' Google’s Stock Split Means More Control for Larry and Sergey '],
 ['2011-10-26T07:11:29Z',
  '1',
  'threatpost.com',
  'SSL DOS attack tool released exploiting negotiation overhead'],
 ['2011-04-03T15:43:44Z',
  '67',
  'algorithm.com.au',
  'Immutability and Blocks Lambdas and Closures']]

In [3]:
import pandas as pd
submissions = pd.read_csv("sel_hn_stories.csv")
submissions.columns = ["submission_time", "upvotes", "url", "headline"]
submissions = submissions.dropna()
print(submissions[:5])

        submission_time  upvotes                  url  \
0  2010-02-17T16:57:59Z        1  blog.jonasbandi.net   
1  2014-02-04T02:36:30Z        1        blogs.wsj.com   
2  2011-10-26T07:11:29Z        1       threatpost.com   
3  2011-04-03T15:43:44Z       67     algorithm.com.au   
4  2013-01-13T16:49:20Z        1      winmacsofts.com   

                                            headline  
0  Software: Sadly we did adopt from the construc...  
1   Google’s Stock Split Means More Control for L...  
2  SSL DOS attack tool released exploiting negoti...  
3       Immutability and Blocks Lambdas and Closures  
4         Comment optimiser la vitesse de Wordpress?  


## Tokenizing the Headlines

We'll train a linear regression algorithm that predicts the number of upvotes a headline would receive. To do this, we'll need to convert each headline to a numerical representation.

In tokenization, we break a sentence up into disconnected words. With tokenization, all we're doing is splitting each sentence into a list of individual words, or tokens.

In [4]:
tokenized_headlines = []
for item in submissions["headline"]:
    tokenized_headlines.append(item.split())
    
print(tokenized_headlines[:5])

[['Software:', 'Sadly', 'we', 'did', 'adopt', 'from', 'the', 'construction', 'analogy'], ['Google’s', 'Stock', 'Split', 'Means', 'More', 'Control', 'for', 'Larry', 'and', 'Sergey'], ['SSL', 'DOS', 'attack', 'tool', 'released', 'exploiting', 'negotiation', 'overhead'], ['Immutability', 'and', 'Blocks', 'Lambdas', 'and', 'Closures'], ['Comment', 'optimiser', 'la', 'vitesse', 'de', 'Wordpress?']]


## Preprocessing Tokens to Increase Accuracy

We now have tokens, but we need to process them a bit to make our predictions more accurate. 

We'll need to convert those variations so that they're consistent.
We can do this by lowercasing and also by removing punctuation.

In [5]:
punctuation = [",", ":", ";", ".", "'", '"', "’", "?", "/", "-", "+", "&", "(", ")"]
clean_tokenized = []
for item in tokenized_headlines:
    tokens = []
    for token in item:
        token = token.lower()
        for punc in punctuation:
            token = token.replace(punc, "")
        tokens.append(token)
    clean_tokenized.append(tokens)
    
print(clean_tokenized[:5])

[['software', 'sadly', 'we', 'did', 'adopt', 'from', 'the', 'construction', 'analogy'], ['googles', 'stock', 'split', 'means', 'more', 'control', 'for', 'larry', 'and', 'sergey'], ['ssl', 'dos', 'attack', 'tool', 'released', 'exploiting', 'negotiation', 'overhead'], ['immutability', 'and', 'blocks', 'lambdas', 'and', 'closures'], ['comment', 'optimiser', 'la', 'vitesse', 'de', 'wordpress']]


## Assembling a Matrix of Unique Words

Now that we have our tokens, we can begin converting the sentences to their numerical representations. 

First, we'll retrieve all of the unique words from all of the headlines (tokens that only occur once don't add anything to the model's prediction power, and removing them will make our algorithm run much more quickly).

Then, we'll create a matrix, and assign those words as the column headers. We'll initialize all of the values in the matrix to 0.

In [6]:
import numpy as np
unique_tokens = []
single_tokens = []
for tokens in clean_tokenized:
    for token in tokens:
        if token not in single_tokens:
            single_tokens.append(token)
        elif token in single_tokens and token not in unique_tokens:
            unique_tokens.append(token)

counts = pd.DataFrame(0, index=np.arange(len(clean_tokenized)), columns=unique_tokens)
print(counts[:5])

   and  for  as  you  is  the  split  good  how  what    ...     frameworks  \
0    0    0   0    0   0    0      0     0    0     0    ...              0   
1    0    0   0    0   0    0      0     0    0     0    ...              0   
2    0    0   0    0   0    0      0     0    0     0    ...              0   
3    0    0   0    0   0    0      0     0    0     0    ...              0   
4    0    0   0    0   0    0      0     0    0     0    ...              0   

   animated  walks  auctions  clouds  hammer  autonomous  vehicle  \
0         0      0         0       0       0           0        0   
1         0      0         0       0       0           0        0   
2         0      0         0       0       0           0        0   
3         0      0         0       0       0           0        0   
4         0      0         0       0       0           0        0   

   crowdsourcing  disaster  
0              0         0  
1              0         0  
2              0       

## Counting Token Occurences

Now that we have a matrix where all values are 0, we need to fill in the correct counts for each cell. This involves going through each set of tokens, and incrementing the column counters in the appropriate row.

When we're finished, we'll have a row vector for each headline that tells us how many times each token occured in that headline.

In [7]:
for i, item in enumerate(clean_tokenized):
    for token in item:
        if token in unique_tokens:
            counts.iloc[i][token] += 1

## Removing Columns to Increase Accuracy

We have over many columns in our matrix. This can make it very hard for a linear regression model to make good predictions. Too many columns will cause the model to fit to noise instead of the signal in the data.

There are two kinds of features that will reduce prediction accuracy.

- Features that occur only a few times will cause overfitting, because the model doesn't have enough information to accurately decide whether they're important.

- Features that occur too many times can also cause issues. These are words like and and to, which occur in nearly every headline. These words don't add any information, because they don't necessarily correlate with upvotes.

To reduce the number of features and enable the linear regression model to make better predictions, we'll remove any words that occur fewer than 5 times or more than 100 times.

In [8]:
word_counts = counts.sum(axis=0)

counts = counts.loc[:,(word_counts >= 5) & (word_counts <= 100)]

## Splitting the Data Into Train and Test Sets

Now we'll need to split the data into two sets so that we can evaluate our algorithm effectively. We'll train our algorithm on a training set, then test its performance on a test set.

The train_test_split() function from scikit-learn will help us accomplish this.

We'll pass in .2 for the test_size parameter to randomly select 20% of the rows for our test set, and 80% for our training set.

X_train and X_test contain the predictors, and y_train and y_test contain the value we're trying to predict (upvotes).

In [9]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(counts, submissions["upvotes"], test_size=0.2, random_state=1)

## Making Predictions With fit()

Now that we have a training set and a test set, let's train a model and make test predictions. We'll use a linear regression algorithm from scikit-learn.

First we'll initialize the model using the LinearRegression class. Then, we'll use the fit() method on the model to train with X_train and y_train. Finally, we'll make predictions with X_test.

When we make predictions with a linear regression model, the model assigns coefficients to each column. Essentially, the model is determining which words correlate with more upvotes, and which with less. By finding these correlations, the model will be able to predict which headlines will be highly upvoted in the future.

In [11]:
from sklearn.linear_model import LinearRegression

clf = LinearRegression()
clf.fit(X_train, y_train)

predictions = clf.predict(X_test)
print(predictions)

[ 2.17690686e+01  6.35049729e+01 -1.67007237e+01  1.67866575e+01
 -1.97586441e+00  3.44558067e+01 -4.49860607e+01  1.41788903e+01
  1.53594595e+01  4.82887218e+00  2.25350723e+00  4.98527927e+01
  1.10696859e+01  3.78096656e+01  1.10326030e+01 -1.90095575e-01
  1.10326030e+01  3.72920816e+00 -1.40047322e+01  3.48050765e+01
  6.43508350e+01  1.10326030e+01  2.44084956e+01  1.10326030e+01
  2.02609640e+01  2.36476055e+00  1.10326030e+01  2.26720526e+00
  2.22436673e+01  2.66568210e+00 -3.47492521e+00 -4.72847975e+01
  3.67933060e+00  1.09959656e+02  9.91904416e+00  4.43886626e+01
  9.00963982e+00 -2.17246247e+01  2.92874561e+01 -7.08448438e+00
  5.38368177e+01 -2.67775578e+00  3.52360958e+01  2.15580590e+01
  1.10326030e+01  2.07073523e+01 -1.06418175e+01  1.10326030e+01
  1.72869227e+01 -1.39319454e+01 -1.55296118e+01  1.23604698e+01
 -5.10036138e+00  1.10326030e+01  1.10326030e+01 -1.06795758e+01
  2.56007188e+01  2.70652122e+01  9.53182447e+00  2.12059356e+01
 -6.39424491e-02  2.30081

## Calculating Prediction Error

Now that we have predictions, we can calculate our prediction error. We'll use mean squared error (MSE).

With MSE, we subtract the predictions from the actual values, square the results, and find the mean. Because the errors are squared, MSE penalizes errors further away from the actual value more than those close to the actual value. 

In [12]:
mse = sum((predictions - y_test) ** 2) / len(predictions)
print(mse)


2651.145705668968


If we take the square root of our MSE to calculate error in terms of upvotes, we get 51.5. This means that our average error is 51.5 upvotes away from the true value. We have high error in predicting upvotes as we have used a very small data set. With larger training sets, this should decrease dramatically.

In [16]:
mse = sum((predictions - y_test) ** 2) / len(predictions)
rmse = (mse)**0.5
print(rmse)

51.48927757959872
