# 1. Introduction

* Learning a language is difficult because language has many complex rules. If we want computers to be able to understand language, `we either need to explicitly teach computers the rules, or enable the computers to intuit the rules themselves.` The former is a lot like `learning a second language`, and the latter is a lot like `learning your native language`.

* Broadly speakingly, natural language processing is the study of enabling computers to understand human languages. This field may involve teaching computers **to automatically score essays, infer grammatical rules, or determine the emotions associated with text**.

* When we feed a computer written text, it has no idea what that text means. In order for a computer to begin making inferences from it, we'll need to `convert the text to a numerical representation`. This process will enable the computer to intuit grammatical rules, which is more akin to learning a first language.

# 2. Overview of the Data

* Our data set consists of submissions users made to Hacker News from 2006 to 2015.

* Developer Arnaud Drizard used the Hacker News API to scrape the data, which you can find in one of his GitHub repositories. We've sampled 3000 rows from the data randomly, and removed all of the extraneous columns. Our data only has four columns:
  * submission_time - When the article was submitted
  * upvotes - The number of upvotes the article received
  * url - The base URL of the article
  * headline - The article's headline

In this mission, we'll be predicting the number of upvotes the articles received, based on their headlines. Because upvotes are an indicator of popularity, we'll discover which types of articles tend to be the most popular.

In [1]:
import pandas as pd 
submission=pd.read_csv('sel_hn_stories.csv')

In [2]:
submission.head()

Unnamed: 0,2014-06-24T05:50:40.000Z,1,flux7.com,8 Ways to Use Docker in the Real World
0,2010-02-17T16:57:59Z,1,blog.jonasbandi.net,Software: Sadly we did adopt from the construc...
1,2014-02-04T02:36:30Z,1,blogs.wsj.com,Google’s Stock Split Means More Control for L...
2,2011-10-26T07:11:29Z,1,threatpost.com,SSL DOS attack tool released exploiting negoti...
3,2011-04-03T15:43:44Z,67,algorithm.com.au,Immutability and Blocks Lambdas and Closures
4,2013-01-13T16:49:20Z,1,winmacsofts.com,Comment optimiser la vitesse de Wordpress?


In [3]:
submission.columns=['submission_time','upvotes','url','headline']

In [4]:
submission.head()

Unnamed: 0,submission_time,upvotes,url,headline
0,2010-02-17T16:57:59Z,1,blog.jonasbandi.net,Software: Sadly we did adopt from the construc...
1,2014-02-04T02:36:30Z,1,blogs.wsj.com,Google’s Stock Split Means More Control for L...
2,2011-10-26T07:11:29Z,1,threatpost.com,SSL DOS attack tool released exploiting negoti...
3,2011-04-03T15:43:44Z,67,algorithm.com.au,Immutability and Blocks Lambdas and Closures
4,2013-01-13T16:49:20Z,1,winmacsofts.com,Comment optimiser la vitesse de Wordpress?


In [15]:
submission.shape

(2999, 4)

In [17]:
submission=submission.dropna()

In [18]:
submission.shape

(2800, 4)

# 3. Tokenizing the Headlines

* Our goal is to train a linear regression algorithm that predicts the number of upvotes a headline would receive. To do this, we'll need to convert each headline to a numerical representation.
* we'll use a bag of words model. [A bag of words model](https://en.wikipedia.org/wiki/Bag-of-words_model) represents each piece of text as a numerical vector.

**`The first step in creating a bag of words model is tokenization. In tokenization, we break a sentence up into disconnected words.`**

splitting each sentence into a list of individual words, or tokens. The split occurs on the space character (" "

## TODO:
* Split each headline into individual words on the space character(" "), and append the resulting list to tokenized_headlines.

* When you're finished, tokenized_headlines should be a list of lists. Each list should contain the tokens for the headline located at the corresponding position in the submissions dataframe.

In [20]:
tokenized_headlines=[]
submission['headline'].str.split(" ")

0       [Software:, Sadly, we, did, adopt, from, the, ...
1       [, Google’s, Stock, Split, Means, More, Contro...
2       [SSL, DOS, attack, tool, released, exploiting,...
3       [Immutability, and, Blocks, Lambdas, and, Clos...
4       [Comment, optimiser, la, vitesse, de, Wordpress?]
5       [ilk, is, not, as, good, for, you, as, you, th...
6        [Worldometers, -, Real, time, world, statistics]
7       [icrosoft, strikes, back:, introduces, docs, f...
8                            [Net, HTTP, status, codes, ]
9       [Anecdata, or, how, McKinsey’s, story, became,...
11            [Immigration, Overhaul, Passes, in, Senate]
12           [What, matters, most, at, Ad:TECH, SF, 2014]
13      [Amazon, Silk, revisited:, Is, the, split, clo...
14      [Dieter, Ram's, Ten, Principles, Of, Good, Des...
15                                          [Gmail, Down]
16      [Show, How, Don't, Tell, What, -, A, Managemen...
17      [U.S., releases, images, said, to, implicate, ...
18      [Real-

In [38]:
tokenized_headlines = []
for item in submission["headline"]:
    tokenized_headlines.append(item.split())

In [41]:
tokenized_headlines[0]

['Software:',
 'Sadly',
 'we',
 'did',
 'adopt',
 'from',
 'the',
 'construction',
 'analogy']

# 4. Preprocessing Tokens to Increase Accuracy

* We now have tokens, but we need to process them a bit to make our predictions more accurate. We know that Berlin, Berlin., and berlin all refer to the same word, but the computer doesn't know that. We'll need to convert those variations so that they're consistent.

* We can do this by lowercasing (which will convert Berlin to berlin), and also by removing punctuation (so Berlin. becomes Berlin).

## TODO:

* Loop through each item in tokenized_headlines, which is a list of lists.
  * For each list of tokens:
    * Convert each individual token to lowercase
    * Remove all of the items in the punctuation list from each individual token
    * Append the clean list to clean_tokenized
* clean_tokenized should now be a list of lists. Each list should contain the preprocessed tokens associated with the headline in the corresponding position of the submissions dataframe.

In [42]:
punctuation = [",", ":", ";", ".", "'", '"', "’", "?", "/", "-", "+", "&", "(", ")"]
clean_tokenized = []

In [47]:
for item in tokenized_headlines:
    tokens=[]
    for token in item:
        token=token.lower()
        for punc in punctuation:
            token=token.replace(punc,'')
        tokens.append(token)
    clean_tokenized.append(tokens)   

In [48]:
clean_tokenized[:3]

[['software',
  'sadly',
  'we',
  'did',
  'adopt',
  'from',
  'the',
  'construction',
  'analogy'],
 ['googles',
  'stock',
  'split',
  'means',
  'more',
  'control',
  'for',
  'larry',
  'and',
  'sergey'],
 ['ssl',
  'dos',
  'attack',
  'tool',
  'released',
  'exploiting',
  'negotiation',
  'overhead']]

# 5. Assembling a Matrix of Unique Words

Now that we have our tokens, we can begin converting the sentences to their numerical representations.` First, we'll retrieve all of the unique words from all of the headlines.` Then, we'll create a matrix, and `assign those words as the column headers`. We'll `initialize all of the values in the matrix to 0`.

We'll use a pandas dataframe instead of a NumPy matrix. We can create a dataframe with all zero values using this syntax:

## TODO:
* Find all of the unique tokens in clean_tokenized, and assign the result to unique_tokens.
  * Only add tokens that occur more than once (across all of the headlines). Tokens that only occur once don't add anything to the model's prediction power, and removing them will make our algorithm run much more quickly. 
  * To do this, you can keep a list of the tokens that occur once in the data, and a different list of the tokens that occur more than once. If a token is already in the first list when you encounter it and it's not in the second list, you should add it to the second list.
  * When you're finished, unique_tokens should contain any tokens that occur more than once across all of the headlines.
  * Each token in unique_tokens should only appear in the list a single time.
 
* Create a dataframe with as many rows as there are items in the clean_tokenized list. Each column name should be a token in unique_tokens. Initialize all of the cells to the value 0. Assign the dataframe to the variable counts.

In [49]:
unique_tokens=[]
single_tokens=[]

In [None]:
for item in clean_tokenized:
    for token in item:
        if token in \: