
<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Introduction to Natural Language Processing (NLP)


---


![](https://snag.gy/uvESGH.jpg)

## Learning Objectives


### Core
- Extract features from unstructured text using Scikit Learn
    - Count vectorizer
    - TFIDF vectorizer
    - Hashing vectorizer
- Describe downsides of bag-of-word approaches
- Remove stop words


### Target
- Identify parts of ppeech using NLTK
    - Stemming 
    - Segmentation
    - Parts of speech tagging


### Stretch

- Describe how TFIDF works and calculate scores by hand
- Describe the advantages and disadvantages of using the hashing vectorizer


### Lesson Guide
- [Introduction to text feature extraction](#intro)
- [An NLP project: rapstats.io](#rapstats)
- [Common NLP problems](#common)
- [Common NLP models](#models)
- [A simple example](#simple)
- [Bag-of-words / word counting](#bow)
    - [Sklearn `CountVectorizer`](#countvectorizer)
- [Hashing](#hash)
    - [Sklearn `HashingVectorizer`](#hashingvectorizer)
- [Term frequency - inverse document frequency](#tfidf)
    - [Sklearn `TfidfVectorizer`](#tfidf-vec)
- [Downsides to bag-of-words](#downsides-bow)
- [Segmentation](#segmentation)
    - [NLTK sentencer](#nltk-sentencer)
- [Stemming with NLTK](#stem-nltk)
    - [Stemming approaches](#group)
- [Stop words](#stopwords)
- [Part of speech tagging](#pos)
- [Unicode: a common pitfall](#unicode)
- [Conclusion](#conclusion)
- [Additional resources](#resources)

<a name="intro"></a>
## Introduction to text feature extraction

---

The models we have been using so far accept a 2D matrix of real numbers as input `X` and a target vector of classes or numbers `y`. What if our starting point data is not given in the form of a table of numbers, but rather is unstructured? This is the case when working with text documents.

We need a way to go from unstructured data to our numeric `X` matrix in order to use the same models. This is called _feature extraction_ and this lesson is dedicated to it.

The applications of using text data in statistical modeling are practically infinite. Some examples include:
- Sentiment analysis of Yelp reviews
- Identifying topics of news articles
- Classification of political authors

<a id='common'></a>
## Common NLP problems

---

The table below details some of the most common problems and tasks in the vast field of natural language processing (NLP).

| | |
|-|-|
| **Sentiment Analysis** | Is what is written positive or negative? | 
| **Named Entity Recognition** | Classify names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. |
| **Summarization** | Boiling down large bodies of text to paraphrased versions |
| **Topic Modeling** | What topics does a body of text belong to? (ie: Auto tagging of news articles) |
| **Question answering** | Given a human-language question, determine its answer. |
| **Word disambiguation** | Many words have more than one meaning; we have to select the meaning which makes the most sense in context. For this problem, we are typically given a list of words and associated word senses, e.g. from a dictionary or from an online resource such as WordNet. |
| **Machine dialog systems** | Building response systems that react contextually to human input (ie: me: Siri, cook me some bacon.  Siri:  How do you like your bacon? ) | 


See Also:

- [News Headline Anlaysis](http://nbviewer.jupyter.org/github/AYLIEN/headline_analysis/blob/06f1223012d285412a650c201a19a1c95859dca1/main-chunks.ipynb?utm_content=buffer5d40c&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer)
- [Sentiment + Robot Classification in Movies](http://nbviewer.jupyter.org/github/cojette/ClusteringRobotsinMovie/blob/master/Classification%20of%20Robots%20in%20Movies.ipynb)
- [Text Summarization / Gensim](http://nbviewer.jupyter.org/github/piskvorky/gensim/blob/develop/docs/notebooks/summarization_tutorial.ipynb)
- [Sentiment Analysis Intro](http://nbviewer.jupyter.org/github/sgsinclair/alta/blob/master/ipynb/SentimentAnalysis.ipynb)

<a id='models'></a>
## Common NLP models and terms

---

- LSI (Latent Semantic Indexing)
- LDA (Latent Dirichlet Allocation)
- HDP (Hierarchical Dirichlet Process)
- Word2Vec
- LogisticRegression
- Naive Bayes
- SVM
- CountVectorizer
- TfIdF (term frequency inverse document frequency)
- DTM (document term matrix)

> **Note:** This is not an exhaustive list, nor will we be covering all of these models in class. NLP is a very deep and very broad area of data science that could warrant it's own immersive entirely!

<a id='simple'></a>
## A Simple Example
---

Suppose we are building a spam classifier. Inputs are emails and the output is a binary label.

Here's an example of an input email from each class:

In [1]:
spam = """
Hello,\nI saw your contact information on LinkedIn. I have carefully read through your profile and you seem to have an outstanding personality. This is one major reason why I am in contact with you. My name is Mr. Valery Grayfer Chairman of the Board of Directors of PJSC "LUKOIL". I am 86 years old and I was diagnosed with cancer 2 years ago. I will be going in for an operation later this week. I decided to WILL/Donate the sum of 8,750,000.00 Euros(Eight Million Seven Hundred And Fifty Thousand Euros Only etc. etc.
"""

ham = """
Hello, \nI am writing in regards to your application to the position of Data Scientist at Hooli X. We are pleased to inform you that you passed the first round of interviews and we would like to invite you for an on-site interview with our Senior Data Scientist Mr. John Smith. You will find attached to this message further information on date, time and location of the interview. Please let me know if I can be of any further assistance. Best Regards.
"""
print(spam)
print()
print(ham)


Hello,
I saw your contact information on LinkedIn. I have carefully read through your profile and you seem to have an outstanding personality. This is one major reason why I am in contact with you. My name is Mr. Valery Grayfer Chairman of the Board of Directors of PJSC "LUKOIL". I am 86 years old and I was diagnosed with cancer 2 years ago. I will be going in for an operation later this week. I decided to WILL/Donate the sum of 8,750,000.00 Euros(Eight Million Seven Hundred And Fifty Thousand Euros Only etc. etc.



Hello, 
I am writing in regards to your application to the position of Data Scientist at Hooli X. We are pleased to inform you that you passed the first round of interviews and we would like to invite you for an on-site interview with our Senior Data Scientist Mr. John Smith. You will find attached to this message further information on date, time and location of the interview. Please let me know if I can be of any further assistance. Best Regards.



- Can you think of a simple heuristic rule to catch email like this?

## Tokenizing
---

When we "tokenize" data, we take it and split it up into pieces, usually words.

In [2]:
# Step1 : Tokenization
# spam
spam_list=spam.lower().split()
spam_list

['hello,',
 'i',
 'saw',
 'your',
 'contact',
 'information',
 'on',
 'linkedin.',
 'i',
 'have',
 'carefully',
 'read',
 'through',
 'your',
 'profile',
 'and',
 'you',
 'seem',
 'to',
 'have',
 'an',
 'outstanding',
 'personality.',
 'this',
 'is',
 'one',
 'major',
 'reason',
 'why',
 'i',
 'am',
 'in',
 'contact',
 'with',
 'you.',
 'my',
 'name',
 'is',
 'mr.',
 'valery',
 'grayfer',
 'chairman',
 'of',
 'the',
 'board',
 'of',
 'directors',
 'of',
 'pjsc',
 '"lukoil".',
 'i',
 'am',
 '86',
 'years',
 'old',
 'and',
 'i',
 'was',
 'diagnosed',
 'with',
 'cancer',
 '2',
 'years',
 'ago.',
 'i',
 'will',
 'be',
 'going',
 'in',
 'for',
 'an',
 'operation',
 'later',
 'this',
 'week.',
 'i',
 'decided',
 'to',
 'will/donate',
 'the',
 'sum',
 'of',
 '8,750,000.00',
 'euros(eight',
 'million',
 'seven',
 'hundred',
 'and',
 'fifty',
 'thousand',
 'euros',
 'only',
 'etc.',
 'etc.']

In [3]:
# Step1 : Tokenization
# ham
ham_list=ham.lower().split()
ham_list

['hello,',
 'i',
 'am',
 'writing',
 'in',
 'regards',
 'to',
 'your',
 'application',
 'to',
 'the',
 'position',
 'of',
 'data',
 'scientist',
 'at',
 'hooli',
 'x.',
 'we',
 'are',
 'pleased',
 'to',
 'inform',
 'you',
 'that',
 'you',
 'passed',
 'the',
 'first',
 'round',
 'of',
 'interviews',
 'and',
 'we',
 'would',
 'like',
 'to',
 'invite',
 'you',
 'for',
 'an',
 'on-site',
 'interview',
 'with',
 'our',
 'senior',
 'data',
 'scientist',
 'mr.',
 'john',
 'smith.',
 'you',
 'will',
 'find',
 'attached',
 'to',
 'this',
 'message',
 'further',
 'information',
 'on',
 'date,',
 'time',
 'and',
 'location',
 'of',
 'the',
 'interview.',
 'please',
 'let',
 'me',
 'know',
 'if',
 'i',
 'can',
 'be',
 'of',
 'any',
 'further',
 'assistance.',
 'best',
 'regards.']

<a id='bow'></a>
## Bag-of-words (BOF) / word counting
---

The bag-of-words model is a simplified representation of the raw data. In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words.

Bag-of-words representations discard grammar, order, and structure in the text, but track occurances.

In [4]:
# BOF from scratch 
from collections import Counter

In [5]:
# spam
spam_count=Counter(spam_list)
spam_count

Counter({'hello,': 1,
         'i': 7,
         'saw': 1,
         'your': 2,
         'contact': 2,
         'information': 1,
         'on': 1,
         'linkedin.': 1,
         'have': 2,
         'carefully': 1,
         'read': 1,
         'through': 1,
         'profile': 1,
         'and': 3,
         'you': 1,
         'seem': 1,
         'to': 2,
         'an': 2,
         'outstanding': 1,
         'personality.': 1,
         'this': 2,
         'is': 2,
         'one': 1,
         'major': 1,
         'reason': 1,
         'why': 1,
         'am': 2,
         'in': 2,
         'with': 2,
         'you.': 1,
         'my': 1,
         'name': 1,
         'mr.': 1,
         'valery': 1,
         'grayfer': 1,
         'chairman': 1,
         'of': 4,
         'the': 2,
         'board': 1,
         'directors': 1,
         'pjsc': 1,
         '"lukoil".': 1,
         '86': 1,
         'years': 2,
         'old': 1,
         'was': 1,
         'diagnosed': 1,
         'cancer':

In [6]:
# ham
ham_count=Counter(ham_list)
ham_count

Counter({'hello,': 1,
         'i': 2,
         'am': 1,
         'writing': 1,
         'in': 1,
         'regards': 1,
         'to': 5,
         'your': 1,
         'application': 1,
         'the': 3,
         'position': 1,
         'of': 4,
         'data': 2,
         'scientist': 2,
         'at': 1,
         'hooli': 1,
         'x.': 1,
         'we': 2,
         'are': 1,
         'pleased': 1,
         'inform': 1,
         'you': 4,
         'that': 1,
         'passed': 1,
         'first': 1,
         'round': 1,
         'interviews': 1,
         'and': 2,
         'would': 1,
         'like': 1,
         'invite': 1,
         'for': 1,
         'an': 1,
         'on-site': 1,
         'interview': 1,
         'with': 1,
         'our': 1,
         'senior': 1,
         'mr.': 1,
         'john': 1,
         'smith.': 1,
         'will': 1,
         'find': 1,
         'attached': 1,
         'this': 1,
         'message': 1,
         'further': 2,
         'information

> In the above example we counted the number of times each word appeared in the text. Note that since we included all the words in the text, we created a dictionary that contains many words with only one appearance.

<a name="countvectorizer"></a>
## Sklearn `CountVectorizer`
---

Sklearn offers a `CountVectorizer` class which basically does the same, but which has many configurable options:

In [7]:
from sklearn.feature_extraction.text import CountVectorizer

In [8]:
# 1- initialize vectorizer 
cv=CountVectorizer(token_pattern='\w+') # character +
spam_ham=spam_list+ham_list

In [9]:
# 2- fit vectorizer (build Vocabulary)
cv.fit((spam,ham))

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='\\w+', tokenizer=None,
        vocabulary=None)

### The count vectorizer returns a sparse matrix (see [scipy](https://docs.scipy.org/doc/scipy/reference/sparse.html))

In a sparse matrix, only those entries are stored which are different from zero. 
They are stored through triples of numbers, the occupied row and column index combination together with the value.

This is particularly useful for NLP where each document will only contain a small amount of all the possible words.


In [10]:
# 3- trnsform (calculate weights)
document_matrix =cv.transform((spam,ham)) 

#### Handling sparse matrices

In [11]:
print('Vocabulary Words:')
cv.get_feature_names()

Vocabulary Words:


['00',
 '000',
 '2',
 '750',
 '8',
 '86',
 'ago',
 'am',
 'an',
 'and',
 'any',
 'application',
 'are',
 'assistance',
 'at',
 'attached',
 'be',
 'best',
 'board',
 'can',
 'cancer',
 'carefully',
 'chairman',
 'contact',
 'data',
 'date',
 'decided',
 'diagnosed',
 'directors',
 'donate',
 'eight',
 'etc',
 'euros',
 'fifty',
 'find',
 'first',
 'for',
 'further',
 'going',
 'grayfer',
 'have',
 'hello',
 'hooli',
 'hundred',
 'i',
 'if',
 'in',
 'inform',
 'information',
 'interview',
 'interviews',
 'invite',
 'is',
 'john',
 'know',
 'later',
 'let',
 'like',
 'linkedin',
 'location',
 'lukoil',
 'major',
 'me',
 'message',
 'million',
 'mr',
 'my',
 'name',
 'of',
 'old',
 'on',
 'one',
 'only',
 'operation',
 'our',
 'outstanding',
 'passed',
 'personality',
 'pjsc',
 'please',
 'pleased',
 'position',
 'profile',
 'read',
 'reason',
 'regards',
 'round',
 'saw',
 'scientist',
 'seem',
 'senior',
 'seven',
 'site',
 'smith',
 'sum',
 'that',
 'the',
 'this',
 'thousand',
 'throu

In [12]:
print("Transform to numpy array format:")
document_matrix.toarray()

Transform to numpy array format:


array([[1, 1, 1, 1, 1, 1, 1, 2, 2, 3, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1,
        1, 2, 0, 0, 1, 1, 1, 1, 1, 2, 2, 1, 0, 0, 1, 0, 1, 1, 2, 1, 0, 1,
        7, 0, 2, 0, 1, 0, 0, 0, 2, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1,
        1, 1, 4, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1,
        0, 1, 0, 1, 0, 0, 1, 0, 2, 2, 1, 1, 0, 2, 1, 1, 0, 1, 1, 2, 2, 0,
        0, 0, 2, 2, 2],
       [0, 0, 0, 0, 0, 0, 0, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0,
        0, 0, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 2, 0, 0, 0, 1, 1, 0,
        2, 1, 1, 1, 1, 2, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1,
        0, 0, 4, 0, 2, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 2, 1, 0,
        2, 0, 1, 0, 1, 1, 0, 1, 3, 1, 0, 0, 1, 5, 0, 0, 2, 0, 0, 1, 1, 1,
        1, 1, 0, 4, 1]])

In [13]:
print("Number of nonzero entries:")
len(document_matrix.nonzero())

Number of nonzero entries:


2

In [14]:
print("Highest count:")
document_matrix.max()

Highest count:


7

#### Inserting the result into a dataframe

> **Note:** For huge text bodies continue working with the sparse matrix format. Many sklearn models can digest sparse matrices.

In [15]:
# create dataframe
import pandas as pd
df=pd.DataFrame(data=document_matrix.toarray(),columns=cv.get_feature_names(),index=['spam','ham'])
df

Unnamed: 0,00,000,2,750,8,86,ago,am,an,and,...,week,why,will,with,would,writing,x,years,you,your
spam,1,1,1,1,1,1,1,2,2,3,...,1,1,2,2,0,0,0,2,2,2
ham,0,0,0,0,0,0,0,1,1,2,...,0,0,1,1,1,1,1,0,4,1


In [16]:
# sort columns 

- ### Spend a couple of minutes scanning the documentation to figure out what those parameters do. 

Share a few takeaways from the documentation. What arguments and capabilities stand out to you? How do the arguments affect the parsing behavior?

[Count Vectorizer Documentation](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

<a id='hash'></a>
## Hashing

---
![](https://i.ytimg.com/vi/bs7Wq0Z1uYk/maxresdefault.jpg)

A hash value (or simply hash), also called a message digest, is a number generated from a string of text. 

The hash is substantially smaller than the text itself, and is generated by a formula in such a way that it is extremely unlikely that some other text will produce the same hash value.

Think of the hash as a code that represents the original text in a more condensed format.

<a name="hashingvectorizer"></a>
## Sklearn `HashingVectorizer`

---

As you have seen we can set the `CountVectorizer` dictionary to have a fixed size, only keeping words of certain frequencies, however, we still have to compute a dictionary and hold the dictionary in memory. This could be a problem when we have a large corpus or in streaming applications where we don't know which words we will encounter in the future.

Both problems can be solved using the `HashingVectorizer`, which converts a collection of text documents to a matrix of occurrences calculated with the [hashing trick](https://en.wikipedia.org/wiki/Feature_hashing). Each word is mapped to a feature with the use of a [hash function](https://en.wikipedia.org/wiki/Hash_function) that converts it to a hash. If we encounter that word again in the text, it will be converted to the same hash, allowing us to count word occurence without retaining a dictionary in memory. This is very convenient!

The main drawback of this trick is that it's *not possible to compute the inverse transform*, and thus we lose information on what words the important features correspond to. The hash function employed is the signed 32-bit version of Murmurhash3.

- ### What characteristics should text feature extraction from text satisfy?

> It should return a vector of fixed size, regardless of the length of a text.

- ### Using the code above as example, let's repeat the vectorization using a `HashingVectorizer`.

> Look up how to do this and then try it.


In [17]:
from sklearn.feature_extraction.text import HashingVectorizer

In [18]:
# 1- initialize vectorizer 
has=HashingVectorizer()

In [19]:
# 2- fit vectorizer
has.fit(spam_ham)

HashingVectorizer(alternate_sign=True, analyzer='word', binary=False,
         decode_error='strict', dtype=<class 'numpy.float64'>,
         encoding='utf-8', input='content', lowercase=True,
         n_features=1048576, ngram_range=(1, 1), non_negative=False,
         norm='l2', preprocessor=None, stop_words=None, strip_accents=None,
         token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None)

In [20]:
# 3- trnsform (calculate weights)
document_matrix =

SyntaxError: invalid syntax (<ipython-input-20-715c4ee35b80>, line 2)

In [None]:
# create dataframe

In [None]:
# sort columns

- ### What new parameters does this vectorizer offer?

> Go to the documentation and compare to the `CountVectorizer`.

[Hashing Vectorizer documentation](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html)

<a name="tfidf"></a>
## Term frequency - inverse document frequency (tf-idf)

---

A tf-idf score tells us which words are most discriminating between documents. Words that occur a lot in one document but don't occur in many documents contain a great deal of discriminating power.

- This weight is a statistical measure used to evaluate how important a word is to a document in a collection (aka corpus).
- The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.

Variations of the tf-idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query.

The inverse document frequency is a measure of how much information the word provides, that is, whether the term is common or rare across all documents. It is the logarithmically scaled inverse fraction of the documents that contain the word plus one, obtained by dividing the total number of documents by the number of documents containing the term plus one, and then taking the logarithm of that quotient.

This enhances terms that are highly specific of a particular document, while suppressing terms that are common to most documents.

#### Let's see how it is calculated:
 * $t$ stands for the term of interest
 * $d$ represents a particular document, in a corpus $D$
 * $N$ represents the number of documents in corpus $D$

Then term frequency (tf) is:
 
### $$ tf(t,d) = \frac{count(t)}{length(d)} $$

And inverse document frequency (idf) is:
$$ \mathrm{idf}(t, D) = \log\left(\frac{N_\text{Documents}}{N_\text{Documents that contain term}}\right) $$

* The denominator translates as "for all documents in the corpus where term _t_ appears"

Then term frequency inverse document frequency (tf-idf) is:

### $$ tfidf(t,d,D) = tf(t,d) * idf(t,D) $$

#### An example

Consider the three sentences:

    a) I have a cat.

    b) I have a puppy.
    
    c) I have a dog, I have a kitten, and I have a pen.
    
- Calculate the tf-idf of the words `cat`, `have` and `puppy`.
- On paper, sketch the values obtained in a 3-dimensional coordinate system.
- Follow the default sklearn settings which will discard single letter words.

#### Tf-idf calculation

In [21]:
import numpy as np

In [22]:
# corpus 
corpus = ['I have a cat',
          'I have a puppy',
          'I have a dog, I have a kitten, and I have a pen']
corpus

['I have a cat',
 'I have a puppy',
 'I have a dog, I have a kitten, and I have a pen']

In [23]:
# Tf-idf from scratch
word = 'cat'

In [24]:
# calculate TF
tf=corpus[0].count(word)/len(corpus[0].split())
tf

0.25

In [25]:
# calculate IDF
idf=np.log(len(corpus)/sum([1 for i in corpus if word in i]))
idf

1.0986122886681098

In [26]:
# calculate TF-IDF
tf*idf

0.27465307216702745

<a id='tfidf-vec'></a>
## Sklearn `TfidfVectorizer`

---

### Why Use TFIDF?

- Common words are penalized
- Rare words have more influence

Sklearn provides a tf-idf vectorizer that works similarly to the other vectorizers we've covered. Notice that we can also eliminate stop words to improve our analysis.

#### Use the `TfidfVectorizer` to fit the spam and ham data.

In [27]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [28]:
# 1- initialize vectorizer
vect=TfidfVectorizer(stop_words='english')

These are the stopwords used by sklearn:

In [29]:
stopwords = vect.stop_words
stopwords

'english'

In [30]:
# 2- fit vectorizer
vect.fit((spam,ham))

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words='english', strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [31]:
# 3- trnsform (calculate weights)
document_matrix = vect.transform((spam,ham))

In [32]:
# create dataframe
pd.DataFrame(document_matrix.toarray())

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,58,59,60,61,62,63,64,65,66,67
0,0.145067,0.145067,0.145067,0.145067,0.145067,0.0,0.0,0.0,0.0,0.145067,...,0.145067,0.0,0.0,0.145067,0.145067,0.0,0.145067,0.145067,0.0,0.290133
1,0.0,0.0,0.0,0.0,0.0,0.155195,0.155195,0.155195,0.155195,0.0,...,0.0,0.155195,0.155195,0.0,0.0,0.155195,0.0,0.0,0.155195,0.0


In [33]:
# sort columns

<a name="downsides-bow"></a>
## Downsides to bag-of-words

---

Bag-of-word approaches like the one outlined above completely ignore the structure of a sentence. Bag-of-word approaches merely assess presence of specific words or word combinations.

The same word can have multiple meanings in different contexts. Consider for example the following two sentences:

- There's wood floating in the **sea**.
- Mike's in a **sea** of trouble with the move.

How do we teach a computer to disambiguate? Later we will cover some other techniques that may help.


<a id='segmentation'></a>
## Segmentation

---

_Segmentation_ is a technique to **identify sentences** within a body of text. Language is not a continuous uninterrupted stream of words: punctuation serves as a guide to group together words that convey meaning when contiguous.


In [34]:
easy_text = "I went to the zoo today. What do you think of that? I bet you hate it, Or maybe you don't."

In [35]:
# Segmentation from scratch 

easy_text_list = []
start = 0
for i, c in enumerate(easy_text):
    if c in ('.', ',', '?'):
        easy_text_list.append(easy_text[start:i+1])
        start = i+1

In [36]:
easy_text_list

['I went to the zoo today.',
 ' What do you think of that?',
 ' I bet you hate it,',
 " Or maybe you don't."]

<a id='nltk-sentencer'></a>
- ### There's an easier way to do the same thing!

<a name="install"></a>
### Install (Natural Language Tool Kit) NLTK packages

First, in your terminal, run 

```bash
pip install nltk
```

Then within python, run the following:

```python
import nltk
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')
```

- ### Use `nltk.download()` to explore the available packages.

In [65]:
import nltk
#nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/shadow/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [38]:
#!pip install nltk

In [39]:
from nltk.tokenize import PunktSentenceTokenizer

In [40]:
seg=PunktSentenceTokenizer()
seg.sentences_from_text(easy_text)

['I went to the zoo today.',
 'What do you think of that?',
 "I bet you hate it, Or maybe you don't."]

<a name="stem-nltk"></a>
## Stemming with NLTK

---

**Text normalization** is the process of converting slightly different versions of words with essentially equivalent meaning into the same features.

For example: LinkedIn sees 6000+ variations of the title "Software Engineer" and 8000+ variations of the word "IBM".

- ### What are other common cases of text that could need normalization?


    - Person titles (Mr, MR, Dr etc.)
    - Dates (10/03, March 10 etc.)
    - Numbers
    - Plurals
    - Verb conjugations
    - Slang
    - SMS abbreviations

It would be wrong to consider the words "MR." and "mr" to be different features, thus we need a technique to normalize words to a common root. This technique is called _stemming_.

- Science, Scientist => Scien
- Swimming, Swimmer, Swim => Swim

As we did above we could define a Stemmer based on rules:

In [41]:
# Stemming from scratch 

def stem(tokens):
    '''rule-based stemming of a bunch of tokens'''

    new_bag = []
    for token in tokens:
        # define rules here
        if token.endswith('s'):
            new_bag.append(token[:-1])
        elif token.endswith('er'):
            new_bag.append(token[:-2])
        elif token.endswith('ce'):
            new_bag.append(token[:-2])
        elif token.endswith('tion'):
            new_bag.append(token[:-4])
        elif token.endswith('tist'):
            new_bag.append(token[:-4])
        elif token.endswith('ing'):
            new_bag.append(token[:-3])
        else:
            new_bag.append(token)

    return new_bag

In [42]:
list_science = ['Science', 'Scientist', 'Scientists']
print(list_science)

['Science', 'Scientist', 'Scientists']


In [43]:
list_play = ['player', 'plays', 'playing']
print(list_play)

['player', 'plays', 'playing']


Luckily for us, NLTK contains several robust stemmers.

In [44]:
from nltk.stem import PorterStemmer

In [54]:
#'walks' - 'walked' - 'Walking'
stem=PorterStemmer()
print(stem.stem('walk'))
print(stem.stem('walking'))
print(stem.stem('walked'))

walk
walk
walk


<a id='group'></a>
### Stemming approaches


> There are other stemmers available in NLTK. Look at [this article](https://www.elastic.co/guide/en/elasticsearch/guide/current/choosing-a-stemmer.html) to find out about different stemmers. Have a look how it works in different languages.

<a id='stopwords'></a>
## Stop words

---

Some words are very common and provide no legitimate information about the content of the text.

- ### Can you give some examples?

> We should remove these _stop words_. Note that each language has different stop words.

In [66]:
from nltk.corpus import stopwords

In [67]:
sentence = "this is a foo bar sentence"

In [71]:
# print sentence without stopwords
stop=stopwords.words('english')
stop[:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In [49]:
#How many stop words

<a id='pos'></a>
## Part of speech tagging

---

Each word has a specific role in a sentence (Verb, Noun, etc.). Parts-of-speech tagging (POS) is a feature extraction technique that attaches a tag to each word in the sentence in order to provide a more precise context for further analysis. This is often a resource intensive process, but it can sometimes improve the accuracy of our models to have the grammatical features.

In [72]:
from nltk.tag import pos_tag
from nltk.tokenize import WordPunctTokenizer

In [73]:
sentance = "today is a great day to learn nlp"

In [74]:
# tokenization
tok = WordPunctTokenizer()
sentance = tok.tokenize(sentance)
sentance 

['today', 'is', 'a', 'great', 'day', 'to', 'learn', 'nlp']

In [77]:
#nltk.download('averaged_perceptron_tagger')
tags = dict(pos_tag(sentance))
tags

{'today': 'NN',
 'is': 'VBZ',
 'a': 'DT',
 'great': 'JJ',
 'day': 'NN',
 'to': 'TO',
 'learn': 'VB',
 'nlp': 'NN'}

Here is the explanation for the abbreviations:

<a id='unicode'></a>
## Unicode: a common pitfall

What happens when we get a character that is referenced outside of the character space, for instance a German umlaut **&ouml;** or a Japanese Katakana character  **片仮名 / カタカナ**?


- Python doesn't know how to handle these characters if it has to process it in any way
- Characters outside the Latin character space will get converted to ordinal 0
- This problem can be very frustrating to deal with

Luckily, sklearn has robust classes for text feature extraction. Use sklearn's built-in text preprocessing method when possible.  Always save/encode your text as UTF8 when there are options available to do so.

<a name="conclusion"></a>
## Conclusion

---

In this lesson we obtained an overview of Natural Language Processing (NLP) and learned about two very powerful toolkits:
- Scikit Learn Feature Extraction Text
- Natural Language Tool Kit

#### Some real world applications of these techniques:

- Spam Detection
- Preprocessing for larger NLP problems
- Job market analysis
- Crude topic analyis
- Building a keyword extractation heuristic and piping it into a marketing analysis 

<a id='resources'></a>
## Additional resources

---

- Check out this [Yelp blog post](http://engineeringblog.yelp.com/2015/09/automatically-categorizing-yelp-businesses.html) how they completed a classification task (with over 1000 response variables!) using restaurant review text
- A list of all stop-words is available [here](https://github.com/ga-students/DSI-DC-2/blob/master/curriculum/Week-05/5.04-nlp/stop-words.txt) h/t sleevillanueva
- Wikpedia's [feature hashing](https://github.com/generalassembly-studio/DSI-course-materials/tree/master/curriculum/04-lessons/week-06/4.1-lesson) and [hash functions](https://en.wikipedia.org/wiki/Hash_function) is a great place to turn for more info on hashing
- This lesson made use of Charlie Greenbacher's [Intro to NLP](http://spark-public.s3.amazonaws.com/nlp/slides/intro.pdf), which he delivered at the [DC-NLP Meetup](http://www.meetup.com/DC-NLP/) 
- Wikipedia includes a [walkthrough](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) of TF-IDF
- Play with Google's [ngram tool](https://books.google.com/ngrams/graph?content=data+science&year_start=1800&year_end=2000&corpus=15&smoothing=3&share=&direct_url=t1%3B%2Cdata%20science%3B%2Cc0)
- A hilarious data scientist gone rogue used NLP and Eigenfaces (Eigenvalues for face recognition) [for Tinder](http://dataconomy.com/hacking-tinder-with-facial-recognition-nlp/)
- Check out KPCB's 2016 internet trends [this massive, insightful deck](http://www.kpcb.com/internet-trends)
- [Choosing a Stemmer](https://www.elastic.co/guide/en/elasticsearch/guide/current/choosing-a-stemmer.html)
- Check documentation: [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html), [HashingVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html), [TF-IDF](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)
- [Feature Hashing](https://en.wikipedia.org/wiki/Feature_hashing)
- [Term Frequency Inverse Document Frequency](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)