

## Intro to nlp

### Learning Objectives
*After this lesson, you will be able to:*

- Extract features from unstructured text with scikit-learn
- Describe how count vectorization and TF-IDF work.
- Define stop words and remove them with scikit-learn.
- Identify shortcomings with the aforementioned methods.


<a name="intro"></a>
## Introduction to Text Feature Extraction

---

The models we’ve been using so far accept a two-dimensional matrix of real numbers as input `X` and a target vector of classes or numbers as `y`. What if our starting point data are not given in the form of a table of numbers but rather are unstructured? This is the case when we work with text documents.

> We need a way to go from unstructured data to our numeric `X` matrix in order to use the same models. This is called _feature extraction_, and this lesson is dedicated to it.

The applications of using text data in statistical modeling are practically infinite. Some examples include:

- Sentiment analysis of Yelp reviews.
- Identifying topics of new articles.
- Classification of political authors.

<br>
<center>
# Y ~ "YOLO 4life ^^ BBQ@@ OMG LOL!"
</center>
<br>

<img src="https://snag.gy/FoaBeK.jpg" style="width: 400px; float: left; margin-right: 20px;">

<img src="https://snag.gy/Qz0mav.jpg" style="width: 150px; float: left; margin-right: 50px;">

<img src="https://snag.gy/6Lu9aC.jpg">


<a id='common'></a>
## Common NLP Problems

---

The table below details some of the most common problems and tasks in the vast field of natural language processing (NLP).

| | |
|-|-|
| **Sentiment Analysis** | Determining if what is written is positive or negative. | 
| **Named Entity Recognition** | Classifying names of people, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. |
| **Summarization** | Boiling down large bodies of text to paraphrased versions. |
| **Topic Modeling** | Pinpointing the topics a body of text belongs to (e.g., auto-tagging news articles). |
| **Question Answering** | Determining the answer to a human-language question. |
| **Word Disambiguation** | Many words have more than one meaning; we have to select the meaning that makes the most sense in context. For this problem, we’re typically given a list of words and associated word senses (e.g., from a dictionary or from an online resource such as WordNet). |
| **Machine Dialog Systems** | Building response systems that react contextually to human input (i.e., Me: "Siri, cook me some bacon." Siri: "How do you like your bacon cooked?"). | 


See also:

- [News headline analysis](http://nbviewer.jupyter.org/github/AYLIEN/headline_analysis/blob/06f1223012d285412a650c201a19a1c95859dca1/main-chunks.ipynb?utm_content=buffer5d40c&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer).
- [Sentiment and robot classification in movies](http://nbviewer.jupyter.org/github/cojette/ClusteringRobotsinMovie/blob/master/Classification%20of%20Robots%20in%20Movies.ipynb).
- [Text summarization with Gensim](http://nbviewer.jupyter.org/github/piskvorky/gensim/blob/develop/docs/notebooks/summarization_tutorial.ipynb).
- [Sentiment analysis introduction](http://nbviewer.jupyter.org/github/sgsinclair/alta/blob/master/ipynb/SentimentAnalysis.ipynb).

<a id='models'></a>
## Some Common NLP Models and Terms

---

- Latent semantic indexing (LSI)
- Latent dirichlet allocation (LDA)
- Hierarchical dirichlet process (HDP)
- Word2Vec
- LogisticRegression
- Naive Bayes
- SVM
- CountVectorizer
- Term frequency-inverse document frequency (TF-IDF)
- Document term matrix (DTM)

> **Note:** This is not an exhaustive list, nor will we be covering all of these models in class. NLP is a very deep and broad area of data science that could warrant its own Immersive course entirely.

<a id='simple'></a>
## A Simple Example
---

Suppose we're building a spam classifier. The inputs are emails and the output is a binary label.

Here's an example of an input email from each class:

In [50]:
spam = """
Hello,\nI saw your contact information on LinkedIn. I have carefully read through your profile and you seem to have an outstanding personality. This is one major reason why I am in contact with you. My name is Mr. Valery Grayfer, chairman of the board of directors of PJSC "LUKOIL." I am 86-years old and I was diagnosed with cancer two years ago. I will be going in for an operation later this week. I decided to will/donate the sum of 8,750,000.00 Euros (eight million seven hundred and fifty thousand euros only etc. etc.
"""

ham = """
Hello,\nI am writing in regards to your application to the position of data scientist at Hooli X. We are pleased to inform you that you passed the first round of interviews, and we would like to invite you for an onsite interview with our senior data scientist, Mr. John Smith. You will find attached to this message further information on date, time, and location of the interview. Please let me know if I can be of any further assistance. Best regards.
"""

print(spam)
print()
print(ham)


Hello,
I saw your contact information on LinkedIn. I have carefully read through your profile and you seem to have an outstanding personality. This is one major reason why I am in contact with you. My name is Mr. Valery Grayfer, chairman of the board of directors of PJSC "LUKOIL." I am 86-years old and I was diagnosed with cancer two years ago. I will be going in for an operation later this week. I decided to will/donate the sum of 8,750,000.00 Euros (eight million seven hundred and fifty thousand euros only etc. etc.



Hello,
I am writing in regards to your application to the position of data scientist at Hooli X. We are pleased to inform you that you passed the first round of interviews, and we would like to invite you for an onsite interview with our senior data scientist, Mr. John Smith. You will find attached to this message further information on date, time, and location of the interview. Please let me know if I can be of any further assistance. Best regards.



### Basic terminology

---

Virtually all NLP uses this base terminology:

- a collection of text is a **document**
- a collection of documents is a **corups** (plural corpora)

In [90]:
corpus = [spam, ham]  # two docs in our corups

## Can You Think of a Simple Heuristic Rule to Catch Emails Like This?

> _We could check for the presence of the words "donate," "will," "sum," "cancer," "LinkedIn," and those that are similar._

By defining a simple rule that parses the text for the presence of keywords, we’re performing one of the simplest text feature extraction methods: _binary word counting_.


<a id='bow'></a>
## Bag of Words/Word Counting
---

The bag-of-words model is a simplified representation of the raw data. In this model, text (such as a sentence or document) is represented as the bag (multiset) of its words.

Bag-of-words representations discard grammar, order, and structure in the text but track occurrences.

<a name="countvectorizer"></a>
## Demo: Scikit-Learn `CountVectorizer`
---

Scikit-learn offers a `CountVectorizer` class with many configurable options:

**Note**: There are several parameters to tweak.

In [91]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

cvec = CountVectorizer()
#cvec = CountVectorizer(stop_words='english')

In [92]:
cvec.fit(corpus)
new_corpus = cvec.transform(corpus)
new_corpus

<2x112 sparse matrix of type '<class 'numpy.int64'>'
	with 130 stored elements in Compressed Sparse Row format>

In [94]:
new_corpus.todense()

matrix([[1, 1, 1, 1, 1, 2, 2, 3, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1,
         2, 0, 0, 1, 1, 1, 1, 1, 2, 2, 1, 0, 0, 1, 0, 1, 1, 2, 1, 0, 1,
         0, 2, 0, 1, 0, 0, 0, 2, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1,
         1, 1, 4, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0,
         0, 1, 0, 1, 0, 1, 0, 1, 0, 2, 2, 1, 1, 0, 2, 1, 1, 1, 0, 1, 1,
         2, 2, 0, 0, 2, 2, 2],
        [0, 0, 0, 0, 0, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0,
         0, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 2, 0, 0, 0, 1, 1, 0,
         1, 1, 1, 1, 2, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1,
         0, 0, 4, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 2,
         1, 0, 2, 0, 1, 0, 1, 0, 1, 3, 1, 0, 0, 1, 5, 0, 0, 0, 2, 0, 0,
         1, 1, 1, 1, 0, 4, 1]])

In [97]:
df  = pd.DataFrame(new_corpus.todense(),
                   columns=cvec.get_feature_names(),
                   index=['spam', 'ham'])
df.T.sort_values('spam', ascending=False).head(10).T

Unnamed: 0,of,and,your,in,etc,is,have,contact,the,this
spam,4,3,2,2,2,2,2,2,2,2
ham,4,2,1,1,0,0,0,0,3,1


### Spend a couple of minutes scanning the documentation to figure out what those parameters do.

In groups, share a few takeaways from the documentation.  In particular, look at what the following do:

- `encoding`
- `analyzer`
- `lowercase`

What arguments and capabilities stand out to you? 

We will soon look at stop words, documents and n-grams which are mentioned in the docs.

[Count vectorizer documentation](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html).

<a id='stopwords'></a>
## Stop Words

---

Some words are commonly used and provide no legitimate information about the content of the text.

In [64]:
from sklearn.feature_extraction import stop_words
 
print(stop_words.ENGLISH_STOP_WORDS)

frozenset({'only', 'etc', 'anyone', 'serious', 'whereafter', 'after', 'please', 'both', 'against', 'moreover', 'too', 'amoungst', 'meanwhile', 'i', 'three', 'until', 'first', 'neither', 'what', 'they', 'per', 'get', 'yourselves', 'system', 'us', 'was', 'where', 'very', 'became', 'top', 'although', 'cry', 'beyond', 'about', 'any', 'along', 'between', 'five', 'fire', 'none', 'side', 'show', 'anything', 'am', 'one', 'third', 'cant', 'wherever', 'somewhere', 'much', 'sometimes', 'with', 'how', 'my', 'four', 'towards', 'whenever', 'yourself', 're', 'whereby', 'due', 'herself', 'thin', 'has', 'itself', 'next', 'hereupon', 'some', 'there', 'during', 'before', 'been', 'except', 'it', 'upon', 'their', 'had', 'no', 'while', 'me', 'these', 'you', 'name', 'often', 'from', 'last', 'even', 'un', 'another', 'every', 'mostly', 'over', 'put', 'several', 'someone', 'at', 'thus', 'must', 'all', 'found', 'other', 'seemed', 'beforehand', 'hasnt', 'above', 'couldnt', 'though', 'or', 'eleven', 'hence', 'ever

<a id='hash'></a>
## What is a Hash Function?

---
![](https://i.ytimg.com/vi/bs7Wq0Z1uYk/maxresdefault.jpg)

### Hashing

A hash value is a number generated from a string of text. It's also referred to simply as "hash" or "message digest."

The hash is substantially smaller than the text itself and is generated by a formula in such a way that it's extremely unlikely some other text will produce the same hash value.

Think of the hash as a code that represents the original text in a more condensed format.

![](images/hash_function.png)

<a name="hashingvectorizer"></a>
## Scikit-Learn's `HashingVectorizer`

---

As we’ve seen, we can set the `CountVectorizer` dictionary to a fixed size, only keeping words of certain frequencies. However, **we still have to compute a dictionary and hold it in memory.** This could be a problem when...

- we have a large corpus or 
- when we stream applications where we don't know which words we'll encounter in the future.

Both problems can be solved using the `HashingVectorizer`, which converts a collection of text documents to a matrix of occurrences calculated with the [hashing trick](https://en.wikipedia.org/wiki/Feature_hashing). Each word is mapped to a feature with the use of a [hash function](https://en.wikipedia.org/wiki/Hash_function), which converts it to a hash. If we encounter that word again in the text, it will be converted to the same hash, allowing us to count word occurrence without retaining a dictionary in memory.

**Huh?**

$$
\text{word} \rightarrow \text{hash} \rightarrow \text{column index}
$$

The main drawback of this trick is that it's *not possible to compute the inverse transform* and we lose information on which words correspond with the important features. The hash function employed is the signed 32-bit version of Murmurhash3.

### Using the code above as an example, let's repeat the vectorization using a `HashingVectorizer`.

Look up how to complete this step and then try it out for yourself.

In [99]:
from sklearn.feature_extraction.text import HashingVectorizer
hvec = HashingVectorizer()
hvec.fit(corpus)

df  = pd.DataFrame(hvec.transform(corpus).todense(), index=['spam', 'ham'])  

df.T.sort_values('spam', ascending=False).head(10).T

Unnamed: 0,479532,170062,994433,144749,832412,675997,1005907,174171,828689,134503
spam,0.336861,0.16843,0.16843,0.16843,0.16843,0.16843,0.16843,0.16843,0.16843,0.084215
ham,0.334497,0.083624,0.0,0.0,0.334497,0.083624,0.083624,0.418121,0.083624,0.0


### What new parameters does this vectorizer offer?

Go back to the documentation and compare it to `CountVectorizer`.


> Answer:
- n_features

<a name="downsides-bow"></a>
## Downsides to Bag of Words

---

Bag-of-word approaches like the one outlined above completely ignore the structure of a sentence and merely assess the presence of specific words or word combinations.

But the same word can have multiple meanings in different contexts. Consider, for example, the following two sentences:

- There's wood floating in the **sea**.
- Mike's in a **sea** of trouble with the move.

How do we teach a computer to disambiguate? We'll cover some techniques that may help with this a little later.


<a name="tfidf"></a>
## Term Frequency-Inverse Document Frequency (TF-IDF)

---

A TF-IDF score tells us which words are most discriminating between documents. Words that occur often in one document but don't occur in many documents contain a great deal of discriminating power.

- This weight is a statistical measure used to evaluate how important a word is to a document in a collection (corpus).
- The importance increases in proportion to the number of times a word appears in a document but is offset by the frequency of the word in the corpus.

Variations of the TF-IDF weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query.

The inverse document frequency is a measure of how much information the word provides — that is, whether the term is common or rare across all documents. It's the logarithmically scaled inverse fraction of the documents that contain the word, obtained by dividing the total number of documents by the number of documents containing the term and then taking the logarithm of that quotient.

**Let's see how it's calculated:**

Term frequency (`tf`) is the frequency of a certain term in a document:

$$
\mathrm{tf}(t,d) = \frac{N_\text{term}}{N_\text{terms in Document}}
$$

where

- $N_\text{term}$ is the number of times a term/word $t$ appears in document $d$
- $N_\text{terms in Document}$ is the number of terms/words in document $d$

Inverse document frequency (`idf`) is defined as the frequency of documents that contain that term over the whole corpus:

$$
\mathrm{idf}(t, D) = \log\frac{N_\text{Documents}}{N_\text{Documents that contain term}}
$$

where

- $N_\text{Documents}$ is the number of documents in the corpus $D$
- $N_\text{Documents that contain term}$ is the number of documents in $D$ that contain term/word $t$

TF-IDF is then calculated as:

$$
\mathrm{tfidf}(t,d,D) = \mathrm{tf}(t,d) \cdot \mathrm{idf}(t, D)
$$


> **You might ask: But what is `log` used for?**<br>
> Good question! This is a sublinear transformation that helps separate our extremes between rare and common values.

> "...any linear function, ${\displaystyle g}$, for sufficiently large input ${\displaystyle f}$, grows slower than ${\displaystyle g}$" — Wikipedia

<a id='tfidf-vec'></a>
## Practice Using the `TfidfVectorizer`

---

### Why Use TF-IDF?
- Common words are penalized.
- Rare words have more influence.

Scikit-learn provides a TF-IDF vectorizer that works similarly to the other vectorizers we've covered. Notice that we can also eliminate stop words to improve our analysis.

As you did above, import and initialize the `TfidfVectorizer`, then fit the spam and ham data.

In [100]:
from sklearn.feature_extraction.text import TfidfVectorizer

tvec = TfidfVectorizer(stop_words='english')
tvec.fit(corpus)

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words='english', strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [101]:
df  = pd.DataFrame(tvec.transform(corpus).todense(),
                   columns=tvec.get_feature_names(),
                   index=['spam', 'ham'])

df.transpose().sort_values('spam', ascending=False).head(10).transpose()

Unnamed: 0,years,euros,contact,personality,linkedin,lukoil,major,million,old,operation
spam,0.290133,0.290133,0.290133,0.145067,0.145067,0.145067,0.145067,0.145067,0.145067,0.145067
ham,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [102]:
df.transpose().sort_values('ham', ascending=False).head(10).transpose()

Unnamed: 0,scientist,regards,data,interview,round,hooli,inform,interviews,invite,like
spam,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ham,0.31039,0.31039,0.31039,0.31039,0.155195,0.155195,0.155195,0.155195,0.155195,0.155195


### "Real" Example

---

Let's test this stuff out on some SMS text data.  Can you predict real vs. promotional texts just based on what is written?  Let's see...

In [2]:
import pandas as pd

In [5]:
df = pd.read_csv('./sms.csv', index_col=0)
df.head(50)

Unnamed: 0,class,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
5,spam,FreeMsg Hey there darling it's been 3 week's n...
6,ham,Even my brother is not like to speak with me. ...
7,ham,As per your request 'Melle Melle (Oru Minnamin...
8,spam,WINNER!! As a valued network customer you have...
9,spam,Had your mobile 11 months or more? U R entitle...


In [78]:
df.shape

(5574, 2)

In [6]:
df['class'].value_counts()/df.shape[0]

ham     0.865985
spam    0.134015
Name: class, dtype: float64

In [7]:
from sklearn.model_selection import train_test_split

X = df['text'].values
y = df['class'].map({'ham':0, 'spam':1})

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [10]:
# Vectorize
from sklearn.feature_extraction.text import CountVectorizer

cvec = CountVectorizer(stop_words='english')

X_train_counts = cvec.fit_transform(X_train)
X_test_counts = cvec.transform(X_test)

In [11]:
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression()
log_reg.fit(X_train_counts, y_train)
log_reg.score(X_test_counts, y_test)

0.9820652173913044

<a name="conclusion"></a>
## Conclusion

---

In this lesson, we covered an overview of natural language processing and learned about two powerful toolkits:

- Scikit-Learn's Feature Extraction Text
- Natural Language Toolkit

**Check:** What are some real-world applications of these techniques?

- Spam detection.
- Preprocessing for larger NLP problems.
- Job market analysis.
- Determining if someone is dateable or not; identifying "I" in relation to the signifier.
- Crude topic analysis.
- Building a keyword extraction heuristic and piping it into a marketing analysis.

<a id='resources'></a>
## Additional Resources

---

- Check out this [Yelp blog post](http://engineeringblog.yelp.com/2015/09/automatically-categorizing-yelp-businesses.html) on how it completed a classification task (with more than 1,000 response variables) using restaurant review text.
- Always check documentation: 
    - [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html). 
    - [HashingVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html). 
    - [TF-IDF](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html).
- A list of all stop words is available [here](https://github.com/ga-students/DSI-DC-2/blob/master/curriculum/Week-05/5.04-nlp/stop-words.txt).
- Wikipedia's [feature hashing](https://github.com/generalassembly-studio/DSI-course-materials/tree/master/curriculum/04-lessons/week-06/4.1-lesson) and [hash functions](https://en.wikipedia.org/wiki/Hash_function) entries are a great place to turn for more information on hashing.
- Check out Charlie Greenbacker's [introduction to NLP](http://spark-public.s3.amazonaws.com/nlp/slides/intro.pdf), which he delivered at the [DC-NLP Meetup](http://www.meetup.com/DC-NLP/).
- Wikipedia also has a [walk through](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) of TF-IDF.
- We played with Google's [ngram tool](https://books.google.com/ngrams/graph?content=data+science&year_start=1800&year_end=2000&corpus=15&smoothing=3&share=&direct_url=t1%3B%2Cdata%20science%3B%2Cc0).
- A hilarious data scientist has gone rogue and used NLP and eigenfaces (eigenvalues for face recognition) [for Tinder](http://dataconomy.com/hacking-tinder-with-facial-recognition-nlp/).
- We referenced KPCB's 2016 internet trends. If you're into startups, check out [this insightful deck](http://www.kpcb.com/internet-trends).
- [Count vectorizer documentation](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html).
- [Choosing a stemmer](https://www.elastic.co/guide/en/elasticsearch/guide/current/choosing-a-stemmer.html).
- [Feature hashing](https://en.wikipedia.org/wiki/Feature_hashing).
- [Term frequency-inverse document frequency](https://en.wikipedia.org/wiki/Tf%E2%80%93idf).
- [TF-IDF vectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html).

<a id='rapstats'></a>
## And just for fun... An Example NLP Project: rapstats.io

---

<a href="http://rapstats.io"><img src="https://snag.gy/8GSVqf.jpg"></a>

<img src="https://snag.gy/8eJNFv.jpg" style="width: 300px; float: left;">
<img src="https://snag.gy/2Hz0o7.jpg" style="width: 300px;">

**See also:**

- [The Largest Vocabulary in Hip Hop](http://poly-graph.co/vocabulary.html).
- [Rap Genius: Rap Stats](http://genius.com/rapstats).
- [Rap lyric generator, Hieu Nguyen and Brian Sa](http://nlp.stanford.edu/courses/cs224n/2009/fp/5.pdf).