# Natural Language Processing

Data comes in many different forms: time stamps, sensor readings, images, categorical labels, and so much more. But text is still some of the most valuable data out there for those who know how to use it.

For example, Google scans billions of searches to track how specific terms change in frequency over time.

![Google trends for SpaCy, NLTK, and Gensim libraries](https://i.imgur.com/QR7eIjt.png)

Meanwhile, tens of thousands of companies look for trends in their customer support requests. But today, few have the technology to automatically identify common terms and alert the appropriate product team. 

In this course, you'll focus on building models with language data. As with other domains, much of the work in natural language processing (NLP) is about representing text so it can be plugged into machine learning models. That is, we need to convert documents, words, or even individual characters into numbers and vectors. These vectors can then be used as input to models.

## Outline

In this course you will use the leading NLP library (SpaCy) to implement code for:

* Text processing and pattern matching
* Text classification models
* Word vectors & embeddings to better represent text numerically

#### A note before you get started

To ge the most out of this course, you'll need some experience with machine learning. If you don't have basic experience with scikit-learn, the [Intro to Machine Learning](https://www.kaggle.com/learn/intro-to-machine-learning) and [Intermediate Machine Learning](https://www.kaggle.com/learn/intermediate-machine-learning) courses will help you get started.

## NLP with SpaCy

SpaCy is the leading library for NLP. It is relatively new and has quickly become one of the most popular Python frameworks. Most peoplefind it intuitive, and it has excellent [documentation](https://spacy.io/usage).

To use spaCy, you need to load a **model**. Models are language specific and come in different sizes, typically small, medium, and large. Larger models have more capabilities but also consume more memory, run slower, and take longer to load.

To use a spaCy model, you load it with `spacy.load`

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')

With the model loaded, you can process text like this:

In [None]:
doc = nlp("Tea is healthy, calming, and delicious, don't you think?")

# Tokenizing

This returns a document object that contains **tokens**. A token is a unit of text in the document, such as individual words and punctuation. SpaCy splits contractions like "don't" into two tokens, "do" and "n't". You can see the tokens by iterating through the document.

In [None]:
for token in doc:
    print(token)

Iterating through a document gives you token objects. Each of these tokens comes with additional information. In most cases, the important ones are `token.lemma_` and `token.is_stop`.

# Text preprocessing

There are a few types of preprocessing to improve how we model with words. The first is "lemmatizing."
The "lemma" of a word is its base form.  For example, "walk" is the lemma of the word "walking". So, when you lemmatize the word walking, you would convert it to walk.

It's also common to remove stopwords. Stopwords are words that occur frequently in the language and don't contain much information. English  stopwords include "the", "is", "and", "but", "not". With a spaCy token, `token.lemma_` returns the lemma, while `token.is_stop` returns a boolean `True` if the token is a stopword and `False` otherwise.

In [None]:
print("{:<15}{:<15}{}".format('Token', 'Lemma', 'Stopword'))
print("-"*40)
for token in doc:
    print("{:<15}{:<15}{}".format(str(token), token.lemma_, token.is_stop))

Why are lemmas and identifying stopwords important? Language data has a lot of noise mixed in with informative content. In the sentence above, the important words are tea, healthy, calming, and delicious. Removing stop words might help the predictive model fit to only relevant words. Lemmatizing similarly helps by combining multiple forms of the same word into one base form ("calming", "calms", "calmed" would all change to "calm").

However, lemmatizing and dropping stopwords might result in your models performing worse. You'll need to treat this preprocessing as part of your hyperparameter optimization process.

# Pattern Matching

Another common NLP task is matching tokens or phrases within chunks of text or whole documents. You can do pattern matching with regular expressions, but spaCy's matching capabilities tend to be easier to use.

To match individual tokens, you create a `Matcher`. When you want to match a list of terms, it's easier and more efficient to use `PhraseMatcher`. For example, if you want to find where different smartphone models show up in some text you can create patterns for the model names of interest. First you create the `PhraseMatcher` itself.

In [None]:
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab, attr='LOWER')

The matcher is created using the vocabulary of your model. Here we're using the small English model you loaded earlier. Setting `attr='LOWER'` will match the phrases on lowercased text. This provides case insensitive matching.

Next you create a list of terms to match in the text. The phrase matcher needs the patterns as document objects. The easiest way to get these is with a list comprehension using the `nlp` model.

In [None]:
terms = ['Galaxy Note', 'iPhone 11', 'iPhone XS', 'Google Pixel']
patterns = [nlp(text) for text in terms]
matcher.add("TerminologyList", None, *patterns)

Then you create a document from the text to search and use the phrase matcher to find where the termcs occur in the text.

In [None]:
# Borrowed from https://daringfireball.net/linked/2019/09/21/patel-11-pro
text_doc = nlp("Glowing review overall, and some really interesting side-by-side photography "
               "tests pitting the iPhone 11 Pro against the Galaxy Note 10 Plus and last year’s " 
               "iPhone XS and Google Pixel 3.") 
matches = matcher(text_doc)
print(matches)

The matches here are a tuple of the match id and the positions of the start and end of the phrase.

In [None]:
match_id, start, end = matches[0]
print(nlp.vocab.strings[match_id], text_doc[start:end])

# Your Turn
Now that you've seen a few uses of SpaCy for NLP, it's your turn to try it to analyze **[Yelp reviews](#$NEXT_NOTEBOOK_URL$)**.