**[Natural Language Processing Home Page](https://www.kaggle.com/learn/natural-language-processing)**

---


# Natural Language Processing

Data comes in many different forms such as time stamps, sensor readings, images, categorical labels, and much much more. A large amount of data exists as language, in text and speech.  The field of using computers to understand language data is known as Natural Language Processing (NLP).

This understanding can come in the form of information extraction for observing trends. For example, Google can scan billions of searches to track how specific terms change in frequency over time.

![Google trends for SpaCy, NLTK, and Gensim libraries](https://i.imgur.com/QR7eIjt.png)

Or consider that some term is showing up in a lot of customer support tickets. You can have a program observe these tickets for frequent terms and alert the appropriate product team. 

Within machine learning, and in this course, we're more interested in using language data to build predictive models. As with other domains, much of the work in NLP is finding ways to represent text or speech such that it can be used with machine learning models. That is, we need to convert documents or words or even individual characters into numbers and vectors. These vectors can then be used as input to models.

As with other domains, you can break down NLP into supervised and unsupervised tasks. Within supervised learning, you have applications like spam detection, machine translation, and voice recognition. A common use case of unsupervised learning is topic modeling, or clustering documents into topics. In this course you'll focus on supervised text classification.

#### A note before you get started

This mini-course was built assuming you already have some experience with machine learning. If you don't have experience with supervised learning and the scikit-learn library, please take the Intro to Machine Learning and Intermediate Machine Learning mini-courses before continuing on with these tutorials.

Looking into this dataset https://www.kaggle.com/crowdflower/twitter-airline-sentiment

## NLP with SpaCy

In this course you'll be using the spaCy library to extract information from text and to convert text into vectors for classification models. SpaCy is relatively new and has quickly become the most popular Python frameworks. Personally, I find it to be intuitive to use and backed up by excellent documentation.

To use spaCy, you need to load a **model**. Models are language specific and come in different sizes, typically small, medium, and large. Larger models have more capabilities but also consume more memory, run slower, and take longer to load.

To use a spaCy model, you load it with `spacy.load`

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')

The above code loads the small English model. When loading models, you might run into an error like this:
```
OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.
```
This error means the model doesn't exist on your machine. You'll need to download the model with SpaCy by running

```
python -m spacy download model_name
```

in your terminal.

With the model loaded, we can use it to process some text.

In [None]:
doc = nlp("Tea is healthy, calming, and delicious, don't you think?")

# Tokenizing

This returns a document object that contains **tokens**. A token is one unit of text in the document, such as individual words and punctuation. SpaCy splits contractions like "don't" into two tokens, "do" and "n't". To get the tokens you iterate through the document.

In [None]:
for token in doc:
    print(token)

Iterating through a document gives you token objects. Each of these tokens comes with additional information. In most cases, the important ones are `token.lemma_` and `token.is_stop`.

# Text preprocessing

The "lemma" of a word is its base form. For example, "to be" is the root verb of "is". The lemma of "is" then, is "be". Removing prefixes and suffixes also results in lemmas, such as changing "calming" to "calm". Converting words in text to their lemma version is often called "lemmatizing" or "normalization".

Stopwords are words that occur frequently in the language and don't contain much information. In English, stopwords include "the", "is", "and", "but", "not". With a spaCy token, `token.lemma_` returns the lemma, while `token.is_stop` returns a boolean `True` if the token is a stopword and `False` otherwise.

In [None]:
print("{:<15}{:<15}{}".format('Token', 'Lemma', 'Stopword'))
print("-"*40)
for token in doc:
    print("{:<15}{:<15}{}".format(str(token), token.lemma_, token.is_stop))

Why are lemmas and identifying stopwords important? Language data tends to have a lot of noise mixed in with informative content. In the sentence above, the important words are tea, healthy, calming, and delicious. Removing the stop words might improve the quality of the data for use in predictive models. Using lemma forms helps reduce noise as well by reducing multiple forms of the same word into one base form ("calming", "calms", "calmed" would all change to "calm").

However, lemmatizing and dropping stopwords might result in your models performing worse. You'll need to treat this preprocessing as part of your hyperparameter optimization process.

# Pattern Matching

Another common use of spaCy is matching tokens or phrases within chunks of text or whole documents. Pattern matching is often done with regular expressions, but spaCy's matching capabilities tend to be easier to use.

To match individual tokens, you create a `Matcher`. When you want to match a list of terms, it's easier and more efficient to use `PhraseMatcher`. For example, if you want to find where different smartphone models show up in some text you can create patterns for the model names of interest. First you create the `PhraseMatcher` itself.

In [None]:
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab, attr='LOWER')

The matcher is created using the vocabulary of your model. Here we're using the small English model you loaded earlier. Setting `attr='LOWER'` will match the phrases on lowercased text. This provides case insensitive matching.

Next you create a list of terms to match in the text. The phrase matcher needs the patterns as document objects. The easiest way to get these is with a list comprehension using the `nlp` model.

In [None]:
terms = ['Galaxy Note', 'iPhone 11', 'iPhone XS', 'Google Pixel']
patterns = [nlp(text) for text in terms]
matcher.add("TerminologyList", None, *patterns)

Then you create a document from the text to search and use the phrase matcher to find where the termcs occur in the text.

In [None]:
# Borrowed from https://daringfireball.net/linked/2019/09/21/patel-11-pro
text_doc = nlp("Glowing review overall, and some really interesting side-by-side photography "
               "tests pitting the iPhone 11 Pro against the Galaxy Note 10 Plus and last year’s " 
               "iPhone XS and Google Pixel 3.") 
matches = matcher(text_doc)
print(matches)

The matches here are a tuple of the match id and the positions of the start and end of the phrase.

In [None]:
match_id, start, end = matches[0]
print(nlp.vocab.strings[match_id], text_doc[start:end])

Now that you've seen a few uses of SpaCy for NLP, it's your turn to try it. In the next exercise, you'll use the `PhraseMatcher` to perform some analysis on Yelp reviews.

---
**[Natural Language Processing Home Page](https://www.kaggle.com/learn/natural-language-processing)**





*Have questions or comments? Visit the [Learn Discussion forum](https://www.kaggle.com/learn-forum) to chat with other Learners.*