# Lecture 8: Text Normalization 

## General Outline

* What is text normalization?
    * Tokenization
    * Stemming
    * Lemmatization
    * Stopwords

We have used these ideas in the past, but now we will go into more detail. More specifically, we will examine the differences between some of the leading libraries for text normalization.

Since we have to represent our text in numbers, we want to get a good idea on what's happening to our corpus as we process it.

In [None]:
# We will use the same text to illustrate the differences among the libraries
import pandas as pd

text = """
Human infants have the remarkable ability to learn any human language. One proposed mechanism for this ability 
is distributional learning, where learners infer the underlying cluster structure from unlabeled input. Computational
models of distributional learning have historically been principled but psychologically-implausible
computational-level models, or ad hoc but psychologically plausible algorithmic-level models. Approximate rational
models like particle filters can potentially bridge this divide, and allow principled, but psychologically plausible
models of distributional learning to be specified and evaluated. As a proof of concept, I evaluate one such particle
filter model, applied to learning English voicing categories from distributions of voice-onset times (VOTs). 
I find that this model learns well, but behaves somewhat differently from the standard, unconstrained Gibbs
sampler implementation of the underlying rational model.
"""

## Tokenization Libraries

* [NLTK](https://www.nltk.org/)
* [spaCy](https://spacy.io/)
* [TextBlob](https://textblob.readthedocs.io/en/dev/)
* [Gensim](https://radimrehurek.com/gensim/)
* [Stanford CoreNLP (Stanza)](https://stanfordnlp.github.io/stanza/)

In [None]:
# Let's keep tabs on our packages
packages = ['nltk', 'spacy', 'textblob', 'gensim', 'stanza']

### NLTK Example

https://www.nltk.org/api/nltk.tokenize.html

In [None]:
## Tokenize the text using the NLTK library

# Import the NLTK library
from nltk.tokenize import word_tokenize

# Tokenize the text
nltk_tokens = word_tokenize(text)

## Total tokens in document
len(nltk_tokens)

In [None]:
## Explore the docs
word_tokenize??

### spaCy Example

https://spacy.io/api/tokenizer

In [None]:
## Tokenize the text using the spaCy library

# Import the spaCy library
import spacy

# Create a spaCy object
NLP = spacy.load('en_core_web_sm')

# Tokenize the text
spacy_tokens = [token.text for token in NLP(text)]
len(spacy_tokens)

In [None]:
# Length of the tokens
spacy_tokens

In [None]:
NLP.tokenizer??

### TextBlob Example

https://textblob.readthedocs.io/en/dev/_modules/textblob/tokenizers.html

In [None]:
## Tokenize the text using the TextBlob library
from textblob import TextBlob

# Create a TextBlob object
blob = TextBlob(text)

# Tokenize the textblob
len(blob.words)


In [None]:
blob??

### Gensim Example

https://tedboy.github.io/nlps/generated/generated/gensim.utils.tokenize.html

In [None]:
## Import the gensim library
import gensim

## Tokenize the text using the gensim library
gensim_tokens = list(gensim.utils.tokenize(text))

In [None]:
len(gensim_tokens)

In [None]:
gensim_tokens

### Stanford CoreNLP Example | Now Stanza

https://stanfordnlp.github.io/stanza/installation_usage.html#getting-started

In [None]:
## Tokenize the text using the Stanford CoreNLP library

# Import the Stanford CoreNLP library
import stanza

# Pipeline for English
stan_NLP = stanza.Pipeline(lang='en', processors='tokenize')

## Tokenize the text using the stanza library
stan_tokens = [token.text for sent in stan_NLP(text).sentences for token in sent.tokens]

In [None]:
len(stan_tokens)

In [None]:
stan_NLP??

In [None]:
from IPython.display import display, Markdown
# Display the token counts
display(Markdown("| NLTK | spaCy | TextBlob | Gensim | Stanza |\n| --- | --- | --- | --- | --- |\n| {} | {} | {} | {} | {} |".format(len(nltk_tokens), len(spacy_tokens), len(blob.words), len(gensim_tokens), len(stan_tokens))))

In [None]:
# Display the texts
display(Markdown("| NLTK | spaCy | TextBlob | Gensim | Stanza |\n| --- | --- | --- | --- | --- |\n| {} | {} | {} | {} | {} |".format(nltk_tokens, spacy_tokens, blob.words, gensim_tokens, stan_tokens)))

## Stemming Libraries

Stemming is the computational process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form.

### NLTK Example

https://www.nltk.org/howto/stem.html

In [None]:
## NLTK Stemming

## import the PorterStemmer
from nltk.stem import PorterStemmer

## Create an instance of the PorterStemmer
stemmer = PorterStemmer()

## Stem the tokens
nltk_stemmed_tokens = [stemmer.stem(token) for token in nltk_tokens]



In [None]:
## Examine the stemmed tokens
print(" ".join(nltk_stemmed_tokens))

### SpaCy Example

SpaCy does not have a built-in stemmer, but it does have a lemmatizer. See below.

### TextBlob Example

In [None]:
## TextBlob Stemming

## Stem the tokens
blob = TextBlob(text)

blob_stemmed_tokens = [word.stem() for word in blob.words]

In [None]:
print(" ".join(blob_stemmed_tokens))

### Genism Example

In [None]:
## Gensim Stemming
import gensim
from gensim.parsing import stem_text

# Stem the tokens
gensim_stemmed_tokens = stem_text(text)

In [None]:
gensim_stemmed_tokens

### Stanford CoreNLP Example

Stanza does not have a built-in stemmer, but it does have a lemmatizer. See below.

In [None]:
# Display the stemmed texts
display(Markdown("| NLTK | TextBlob | Gensim |\n| --- | --- | --- |\n| {} | {} | {} |".format(nltk_stemmed_tokens, blob_stemmed_tokens, gensim_stemmed_tokens)))

## Lemmatization Libraries

Lemmatization is the linguistic process of reducing the groups of inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form.

| Case | Masc. | Fem. | Neut. |
|------|-------|------|-------|
| nominative | der | die | das |
| accusative | den | die | das |
| dative | dem | der | dem |
| genitive | des | der | des |

### NLTK Example

https://www.nltk.org/_modules/nltk/stem/wordnet.html

In [None]:
import nltk
nltk.download('wordnet')

## NLTK Lemmatization
from nltk.stem import WordNetLemmatizer


# Create an instance of the WordNetLemmatizer
nltk_lemmatizer = WordNetLemmatizer()

# Lemmatize the tokens
nltk_lemmas = [nltk_lemmatizer.lemmatize(token) for token in nltk_tokens]

In [None]:
print(" ".join(nltk_lemmas))

### SpaCy Example

https://spacy.io/api/lemmatizer

In [None]:
## SpaCy Lemmatization
import spacy

## Create an instance of the spaCy library
spacy_NLP = spacy.load('en_core_web_sm')

## Lemmatize the tokens
spacy_lemmas = [token.lemma_ for token in spacy_NLP(text)]



In [None]:
print(" ".join(spacy_lemmas))

### TextBlob Example

In [None]:
from textblob import TextBlob

## Create an instance of the TextBlob library
blob = TextBlob(text)

## Lemmatize the blob
blob_lemmas = [word.lemmatize() for word in blob.words]


In [None]:
print(" ".join(blob_lemmas))

### Genism Example


Gensim no longer hosts a lemmatizer. They used to port the code in from `Pattern` but it has since been removed.

### Stanford CoreNLP Example

https://stanfordnlp.github.io/stanza/lemma.html

In [None]:
import stanza

## Create an instance of the stanza library
stanza_NLP = stanza.Pipeline(lang='en', processors='tokenize,lemma')

## Create a document object
doc = stanza_NLP(text)

# Stanza lemmas
stanza_lemmas = [word.lemma for sent in doc.sentences for word in sent.words]

## Lemmatize the tokens
print(*[f'word: {word.text+" "}\tlemma: {word.lemma}' for sent in doc.sentences for word in sent.words], sep='\n')

In [None]:
# Display the lemmatized texts
display(Markdown("| NLTK | spaCy | TextBlob | Stanza |\n| --- | --- | --- | --- |\n| {} | {} | {} | {} |".format(nltk_lemmas, spacy_lemmas, blob_lemmas, stanza_lemmas)))

## Stopwords Libraries

`Stopwords` are words that are so common that they are not useful for analysis. For example, the word `the` is a stopword. To nomralize our text with stopwords, we remove them from our corpus.

### NLTK Example

https://www.nltk.org/search.html?q=stopwords&check_keywords=yes&area=default

In [None]:
## NLTK Stopwords

## Import the stopwords
from nltk.corpus import stopwords

print(stopwords.words('english'))

### SpaCy Example

https://spacy.io/api/language#section-defaults

In [None]:
import spacy

## Create a list of SpaCy stopwords in English
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS

spacy_stopwords

### TextBlob Example

Texblob relies on NLTK for stopwords.

### Genism Example

https://radimrehurek.com/gensim/parsing/preprocessing.html

In [None]:
import gensim
from gensim.parsing.preprocessing import STOPWORDS

STOPWORDS

### Stanford CoreNLP Example