# Lecture 8: Text Normalization

## Table of Contents

1. [Text Normalization](#Text-Normalization)
2. [Tokenization](#Tokenization)
3. [Stemming](#Stemming)
4. [Lemmatization](#Lemmatization)
5. [Stop Words](#Stop-Words)
6. [Vectorization](#Vectorization)

## Text Normalization

Text normalization is the process of transforming text into a single canonical form that it might not have had before. This can involve changing the case of the text, removing punctuation, expanding contractions, and so on. The goal is to transform text into a more standard form that might be easier for other NLP tasks. More importantly, our goal is to think about strategies to represent text in numerical form, which is the only way that machine learning algorithms can process text.

### Some general stdlib python libraries for text normalization

* `string` (punctuation)
* `re` (regular expressions)
* `lower()`, `upper()`, `capitalize()`, `title()` # string methods
* `split()`, `join()` # more helpful string methods
* `replace()` # more helpful string methods

## Pip installable libraries for text normalization

* `nltk` (Natural Language Toolkit)
* `spacy` (Industrial Strength NLP)
* `textblob` (Simplified Text Processing)
* `gensim` (Topic Modeling for Humans)
* `stanza` (Python wrapper for Stanford NLP)

## Sample text for understading text normalization strategies

In [1]:
text = """
Human infants have the remarkable ability to learn any human language. One proposed mechanism for this ability 
is distributional learning, where learners infer the underlying cluster structure from unlabeled input. Computational
models of distributional learning have historically been principled but psychologically-implausible
computational-level models, or ad hoc but psychologically plausible algorithmic-level models. Approximate rational
models like particle filters can potentially bridge this divide, and allow principled, but psychologically plausible
models of distributional learning to be specified and evaluated. As a proof of concept, I evaluate one such particle
filter model, applied to learning English voicing categories from distributions of voice-onset times (VOTs). 
I find that this model learns well, but behaves somewhat differently from the standard, unconstrained Gibbs
sampler implementation of the underlying rational model.
"""

## Tokenization: breaking text into smaller units

### NLTK

NLTK is a slightly outdated platform for building Python programs to work with human language data. Although somewhat outdated, it provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic analysis. NLTK is a great tool for learning NLP, but it is not the best tool for building production systems.

In [None]:
import nltk
from nltk.tokenize import word_tokenize

nltk_tokens = word_tokenize(text)

print(nltk_tokens, len(nltk_tokens), sep='\n')

#### NLTK tokenization docs

In [None]:
word_tokenize??

### SpaCy

SpaCy is a modern Python library for industrial-strength natural language processing. It is designed to be fast and accurate, and it includes built-in capabilities for visualizing and analyzing NLP data. SpaCy is a great choice for building real-world NLP applications.

In [None]:
# tokenize our sample text using spacy

import spacy

NLP = spacy.load('en_core_web_sm')

spacy_tokens = [token.text for token in NLP(text)]

print(spacy_tokens, len(spacy_tokens), sep='\n')

In [None]:
spacy.tokens.Token??

### TextBlob

TextBlob is a simplified text processing library that sits on top of NLTK. It provides an easy-to-use interface to NLTK along with some text processing capabilities of its own. TextBlob is a great choice for getting started with NLP.

In [None]:
# tokenize the text using the textblob library
from textblob import TextBlob

blob_tokens = TextBlob(text).words

print(blob_tokens, len(blob_tokens), sep='\n')

In [None]:
funcs = [x for x in dir(TextBlob) if not x.startswith('_')]
funcs

### Gensim

Gensim is a Python library for topic modeling, document indexing, and similarity retrieval with large corpora. It is designed to be fast and memory efficient, and it includes capabilities for hierarchical document clustering. Gensim is a great choice for building production NLP systems.

https://tedboy.github.io/nlps/generated/generated/gensim.utils.tokenize.html

In [None]:
import gensim

# returns a generator
gensim_tokens = list(gensim.utils.tokenize(text))

# get the token count
# gensim_tokens_len = len(list(gensim_tokens))

print(list(gensim_tokens), len(gensim_tokens), sep='\n')

In [None]:
gensim.utils.tokenize??

### Stanford CoreNLP

Stanford CoreNLP is a suite of production-ready natural language analysis tools. It provides a set of human language technology tools that can be used to analyze text. It provides support for tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, and coreference resolution. It is written in Java, but it provides a Python wrapper that can be used to access its capabilities.

https://stanfordnlp.github.io/stanza/installation_usage.html#getting-started

In [None]:
import stanza

# Instantiate our stanza pipeline
stan_NLP = stanza.Pipeline(lang='en', processors='tokenize')

stan_tokens = [token.text for sent in stan_NLP(text).sentences for token in sent.tokens]

In [None]:
print(stan_tokens, len(stan_tokens), sep='\n')

In [None]:
stan_NLP??

### Tokenization analysis

In [None]:
from IPython.display import display, Markdown
# Display the token counts
display(Markdown("| NLTK | spaCy | TextBlob | Gensim | Stanza |\n| --- | --- | --- | --- | --- |\n| {} | {} | {} | {} | {} |"\
    .format(len(nltk_tokens),
            len(spacy_tokens),
            len(blob_tokens),
            len(gensim_tokens),
            len(stan_tokens)))
    )

In [None]:
# Display the texts
display(Markdown("| NLTK | spaCy | TextBlob | Gensim | Stanza |\n| --- | --- | --- | --- | --- |\n| {} | {} | {} | {} | {} |"\
    .format(nltk_tokens,
            spacy_tokens,
            blob_tokens,
            gensim_tokens,
            stan_tokens))
    )

## Stemming: (attempting to) reduce words to their root form

Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base, or root form. For example, the stem of the word `waiting` is `wait`. A stemming algorithm reduces the words `waiting`, `waits`, and `waited` to the same stem. This is important in building NLP systems because it helps us ensure that we are processing all forms of a word using the same representation.


### NLTK

https://www.nltk.org/howto/stem.html

In [None]:
## NLTK Stemming

## import the PorterStemmer
from nltk.stem import PorterStemmer

## Create an instance of the PorterStemmer
stemmer = PorterStemmer()

## Stem the tokens
nltk_stemmed_tokens = [stemmer.stem(token) for token in nltk_tokens]

In [None]:
nltk_stemmed = " ".join(nltk_stemmed_tokens)

# Display the stemmed text
print(nltk_stemmed[:100])

### SpaCy

https://spacy.io/api/lemmatizer

See below for SpaCy lemmatization

### TextBlob

https://textblob.readthedocs.io/en/dev/quickstart.html#stemming

In [None]:
## TextBlob Stemming

## Stem the tokens
blob = TextBlob(text)

blob_stemmed_tokens = [word.stem() for word in blob.words]

### Gensim

https://tedboy.github.io/nlps/generated/generated/gensim.parsing.porter.PorterStemmer.html


In [None]:
import gensim
from gensim.parsing.porter import PorterStemmer

gensim_stemmed_tokens = [PorterStemmer().stem(token) for token in gensim_tokens]

In [None]:
gensim_stemmed_tokens = " ".join(gensim_stemmed_tokens)
gensim_stemmed_tokens[:100]

### Stanza

Stanza does not have a built-in stemmer, but it does have a lemmatizer. See below for Stanza lemmatization.

### Comparison of stemming algorithms

In [None]:
# Display the stemmed texts
display(Markdown("| NLTK | TextBlob | Gensim |\n| --- | --- | --- |\n| {} | {} | {} |"\
    .format(nltk_stemmed_tokens,
            blob_stemmed_tokens,
            gensim_stemmed_tokens))
        )

## Lemmatization: reducing words to their dictionary form

Lemmatization is the linguistic process of reducing the groups of inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form.

| Case | Masc. | Fem. | Neut. |
|------|-------|------|-------|
| nominative | der | die | das |
| accusative | den | die | das |
| dative | dem | der | dem |
| genitive | des | der | des |

### NLTK

https://www.nltk.org/_modules/nltk/stem/wordnet.html

In [None]:
import nltk
nltk.download('wordnet')

## NLTK Lemmatization
from nltk.stem import WordNetLemmatizer


# Create an instance of the WordNetLemmatizer
nltk_lemmatizer = WordNetLemmatizer()

# Lemmatize the tokens
nltk_lemmas = [nltk_lemmatizer.lemmatize(token) for token in nltk_tokens]

In [None]:
print(" ".join(nltk_lemmas))

### SpaCy

https://spacy.io/api/lemmatizer

In [None]:
## SpaCy Lemmatization
import spacy

## Create an instance of the spaCy library
spacy_NLP = spacy.load('en_core_web_sm')

## Lemmatize the tokens
spacy_lemmas = [token.lemma_ for token in spacy_NLP(text)]

In [None]:
print(" ".join(spacy_lemmas))

### TextBlob


In [None]:
from textblob import TextBlob

## Create an instance of the TextBlob library
blob = TextBlob(text)

## Lemmatize the blob
blob_lemmas = [word.lemmatize() for word in blob.words]

In [None]:
print(" ".join(blob_lemmas))

### Gensim

Gensim no longer has a lemmatizer.

### Stanza

https://stanfordnlp.github.io/stanza/lemma.html

In [None]:
import stanza

## Create an instance of the stanza library
stanza_NLP = stanza.Pipeline(lang='en', processors='tokenize,lemma')

## Create a document object
doc = stanza_NLP(text)

# Stanza lemmas
stanza_lemmas = [word.lemma for sent in doc.sentences for word in sent.words]

## Lemmatize the tokens
print(*[f'word: {word.text+" "}\tlemma: {word.lemma}' for sent in doc.sentences for word in sent.words], sep='\n')

In [None]:
# Display the lemmatized texts
display(Markdown("| NLTK | spaCy | TextBlob | Stanza |\n| --- | --- | --- | --- |\n| {} | {} | {} | {} |"\
    .format(nltk_lemmas,
            spacy_lemmas,
            blob_lemmas,
            stanza_lemmas))
        )

## Stopword lists

`Stopwords` are words that are so common that they are not useful for analysis. For example, the word `the` is a stopword. To nomralize our text with stopwords, we remove them from our corpus.

### NLTK

https://www.nltk.org/book/ch02.html#stopwords_index_term

In [None]:
## NLTK Stopwords

## Import the stopwords
from nltk.corpus import stopwords

print(stopwords.words('english'))

### SpaCy

https://spacy.io/api/language#section-defaults

In [None]:
import spacy

## Create a list of SpaCy stopwords in English
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS

spacy_stopwords

### TextBlob

TextBlob relies on NLTK for stopword lists.

### Gensim

https://radimrehurek.com/gensim/parsing/preprocessing.html

In [None]:
import gensim
from gensim.parsing.preprocessing import STOPWORDS

STOPWORDS

### Create your own stopword list using Zipf's Law or other statistical methods

## Interesting features of Spacy and Stanza

In [5]:
import spacy

## Create an instance of the spaCy library
NLP = spacy.load('en_core_web_sm')

doc = NLP(text)

for token in doc:
    print(f'{token.text=}, {token.is_stop=}, {token.is_punct=}, {token.is_space=}, {token.is_alpha=}, {token.is_digit=}')

token.text='\n', token.is_stop=False, token.is_punct=False, token.is_space=True, token.is_alpha=False, token.is_digit=False
token.text='Human', token.is_stop=False, token.is_punct=False, token.is_space=False, token.is_alpha=True, token.is_digit=False
token.text='infants', token.is_stop=False, token.is_punct=False, token.is_space=False, token.is_alpha=True, token.is_digit=False
token.text='have', token.is_stop=True, token.is_punct=False, token.is_space=False, token.is_alpha=True, token.is_digit=False
token.text='the', token.is_stop=True, token.is_punct=False, token.is_space=False, token.is_alpha=True, token.is_digit=False
token.text='remarkable', token.is_stop=False, token.is_punct=False, token.is_space=False, token.is_alpha=True, token.is_digit=False
token.text='ability', token.is_stop=False, token.is_punct=False, token.is_space=False, token.is_alpha=True, token.is_digit=False
token.text='to', token.is_stop=True, token.is_punct=False, token.is_space=False, token.is_alpha=True, token.is

In [7]:
# dependency parsing

import spacy

## Create an instance of the spaCy library

NLP = spacy.load('en_core_web_sm')

## Create a document object
doc = NLP(text)

## Print the dependency parsing
for token in doc:
    print(f'{token.text} <--{token.dep_}-- {token.head.text}')
    
# graph the dependency parsing
from spacy import displacy

displacy.render(doc, style='dep', jupyter=True)


 <--dep-- Human
Human <--amod-- infants
infants <--nsubj-- have
have <--ROOT-- have
the <--det-- ability
remarkable <--amod-- ability
ability <--dobj-- have
to <--aux-- learn
learn <--acl-- ability
any <--det-- language
human <--amod-- language
language <--dobj-- learn
. <--punct-- have
One <--nummod-- mechanism
proposed <--amod-- mechanism
mechanism <--nsubj-- is
for <--prep-- mechanism
this <--det-- ability
ability <--pobj-- for

 <--dep-- ability
is <--ROOT-- is
distributional <--amod-- learning
learning <--attr-- is
, <--punct-- learning
where <--advmod-- infer
learners <--nsubj-- infer
infer <--relcl-- learning
the <--det-- structure
underlying <--amod-- structure
cluster <--compound-- structure
structure <--dobj-- infer
from <--prep-- structure
unlabeled <--amod-- input
input <--pobj-- from
. <--punct-- is
Computational <--amod-- models

 <--dep-- Computational
models <--nsubjpass-- principled
of <--prep-- models
distributional <--amod-- learning
learning <--pobj-- of
have <--au

In [8]:
# Named Entity Recognition
import spacy

## Create an instance of the spaCy library
NLP = spacy.load('en_core_web_sm')

## Create a document object
doc = NLP(text)

## Print the named entities
for ent in doc.ents:
    print(f'{ent.text} - {ent.label_}')

One - CARDINAL
one - CARDINAL
English - LANGUAGE
Gibbs - GPE


In [9]:
# create semantic triples
import spacy

## Create an instance of the spaCy library
NLP = spacy.load('en_core_web_sm')

## Create a document object
doc = NLP(text)

## Create a list of triples
triples = [(token.text, token.dep_, token.head.text) for token in doc]

## Print the triples
for triple in triples:
    print(triple)

('\n', 'dep', 'Human')
('Human', 'amod', 'infants')
('infants', 'nsubj', 'have')
('have', 'ROOT', 'have')
('the', 'det', 'ability')
('remarkable', 'amod', 'ability')
('ability', 'dobj', 'have')
('to', 'aux', 'learn')
('learn', 'acl', 'ability')
('any', 'det', 'language')
('human', 'amod', 'language')
('language', 'dobj', 'learn')
('.', 'punct', 'have')
('One', 'nummod', 'mechanism')
('proposed', 'amod', 'mechanism')
('mechanism', 'nsubj', 'is')
('for', 'prep', 'mechanism')
('this', 'det', 'ability')
('ability', 'pobj', 'for')
('\n', 'dep', 'ability')
('is', 'ROOT', 'is')
('distributional', 'amod', 'learning')
('learning', 'attr', 'is')
(',', 'punct', 'learning')
('where', 'advmod', 'infer')
('learners', 'nsubj', 'infer')
('infer', 'relcl', 'learning')
('the', 'det', 'structure')
('underlying', 'amod', 'structure')
('cluster', 'compound', 'structure')
('structure', 'dobj', 'infer')
('from', 'prep', 'structure')
('unlabeled', 'amod', 'input')
('input', 'pobj', 'from')
('.', 'punct', 'is'