# Lecture 8: Text Normalization

## Table of Contents

1. [Text Normalization](#Text-Normalization)
2. [Tokenization](#Tokenization)
3. [Stemming](#Stemming)
4. [Lemmatization](#Lemmatization)
5. [Stop Words](#Stop-Words)
6. [Vectorization](#Vectorization)

## Text Normalization

Text normalization is the process of transforming text into a single canonical form that it might not have had before. This can involve changing the case of the text, removing punctuation, expanding contractions, and so on. The goal is to transform text into a more standard form that might be easier for other NLP tasks. More importantly, our goal is to think about strategies to represent text in numerical form, which is the only way that machine learning algorithms can process text.

### Some general stdlib python libraries for text normalization

* `string` (punctuation)
* `re` (regular expressions)
* `lower()`, `upper()`, `capitalize()`, `title()` # string methods
* `split()`, `join()` # more helpful string methods
* `replace()` # more helpful string methods

## Pip installable libraries for text normalization

* `nltk` (Natural Language Toolkit)
* `spacy` (Industrial Strength NLP)
* `textblob` (Simplified Text Processing)
* `gensim` (Topic Modeling for Humans)
* `stanza` (Python wrapper for Stanford NLP)

## Sample text for understading text normalization strategies

In [None]:
text = """
Human infants have the remarkable ability to learn any human language. One proposed mechanism for this ability 
is distributional learning, where learners infer the underlying cluster structure from unlabeled input. Computational
models of distributional learning have historically been principled but psychologically-implausible
computational-level models, or ad hoc but psychologically plausible algorithmic-level models. Approximate rational
models like particle filters can potentially bridge this divide, and allow principled, but psychologically plausible
models of distributional learning to be specified and evaluated. As a proof of concept, I evaluate one such particle
filter model, applied to learning English voicing categories from distributions of voice-onset times (VOTs). 
I find that this model learns well, but behaves somewhat differently from the standard, unconstrained Gibbs
sampler implementation of the underlying rational model.
"""

## Tokenization: breaking text into smaller units

### NLTK

NLTK is a slightly outdated platform for building Python programs to work with human language data. Although somewhat outdated, it provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic analysis. NLTK is a great tool for learning NLP, but it is not the best tool for building production systems.

In [None]:
import nltk
from nltk.tokenize import word_tokenize

nltk_tokens = word_tokenize(text)

print(nltk_tokens, len(nltk_tokens), sep='\n')

#### NLTK tokenization docs

In [None]:
word_tokenize??

### SpaCy

SpaCy is a modern Python library for industrial-strength natural language processing. It is designed to be fast and accurate, and it includes built-in capabilities for visualizing and analyzing NLP data. SpaCy is a great choice for building real-world NLP applications.

In [None]:
# tokenize our sample text using spacy

import spacy

NLP = spacy.load('en_core_web_sm')

spacy_tokens = [token.text for token in NLP(text)]

print(spacy_tokens, len(spacy_tokens), sep='\n')

In [None]:
spacy.tokens.Token??

### TextBlob

TextBlob is a simplified text processing library that sits on top of NLTK. It provides an easy-to-use interface to NLTK along with some text processing capabilities of its own. TextBlob is a great choice for getting started with NLP.

In [None]:
# tokenize the text using the textblob library
from textblob import TextBlob

blob_tokens = TextBlob(text).words

print(blob_tokens, len(blob_tokens), sep='\n')

In [None]:
funcs = [x for x in dir(TextBlob) if not x.startswith('_')]
funcs

### Gensim

Gensim is a Python library for topic modeling, document indexing, and similarity retrieval with large corpora. It is designed to be fast and memory efficient, and it includes capabilities for hierarchical document clustering. Gensim is a great choice for building production NLP systems.

https://tedboy.github.io/nlps/generated/generated/gensim.utils.tokenize.html

In [None]:
import gensim

# returns a generator
gensim_tokens = list(gensim.utils.tokenize(text))

# get the token count
# gensim_tokens_len = len(list(gensim_tokens))

print(list(gensim_tokens), len(gensim_tokens), sep='\n')

In [None]:
gensim.utils.tokenize??

### Stanford CoreNLP

Stanford CoreNLP is a suite of production-ready natural language analysis tools. It provides a set of human language technology tools that can be used to analyze text. It provides support for tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, and coreference resolution. It is written in Java, but it provides a Python wrapper that can be used to access its capabilities.

https://stanfordnlp.github.io/stanza/installation_usage.html#getting-started

In [None]:
import stanza

# Instantiate our stanza pipeline
stan_NLP = stanza.Pipeline(lang='en', processors='tokenize')

stan_tokens = [token.text for sent in stan_NLP(text).sentences for token in sent.tokens]

In [None]:
print(stan_tokens, len(stan_tokens), sep='\n')

In [None]:
stan_NLP??

### Tokenization analysis

In [None]:
from IPython.display import display, Markdown
# Display the token counts
display(Markdown("| NLTK | spaCy | TextBlob | Gensim | Stanza |\n| --- | --- | --- | --- | --- |\n| {} | {} | {} | {} | {} |"\
    .format(len(nltk_tokens),
            len(spacy_tokens),
            len(blob_tokens),
            len(gensim_tokens),
            len(stan_tokens)))
    )

In [None]:
# Display the texts
display(Markdown("| NLTK | spaCy | TextBlob | Gensim | Stanza |\n| --- | --- | --- | --- | --- |\n| {} | {} | {} | {} | {} |"\
    .format(nltk_tokens,
            spacy_tokens,
            blob_tokens,
            gensim_tokens,
            stan_tokens))
    )

## Stemming: (attempting to) reduce words to their root form

Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base, or root form. For example, the stem of the word `waiting` is `wait`. A stemming algorithm reduces the words `waiting`, `waits`, and `waited` to the same stem. This is important in building NLP systems because it helps us ensure that we are processing all forms of a word using the same representation.


### NLTK

https://www.nltk.org/howto/stem.html

In [None]:
## NLTK Stemming

## import the PorterStemmer
from nltk.stem import PorterStemmer

## Create an instance of the PorterStemmer
stemmer = PorterStemmer()

## Stem the tokens
nltk_stemmed_tokens = [stemmer.stem(token) for token in nltk_tokens]

In [None]:
nltk_stemmed = " ".join(nltk_stemmed_tokens)

# Display the stemmed text
print(nltk_stemmed[:100])

### SpaCy

https://spacy.io/api/lemmatizer

See below for SpaCy lemmatization

### TextBlob

https://textblob.readthedocs.io/en/dev/quickstart.html#stemming

In [None]:
## TextBlob Stemming

## Stem the tokens
blob = TextBlob(text)

blob_stemmed_tokens = [word.stem() for word in blob.words]

### Gensim

https://tedboy.github.io/nlps/generated/generated/gensim.parsing.porter.PorterStemmer.html


In [None]:
import gensim
from gensim.parsing.porter import PorterStemmer

gensim_stemmed_tokens = [PorterStemmer().stem(token) for token in gensim_tokens]

In [None]:
gensim_stemmed_tokens = " ".join(gensim_stemmed_tokens)
gensim_stemmed_tokens[:100]

### Stanza

Stanza does not have a built-in stemmer, but it does have a lemmatizer. See below for Stanza lemmatization.

### Comparison of stemming algorithms

In [None]:
# Display the stemmed texts
display(Markdown("| NLTK | TextBlob | Gensim |\n| --- | --- | --- |\n| {} | {} | {} |"\
    .format(nltk_stemmed_tokens,
            blob_stemmed_tokens,
            gensim_stemmed_tokens))
        )

## Lemmatization: reducing words to their dictionary form

Lemmatization is the linguistic process of reducing the groups of inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form.

| Case | Masc. | Fem. | Neut. |
|------|-------|------|-------|
| nominative | der | die | das |
| accusative | den | die | das |
| dative | dem | der | dem |
| genitive | des | der | des |

### NLTK

https://www.nltk.org/_modules/nltk/stem/wordnet.html

In [None]:
import nltk
nltk.download('wordnet')

## NLTK Lemmatization
from nltk.stem import WordNetLemmatizer


# Create an instance of the WordNetLemmatizer
nltk_lemmatizer = WordNetLemmatizer()

# Lemmatize the tokens
nltk_lemmas = [nltk_lemmatizer.lemmatize(token) for token in nltk_tokens]

In [None]:
print(" ".join(nltk_lemmas))

### SpaCy

https://spacy.io/api/lemmatizer

In [None]:
## SpaCy Lemmatization
import spacy

## Create an instance of the spaCy library
spacy_NLP = spacy.load('en_core_web_sm')

## Lemmatize the tokens
spacy_lemmas = [token.lemma_ for token in spacy_NLP(text)]

In [None]:
print(" ".join(spacy_lemmas))

### TextBlob


In [None]:
from textblob import TextBlob

## Create an instance of the TextBlob library
blob = TextBlob(text)

## Lemmatize the blob
blob_lemmas = [word.lemmatize() for word in blob.words]

In [None]:
print(" ".join(blob_lemmas))

### Gensim

Gensim no longer has a lemmatizer.

### Stanza

https://stanfordnlp.github.io/stanza/lemma.html

In [None]:
import stanza

## Create an instance of the stanza library
stanza_NLP = stanza.Pipeline(lang='en', processors='tokenize,lemma')

## Create a document object
doc = stanza_NLP(text)

# Stanza lemmas
stanza_lemmas = [word.lemma for sent in doc.sentences for word in sent.words]

## Lemmatize the tokens
print(*[f'word: {word.text+" "}\tlemma: {word.lemma}' for sent in doc.sentences for word in sent.words], sep='\n')

In [None]:
# Display the lemmatized texts
display(Markdown("| NLTK | spaCy | TextBlob | Stanza |\n| --- | --- | --- | --- |\n| {} | {} | {} | {} |"\
    .format(nltk_lemmas,
            spacy_lemmas,
            blob_lemmas,
            stanza_lemmas))
        )

## Stopword lists

`Stopwords` are words that are so common that they are not useful for analysis. For example, the word `the` is a stopword. To nomralize our text with stopwords, we remove them from our corpus.

### NLTK

https://www.nltk.org/book/ch02.html#stopwords_index_term

In [None]:
## NLTK Stopwords

## Import the stopwords
from nltk.corpus import stopwords

print(stopwords.words('english'))

### SpaCy

https://spacy.io/api/language#section-defaults

In [None]:
import spacy

## Create a list of SpaCy stopwords in English
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS

spacy_stopwords

### TextBlob

TextBlob relies on NLTK for stopword lists.

### Gensim

https://radimrehurek.com/gensim/parsing/preprocessing.html

In [None]:
import gensim
from gensim.parsing.preprocessing import STOPWORDS

STOPWORDS

### Create your own stopword list using Zipf's Law or other statistical methods

## Vectorization: representing text as numbers

### Bag of Words

Bag of words is a representation of text that describes the occurrence of words within a document. It involves two things:

* A vocabulary of known words.
* A measure of the presence of known words.

### TF-IDF

TF-IDF stands for term frequency-inverse document frequency, and the tf-idf weight is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.

### Word2Vec

Word2vec is a group of related models that are used to produce word embeddings. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words. Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space. Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located close to one another in the space.

### GloVe

GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.

### FastText

FastText is an algorithm that generates vectors based on subword information. It is an extension of the word2vec model, which learns vectors for subwords by treating each subword as an atomic entity. FastText represents each word as an n-gram of characters. Word vectors are then generated by computing the mean vector of each word. Subword information allows us to obtain vectors for words that did not appear in our training corpus.

### Elmo

Elmo is a deep contextualized word representation that models both (1) complex characteristics of word use (e.g., syntax and semantics), and (2) how these uses vary across linguistic contexts (i.e., to model polysemy). These word vectors are learned functions of the internal states of a deep bidirectional language model (biLM), which is pre-trained on a large text corpus. They can be easily added to existing models and significantly improve the state of the art across a broad range of challenging NLP problems, including question answering, textual entailment, and sentiment analysis.

### GPT

GPT vectors are based on the GPT model, which is a large-scale transformer-based language model with 1.5 billion parameters, trained on a dataset of 8 million web pages. GPT is trained with a simple objective: predict the next word, given all of the previous words within some text. The diversity of the dataset causes this simple goal to contain naturally occurring demonstrations of many tasks across diverse domains. GPT is a direct scale-up of the GPT-2 model, which is trained on 40GB of internet text.

## Vectorizing text

There are numerous ways to vectorize text. We will begin with some relatively naive methods and advance to more sophisticated methods in the coming weeks. To begin, let's code out a vanilla implementation of a bag of words vectorizer using `numpy`.

In [8]:
import numpy as np

### Corpus

When we vectorize text, we are concerned with the corpus, understood as a collection of documents, and the vocabulary used in the corpus.

In [9]:
corpus = ['NLP class is the most awesome class in the Data Science program',
          'I think you are wrong and the best data science class is Graphing Algorithms',
          'No, you both are wrong, it is Machine Learning']

### Vocabulary

How shall we define vocabulary?

* White space?
* Tokens?
* Lemmas?
* Stems?

In [10]:
token_vocab = set([word for word in ' '.join(corpus).split(' ')])

In [11]:
print(token_vocab, len(token_vocab), sep='\n')

{'program', 'data', 'Data', 'wrong,', 'and', 'awesome', 'in', 'are', 'best', 'science', 'you', 'most', 'it', 'is', 'No,', 'NLP', 'Machine', 'I', 'Science', 'Algorithms', 'the', 'Learning', 'class', 'wrong', 'think', 'both', 'Graphing'}
27


In [12]:
### Let's go with lemmas
import spacy

NLP = spacy.load('en_core_web_sm')

vocabulary = [token.lemma_.lower() for token in NLP(' '.join(corpus))]

print(vocabulary, len(vocabulary), sep='\n')

['nlp', 'class', 'be', 'the', 'most', 'awesome', 'class', 'in', 'the', 'data', 'science', 'program', 'i', 'think', 'you', 'be', 'wrong', 'and', 'the', 'good', 'data', 'science', 'class', 'be', 'graphing', 'algorithms', 'no', ',', 'you', 'both', 'be', 'wrong', ',', 'it', 'be', 'machine', 'learning']
37


In [13]:
# reduce down to unique words
vocabulary = list(set(vocabulary))

print(vocabulary, len(vocabulary), sep='\n')

['graphing', 'data', 'program', 'nlp', 'and', 'awesome', 'in', 'science', 'you', 'most', 'it', ',', 'machine', 'learning', 'the', 'be', 'class', 'wrong', 'algorithms', 'think', 'good', 'i', 'both', 'no']
24


### Vectorizating our corpus

How can we vectorize our corpus?

In [14]:
# let's write a function to vectorize our text

def create_sparse_vector(sentence, vocabulary):
    # initialize the vector as the same length as the vocabulary
    vector = np.zeros(len(vocabulary), dtype=int)
    
    # tokenize the sentence
    sentence_lemmas = [token.lemma_.lower() for token in NLP(sentence)]
    
    for i, lemma in enumerate(sentence_lemmas):
        if lemma.lower() in vocabulary:
            index = vocabulary.index(lemma)
            vector[index] = 1
    return vector

In [15]:
# let's use the function to vectorize our corpus

vectors = [create_sparse_vector(sentence, vocabulary) for sentence in corpus]
print(vectors, len(vectors), sep='\n')

[array([0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0,
       0, 0]), array([1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0,
       1, 1])]
3


In [20]:
# let's look up some words in our vocabulary
vocabulary[3]

'nlp'

### Using `sklearn` CountVectorizer

Now we have coded out our own vectorizer, we can have a better understanding of what `sklearn` is doing under the hood.

In [21]:
from sklearn.feature_extraction.text import CountVectorizer

# initialize the vectorizer
vectorizer = CountVectorizer()

# let's vectorize our sentences
vectorized = vectorizer.fit_transform(corpus)

# let's look at the vocabulary
print(vectorizer.vocabulary_)

{'nlp': 15, 'class': 6, 'is': 10, 'the': 19, 'most': 14, 'awesome': 3, 'in': 9, 'data': 7, 'science': 18, 'program': 17, 'think': 20, 'you': 22, 'are': 2, 'wrong': 21, 'and': 1, 'best': 4, 'graphing': 8, 'algorithms': 0, 'no': 16, 'both': 5, 'it': 11, 'machine': 13, 'learning': 12}


In [22]:
# let's look at the vectorized sentences
print(vectorized.toarray())

[[0 0 0 1 0 0 2 1 0 1 1 0 0 0 1 1 0 1 1 2 0 0 0]
 [1 1 1 0 1 0 1 1 1 0 1 0 0 0 0 0 0 0 1 1 1 1 1]
 [0 0 1 0 0 1 0 0 0 0 1 1 1 1 0 0 1 0 0 0 0 1 1]]


In [27]:
# Let's get the vocabulary for each of the sentences
vocabulary = vectorizer.get_feature_names_out()

In [28]:
vocabulary

array(['algorithms', 'and', 'are', 'awesome', 'best', 'both', 'class',
       'data', 'graphing', 'in', 'is', 'it', 'learning', 'machine',
       'most', 'nlp', 'no', 'program', 'science', 'the', 'think', 'wrong',
       'you'], dtype=object)

In [29]:
# Reconstruction of the original sentences
reconstructed = vectorizer.inverse_transform(vectorized)
reconstructed

[array(['nlp', 'class', 'is', 'the', 'most', 'awesome', 'in', 'data',
        'science', 'program'], dtype='<U10'),
 array(['class', 'is', 'the', 'data', 'science', 'think', 'you', 'are',
        'wrong', 'and', 'best', 'graphing', 'algorithms'], dtype='<U10'),
 array(['is', 'you', 'are', 'wrong', 'no', 'both', 'it', 'machine',
        'learning'], dtype='<U10')]

In [30]:
vectorizer.inverse_transform??

[0;31mSignature:[0m [0mvectorizer[0m[0;34m.[0m[0minverse_transform[0m[0;34m([0m[0mX[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mSource:[0m   
    [0;32mdef[0m [0minverse_transform[0m[0;34m([0m[0mself[0m[0;34m,[0m [0mX[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m        [0;34m"""Return terms per document with nonzero entries in X.[0m
[0;34m[0m
[0;34m        Parameters[0m
[0;34m        ----------[0m
[0;34m        X : {array-like, sparse matrix} of shape (n_samples, n_features)[0m
[0;34m            Document-term matrix.[0m
[0;34m[0m
[0;34m        Returns[0m
[0;34m        -------[0m
[0;34m        X_inv : list of arrays of shape (n_samples,)[0m
[0;34m            List of arrays of terms.[0m
[0;34m        """[0m[0;34m[0m
[0;34m[0m        [0mself[0m[0;34m.[0m[0m_check_vocabulary[0m[0;34m([0m[0;34m)[0m[0;34m[0m
[0;34m[0m        [0;31m# We need CSR format for fast row manipulations.[0m[0;34m[0m
[0;34m[0m        [0

### Using `sklearn` TfidfVectorizer

Now we have coded out our own vectorizer, we can have a better understanding of what `sklearn` is doing under the hood. We will also explore the `TfidfVectorizer` which is a more sophisticated vectorizer.

In [31]:
from sklearn.feature_extraction.text import TfidfVectorizer

# initialize the vectorizer
vectorizer = TfidfVectorizer()

# let's vectorize our sentences
vectorized = vectorizer.fit_transform(corpus)

# let's look at the vocabulary
print(vectorizer.vocabulary_)

{'nlp': 15, 'class': 6, 'is': 10, 'the': 19, 'most': 14, 'awesome': 3, 'in': 9, 'data': 7, 'science': 18, 'program': 17, 'think': 20, 'you': 22, 'are': 2, 'wrong': 21, 'and': 1, 'best': 4, 'graphing': 8, 'algorithms': 0, 'no': 16, 'both': 5, 'it': 11, 'machine': 13, 'learning': 12}


In [32]:
# let's look at the vectorized sentences
print(vectorized.toarray())

[[0.         0.         0.         0.29970733 0.         0.
  0.4558703  0.22793515 0.         0.29970733 0.17701198 0.
  0.         0.         0.29970733 0.29970733 0.         0.29970733
  0.22793515 0.4558703  0.         0.         0.        ]
 [0.32620527 0.32620527 0.24808752 0.         0.32620527 0.
  0.24808752 0.24808752 0.32620527 0.         0.19266209 0.
  0.         0.         0.         0.         0.         0.
  0.24808752 0.24808752 0.32620527 0.24808752 0.24808752]
 [0.         0.         0.28574186 0.         0.         0.37571621
  0.         0.         0.         0.         0.22190405 0.37571621
  0.37571621 0.37571621 0.         0.         0.37571621 0.
  0.         0.         0.         0.28574186 0.28574186]]


### What has TFIDFVectorization done?

In [33]:
vectorizer.fit_transform??

[0;31mSignature:[0m [0mvectorizer[0m[0;34m.[0m[0mfit_transform[0m[0;34m([0m[0mraw_documents[0m[0;34m,[0m [0my[0m[0;34m=[0m[0;32mNone[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mSource:[0m   
    [0;32mdef[0m [0mfit_transform[0m[0;34m([0m[0mself[0m[0;34m,[0m [0mraw_documents[0m[0;34m,[0m [0my[0m[0;34m=[0m[0;32mNone[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m        [0;34m"""Learn vocabulary and idf, return document-term matrix.[0m
[0;34m[0m
[0;34m        This is equivalent to fit followed by transform, but more efficiently[0m
[0;34m        implemented.[0m
[0;34m[0m
[0;34m        Parameters[0m
[0;34m        ----------[0m
[0;34m        raw_documents : iterable[0m
[0;34m            An iterable which generates either str, unicode or file objects.[0m
[0;34m[0m
[0;34m        y : None[0m
[0;34m            This parameter is ignored.[0m
[0;34m[0m
[0;34m        Returns[0m
[0;34m        -------[0m
[0;34m        X : s

In [36]:
# What are the TF-IDF scores for each of the words in the vocabulary?
tfidf_scores = vectorizer.idf_
vocabulary = vectorizer.get_feature_names_out()

for word, score in zip(vocabulary, tfidf_scores):
    print(f'{word}: {score}')

algorithms: 1.6931471805599454
and: 1.6931471805599454
are: 1.2876820724517808
awesome: 1.6931471805599454
best: 1.6931471805599454
both: 1.6931471805599454
class: 1.2876820724517808
data: 1.2876820724517808
graphing: 1.6931471805599454
in: 1.6931471805599454
is: 1.0
it: 1.6931471805599454
learning: 1.6931471805599454
machine: 1.6931471805599454
most: 1.6931471805599454
nlp: 1.6931471805599454
no: 1.6931471805599454
program: 1.6931471805599454
science: 1.2876820724517808
the: 1.2876820724517808
think: 1.6931471805599454
wrong: 1.2876820724517808
you: 1.2876820724517808
