# Lecture 8: Text Normalization 

## General Outline

* What is text normalization?
    * Tokenization
    * Stemming
    * Lemmatization
    * Stopwords

We have used these ideas in the past, but now we will go into more detail. More specifically, we will examine the differences between some of the leading libraries for text normalization.

Since we have to represent our text in numbers, we want to get a good idea on what's happening to our corpus as we process it.

In [38]:
# We will use the same text to illustrate the differences among the libraries
import pandas as pd

text = """
Human infants have the remarkable ability to learn any human language. One proposed mechanism for this ability 
is distributional learning, where learners infer the underlying cluster structure from unlabeled input. Computational
models of distributional learning have historically been principled but psychologically-implausible
computational-level models, or ad hoc but psychologically plausible algorithmic-level models. Approximate rational
models like particle filters can potentially bridge this divide, and allow principled, but psychologically plausible
models of distributional learning to be specified and evaluated. As a proof of concept, I evaluate one such particle
filter model, applied to learning English voicing categories from distributions of voice-onset times (VOTs). 
I find that this model learns well, but behaves somewhat differently from the standard, unconstrained Gibbs
sampler implementation of the underlying rational model.
"""

## Tokenization Libraries

* [NLTK](https://www.nltk.org/)
* [spaCy](https://spacy.io/)
* [TextBlob](https://textblob.readthedocs.io/en/dev/)
* [Gensim](https://radimrehurek.com/gensim/)
* [Stanford CoreNLP (Stanza)](https://stanfordnlp.github.io/stanza/)

In [73]:
# Let's keep tabs on our packages
packages = ['nltk', 'spacy', 'textblob', 'gensim', 'stanza']

### NLTK Example

https://www.nltk.org/api/nltk.tokenize.html

In [74]:
## Tokenize the text using the NLTK library

# Import the NLTK library
from nltk.tokenize import word_tokenize

# Tokenize the text
nltk_tokens = word_tokenize(text)

## Total tokens in document
len(nltk_tokens)

140

In [75]:
## Explore the docs
word_tokenize??

[0;31mSignature:[0m [0mword_tokenize[0m[0;34m([0m[0mtext[0m[0;34m,[0m [0mlanguage[0m[0;34m=[0m[0;34m'english'[0m[0;34m,[0m [0mpreserve_line[0m[0;34m=[0m[0;32mFalse[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mSource:[0m   
[0;32mdef[0m [0mword_tokenize[0m[0;34m([0m[0mtext[0m[0;34m,[0m [0mlanguage[0m[0;34m=[0m[0;34m"english"[0m[0;34m,[0m [0mpreserve_line[0m[0;34m=[0m[0;32mFalse[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m    [0;34m"""[0m
[0;34m    Return a tokenized copy of *text*,[0m
[0;34m    using NLTK's recommended word tokenizer[0m
[0;34m    (currently an improved :class:`.TreebankWordTokenizer`[0m
[0;34m    along with :class:`.PunktSentenceTokenizer`[0m
[0;34m    for the specified language).[0m
[0;34m[0m
[0;34m    :param text: text to split into words[0m
[0;34m    :type text: str[0m
[0;34m    :param language: the model name in the Punkt corpus[0m
[0;34m    :type language: str[0m
[0;34m    :param preser

### spaCy Example

https://spacy.io/api/tokenizer

In [76]:
## Tokenize the text using the spaCy library

# Import the spaCy library
import spacy

# Create a spaCy object
NLP = spacy.load('en_core_web_sm')

# Tokenize the text
spacy_tokens = [token.text for token in NLP(text)]
len(spacy_tokens)

158

In [77]:
# Length of the tokens
spacy_tokens

['\n',
 'Human',
 'infants',
 'have',
 'the',
 'remarkable',
 'ability',
 'to',
 'learn',
 'any',
 'human',
 'language',
 '.',
 'One',
 'proposed',
 'mechanism',
 'for',
 'this',
 'ability',
 '\n',
 'is',
 'distributional',
 'learning',
 ',',
 'where',
 'learners',
 'infer',
 'the',
 'underlying',
 'cluster',
 'structure',
 'from',
 'unlabeled',
 'input',
 '.',
 'Computational',
 '\n',
 'models',
 'of',
 'distributional',
 'learning',
 'have',
 'historically',
 'been',
 'principled',
 'but',
 'psychologically',
 '-',
 'implausible',
 '\n',
 'computational',
 '-',
 'level',
 'models',
 ',',
 'or',
 'ad',
 'hoc',
 'but',
 'psychologically',
 'plausible',
 'algorithmic',
 '-',
 'level',
 'models',
 '.',
 'Approximate',
 'rational',
 '\n',
 'models',
 'like',
 'particle',
 'filters',
 'can',
 'potentially',
 'bridge',
 'this',
 'divide',
 ',',
 'and',
 'allow',
 'principled',
 ',',
 'but',
 'psychologically',
 'plausible',
 '\n',
 'models',
 'of',
 'distributional',
 'learning',
 'to',
 'b

In [12]:
NLP.tokenizer??

[0;31mType:[0m           Tokenizer
[0;31mString form:[0m    <spacy.tokenizer.Tokenizer object at 0x7fe616a09090>
[0;31mFile:[0m           /media/james/Projects/GitHub/DATA_340_NLP/Notebooks/venv/lib/python3.10/site-packages/spacy/tokenizer.cpython-310-x86_64-linux-gnu.so
[0;31mDocstring:[0m     
Tokenizer(Vocab vocab, rules=None, prefix_search=None, suffix_search=None, infix_finditer=None, token_match=None, url_match=None, faster_heuristics=True)
Segment text, and create Doc objects with the discovered segment
    boundaries.

    DOCS: https://spacy.io/api/tokenizer
    
[0;31mInit docstring:[0m
Create a `Tokenizer`, to create `Doc` objects given unicode text.

vocab (Vocab): A storage container for lexical types.
rules (dict): Exceptions and special-cases for the tokenizer.
prefix_search (callable): A function matching the signature of
    `re.compile(string).search` to match prefixes.
suffix_search (callable): A function matching the signature of
    `re.compile(string).se

### TextBlob Example

https://textblob.readthedocs.io/en/dev/_modules/textblob/tokenizers.html

In [78]:
## Tokenize the text using the TextBlob library
from textblob import TextBlob

# Create a TextBlob object
blob = TextBlob(text)

# Tokenize the textblob
len(blob.words)


124

In [16]:
blob??

[0;31mType:[0m        TextBlob
[0;31mString form:[0m
Human infants have the remarkable ability to learn any human language. One proposed mechanism for this ability 
is distributional learning, where learners infer the underlying cluster structure from unlabeled input. Computational
models of distributional learning have historically been principled but psychologically-implausible
computational-level models, or ad hoc but psychologically plausible algorithmic-level models. Approximate rational
models like particle filters can potentially bridge this divide, and allow principled, but psychologically plausible
models of distributional learning to be specified and evaluated. As a proof of concept, I evaluate one such particle
filter model, applied to learning English voicing categories from distributions of voice-onset times (VOTs). 
I find that this model learns well, but behaves somewhat differently from the standard, unconstrained Gibbs
sampler implementation of the underlying ratio

### Gensim Example

https://tedboy.github.io/nlps/generated/generated/gensim.utils.tokenize.html

In [34]:
## Import the gensim library
import gensim

## Tokenize the text using the gensim library
gensim_tokens = list(gensim.utils.tokenize(text))

In [36]:
len(gensim_tokens)

128

In [37]:
gensim_tokens

['Human',
 'infants',
 'have',
 'the',
 'remarkable',
 'ability',
 'to',
 'learn',
 'any',
 'human',
 'language',
 'One',
 'proposed',
 'mechanism',
 'for',
 'this',
 'ability',
 'is',
 'distributional',
 'learning',
 'where',
 'learners',
 'infer',
 'the',
 'underlying',
 'cluster',
 'structure',
 'from',
 'unlabeled',
 'input',
 'Computational',
 'models',
 'of',
 'distributional',
 'learning',
 'have',
 'historically',
 'been',
 'principled',
 'but',
 'psychologically',
 'implausible',
 'computational',
 'level',
 'models',
 'or',
 'ad',
 'hoc',
 'but',
 'psychologically',
 'plausible',
 'algorithmic',
 'level',
 'models',
 'Approximate',
 'rational',
 'models',
 'like',
 'particle',
 'filters',
 'can',
 'potentially',
 'bridge',
 'this',
 'divide',
 'and',
 'allow',
 'principled',
 'but',
 'psychologically',
 'plausible',
 'models',
 'of',
 'distributional',
 'learning',
 'to',
 'be',
 'specified',
 'and',
 'evaluated',
 'As',
 'a',
 'proof',
 'of',
 'concept',
 'I',
 'evaluate',
 

### Stanford CoreNLP Example | Now Stanza

https://stanfordnlp.github.io/stanza/installation_usage.html#getting-started

In [79]:
## Tokenize the text using the Stanford CoreNLP library

# Import the Stanford CoreNLP library
import stanza

# Pipeline for English
stan_NLP = stanza.Pipeline(lang='en', processors='tokenize')

## Tokenize the text using the stanza library
stan_tokens = [token.text for sent in stan_NLP(text).sentences for token in sent.tokens]

INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.1.json: 193kB [00:00, 20.4MB/s]                    
INFO:stanza:Loading these models for language: en (English):
| Processor | Package  |
------------------------
| tokenize  | combined |

INFO:stanza:Use device: gpu
INFO:stanza:Loading: tokenize
INFO:stanza:Done loading processors!


In [80]:
len(stan_tokens)

148

In [81]:
stan_NLP??

[0;31mSignature:[0m   [0mstan_NLP[0m[0;34m([0m[0mdoc[0m[0;34m,[0m [0mprocessors[0m[0;34m=[0m[0;32mNone[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mType:[0m        Pipeline
[0;31mString form:[0m <Pipeline: tokenize=TokenizeProcessor(/home/james/stanza_resources/en/tokenize/combined.pt)>
[0;31mFile:[0m        /media/james/Projects/GitHub/DATA_340_NLP/Notebooks/venv/lib/python3.10/site-packages/stanza/pipeline/core.py
[0;31mSource:[0m     
[0;32mclass[0m [0mPipeline[0m[0;34m:[0m[0;34m[0m
[0;34m[0m[0;34m[0m
[0;34m[0m    [0;32mdef[0m [0m__init__[0m[0;34m([0m[0mself[0m[0;34m,[0m[0;34m[0m
[0;34m[0m                 [0mlang[0m[0;34m=[0m[0;34m'en'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m                 [0mdir[0m[0;34m=[0m[0mDEFAULT_MODEL_DIR[0m[0;34m,[0m[0;34m[0m
[0;34m[0m                 [0mpackage[0m[0;34m=[0m[0;34m'default'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m                 [0mprocessors[0m[0;34m=[0m[0;34m{[

In [95]:
from IPython.display import display, Markdown
# Display the token counts
display(Markdown("| NLTK | spaCy | TextBlob | Gensim | Stanza |\n| --- | --- | --- | --- | --- |\n| {} | {} | {} | {} | {} |".format(len(nltk_tokens), len(spacy_tokens), len(blob.words), len(gensim_tokens), len(stan_tokens))))

| NLTK | spaCy | TextBlob | Gensim | Stanza |
| --- | --- | --- | --- | --- |
| 140 | 158 | 124 | 128 | 148 |

In [96]:
# Display the texts
display(Markdown("| NLTK | spaCy | TextBlob | Gensim | Stanza |\n| --- | --- | --- | --- | --- |\n| {} | {} | {} | {} | {} |".format(nltk_tokens, spacy_tokens, blob.words, gensim_tokens, stan_tokens)))

| NLTK | spaCy | TextBlob | Gensim | Stanza |
| --- | --- | --- | --- | --- |
| ['Human', 'infants', 'have', 'the', 'remarkable', 'ability', 'to', 'learn', 'any', 'human', 'language', '.', 'One', 'proposed', 'mechanism', 'for', 'this', 'ability', 'is', 'distributional', 'learning', ',', 'where', 'learners', 'infer', 'the', 'underlying', 'cluster', 'structure', 'from', 'unlabeled', 'input', '.', 'Computational', 'models', 'of', 'distributional', 'learning', 'have', 'historically', 'been', 'principled', 'but', 'psychologically-implausible', 'computational-level', 'models', ',', 'or', 'ad', 'hoc', 'but', 'psychologically', 'plausible', 'algorithmic-level', 'models', '.', 'Approximate', 'rational', 'models', 'like', 'particle', 'filters', 'can', 'potentially', 'bridge', 'this', 'divide', ',', 'and', 'allow', 'principled', ',', 'but', 'psychologically', 'plausible', 'models', 'of', 'distributional', 'learning', 'to', 'be', 'specified', 'and', 'evaluated', '.', 'As', 'a', 'proof', 'of', 'concept', ',', 'I', 'evaluate', 'one', 'such', 'particle', 'filter', 'model', ',', 'applied', 'to', 'learning', 'English', 'voicing', 'categories', 'from', 'distributions', 'of', 'voice-onset', 'times', '(', 'VOTs', ')', '.', 'I', 'find', 'that', 'this', 'model', 'learns', 'well', ',', 'but', 'behaves', 'somewhat', 'differently', 'from', 'the', 'standard', ',', 'unconstrained', 'Gibbs', 'sampler', 'implementation', 'of', 'the', 'underlying', 'rational', 'model', '.'] | ['\n', 'Human', 'infants', 'have', 'the', 'remarkable', 'ability', 'to', 'learn', 'any', 'human', 'language', '.', 'One', 'proposed', 'mechanism', 'for', 'this', 'ability', '\n', 'is', 'distributional', 'learning', ',', 'where', 'learners', 'infer', 'the', 'underlying', 'cluster', 'structure', 'from', 'unlabeled', 'input', '.', 'Computational', '\n', 'models', 'of', 'distributional', 'learning', 'have', 'historically', 'been', 'principled', 'but', 'psychologically', '-', 'implausible', '\n', 'computational', '-', 'level', 'models', ',', 'or', 'ad', 'hoc', 'but', 'psychologically', 'plausible', 'algorithmic', '-', 'level', 'models', '.', 'Approximate', 'rational', '\n', 'models', 'like', 'particle', 'filters', 'can', 'potentially', 'bridge', 'this', 'divide', ',', 'and', 'allow', 'principled', ',', 'but', 'psychologically', 'plausible', '\n', 'models', 'of', 'distributional', 'learning', 'to', 'be', 'specified', 'and', 'evaluated', '.', 'As', 'a', 'proof', 'of', 'concept', ',', 'I', 'evaluate', 'one', 'such', 'particle', '\n', 'filter', 'model', ',', 'applied', 'to', 'learning', 'English', 'voicing', 'categories', 'from', 'distributions', 'of', 'voice', '-', 'onset', 'times', '(', 'VOTs', ')', '.', '\n', 'I', 'find', 'that', 'this', 'model', 'learns', 'well', ',', 'but', 'behaves', 'somewhat', 'differently', 'from', 'the', 'standard', ',', 'unconstrained', 'Gibbs', '\n', 'sampler', 'implementation', 'of', 'the', 'underlying', 'rational', 'model', '.', '\n'] | ['Human', 'infants', 'have', 'the', 'remarkable', 'ability', 'to', 'learn', 'any', 'human', 'language', 'One', 'proposed', 'mechanism', 'for', 'this', 'ability', 'is', 'distributional', 'learning', 'where', 'learners', 'infer', 'the', 'underlying', 'cluster', 'structure', 'from', 'unlabeled', 'input', 'Computational', 'models', 'of', 'distributional', 'learning', 'have', 'historically', 'been', 'principled', 'but', 'psychologically-implausible', 'computational-level', 'models', 'or', 'ad', 'hoc', 'but', 'psychologically', 'plausible', 'algorithmic-level', 'models', 'Approximate', 'rational', 'models', 'like', 'particle', 'filters', 'can', 'potentially', 'bridge', 'this', 'divide', 'and', 'allow', 'principled', 'but', 'psychologically', 'plausible', 'models', 'of', 'distributional', 'learning', 'to', 'be', 'specified', 'and', 'evaluated', 'As', 'a', 'proof', 'of', 'concept', 'I', 'evaluate', 'one', 'such', 'particle', 'filter', 'model', 'applied', 'to', 'learning', 'English', 'voicing', 'categories', 'from', 'distributions', 'of', 'voice-onset', 'times', 'VOTs', 'I', 'find', 'that', 'this', 'model', 'learns', 'well', 'but', 'behaves', 'somewhat', 'differently', 'from', 'the', 'standard', 'unconstrained', 'Gibbs', 'sampler', 'implementation', 'of', 'the', 'underlying', 'rational', 'model'] | ['Human', 'infants', 'have', 'the', 'remarkable', 'ability', 'to', 'learn', 'any', 'human', 'language', 'One', 'proposed', 'mechanism', 'for', 'this', 'ability', 'is', 'distributional', 'learning', 'where', 'learners', 'infer', 'the', 'underlying', 'cluster', 'structure', 'from', 'unlabeled', 'input', 'Computational', 'models', 'of', 'distributional', 'learning', 'have', 'historically', 'been', 'principled', 'but', 'psychologically', 'implausible', 'computational', 'level', 'models', 'or', 'ad', 'hoc', 'but', 'psychologically', 'plausible', 'algorithmic', 'level', 'models', 'Approximate', 'rational', 'models', 'like', 'particle', 'filters', 'can', 'potentially', 'bridge', 'this', 'divide', 'and', 'allow', 'principled', 'but', 'psychologically', 'plausible', 'models', 'of', 'distributional', 'learning', 'to', 'be', 'specified', 'and', 'evaluated', 'As', 'a', 'proof', 'of', 'concept', 'I', 'evaluate', 'one', 'such', 'particle', 'filter', 'model', 'applied', 'to', 'learning', 'English', 'voicing', 'categories', 'from', 'distributions', 'of', 'voice', 'onset', 'times', 'VOTs', 'I', 'find', 'that', 'this', 'model', 'learns', 'well', 'but', 'behaves', 'somewhat', 'differently', 'from', 'the', 'standard', 'unconstrained', 'Gibbs', 'sampler', 'implementation', 'of', 'the', 'underlying', 'rational', 'model'] | ['Human', 'infants', 'have', 'the', 'remarkable', 'ability', 'to', 'learn', 'any', 'human', 'language', '.', 'One', 'proposed', 'mechanism', 'for', 'this', 'ability', 'is', 'distributional', 'learning', ',', 'where', 'learners', 'infer', 'the', 'underlying', 'cluster', 'structure', 'from', 'unlabeled', 'input', '.', 'Computational', 'models', 'of', 'distributional', 'learning', 'have', 'historically', 'been', 'principled', 'but', 'psychologically', '-', 'implausible', 'computational', '-', 'level', 'models', ',', 'or', 'ad', 'hoc', 'but', 'psychologically', 'plausible', 'algorithmic', '-', 'level', 'models', '.', 'Approximate', 'rational', 'models', 'like', 'particle', 'filters', 'can', 'potentially', 'bridge', 'this', 'divide', ',', 'and', 'allow', 'principled', ',', 'but', 'psychologically', 'plausible', 'models', 'of', 'distributional', 'learning', 'to', 'be', 'specified', 'and', 'evaluated', '.', 'As', 'a', 'proof', 'of', 'concept', ',', 'I', 'evaluate', 'one', 'such', 'particle', 'filter', 'model', ',', 'applied', 'to', 'learning', 'English', 'voicing', 'categories', 'from', 'distributions', 'of', 'voice', '-', 'onset', 'times', '(', 'VOTs', ')', '.', 'I', 'find', 'that', 'this', 'model', 'learns', 'well', ',', 'but', 'behaves', 'somewhat', 'differently', 'from', 'the', 'standard', ',', 'unconstrained', 'Gibbs', 'sampler', 'implementation', 'of', 'the', 'underlying', 'rational', 'model', '.'] |

## Stemming Libraries

Stemming is the computational process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form.

### NLTK Example

https://www.nltk.org/howto/stem.html

In [82]:
## NLTK Stemming

## import the PorterStemmer
from nltk.stem import PorterStemmer

## Create an instance of the PorterStemmer
stemmer = PorterStemmer()

## Stem the tokens
nltk_stemmed_tokens = [stemmer.stem(token) for token in nltk_tokens]



In [83]:
## Examine the stemmed tokens
print(" ".join(nltk_stemmed_tokens))

human infant have the remark abil to learn ani human languag . one propos mechan for thi abil is distribut learn , where learner infer the underli cluster structur from unlabel input . comput model of distribut learn have histor been principl but psychologically-implaus computational-level model , or ad hoc but psycholog plausibl algorithmic-level model . approxim ration model like particl filter can potenti bridg thi divid , and allow principl , but psycholog plausibl model of distribut learn to be specifi and evalu . as a proof of concept , i evalu one such particl filter model , appli to learn english voic categori from distribut of voice-onset time ( vot ) . i find that thi model learn well , but behav somewhat differ from the standard , unconstrain gibb sampler implement of the underli ration model .


### SpaCy Example

SpaCy does not have a built-in stemmer, but it does have a lemmatizer. See below.

### TextBlob Example

In [44]:
## TextBlob Stemming

## Stem the tokens
blob = TextBlob(text)

blob_stemmed_tokens = [word.stem() for word in blob.words]

In [45]:
print(" ".join(blob_stemmed_tokens))

human infant have the remark abil to learn ani human languag one propos mechan for thi abil is distribut learn where learner infer the underli cluster structur from unlabel input comput model of distribut learn have histor been principl but psychologically-implaus computational-level model or ad hoc but psycholog plausibl algorithmic-level model approxim ration model like particl filter can potenti bridg thi divid and allow principl but psycholog plausibl model of distribut learn to be specifi and evalu as a proof of concept i evalu one such particl filter model appli to learn english voic categori from distribut of voice-onset time vot i find that thi model learn well but behav somewhat differ from the standard unconstrain gibb sampler implement of the underli ration model


### Genism Example

In [50]:
## Gensim Stemming
import gensim
from gensim.parsing import stem_text

# Stem the tokens
gensim_stemmed_tokens = stem_text(text)

In [51]:
gensim_stemmed_tokens

'human infant have the remark abil to learn ani human language. on propos mechan for thi abil is distribut learning, where learner infer the underli cluster structur from unlabel input. comput model of distribut learn have histor been principl but psychologically-implaus computational-level models, or ad hoc but psycholog plausibl algorithmic-level models. approxim ration model like particl filter can potenti bridg thi divide, and allow principled, but psycholog plausibl model of distribut learn to be specifi and evaluated. as a proof of concept, i evalu on such particl filter model, appli to learn english voic categori from distribut of voice-onset time (vots). i find that thi model learn well, but behav somewhat differ from the standard, unconstrain gibb sampler implement of the underli ration model.'

### Stanford CoreNLP Example

Stanza does not have a built-in stemmer, but it does have a lemmatizer. See below.

## Lemmatization Libraries

### NLTK Example

In [55]:
import nltk
nltk.download('wordnet')

## NLTK Lemmatization
from nltk.stem import WordNetLemmatizer


# Create an instance of the WordNetLemmatizer
nltk_lemmatizer = WordNetLemmatizer()

# Lemmatize the tokens
nltk_lemmas = [nltk_lemmatizer.lemmatize(token) for token in nltk_tokens]

[nltk_data] Downloading package wordnet to /home/james/nltk_data...


In [57]:
print(" ".join(nltk_lemmas))

Human infant have the remarkable ability to learn any human language . One proposed mechanism for this ability is distributional learning , where learner infer the underlying cluster structure from unlabeled input . Computational model of distributional learning have historically been principled but psychologically-implausible computational-level model , or ad hoc but psychologically plausible algorithmic-level model . Approximate rational model like particle filter can potentially bridge this divide , and allow principled , but psychologically plausible model of distributional learning to be specified and evaluated . As a proof of concept , I evaluate one such particle filter model , applied to learning English voicing category from distribution of voice-onset time ( VOTs ) . I find that this model learns well , but behaves somewhat differently from the standard , unconstrained Gibbs sampler implementation of the underlying rational model .


### SpaCy Example

https://spacy.io/api/lemmatizer

In [58]:
## SpaCy Lemmatization
import spacy

## Create an instance of the spaCy library
spacy_NLP = spacy.load('en_core_web_sm')

## Lemmatize the tokens
spacy_lemmas = [token.lemma_ for token in spacy_NLP(text)]



In [60]:
print(" ".join(spacy_lemmas))


 human infant have the remarkable ability to learn any human language . one propose mechanism for this ability 
 be distributional learning , where learner infer the underlie cluster structure from unlabeled input . computational 
 model of distributional learning have historically be principle but psychologically - implausible 
 computational - level model , or ad hoc but psychologically plausible algorithmic - level model . approximate rational 
 model like particle filter can potentially bridge this divide , and allow principle , but psychologically plausible 
 model of distributional learning to be specify and evaluate . as a proof of concept , I evaluate one such particle 
 filter model , apply to learn english voicing category from distribution of voice - onset time ( VOTs ) . 
 I find that this model learn well , but behave somewhat differently from the standard , unconstrained Gibbs 
 sampler implementation of the underlying rational model . 



### TextBlob Example

In [62]:
from textblob import TextBlob

## Create an instance of the TextBlob library
blob = TextBlob(text)

## Lemmatize the blob
blob_lemmas = [word.lemmatize() for word in blob.words]


In [64]:
print(" ".join(blob_lemmas))

Human infant have the remarkable ability to learn any human language One proposed mechanism for this ability is distributional learning where learner infer the underlying cluster structure from unlabeled input Computational model of distributional learning have historically been principled but psychologically-implausible computational-level model or ad hoc but psychologically plausible algorithmic-level model Approximate rational model like particle filter can potentially bridge this divide and allow principled but psychologically plausible model of distributional learning to be specified and evaluated As a proof of concept I evaluate one such particle filter model applied to learning English voicing category from distribution of voice-onset time VOTs I find that this model learns well but behaves somewhat differently from the standard unconstrained Gibbs sampler implementation of the underlying rational model


### Genism Example


Gensim no longer hosts a lemmatizer. They used to port the code in from `Pattern` but it has since been removed.

### Stanford CoreNLP Example

https://stanfordnlp.github.io/stanza/lemma.html

In [70]:
import stanza

## Create an instance of the stanza library
stanza_NLP = stanza.Pipeline(lang='en', processors='tokenize,lemma')

## Create a document object
doc = stanza_NLP(text)

## Lemmatize the tokens
print(*[f'word: {word.text+" "}\tlemma: {word.lemma}' for sent in doc.sentences for word in sent.words], sep='\n')

INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.1.json: 193kB [00:00, 20.6MB/s]                    
INFO:stanza:Loading these models for language: en (English):
| Processor | Package  |
------------------------
| tokenize  | combined |
| lemma     | combined |

INFO:stanza:Use device: gpu
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: lemma
INFO:stanza:Done loading processors!


word: Human 	lemma: human
word: infants 	lemma: infant
word: have 	lemma: have
word: the 	lemma: the
word: remarkable 	lemma: remarkable
word: ability 	lemma: ability
word: to 	lemma: to
word: learn 	lemma: learn
word: any 	lemma: any
word: human 	lemma: human
word: language 	lemma: language
word: . 	lemma: .
word: One 	lemma: one
word: proposed 	lemma: propose
word: mechanism 	lemma: mechanism
word: for 	lemma: for
word: this 	lemma: this
word: ability 	lemma: ability
word: is 	lemma: be
word: distributional 	lemma: distributional
word: learning 	lemma: learning
word: , 	lemma: ,
word: where 	lemma: where
word: learners 	lemma: learner
word: infer 	lemma: infer
word: the 	lemma: the
word: underlying 	lemma: underlying
word: cluster 	lemma: cluster
word: structure 	lemma: structure
word: from 	lemma: from
word: unlabeled 	lemma: unlabel
word: input 	lemma: input
word: . 	lemma: .
word: Computational 	lemma: Computational
word: models 	lemma: model
word: of 	lemma: of
word: distribution

## Stopwords Libraries

`Stopwords` are words that are so common that they are not useful for analysis. For example, the word `the` is a stopword. To nomralize our text with stopwords, we remove them from our corpus.

### NLTK Example

### SpaCy Example

### TextBlob Example

### Genism Example

### Stanford CoreNLP Example