<a href='https://ai.meng.duke.edu'> = <img align="left" style="padding-top:10px;" src=https://storage.googleapis.com/aipi_datasets/Duke-AIPI-Logo.png>

# Text Pre-processing
Text is messy, and a lot of work needs to be done to pre-process it before it is useful for modeling.  Generally a text pre-processing pipeline will include at least the following steps:  
- Tokenizing the text - splitting it into words and punctuation
- Remove stop words and punctuation  
- Convert words to root words using lemmatization or stemming  

This notebook walks through a basic example of how to perform those steps using two common NLP libraries: [NLTK](https://www.nltk.org) and spaCy (https://spacy.io).


In [5]:
import string

# Import Spacy and download model to use
import spacy
#!python -m spacy download en_core_web_sm

# Import NLTK and download model to use
import nltk
nltk.download('omw-1.4')

import warnings
warnings.filterwarnings('ignore')

[nltk_data] Downloading package omw-1.4 to /Users/jjr10/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [130]:
example_doc = '''I saw some geese near the pond. Then they took off flying.'''

## NLTK
Let's first use NLTK to pre-process our text.  We'll start by tokenizing our sentence, then remove punctuation and stop words, and then we will lemmatize the tokens to get the root words.

### Tokenization

In [131]:
# Convert to tokens
tokens = nltk.word_tokenize(example_doc)

print(tokens)

['I', 'saw', 'some', 'geese', 'near', 'the', 'pond', '.', 'Then', 'they', 'took', 'off', 'flying', '.']


### Remove stop words & punctuation

In [172]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
print(stop_words)

{'she', "hasn't", 'but', 'hasn', 'because', "you're", "don't", 'he', 'ours', 'shouldn', 'when', "you'll", 'isn', 'from', 'where', 'any', 'no', 'few', "isn't", 'them', 'up', 'won', 'yourself', 'only', 'haven', 'don', 'ain', 'an', 'on', 'while', 'all', 'why', 'wouldn', 'did', 'then', 'himself', 'there', 'now', 'couldn', 'in', 'hadn', 's', 'd', 'is', 'if', 'through', 'which', 'o', 'here', "haven't", 'under', 'been', 'mightn', 'wasn', 'i', 'his', "shouldn't", "wasn't", 'we', 'needn', 'this', 'once', 'll', 'just', "shan't", 'can', 'themselves', "couldn't", 'm', 'below', 'by', 'out', "mightn't", "it's", 'at', "wouldn't", 't', 'your', 'such', 'over', "you've", 'was', 'has', 'above', 've', 'their', "doesn't", 'during', 'have', 'and', 'be', 'between', 'ma', "needn't", "weren't", 'her', 'those', 'having', 'the', "won't", 'itself', "she's", 'him', 'will', 'further', 'other', 'whom', "should've", 'about', 'same', "you'd", 'too', 'y', "aren't", 'theirs', 'or', 'myself', 'who', 'these', 'than', 'bei

In [173]:
punctuations = string.punctuation

# Filter out stop words and punctuation
tokens = [w for w in tokens if w.lower() not in stop_words and w not in punctuations]
 
print(tokens)

['geese', 'near', 'pond', 'fly']


### Lemmatize

In [174]:
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
tokens = [wordnet_lemmatizer.lemmatize(word).lower().strip() for word in tokens]
print(tokens)


['goose', 'near', 'pond', 'fly']


In [175]:
# Combine the filtered lemmas back into a string
doc_processed = " ".join([i for i in tokens])

print('Original:')
print(example_doc)
print('Processed:')
print(doc_processed)

Original:
I saw some geese near the pond. Then they took off flying.
Processed:
goose near pond fly


## SpaCy
Let's now walk through our simple example using spaCy.  With spaCy, we'll first tokenize as we did with NLTK.  But since spaCy's tokens are a bit different than NLTK (NLTK just creates string tokens, while spaCy's tokens contain lots of additional useful information on each word such as part-of-speech, root etc.), we will next use the spaCy tokens to extract the lemmas, and then remove stop words and punctuation from the list of string lemmas.
### Tokenization

In [176]:
# Process sentence
nlp = spacy.load("en_core_web_sm")
doc = nlp(example_doc)
# Get tokens
tokens = [token for token in doc]

print(tokens)

[I, saw, some, geese, near, the, pond, ., Then, they, took, off, flying, .]


### Lemmatization

In [177]:
# Extract the lemmas for each token
tokens = [token.lemma_.lower().strip() for token in tokens]
print(tokens)

['i', 'see', 'some', 'geese', 'near', 'the', 'pond', '.', 'then', 'they', 'take', 'off', 'fly', '.']


### Remove stop words and punctuation

In [178]:
from spacy.lang.en.stop_words import STOP_WORDS
stopwords = set(STOP_WORDS)
punctuations = string.punctuation

tokens = [token for token in tokens if token.lower() not in stopwords and token not in punctuations]
print(tokens)

['geese', 'near', 'pond', 'fly']


In [179]:
# Combine the filtered lemmas back into a string
doc_processed = " ".join([i for i in tokens])

print('Original:')
print(example_doc)
print('Processed:')
print(doc_processed)

Original:
I saw some geese near the pond. Then they took off flying.
Processed:
geese near pond fly
