# Brief nltk and spaCy Tutorial for text Preprocessing 

We will be using both the nltk and spaCy libraries for data pre-processing, in our case, text cleaning tasks. 

## Documentation

**nltk** : https://www.nltk.org/ 

**spaCy**: https://spacy.io/ 

## Installation 

**nltk**: https://www.nltk.org/install.html 

**spaCy**: https://spacy.io/usage/ 



## Imports 

We start by importing the necessary packages. Make sure you download the nltk data, and also that you have download language model from spaCy. See 

**nltk**

`nltk.download()` 
https://www.nltk.org/data.html

**spacy**

`python -m spacy download en_core_web_sm` `# small model` 

`python -m spacy download en_core_web_md   # medium model` 

`python -m spacy download en_core_web_lg # large model, we will be using this`

https://spacy.io/usage/models


In [25]:
import re
import nltk
import spacy 
from nltk.corpus import stopwords # to filter stopwords
from nltk import word_tokenize
from nltk import sent_tokenize
from nltk.stem import WordNetLemmatizer # lemmatizer 
from nltk.stem.snowball import SnowballStemmer # stemmer
from nltk.stem import WordNetLemmatizer
from nltk.stem.snowball import SnowballStemmer

# nltk.download()# download nltk data , this will take a good while. 

nlp = spacy.load("en_core_web_lg") # large web model inizialization 
stopwords = set(stopwords.words('english')) # obtain english stopwords 


Now we can have some sample text and see how to do common procedures. 

In [8]:
# example text 
text = """Megathread: Bernie Sanders Undergoes Emergency Heart Procedure, Suspends Campaign Events Until Further Notice
Megathread
Sen. Bernie Sanders underwent heart surgery after he experienced chest discomfort during a campaign event on Tuesday, his campaign said.

Jeff Weaver, a senior adviser to Sanders's campaign, said a medical evaluation of the Vermont senator discovered blockage in one of his arteries, and two stents were successfully inserted.

Sen. Sanders is conversing and in good spirits, Weaver said. He will be resting up over the next few days. We are canceling his events and appearances until further notice, and we will continue to provide appropriate updates."""

In what follows, both perform alright. However, we will be using mostly spaCy whenever possible. 


## nltk : tokenizing 

In [9]:
word_tokens = [token for token in word_tokenize(text, language='english')] # tokenize by words
sent_tokens = [sent for sent in sent_tokenize(text, language='english')] # tokenize by sentence
print(word_tokens, "\n") 
print(sent_tokens, "\n")

['Megathread', ':', 'Bernie', 'Sanders', 'Undergoes', 'Emergency', 'Heart', 'Procedure', ',', 'Suspends', 'Campaign', 'Events', 'Until', 'Further', 'Notice', 'Megathread', 'Sen.', 'Bernie', 'Sanders', 'underwent', 'heart', 'surgery', 'after', 'he', 'experienced', 'chest', 'discomfort', 'during', 'a', 'campaign', 'event', 'on', 'Tuesday', ',', 'his', 'campaign', 'said', '.', 'Jeff', 'Weaver', ',', 'a', 'senior', 'adviser', 'to', 'Sanders', "'s", 'campaign', ',', 'said', 'a', 'medical', 'evaluation', 'of', 'the', 'Vermont', 'senator', 'discovered', 'blockage', 'in', 'one', 'of', 'his', 'arteries', ',', 'and', 'two', 'stents', 'were', 'successfully', 'inserted', '.', 'Sen.', 'Sanders', 'is', 'conversing', 'and', 'in', 'good', 'spirits', ',', 'Weaver', 'said', '.', 'He', 'will', 'be', 'resting', 'up', 'over', 'the', 'next', 'few', 'days', '.', 'We', 'are', 'canceling', 'his', 'events', 'and', 'appearances', 'until', 'further', 'notice', ',', 'and', 'we', 'will', 'continue', 'to', 'provide'

## spacy : tokenizing

In [10]:
doc = nlp(text) # process text with the model 

sp_word_tokens = [token.text for token in doc if token.text.isalpha()]
sp_sent_tokens = [sent for sent in doc.sents]
print(sp_word_tokens, "\n")
print(sp_sent_tokens, "\n")

['Megathread', 'Bernie', 'Sanders', 'Undergoes', 'Emergency', 'Heart', 'Procedure', 'Suspends', 'Campaign', 'Events', 'Until', 'Further', 'Notice', 'Megathread', 'Bernie', 'Sanders', 'underwent', 'heart', 'surgery', 'after', 'he', 'experienced', 'chest', 'discomfort', 'during', 'a', 'campaign', 'event', 'on', 'Tuesday', 'his', 'campaign', 'said', 'Jeff', 'Weaver', 'a', 'senior', 'adviser', 'to', 'Sanders', 'campaign', 'said', 'a', 'medical', 'evaluation', 'of', 'the', 'Vermont', 'senator', 'discovered', 'blockage', 'in', 'one', 'of', 'his', 'arteries', 'and', 'two', 'stents', 'were', 'successfully', 'inserted', 'Sanders', 'is', 'conversing', 'and', 'in', 'good', 'spirits', 'Weaver', 'said', 'He', 'will', 'be', 'resting', 'up', 'over', 'the', 'next', 'few', 'days', 'We', 'are', 'canceling', 'his', 'events', 'and', 'appearances', 'until', 'further', 'notice', 'and', 'we', 'will', 'continue', 'to', 'provide', 'appropriate', 'updates'] 

[Megathread: Bernie Sanders Undergoes Emergency Hear

## stemming (nltk) 


Basic idea: chop the word

In [11]:
stemmer = SnowballStemmer(language='english')  # stemmer algorithm 
stemmer.stem("Cats") # example 

stemmed = [stemmer.stem(word) for word in word_tokens]  # the whole text 
print(stemmed) 

['megathread', ':', 'berni', 'sander', 'undergo', 'emerg', 'heart', 'procedur', ',', 'suspend', 'campaign', 'event', 'until', 'further', 'notic', 'megathread', 'sen.', 'berni', 'sander', 'underw', 'heart', 'surgeri', 'after', 'he', 'experienc', 'chest', 'discomfort', 'dure', 'a', 'campaign', 'event', 'on', 'tuesday', ',', 'his', 'campaign', 'said', '.', 'jeff', 'weaver', ',', 'a', 'senior', 'advis', 'to', 'sander', "'s", 'campaign', ',', 'said', 'a', 'medic', 'evalu', 'of', 'the', 'vermont', 'senat', 'discov', 'blockag', 'in', 'one', 'of', 'his', 'arteri', ',', 'and', 'two', 'stent', 'were', 'success', 'insert', '.', 'sen.', 'sander', 'is', 'convers', 'and', 'in', 'good', 'spirit', ',', 'weaver', 'said', '.', 'he', 'will', 'be', 'rest', 'up', 'over', 'the', 'next', 'few', 'day', '.', 'we', 'are', 'cancel', 'his', 'event', 'and', 'appear', 'until', 'further', 'notic', ',', 'and', 'we', 'will', 'continu', 'to', 'provid', 'appropri', 'updat', '.']


## stemming (spaCy) 


In [12]:
doc = nlp(text) # process the text with the model 

sp_word_tokens = [token.text for token in doc if token.text.isalpha()] # process every token 
sp_sent_tokens = [sent for sent in doc.sents]
print(sp_word_tokens, "\n")
print(sp_sent_tokens, "\n")

['Megathread', 'Bernie', 'Sanders', 'Undergoes', 'Emergency', 'Heart', 'Procedure', 'Suspends', 'Campaign', 'Events', 'Until', 'Further', 'Notice', 'Megathread', 'Bernie', 'Sanders', 'underwent', 'heart', 'surgery', 'after', 'he', 'experienced', 'chest', 'discomfort', 'during', 'a', 'campaign', 'event', 'on', 'Tuesday', 'his', 'campaign', 'said', 'Jeff', 'Weaver', 'a', 'senior', 'adviser', 'to', 'Sanders', 'campaign', 'said', 'a', 'medical', 'evaluation', 'of', 'the', 'Vermont', 'senator', 'discovered', 'blockage', 'in', 'one', 'of', 'his', 'arteries', 'and', 'two', 'stents', 'were', 'successfully', 'inserted', 'Sanders', 'is', 'conversing', 'and', 'in', 'good', 'spirits', 'Weaver', 'said', 'He', 'will', 'be', 'resting', 'up', 'over', 'the', 'next', 'few', 'days', 'We', 'are', 'canceling', 'his', 'events', 'and', 'appearances', 'until', 'further', 'notice', 'and', 'we', 'will', 'continue', 'to', 'provide', 'appropriate', 'updates'] 

[Megathread: Bernie Sanders Undergoes Emergency Hear

## lemmatization (nltk) 

Bring the word back to its dictionary form 

In [14]:
lemmatizer = WordNetLemmatizer() # instantiate nltk lemmatizer 

# lemmatize all tokens 
lemm_tokens_nltk = [stemmer.stem(token) for token in word_tokens] 
print(lemm_tokens_nltk)


['megathread', ':', 'berni', 'sander', 'undergo', 'emerg', 'heart', 'procedur', ',', 'suspend', 'campaign', 'event', 'until', 'further', 'notic', 'megathread', 'sen.', 'berni', 'sander', 'underw', 'heart', 'surgeri', 'after', 'he', 'experienc', 'chest', 'discomfort', 'dure', 'a', 'campaign', 'event', 'on', 'tuesday', ',', 'his', 'campaign', 'said', '.', 'jeff', 'weaver', ',', 'a', 'senior', 'advis', 'to', 'sander', "'s", 'campaign', ',', 'said', 'a', 'medic', 'evalu', 'of', 'the', 'vermont', 'senat', 'discov', 'blockag', 'in', 'one', 'of', 'his', 'arteri', ',', 'and', 'two', 'stent', 'were', 'success', 'insert', '.', 'sen.', 'sander', 'is', 'convers', 'and', 'in', 'good', 'spirit', ',', 'weaver', 'said', '.', 'he', 'will', 'be', 'rest', 'up', 'over', 'the', 'next', 'few', 'day', '.', 'we', 'are', 'cancel', 'his', 'event', 'and', 'appear', 'until', 'further', 'notic', ',', 'and', 'we', 'will', 'continu', 'to', 'provid', 'appropri', 'updat', '.']


## lemmatization (spacy)
Better than nltk, specially for other languages.

In [15]:
doc = nlp(text) # fit text to model 
doc_lemmas = [(token.text, token.lemma_) for token in doc if token.text.isalpha()]
print(doc_lemmas)

print("Lemma: ", doc_lemmas[0][1]) # how to access lemmas 

[('Megathread', 'megathread'), ('Bernie', 'Bernie'), ('Sanders', 'Sanders'), ('Undergoes', 'undergo'), ('Emergency', 'Emergency'), ('Heart', 'Heart'), ('Procedure', 'Procedure'), ('Suspends', 'suspend'), ('Campaign', 'campaign'), ('Events', 'event'), ('Until', 'until'), ('Further', 'further'), ('Notice', 'notice'), ('Megathread', 'Megathread'), ('Bernie', 'Bernie'), ('Sanders', 'Sanders'), ('underwent', 'undergo'), ('heart', 'heart'), ('surgery', 'surgery'), ('after', 'after'), ('he', '-PRON-'), ('experienced', 'experience'), ('chest', 'chest'), ('discomfort', 'discomfort'), ('during', 'during'), ('a', 'a'), ('campaign', 'campaign'), ('event', 'event'), ('on', 'on'), ('Tuesday', 'Tuesday'), ('his', '-PRON-'), ('campaign', 'campaign'), ('said', 'say'), ('Jeff', 'Jeff'), ('Weaver', 'Weaver'), ('a', 'a'), ('senior', 'senior'), ('adviser', 'adviser'), ('to', 'to'), ('Sanders', 'Sanders'), ('campaign', 'campaign'), ('said', 'say'), ('a', 'a'), ('medical', 'medical'), ('evaluation', 'evaluat

## preprocessing-function template using nltk  and re


In [28]:
tokenizer = word_tokenize # tokenizes
stemmer = SnowballStemmer(language='english')  # stemmer
lemmatizer = WordNetLemmatizer() # lemmatizer 

def preprocess_text(sentence, stem=False, lemmatize=False): 
    """
    Cleans text list by applying the following steps: 
        1. Tokenize the input sentence 
        2. Remove punctuation, symbols and unwanted characters
        3. Convert the tokens to lowercase 
        4. Stem or lemmatize (according to input)
        5. Remove stopwords and empty strings
    """
    # Tokenize
    tokens = tokenizer(sentence) 
    
    # Remove punctuation & symbols
    tokens = [re.sub(r"[^a-zA-Z]","", token) for token in tokens ]
    
    # convert to lowercase 
    tokens = [token.lower() for token in tokens]
    
    # Stem or lemmatize
    if stem: 
        tokens = [stemmer.stem(token) for token in tokens] 
    if lemmatize:
        tokens = [lemmatizer.lemmatize(token) for token in tokens] 
    
    # remove stopwords and empty strings 
    tokens = [token for token in tokens if token not in stopwords
              and len(token) > 1] 
    
    return ' '.join(tokens)

# Note we can change the stemmer and lemmatizer for spaCy's ones instead. 

# example 
ex1_stem = preprocess_text(text, stem='true')
ex2_lemm = preprocess_text(text, lemmatize='true')

In [27]:
print(ex1_stem) 

megathread berni sander undergo emerg heart procedur suspend campaign event notic megathread sen berni sander underw heart surgeri experienc chest discomfort dure campaign event tuesday campaign said jeff weaver senior advis sander campaign said medic evalu vermont senat discov blockag one arteri two stent success insert sen sander convers good spirit weaver said rest next day cancel event appear notic continu provid appropri updat


In [29]:
print(ex2_lemm)

megathread bernie sander undergoes emergency heart procedure suspends campaign event notice megathread sen bernie sander underwent heart surgery experienced chest discomfort campaign event tuesday campaign said jeff weaver senior adviser sander campaign said medical evaluation vermont senator discovered blockage one artery two stent successfully inserted sen sander conversing good spirit weaver said resting next day canceling event appearance notice continue provide appropriate update


Observe carefully the differences in stemming vs lemmatizing  