<h1> Text preprocessing tutorial <h1>

In [1]:
import nltk
import spacy
import pandas as pd

# 1. Preliminaries (Tokenization)

In [3]:
from nltk import word_tokenize,sent_tokenize

Sentence segmentation

In [None]:
mytext = """In the previous chapter, we saw examples of some common NLP
applications that we might encounter in everyday life. If we were asked to
build such an application, think about how we would approach doing so at our
organization. We would normally walk through the requirements and break the
problem down into several sub-problems, then try to develop a step-by-step
procedure to solve them. Since language processing is involved, we would also
list all the forms of text processing needed at each step. This step-by-step
processing of text is known as pipeline. It is the series of steps involved in
building any NLP model. These steps are common in every NLP project, so it
makes sense to study them in this chapter. Understanding some common procedures
in any NLP pipeline will enable us to get started on any NLP problem encountered
in the workplace. Laying out and developing a text-processing pipeline is seen
as a starting point for any NLP application development process. In this
chapter, we will learn about the various steps involved and how they play
important roles in solving the NLP problem and we’ll see a few guidelines
about when and how to use which step. In later chapters, we’ll discuss
specific pipelines for various NLP tasks (e.g., Chapters 4–7)."""

my_sent = sent_tokenize(mytext)
my_sent

Word tokenization

In [None]:
for sent in my_sent:
    print(word_tokenize(sent))

# 2. Frequent Steps:
    - Stop words remove
    - Steaming and lemmatization
    - removing digits / punctuation
    - lowercasing,...

In [1]:
# nltk.download('stopwords')
from nltk.corpus import stopwords
from string import punctuation

Remove stopwords, punctuations, digits and lowercase

In [22]:
def preprocess_corpus(texts):
    my_stop_words = set(stopwords.words('english'))
    # Convert texts to sentences
    sents = sent_tokenize(texts)

    def remove_stop_digit(word_tokens):
        return [token.lower() for token in word_tokens if token not in my_stop_words
                and token not in punctuation and not token.isdigit()] 
    
    return [remove_stop_digit(word_tokenize(sent)) for sent in sents]

In [23]:
print(preprocess_corpus(mytext))

[['in', 'previous', 'chapter', 'saw', 'examples', 'common', 'nlp', 'applications', 'might', 'encounter', 'everyday', 'life'], ['if', 'asked', 'build', 'application', 'think', 'would', 'approach', 'organization'], ['we', 'would', 'normally', 'walk', 'requirements', 'break', 'problem', 'several', 'sub-problems', 'try', 'develop', 'step-by-step', 'procedure', 'solve'], ['since', 'language', 'processing', 'involved', 'would', 'also', 'list', 'forms', 'text', 'processing', 'needed', 'step'], ['this', 'step-by-step', 'processing', 'text', 'known', 'pipeline'], ['it', 'series', 'steps', 'involved', 'building', 'nlp', 'model'], ['these', 'steps', 'common', 'every', 'nlp', 'project', 'makes', 'sense', 'study', 'chapter'], ['understanding', 'common', 'procedures', 'nlp', 'pipeline', 'enable', 'us', 'get', 'started', 'nlp', 'problem', 'encountered', 'workplace'], ['laying', 'developing', 'text-processing', 'pipeline', 'seen', 'starting', 'point', 'nlp', 'application', 'development', 'process'], [

Lemmatization and Stemming

Stemming

In [36]:
from nltk.stem import PorterStemmer,LancasterStemmer,SnowballStemmer
import pandas as pd

In [54]:
words = [
    'adjusting', 'adjustment', 'adjustments', 'alumni', 'alumnus', 'am', 'are', 'ate', 'beautified', 'beautiful', 'beautifully', 'beautify', 'best', 'better', 'bigger', 
    'biggest', 'children', 'connected', 'connection', 'connections', 'connectivity', 'cried', 'cries', 'criteria', 'criterion', 'crying', 'democracy', 'democratization',
    'democratize', 'discoveries', 'discovering','men', 'mice', 'misunderstanding', 'misunderstandings', 'more beautiful', 'most beautiful', 
    'multiplying', 'nationalities', 'nationality', 'nationalization', 'nationalize', 'nationalized', 'organization', 'organizations', 'phenomena', 'phenomenon', 'quicker', 'quickest', 'quickly', 
    'ran', 'relating', 'relational', 'relationally', 'remembering', 'runnable', 'runner', 'running','science', 'scientific', 'scientifically', 'studies', 'study', 'studying', 'troubled', 
    'troubles', 'troubling', 'understood', 'unknowingly', 'unknown', 'was', 'went', 'were', 'women', 'wondered', 'wondering', 'wonders', 'worse', 'worst'
]


# Initialize stemmers
porter_stemmer = PorterStemmer()
lancaster_stemmer = LancasterStemmer()
snowball_stemmer = SnowballStemmer('english')

# Apply stemmers
porter_stemmed = [porter_stemmer.stem(word) for word in words]
lancaster_stemmed = [lancaster_stemmer.stem(word) for word in words]
snowball_stemmed = [snowball_stemmer.stem(word) for word in words]

# Create DataFrame
df = pd.DataFrame({
    'Original': words,
    'PorterStemmer': porter_stemmed,
    'LancasterStemmer': lancaster_stemmed,
    'SnowballStemmer': snowball_stemmed
})

# Print DataFrame
df.sample(5)

# ===================> SnowballStemmer is recommend

Unnamed: 0,Original,PorterStemmer,LancasterStemmer,SnowballStemmer
63,studying,studi,study,studi
49,quickly,quickli,quick,quick
58,science,scienc,sci,scienc
13,better,better,bet,better
62,study,studi,study,studi


Lemmatization

In [55]:
# Using nltk

nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize('better',pos='a')

[nltk_data] Downloading package wordnet to /home/tuanphan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


'good'

In [60]:
# Using spacy

sp = spacy.load('en_core_web_sm')
token = sp(u'better')
for word in token:
    print(word.text,word.lemma_)

better well


# 3. Advanced pre-steps:
- POS tagging

In [70]:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'Charles Spencer Chaplin was born on 16 April 1889 toHannah Chaplin (born Hannah Harriet Pedlingham Hill) and Charles Chaplin Sr')
for token in doc:
    print(token.text,token.lemma_,token.pos_,token.shape_,token.is_alpha,token.is_stop)

Charles Charles PROPN Xxxxx True False
Spencer Spencer PROPN Xxxxx True False
Chaplin Chaplin PROPN Xxxxx True False
was be AUX xxx True True
born bear VERB xxxx True False
on on ADP xx True True
16 16 NUM dd False False
April April PROPN Xxxxx True False
1889 1889 NUM dddd False False
toHannah toHannah PROPN xxXxxxx True False
Chaplin Chaplin PROPN Xxxxx True False
( ( PUNCT ( False False
born bear VERB xxxx True False
Hannah Hannah PROPN Xxxxx True False
Harriet Harriet PROPN Xxxxx True False
Pedlingham Pedlingham PROPN Xxxxx True False
Hill Hill PROPN Xxxx True False
) ) PUNCT ) False False
and and CCONJ xxx True True
Charles Charles PROPN Xxxxx True False
Chaplin Chaplin PROPN Xxxxx True False
Sr Sr PROPN Xx True False
