Notebook prepared by Henrique Lopes Cardoso (hlc@fe.up.pt).

# PREPROCESSING

## Tokenization

*Tokenization* is the process of spliting an input text into tokens (words or other relevant elements, such as punctuation).

#### Making use of regular expressions

We can tokenize a piece of text by using a regular expression tokenizer, such as the one available in **NLTK**.

For starters, let's stick to alphanumerical sequences of characters.

In [2]:
import nltk
from nltk import regexp_tokenize

text = 'That U.S.A. poster-print costs $12.40...'

pattern = '[a-zA-Z0-9_]+'
tokens = regexp_tokenize(text, pattern)
print(len(tokens))
print(tokens)

9
['That', 'U', 'S', 'A', 'poster', 'print', 'costs', '12', '40']


# Problems with this simple tokenizer

- Currencies
- Floating Point numbers (if point of comma is used to denote decimal places)
- Acronyms with points, points are considered **word bounds** but they can be part of the word itself `U.S.A`
- Elipsis as ommited

We can refine the regular expression to obtain a more sensible tokenization.

In [3]:
pattern = r'''(?x)           # set flag to allow verbose regexps
        (?:[A-Z]\.)+         # abbreviations, e.g. U.S.A.
        | \w+(?:-\w+)*       # words with optional internal hyphens
        | \$?\d+(?:\.\d+)?%? # currency and percentages, e.g. $12.40, 82%
        | \.\.\.             # ellipsis
        | [][.,;"'?():-_`]   # these are separate tokens; includes ], [
        '''


tokens = regexp_tokenize(text, pattern)
print(len(tokens))
print(tokens)


6
['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', '...']


#### Using NLTK

NLTK also includes a word tokenizer, which gets roughly the same result (it finds "words" and punctuation).

In [4]:
from nltk import word_tokenize

text = 'That U.S.A. poster-print costs $12.40...'
tokens = word_tokenize(text)

print(len(tokens))
print(tokens)

7
['That', 'U.S.A.', 'poster-print', 'costs', '$', '12.40', '...']


In [5]:
word_tokenize("I don't think we're flying today.")

['I', 'do', "n't", 'think', 'we', "'re", 'flying', 'today', '.']

You can try [other tokenizers](https://www.nltk.org/api/nltk.tokenize.html) available in NLTK.

In [6]:
# try out the wordpunct tokenizer


Let's get a sentence from the user and tokenize it.

In [7]:
import os

s = input("Enter some text:")
tokens = word_tokenize(s)

print("You typed", len(tokens), "words:", tokens)

You typed 1 words: ['blya']


#### Sentence segmentation

We may also be interested in spliting the text into sentences.

In [8]:
from nltk import sent_tokenize

text = "Hello. Are you Mr. Smith? Just to let you know that I have finished my M.Sc. and Ph.D. on AI. I loved it!"
sentences = sent_tokenize(text)

print(sentences)
print("Number of sentences:", len(sentences))

['Hello.', 'Are you Mr. Smith?', 'Just to let you know that I have finished my M.Sc.', 'and Ph.D. on AI.', 'I loved it!']
Number of sentences: 5


#### Experimenting with long texts

We can try downloading a book from [Project Gutenberg](https://www.gutenberg.org/).

In [9]:
from urllib import request

url = "http://www.gutenberg.org/files/2554/2554-0.txt"
response = request.urlopen(url)
raw = response.read().decode('utf8')

print(len(raw))
print(raw[:75])

1176812
﻿The Project Gutenberg eBook of Crime and Punishment, by Fyodor Dostoevsky


How many sentences are there? Printout the second sentence (index 1).

In [10]:
# insert your code here


How many tokens are there? What is the index of the first token in the second sentence?

In [11]:
# insert your code here


And how many types (unique tokens) are there? Which is the most frequent one? *(Hint: use a [Counter](https://docs.python.org/3/library/collections.html#collections.Counter) container from collections.)*

In [12]:
# insert your code here


#### Dealing with multi-word expressions (MWE)

Sometimes we want certain words to stick together when tokenizing, such as in multi-word names.

In [13]:
word_tokenize("Good muffins cost $3.88\nin New York.")

['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.']

One way to do it is to suply our own lexicon and make use of NLTK's [MWE tokenizer](https://www.nltk.org/api/nltk.tokenize.mwe.html).

In [14]:
from nltk.tokenize import MWETokenizer
from nltk import sent_tokenize, word_tokenize

s = "Good muffins cost $3.88\nin New York."
mwe = MWETokenizer([('New', 'York'), ('Hong', 'Kong')], separator=' ')

[mwe.tokenize(word_tokenize(sent)) for sent in sent_tokenize(s)]

[['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New York', '.']]

Try out your own multi-word expressions to tokenize text.

In [15]:
# try out your own multi-word expressions


## Stemming and Lemmatization

*Stemming* and *Lemmatization* are techniques used to normalize tokens, so as to reduce the size of the vocabulary.
Whereas lemmatization is a process of finding the root of the word, stemming typically applies a set of transformation rules that aim to cut off word final affixes.

#### Stemming

NLTK includes one of the most well-known stemmers: the [Porter stemmer](https://www.emerald.com/insight/content/doi/10.1108/00330330610681286/full/pdf?casa_token=eT_IPtH_eLEAAAAA:Z3lAtxWdxf0FL479mL-A7tC-_QRzxNeeyC2DFLyWwGBlcj6DQcwu2Bnq37waDPcXKOnXkMMDtKGyCaYGZtYcb3lgBZ9uaHKUNO0JCMivSdPE4HTe).

In [16]:
from nltk.stem import PorterStemmer

# initialize the Porter Stemmer
porter = PorterStemmer()

Let's use an illustrative piece of text:

In [17]:
sentence = '''The European Commission has funded a numerical study to analyze the purchase of a pipe organ with no noise
for Europe's organization. Numerous donations have followed the analysis after a noisy debate.'''

# tokenize: split the text into words
word_list = nltk.word_tokenize(sentence)

print("\nOriginal word list:", word_list)
print("\nOriginal number of distinct tokens:", len(set(word_list)))


Original word list: ['The', 'European', 'Commission', 'has', 'funded', 'a', 'numerical', 'study', 'to', 'analyze', 'the', 'purchase', 'of', 'a', 'pipe', 'organ', 'with', 'no', 'noise', 'for', 'Europe', "'s", 'organization', '.', 'Numerous', 'donations', 'have', 'followed', 'the', 'analysis', 'after', 'a', 'noisy', 'debate', '.']

Original number of distinct tokens: 31


Now, we stem the tokens in the text:

In [18]:
# stem list of words and join
stemmed_output = ' '.join([porter.stem(w) for w in word_list])
print("Stemmed text:", stemmed_output)

# tokenize: split the text into words
stemmed_word_list = nltk.word_tokenize(stemmed_output)

print("\nStemmed word list:", stemmed_word_list)
print("\nStemmed number of distinct tokens:", len(set(stemmed_word_list)))

Stemmed text: the european commiss ha fund a numer studi to analyz the purchas of a pipe organ with no nois for europ 's organ . numer donat have follow the analysi after a noisi debat .

Stemmed word list: ['the', 'european', 'commiss', 'ha', 'fund', 'a', 'numer', 'studi', 'to', 'analyz', 'the', 'purchas', 'of', 'a', 'pipe', 'organ', 'with', 'no', 'nois', 'for', 'europ', "'s", 'organ', '.', 'numer', 'donat', 'have', 'follow', 'the', 'analysi', 'after', 'a', 'noisi', 'debat', '.']

Stemmed number of distinct tokens: 28


You can see the reduced vocabulary size. Some tokens are over-generalized (semantically different tokens that get the same stem), while others are under-generalized (semantically similar tokens that get different stems).

Try out [other stemmers](https://www.nltk.org/api/nltk.stem.html) available in NLTK.

In [19]:
# try out other stemmers


We can try a few for Portuguese:

In [20]:
# Portuguese stemmer: https://www.nltk.org/_modules/nltk/stem/rslp.html
from nltk.stem import RSLPStemmer

stemmer = RSLPStemmer()
sentence = "Estou mesmo a gostar desta unidade curricular, todos gostamos de unidades curriculares interessantes."

word_list = nltk.word_tokenize(sentence)
stemmed_output = ' '.join([stemmer.stem(w) for w in word_list])
print(stemmed_output)

LookupError: 
**********************************************************************
  Resource [93mrslp[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('rslp')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mstemmers/rslp/step0.pt[0m

  Searched in:
    - '/home/martim/nltk_data'
    - '/home/martim/Desktop/nlp/.venv/nltk_data'
    - '/home/martim/Desktop/nlp/.venv/share/nltk_data'
    - '/home/martim/Desktop/nlp/.venv/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - ''
**********************************************************************


In [None]:
from nltk.stem import SnowballStemmer

stemmer = SnowballStemmer("portuguese")
sentence = "Estou mesmo a gostar desta unidade curricular, todos gostamos de unidades curriculares interessantes."

word_list = nltk.word_tokenize(sentence)
stemmed_output = ' '.join([stemmer.stem(w) for w in word_list])
print(stemmed_output)

#### Lemmatization

NLTK includes a [lemmatizer based on WordNet](https://www.nltk.org/api/nltk.stem.wordnet.html).

In [21]:
# WordNet lemmatizer
from nltk.stem import WordNetLemmatizer 

# Init the Wordnet Lemmatizer
lemmatizer = WordNetLemmatizer()

sentence = "Men and women love to study artificial intelligence while studying data science. Don't you? My feet and teeth are clean!"


# tokenize: Split the sentence into words
word_list = nltk.word_tokenize(sentence)
print(word_list)

# lemmatize list of words
lemmatized_output = [lemmatizer.lemmatize(w) for w in word_list]
print(lemmatized_output)

['Men', 'and', 'women', 'love', 'to', 'study', 'artificial', 'intelligence', 'while', 'studying', 'data', 'science', '.', 'Do', "n't", 'you', '?', 'My', 'feet', 'and', 'teeth', 'are', 'clean', '!']
['Men', 'and', 'woman', 'love', 'to', 'study', 'artificial', 'intelligence', 'while', 'studying', 'data', 'science', '.', 'Do', "n't", 'you', '?', 'My', 'foot', 'and', 'teeth', 'are', 'clean', '!']


Compare the result with stemming applied to the same text.

In [None]:
# compare with stemming


## spaCy

SpaCy includes several [language processing pipelines](https://spacy.io/usage/processing-pipelines) that streamline several NLP tasks at once. We can use one of the available [trained pipelines](https://spacy.io/models).

In [None]:
import spacy
nlp = spacy.load("en_core_web_sm")

We simply pass the sentence through the language processing pipeline (in this case, for English):

In [None]:
sent = nlp(sentence)
print(sent)
print(len(sent))

As you can see, we now have a sequence of tokens, each of which has specific [attributes](https://spacy.io/api/token#attributes) attached. For instance, we can easily get the lemma for each word:

In [None]:
for token in sent:
    print(token.text, token.lemma_)