# Processing Raw Text with NLTK

## Tokenization

<div class="alert-info">

<p>¿What is a token in NLP?</p>

A "Token" is an instance of a sequence of characters that are grouped together as a useful semantic unit for processing

</div>

Using NLTK we can tokenize the text in <b>words</b> or in <b>sentences</b>:
+ <b>word_tokenize</b> for words
+ <b>sent_tokenize</b> for sentences

In [5]:
from nltk.tokenize import word_tokenize, sent_tokenize

In [6]:
def read_file(filename):
    with open(filename,'r') as file:
        text = file.read()
    return text

In [7]:
text = read_file('shakespeare-taming-of-the-shrew.txt')

In [10]:
words = word_tokenize(text)

# word_tokenize - inputs ---> string containing the text
#               - outputs --> list of words

In [11]:
sentences = sent_tokenize(text)

# sent_tokenize - inputs ---> string containing the text
#               - outupts --> list of sentences

## Regular Expressions

re.search() function takes two parameters:
+ regex pattern by expression
+ word string on which we want to apply that pattern

In [13]:
import re

In [19]:
re.search('^ab','abc')

<re.Match object; span=(0, 2), match='ab'>

It can also be used <b>as a boolean</b> in a conditional statement:

In [18]:
if re.search('^ab','abc'):
    print ('Found it!')

Found it!


Example
+ Get all words which end with ed:

In [20]:
print([w for w in words if re.search('ed$',w)])

['bed', 'winded', 'cried', 'bed', 'bed', 'bed', 'waked', 'distilled', 'husbanded', 'bed', 'fitted', 'observed', 'accomplished', 'shed', 'restored', 'commanded', 'infused', 'kindred', 'caged', 'bed', 'studded', 'breathed', 'painted', 'beguiled', 'surprised', 'painted', 'deed', 'shed', 'indeed', 'restored', 'waked', 'waked', 'thanked', 'bed', 'bed', 'charged', 'bed', 'arrived', 'approved', 'conceived', 'achieved', 'affected', 'abjured', 'resolved', 'need', 'resolved', 'appointed', 'brooked', 'married', 'whipped', 'maintained', 'agreed', 'wed', 'bed', 'rated', 'Sacred', 'advised', 'plotted', 'need', 'tied', 'charged', 'wounded', 'descried', 'changed', 'indeed', 'beloved', 'approved', 'rebused', 'deceased', 'wed', 'deceased', 'rehearsed', 'disguised', 'unsuspected', 'disguised', 'perused', 'perfumed', 'assured', 'promised', 'lighted', 'promised', 'beloved', 'Beloved', 'chafed', 'pitched', 'arrived', 'promised', 'provided', 'speed', 'wed', 'indeed', 'jested', 'revenged', 'grieved', 'called'

The same above, but more "pythonic" written:

In [21]:
words_ending_with_ed = [w for w in words if re.search('ed$',w)]
print(set(words_ending_with_ed))

{'undeserved', 'deceived', 'carved', 'descried', 'busied', 'opposed', 'hipped', 'satisfied', 'half-chequed', 'strewed', 'stumbled', 'assured', 'compassed', 'jested', 'wed', 'possessed', 'armed', 'endured', 'approved', 'wounded', 'unparted', 'sounded', 'starved', 'pricked', 'logger-headed', 'happened', 'spied', 'entreated', 'lighted', 'turned', 'feed', 'moved', 'commanded', 'disguised', 'Braved', 'distilled', 'hunted', 'interrupted', 'trimmed', 'whipped', 'supposed', 'faded', 'repaired', 'murdered', 'infected', 'tuned', 'lodged', 'bedazzled', 'conceived', 'Beloved', 'led', 'brooked', 'thanked', 'amazed', 'alighted', 'mated', 'furnished', "'greed", 'tied', 'full-gorged', 'laced', 'Kated', 'frighted', 'promised', 'observed', 'caged', 'faced', 'walked', 'abjured', "'pointed", 'shed', 'reported', 'perused', 'contented', 'sewed', 'gartered', 'blessed', 'rated', 'hailed', 'chafed', 'forced', 'winded', 'resolved', 'pieced', 'brushed', 'caparisoned', 'rayed', 'prepared', 'renowned', 'replied', 

### Findall

*Note

Writting a Regular Expression for NLTK is a little different than in standard RegEx:

the regular expression is written inside <>. These symbols indicate the boundary of a token:

+ From the documentation:

    + <span style="color: #999999;">Find instances of the regular expression in the text.</span><br /><span style="color: #999999;">The text is a list of tokens, and a regexp pattern to match</span><br /><span style="color: #999999;">a single token must be surrounded by angle brackets. E.g.</span>
    

inside <>, all modifiers we add (like ?, *, +,...) will affect to the token text, so for example:

+ <span>&#60;</span>.*a<span>&#62;</span> will find all tokens that contain 0 or more characters before an "a"

but, if we put the modifiers outside <>, they will act over the tag itself:

+ <span>&#60;</span>so<span>&#62;</span>+ will find 1 or more consecutive tags containing "so"

In [42]:
text = 'Hello, I am a computer programmer who is currently learning and studying NLP, but now I am so so !'

obj = nltk.Text(nltk.word_tokenize(text))




In [47]:
obj.findall(r'<.*a>')

a


In [44]:
obj.findall(r'<so>+')

so so


In [48]:
obj.findall(r'<.*ing>')

learning; studying


In [45]:
obj.findall(r'<.*er><.*><.*ing>+')




In [23]:
import nltk
from nltk.corpus import gutenberg as gt

In [24]:
moby = nltk.Text(gt.words('melville-moby_dick.txt'))

<p><span style="background-color: #00ccff;"><strong>Example:</strong></span></p>

Let's try to find all sentences with the structure "a _____ man":

In [25]:
print(moby.findall(r'<a><.*><man>'))

a monied man; a nervous man; a dangerous man; a white man; a white
man; a white man; a pious man; a queer man; a good man; a mature man;
a white man; a Cape man; a great man; a wise man; a wise man; a
butterless man; a white man; a fiendish man; a pale man; a furious
man; a better man; a certain man; a complete man; a dismasted man; a
younger man; a brave man; a brave man; a brave man; a brave man
None


<p><span style="background-color: #00ccff;"><strong>Example:</strong></span></p>

Let's try to find a 3-word sentence that ends in "bro".

In [29]:
from nltk.corpus import nps_chat

In [31]:
chat_obj = nltk.Text(nps_chat.words())

In [32]:
print(chat_obj.findall(r'<.*><.*><bro>'))

you rule bro; telling you bro; u twizted bro
None


### Normalization

Text normalization is the process of transforming text into a single canonical form that it might not have had before.

There are two types of normalization:
+ Stemming
+ Lemmatization

While stemming is based on an algorithm to detect the stem, Lemmatization uses a dictionary to lookup the lemma.

#### Stemming

##### PorterStemmer

+ http://www.tartarus.org/~martin/PorterStemmer/

In [49]:
from nltk import PorterStemmer

In [50]:
porter = PorterStemmer() # by default mode='NLTK_EXTENSIONS', there are two more options. Check documentation
tokens =['lying']

In [51]:
print(porter.stem(tokens[0]))

lie


##### LancasterStemmer

In [52]:
from nltk import LancasterStemmer

In [53]:
lancaster = LancasterStemmer()
tokens2 =['lying']

In [55]:
print(lancaster.stem(tokens2[0]))

lying


As we can see, <b>Lancaster fails</b> to detect that lying has a root that is "lie"

#### Lemmatization

There are several Lemmatizers, but we will only use WordNet´s Lemmatizer:

In [57]:
from nltk import WordNetLemmatizer
from nltk.corpus import brown

In [58]:
tokens = brown.words(categories=['religion'])

In [59]:
wnl = WordNetLemmatizer()

In [64]:
text2 = 'the women are lying'

In [66]:
tokens2 = nltk.word_tokenize(text2)

In [69]:
print(([wnl.lemmatize(t) for t in tokens2]))

['the', 'woman', 'are', 'lying']


In [60]:
print(set([wnl.lemmatize(t) for t in tokens]))

{'upside', 'As', 'conceptualization', 'partaker', 'originally', 'balanced', 'sector', 'chunk', 'prudential', 'development', 'Munich', 'able', 'arm', 'Manjucri', 'ritual', 'adverse', 'rarely', 'sympathy', 'central', 'little', 'Among', 'sacrifice', 'justification', 'tenth', 'withheld', 'originated', 'propriety', 'Our', 'prohibition', 'blindness', '1,419,833', 'fallen', 'forget', 'chorus', 'prompt', 'governed', 'uniqueness', 'factor', 'mathematical', 'Program', 'occasional', '9,910,741', 'feature', 'textbook', 'consciously', 'recluse', 'association', 'mass', 'dedication', 'insisting', 'symbolism', 'participate', 'preserve', 'Churches', 'gathering', 'tried', 'trend', 'square', 'welfare', 'Abigail', 'pass', 'mode', 'hunger', 'proscription', 'otherworldly', 'gossip', '200', 'analyzing', 'careful', 'Yonkers', 'appointment', 'Hwa-Shan', 'demoted', 'willing', 'Baltimore', 'rank', 'unexpectedly', 'possibly', 'wayside', 'optimistic', 'thesis', "author's", 'foolishness', 'Holiness', 'Join', 'self-

### Regex for Tokenization

Let's try to build our own tokenizer that uses Regular Expression patterns.

In [70]:
text = 'A Linux server, like any other computer you may be familiar with, runs applications. To the computer, these are' \
       ' considered "processes" While Linux will handle the low-level, behind-the-scenes management in a process\'s ' \
       'life-cycle, you will need a way of interacting with the operating system to manage it from a higher-level.'

First of all, we are going to try to split our text in tokens:

In [72]:
print(re.split(' ', text))

['A', 'Linux', 'server,', 'like', 'any', 'other', 'computer', 'you', 'may', 'be', 'familiar', 'with,', 'runs', 'applications.', 'To', 'the', 'computer,', 'these', 'are', 'considered', '"processes"', 'While', 'Linux', 'will', 'handle', 'the', 'low-level,', 'behind-the-scenes', 'management', 'in', 'a', "process's", 'life-cycle,', 'you', 'will', 'need', 'a', 'way', 'of', 'interacting', 'with', 'the', 'operating', 'system', 'to', 'manage', 'it', 'from', 'a', 'higher-level.']


As we can see, <b>higher-level</b> was not splitted in <b>higher</b> and <b>level</b>. We need it to be tokenized as <b>higher level</b> So let's try a different approach:

In [73]:
print(re.split('\s+', text)) 

['A', 'Linux', 'server,', 'like', 'any', 'other', 'computer', 'you', 'may', 'be', 'familiar', 'with,', 'runs', 'applications.', 'To', 'the', 'computer,', 'these', 'are', 'considered', '"processes"', 'While', 'Linux', 'will', 'handle', 'the', 'low-level,', 'behind-the-scenes', 'management', 'in', 'a', "process's", 'life-cycle,', 'you', 'will', 'need', 'a', 'way', 'of', 'interacting', 'with', 'the', 'operating', 'system', 'to', 'manage', 'it', 'from', 'a', 'higher-level.']


The <b>\s</b> metacharacter is used to find a whitespace character.

* A whitespace character can be:

    * A space character
    * A tab character
    * A carriage return character
    * A new line character
    * A vertical tab character
    * A form feed character

so it will behave kind of the same as before, let´s try something different:

In [75]:
print(re.findall('\W', text))

[' ', ' ', ',', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ',', ' ', ' ', '.', ' ', ' ', ' ', ',', ' ', ' ', ' ', ' ', '"', '"', ' ', ' ', ' ', ' ', ' ', ' ', '-', ',', ' ', '-', '-', ' ', ' ', ' ', ' ', "'", ' ', '-', ',', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', '-', '.']


In [76]:
print(re.split('\W', text))

['A', 'Linux', 'server', '', 'like', 'any', 'other', 'computer', 'you', 'may', 'be', 'familiar', 'with', '', 'runs', 'applications', '', 'To', 'the', 'computer', '', 'these', 'are', 'considered', '', 'processes', '', 'While', 'Linux', 'will', 'handle', 'the', 'low', 'level', '', 'behind', 'the', 'scenes', 'management', 'in', 'a', 'process', 's', 'life', 'cycle', '', 'you', 'will', 'need', 'a', 'way', 'of', 'interacting', 'with', 'the', 'operating', 'system', 'to', 'manage', 'it', 'from', 'a', 'higher', 'level', '']


<b>\W</b> its looking for <b>not word</b>, but it is not tokenizing correctly <b>higher-level</b>, so we are going to try to refine our lookup:

Using conditional RegEx:

*Note: the pipe character | is equal to OR

In [77]:
print(re.findall('\w+|\S|\w*', text)) # 1 or more word OR not whitespace OR 0 or more word

['A', '', 'Linux', '', 'server', ',', '', 'like', '', 'any', '', 'other', '', 'computer', '', 'you', '', 'may', '', 'be', '', 'familiar', '', 'with', ',', '', 'runs', '', 'applications', '.', '', 'To', '', 'the', '', 'computer', ',', '', 'these', '', 'are', '', 'considered', '', '"', 'processes', '"', '', 'While', '', 'Linux', '', 'will', '', 'handle', '', 'the', '', 'low', '-', 'level', ',', '', 'behind', '-', 'the', '-', 'scenes', '', 'management', '', 'in', '', 'a', '', 'process', "'", 's', '', 'life', '-', 'cycle', ',', '', 'you', '', 'will', '', 'need', '', 'a', '', 'way', '', 'of', '', 'interacting', '', 'with', '', 'the', '', 'operating', '', 'system', '', 'to', '', 'manage', '', 'it', '', 'from', '', 'a', '', 'higher', '-', 'level', '.', '']


In [78]:
print(re.findall("\w+[-']+\w+", text)) # (1 or more word) followed by (- OR ') followed by (1 or more word)

['low-level', 'behind-the', "process's", 'life-cycle', 'higher-level']


In [79]:
print(re.findall("\w+[-']+\w+|\w+", text)) # same as previous OR 1 or more word

['A', 'Linux', 'server', 'like', 'any', 'other', 'computer', 'you', 'may', 'be', 'familiar', 'with', 'runs', 'applications', 'To', 'the', 'computer', 'these', 'are', 'considered', 'processes', 'While', 'Linux', 'will', 'handle', 'the', 'low-level', 'behind-the', 'scenes', 'management', 'in', 'a', "process's", 'life-cycle', 'you', 'will', 'need', 'a', 'way', 'of', 'interacting', 'with', 'the', 'operating', 'system', 'to', 'manage', 'it', 'from', 'a', 'higher-level']


Sometimes is very tricky to tune RegEx so, although is not perfect, we will leave it as it is now.