# Intro

In this notebook, the main pre-processing strategies for text are exemplified with tools such as nltk, spacy and textblob.
These steps involve: 

- tokenization;
- stop word removal;
- lemmatization/stemming;
- N-grams

### Tokenization

Tokenization refers to the process of segmenting text in tokens. These may represent words, characters or subwords. For instance, the sentence "I am supermotivated" can be separated in "[I, am, supermotivated]" or "[I, am, super, motivated]". This step is very relevant to prepare the data for next processing steps. 

### Stop Words Removal

Stop words are referred as words that are not important for processing. In fact, these words are defined as noise and can be prejudicial to the analysis. For this reason, a list of stop words exist for each language, being removed from text before next processing steps. Typically these are very common words, which might confuse the system in evaluating the similarity between documents.

### Lemmatizing/Stemming

As mentioned, languages use conjugation for gender, verbs, etc...which requires making variations of a word, and in some cases, with irregular conjugations, changing the root form of the word completely. Having the root form of a word is better for post-processing methodologies, since words conjugated differently will be treated as the same. Considering the sentence "I \textbf{like} to have something that he \textbf{likes}", "like" and "likes" are originated from the same root word "like", and should be considered as "like" for the benefit of improving the analysis of some methodologies. There are two ways of simplifying this text representation: Lemmatisation and Stemming.

 Lemmatisation is the process of grouping together the inflected forms of a word considering the lemma, that is the dictionary form (e.g: "\textit{to walk}", "\textit{walked}" or "\textit{walking}" have the same lemma "\textit{walk}".

 Stemming differs from Lemmatisation in the sense that the meaning is not inferred as in the case of the lemma. A stemma is the root form of a word, e.g. "\textit{world}" is the stem of "\textit{worldwide}" and "\textit{worlds}". 
 
### N-grams

Text is a sequential structure, in which the \textit{next} word has a certain level of dependency from \textit{previous} ones. Therefore, when designing probabilistic models, having the text structured in a way that we are able to understand these dependencies is of great relevance. In that sense, the N-grams structure was designed for this purpose. N-grams are a structured way of organizing the text by grouping \textit{N} tokens that are followed, with a total overlap of the sequence. For instance, the sentence "I am doing great", organized as a bigram (2-grams), would be: ["I am", "am doing", "doing great"]. 


#### Using NLTK

In [77]:
import nltk as nltk


#-------------------------------------------------------------------------------------------------------------------------------
# Tokenization
#-------------------------------------------------------------------------------------------------------------------------------

sentence = "I am super-motivated."
tokens = nltk.word_tokenize(sentence)
print("tokenized sentence:------------")
print(tokens)
tagged = nltk.pos_tag(tokens)
print("\ntagged tokens:------------")
print(tagged)

entities = nltk.chunk.ne_chunk(tagged)
print("\nnamed entities:------------")
print(entities)

#-------------------------------------------------------------------------------------------------------------------------------
# Stop Word Removal
#-------------------------------------------------------------------------------------------------------------------------------
#from: https://www.geeksforgeeks.org/removing-stop-words-nltk-python/

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize 

example_sent = """This is a sample sentence, showing off the stop words filtration."""

stop_words = set(stopwords.words('english')) 

word_tokens = word_tokenize(example_sent) 

filtered_sentence = [w for w in word_tokens if not w in stop_words] 

filtered_sentence = [] 

for w in word_tokens: 
    if w not in stop_words: 
        filtered_sentence.append(w) 

print(word_tokens) 
print(filtered_sentence) 

#-------------------------------------------------------------------------------------------------------------------------------
# Lemmatizing
#-------------------------------------------------------------------------------------------------------------------------------
#from: https://www.guru99.com/stemming-lemmatization-python-nltk.html

from nltk.stem import PorterStemmer

e_words= ["studies", "studying", "cries", "cry"]

ps =PorterStemmer()
for w in e_words:
    rootWord=ps.stem(w)
    print(rootWord)
    
from nltk.stem import 	WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
text = "studies studying cries cry"
tokenization = nltk.word_tokenize(text)
for w in tokenization:
    print("Lemma for {} is {}".format(w, wordnet_lemmatizer.lemmatize(w)))
    

#-------------------------------------------------------------------------------------------------------------------------------
# N-grams
#-------------------------------------------------------------------------------------------------------------------------------
#from: https://www.pythonprogramming.in/generate-the-n-grams-for-the-given-sentence-using-nltk-or-textblob.html

from nltk.util import ngrams

data = 'A class is a blueprint for the object.'

n_grams = ngrams(nltk.word_tokenize(data), 2)
print([gram for gram in n_grams])
n_grams = ngrams(nltk.word_tokenize(data), 4)
print([gram for gram in n_grams])

tokenized sentence:------------
['I', 'am', 'super-motivated', '.']

tagged tokens:------------
[('I', 'PRP'), ('am', 'VBP'), ('super-motivated', 'JJ'), ('.', '.')]

named entities:------------
(S I/PRP am/VBP super-motivated/JJ ./.)
['This', 'is', 'a', 'sample', 'sentence', ',', 'showing', 'off', 'the', 'stop', 'words', 'filtration', '.']
['This', 'sample', 'sentence', ',', 'showing', 'stop', 'words', 'filtration', '.']
studi
studi
cri
cri
Lemma for studies is study
Lemma for studying is studying
Lemma for cries is cry
Lemma for cry is cry
[('A', 'class'), ('class', 'is'), ('is', 'a'), ('a', 'blueprint'), ('blueprint', 'for'), ('for', 'the'), ('the', 'object'), ('object', '.')]
[('A', 'class', 'is', 'a'), ('class', 'is', 'a', 'blueprint'), ('is', 'a', 'blueprint', 'for'), ('a', 'blueprint', 'for', 'the'), ('blueprint', 'for', 'the', 'object'), ('for', 'the', 'object', '.')]


#### Using Textblob

In [80]:
from textblob import TextBlob

sentence = "I am super-motivated."
blob = TextBlob(sentence)
blob.tokens
blob.tags

#-------------------------------------------------------------------------------------------------------------------------------
# Stop Word Removal
#-------------------------------------------------------------------------------------------------------------------------------
#There is no functionality that is specific for stopwords removal


#-------------------------------------------------------------------------------------------------------------------------------
# Lemmatizing
#-------------------------------------------------------------------------------------------------------------------------------

# from textblob lib import Word method 
from textblob import Word 

# create a Word object. 
u = Word("rocks") 

# apply lemmatization. 
print("rocks :", u.lemmatize()) 

# create a Word object. 
v = Word("corpora") 

# apply lemmatization. 
print("corpora :", v.lemmatize()) 

# create a Word object. 
w = Word("better") 

# apply lemmatization with 
# parameter "a", "a" denotes adjective. 
print("better :", w.lemmatize("a")) 

#-------------------------------------------------------------------------------------------------------------------------------
# N-grams
#-------------------------------------------------------------------------------------------------------------------------------
#from: https://www.pythonprogramming.in/generate-the-n-grams-for-the-given-sentence-using-nltk-or-textblob.html
data = 'A class is a blueprint for the object.'
n_grams = TextBlob(data).ngrams(2)
print(n_grams)
n_grams = TextBlob(data).ngrams(4)
print(n_grams)

rocks : rock
corpora : corpus
better : good
[WordList(['A', 'class']), WordList(['class', 'is']), WordList(['is', 'a']), WordList(['a', 'blueprint']), WordList(['blueprint', 'for']), WordList(['for', 'the']), WordList(['the', 'object'])]
[WordList(['A', 'class', 'is', 'a']), WordList(['class', 'is', 'a', 'blueprint']), WordList(['is', 'a', 'blueprint', 'for']), WordList(['a', 'blueprint', 'for', 'the']), WordList(['blueprint', 'for', 'the', 'object'])]


#### Using Spacy

In [18]:
import spacy
from spacy.symbols import ORTH

nlp = spacy.load("en_core_web_sm")
doc = nlp("I am super-motivated")
for token in doc:
    print(token)

print("\n")
#add special case to specific token
special_case = [{ORTH:"a"}, {ORTH:"m"}]
nlp.tokenizer.add_special_case("am", special_case)
doc = nlp("I am super-motivated")
for token in doc:
    print(token)

print("\n")
#explaining text with tokenizer
from spacy.lang.en import English

nlp = English()
text = '''I am super-motivated'''
doc = nlp(text)
tok_exp = nlp.tokenizer.explain(text)
assert [t.text for t in doc if not t.is_space] == [t[1] for t in tok_exp]
for t in tok_exp:
    print(t[1], "\t", t[0])

    

I
am
super
-
motivated


I
a
m
super
-
motivated


I 	 TOKEN
am 	 TOKEN
super 	 TOKEN
- 	 INFIX
motivated 	 TOKEN


Spacy has an additional functionality, which is to define your own tokenizer rules. You can use specific functions or base your tokenization on several text parsers, such as regular expressions.

In [52]:
import re
import spacy
from spacy.tokenizer import Tokenizer

simple_url_re = re.compile(r'''^https?://''')
prefix_re = re.compile(r'''ac\[''')
suffix_re = re.compile(r'''c''')

def custom_tokenizer(nlp):
    return Tokenizer(nlp.vocab, 
                     prefix_search=prefix_re.search,
                     suffix_search=suffix_re.search,
                     url_match=simple_url_re.match)


nlp = spacy.load("en_core_web_sm")
nlp.tokenizer = custom_tokenizer(nlp)

doc = nlp("hello-world. :) ac[https://www.nltk.org/data.html aaaaacccc")
print([t.text for t in doc])


['hello-world.', ':)', 'ac[', 'https://www.nltk.org/data.htm', 'l', 'aaaaa', 'c', 'c', 'c', 'c']


There is also a package for tokenizers that can be used on spacy for special cases
https://github.com/huggingface/tokenizers

In [73]:
#-------------------------------------------------------------------------------------------------------------------------------
# Stop Word Removal
#-------------------------------------------------------------------------------------------------------------------------------
import spacy    
nlp = spacy.load("en_core_web_sm")
nlp.Defaults.stop_words.add("my_new_stopword")
nlp.Defaults.stop_words

#-------------------------------------------------------------------------------------------------------------------------------
# Lemmatizing (The stemmer is from nltk)
#-------------------------------------------------------------------------------------------------------------------------------
#https://stackabuse.com/python-for-nlp-tokenization-stemming-and-lemmatization-with-spacy-library/

sentence7 = nlp(u'A letter has been written, asking him to be released')

for word in sentence7:
    print(word.text + '  ===>', word.lemma_)
    
#-------------------------------------------------------------------------------------------------------------------------------
# N-grams
#-------------------------------------------------------------------------------------------------------------------------------
#Spacy does not have a specific Ngram function

A  ===> a
letter  ===> letter
has  ===> have
been  ===> be
written  ===> write
,  ===> ,
asking  ===> ask
him  ===> he
to  ===> to
be  ===> be
released  ===> release
