# Intro


The analysis of a language requires being aware of the existing ambiguities. These are not necessarily an ambiguity because of the text used, but rather because when making a specific analysis, words that are conjugated, or words and sentence connectors, are not essential, and can be noise for the desired analysis. For instance, words with high frequency of appearance in the *English-language* are usually "the", "and", "or", etc...which are not relevant for most analysis. Additionally, using conjugations of verbs, or gender might not be relevant as well.

Pre-processing the text involves considering all these aspects of a language, and is an utterly relevant step. The process involves cleaning the text, removing unnecessary words and punctuation, and transforming the text in its simplest form to extract relevant features and be processed correctly. In this section, the most relevant methods for text pre-processing are presented.

In this notebook, the main pre-processing strategies for text are exemplified with tools such as nltk, spacy and textblob.
These steps involve: 

- tokenization;
- stop word removal;
- lemmatization/stemming;
- N-grams
- parsing
- Grammar inference

### Tokenization

Tokenization refers to the process of segmenting text in tokens. These may represent words, characters or subwords. For instance, the sentence "I am supermotivated" can be separated in "[I, am, supermotivated]" or "[I, am, super, motivated]". This step is very relevant to prepare the data for next processing steps. 

### Stop Words Removal

Stop words are referred as words that are not important for processing. In fact, these words are defined as noise and can be prejudicial to the analysis. For this reason, a list of stop words exist for each language, being removed from text before next processing steps. Typically these are very common words, which might confuse the system in evaluating the similarity between documents.

### Lemmatizing/Stemming

As mentioned, languages use conjugation for gender, verbs, etc...which requires making variations of a word, and in some cases, with irregular conjugations, changing the root form of the word completely. Having the root form of a word is better for post-processing methodologies, since words conjugated differently will be treated as the same. Considering the sentence "I \textbf{like} to have something that he \textbf{likes}", "like" and "likes" are originated from the same root word "like", and should be considered as "like" for the benefit of improving the analysis of some methodologies. There are two ways of simplifying this text representation: Lemmatisation and Stemming.

 Lemmatisation is the process of grouping together the inflected forms of a word considering the lemma, that is the dictionary form (e.g: "\textit{to walk}", "\textit{walked}" or "\textit{walking}" have the same lemma "\textit{walk}".

 Stemming differs from Lemmatisation in the sense that the meaning is not inferred as in the case of the lemma. A stemma is the root form of a word, e.g. "\textit{world}" is the stem of "\textit{worldwide}" and "\textit{worlds}". 
 
### N-grams

Text is a sequential structure, in which the \textit{next} word has a certain level of dependency from \textit{previous} ones. Therefore, when designing probabilistic models, having the text structured in a way that we are able to understand these dependencies is of great relevance. In that sense, the N-grams structure was designed for this purpose. N-grams are a structured way of organizing the text by grouping \textit{N} tokens that are followed, with a total overlap of the sequence. For instance, the sentence "I am doing great", organized as a bigram (2-grams), would be: ["I am", "am doing", "doing great"]. 


#### Using NLTK

In [77]:
import nltk as nltk


#-------------------------------------------------------------------------------------------------------------------------------
# Tokenization
#-------------------------------------------------------------------------------------------------------------------------------

sentence = "I am super-motivated."
tokens = nltk.word_tokenize(sentence)
print("tokenized sentence:------------")
print(tokens)
tagged = nltk.pos_tag(tokens)
print("\ntagged tokens:------------")
print(tagged)

entities = nltk.chunk.ne_chunk(tagged)
print("\nnamed entities:------------")
print(entities)

#-------------------------------------------------------------------------------------------------------------------------------
# Stop Word Removal
#-------------------------------------------------------------------------------------------------------------------------------
#from: https://www.geeksforgeeks.org/removing-stop-words-nltk-python/

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize 

example_sent = """This is a sample sentence, showing off the stop words filtration."""

stop_words = set(stopwords.words('english')) 

word_tokens = word_tokenize(example_sent) 

filtered_sentence = [w for w in word_tokens if not w in stop_words] 

filtered_sentence = [] 

for w in word_tokens: 
    if w not in stop_words: 
        filtered_sentence.append(w) 

print(word_tokens) 
print(filtered_sentence) 

#-------------------------------------------------------------------------------------------------------------------------------
# Lemmatizing
#-------------------------------------------------------------------------------------------------------------------------------
#from: https://www.guru99.com/stemming-lemmatization-python-nltk.html

from nltk.stem import PorterStemmer

e_words= ["studies", "studying", "cries", "cry"]

ps =PorterStemmer()
for w in e_words:
    rootWord=ps.stem(w)
    print(rootWord)
    
from nltk.stem import 	WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
text = "studies studying cries cry"
tokenization = nltk.word_tokenize(text)
for w in tokenization:
    print("Lemma for {} is {}".format(w, wordnet_lemmatizer.lemmatize(w)))
    

#-------------------------------------------------------------------------------------------------------------------------------
# N-grams
#-------------------------------------------------------------------------------------------------------------------------------
#from: https://www.pythonprogramming.in/generate-the-n-grams-for-the-given-sentence-using-nltk-or-textblob.html

from nltk.util import ngrams

data = 'A class is a blueprint for the object.'

n_grams = ngrams(nltk.word_tokenize(data), 2)
print([gram for gram in n_grams])
n_grams = ngrams(nltk.word_tokenize(data), 4)
print([gram for gram in n_grams])

tokenized sentence:------------
['I', 'am', 'super-motivated', '.']

tagged tokens:------------
[('I', 'PRP'), ('am', 'VBP'), ('super-motivated', 'JJ'), ('.', '.')]

named entities:------------
(S I/PRP am/VBP super-motivated/JJ ./.)
['This', 'is', 'a', 'sample', 'sentence', ',', 'showing', 'off', 'the', 'stop', 'words', 'filtration', '.']
['This', 'sample', 'sentence', ',', 'showing', 'stop', 'words', 'filtration', '.']
studi
studi
cri
cri
Lemma for studies is study
Lemma for studying is studying
Lemma for cries is cry
Lemma for cry is cry
[('A', 'class'), ('class', 'is'), ('is', 'a'), ('a', 'blueprint'), ('blueprint', 'for'), ('for', 'the'), ('the', 'object'), ('object', '.')]
[('A', 'class', 'is', 'a'), ('class', 'is', 'a', 'blueprint'), ('is', 'a', 'blueprint', 'for'), ('a', 'blueprint', 'for', 'the'), ('blueprint', 'for', 'the', 'object'), ('for', 'the', 'object', '.')]


#### Using Textblob

In [80]:
from textblob import TextBlob

sentence = "I am super-motivated."
blob = TextBlob(sentence)
blob.tokens
blob.tags

#-------------------------------------------------------------------------------------------------------------------------------
# Stop Word Removal
#-------------------------------------------------------------------------------------------------------------------------------
#There is no functionality that is specific for stopwords removal


#-------------------------------------------------------------------------------------------------------------------------------
# Lemmatizing
#-------------------------------------------------------------------------------------------------------------------------------

# from textblob lib import Word method 
from textblob import Word 

# create a Word object. 
u = Word("rocks") 

# apply lemmatization. 
print("rocks :", u.lemmatize()) 

# create a Word object. 
v = Word("corpora") 

# apply lemmatization. 
print("corpora :", v.lemmatize()) 

# create a Word object. 
w = Word("better") 

# apply lemmatization with 
# parameter "a", "a" denotes adjective. 
print("better :", w.lemmatize("a")) 

#-------------------------------------------------------------------------------------------------------------------------------
# N-grams
#-------------------------------------------------------------------------------------------------------------------------------
#from: https://www.pythonprogramming.in/generate-the-n-grams-for-the-given-sentence-using-nltk-or-textblob.html
data = 'A class is a blueprint for the object.'
n_grams = TextBlob(data).ngrams(2)
print(n_grams)
n_grams = TextBlob(data).ngrams(4)
print(n_grams)

rocks : rock
corpora : corpus
better : good
[WordList(['A', 'class']), WordList(['class', 'is']), WordList(['is', 'a']), WordList(['a', 'blueprint']), WordList(['blueprint', 'for']), WordList(['for', 'the']), WordList(['the', 'object'])]
[WordList(['A', 'class', 'is', 'a']), WordList(['class', 'is', 'a', 'blueprint']), WordList(['is', 'a', 'blueprint', 'for']), WordList(['a', 'blueprint', 'for', 'the']), WordList(['blueprint', 'for', 'the', 'object'])]


#### Using Spacy

In [18]:
import spacy
from spacy.symbols import ORTH

nlp = spacy.load("en_core_web_sm")
doc = nlp("I am super-motivated")
for token in doc:
    print(token)

print("\n")
#add special case to specific token
special_case = [{ORTH:"a"}, {ORTH:"m"}]
nlp.tokenizer.add_special_case("am", special_case)
doc = nlp("I am super-motivated")
for token in doc:
    print(token)

print("\n")
#explaining text with tokenizer
from spacy.lang.en import English

nlp = English()
text = '''I am super-motivated'''
doc = nlp(text)
tok_exp = nlp.tokenizer.explain(text)
assert [t.text for t in doc if not t.is_space] == [t[1] for t in tok_exp]
for t in tok_exp:
    print(t[1], "\t", t[0])

    

I
am
super
-
motivated


I
a
m
super
-
motivated


I 	 TOKEN
am 	 TOKEN
super 	 TOKEN
- 	 INFIX
motivated 	 TOKEN


Spacy has an additional functionality, which is to define your own tokenizer rules. You can use specific functions or base your tokenization on several text parsers, such as regular expressions.

In [52]:
import re
import spacy
from spacy.tokenizer import Tokenizer

simple_url_re = re.compile(r'''^https?://''')
prefix_re = re.compile(r'''ac\[''')
suffix_re = re.compile(r'''c''')

def custom_tokenizer(nlp):
    return Tokenizer(nlp.vocab, 
                     prefix_search=prefix_re.search,
                     suffix_search=suffix_re.search,
                     url_match=simple_url_re.match)


nlp = spacy.load("en_core_web_sm")
nlp.tokenizer = custom_tokenizer(nlp)

doc = nlp("hello-world. :) ac[https://www.nltk.org/data.html aaaaacccc")
print([t.text for t in doc])


['hello-world.', ':)', 'ac[', 'https://www.nltk.org/data.htm', 'l', 'aaaaa', 'c', 'c', 'c', 'c']


There is also a package for tokenizers that can be used on spacy for special cases
https://github.com/huggingface/tokenizers

In [73]:
#-------------------------------------------------------------------------------------------------------------------------------
# Stop Word Removal
#-------------------------------------------------------------------------------------------------------------------------------
import spacy    
nlp = spacy.load("en_core_web_sm")
nlp.Defaults.stop_words.add("my_new_stopword")
nlp.Defaults.stop_words

#-------------------------------------------------------------------------------------------------------------------------------
# Lemmatizing (The stemmer is from nltk)
#-------------------------------------------------------------------------------------------------------------------------------
#https://stackabuse.com/python-for-nlp-tokenization-stemming-and-lemmatization-with-spacy-library/

sentence7 = nlp(u'A letter has been written, asking him to be released')

for word in sentence7:
    print(word.text + '  ===>', word.lemma_)
    
#-------------------------------------------------------------------------------------------------------------------------------
# N-grams
#-------------------------------------------------------------------------------------------------------------------------------
#Spacy does not have a specific Ngram function

A  ===> a
letter  ===> letter
has  ===> have
been  ===> be
written  ===> write
,  ===> ,
asking  ===> ask
him  ===> he
to  ===> to
be  ===> be
released  ===> release


### Text Parsing - Using Regex

Parsing is the process of defining rules for splitting text segments into smaller segments. The definition of these rules follow a specific grammar. One example of such set of rules are regular grammars, a powerful text pattern parsing mechanism from which regular expressions can be defined.

Besides regular expressions, text can also be parsed based on other rules. For this, *Spacy*'s library offers a set of rule-based matchers, which include parsing text by means of: (1) text, (2) regular expressions, (3) part-of-speech (POS) tags, (4) lemma, (5) morphological features and (6) dependency tree.

In this example, we will show how to use the regex module and the common_regex module, which provides predefined regular expressions for typical text patterns.

In [40]:
#Regex usage
import re
from commonregex import CommonRegex

text = """John, please get that article on www.linkedin.com to me by 5:00PM 
                               on Jan 9th 2012. 4:00 would be ideal, actually. If you have any 
                               questions, You can reach me at (519)-236-2723x341 or get in touch with
                               my associate at harold.smith@gmail.com"""

#Use re module from python---------------------------
#search for all words that start with an "a":
search = r" (a[a-z]+) "
a = re.finditer(search, text)
print("--------Matches using re module from Python----------")
for i in a:
    print(i)

#search for specific pattern with CommonRegex--------
#search for times in text
print("\n")
print("----- Matches using Common Regex module for time values and emails -------")

parsed_text = CommonRegex(text)
print("time values:")
print(parsed_text.times)
print("emails:")
print(parsed_text.emails)

#Additional interests regarding regular expressions
#Using Verbosity
print("\n")
print("----- Verbosity -----")

matches = re.finditer(r'''
        \S+ #Matches any non-whitespace character
        @   #Matches @ char
        \S+ #Matches any non-whitespace character
''', text, re.VERBOSE)

for match in matches:
    print(match)

#Using Verbosity
print("\n")
print("----- String Replace with Function -----")
def change_email_name(s):
    a = len(re.search(r"(\S+)@", s[0])[0])
    return "*"*a + s[0][a:]

sub = re.sub("\S+@\S+", change_email_name, text)
print(sub)


#Create Dictionnary within the writting pattern:
print("\n")
print("-----Giving tags to the pattern found-----")
print(re.search(r"(?P<email>\S+@\S+)", text).groupdict())


#Using lookbehinds and Lookaheads:
print("\n")
print("----- Lookahead -----")
print("Search for word that precedes '@'")
print(re.findall(r"\w+(?=@)", text))
print("----- Lookbehind -----")
print("Search for word that follows '@'")
print(re.findall(r"(?<=@)\w+", text))

--------Matches using re module from Python----------
<re.Match object; span=(21, 30), match=' article '>
<re.Match object; span=(157, 162), match=' any '>
<re.Match object; span=(221, 225), match=' at '>
<re.Match object; span=(298, 309), match=' associate '>


----- Matches using Common Regex module for time values and emails -------
time values:
['5:00PM', '4:00']
emails:
['harold.smith@gmail.com']


----- Verbosity -----
<re.Match object; span=(312, 334), match='harold.smith@gmail.com'>


----- String Replace with Function -----
John, please get that article on www.linkedin.com to me by 5:00PM 
                               on Jan 9th 2012. 4:00 would be ideal, actually. If you have any 
                               questions, You can reach me at (519)-236-2723x341 or get in touch with
                               my associate at *************gmail.com


-----Giving tags to the pattern found-----
{'email': 'harold.smith@gmail.com'}


----- Lookahead -----
Search for word that 

### Text Parsing - Using Spacy

Spacy has an interesting module to make parsing assumptions and multiple levels of parsing methods. The following link shows all the rules available to make parsing:

https://spacy.io/usage/rule-based-matching#matcher

With this method we are able to define multiple rules that define the pattern. This gives a lot of freedom to perform a search on the text space.

#### Entity, Shape, Tag, Lemma Rules

Rules can be made with multiple levels of word analysis. For instance, words can be searched based on the fact that these are verbs, or pronouns, or even if their shape is *xx*. Here is an example of the table used as a reference:

![Caption](../Figures/OverallRulesSpacy.PNG)

#### Dependency Rules

Rules to implement parsing based on the dependencies between words in a sentence. Operators for tree dependency

![Caption](../Figures/DependencyRulesSpacy.PNG)



In [52]:
#collect data from matches
def collect_sents(matcher, doc, i, matches):
    match_id, start, end = matches[i]
    span = doc[start:end]  # Matched span
    sent = span.sent  # Sentence containing matched span
    # Append mock entity for match in displaCy style to matched_sents
    # get the match span by ofsetting the start and end of the span with the
    # start and end of the sentence in the doc
    match_ents = [{
        "start": span.start_char - sent.start_char,
        "end": span.end_char - sent.start_char,
        "label": "MATCH",
    }]
    matched_sents.append({"text": sent.text, "ents": match_ents})

import spacy
from spacy import displacy
from spacy.matcher import DependencyMatcher, Matcher

text = "Smith founded a healthcare company in 2005."
doc = nlp(text)

nlp = spacy.load("en_core_web_sm")


print("----------- POS MATCHER -----------")
print("\n")
print("pattern:")
print("""pattern1 = [{"POS": "NOUN", "TEXT":{"REGEX":"^h\w+"}}]""")
matched_sents = []
matcher = Matcher(nlp.vocab)
pattern1 = [{"POS": "NOUN", "TEXT":{"REGEX":"^h\w+"}}]

matcher.add("multiple", [pattern1], on_match=collect_sents)

matches = matcher(doc)

for match in matched_sents:
    i = match["ents"][0]
    print(text[i["start"]:i["end"]])


    
print("----------- DEPENDENCY MATCHER -----------")
print("\n")
print("pattern:")
print("""pattern = [
  {
    "RIGHT_ID": "anchor_founded",       # unique name
    "RIGHT_ATTRS": {"ORTH": "founded"}  # token pattern for "founded"
  },
  {
    "LEFT_ID": "anchor_founded",
    "REL_OP": ">", #leftid is the head of the parsing tree, which means, what comes next to anchor is matched
    "RIGHT_ID": "founded_subject",
    "RIGHT_ATTRS": {"DEP": "nsubj"}, #the dependency between the anchor and this has to be nsubj 
   },
   {
    "LEFT_ID": "anchor_founded",
    "REL_OP": ">", #leftid is the head of the parsing tree, which means, what comes next to anchor is matched
    "RIGHT_ID": "founded_prep",
    "RIGHT_ATTRS": {"DEP": "prep"}, #the dependency between the anchor and this has to be prep 
   }
]""")
matcher = DependencyMatcher(nlp.vocab)
pattern = [
  {
    "RIGHT_ID": "anchor_founded",       # unique name
    "RIGHT_ATTRS": {"ORTH": "founded"}  # token pattern for "founded"
  },
  {
    "LEFT_ID": "anchor_founded",
    "REL_OP": ">", #leftid is the head of the parsing tree, which means, what comes next to anchor is matched
    "RIGHT_ID": "founded_subject",
    "RIGHT_ATTRS": {"DEP": "nsubj"}, #the dependency between the anchor and this has to be nsubj 
   },
   {
    "LEFT_ID": "anchor_founded",
    "REL_OP": ">", #leftid is the head of the parsing tree, which means, what comes next to anchor is matched
    "RIGHT_ID": "founded_prep",
    "RIGHT_ATTRS": {"DEP": "prep"}, #the dependency between the anchor and this has to be prep 
   }
]

matcher.add("FOUNDED", [pattern])
displacy.render(doc, jupyter=True)
matches = matcher(doc)

for i in matches[0][1]:
    print("match"+str(i)+":")
    print(doc[i])

----------- POS MATCHER -----------


pattern:
pattern1 = [{"POS": "NOUN", "TEXT":{"REGEX":"^h\w+"}}]
healthcare
----------- DEPENDENCY MATCHER -----------


pattern:
pattern = [
  {
    "RIGHT_ID": "anchor_founded",       # unique name
    "RIGHT_ATTRS": {"ORTH": "founded"}  # token pattern for "founded"
  },
  {
    "LEFT_ID": "anchor_founded",
    "REL_OP": ">", #leftid is the head of the parsing tree, which means, what comes next to anchor is matched
    "RIGHT_ID": "founded_subject",
    "RIGHT_ATTRS": {"DEP": "nsubj"}, #the dependency between the anchor and this has to be nsubj 
   },
   {
    "LEFT_ID": "anchor_founded",
    "REL_OP": ">", #leftid is the head of the parsing tree, which means, what comes next to anchor is matched
    "RIGHT_ID": "founded_prep",
    "RIGHT_ATTRS": {"DEP": "prep"}, #the dependency between the anchor and this has to be prep 
   }
]


match1:
founded
match0:
Smith
match5:
in


### Text Parsing - Fuzzy text parsing (TODO)

Sometimes, text parsing can be made with a certain level of difference. Imagine you want to find a word "made", but by mistake, words such as "maide" were written...sometimes, fuzzyness in the search is necessary.

In [89]:
import spacy
from spacy.tokens import Span
from spaczz.matcher import FuzzyMatcher, RegexMatcher, SimilarityMatcher

nlp = spacy.load("en_core_web_sm")
text = "Smiths founded a healthcare company in 2005."
doc = nlp(text)

print("----------- Fuzzy Matcher -----------")
print("\n")
print("pattern:")

def add_name_ent(matcher, doc, i, matches):
    """Callback on match function. Adds "NAME" entities to doc."""
    # Get the current match and create tuple of entity label, start and end.
    # Append entity to the doc's entity. (Don't overwrite doc.ents!)
    _match_id, start, end, _ratio = matches[i]
    entity = Span(doc, start, end, label="NAME")
    doc.ents += (entity,)

matcher = RegexMatcher(nlp.vocab)
matcher.add("NAME", [r"(Smith)"], on_match=add_name_ent)
matches = matcher(doc)
print(matches)

print("----------- Regex Fuzzy Matcher -----------")
print("\n")
print("pattern:")


print("----------- Similarity MATCHER -----------")
print("\n")
print("pattern:")

# lowering min_r2 from default of 75 to produce matches in this example
matcher = SimilarityMatcher(nlp.vocab, thresh=0.5)
matcher.add("name", [nlp("health")])
matches = matcher(doc)
print(matches)

----------- Fuzzy Matcher -----------


pattern:
[]
----------- Regex Fuzzy Matcher -----------


pattern:
----------- Similarity MATCHER -----------


pattern:
[]


                Similarity results may not be useful.


## Grammar Inference (TODO)

Besides text-parsing methods based on grammars, from which text is parsed following a specific criteria, there are also grammar inference methods, which are used to infer the rules and therefore, the structure and language of a piece of text. Examples of such algorithms are the Sequitur-algorithm, which infers context-free-grammars and Regular in Positive and Negative Inference (RPNI), which infers regular grammars. For this example, we will use the Sequitur algorithm and the RPNI for regular inference (from: https://github.com/steynvl/inferrer).

In [121]:
from sksequitur import parse, Production

text = "abc abc abc"

#Parse text
grammar1 = parse(text)
print(grammar)
print(grammar[0])
print(grammar[1])
print(grammar[2])

0 -> 1 1 2
1 -> 2 _                                          abc_
2 -> a b c                                        abc
[Production(1), Production(1), Production(2)]
[Production(2), ' ']
['a', 'b', 'c']


In [47]:
# import sys
# import argparse
# from methods import inferrer
# from typing import Set

# pos_examples = set(["abababab", "abababab"])
# neg_examples = set(["abab"])
# alphabet = mt.inferrer.utils.determine_alphabet(pos_examples.union(neg_examples))

# algorithm = "rpni"

# if algorithm in ['rpni', 'gold']:
#     learner = inferrer.Learner(alphabet=alphabet,
#                                pos_examples=pos_examples,
#                                neg_examples=neg_examples,
#                                algorithm=algorithm)
# elif algorithm in ['lstar', 'nlstar']:
#     learner = inferrer.Learner(alphabet=alphabet,
#                                oracle=inferrer.oracle.PassiveOracle(pos_examples,
#                                                                     neg_examples),
#                                algorithm=algorithm)

# dfa = learner.learn_grammar()
# print(dfa.to_regex())

(ba*ba*b|a)*(ba*|())
