### Lemmatization
* In contrast to stemming, lemmatization looks beyond word reduction, and considers a language's full vocabulary to apply a morphological analysis to words.
* The lemma of 'was' is 'be' and the lemma of 'mice' is 'mouse'. Further, the lemma of 'meeting' might be 'meet' or 'meeting' depending on its use in a sentence.

* Lemmatization is typically seen as much more informative than simple stemming, which is why Spacy has opted to only have Lemmatization available instead of stemming.

* Lemmatization looks at surrounding text to determine a given word's part of speech, it does not categorize phrases. 
* Next we'll discuss word vectors and similarity.
* Let's discuss how to perform lemmatization with Spacy. 

In [2]:
import spacy

In [3]:
nlp = spacy.load('en_core_web_sm')

In [4]:
doc1 = nlp(u'I am runner running in a race because I love to run since I ran today')

In [5]:
for token in doc1:
    print(token.text, '\t', token.pos_, '\t', token.lemma, '\t', token.lemma_)


I 	 PRON 	 4690420944186131903 	 I
am 	 AUX 	 10382539506755952630 	 be
runner 	 NOUN 	 12640964157389618806 	 runner
running 	 VERB 	 12767647472892411841 	 run
in 	 ADP 	 3002984154512732771 	 in
a 	 DET 	 11901859001352538922 	 a
race 	 NOUN 	 8048469955494714898 	 race
because 	 SCONJ 	 16950148841647037698 	 because
I 	 PRON 	 4690420944186131903 	 I
love 	 VERB 	 3702023516439754181 	 love
to 	 PART 	 3791531372978436496 	 to
run 	 VERB 	 12767647472892411841 	 run
since 	 SCONJ 	 10066841407251338481 	 since
I 	 PRON 	 4690420944186131903 	 I
ran 	 VERB 	 12767647472892411841 	 run
today 	 NOUN 	 11042482332948150395 	 today


In [8]:
def show_lemmas(text):
    for token in text:
        print(f'{token.text:{12}} {token.pos_:{6}} {token.lemma:<{22}} {token.lemma_}')

In [6]:
doc2 = nlp(u'I saw ten mice today!')

In [9]:
show_lemmas(doc2)

I            PRON   4690420944186131903    I
saw          VERB   11925638236994514241   see
ten          NUM    7970704286052693043    ten
mice         NOUN   1384165645700560590    mouse
today        NOUN   11042482332948150395   today
!            PUNCT  17494803046312582752   !


# Stop Words

* Words lika "a" and "the" appear so frequently that they don't require tagging as thoroghly as nouns, verbs and modifiers.
* We call these stop words, and they can be filtered from the text to be processed.
* Spacy holds a built-in list of some 305 English stop words.

* Sometimes stop words will hurt to our natural language processing, so they just the really common words and that don't give us additional information, and we must remove stop  words. And we will discuss how to remove stop words with spaCy, and how to add our own special stop words in case maybe you're dealing with a unique data set where you have an acronym that gets used too much and you want to add that as a stop word.


In [10]:
import spacy

In [11]:
nlp = spacy.load('en_core_web_sm')


In [12]:
print(nlp.Defaults.stop_words)

{'along', 'someone', 'part', 'everyone', 'through', 'does', 'ten', '’d', 'mine', 'first', 'sixty', 'third', 'together', 'noone', 'the', 'their', 'several', 'already', 'around', 'could', 'be', 'n‘t', 'before', 'yet', 'call', 'same', 'my', 'has', 'never', 'put', 'see', "'re", 'here', 'rather', 'them', 'n’t', 'unless', 'are', '‘s', 'by', 'too', 'therein', 'give', 'whereafter', 'those', 'still', 'would', 'bottom', 'these', 'under', 'wherever', 'fifteen', 'keep', 'becomes', 'who', 'nothing', 'since', 'four', 'whole', 'and', 'almost', 'beyond', 'used', 'twenty', 'may', 'well', 'ca', 'you', 'thereby', 'might', 'though', 'across', 'his', 'within', 'herself', 'was', 'even', 're', 'whether', 'whereby', 'after', 'while', 'in', 'nine', 'about', 'now', "'s", 'both', 'an', 'she', 'on', 'always', 'whither', 'herein', 'yours', 'seemed', 'whom', 'moreover', 'one', 'did', 'somewhere', 'own', 'above', 'hundred', 'whoever', 'it', 'itself', 'name', 'otherwise', 'this', 'least', 'either', 'being', 'yourselv

In [25]:
len(nlp.Defaults.stop_words)

326

In [17]:
nlp.vocab['is'].is_stop
nlp.vocab['mystery'].is_stop

False

In [19]:
nlp.Defaults.stop_words.add('btw')

In [20]:
nlp.vocab['btw'].is_stop = True

In [22]:
len(nlp.Defaults.stop_words)
nlp.vocab['btw'].is_stop

True

In [23]:
nlp.Defaults.stop_words.remove('beyond')

In [18]:
nlp.vocab['beyond'].is_stop= False

In [26]:
nlp.vocab['beyond'].is_stop

False

# Vocabulary and Matching 

* We've seen how a body of text is divided into tokens, and how individual tokens are parsed and tagged with parts of speech, dependencies and lemmas.
* We will identify and label specific phrases that match patterns we can define ourselves.

* We can say this is a powerful version of Regular Expression where we actually take parts of speech into account for our pattern search.

Compared to using regular expressions on raw text, spaCy’s rule-based matcher engines and components not only let you find the words and phrases you’re looking for – they also give you access to the tokens within the document and their relationships. This means you can easily access and analyze the surrounding tokens, merge spans into single tokens or add entries to the named entities in doc.ents.

# Token based matching

Orth (short for "orthography"): while text refers to the actual words and characters of a document, orth refers to the standardized spelling and writing conventions used for a language, and is often used to refer to the normalized form of text after pre-processing.

Text: The original text of the token.

Lower: The lowercase form of the token.

Lemma: The base or dictionary form of the token.

Pos: The part-of-speech (POS) tag of the token, which identifies the grammatical role of the word in the sentence.

Tag: The detailed POS tag of the token, which provides additional information about the word, such as tense or number.

Dep: The dependency label of the token, which describes the syntactic relationship between the word and its dependents.

Shape: The shape of the token, which captures its capitalization, punctuation, and other formatting.

is_alpha: A Boolean attribute that indicates whether the token consists entirely of alphabetic characters.

is_digit: A Boolean attribute that indicates whether the token consists entirely of digits.

is_punct: A Boolean attribute that indicates whether the token is a punctuation mark.

is_space: A Boolean attribute that indicates whether the token is a whitespace character.

is_stop: A Boolean attribute that indicates whether the token is a stop word, which is a common word that is often excluded from analysis.

OP: operator or quantifier to determine how often to match a token pattern. 

* !	Negate the pattern, by requiring it to match exactly 0 times.

* ?	Make the pattern optional, by allowing it to match 0 or 1 times.

* '+' Require the pattern to match 1 or more times.

* '*' Allow the pattern to match zero or more times.

* {n}	Require the pattern to match exactly n times.

* {n,m}	Require the pattern to match at least n but not more than m times.

* {n,}	Require the pattern to match at least n times.

* {,m}	Require the pattern to match at most m times.

In [19]:
import spacy

In [20]:
nlp = spacy.load('en_core_web_sm')

In [21]:
# rule based matching
from spacy.matcher import Matcher

In [112]:
doc = nlp(u"The use of blockchain-based smart contracts has greatly reduced transaction costs in the renewable energy sector."
          "As more companies adopt blockchain based solutions for energy management, we can expect to see increased efficiency and transparency in the industry of Block Chain Based")

In [113]:
matcher = Matcher(nlp.vocab)

In [114]:
# SolarPower
pattern1 = [{'LOWER': 'blockchain'}, {'LOWER': 'based'}]
# Solar-power
pattern2 = [{'LOWER': 'blockchain'}, {'IS_PUNCT' : True}, {'LOWER': 'based'}]
# Solar power
pattern3 = [{'LOWER' : 'block'}, {'LOWER' : 'chain' }, {'LOWER': 'based'}]

In [115]:
matcher.add('Blockchain-based', patterns=[pattern1, pattern2, pattern3], on_match=None)

In [116]:
found_matches = matcher(doc)

In [117]:
print(found_matches)

[(12507262609179161830, 3, 6), (12507262609179161830, 23, 25), (12507262609179161830, 43, 46)]


In [118]:
for match_id, start, end in found_matches:
    string_id = nlp.vocab.strings[match_id] # get string representation
    span = doc[start:end]                # get the matched span
    print(match_id, string_id, start, end, span.text)

12507262609179161830 Blockchain-based 3 6 blockchain-based
12507262609179161830 Blockchain-based 23 25 blockchain based
12507262609179161830 Blockchain-based 43 46 Block Chain Based


In [119]:
matcher.remove('Blockchain-based')

In [148]:
pattern1 = [{'LOWER' : 'blockchain'}, {'LOWER' : 'based'}]
#solar.power and more
pattern2 = [{'LOWER' : 'blockchain'}, {'IS_PUNCT' : True, 'OP': '*' }, {'LOWER' : 'based' }]

In [149]:
matcher.add('Blockchain-based', patterns=[pattern1, pattern2], on_match=None)

TypeError: add() takes at least 2 positional arguments (1 given)

In [122]:
doc2 = nlp(u"Blockchain--based is solarpower !")

In [142]:
found_matches = matcher(doc2)

In [124]:
print(found_matches)

[(12507262609179161830, 0, 3)]


# Part two -> phrase matching

In previous we saw how to performe rule-based matching. In alternative and often more efficient method is to actually match on a terminology list. 

In [3]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [4]:
from spacy.matcher import PhraseMatcher

In [5]:
matcher = PhraseMatcher(nlp.vocab)

In [16]:
with open('../TextFiles/reaganomics.txt') as fs:
    doc3 = nlp(fs.read())

In [17]:
phrase_list = ['voodoo economics', 'supply-side economics', 'trickle-down economics', 'free-market economics']

In [18]:
phrase_patterns = [nlp(text) for text in phrase_list]
print(type(phrase_patterns[0]))

<class 'spacy.tokens.doc.Doc'>


In [141]:
matcher.add('EconMatcher', phrase_patterns, on_match=None)

In [136]:
found_matches = matcher(doc3)

In [140]:
print(found_matches)


[(3680293220734633682, 41, 45), (3680293220734633682, 49, 53), (3680293220734633682, 54, 56), (3680293220734633682, 61, 65), (3680293220734633682, 673, 677), (3680293220734633682, 2986, 2990)]


In [138]:
for match_id, start, end in found_matches:
    string_id = nlp.vocab.strings[match_id]
    span = doc3[start: end]
    #span = doc3[start - 5: end + 5]

    print(match_id, string_id, start, end, span.text)

3680293220734633682 EconMatcher 41 45 supply-side economics
3680293220734633682 EconMatcher 49 53 trickle-down economics
3680293220734633682 EconMatcher 54 56 voodoo economics
3680293220734633682 EconMatcher 61 65 free-market economics
3680293220734633682 EconMatcher 673 677 supply-side economics
3680293220734633682 EconMatcher 2986 2990 trickle-down economics
