<center>    
    <h1 id='spacy-notebook-2' style='color:#7159c1; font-size:350%'>Data Structures</h1>
    <i style='font-size:125%'>Diving into Spacy's Architecture and Objects</i>
</center>

> **Topics**

```
- 📖 Vocab, StringStore and Lexeme
- 📁 Documents, Tokens and Spans (Part II)
- ✨ Part-of-Speech (POS), Morphemes and Sentence Segmentation
- 🪞 Word Vectors and Semantic Similarity
- 🎨 Combining Predictions and Rules
- 🔍 PhraseMatcher, Morphological Attributes Matcher and DependencyMatcher
```

<h1 id='0-vocab-stringstore-and-lexeme' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>📖 | Vocab, StringStore and Lexeme</h1>

`Vocab`, `StringStore` and `Lexeme`: what are they? What is their relationship? Why are they so important to Spacy's architecture?

Before creating Documents with Spacy, our pipeline must know a considerable amount of words from our target language in order to be able to identify the Part-of-Speech (POS), Dependency Label, Syntatic Head and so on of each Token.

All words are stored into `StringStore` object and, in order to save memory and avoid duplicated entries, each word is stored only once and then is assigned to a string hash.

Consequently, each string hash is stored into `Vocab` object, where each element is called `Lexeme`. Since Vocab only stores words hashes independetly, that is, without context and sequential texts, Lexemes contain only `context-independent` info, such as whether a token is a digit, alphabetic or punctuation character.

There's alse the `Document` object, where each element is a `Token` and since Document stores sequential texts with context, each Token contains `context-dependent` info, such as Part-of-Speech (POS), Dependency Label, Flag to Stop Words and Syntatic Head.

In a nutshell, `StringStore` stores words as string only once and each word is assigned to a string hash. This very hash is stored into `Vocab` as a `Lexeme` object containing all `context-independent` info of it.

Besides, every time we create a `Document` in Spacy, the Document accesses Vocab in order to check the existence of the word. If it exists, Spacy gets the hash and search for the word into StringStore in order to associate it to the Document. If it doesn't exist, the word is inserted right away into Vocab and StringStore, then Spacy associate the word to the Document.

So, we can say that each time we search for a word, Spacy looks for its hash value into Vocab and then for its string representation into StringStore.

The image below illustrates the association between these objects:

<figure style='text-aling:center'>
    <img style='border-radius:20px' src='./images/1-spacy-architecture.png' alt='Spacy Architecture to Store Words' />
    <figcaption>Figure 1 - Spacy Architecture to Store Words By <a href='https://course.spacy.io/en/chapter2'>Spacy - Advanced NLP with Spacy Course - Chapter 2</a>.</figcaption>
</figure>

In [1]:
# Blank models are instatiated with empty Vocab and StringStore,
# whereas pre-trained model are instantiated with both objects
# populated.
#
# In both models, blank and pre-trained, new words are automatically
# inserted into Vocab and StringStore while creating Documents.
#
import spacy
nlp_blank = spacy.blank('en')
document = nlp_blank('I love Natural Language Processing!')

In [2]:
# Getting Hash and String of a word
love_hash = nlp_blank.vocab.strings['love']
love_string = nlp_blank.vocab.strings[love_hash]

print(f'- Love Hash: {love_hash}')
print(f'- Love String: {love_string}')

- Love Hash: 3702023516439754181
- Love String: love


In [3]:
# Adding a new Word into Vocab and StringStore
nlp_blank.vocab.strings.add('hate')

hate_hash = nlp_blank.vocab.strings['hate']
hate_string = nlp_blank.vocab.strings[hate_hash]

print(f'- Hate Hash: {hate_hash}')
print(f'- Hate String: {hate_string}')

- Hate Hash: 8706232279129489120
- Hate String: hate


In [4]:
# Accessing Directly via Document
document_love_hash = document.vocab.strings['love'] # Documents also contains its own 'vocab' and 'StringStore' objects
document_love_string = document.vocab.strings[document_love_hash]

print(f'- Document Love Hash: {document_love_hash}')
print(f'- Document Love String: {document_love_string}')

- Document Love Hash: 3702023516439754181
- Document Love String: love


---

In [5]:
# Exploring Lexemes
document = nlp_blank('I love Natural Language Processing!')
lexeme = nlp_blank.vocab['love']

print(f'- Text: {lexeme.text}')
print(f'- Hash Value: {lexeme.orth}')
print(f'- Some Lexical Attributes:')
print(f'\t- Is Alphabetic? {lexeme.is_alpha}')
print(f'\t- Is Punctuation? {lexeme.is_punct}')
print(f'\t- Is Digit? {lexeme.is_digit}')
print(f'\t- Is Like a Number? {lexeme.like_num}')

- Text: love
- Hash Value: 3702023516439754181
- Some Lexical Attributes:
	- Is Alphabetic? True
	- Is Punctuation? False
	- Is Digit? False
	- Is Like a Number? False


<h1 id='1-documents-tokens-and-spans-part-ii' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>📁 | Documents, Tokens and Spans (Part II)</h1>

Recapping what we already know about these three objects:

- **Documents** - `object of Tokens, that is, a sequential text with a context`;
- **Token** - `a word or a punctuation and each element into a Document`;
- **Span** - `slices of the Document, consisting in two or more Tokens together`.

We have seen that Documents are created automatically when processing texts with `nlp` object like below:

```python
document = nlp('Hey it is me, Goku!')
```

However, we can create them manually passing only three parameters:

- **Vocab** - `Vocab object of the target language`;
- **Words** - `list of the sequential text where each element is a Token`;
- **Spaces** - `list of flags telling whether there is a space right after the corresponding word of the same index into 'words' parameter`.

In [6]:
# Manually Creating a Document
from spacy.tokens import Doc, Span

nlp_large = spacy.load('en_core_web_lg')

words = ['Hey', 'it', '\'s', 'me', ',', 'Goku', '!']
spaces = [True, True, True, False, True, False, False]
document = Doc(nlp_large.vocab, words=words, spaces=spaces)

print(f'- Document Text: {document}')

- Document Text: Hey it 's me, Goku!


In [7]:
# Manually Creating a Span
span_without_label = Span(document, 5, 6)
span_with_label = Span(document, 5, 6, label='PERSON')

print(f'- Span Without Label: {span_without_label} - (label: {span_without_label.label_})')
print(f'- Span With Label: {span_with_label} - (label: {span_with_label.label_})')

- Span Without Label: Goku - (label: )
- Span With Label: Goku - (label: PERSON)


In [8]:
# Updating Document Entities
print('- Document Named Entities (NER) - Before Manual Update:')

for entity in document.ents: print(entity.text, entity.label_, spacy.explain(entity.label_))

###

document.ents = [span_with_label]

print('---\n- Document Named Entities (NER) - After Manual Update:')

for entity in document.ents: print(entity.text, entity.label_, spacy.explain(entity.label_))

- Document Named Entities (NER) - Before Manual Update:
---
- Document Named Entities (NER) - After Manual Update:
Goku PERSON People, including fictional


<h1 id='2-part-of-speech-pos-morphemes-and-sentence-segmentation' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>✨ | Part-of-Speech (POS), Morphemes and Sentence Segmentation</h1>

When a document is created, the text is processed by the following Pipelines in order:

- **tokenizer** - `transforms each word in a Token`;

- **tok2vec** - `calculates the WordVector for the whole Document and for each Token. Word2Vec is the default algorithm used for this task in Spacy`;

- **tagger** - `responsible to assign Tag and Part-of-Speech (POS) on each Token, that is, the grammatical role`;

- **parser** - `responsible to assign the relationships of the Tokens in the text, such as Dependency Label and Syntatic Head`;

- **lemmatizer** - `responsible to assign the Lemma (dictionary/base form) to Tokens`;

- **attribute_ruler** - `responsible to process Tokens and assign information on them following specific rules and logic given by us. This Pipeline is normally used when Spacy cannot process well a certain word, phrase or a target language`;

- **ner (Named Entity Recognition)** - `responsible to identify and assign Named Entities and their Labels`;

- **textcat** - `responsible to assign categories to Documents following rules and logic given by us. This Pipeline is normally used on Text Classification projects. For instance, the rating 'Steins;Gate is an amazing show' should be classified as 'positive'`;

- **merge_noun_chunks** - `responsible to merge multiple Tokens that represent a single noun. For instance, instead of 'Son Goku' be considered as two tokens, one for each word, it's considered as a single Token`;

- **merge_entities** - `responsible to merge multiple Tokens that represent an entity. For example, instead of 'the Kame House' be considered as three Tokens, one for each word, it's considered as a single Token`.

About the `tagger` Pipeline, Spacy provides two `Part-of-Speech (POS)`  Tags, the `coarse` that is accessible on `.pos_` attribute and the `fine-grained` that is accessbile on `._tag` attribute, being:

- **coarse (.pos_)** - `based on Universal Dependencies Tag Set (https://universaldependencies.org/u/pos/all.html)`;

- **fine-grained (.tag_)** - `based on OntoNotes 5.0, the dataset used to train the Language Model in Spacy (https://catalog.ldc.upenn.edu/LDC2013T19)`.

By the way, `fine-grained (.tag_)` Part-of-Speech (POS) contains more information about the Token, for example, when a Token is a verb, it returns the aspect (1ª, 2ª, 3ª person; singular or plural) and the tense (past, present, future, conditional...); whereas `coarse (.pos_)` only returns that the Token is a verb and we must access the `Morphemes` attributes in order to get further details.

In [9]:
# Coarse and Fine-Grained Part-of-Speech (POS)
document = nlp_large('Hey it\'s me, Goku!')

for token in document:
    print(f'- Text: {token.text}')
    print(f'- Coarse POS: {token.pos_} - {spacy.explain(token.pos_)}')
    print(f'- Fine-Grained POS: {token.tag_} - {spacy.explain(token.tag_)}')
    print('---')

- Text: Hey
- Coarse POS: INTJ - interjection
- Fine-Grained POS: UH - interjection
---
- Text: it
- Coarse POS: PRON - pronoun
- Fine-Grained POS: PRP - pronoun, personal
---
- Text: 's
- Coarse POS: AUX - auxiliary
- Fine-Grained POS: VBZ - verb, 3rd person singular present
---
- Text: me
- Coarse POS: PRON - pronoun
- Fine-Grained POS: PRP - pronoun, personal
---
- Text: ,
- Coarse POS: PUNCT - punctuation
- Fine-Grained POS: , - punctuation mark, comma
---
- Text: Goku
- Coarse POS: PROPN - proper noun
- Fine-Grained POS: NNP - noun, proper singular
---
- Text: !
- Coarse POS: PUNCT - punctuation
- Fine-Grained POS: . - punctuation mark, sentence closer
---


---

`Morphemes` are the smalles piece of the words and, in Spacy, can be accessed through `.morph` attribute. Besides, not all Tokens contain morphemes attributes and when so, the available ones may differ from each Token.

In order to get all available `Morphemes Attributes`, we can convert `.morph` into a dictionary; and to access the information, we can simply use `get` function.

In [10]:
# Morphemes
for token in document:
    morpheme_dictionary = token.morph.to_dict()
    morpheme_number = morpheme_dictionary.get('Number', '')
    morpheme_person = morpheme_dictionary.get('Person', '')

    print(f'- Text: {token.text}')
    print(f'- Morpheme Attributes: {token.morph}')
    print(f'- Morpheme Attributes Dictionary: {morpheme_dictionary}')
    print(f'- Morpheme - Number: {morpheme_number}') # or token.morph.get('Number')
    print(f'- Morpheme - Person: {morpheme_person}') # or token.morph.get('Person')
    print('---')

- Text: Hey
- Morpheme Attributes: 
- Morpheme Attributes Dictionary: {}
- Morpheme - Number: 
- Morpheme - Person: 
---
- Text: it
- Morpheme Attributes: Case=Nom|Gender=Neut|Number=Sing|Person=3|PronType=Prs
- Morpheme Attributes Dictionary: {'Case': 'Nom', 'Gender': 'Neut', 'Number': 'Sing', 'Person': '3', 'PronType': 'Prs'}
- Morpheme - Number: Sing
- Morpheme - Person: 3
---
- Text: 's
- Morpheme Attributes: Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin
- Morpheme Attributes Dictionary: {'Mood': 'Ind', 'Number': 'Sing', 'Person': '3', 'Tense': 'Pres', 'VerbForm': 'Fin'}
- Morpheme - Number: Sing
- Morpheme - Person: 3
---
- Text: me
- Morpheme Attributes: Case=Acc|Number=Sing|Person=1|PronType=Prs
- Morpheme Attributes Dictionary: {'Case': 'Acc', 'Number': 'Sing', 'Person': '1', 'PronType': 'Prs'}
- Morpheme - Number: Sing
- Morpheme - Person: 1
---
- Text: ,
- Morpheme Attributes: PunctType=Comm
- Morpheme Attributes Dictionary: {'PunctType': 'Comm'}
- Morpheme - Number: 

---

`Sentence Segmentation` is very useful when dealing with large Documents. It's responsible to split the Document into sentences and we can access it through `.sents` attribute.

In [11]:
# Sentence Segmentation
document = nlp_large('Hey it\'s me, Goku! You look strong, let\'s fight! Oh, but let\'s head to the Kame House first.')
sentences = list(document.sents)
for index, sentence in enumerate(sentences): print(f'- Sentence {index}: {sentence}')

- Sentence 0: Hey it's me, Goku!
- Sentence 1: You look strong, let's fight!
- Sentence 2: Oh, but let's head to the Kame House first.


<h1 id='3-word-vectors-and-semantic-similarity' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>🪞 | Word Vectors and Semantic Similarity</h1>

`Semantic Similarity` is a technique to check out how similar Documents, Spans and Tokens are between each other given their content and not context. The similarity goes from 0 (`non-similar`) to 1 (`totally similar`) and it's calculated via `Cosine Similarity` by default in Spacy.

In order to be able to calculate the similarity, the NLP object should have `WordVectors`, that are calculated via `Word2Vec` algorithm by default in Spacy.

However, only medium (md) and large (lg) pre-trained Pipelines contain WordVectors automatically calculated when creating Documents, Spans and Tokens. So, when working with small (sm), accuracy (trf) or blank models, we should calculate the WordVectors by ourselves:

- **❌ blank** - `doesn't contain WordVectors`;
- **❌ en_core_web_sm** - `doesn't contain WordVectors`;
- **❌ en_core_web_trf** - `doesn't contain WordVectors`;
- **✔️ en_core_web_md** - `contains WordVectors`;
- **✔️ en_core_web_lg** - `contains WordVectors`.

Besides, short phrases are bettern than long Documents and Spans with many irrelevant words (Stop Words?!) when calculating Semantic Similarity.

In [13]:
# Accessing WordVector object of a Document via '.vector' list
document0 = nlp_large('Hey it\'s me, Goku!')
print(f'- WordVectors Size: {len(document0.vector)}')
print(f'- WordVectors: {document0.vector}')

- WordVectors Size: 300
- WordVectors: [-1.79141298e-01  2.59370029e-01 -6.19667135e-02 -9.03841406e-02
  4.12608907e-02  4.53091338e-02  1.05849013e-01 -2.85014302e-01
 -1.73039287e-01  1.65766442e+00 -1.70944706e-01 -1.68457717e-01
  3.90684120e-02  1.74515545e-02 -2.41932869e-01 -1.07065730e-01
  1.13352863e-02  8.43292892e-01 -1.54243857e-01 -1.04621135e-01
  1.19625010e-01 -4.73900791e-03  3.30229998e-02 -1.82213709e-01
 -2.81651430e-02  2.05937158e-02  2.95375735e-02 -5.45188524e-02
  2.76406139e-01 -2.49356136e-01  5.31405434e-02  2.57778853e-01
 -3.96807445e-03  5.42552806e-02  4.83615696e-02 -5.21360002e-02
 -6.91885203e-02  1.26119420e-01 -1.37276560e-01 -8.89018178e-02
 -1.22366585e-01 -1.40172720e-01  1.59969963e-02 -1.77072324e-02
  8.81782845e-02  1.93355709e-01 -2.18852863e-01 -1.02922000e-01
  1.13547578e-01 -2.76979953e-02  5.00818565e-02  1.70085698e-01
  4.49685715e-02  2.94317063e-02  2.10785540e-03  1.87033281e-01
 -4.05484326e-02  3.63524295e-02  6.01218566e-02 -1

In [14]:
# Accessing WordVector object of a Token via '.vector' list
token0 = document0[5]
print(f'- WordVectors Size: {len(token0.vector)}')
print(f'- WordVectors: {token0.vector}')

- WordVectors Size: 300
- WordVectors: [-4.0001e-01 -6.4200e-01  5.4013e-01 -4.6932e-01  2.3678e-01 -2.1087e-01
  6.1721e-01  1.7844e-01 -3.0992e-02 -8.0075e-01 -2.2622e-02 -5.7395e-01
 -2.0335e-01  5.6272e-01 -2.1141e-01 -1.8668e-01  4.7549e-01  2.4373e-01
 -3.3346e-01 -2.6711e-01  4.7203e-01 -2.5117e-01  2.1239e-01 -9.1873e-01
 -1.7530e-01  3.1297e-01 -3.4612e-02 -3.0685e-01 -6.7977e-02 -3.3142e-01
  8.2209e-02  6.5252e-02  5.3315e-01 -4.8743e-02  1.8212e-01 -2.1009e-01
 -9.8826e-01 -1.4896e-01 -5.1341e-01 -2.7604e-02 -4.3536e-01 -9.4728e-01
  2.3615e-01  4.6428e-01 -2.3510e-01  1.9391e-01 -1.4218e-01 -4.6251e-01
 -1.9442e-01 -4.2031e-01  5.2424e-02 -3.2733e-01  5.8309e-01 -5.6361e-01
 -2.2144e-01  1.0501e+00 -1.4222e-02  3.3922e-01  1.4471e-01 -4.2506e-01
  3.8405e-02  5.8688e-01 -1.6905e-01 -4.0435e-01 -2.8506e-01 -7.6365e-02
  1.0794e-01  4.7223e-01  4.4349e-02 -4.1628e-01 -4.7326e-01 -4.2144e-02
  1.1894e-01  1.2493e-02  3.9658e-02  7.8624e-01  4.5608e-01 -2.4868e-01
  6.6457e-01

---

In [15]:
# Similarity between Documents
document1 = nlp_large('I love pizza')
document2 = nlp_large('I love pasta')
print(f'- Similarity between Documents: {document1.similarity(document2)}')

- Similarity between Documents: 0.9358318464113806


In [16]:
# Similarity between Tokens
document3 = nlp_large('I love pizza and pasta')
token1 = document3[2]
token2 = document3[4]
print(f'- Similarity between Tokens: {token1.similarity(token2)}')

- Similarity between Tokens: 0.7369545698165894


In [17]:
# Similarity between Document and Token
document4 = nlp_large('I love pizza')
token = nlp_large('cheese')[0]
print(f'- Similarity between Document and Token: {document4.similarity(token)}')

- Similarity between Document and Token: 0.5415431108130979


In [18]:
# Similarity between Span and Document
document5 = nlp_large('McDonalds sells burger')
span = nlp_large('I like pizza and pasta')[2:5]
print(f'- Similarity between Span and Document: {span.similarity(document5)}')
print(f'- Similarity between Document and Span: {document5.similarity(span)}')

- Similarity between Span and Document: 0.5886225771237401
- Similarity between Document and Span: 0.5886225771237401


---

Besides, similarity doesn't recognize `sentiments`. For instance, the two phrases `I love cats` and `I hate cats` have high similarity even though their meanings are totally opposite.

It happens due to their semantic contents be very similar: both contains the words `I` followed by a `VERB` and then the word `cats`.

In [19]:
# Semantic Similarity doesn't consider Sentiments but only Contents
document6 = nlp_large('I love cats')
document7 = nlp_large('I hate cats')
print(f'- Similarity between Documents: {document6.similarity(document7)}')

- Similarity between Documents: 0.9409261755229907


<h1 id='4-combining-predictions-and-rules' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>🎨 | Combining Predictions and Rules</h1>

The combination of Predictions from `Statistical Models` and Rules from `Rule-Based Systems` is a powerful technique to boost searches and Document processings.

They are, literally, the combination of `context-dependent` and `context-independent` (both predictions) and texts (rules) during searches and matches over the Documents. So:

- **Statistical Models** - `searches for generalized info, such as Named Entities (NER), Part-of-Speech (POS), Dependency Label and Syntatic Head of Tokens`;
- **Rule-Based Systems** - `searches for specific, finite info, such as Specific Named Entities (countries of the world, soccer player names and dog breeds). We can achieve it by using Tokenizer, Matcher and PhraseMatcher objects from Spacy`.

In [20]:
# Combining Predictions from Statistical Modes
# and Rules from Rule-Based Systems
from spacy.matcher import Matcher

document = nlp_large('I have a Golden Retriever')

pattern = [{ 'LOWER': 'golden' }, { 'LOWER': 'retriever' }]

matcher = Matcher(nlp_large.vocab)
matcher.add('DOG_BREED', [pattern])
matches = matcher(document)

for match_id, start_index, end_index in matches:
    matched_span = document[start_index:end_index]
    print(f'- Text: {matched_span.text}')
    print(f'- Root: {matched_span.root.text}') # Token that decides the Category of the Span
    print(f'- Part-of-Speech (POS): {matched_span.root.pos_} ({spacy.explain(matched_span.root.pos_)})')
    print(f'- Dependency Label: {matched_span.root.dep_} ({spacy.explain(matched_span.root.dep_)})')
    print(f'- Syntatic Head: {matched_span.root.head.text}')
    print(f'- Previous Token: {document[start_index-1].text}')

- Text: Golden Retriever
- Root: Retriever
- Part-of-Speech (POS): PROPN (proper noun)
- Dependency Label: dobj (direct object)
- Syntatic Head: have
- Previous Token: a


<h1 id='5-phrasematcher-morphological-attributes-matcher-and-dependencymatcher' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>🔍 | PhraseMatcher, Morphological Attributes Matcher and Dependency Matcher</h1>

Differently to Matcher, `PhraseMatchers` are more efficient and faster when we desire to search for a specific list of strings and instead of receiving a dictionaries with search rules, it receives only the list of strings that we desire to search and match. Oh, and always remember to convert this list into a `nlp.pipe` in order to gain more efficiency and save memory.

`Morphological Attributes Matcher` consists in simple Matchers that aims to filter using only morphological attributes rather than Part-of-Speech, Lemmas or other info.

`Dependency Matcher`-wise, it's responsible to apply filters using Syntatic Patterns more efficiently, that is, Dependency Label and Syntatic Head filters.

Then:

- **Matcher and Morphological Attributes Matcher** - `useful when searching for Tokens or Spans with simple filters, such as, Morphological and Part-of-Speech (POS) attributes`;

- **PhraseMatcher** - `useful when searching for specific Spans or Named Entities Recognition (NER)`;

- **DependencyMatcher** - `useful when searching for Tokens and Spans taking the Dependency Label relationship into consideration`.

In [36]:
# Morphological Attributes - Not as Spans
from spacy.matcher import Matcher

document = nlp_large('Hey it\'s me, Goku!')
pattern_1 = [{ 'MORPH': 'Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin' }] # instead of passing the whole morphological attribute
pattern_2 = [{ 'MORPH': { 'IS_SUPERSET': ['Mood=Ind', 'Person=3'] } }] # we can get advantage of 'IS_SUPERSET' attribute!!

morphological_matcher = Matcher(nlp_large.vocab)
morphological_matcher.add('indMood+3Person', [pattern_2])
matches = morphological_matcher(document)

for match_id, start_index, end_index in matches:
    pattern_name = nlp_large.vocab[match_id].text
    matched_span = document[start_index:end_index]

    print(f'- Fetched Pattern: {pattern_name}')
    print(f'- Matched Span: {matched_span.text}')
    print('---')

- Fetched Pattern: indMood+3Person
- Matched Span: 's
---


In [37]:
# Morphological Attributes - as Spans
matches = morphological_matcher(document, as_spans=True)

for matched_span in matches:
    print(f'- Matched Span: {matched_span}')
    print('---')

- Matched Span: 's
---


---

In [38]:
# PhraseMatcher
from spacy.matcher import PhraseMatcher

document = nlp_large('I have a Golden Retriever')

pattern = ['Golden Retriever']

phraseMatcher = PhraseMatcher(nlp_large.vocab)
phraseMatcher.add('dog-breed', nlp_large.pipe(pattern))
matches = phraseMatcher(document)

for match_id, start_index, end_index in matches:
    pattern_name = nlp_large.vocab[match_id].text
    matched_span = document[start_index:end_index]
    
    print(f'- Pattern Name: {pattern_name}')
    print(f'- Matched Span: {matched_span.text}')
    print('---')

- Pattern Name: dog-breed
- Matched Span: Golden Retriever
---


---

In [44]:
# Dependency Matcher
from spacy.matcher import DependencyMatcher

document = nlp_large('On 17 September 2012, protestors returned to Zuccotti Park to mark the one-year anniversary of the beginning of the occupation.')

pattern = [
    { 'RIGHT_ID': 'verb', 'RIGHT_ATTRS': { 'POS': 'VERB' } } # head of the document
    , { 'LEFT_ID': 'verb', 'REL_OP': '>', 'RIGHT_ID': 'subject', 'RIGHT_ATTRS': { 'DEP': 'nsubj' } } # nominal subject directly related to the head
    #   , { 'LEFT_ID': 'verb', 'REL_OP': '>', 'RIGHT_ID': 'd_object', 'RIGHT_ATTRS': { 'DEP': 'dobj' } } # direct object directly related to the head
]

dependencyMatcher = DependencyMatcher(nlp_large.vocab)
dependencyMatcher.add('verb-nsubj+verb-dobj', [pattern])
matches = dependencyMatcher(document)

for match_id, matched_token_ids in matches:
    pattern_name = nlp_large.vocab[match_id].text
    matched_token_1 = document[matched_token_ids[0]]
    matched_token_2 = document[matched_token_ids[1]]

    print(f'- Pattern Name: {pattern_name}')
    print(f'- Head Token: {matched_token_1.text}')
    print(f'- Nominal Subject Token: {matched_token_2.text}')
    print('---')

- Pattern Name: verb-nsubj+verb-dobj
- Head Token: returned
- Nominal Subject Token: protestors
---


---

<h1 id='reach-me' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>📫 | Reach Me</h1>

> **Email** - [csfelix08@gmail.com](mailto:csfelix08@gmail.com?)

> **Linkedin** - [linkedin.com/in/csfelix/](https://www.linkedin.com/in/csfelix/)

> **GitHub:** - [CSFelix](https://github.com/CSFelix)

> **Kaggle** - [DSFelix](https://www.kaggle.com/dsfelix)

> **Portfolio** - [CSFelix.io](https://csfelix.github.io/).