<center>    
    <h1 id='spacy-notebook-1' style='color:#7159c1; font-size:350%'>Finding Words, Phrases, Names and Concepts</h1>
    <i style='font-size:125%'>Introduction to Spacy Package</i>
</center>

> **Topics**

```
- ⚙️ NLP Processes
- 📁 Spacy Documents, Tokens and Spans
- 📝 Lexical Attributes
- 🪈 Pipelines
- 🏷️ Named Entities (NER)
- 🔍 Matches
```

<h1 id='0-nlp-processes' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>⚙️ | NLP Processes</h1>

`Natural Language Processing (NLP)` is a field of Artificial Intelligence and Machine Learning focused on processing sequential texts. It can be used for a variety of tasks, since in `Phonemes` and `Morphemes & Lexemes` to `Syntax` and `Context` tasks. Applications-wise, we can do `Speech to Text`, `Documents Summary`, `Part-of-Speech (POS) Tagging` projects and much more!!

The image below shows some of the applications we can do diving into each block of language.

<figure style='text-aling:center'>
    <img style='border-radius:20px' src='./images/0-nlp-processes.png' alt='Diagram of possible NLP Applications in each Block of Language' />
    <figcaption>Figure 1 - Diagram of possible NLP Applications in each Block of Language. By <a href='https://www.oreilly.com/library/view/practical-natural-language/9781492054047/ch01.html'>Oreilly - Practical Natural Language Book - Chapter 1</a>.</figcaption>
</figure>

<h1 id='1-spacy-documents-tokens-and-spans' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>📁 | Spacy Documents, Tokens and Spans</h1>

`Spacy` is a great Python package to work with NLP and in this section we are going to dive into its basic objects: Documents, Tokens and Spans! First off, in order to install the package, we should run the following command:

```bash
pip install -U spacy
```

In Spacy, we always create a blank Language Model (LM) or load a pre-trained one. For now, we'll be creating a blank one.

In [1]:
# Importings
import spacy

# Creating a blank English NLP Object AKA Processing Pipeline
nlp = spacy.blank('en')

After creating a Processing Pipeline, we can process sequential texts to create a Document with Tokens. By default, Spacy works with `n-gram` Tokens, that is, each word is considered as a Token.

Besides, when two or more tokens are together, they're called Span, so:

- **Document** - `object of Tokens`;
- **Token** - `a word or punctuation`;
- **Span** - `two or more tokens together`.

There are some observations about Tokenization to keep in mind:

    1. single spaces are not considered as tokens, but multiple spaces are considered as a unique token;
    
    2. words with hiphen are split into multiple tokens, for instance, the word 'ad-free' is split into three tokens: ['ad', '-', 'free'].

In [2]:
# Document
document = nlp('Hey it\'s me, Goku!')

print(f'- Full Document Text: {document.text}')
print('---')

print('- Text of Each Token:')
for token in document: print(token.text)

- Full Document Text: Hey it's me, Goku!
---
- Text of Each Token:
Hey
it
's
me
,
Goku
!


In [3]:
# Token
token = document[5]
print(f'- Single Token Text: {token.text}')

- Single Token Text: Goku


In [4]:
# Span
span = document[0:4]
print(f'- Single Span Token Text: {span.text}')

- Single Span Token Text: Hey it's me


<h1 id='2-lexical-attributes' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>📝 | Lexical Attributes</h1>

`Lexical Attributes` are info about the words from a language that are `context-independent`, that is, they are not influenced neither by the role of the word in a sentence nor by the context of it.

There are lot of Lexical Attributes, so let's just keep in mind the main ones by now:

- **i** - `token's index in the document`;
- **text** - `token's text in string type`;
- **is_alpha** - `whether a token contains only alphabetic characters (a-zA-Z)`;
- **is_punct** - `whether a token is a punctuation`;
- **is_digit** - `whether a token contains only digits (0-9)`;
- **like_num** - `whether a token resemble a number (5 or 'five')`.

In [5]:
# Lexical Attributes
document = nlp('It costs $5.00. Yeah, five dollars!')

print(f'- Index: {[token.i for token in document]}')
print(f'- Text: {[token.text for token in document]}')
print(f'- Is Alphabetic? {[token.is_alpha for token in document]}')
print(f'- Is Punctuation? {[token.is_punct for token in document]}')
print(f'- Is Digit? {[token.is_digit for token in document]}')
print(f'- Is Like a Number? {[token.like_num for token in document]}')

- Index: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
- Text: ['It', 'costs', '$', '5.00', '.', 'Yeah', ',', 'five', 'dollars', '!']
- Is Alphabetic? [True, True, False, False, False, True, False, True, True, False]
- Is Punctuation? [False, False, False, False, True, False, True, False, False, True]
- Is Digit? [False, False, False, False, False, False, False, False, False, False]
- Is Like a Number? [False, False, False, True, False, False, False, True, False, False]


<h1 id='3-pipelines' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>🪈 | Pipelines</h1>

Only Lexical Attributes cannot be enough in some projects where we have to get information about `Part-of-Speech (POS)`, `Named Entities (NER)` and `Contexts`, that is, `context-dependent` variables.

In these scenarios, we must have a pre-trained model that has learnt the corpus and rules of the target language, and in order to achieve it, we can either create a model from the scratch or load a pre-trained one.

Since creating a model from the scratch is not the goal of this notebook and I still don't have the enough knowledge for it (😂), we will be working with English pre-trained models from Spacy.

---

Actually, Spacy contains four Language Models (LM) in English:

- **en_core_web_sm** - `small pipeline and doesn't contain WordVectors`;
- **en_core_web_md** - `middle pipeline and contains WordVectors`;
- **en_core_web_lg** - `large pipeline and contains WordVectors`;
- **en_core_web_trf** - `equivalent to the large pipeline but with Transformers rather than WordVectors and more focused on accuracy`.

While `sm, md and lg` Pipelines are focused on `Efficiency` and works faster, `trf` Pipeline is focused on `Accuracy` and demands more time and computational cost to process.

Since these models are not automatically installed with Spacy, we must to install them manually running the following commands:

```bash
python -m spacy download en_core_web_sm
python -m spacy download en_core_web_md
python -m spacy download en_core_web_lg
python -m spacy download en_core_web_trf
```

Even though the Pipelines are pre-trained already, we can fine-tune them with our own dataset!

---

And finally, let's take a look at some of the advantages in using pre-trained Pipelines:

- **Part-of-Speech (.pos_) Recognition** - `whether a token is a verb, noun, pronoun...`;
- **Dependency Label (.dep_) Recognition** - `what role a token takes in the document accordingly to the context, such as nominal subject, object, root, determinant...`;
- **Syntatic Head (.head.text)** - `parent token that is directly related to the current token accordingly to the current token Dependency Label`;
- **Stop Word (.is_stop) Detection** - `wheter a token is a Stop Word`.

In [6]:
# Loading Large Pipeline
nlp_large = spacy.load('en_core_web_lg')

In [7]:
# Checking Tokens' Context-Dependent Info
document = nlp_large('Hey it\'s me, Goku!')

for token in document:
    print(f'- Text: {token.text}')
    print(f'- Part-of-Speech (POS): {token.pos_}')
    print(f'- Dependency Label: {token.dep_}')
    print(f'- Head Token: {token.head.text}')
    print(f'- Is Stop Word: {token.is_stop}')
    print('---')

- Text: Hey
- Part-of-Speech (POS): INTJ
- Dependency Label: intj
- Head Token: 's
- Is Stop Word: False
---
- Text: it
- Part-of-Speech (POS): PRON
- Dependency Label: nsubj
- Head Token: 's
- Is Stop Word: True
---
- Text: 's
- Part-of-Speech (POS): AUX
- Dependency Label: ROOT
- Head Token: 's
- Is Stop Word: True
---
- Text: me
- Part-of-Speech (POS): PRON
- Dependency Label: attr
- Head Token: 's
- Is Stop Word: True
---
- Text: ,
- Part-of-Speech (POS): PUNCT
- Dependency Label: punct
- Head Token: 's
- Is Stop Word: False
---
- Text: Goku
- Part-of-Speech (POS): PROPN
- Dependency Label: attr
- Head Token: 's
- Is Stop Word: False
---
- Text: !
- Part-of-Speech (POS): PUNCT
- Dependency Label: punct
- Head Token: 's
- Is Stop Word: False
---


<h1 id='4-named-entities-ner' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>🏷️ | Named Entities (NER)</h1>

A common task when dealing with sequential texts is to identify real-world object, such as companies, organizations, people, countries, objects, currencies... This kind of objects are called `Entities` or `Named Entities (NER)` in NLP.

All recognized entities from a document can be acessed via the following attributes:

- **document.ents** - `returns a list of Spans, where each Span corresponds to an entity`;
- **Entity Label (.label_)** - `returns what kind of entity the Span is. We can use 'spacy.explain' method to get a further explanation about the label definition`.

Since whether a Token or Span is an entity depends on the context, it's also needed to work with a pre-trained model.

In [8]:
# Listing Entities
document = nlp_large('Apple is looking at buying U.K. startup at $1 billion.')

for entity in document.ents:
    print(f'- Entity Text: {entity.text}')
    print(f'- Label: {entity.label_}')
    print(f'- Explanation: {spacy.explain(entity.label_)}')
    print('---')

- Entity Text: Apple
- Label: ORG
- Explanation: Companies, agencies, institutions, etc.
---
- Entity Text: U.K.
- Label: GPE
- Explanation: Countries, cities, states
---
- Entity Text: $1 billion
- Label: MONEY
- Explanation: Monetary values, including unit
---


In [9]:
# We can also call 'spacy.explain' method to get further info about Spacy's abbreviations
print(f'- GPE: {spacy.explain("GPE")}')
print(f'- NNP: {spacy.explain("NNP")}')
print(f'- dobj: {spacy.explain("dobj")}')

- GPE: Countries, cities, states
- NNP: noun, proper singular
- dobj: direct object


However, there are some situations where entities are not identified automatically by Spacy. In these situations, we must get them as Spans manually.

In [10]:
# Getting Entities Manually (1)
text = 'Upcoming iPhone X release date leaked as Apple reveals pre-orders'
document = nlp_large(text)

for entity in document.ents:
    print(f'- Entity Text: {entity.text}')
    print(f'- Label: {entity.label_}')
    print(f'- Explanation: {spacy.explain(entity.label_)}')
    print('---')

- Entity Text: Apple
- Label: ORG
- Explanation: Companies, agencies, institutions, etc.
---


In [11]:
# Getting Entities Manually (2)
iphone_x_entity = document[1:3]
print(f'- Missing Entity: {iphone_x_entity}')

- Missing Entity: iPhone X


We will learn better ways to fetch non-recognized entities and even make the model learn to automatically identify them in the further lessons.

<h1 id='5-matches' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>🔍 | Matches</h1>

Sometimes we desire to apply rules in order to find words or phrases into text and are in these situations that `Matches` and `RegEx` come in action.

While `RegEx` searches contents only into strings, `Matches` allow us to search content into Spacy Documents, Tokens and Spans, applying advanced linguistic searches, such as, match the word 'duck' only when it's a noun and not a verb.

Matches are passed as a list of dictionaries, where each dictionary corresponds to a single token search rule. Besides, search rules are applied sequentially in the Tokens.

Oh, and there are some filter Operators that we can use with Matcher:

- **{ 'OP': '!' }** - `Negation - Match 0 times`;
- **{ 'OP': '?' }** - `Optional - Match 0 or 1 times`;
- **{ 'OP': '+' }** - `Match 1 or more times`;
- **{ 'OP': '*' }** - `Match 0 or more times`.

In [12]:
# Matches (1)
from spacy.matcher import Matcher

pattern1 = [{ 'TEXT': 'iPhone' }, { 'TEXT': 'X' }] # matches a token with text 'iPhone' followed by a token with text 'X'
pattern2 = [{ 'LOWER': 'iphone' }, { 'LOWER': 'x' }] # matches a token with lowercase text 'iphone' followed by a token with lowercase text 'x'
pattern3 = [{ 'LEMMA': 'buy' }, { 'POS': 'NOUN' }] # matches a token with lemma (dictionary form) as 'buy' followed by a token with part-of-speech (pos) as 'NOUN'. For instance, 'buy flowers' and 'bought books'

matcher = Matcher(nlp_large.vocab) # always feed Matcher with Vocab when instantiating it
matcher.add('IPHONE_PATTERN_1', [pattern1])
matcher.add('IPHONE_PATTERN_2', [pattern2])
matcher.add('IPHONE_PATTERN_3', [pattern3])

Matcher returns a list of tuples containing the following three elements each:

- **match_id** - `match id (a random hash)`;
- **start** - `start index of the matched Span into the Document`;
- **end** - `end index of the matched Span into the Document`.

Also, we can create a Span of the match by slicing the Document by 'start' and 'end' indexes.

In [13]:
matches = matcher(document)

for match_id, start_index, end_index in matches:
    matched_span = document[start_index: end_index]

    print(f'- Match ID: {match_id}')
    print(f'- Matched Span: {matched_span.text}')
    print('---')

- Match ID: 7162708591567205619
- Matched Span: iPhone X
---
- Match ID: 16007330697362006008
- Matched Span: iPhone X
---


---

Let's do some exercises with Matches now!!

In [14]:
# Exercise 1) Extract the name of the tournament
document = nlp_large('2018 FIFA World Cup: France won!')

pattern_fifa = [
    { 'IS_DIGIT': True }
    , { 'LOWER': 'fifa' }
    , { 'LOWER': 'world' }
    , { 'LOWER': 'cup' }
    , { 'IS_PUNCT': True }
]

matcher = Matcher(nlp_large.vocab)
matcher.add('FIFA_PATTERN', [pattern_fifa])
matches = matcher(document)

for match_id, start_index, end_index in matches:
    matched_span = document[start_index:end_index]
    print(f'- Tournament\' Name: {matched_span.text}')

- Tournament' Name: 2018 FIFA World Cup:


In [15]:
# Exercise 2) Apply two filters in a single token
document = nlp_large('I loved cats but I love dogs more')

pattern_animals = [{ 'LEMMA': 'love', 'POS': 'VERB' }, { 'POS': 'NOUN' }]

matcher = Matcher(nlp_large.vocab)
matcher.add('ANIMALS_SEARCH', [pattern_animals])
matches = matcher(document)

for match_id, start_index, end_index in matches:
    matched_span = document[start_index:end_index]
    print(f'- Matched Span: {matched_span.text}')

- Matched Span: loved cats
- Matched Span: love dogs


In [16]:
# Exercise 3) Apply filters using operators
document = nlp_large('I bought a smartphone. Now I am buying apps.')

pattern_with_operators = [
    { 'LEMMA': 'buy' }
    , { 'POS': 'DET', 'OP': '?' }
    , { 'POS': 'NOUN' }
]

matcher = Matcher(nlp_large.vocab)
matcher.add('PATTERN_WITH_OPERATORS', [pattern_with_operators])
matches = matcher(document)

for match_id, start_index, end_index in matches:
    matched_span = document[start_index:end_index]
    print(f'- Matched Span: {matched_span.text}')

- Matched Span: bought a smartphone
- Matched Span: buying apps


In [17]:
# Exercise 4) Applu filters using operators again
document = nlp_large('I love cats and I am very very happy')

pattern_love_cats = [{ 'LEMMA': 'love', 'POS': 'VERB' }, { 'LOWER': 'cats' }]
pattern_very_happy = [{ 'LOWER': 'very', 'OP': '+' }, { 'LOWER': 'happy' }]

matcher = Matcher(nlp_large.vocab)
matcher.add('PATTERN_LOVE_CATS', [pattern_love_cats])
matcher.add('PATTERN_VERY_HAPPY', [pattern_very_happy])
matches = matcher(document)

for match_id, start_index, end_index in matches:
    matched_span = document[start_index:end_index]
    print(f'- Matched Span: {matched_span.text}')

- Matched Span: love cats
- Matched Span: very happy
- Matched Span: very very happy


---

<h1 id='reach-me' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>📫 | Reach Me</h1>

> **Email** - [csfelix08@gmail.com](mailto:csfelix08@gmail.com?)

> **Linkedin** - [linkedin.com/in/csfelix/](https://www.linkedin.com/in/csfelix/)

> **GitHub:** - [CSFelix](https://github.com/CSFelix)

> **Kaggle** - [DSFelix](https://www.kaggle.com/dsfelix)

> **Portfolio** - [CSFelix.io](https://csfelix.github.io/).