<a href="https://colab.research.google.com/github/Gladiator07/Natural-Language-Processing/blob/main/Basics/Text-Preprocessing/spacy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Spacy Tutorial

### References
- [Overview of Spacy](https://www.analyticsvidhya.com/blog/2020/03/spacy-tutorial-learn-natural-language-processing/)
- [Comprehensive article](https://www.machinelearningplus.com/spacy-tutorial-nlp/)

## Spacy's Processing Pipeline

![](https://cdn.analyticsvidhya.com/wp-content/uploads/2020/03/spacy_pipeline.png)

The first step for a text string, when working with spaCy, is to pass it to an NLP object. This object is essentially a pipeline of several text pre-processing operations through which the input text string has to go through.

In [1]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [2]:
# create an nlp object
doc = nlp("He went to play basketball")

In [3]:
# seeing the active pipelines
nlp.pipe_names

['tagger', 'parser', 'ner']

In [4]:
# disable pipeline components (if not required, can save up the computation)
nlp.disable_pipes('tagger', 'parser')

[('tagger', <spacy.pipeline.pipes.Tagger at 0x7fe7259c9690>),
 ('parser', <spacy.pipeline.pipes.DependencyParser at 0x7fe721bb2520>)]

In [5]:
nlp.pipe_names

['ner']

## 1.POS Tagging (Part-of-Speech)

In English grammar, the parts of speech tell us what is the function of a word and how it is used in a sentence. Some of the common parts of speech in English are Noun, Pronoun, Adjective, Verb, Adverb, etc.

POS tagging is the task of automatically assigning POS tags to all the words of a sentence. It is helpful in various downstream tasks in NLP, such as feature engineering, language understanding, and information extraction.

In [6]:
nlp = spacy.load('en_core_web_sm')

# create an nlp object
doc = nlp("He went to play basketball")

# iterate over the tokens
for token in doc:
    print(token.text, "-->", token.pos_)
    # print(token)

He --> PRON
went --> VERB
to --> PART
play --> VERB
basketball --> NOUN


In [7]:
# if not sure what the POS tag does, you can use the explain method
spacy.explain("PART")

'particle'

## 2. Dependency Parsing

Every sentence has a grammatical structure to it and with the help of dependency parsing, we can extract this structure. It can also be thought of as a directed graph, where nodes correspond to the words in the sentence and the edges between the nodes are the corresponding dependencies between the word.

In [8]:
for token in doc:
    print(token.text, "-->", token.dep_)

He --> nsubj
went --> ROOT
to --> aux
play --> advcl
basketball --> dobj


In [9]:
for token in doc:
    print(token.dep_, "-->",spacy.explain(token.dep_))

nsubj --> nominal subject
ROOT --> None
aux --> auxiliary
advcl --> adverbial clause modifier
dobj --> direct object


### 3.Named Entity Recognition

 Entities are the words or groups of words that represent information about common things such as persons, locations, organizations, etc. These entities have proper names.

For example, consider the following sentence:

![](https://cdn.analyticsvidhya.com/wp-content/uploads/2020/03/sentence.png)

In this sentence, the entities are “Donald Trump”, “Google”, and “New York City”.

In [10]:
doc = nlp("Indians spent over $71 billion on clothes in 2018")

for ent in doc.ents:
    print(ent.text, "-->", ent.label_)
    print(ent.label_, "-->", spacy.explain(ent))

Indians --> NORP
NORP --> None
$71 billion --> MONEY
MONEY --> None
2018 --> DATE
DATE --> None


### 4. Rule-Based Matching using Spacy

Rule-based matching is a new addition to spaCy’s arsenal. With this spaCy matcher, you can find words and phrases in the text using user-defined rules.

`It is like Regular Expressions on steroids.`

While Regular Expressions use text patterns to find words and phrases, the spaCy matcher not only uses the text patterns but lexical properties of the word, such as POS tags, dependency tags, lemma, etc.

Let’s see how it works:

In [11]:
nlp = spacy.load('en_core_web_sm')

from spacy.matcher import Matcher

# initialize the mathcer with spacy vocabulary
matcher = Matcher(nlp.vocab)

doc = nlp("Some people start their day with lemon water")

# define rule
pattern = [{'TEXT': 'lemon'}, {'TEXT' : 'water'}]

# add rule
matcher.add('rule_1', None, pattern)

So, our objective is that whenever “lemon” is followed by the word “water”, then the matcher should be able to find this pattern in the text. 

In [12]:
matches = matcher(doc)
matches

[(7604275899133490726, 6, 8)]

The output has three elements. The first element, ‘7604275899133490726’, is the match ID. The second and third elements are the positions of the matched tokens.

In [13]:
for match_id, start, end in matches:
    matched_span = doc[start:end]
    print(matched_span.text)

lemon water


So, the pattern is a list of token attributes. For example, ‘TEXT’ is a token attribute that means the exact text of the token. There are, in fact, many other useful token attributes in spaCy which can be used to define a variety of rules and patterns.

For more rules visit : https://spacy.io/usage/rule-based-matching

Let’s see another use case of the spaCy matcher. Consider the two sentences below:

- You can read this book
- I will book my ticket

Now we are interested in finding whether a sentence contains the word “book” in it or not. It seems pretty straight forward right? But here is the catch – we have to find the word “book” only if it has been used in the sentence as a noun.

In the first sentence above, “book” has been used as a noun and in the second sentence, it has been used as a verb. So, the spaCy matcher should be able to extract the pattern from the first sentence only. Let’s try it out:

In [14]:
doc1 = nlp("You read this book")
doc2 = nlp("I will book my ticket")

pattern = [{'TEXT': 'book', 'POS': 'NOUN'}]

# Initialize the matcher with the shared vocab
matcher = Matcher(nlp.vocab)
matcher.add('rule_2', None, pattern)

In [15]:
matches = matcher(doc1)
matches

[(375134486054924901, 3, 4)]

The matcher has found the pattern in the first sentence.


In [16]:
matches = matcher(doc2)
matches

[]

Nice! Though “book” is present in the second sentence, the matcher ignored it as it was not a noun.