The objective during the following 5 3-hour practical session this week will be to explore different tools
for important NLP tasks.
 
The first two sessions will focus on the NLP Pipeline and how to apply it for Text Classification. 
We will first learn how to create NLP Pipelines with SpaCy and define the successive execution of NLP tools to achieve:
1. Loading Models
2. Loading and Normalization
    1. Loading text and documents
    2. Segmenting into tokens and sentences
    3. Stop-word filtering
    4. Lemmatization and Stemming
    5. Computing counts and frequencies
3. Tasks
    1. Part of Speech Tagging
    2. Parsing
        1. Dependency
        2. Shallow
    3. Named Entity Recognition 
4. Rule-based Matching
5. Custom pipelines


After we've explored the capabilities of pipelines with SpaCy in the first tutorial session, In the second session, we will complete a tutorial (read/watch https://huggingface.co/tasks/text-classification, get the notebook: https://github.com/huggingface/notebooks/blob/main/examples/text_classification.ipynb) on text classification with hugging-face transformers and then adapt both approaches for text classification on a new dataset: 

- A traditional ML pipeline using scikit-learn and feature engineering based on NLP features (e.g. using spacy). You can combine TFIdf features with your own one-hot encoded linguistic features (scikit-learn can combine different types of features using pipelines)

- A transformer architecture as studied in the hunggingface tutorial. 

You will have to adjust the approaches to train/evaluate them on the Yelp Reviews dataset: https://www.yelp.com/dataset

Please establish a protocol to compare and evaluate the two systems using relevant metrics. You may consider several transformer models and see if task-specific models are better (they should be).  


For the feature engineering aspect, you could proceed as follows: 
   1. Explore the corpus and create an NLP pipeline to prepare the text for processing
   2. Identify features that are likely to be informative for the classification task
   3. Create feature vectors from the linguistic features using sci-kit learn
   4. Set up a series of classifiers and compare them to determine optimal parameters and feature combinations    


This notebook is adapted from several online resources:
- Source 1 https://realpython.com/natural-language-processing-spacy-python/
- Source 2 https://www.ekino.com/articles/simple-nlp-tasks-tutorial
- Source 3 https://spacy.io/usage/processing-pipelines
# I. NLP Pipelines
## 1. Loading models
We first need to download a set of models that we can use with Spacy. There is a limited number of 
supported languages, but Englih and French are among them. You may consult the Spacy documentation to see a list of available models:
[https://spacy.io/usage/models](https://spacy.io/usage/models)


Install Spacy English Model

```
python -m spacy download en_core_web_md
```

There's also a french model that can be loaded in the same way if you need it:

```
python -m spacy download fr_core_news_md
```

In [46]:
!python -m spacy download en_core_web_lg

2023-01-11 09:17:21.850199: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en-core-web-lg==3.4.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.4.1/en_core_web_lg-3.4.1-py3-none-any.whl (587.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.7/587.7 MB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-3.4.1
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')


We first need to download a set of models that we can use with SpaCy. There is a limited number of 
supported languages, but Englih and French are among them. 

In [47]:
import spacy
nlp = spacy.load('en_core_web_lg') # Load english models

## 2. Loading and Normalization
### 2.1 Loading texts and doucuments
Now that we know how to load models, we can see how to perform the first basic steps of the NLP Pipeline: reading and processing a text.  

We can either process a string directly or load the text from a file. 
Then we must process the text with Spacy, which gives us a `Doc` instance from which we can retrieve successive annotations. 

In [3]:
# From a string 
introduction_text = ('This tutorial is about Natural Language Processing in Spacy.')
introduction_doc = nlp(introduction_text)

In [None]:
#From a file
file_name = 'introduction.txt'
introduction_file_text = open(file_name).read()
introduction_file_doc = nlp(introduction_file_text)

### 2.2 Segmenting into tokens and sentences
#### Sentence detection
**Sentence Detection** is the process of locating the start and end of sentences 
in a given text. This allows you to you divide a text into linguistically meaningful 
units. You’ll use these units when you’re processing your text to perform tasks such
as **part of speech tagging** and **entity extraction**.

In spaCy, the sents property is used to extract sentences. 
Here’s how you would extract the total number of sentences and the sentences 
for a given input text:

In [4]:
about_text = ('Gus Proto is a Python developer currently'
               ' working for a London-based Fintech'
              ' company. He is interested in learning'
              ' Natural Language Processing.')

about_doc = nlp(about_text)
sentences = list(about_doc.sents)
len(sentences)

for sentence in sentences:
    print ("SENT: "+str(sentence))

SENT: Gus Proto is a Python developer currently working for a London-based Fintech company.
SENT: He is interested in learning Natural Language Processing.


#### Tokenization

**Tokenization** is the next step after sentence detection. 
It allows you to identify the basic units in your text. 
These basic units are called **tokens**. 
Tokenization is useful because it breaks a text into meaningful units.
 These units are used for further analysis, like part of speech tagging.

In spaCy, you can print tokens by iterating on the Doc object:

In [5]:
for token in about_doc:
    print (token, token.idx)

Gus 0
Proto 4
is 10
a 13
Python 15
developer 22
currently 32
working 42
for 50
a 54
London 56
- 62
based 63
Fintech 69
company 77
. 84
He 86
is 89
interested 92
in 103
learning 106
Natural 115
Language 123
Processing 132
. 142


Note how spaCy preserves the starting index of the tokens. 
It’s useful for in-place word replacement. 
spaCy provides [various](https://spacy.io/api/token#attributes) attributes for the Token class:

In [6]:
for token in about_doc:
    print (token, token.idx, token.text_with_ws,
           token.is_alpha, token.is_punct, 
           token.is_space,token.shape_, token.is_stop)

Gus 0 Gus  True False False Xxx False
Proto 4 Proto  True False False Xxxxx False
is 10 is  True False False xx True
a 13 a  True False False x True
Python 15 Python  True False False Xxxxx False
developer 22 developer  True False False xxxx False
currently 32 currently  True False False xxxx False
working 42 working  True False False xxxx False
for 50 for  True False False xxx True
a 54 a  True False False x True
London 56 London True False False Xxxxx False
- 62 - False True False - False
based 63 based  True False False xxxx False
Fintech 69 Fintech  True False False Xxxxx False
company 77 company True False False xxxx False
. 84 .  False True False . False
He 86 He  True False False Xx True
is 89 is  True False False xx True
interested 92 interested  True False False xxxx False
in 103 in  True False False xx True
learning 106 learning  True False False xxxx False
Natural 115 Natural  True False False Xxxxx False
Language 123 Language  True False False Xxxxx False
Processing 132 Pro

In this example, some of the commonly required attributes are accessed:

 - `text_with_ws` prints token text with trailing space (if present).
 - `is_alpha` detects if the token consists of alphabetic characters or not.
 - `is_punct` detects if the token is a punctuation symbol or not.
 - `is_space` detects if the token is a space or not.
 - `shape_` prints out the shape of the word.
 - `is_stop` detects if the token is a stop word or not.

**Note:** *We will see stop word filtering in the next section* 

You can also customize the tokenization process to detect tokens on custom characters. 
This is often used for hyphenated words, which are words joined with hyphen. 
For example, “London-based” is a hyphenated word.

spaCy allows you to customize tokenization by updating the tokenizer property on the `nlp` object:

In [None]:
import re
import spacy
from spacy.tokenizer import Tokenizer
custom_nlp = spacy.load('en_core_web_sm')
prefix_re = spacy.util.compile_prefix_regex(custom_nlp.Defaults.prefixes)
suffix_re = spacy.util.compile_suffix_regex(custom_nlp.Defaults.suffixes)
infix_re = re.compile(r'''[-~]''')
def customize_tokenizer(nlp):
    #Adds support to use `-` as the delimiter for tokenization
    return Tokenizer(nlp.vocab, 
                     prefix_search=prefix_re.search,suffix_search=suffix_re.search,
                     infix_finditer=infix_re.finditer,token_match=None)

custom_nlp.tokenizer = customize_tokenizer(custom_nlp)
custom_tokenizer_about_doc = custom_nlp(about_text)
print([token.text for token in custom_tokenizer_about_doc])


In order for you to customize, you can pass various parameters to the Tokenizer class:
 - `nlp.vocab` is a storage container for special cases and is used to handle cases like contractions and emoticons.
 - `prefix_search` is the function that is used to handle preceding punctuation, such as opening parentheses.
 - `infix_finditer` is the function that is used to handle non-whitespace separators, such as hyphens.
 - `suffix_search` is the function that is used to handle succeeding punctuation, such as closing parentheses.
 - `token_match` is an optional boolean function that is used to match strings that should never be split. 
 
It overrides the previous rules and is useful for entities like URLs or numbers.

**Note:** *spaCy already detects hyphenated words as individual tokens. The above code is just an example to show how tokenization can be customized. It can be used for any other character.*

### 2.3 Stop-word filtering
Stop words are the most common words in a language. 
In the English language, some examples of stop words are `the`, `are`, `but`, and `they`. 
Most sentences need to contain stop words in order to be full sentences that make sense.

Generally, stop words are removed because they aren’t significant and distort the word 
frequency analysis. spaCy has a list of stop words for the English language:

In [7]:
import spacy
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS
len(spacy_stopwords)

326

In [8]:
for stop_word in list(spacy_stopwords)[:10]:
     print(stop_word)


give
two
while
nowhere
‘s
become
forty
's
they
beyond


You can remove stop words from the input text:

In [9]:
for token in about_doc:
    if not token.is_stop:
        print(token)

Gus
Proto
Python
developer
currently
working
London
-
based
Fintech
company
.
interested
learning
Natural
Language
Processing
.


Stop words like `is`, `a`, `for`, `the`, and `in` are not printed in the output above. You can also create a list of tokens not containing stop words:

In [10]:
about_no_stopword_doc = [token for token in about_doc if not token.is_stop]
print (about_no_stopword_doc)


[Gus, Proto, Python, developer, currently, working, London, -, based, Fintech, company, ., interested, learning, Natural, Language, Processing, .]


### 2.4 Lemmatization and Stemming
#### Lemmatization
**Lemmatization** is the process of reducing inflected forms of a word while
 still ensuring that the reduced form belongs to the language. 
 This reduced form or root word is called a **lemma**.

For example, *organizes*, *organized* and *organizing* are all forms of *organize*. 
Here, *organize* is the lemma. The inflection of a word allows you to express different 
grammatical categories like tense (organized vs organize), 
number (trains vs train), and so on. 
Lemmatization is necessary because it helps you reduce the inflected forms 
of a word so that they can be analyzed as a single item. 
It can also help you **normalize** the text.

spaCy has the attribute `lemma_` on the Token class. This attribute has the lemmatized form of a token:

In [48]:
conference_help_text = ('Gus is helping organize a developer'
    'conference on Applications of Natural Language'
    ' Processing. He keeps organizing local Python meetups'
    ' and several internal talks at his workplace.')
conference_help_doc = nlp(conference_help_text)
for token in conference_help_doc:
    print (token, token.lemma_)

Gus Gus
is be
helping helping
organize organize
a a
developerconference developerconference
on on
Applications Applications
of of
Natural Natural
Language Language
Processing Processing
. .
He he
keeps keep
organizing organize
local local
Python Python
meetups meetup
and and
several several
internal internal
talks talk
at at
his his
workplace workplace
. .


In this example, *organizing* reduces to its lemma form *organize*. 
If you do not lemmatize the text, 
then *organize* and *organizing* will be counted as different tokens, 
even though they both have a similar meaning. 
Lemmatization helps you avoid duplicate words that have similar meanings.

#### Stemming
spaCy cannot do stemming, but we can use the older reference NLP library, NLTK to import the Porter
stemmer for Python. 

In [12]:
# import these modules 
from nltk.stem import PorterStemmer 

ps = PorterStemmer() 
#equivalent to ps = SnowballStemmer("english")
# also available for french 
  
# choose some words to be stemmed 
words = ["program", "programs", "programer", "programing", "programers"] 
  
for w in words: 
    print(w, " : ", ps.stem(w)) 

program  :  program
programs  :  program
programer  :  program
programing  :  program
programers  :  program


Note that NLTK requires to download ressources for some of its components other than Stemmers: 
[https://www.nltk.org/data.html](https://www.nltk.org/data.html)

## 2.5 Computing counts and frequencies
You can now convert a given text into tokens and perform statistical analysis over it. 
This analysis can give you various insights about word patterns, such as common words or unique words in the text:


In [13]:
from collections import Counter
complete_text = ('Gus Proto is a Python developer currently'
    'working for a London-based Fintech company. He is'
    ' interested in learning Natural Language Processing.'
    ' There is a developer conference happening on 21 July'
    ' 2019 in London. It is titled "Applications of Natural'
    ' Language Processing". There is a helpline number '
    ' available at +1-1234567891. Gus is helping organize it.'
    ' He keeps organizing local Python meetups and several'
    ' internal talks at his workplace. Gus is also presenting'
    ' a talk. The talk will introduce the reader about "Use'
    ' cases of Natural Language Processing in Fintech".'
    ' Apart from his work, he is very passionate about music.'
    ' Gus is learning to play the Piano. He has enrolled '
    ' himself in the weekend batch of Great Piano Academy.'
    ' Great Piano Academy is situated in Mayfair or the City'
    ' of London and has world-class piano instructors.')

complete_doc = nlp(complete_text)
# Remove stop words and punctuation symbols
words = [token.text for token in complete_doc
         if not token.is_stop and not token.is_punct]
word_freq = Counter(words)
# 5 commonly occurring words with their frequencies
common_words = word_freq.most_common(5)
print (common_words)

# Unique words
unique_words = [word for (word, freq) in word_freq.items() if freq == 1]
print (unique_words)


[('Gus', 4), ('London', 3), ('Natural', 3), ('Language', 3), ('Processing', 3)]
['Proto', 'currentlyworking', 'based', 'company', 'interested', 'conference', 'happening', '21', 'July', '2019', 'titled', 'Applications', 'helpline', 'number', 'available', '+1', '1234567891', 'helping', 'organize', 'keeps', 'organizing', 'local', 'meetups', 'internal', 'talks', 'workplace', 'presenting', 'introduce', 'reader', 'Use', 'cases', 'Apart', 'work', 'passionate', 'music', 'play', 'enrolled', 'weekend', 'batch', 'situated', 'Mayfair', 'City', 'world', 'class', 'piano', 'instructors']


By looking at the common words, you can see that the text as a whole is probably about Gus, 
London, or Natural Language Processing. 
This way, you can take any unstructured text and perform statistical analysis to 
know what it’s about.

Here’s another example of the same text with stop words:

In [14]:
words_all = [token.text for token in complete_doc if not token.is_punct]
word_freq_all = Counter(words_all)
# 5 commonly occurring words with their frequencies
common_words_all = word_freq_all.most_common(5)
print (common_words_all)

[('is', 10), ('a', 5), ('in', 5), ('Gus', 4), ('of', 4)]


Four out of five of the most common words are stop words, which don’t tell you much about the text. If you consider stop words while doing word frequency analysis, then you won’t be able to derive meaningful insights from the input text. This is why removing stop words is so important.
## 3. Visualization
spaCy comes with a built-in visualizer called displaCy. 
You can use it to visualize a dependency parse or named entities in a browser or 
a Jupyter notebook.

You can use displaCy to find POS tags for tokens.
In a browser:

In [15]:
from spacy import displacy
about_interest_text = ('He is interested in learning'
    ' Natural Language Processing.')
about_interest_doc = nlp(about_interest_text)


In [16]:
displacy.serve(about_interest_doc, style='dep')


Using the 'dep' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


The above code will spin a simple web server. 
You can see the visualization by opening [http://127.0.0.1:5000](http://127.0.0.1:5000) in your browser:

![](https://files.realpython.com/media/displacy_pos_tags.45059f2bf851.png)

You can also directly render the result in Jupyter: 

In [17]:
displacy.render(about_interest_doc, style='dep', jupyter=True)

## 4. Tasks
In this part we will present an overview of the NLP tasks included in spaCy that are part of the 
standard NLP pipleine.

### 4.1. Part of Speech Tagging

Part of speech or POS is a grammatical role that explains how a particular word is used in a sentence. 
There are many different sets of tags for POS tagging, the most common for English is the Brow Corpus tagset that's 
being used in the Penn TreeBank. There are normally 21 tags, but they can be regrouped in the following broad categories: 

- Noun
- Pronoun
- Adjective
- Verb
- Adverb
- Preposition
- Conjunction
- Interjection

**Part of speech tagging** is the process of assigning a **POS tag** to each token depending 
on its usage in the sentence. 
POS tags are useful for assigning a syntactic category like noun or verb to each word.

In spaCy, POS tags are available as an attribute on the Token object:

In [18]:
for token in about_doc:
    print (token, token.tag_, token.pos_, spacy.explain(token.tag_))

Gus NNP PROPN noun, proper singular
Proto NNP PROPN noun, proper singular
is VBZ AUX verb, 3rd person singular present
a DT DET determiner
Python NNP PROPN noun, proper singular
developer NN NOUN noun, singular or mass
currently RB ADV adverb
working VBG VERB verb, gerund or present participle
for IN ADP conjunction, subordinating or preposition
a DT DET determiner
London NNP PROPN noun, proper singular
- HYPH PUNCT punctuation mark, hyphen
based VBN VERB verb, past participle
Fintech NNP PROPN noun, proper singular
company NN NOUN noun, singular or mass
. . PUNCT punctuation mark, sentence closer
He PRP PRON pronoun, personal
is VBZ AUX verb, 3rd person singular present
interested JJ ADJ adjective (English), other noun-modifier (Chinese)
in IN ADP conjunction, subordinating or preposition
learning VBG VERB verb, gerund or present participle
Natural NNP PROPN noun, proper singular
Language NNP PROPN noun, proper singular
Processing NN NOUN noun, singular or mass
. . PUNCT punctuation m

Here, two attributes of the Token class are accessed:

1. `tag_ lists` the fine-grained part of speech.
2. `pos_` lists the coarse-grained part of speech.
3. `spacy.explain` gives descriptive details about a particular POS tag. spaCy provides a complete tag list along with an explanation for each tag.

Using POS tags, you can extract a particular category of words:

In [19]:
nouns = []
adjectives = []
for token in about_doc:
    if token.pos_ == 'NOUN':
        nouns.append(token)
    if token.pos_ == 'ADJ':
        adjectives.append(token)

nouns
adjectives

[interested]

You can use this to derive insights, 
remove the most common nouns, 
or see which adjectives are used for a particular noun.
### 4.2. Parsing
spaCy only supports Dependency parsing, which we will examine first.  We cannot use spaCy to get a full constituency parse,
howeve the use of third party libraries can allow to have a shallow consituency parsing. 

#### 4.2.1 Dependency
**Dependency parsing** is the process of extracting the dependency parse of a sentence to represent 
its grammatical structure. It defines the dependency relationship between **headwords** and their 
**dependents**. The head of a sentence has no dependency and is called the **root of the sentence**. 
The **verb** is usually the head of the sentence. All other words are linked to the headword.

The dependencies can be mapped in a directed graph representation:

- Words are the nodes.
- The grammatical relationships are the edges.
Dependency parsing helps you know what role a word plays in the text and how different words 
relate to each other. It’s also used in **shallow parsing** and **named entity recognition**.

Here’s how you can use dependency parsing to see the relationships between words:

In [20]:
piano_text = 'Gus is learning piano'
piano_doc = nlp(piano_text)
for token in piano_doc:
    print (token.text, token.tag_, token.head.text, token.dep_)

Gus NNP learning nsubj
is VBZ learning aux
learning VBG learning ROOT
piano NN learning dobj


In this example, the sentence contains three relationships:

- `nsubj` is the subject of the word. Its headword is a verb.
- `aux` is an auxiliary word. Its headword is a verb.
- `dobj` is the direct object of the verb. Its headword is a verb.

There is a detailed list of relationships with descriptions. 

You can visualise the result as follows: 

In [21]:
displacy.render(piano_doc, style='dep', jupyter=True)

This image shows you that the subject of the sentence is the proper noun `Gus` and 
that it has a learn relationship with `piano`.
##### Navigating the Tree and Subtree
The dependency parse tree has all the properties of a tree. 
This tree contains information about sentence structure and grammar and can be traversed
 in different ways to extract relationships.

spaCy provides attributes like children, lefts, rights, and subtree to navigate the parse tree:


In [22]:
one_line_about_text = ('Gus Proto is a Python developer'
    ' currently working for a London-based Fintech company')
one_line_about_doc = nlp(one_line_about_text)
# Extract children of `developer`
print([token.text for token in one_line_about_doc[5].children])

# Extract previous neighboring node of `developer`
print (one_line_about_doc[5].nbor(-1))

# Extract next neighboring node of `developer`
print (one_line_about_doc[5].nbor())

# Extract all tokens on the left of `developer`
print([token.text for token in one_line_about_doc[5].lefts])

# Extract tokens on the right of `developer`
print([token.text for token in one_line_about_doc[5].rights])

# Print subtree of `developer`
print (list(one_line_about_doc[5].subtree))

['a', 'Python', 'working']
Python
currently
['a', 'Python']
['working']
[a, Python, developer, currently, working, for, a, London, -, based, Fintech, company]


You can construct a function that takes a subtree as an argument and returns a string by merging words in it:

In [23]:
def flatten_tree(tree):
    return ''.join([token.text_with_ws for token in list(tree)]).strip()

# Print flattened subtree of `developer`
print (flatten_tree(one_line_about_doc[5].subtree))

a Python developer currently working for a London-based Fintech company


You can use this function to print all the tokens in a subtree.
#### 4.2.2 Shallow
**Shallow parsing**, or **chunking**, is the process of extracting phrases from unstructured text. 
Chunking groups adjacent tokens into phrases on the basis of their POS tags.
There are some standard well-known chunks such as noun phrases, verb phrases, and prepositional phrases.
##### Noun Phrase Detection

A noun phrase is a phrase that has a noun as its head. 
It could also include other kinds of words, such as adjectives, ordinals, determiners. 
Noun phrases are useful for explaining the context of the sentence. 
They help you infer *what* is being talked about in the sentence.

spaCy has the property `noun_chunks` on Doc object. You can use it to extract noun phrases:

In [24]:
conference_text = 'There is a developer conference happening on 21 July 2019 in London.'
conference_doc = nlp(conference_text)
# Extract Noun Phrases
for chunk in conference_doc.noun_chunks:
    print (chunk)


a developer conference
21 July
London


By looking at noun phrases, you can get information about your text.
For example, a developer conference indicates that the text mentions a conference, 
while the date 21 July lets you know that conference is scheduled for 21 July. 
You can figure out whether the conference is in the past or the future.
London tells you that the conference is in London.




### 4.3 Named Entity Recognition

**Named Entity Recognition (NER)** is the process of locating **named entities** in unstructured text and then classifying them into pre-defined categories, such as person names, organizations, locations, monetary values, percentages, time expressions, and so on.

You can use **NER** to know more about the meaning of your text. 
For example, you could use it to populate tags for a set of documents in order to 
improve the keyword search. You could also use it to categorize customer support tickets
 into relevant categories.

spaCy has the property `ents` on `Doc` objects. You can use it to extract named entities:

In [30]:
piano_class_text = ('Great Piano Academy is situated'
    ' in Mayfair or the City of London and has'
    ' world-class piano instructors.')
piano_class_doc = nlp(piano_class_text)
for ent in piano_class_doc.ents:
    print(ent.text, ent.start_char, ent.end_char,
          ent.label_, spacy.explain(ent.label_))


Great Piano Academy 0 19 ORG Companies, agencies, institutions, etc.
Mayfair 35 42 GPE Countries, cities, states
the City of London 46 64 GPE Countries, cities, states


In the above example, `ent` is a Span object with various attributes:

- `text` gives the Unicode text representation of the entity.
- `start_char` denotes the character offset for the start of the entity.
- `end_char` denotes the character offset for the end of the entity.
- `label_` gives the label of the entity.

`spacy.explain` gives descriptive details about an entity label. 
The spaCy model has a pre-trained list of entity classes. 
You can use displaCy to visualize these entities:

In [31]:
displacy.render(piano_class_doc, style="ent", jupyter=True)

You can use NER to redact people’s names from a text. 
For example, you might want to do this in order to hide personal information collected in a survey.
You can use spaCy to do that:

In [38]:
survey_text = ('Out of 5 people surveyed, James Robert,'
               ' Julie Fuller and Benjamin Brooks like'
               ' apples. Kelly Cox and Matthew Evans'
               ' like oranges.')

def replace_person_names(token):
    if token.ent_iob != 0 and token.ent_type_ == 'PERSON':
        return '[REDACTED] '
    return str(token)+" "

def redact_names(nlp_doc):
    with nlp_doc.retokenize() as retokenizer:
      for ent in nlp_doc.ents:
        retokenizer.merge(ent)
    tokens = map(replace_person_names, nlp_doc)
    return "".join(tokens)

survey_doc = nlp(survey_text)
redact_names(survey_doc)

'Out of 5 people surveyed , [REDACTED] , [REDACTED] and [REDACTED] like apples . [REDACTED] and [REDACTED] like oranges . '

In this example, replace_person_names() uses ent_iob. It gives the IOB code of the named entity tag using inside-outside-beginning (IOB) tagging. Here, it can assume a value other than zero, because zero means that no entity tag is set.

## 5. Pipelines

spaCy actually runs all the tasks and tools in the pipeline automatically when you call the `nlp` functions.
![Alt](https://d33wubrfki0l68.cloudfront.net/16b2ccafeefd6d547171afa23f9ac62f159e353d/48b91/pipeline-7a14d4edd18f3edfee8f34393bff2992.svg)

In some situations, it may be useful to run only part of the standard pipeline and thus to disable some proesses. 
For example, we can disable POS tagging and parsing as follows: 

In [39]:
import spacy

texts = [
    "Net income was $9.4 million compared to the prior year of $2.7 million.",
    "Revenue exceeded twelve billion dollars, with a loss of $1b.",
]

alt_nlp = spacy.load("en_core_web_md")
for doc in nlp.pipe(texts, disable=["tagger", "parser"]):
    # Do something with the doc here
    print([(ent.text, ent.label_) for ent in doc.ents])

[('$9.4 million', 'MONEY'), ('the prior year', 'DATE'), ('$2.7 million', 'MONEY')]
[('twelve billion dollars', 'MONEY'), ('1b', 'MONEY')]




You can find more details about pipelines and custom pipeline coponents in the spAcy documentation: https://spacy.io/usage/processing-pipelines
 

## 6. Rule-based Matching
**Rule-based matching** is one of the steps in extracting information from unstructured text. 
It’s used to identify and extract tokens and phrases according to patterns (such as lowercase)
 and grammatical features (such as part of speech).

Rule-based matching can use regular expressions to extract entities (such as phone numbers)
 from an unstructured text. It’s different from extracting text using regular expressions only in the sense that regular expressions don’t consider the lexical and grammatical attributes of the text.

With rule-based matching, you can extract a first name and a last name, which are always **proper nouns**:

In [43]:
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
def extract_full_name(nlp_doc):
    pattern = [{'POS': 'PROPN'}, {'POS': 'PROPN'}]
    matcher.add('FULL_NAME', [pattern])
    matches = matcher(nlp_doc)
    for match_id, start, end in matches:
        span = nlp_doc[start:end]
        return span.text

extract_full_name(about_doc)

'Gus Proto'

In this example, pattern is a list of objects that defines the combination of tokens to be matched. Both POS tags in it are PROPN (proper noun). So, the pattern consists of two objects in which the POS tags for both tokens should be PROPN. This pattern is then added to Matcher using FULL_NAME and the the match_id. Finally, matches are obtained with their starting and end indexes.

You can also use rule-based matching to extract phone numbers:

In [45]:
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
conference_org_text = ('There is a developer conference'
    'happening on 21 July 2019 in London. It is titled'
    ' "Applications of Natural Language Processing".'
    ' There is a helpline number available'
    ' at (123) 456-789')

def extract_phone_number(nlp_doc):
    pattern = [{'ORTH': '('}, {'SHAPE': 'ddd'},
               {'ORTH': ')'}, {'SHAPE': 'ddd'},
               {'ORTH': '-', 'OP': '?'},
               {'SHAPE': 'ddd'}]
    matcher.add('PHONE_NUMBER', [pattern])
    matches = matcher(nlp_doc)
    for match_id, start, end in matches:
        span = nlp_doc[start:end]
        return span.text

conference_org_doc = nlp(conference_org_text)
extract_phone_number(conference_org_doc)

'(123) 456-789'

In this example, only the pattern is updated in order to match phone numbers from the previous example. Here, some attributes of the token are also used:

 - `ORTH` gives the exact text of the token.
 - `SHAPE` transforms the token string to show orthographic features.
 - `OP` defines operators. Using ? as a value means that the pattern is optional, meaning it can match 0 or 1 times.
 
 Rule-based matching helps you identify and extract tokens and phrases according to lexical patterns (such as lowercase) and grammatical features(such as part of speech).




 

##### Verb Phrase Detecton

A verb phrase is a syntactic unit composed of at least one verb. This verb can be followed by other chunks, such as noun phrases. Verb phrases are useful for understanding the actions that nouns are involved in.


In [59]:
import spacy   
from spacy.matcher import Matcher
from spacy.util import filter_spans

sentence = 'The cat sat on the mat. He quickly ran to the market. The dog jumped into the water. The author is writing a book.'
pattern = [{'POS': 'VERB', 'OP': '?'},
           {'POS': 'ADV', 'OP': '*'},
           {'POS': 'AUX', 'OP': '*'},
           {'POS': 'VERB', 'OP': '+'}]

# instantiate a Matcher instance
matcher = Matcher(nlp.vocab)
matcher.add("Verb phrase", [pattern])

doc = nlp(sentence) 
# call the matcher to find matches 
matches = matcher(doc)
spans = [doc[start:end] for _, start, end in matches]

print (filter_spans(spans)) 

[sat, quickly ran, jumped, is writing]


In this example, the verb phrase indicates that something will be introduced.
The above code extracts all the verb phrases using a pattern of POS tags. 
You can tweak the pattern for verb phrases depending upon your use case.