# What is spaCy?
Spacy is an Open Source Natural Landguage Processing Library designed to effectively handle NLP tasks with the most efficient implementation of common algorithm. For many NLP tasks, Spacy only has one implemented method, chosing the most efficient algorithm currently available, meaning we dont have the option to choose other algorithms.

# What is NLTK?
Natural Language Toolkit is a popular open source that provides many functionalities, but includes less efficient implementations. 

# NLTK vs spaCy
Spacy does not include pre-created models for some applications, such as sentimental analysis, which is typically easier to perform with NLTK.

# What is NLP?
Natural Language Processing is an area of computer science and AI concerned with interactions between computers and human languages, in particular how to program computers to process and analyze large amounts of natural language data.

# spaCy Basics
The nlp() function from spacy automatically takes raw text and performs a series of operations to tag, parse, and describe the text data.

# spaCy Object

After importing the spacy module in the cell above we loaded a **model** and named it `nlp`.<br>Next we created a **Doc** object by applying the model to our text, and named it `doc`.<br>spaCy also builds a companion **Vocab** object that we'll cover in later sections.<br>The **Doc** object that holds the processed text is our focus here.


For a full list of POS Tags visit https://spacy.io/api/annotation#pos-tagging           
For a full list of Syntactic Dependencies visit https://spacy.io/api/annotation#dependency-parsing
To see the full name of a tag use `spacy.explain(tag)`

|Tag|Description|doc2[0].tag|
|:------|:------:|:------|
|`.text`|The original word text<!-- .element: style="text-align:left;" -->|`Tesla`|
|`.lemma_`|The base form of the word|`tesla`|
|`.pos_`|The simple part-of-speech tag|`PROPN`/`proper noun`|
|`.tag_`|The detailed part-of-speech tag|`NNP`/`noun, proper singular`|
|`.shape_`|The word shape – capitalization, punctuation, digits|`Xxxxx`|
|`.is_alpha`|Is the token an alpha character?|`True`|
|`.is_stop`|Is the token part of a stop list, i.e. the most common words of the language?|`False`|

In [1]:
import spacy
##Load language library
nlp = spacy.load('en_core_web_sm')

doc=nlp(u'Tesla is looking at buying U.S. startups for $6 million.')
for token in doc:
    print(token.text)##the text
    print(token.pos_)##part of speech
    print(token.dep_)##syntactic dependency
    print()
print(nlp.pipeline,'\n')
print(nlp.pipe_names)

Tesla
PROPN
nsubj

is
AUX
aux

looking
VERB
ROOT

at
ADP
prep

buying
VERB
pcomp

U.S.
PROPN
compound

startups
NOUN
dobj

for
ADP
prep

$
SYM
quantmod

6
NUM
compound

million
NUM
pobj

.
PUNCT
punct

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec object at 0x0000029F51EE3C50>), ('tagger', <spacy.pipeline.tagger.Tagger object at 0x0000029F51EE3F50>), ('parser', <spacy.pipeline.dep_parser.DependencyParser object at 0x0000029F51DF2110>), ('attribute_ruler', <spacy.pipeline.attributeruler.AttributeRuler object at 0x0000029F51FDE550>), ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer object at 0x0000029F520A8450>), ('ner', <spacy.pipeline.ner.EntityRecognizer object at 0x0000029F51DF22D0>)] 

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']


In [2]:
doc2 = nlp(u"Tesla isn't looking for anymore companies")
for token in doc2:
    print(token.text,'\t',token.pos_,'\t',token.dep_,'\t',token.lemma_,'\t',token.shape_,'\t',token.tag_,'\t',spacy.explain(token.tag_),'\t',token.is_alpha,'\t',token.is_stop)
print()
for i in range(len(doc2)):
    print(doc[i].text,'\t',doc[i].pos_,'\t',doc[i].dep_,'\t',doc[i].lemma_,'\t',doc[i].shape_,'\t',doc[i].tag_,'\t',spacy.explain(doc[i].tag_),'\t',doc[i].is_alpha,'\t',doc[i].is_stop)
print()
spacy.explain('PROPN')

Tesla 	 PROPN 	 nsubj 	 Tesla 	 Xxxxx 	 NNP 	 noun, proper singular 	 True 	 False
is 	 AUX 	 aux 	 be 	 xx 	 VBZ 	 verb, 3rd person singular present 	 True 	 True
n't 	 PART 	 neg 	 not 	 x'x 	 RB 	 adverb 	 False 	 True
looking 	 VERB 	 ROOT 	 look 	 xxxx 	 VBG 	 verb, gerund or present participle 	 True 	 False
for 	 ADP 	 prep 	 for 	 xxx 	 IN 	 conjunction, subordinating or preposition 	 True 	 True
anymore 	 ADJ 	 amod 	 anymore 	 xxxx 	 JJ 	 adjective (English), other noun-modifier (Chinese) 	 True 	 False
companies 	 NOUN 	 pobj 	 company 	 xxxx 	 NNS 	 noun, plural 	 True 	 False

Tesla 	 PROPN 	 nsubj 	 Tesla 	 Xxxxx 	 NNP 	 noun, proper singular 	 True 	 False
is 	 AUX 	 aux 	 be 	 xx 	 VBZ 	 verb, 3rd person singular present 	 True 	 True
looking 	 VERB 	 ROOT 	 look 	 xxxx 	 VBG 	 verb, gerund or present participle 	 True 	 False
at 	 ADP 	 prep 	 at 	 xx 	 IN 	 conjunction, subordinating or preposition 	 True 	 True
buying 	 VERB 	 pcomp 	 buy 	 xxxx 	 VBG 	 verb, gerund 

'proper noun'

# Tokenization
Tokenization is the process of breaking up the original text into component pieces(tokens)


In [3]:
mystring = '"We\'re moving to N.C.R.!"'
print(mystring)
doc1 = nlp(mystring)
for token in doc1:
    print(token.text,'\t',token.pos_)
print('\n')

doc2 = nlp(u"We're here to help1 zsned snail-mail, email support@oursite.com or visit us at http://www.oursite.com!")
for token in doc2:
    print(token.text,'\t',token.pos_)
print('\n')

doc3 = nlp(u"A 5km NYC cab ride from St. Joseph costs Rs 50.00 /-")
for token in doc3:
    print(token.text,'\t',token.pos_)

print('\nLength of doc 1',len(doc1),'\nLength of doc 2',len(doc2),'\nLength of doc 3',len(doc3))

"We're moving to N.C.R.!"
" 	 PUNCT
We 	 PRON
're 	 AUX
moving 	 VERB
to 	 ADP
N.C.R. 	 PROPN
! 	 PUNCT
" 	 PUNCT


We 	 PRON
're 	 AUX
here 	 ADV
to 	 AUX
help1 	 PROPN
zsned 	 VERB
snail 	 NOUN
- 	 PUNCT
mail 	 NOUN
, 	 PUNCT
email 	 NOUN
support@oursite.com 	 X
or 	 CCONJ
visit 	 VERB
us 	 PRON
at 	 ADP
http://www.oursite.com 	 X
! 	 PUNCT


A 	 DET
5 	 NUM
km 	 NOUN
NYC 	 PROPN
cab 	 NOUN
ride 	 NOUN
from 	 ADP
St. 	 PROPN
Joseph 	 PROPN
costs 	 VERB
Rs 	 NOUN
50.00 	 NUM
/- 	 PUNCT

Length of doc 1 8 
Length of doc 2 18 
Length of doc 3 13


# Spans
Large Doc objects can be hard to work with at times. A **span** is a slice of Doc object in the form `Doc[start:stop]`.

In [4]:
doc = nlp(u'Although commmonly attributed to John Lennon from his song "Beautiful Boy", \
the phrase "Life is what happens to us while we are making other plans" was written by \
cartoonist Allen Saunders and published in Reader\'s Digest in 1957, when Lennon was 17.')
life_quote = doc[16:30]
print(life_quote)
type(life_quote)

"Life is what happens to us while we are making other plans"


spacy.tokens.span.Span

# Sentences
Certain tokens inside a Doc object may also receive a "start of sentence" tag. While this doesn't immediately build a list of sentences, these tags enable the generation of sentence segments through `Doc.sents`. Later we'll write our own segmentation rules.

In [5]:
doc = nlp(u'This is the first sentence. This is the second sentence. This is the last sentence.')
for stnc in doc.sents:
    print(stnc)
doc[6].is_sent_start

This is the first sentence.
This is the second sentence.
This is the last sentence.


True


# Named Entities
Going a step beyond tokens, *named entities* add another layer of context. The language model recognizes that certain words are organizational names while others are locations, and still other combinations relate to money, dates, etc. Named entities are accessible through the `ents` property of a `Doc` object.

In [6]:
doc = nlp(u"Apple to build a factory at Hong Kong for 6 million U.S. Dollars")
for entity in  doc.ents:
    print(entity)
    print(entity.label_,'\n',str(spacy.explain(entity.label_)),'\n')

Apple
ORG 
 Companies, agencies, institutions, etc. 

Hong Kong
GPE 
 Countries, cities, states 

6 million U.S. Dollars
MONEY 
 Monetary values, including unit 




# Noun Chunks
Similar to `Doc.ents`, `Doc.noun_chunks` are another object property. *Noun chunks* are "base noun phrases" – flat phrases that have a noun as their head. You can think of noun chunks as a noun plus the words describing the noun – for example, in [Sheb Wooley's 1958 song](https://en.wikipedia.org/wiki/The_Purple_People_Eater), a *"one-eyed, one-horned, flying, purple people-eater"* would be one long noun chunk.

In [7]:
doc = nlp(u"Autonomous cars shift insurance liability towards manufacturers!")
for chunks in doc.noun_chunks:
    print(chunks)

Autonomous cars
insurance liability
manufacturers



# Built-in Visualizers

spaCy includes a built-in visualization tool called **displaCy**. displaCy is able to detect whether you're working in a Jupyter notebook, and will return markup that can be rendered in a cell right away. When you export your notebook, the visualizations will be included as HTML.

For more info visit https://spacy.io/usage/vi
sual
# Visualizing the dependency paraphi
The optional `'distance'` argument sets the distance between tokens. If the distance is made too small, text that appears beneath short arrows may become too compressed to read.cizers

In [8]:
from spacy import displacy
doc = nlp(u"Apple is going to bild a factory in India in the state of U.P. for a sum-total of 4 million U.S. Dollars")
displacy.render(doc, style = 'dep', jupyter = True,options = {'distace':110})

# Visualizing the entity recognizer

In [9]:
doc = nlp(u"Apple is going to bild a factory in India in the state of U.P. for a sum-total of 4 million U.S. Dollars")
displacy.render(doc, style = 'ent', jupyter = True,options = {'distace':110})


# Creating Visualizations Outside of Jupyter
If you're using another Python IDE or writing a script, you can choose to have spaCy serve up html separately:

In [10]:
##doc = nlp(u'This is a sentence.')
##displacy.serve(doc, style='dep')

After running the above cell, to view the dependency parse click

http://127.0.0.1:5000

to shut down the server return to jupyter and interrupt the Kernel either through Kernel menu, by hitting the black square on the toolbar, or by typing the keyboard shortcut `Esc`

# Stemming
Often when searching text for a certain keyword, it helps if the search returns variations of the word. For instance, searching for "boat" might also return "boats" and "boating". Here, "boat" would be the **stem** for [boat, boater, boating, boats].

Stemming is a somewhat crude method for cataloging related words; it essentially chops off letters from the end until the stem is reached. This works fairly well in most cases, but unfortunately English has many exceptions where a more sophisticated process is required. In fact, spaCy doesn't include a stemmer, opting instead to rely entirely on lemmatization. For those interested, there's some background on this decision [here](https://github.com/explosion/spaCy/issues/327). We discuss the virtues of *lemmatization* in the next section.

Instead, we'll use another popular NLP tool called **nltk**, which stands for *Natural Language Toolkit*. For more information on nltk visit https://www.nltk.

# Porter Stemmer

One of the most common - and effective - stemming tools is [*Porter's Algorithm*](https://tartarus.org/martin/PorterStemmer/) developed by Martin Porter in [1980](https://tartarus.org/martin/PorterStemmer/def.txt). The algorithm employs five phases of word reduction, each with its own set of mapping rules. In the first phase, simple suffix mapping rules are de.1.png)

From a given set of stemming rules only one rule is applied, based on the longest suffix S1. Thus, `caresses` reduces to `caress` but not `cares`.

More sophisticated phases consider the length/complexity of the word before applyie.e:s:org/

In [11]:
import nltk
print("PorterStemmer\n")
from nltk.stem.porter import PorterStemmer
p_stemmer = PorterStemmer()
words = ['run','runner','ran','runs','easily','fairly','fairness','generously','generate','generous','generation']
for wrd in words:
    print(wrd + '---->' + p_stemmer.stem(wrd))
print("\n\nSnowballStemmer\n")
from nltk.stem.snowball import SnowballStemmer
s_stemmer = SnowballStemmer(language = 'english')
for wrd in words:
    print(wrd + '---->' + s_stemmer.stem(wrd))
    

PorterStemmer

run---->run
runner---->runner
ran---->ran
runs---->run
easily---->easili
fairly---->fairli
fairness---->fair
generously---->gener
generate---->gener
generous---->gener
generation---->gener


SnowballStemmer

run---->run
runner---->runner
ran---->ran
runs---->run
easily---->easili
fairly---->fair
fairness---->fair
generously---->generous
generate---->generat
generous---->generous
generation---->generat



Stemming has its drawbacks. If given the token `saw`, stemming might always return `saw`, whereas lemmatization would likely return either `see` or `saw` depending on whether the use of the token was as a verb or a noun. As an example, consider the following:

In [12]:
phrase = 'I am meeting him tomorrow at the meeting'
for word in phrase.split():
    print(word+' --> '+p_stemmer.stem(word))

I --> i
am --> am
meeting --> meet
him --> him
tomorrow --> tomorrow
at --> at
the --> the
meeting --> meet


# Lemmatization
In contrast to stemming, lemmatization looks beyond word reduction, and considers a language's full vocabulary to apply a *morphological analysis* to words. The lemma of 'was' is 'be' and the lemma of 'mice' is 'mouse'. Further, the lemma of 'meeting' might be 'meet' or 'meeting' depending on its use in a sentence.
Lemmatization is much more informative than simple stmming which is why spaCy has opted to only have Lemmatization insteam of stemming. Lemmatization looks at the surrounding text to determine a given word's part of speech, it does not categorize phrases.

In [13]:
doc = nlp(u'I am a runner running in a race because I love to run since i ran today')
for token in doc:
    print(token.text,'\t',token.pos_,'\t',token.lemma,'\t',token.lemma_)

I 	 PRON 	 4690420944186131903 	 I
am 	 AUX 	 10382539506755952630 	 be
a 	 DET 	 11901859001352538922 	 a
runner 	 NOUN 	 12640964157389618806 	 runner
running 	 VERB 	 12767647472892411841 	 run
in 	 ADP 	 3002984154512732771 	 in
a 	 DET 	 11901859001352538922 	 a
race 	 NOUN 	 8048469955494714898 	 race
because 	 SCONJ 	 16950148841647037698 	 because
I 	 PRON 	 4690420944186131903 	 I
love 	 VERB 	 3702023516439754181 	 love
to 	 PART 	 3791531372978436496 	 to
run 	 VERB 	 12767647472892411841 	 run
since 	 SCONJ 	 10066841407251338481 	 since
i 	 PRON 	 4690420944186131903 	 I
ran 	 VERB 	 12767647472892411841 	 run
today 	 NOUN 	 11042482332948150395 	 today


# Function to show lemmas
Since the above o/p is staggered we define a function that show us the lemmas in more systemic manner and more neatly

In [14]:
def show_lemmas(doc):
    for token in doc:
        print(f"{token.text:{12}}{token.pos_:{6}}{token.lemma:<{22}}{token.lemma_:}")
doc = nlp(u'I am a runner running in a race because I love to run since I ran today when I saw eighteen mice')
show_lemmas(doc)

I           PRON  4690420944186131903   I
am          AUX   10382539506755952630  be
a           DET   11901859001352538922  a
runner      NOUN  12640964157389618806  runner
running     VERB  12767647472892411841  run
in          ADP   3002984154512732771   in
a           DET   11901859001352538922  a
race        NOUN  8048469955494714898   race
because     SCONJ 16950148841647037698  because
I           PRON  4690420944186131903   I
love        VERB  3702023516439754181   love
to          PART  3791531372978436496   to
run         VERB  12767647472892411841  run
since       SCONJ 10066841407251338481  since
I           PRON  4690420944186131903   I
ran         VERB  12767647472892411841  run
today       NOUN  11042482332948150395  today
when        SCONJ 15807309897752499399  when
I           PRON  4690420944186131903   I
saw         VERB  11925638236994514241  see
eighteen    NUM   9609336664675087640   eighteen
mice        NOUN  1384165645700560590   mouse


# Stop Words
Words that appear more frequently and hence don't need a tag as throughly as nouns, verbs or modifiers. These words like 'a', 'an', 'the' etc are called stop words and they can be filtered from the text to be processed, spaCy holds around 326 such stop words.

In [15]:
print(nlp.Defaults.stop_words,'\n\n',len(nlp.Defaults.stop_words))

{'being', 'do', 'must', 'whence', 'whether', "n't", 'in', "'d", 'would', 'fifty', 'be', 'either', 'hence', 'can', 'just', 'before', 'itself', 'whom', 'became', 'until', 'front', 'whereupon', 'both', 'am', 'used', 'only', 'upon', 'hundred', 'make', 'therein', 'becomes', 'nor', '‘m', 'yourself', '’ve', '‘ll', 'on', 'name', 'below', 'under', '‘ve', 'it', 'ever', 'eleven', 'bottom', 'their', 'i', 'during', 'yourselves', 'anyone', 'indeed', 'thru', 'afterwards', 'meanwhile', 'has', 'anyway', 'could', 'seeming', 'its', 'off', 'move', 'into', 'n’t', 'none', 'amongst', '’re', 'mine', 'without', 'than', 'sometimes', 'whenever', 'down', 'again', 'will', 'hereafter', 'more', 'though', 'first', 'everyone', 'herein', 'neither', 'go', 'almost', 'then', 'thence', 'using', 'should', 'not', 'hers', 'ten', 'rather', 'seems', 'made', 'otherwise', 'beforehand', 'already', 'eight', 'that', 'does', 'thereby', 'three', 'whatever', 'for', 'never', 'become', 'about', 'among', 'all', "'ll", 'were', 'noone', 'un

We can check whether a word is a stop word or not by using `nlp.vocab['word'].is_stop`

We can also add stop words if we feel that they are used multiple times, to the default set by using `nlp.Defaults.stop_words.add('word')` floowed by `nlp.vocab['btw'].is_stop=True`

We can remove a stop word by using `nlp.Defaults.stop_words.remove('word')` followed by `nlp.vocab['word'].is_stop=False`

In [16]:
print(nlp.vocab['only'].is_stop,'\t',nlp.vocab['India'].is_stop)

print(len(nlp.Defaults.stop_words))
nlp.Defaults.stop_words.add('btw')
nlp.vocab['btw'].is_stop = True
print('\n',len(nlp.Defaults.stop_words))

nlp.Defaults.stop_words.remove('btw')
nlp.vocab['btw'].is_stop = False
print('\n',len(nlp.Defaults.stop_words))

True 	 False
326

 327

 326


# Vocabulary and Matching
So far we have seen how a body of text is divided into tokens , and how indivisual tokens are parsed and tagged eith parts of speech, depandencies and lemmas. Now we will identify and label specific phrases that matches patterns we can define ourselves

# Rule-based Matching 
spaCy offers a rule-based matching tool called `Matcher`that allows you to buikd a library of toknes patterns, then match those patterns against a doc object to return a list of found matches. You can match on any part of the token including text and annotations, and you can add multiple patterns to the same matcher.


`matcher` returns a list of tuples. Each tuple contains an ID for the match, with start & end tokens that map to the span `doc[start:end]`. The `match_id` is simply the hash value of the `string_ID` 'SolarPower'

In [17]:
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
pattern1 = [{'LOWER' : 'solarpower'}]
pattern2 = [{'LOWER' : 'solar'},{'LOWER' : 'power'}]
pattern3 = [{'LOWER' : 'solar'},{'IS_PUNCT' : True},{'LOWER' : 'power'}]
#pattern=[[{'LOWER' : 'solarpower'}],[{'LOWER' : 'solar'},{'LOWER' : 'power'}],[{'LOWER' : 'solar'},{'IS_PUNCT' : True},{'LOWER' : 'power'}]]

matcher.add('SolarPower', [pattern1, pattern2, pattern3])
#matcher.add('SolarPower', pattern)

doc = nlp(u'The Solar Power industry continues to grow as demand \
for solarpower increases. Solar-power cars are gaining popularity.')

found_matches = matcher(doc)
print(found_matches)

[(8656102463236116519, 1, 3), (8656102463236116519, 10, 11), (8656102463236116519, 13, 16)]


# Making a function for finding matches

In [18]:
def findingMatches(doc):
    found_matches = matcher(doc)
    for match_id, start, end in found_matches:
        string_id = nlp.vocab.strings[match_id] # get string representation
        span = doc[start:end]                    # get the matched span
        print(match_id, string_id, start, end, span.text)
        
doc = nlp(u'The Solar Power industry continues to grow as demand \
for solarpower increases. Solar-power cars are gaining popularity.')

findingMatches(doc)

8656102463236116519 SolarPower 1 3 Solar Power
8656102463236116519 SolarPower 10 11 solarpower
8656102463236116519 SolarPower 13 16 Solar-power


# Setting pattern options and quantifiers
You can make token rules optional by passing an `'OP':'*'` argument. This lets us streamline our patterns lis. The following codes found both two-word patterns, with and without the hyphen!

The following quantifiers can be passed to the `'OP'` key:
<table><tr><th>OP</th><th>Description</th></tr>

<tr ><td><span >\!</span></td><td>Negate the pattern, by requiring it to match exactly 0 times</td></tr>
<tr ><td><span >?</span></td><td>Make the pattern optional, by allowing it to match 0 or 1 times</td></tr>
<tr ><td><span >\+</span></td><td>Require the pattern to match 1 or more times</td></tr>
<tr ><td><span >\*</span></td><td>Allow the pattern to match zero or more times</td></tr>
</table>


In [19]:
pattern1 = [{'LOWER': 'solarpower'}]
pattern2 = [{'LOWER': 'solar'}, {'IS_PUNCT': True, 'OP':'*'}, {'LOWER': 'power'}]
#patterns = [[{'LOWER': 'solarpower'}], [{'LOWER': 'solar'}, {'IS_PUNCT': True, 'OP':'*'}, {'LOWER': 'power'}]]

matcher.remove('SolarPower')

matcher.add('SolarPower', [pattern1, pattern2])
#matcher.add('SolarPower', patterns)

doc = nlp(u'The Solar Power industry continues to grow as demand \
for solarpower increases. Solar-power cars are gaining popularity.')

found_matches = matcher(doc)
print(found_matches)

doc2 = nlp(u'Solar-powered energy runs solar-powered cars.')
found_matches_doc2 = matcher(doc2)
print(found_matches_doc2)

[(8656102463236116519, 1, 3), (8656102463236116519, 10, 11), (8656102463236116519, 13, 16)]
[]


# Working with lemmas!
If we wanted to match on both 'solar power' and 'solar powered', it might be tempting to look for the *lemma* of 'powered' and expect it to be 'power'. This is not always the case! The lemma of the *adjective* 'powered' is still 'powered:
The matcher found the first occurrence because the lemmatizer treated 'Solar-powered' as a verb, but not the second as it considered it an adjective.<br>For this case it may be better to set explicit token patterns.

In [20]:
pattern1 = [{'LOWER': 'solarpower'}]
pattern2 = [{'LOWER': 'solar'}, {'IS_PUNCT': True, 'OP':'*'}, {'LOWER': 'power'}]
pattern3 = [{'LOWER': 'solarpowered'}]
pattern4 = [{'LOWER': 'solar'}, {'IS_PUNCT': True, 'OP':'*'}, {'LOWER': 'powered'}]
##patterns=[[{'LOWER': 'solarpower'}], [{'LOWER': 'solar'}, {'IS_PUNCT': True, 'OP':'*'}, {'LOWER': 'power'}], [{'LOWER': 'solarpowered'}], [{'LOWER': 'solar'}, {'IS_PUNCT': True, 'OP':'*'}, {'LOWER': 'powered'}]}

matcher.remove('SolarPower')

matcher.add('SolarPower',[pattern1, pattern2, pattern3, pattern4])
#matcher.add('SolarPower',patterns)

doc2 = nlp(u'Solar-powered energy runs solar-powered cars.')
found_matches_doc2 = matcher(doc2)
print(found_matches_doc2)

[(8656102463236116519, 0, 3), (8656102463236116519, 5, 8)]


# Other token attributes
Besides lemmas, there are a variety of token attributes we can use to determine matching rules:
<table><tr><th>Attribute</th><th>Description</th></tr>

<tr ><td><span >`ORTH`</span></td><td>The exact verbatim text of a token</td></tr>
<tr ><td><span >`LOWER`</span></td><td>The lowercase form of the token text</td></tr>
<tr ><td><span >`LENGTH`</span></td><td>The length of the token text</td></tr>
<tr ><td><span >`IS_ALPHA`, `IS_ASCII`, `IS_DIGIT`</span></td><td>Token text consists of alphanumeric characters, ASCII characters, digits</td></tr>
<tr ><td><span >`IS_LOWER`, `IS_UPPER`, `IS_TITLE`</span></td><td>Token text is in lowercase, uppercase, titlecase</td></tr>
<tr ><td><span >`IS_PUNCT`, `IS_SPACE`, `IS_STOP`</span></td><td>Token is punctuation, whitespace, stop word</td></tr>
<tr ><td><span >`LIKE_NUM`, `LIKE_URL`, `LIKE_EMAIL`</span></td><td>Token text resembles a number, URL, email</td></tr>
<tr ><td><span >`POS`, `TAG`, `DEP`, `LEMMA`, `SHAPE`</span></td><td>The token's simple and extended part-of-speech tag, dependency label, lemma, shape</td></tr>
<tr ><td><span >`ENT_TYPE`</span></td><td>The token's entity label</td></tr>

</table>

# Token wildcard
You can pass an empty dictionary `{}` as a wildcard to represent*any toke*. For example, you might want to retrieve hashtags without knowing what might follow the `#` charater we can use `[{'ORTH':'#'},{}]`

## PhraseMatcher
In the above section we used token patterns to perform rule-based matching. An alternative - and often more efficient - method is to match on terminology lists. In this case we use PhraseMatcher to create a Doc object from a list of phrases, and pass that into `matcher` instead.

In [21]:
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab)

with open('C:/nlp/textFiles/reaganomics.txt', encoding= 'unicode_escape') as f:
    doc3 = nlp(f.read())

phrase_list = ['voodoo economics', 'supply-side economics', 'trickle-down economics', 'free-market economics']

phrase_patterns = [nlp(text) for text in phrase_list]

matcher.add('VoodooEconomics', phrase_patterns)

matches = matcher(doc3)
matches

[(3473369816841043438, 41, 45),
 (3473369816841043438, 49, 53),
 (3473369816841043438, 54, 56),
 (3473369816841043438, 61, 65),
 (3473369816841043438, 673, 677),
 (3473369816841043438, 2986, 2990)]

In [22]:
matcher.remove('VoodooEconomics')

# Part of Speech Basics
The challenge of correctly identifying parts of speech is summed up nicely in the [spaCy docs](https://spacy.io/usage/linguistic-features)

Processing raw text intelligently is difficult: most words are rare, and it's common for words that look completely different to mean almost the same thing. The same words in a different order can mean something completely different. Even splitting text into useful word-like units can be difficult in many languages. While it's possible to solve some problems starting from only the raw characters, it's usually better to use linguistic knowledge to add useful information. That's exactly what spaCy is designed to do: you put in raw text, and get back a **Doc** object, that comes with a variety of annotations.

In this section we'll take a closer look at coarse POS tags (noun, verb, adjective) and fine-grained tags (plural noun, past-tense verb, superlative adjective..

## View token tags
Recall that you can obtain a particular token by its index position.
* To view the coarse POS tag use `token.pos_`
* To view the fine-grained tag use `token.tag_`
* To view the description of either type of tag use `spacy.explain(tag)`

Note that `token.pos` and `token.tag` return integer hash values; by adding the underscore we get the text equivalent that lives in **doc.vocab**

In [23]:
doc = nlp(u'The quick brown fox jumped over the lazy dog')
for token in doc:
    print(token.text, '\t', token.pos_, '\t', token.tag_, '\t', spacy.explain(token.tag_))

The 	 DET 	 DT 	 determiner
quick 	 ADJ 	 JJ 	 adjective (English), other noun-modifier (Chinese)
brown 	 ADJ 	 JJ 	 adjective (English), other noun-modifier (Chinese)
fox 	 NOUN 	 NN 	 noun, singular or mass
jumped 	 VERB 	 VBD 	 verb, past tense
over 	 ADP 	 IN 	 conjunction, subordinating or preposition
the 	 DET 	 DT 	 determiner
lazy 	 ADJ 	 JJ 	 adjective (English), other noun-modifier (Chinese)
dog 	 NOUN 	 NN 	 noun, singular or mass


# Coarse-grained Part-of-speech Tags
Every token is assigned a POS Tag from the following list:


<table><tr><th>POS</th><th>DESCRIPTION</th><th>EXAMPLES</th></tr>
    
<tr><td>ADJ</td><td>adjective</td><td>*big, old, green, incomprehensible, first*</td></tr>
<tr><td>ADP</td><td>adposition</td><td>*in, to, during*</td></tr>
<tr><td>ADV</td><td>adverb</td><td>*very, tomorrow, down, where, there*</td></tr>
<tr><td>AUX</td><td>auxiliary</td><td>*is, has (done), will (do), should (do)*</td></tr>
<tr><td>CONJ</td><td>conjunction</td><td>*and, or, but*</td></tr>
<tr><td>CCONJ</td><td>coordinating conjunction</td><td>*and, or, but*</td></tr>
<tr><td>DET</td><td>determiner</td><td>*a, an, the*</td></tr>
<tr><td>INTJ</td><td>interjection</td><td>*psst, ouch, bravo, hello*</td></tr>
<tr><td>NOUN</td><td>noun</td><td>*girl, cat, tree, air, beauty*</td></tr>
<tr><td>NUM</td><td>numeral</td><td>*1, 2017, one, seventy-seven, IV, MMXIV*</td></tr>
<tr><td>PART</td><td>particle</td><td>*'s, not,*</td></tr>
<tr><td>PRON</td><td>pronoun</td><td>*I, you, he, she, myself, themselves, somebody*</td></tr>
<tr><td>PROPN</td><td>proper noun</td><td>*Mary, John, London, NATO, HBO*</td></tr>
<tr><td>PUNCT</td><td>punctuation</td><td>*., (, ), ?*</td></tr>
<tr><td>SCONJ</td><td>subordinating conjunction</td><td>*if, while, that*</td></tr>
<tr><td>SYM</td><td>symbol</td><td>*$, %, §, ©, +, −, ×, ÷, =, :), 😝*</td></tr>
<tr><td>VERB</td><td>verb</td><td>*run, runs, running, eat, ate, eating*</td></tr>
<tr><td>X</td><td>other</td><td>*sfpksdpsxmsa*</td></tr>
<tr><td>SPACE</td><td>space</td></tr>

## Fine-grained Part-of-speech Tags
Tokens are subsequently given a fine-grained tag as determined by morphology:
<table>
<tr><th>POS</th><th>Description</th><th>Fine-grained Tag</th><th>Description</th><th>Morphology</th></tr>
<tr><td>ADJ</td><td>adjective</td><td>AFX</td><td>affix</td><td>Hyph=yes</td></tr>
<tr><td>ADJ</td><td></td><td>JJ</td><td>adjective</td><td>Degree=pos</td></tr>
<tr><td>ADJ</td><td></td><td>JJR</td><td>adjective, comparative</td><td>Degree=comp</td></tr>
<tr><td>ADJ</td><td></td><td>JJS</td><td>adjective, superlative</td><td>Degree=sup</td></tr>
<tr><td>ADJ</td><td></td><td>PDT</td><td>predeterminer</td><td>AdjType=pdt PronType=prn</td></tr>
<tr><td>ADJ</td><td></td><td>PRP\$</td><td>pronoun, possessive</td><td>PronType=prs Poss=yes</td></tr>
<tr><td>ADJ</td><td></td><td>WDT</td><td>wh-determiner</td><td>PronType=int rel</td></tr>
<tr><td>ADJ</td><td></td><td>WP\$</td><td>wh-pronoun, possessive</td><td>Poss=yes PronType=int rel</td></tr>
<tr><td>ADP</td><td>adposition</td><td>IN</td><td>conjunction, subordinating or preposition</td><td></td></tr>
<tr><td>ADV</td><td>adverb</td><td>EX</td><td>existential there</td><td>AdvType=ex</td></tr>
<tr><td>ADV</td><td></td><td>RB</td><td>adverb</td><td>Degree=pos</td></tr>
<tr><td>ADV</td><td></td><td>RBR</td><td>adverb, comparative</td><td>Degree=comp</td></tr>
<tr><td>ADV</td><td></td><td>RBS</td><td>adverb, superlative</td><td>Degree=sup</td></tr>
<tr><td>ADV</td><td></td><td>WRB</td><td>wh-adverb</td><td>PronType=int rel</td></tr>
<tr><td>CONJ</td><td>conjunction</td><td>CC</td><td>conjunction, coordinating</td><td>ConjType=coor</td></tr>
<tr><td>DET</td><td>determiner</td><td>DT</td><td>determiner</td><td></td></tr>
<tr><td>INTJ</td><td>interjection</td><td>UH</td><td>interjection</td><td></td></tr>
<tr><td>NOUN</td><td>noun</td><td>NN</td><td>noun, singular or mass</td><td>Number=sing</td></tr>
<tr><td>NOUN</td><td></td><td>NNS</td><td>noun, plural</td><td>Number=plur</td></tr>
<tr><td>NOUN</td><td></td><td>WP</td><td>wh-pronoun, personal</td><td>PronType=int rel</td></tr>
<tr><td>NUM</td><td>numeral</td><td>CD</td><td>cardinal number</td><td>NumType=card</td></tr>
<tr><td>PART</td><td>particle</td><td>POS</td><td>possessive ending</td><td>Poss=yes</td></tr>
<tr><td>PART</td><td></td><td>RP</td><td>adverb, particle</td><td></td></tr>
<tr><td>PART</td><td></td><td>TO</td><td>infinitival to</td><td>PartType=inf VerbForm=inf</td></tr>
<tr><td>PRON</td><td>pronoun</td><td>PRP</td><td>pronoun, personal</td><td>PronType=prs</td></tr>
<tr><td>PROPN</td><td>proper noun</td><td>NNP</td><td>noun, proper singular</td><td>NounType=prop Number=sign</td></tr>
<tr><td>PROPN</td><td></td><td>NNPS</td><td>noun, proper plural</td><td>NounType=prop Number=plur</td></tr>
<tr><td>PUNCT</td><td>punctuation</td><td>-LRB-</td><td>left round bracket</td><td>PunctType=brck PunctSide=ini</td></tr>
<tr><td>PUNCT</td><td></td><td>-RRB-</td><td>right round bracket</td><td>PunctType=brck PunctSide=fin</td></tr>
<tr><td>PUNCT</td><td></td><td>,</td><td>punctuation mark, comma</td><td>PunctType=comm</td></tr>
<tr><td>PUNCT</td><td></td><td>:</td><td>punctuation mark, colon or ellipsis</td><td></td></tr>
<tr><td>PUNCT</td><td></td><td>.</td><td>punctuation mark, sentence closer</td><td>PunctType=peri</td></tr>
<tr><td>PUNCT</td><td></td><td>''</td><td>closing quotation mark</td><td>PunctType=quot PunctSide=fin</td></tr>
<tr><td>PUNCT</td><td></td><td>""</td><td>closing quotation mark</td><td>PunctType=quot PunctSide=fin</td></tr>
<tr><td>PUNCT</td><td></td><td>``</td><td>opening quotation mark</td><td>PunctType=quot PunctSide=ini</td></tr>
<tr><td>PUNCT</td><td></td><td>HYPH</td><td>punctuation mark, hyphen</td><td>PunctType=dash</td></tr>
<tr><td>PUNCT</td><td></td><td>LS</td><td>list item marker</td><td>NumType=ord</td></tr>
<tr><td>PUNCT</td><td></td><td>NFP</td><td>superfluous punctuation</td><td></td></tr>
<tr><td>SYM</td><td>symbol</td><td>#</td><td>symbol, number sign</td><td>SymType=numbersign</td></tr>
<tr><td>SYM</td><td></td><td>\$</td><td>symbol, currency</td><td>SymType=currency</td></tr>
<tr><td>SYM</td><td></td><td>SYM</td><td>symbol</td><td></td></tr>
<tr><td>VERB</td><td>verb</td><td>BES</td><td>auxiliary "be"</td><td></td></tr>
<tr><td>VERB</td><td></td><td>HVS</td><td>forms of "have"</td><td></td></tr>
<tr><td>VERB</td><td></td><td>MD</td><td>verb, modal auxiliary</td><td>VerbType=mod</td></tr>
<tr><td>VERB</td><td></td><td>VB</td><td>verb, base form</td><td>VerbForm=inf</td></tr>
<tr><td>VERB</td><td></td><td>VBD</td><td>verb, past tense</td><td>VerbForm=fin Tense=past</td></tr>
<tr><td>VERB</td><td></td><td>VBG</td><td>verb, gerund or present participle</td><td>VerbForm=part Tense=pres Aspect=prog</td></tr>
<tr><td>VERB</td><td></td><td>VBN</td><td>verb, past participle</td><td>VerbForm=part Tense=past Aspect=perf</td></tr>
<tr><td>VERB</td><td></td><td>VBP</td><td>verb, non-3rd person singular present</td><td>VerbForm=fin Tense=pres</td></tr>
<tr><td>VERB</td><td></td><td>VBZ</td><td>verb, 3rd person singular present</td><td>VerbForm=fin Tense=pres Number=sing Person=3</td></tr>
<tr><td>X</td><td>other</td><td>ADD</td><td>email</td><td></td></tr>
<tr><td>X</td><td></td><td>FW</td><td>foreign word</td><td>Foreign=yes</td></tr>
<tr><td>X</td><td></td><td>GW</td><td>additional word in multi-word expression</td><td></td></tr>
<tr><td>X</td><td></td><td>XX</td><td>unknown</td><td></td></tr>
<tr><td>SPACE</td><td>space</td><td>_SP</td><td>space</td><td></td></tr>
<tr><td></td><td></td><td>NIL</td><td>missing tag</td><td></td></tr>
</table>

# Working with POS Tags
In the English language, the same string of characters can have different meanings, even within the same sentence. For this reason, morphology is important. **spaCy** uses machine learning algorithms to best predict the use of a token in a sentence. Is *"I read books on NLP"* present or past tense? Is *wind* a verb or a noun?

In [24]:
doc1 = nlp(u'I read books on nlp')
doc2 = nlp(u'I read a book on nlp')
doc3 = nlp(u'I am reading a book on nlp')
print(f'{doc1[1].text:{10}}{doc1[1].pos_:{8}}{doc1[1].tag_:{6}}{spacy.explain(doc1[1].tag_)}')
print(f'{doc2[1].text:{10}}{doc2[1].pos_:{8}}{doc2[1].tag_:{6}}{spacy.explain(doc2[1].tag_)}')
print(f'{doc3[2].text:{10}}{doc3[2].pos_:{8}}{doc3[2].tag_:{6}}{spacy.explain(doc3[2].tag_)}')

read      VERB    VBP   verb, non-3rd person singular present
read      VERB    VBD   verb, past tense
reading   VERB    VBG   verb, gerund or present participle


# Counting POS Tags
The `Doc.count_by()` method accepts a specific token attribute as its argument, and returns a frequency count of the given attribute as a dictionary object. Keys in the dictionary are the integer values of the given attribute ID, and values are the frequency. Counts of zero are not included.

In [25]:
doc = nlp(u'The quick brown fox jumped over the lazy dog')
pos_counts = doc.count_by(spacy.attrs.POS)
for pos,freq in sorted(pos_counts.items()):
    print(f'{pos:{4}}.\t{doc.vocab[pos].text:{5}}\t{spacy.explain(doc.vocab[pos].text):{10}}  :  {freq}')

  84.	ADJ  	adjective   :  3
  85.	ADP  	adposition  :  1
  90.	DET  	determiner  :  2
  92.	NOUN 	noun        :  2
 100.	VERB 	verb        :  1


In [26]:
doc = nlp(u'The quick brown fox jumped over the lazy dog')
tag_counts = doc.count_by(spacy.attrs.TAG)
for tag,freq in sorted(tag_counts.items()):
    print(f'{tag:{4}}.\t{doc.vocab[tag].text:{5}}\t{spacy.explain(doc.vocab[tag].text):{10}}  :  {freq}')

1292078113972184607.	IN   	conjunction, subordinating or preposition  :  1
10554686591937588953.	JJ   	adjective (English), other noun-modifier (Chinese)  :  3
15267657372422890137.	DT   	determiner  :  2
15308085513773655218.	NN   	noun, singular or mass  :  2
17109001835818727656.	VBD  	verb, past tense  :  1


Why did the ID numbers get so big?\
In spaCy, certain text values are hardcoded into `Doc.vocab` and take up the first several hundred ID numbers. Strings like 'NOUN' and 'VERB' are used frequently by internal operations. Others, like fine-grained tags, are assigned hash values as needed.

Why don't SPACE tags appear?\
In spaCy, only strings of spaces (two or more) are assigned tokens. Single spaces are not.

In [27]:
doc = nlp(u'The quick brown fox jumped over the lazy dog')
dep_counts = doc.count_by(spacy.attrs.DEP)
for dep,freq in sorted(dep_counts.items()):
    print(f'{dep:{20}}.\t{doc.vocab[dep].text:{5}}\t{spacy.explain(doc.vocab[dep].text):{25}}  :  {freq}')

                 402.	amod 	adjectival modifier        :  3
                 415.	det  	determiner                 :  2
                 429.	nsubj	nominal subject            :  1
                 439.	pobj 	object of preposition      :  1
                 443.	prep 	prepositional modifier     :  1
 8206900633647566924.	ROOT 	root                       :  1


# Visualizing Parts of Speech
spaCy offers an outstanding visualizer called **displaCy**:

In [28]:
#from spacy import displacy
doc = nlp(u'The quick brown fox jumped over the lazy dog')
displacy.render(doc, style = 'dep', jupyter = True, options = {'distance' : 110})

In [29]:
for token in doc:
    print(f'{token.text:{10}} {token.pos_:{7}} {token.dep_:{7}} {spacy.explain(token.dep_)}')

The        DET     det     determiner
quick      ADJ     amod    adjectival modifier
brown      ADJ     amod    adjectival modifier
fox        NOUN    nsubj   nominal subject
jumped     VERB    ROOT    root
over       ADP     prep    prepositional modifier
the        DET     det     determiner
lazy       ADJ     amod    adjectival modifier
dog        NOUN    pobj    object of preposition


# Creating Visualizations Outside of Jupyter
If you're using another Python IDE or writing a script, you can choose to have spaCy serve up HTML separately.

Instead of `displacy.render()`, use `displacy.serve()`:

In [30]:
# displacy.serve(doc, style='dep', options={'distance': 110})

# Handling Large Text
`displacy.serve()` accepts a single Doc or list of Doc objects. Since large texts are difficult to view in one line, you may want to pass a list of spans instead. Each span will appear on its own line:

In [31]:
doc2 = nlp(u"This is a sentence. This is another, possibly longer sentence.")

spans = list(doc2.sents)
# displacy.serve(spans,  style = 'dep', options = {'distance' : 110})

## Customizing the Appearance
Besides setting the distance between tokens, you can pass other arguments to the `options` parameter:

<table>
<tr><th>NAME</th><th>TYPE</th><th>DESCRIPTION</th><th>DEFAULT</th></tr>
<tr><td>`compact`</td><td>bool</td><td>"Compact mode" with square arrows that takes up less space.</td><td>`False`</td></tr>
<tr><td>`color`</td><td>unicode</td><td>Text color (HEX, RGB or color names).</td><td>`#000000`</td></tr>
<tr><td>`bg`</td><td>unicode</td><td>Background color (HEX, RGB or color names).</td><td>`#ffffff`</td></tr>
<tr><td>`font`</td><td>unicode</td><td>Font name or font family for all text.</td><td>`Arial`</td></tr>
</table>

For a full list of options visit https://spacy.io/api/top-level#displacy_options

In [32]:
options = {'distance': 110, 'compact': 'True', 'color': 'yellow', 'bg': '#09a3d5', 'font': 'Times'}

# displacy.serve(doc, style='dep', options=options)

Great! Now you should be familiar with visualizing spaCy's dependency parse. For more info on **displaCy** visit https://spacy.io/usage/visualizers

# Named Entity Recognition (NER)
spaCy has a NER pipeline component that identifies token spans fitting a predetermind=ed set of named entities. These are available as the `ents` property os a `doc` object

In [33]:
def show_ents(doc):
    if doc.ents:
        for ent in doc.ents:
            print(f'{ent.text:{20}}-{ent.label_:{20}}-{str(spacy.explain(ent.label_))}')
    else:
        print('No Named Entity Found')
doc = nlp(u'May I go to Washington D.C. next May to see the Washington Monument?')
show_ents(doc)

Washington D.C.     -GPE                 -Countries, cities, states
next May            -DATE                -Absolute or relative dates or periods
the Washington Monument-ORG                 -Companies, agencies, institutions, etc.


# NER Tags
Tags are accessible through the `.label_` property of an entity.
<table>
<tr><th>TYPE</th><th>DESCRIPTION</th><th>EXAMPLE</th></tr>
<tr><td>`PERSON`</td><td>People, including fictional.</td><td>*Fred Flintstone*</td></tr>
<tr><td>`NORP`</td><td>Nationalities or religious or political groups.</td><td>*The Republican Party*</td></tr>
<tr><td>`FAC`</td><td>Buildings, airports, highways, bridges, etc.</td><td>*Logan International Airport, The Golden Gate*</td></tr>
<tr><td>`ORG`</td><td>Companies, agencies, institutions, etc.</td><td>*Microsoft, FBI, MIT*</td></tr>
<tr><td>`GPE`</td><td>Countries, cities, states.</td><td>*France, UAR, Chicago, Idaho*</td></tr>
<tr><td>`LOC`</td><td>Non-GPE locations, mountain ranges, bodies of water.</td><td>*Europe, Nile River, Midwest*</td></tr>
<tr><td>`PRODUCT`</td><td>Objects, vehicles, foods, etc. (Not services.)</td><td>*Formula 1*</td></tr>
<tr><td>`EVENT`</td><td>Named hurricanes, battles, wars, sports events, etc.</td><td>*Olympic Games*</td></tr>
<tr><td>`WORK_OF_ART`</td><td>Titles of books, songs, etc.</td><td>*The Mona Lisa*</td></tr>
<tr><td>`LAW`</td><td>Named documents made into laws.</td><td>*Roe v. Wade*</td></tr>
<tr><td>`LANGUAGE`</td><td>Any named language.</td><td>*English*</td></tr>
<tr><td>`DATE`</td><td>Absolute or relative dates or periods.</td><td>*20 July 1969*</td></tr>
<tr><td>`TIME`</td><td>Times smaller than a day.</td><td>*Four hours*</td></tr>
<tr><td>`PERCENT`</td><td>Percentage, including "%".</td><td>*Eighty percent*</td></tr>
<tr><td>`MONEY`</td><td>Monetary values, including unit.</td><td>*Twenty Cents*</td></tr>
<tr><td>`QUANTITY`</td><td>Measurements, as of weight or distance.</td><td>*Several kilometers, 55kg*</td></tr>
<tr><td>`ORDINAL`</td><td>"first", "second", etc.</td><td>*9th, Ninth*</td></tr>
<tr><td>`CARDINAL`</td><td>Numerals that do not fall under another type.</td><td>*2, Two, Fifty-two*</td></tr>
</table>

## Adding a Named Entity to a Span
Normally we would have spaCy build a library of named entities by training it on several samples of text.<br>In this case, we only want to add one value:

In [34]:
doc = nlp(u'Tesla to build a U.K. factory for $6 million')
show_ents(doc)
print()
from spacy.tokens import Span
ORG = doc.vocab.strings[u'ORG']
new_ent = Span(doc, 0, 1, label = ORG)
doc.ents = list(doc.ents) +[new_ent]
show_ents(doc)

U.K.                -GPE                 -Countries, cities, states
$6 million          -MONEY               -Monetary values, including unit

Tesla               -ORG                 -Companies, agencies, institutions, etc.
U.K.                -GPE                 -Countries, cities, states
$6 million          -MONEY               -Monetary values, including unit


In the code above, the arguments passed to `Span()` are:
-  `doc` - the name of the Doc object
-  `0` - the *start* index position of the span
-  `1` - the *stop* index position (exclusive)
-  `label=ORG` - the label assigned to our entity

## Adding Named Entities to All Matching Spans
What if we want to tag *all* occurrences of "Tesla"? In this section we show how to use the PhraseMatcher to identify a series of spans in the Doc:

In [35]:
doc = nlp(u'Our company plans to introduce a new vaccum cleaner.'
         u'If successful, the vaccum cleaner wil be our first product')
show_ents(doc)
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab)
phrase_list = ['vaccum cleaner', 'vaccum-cleaner']
phrase_patterns = [nlp(text) for text in phrase_list]
matcher.add('newproduct', phrase_patterns)
matches = matcher(doc)
print('\n',matches,'\n')

# from spacy.tokens import Span
PROD = doc.vocab.strings[u'PRODUCT']
new_ents = [Span(doc, match[1], match[2], label = PROD) for match in matches]
doc.ents = list(doc.ents) + new_ents
show_ents(doc)

first               -ORDINAL             -"first", "second", etc.

 [(2689272359382549672, 7, 9), (2689272359382549672, 14, 16)] 

vaccum cleaner      -PRODUCT             -Objects, vehicles, foods, etc. (not services)
vaccum cleaner      -PRODUCT             -Objects, vehicles, foods, etc. (not services)
first               -ORDINAL             -"first", "second", etc.


In [36]:
matcher.remove('newproduct')

## Counting Entities
While spaCy may not have a built-in tool for counting entities, we can pass a conditional statement into a list comprehension:

In [37]:
doc = nlp(u'Originally priced at $29.50, the sewater was marked down to five dollars.')
show_ents(doc)
#print()
len([ent for ent in doc.ents if ent.label_ == 'MONEY'])

29.50               -MONEY               -Monetary values, including unit
five dollars        -MONEY               -Monetary values, including unit


2

# Problem with line breaks
Sometimes there arises a issue where linebreaks are interpreted as `GPE` entities\
However that can be fixed by adding a `NLP pipeline` to remove the white space


In [38]:
from spacy.language import Language
@Language.component("remove_whitespace_entities")
def remove_whitespace_entities(doc):
    doc.ents = [e for e in doc.ents if not e.text.isspace()]
    return doc
nlp.add_pipe("remove_whitespace_entities", after = 'ner')
print(nlp.pipe_names)

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner', 'remove_whitespace_entities']


# Remove a pipeline
Similarly to remove a pipeline from the NLP we use the funtion `nlp.remove_pipe("<nameOfThePipeline>")`

To disable a pipeline at loading we use `nlp = spacy.load("en_core_web_sm", enable = ["tok2vec", "tagger"], disable = ["ner"])`

To exclude a pipeline at loading we use `nlp = spacy.load("en_core_web_sm", exclude=["ner"])`
`

In [39]:
nlp.remove_pipe('remove_whitespace_entities')
print(nlp.pipe_names)

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']


# Noun Chuncks
`doc.noun_chunks` are basic noun phrases : token spans that include the noun and words describing the noun. Noun chuncks cannot be nested, cannot overlap, abd do not involve prepositional phrases or relative clauses.\ Where `doc.ents` rely on the *ner* pipeline component, `doc.noun_chuncks` are provided by the *parser*

# `noun_chunks` components:
<table>
<tr><td>`.text`</td><td>The original noun chunk text.</td></tr>
<tr><td>`.root.text`</td><td>The original text of the word connecting the noun chunk to the rest of the parse.</td></tr>
<tr><td>`.root.dep_`</td><td>Dependency relation connecting the root to its head.</td></tr>
<tr><td>`.root.head.text`</td><td>The text of the root token's head.</td></tr>
</table>

In [40]:
doc = nlp(u"Autonomous cars shift insurance liability toward manufacturers.")

for chunk in doc.noun_chunks:
    print(f'{chunk.text:{25}}{chunk.root.text:{15}}{chunk.root.dep_:{10}}{chunk.root.head.text}')

Autonomous cars          cars           nsubj     shift
insurance liability      liability      dobj      shift
manufacturers            manufacturers  pobj      toward


# `Doc.noun_chunks` is a  generator function
Previously we mentioned that `Doc` objects do not retain a list of sentences, but they're available through the `Doc.sents` generator.<br>It's the same with `Doc.noun_chunks` - lists can be created if needed:

In [57]:
#len(doc.noun_chunks)

TypeError: object of type 'generator' has no len()

In [42]:
len(list(doc.noun_chunks))

3

# Visualizing named entities
Like visualizing dependencies with `style='dep'` *displacy* also offers a `style='ent'` visualizer

In [43]:
doc = nlp(u'Over the last quater Apple sold nearly 20 thousand iPods for a profit of $6 million. '
         u'By contrast, Sony sold only 7 thousand Walkman music players. ')
displacy.render(doc, style = 'ent', jupyter = True)

# Viewing Sentences Line by Line
Unlike displaCy dependency parse, the NER viewer has to take in a doc object with `ents` attribute. For this reason, we can't just pass a list of spans to `.render()`, we have to create a new doc from each `span.text`:

In [44]:
doc = nlp(u'Over the last quarter Apple sold nearly 20 thousand iPods for a profit of $6 million.'
         u'By contrast, the vendor near my house sold a lot of lemonade.')
for sent in doc.sents:
    displacy.render(nlp(sent.text), style = 'ent', jupyter = True)




If a span does not contain any entities, displaCy will issue a harless warning. that can be avoided by an additional bit of code

In [45]:
for sent in doc.sents:
    docx = nlp(sent.text)
    if docx.ents:
        displacy.render(docx, style='ent', jupyter=True)
        print('\n')
    else:
        print(docx.text)



By contrast, the vendor near my house sold a lot of lemonade.


# Viewing Specific Entities 
You can pass a list of entity types to restrict the visualization:

In [46]:
opt = {'ents' : ['ORG', 'PRODUCT']}
displacy.render(doc, style = 'ent', jupyter = True, options = opt)

# Customimizing colors and effects

In [47]:
colors = {'ORG': 'linear-gradient(90deg, #aa9cfc, #fc9ce7)', 'PRODUCT': 'radial-gradient(yellow, green)'}
options = {'ents': ['ORG', 'PRODUCT'], 'colors':colors}
displacy.render(doc, style = 'ent', jupyter = True, options = options)

# Creating Visualizations Outside of Jupyter


In [48]:
# displacy.serve(doc, style = 'ent', options=options)

# Sentence Segmentation
In this section we'll use sentence segmentation n set our own segmentation rules

In [49]:
doc = nlp(u'This is the first sentence. This is the second one. This is the last sentence')
for sent in doc.sents:
    print(sent)

This is the first sentence.
This is the second one.
This is the last sentence


# `Doc.sents` is a generator
It is important to note that `doc.sents` is a *generator*. That is, a Doc is not segmented until `doc.sents` is called. This means that, where you could print the second Doc token with `print(doc[1])`, you can't call the "second Doc sentence" with `print(doc.sents[1])`:

In [56]:
print(doc[1])
#print(doc.sents[1])

Management


TypeError: 'generator' object is not subscriptable

In [51]:
doc_sents = [sent for sent in doc.sents]
print(f'{doc_sents[0]}\n{doc_sents[1]}\n{doc_sents[2]}')

This is the first sentence.
This is the second one.
This is the last sentence


# `sents` are Spans
At first glance it looks like each `sent` contains text from the original Doc object. In fact they're just Spans with start and end token pointers.:

In [52]:
type(doc_sents[1])

spacy.tokens.span.Span

In [53]:
print(doc_sents[1].start, doc_sents[1].end)

6 12


# Adding Rules
spaCy's built-in `sentencizer` relies on the dependency parse and end-of-sentence punctuation to determine segmentation rules. We can add rules of our own, but they have to be added *before* the creation of the Doc object, as that is where the parsing of segment start tokens happens:

In [54]:
doc2 = nlp(u'This is a sentence. This is a sentence. This is a sentence.')

for token in doc2:
    print(token.is_sent_start, ' '+token.text)

True  This
False  is
False  a
False  sentence
False  .
True  This
False  is
False  a
False  sentence
False  .
True  This
False  is
False  a
False  sentence
False  .


In [55]:
doc = nlp(u'"Management is doing things right; leadership is doing the right things." -Peter Drucker')
for sent in doc.sents:
    print(sent)

"Management is doing things right; leadership is doing the right things."
-Peter Drucker


Now let's add a new rule to form sentences

In [70]:
@Language.component('set_custom_boundaries')
def set_custom_boundaries(doc):
    for token in doc[:-1]:
        if token.text == ';':
            doc[token.i+1].is_sent_start = True
    return doc
nlp.add_pipe('set_custom_boundaries', before = 'parser')
print(nlp.pipe_names,'\n')

for sent in doc.sents:
    print(sent)
print()
doc1 = nlp(u'"Management is doing things right; leadership is doing the right things." -Peter Drucker')   
for sent in doc1.sents:
    print(sent)
    
nlp.remove_pipe('set_custom_boundaries')
print('\n',nlp.pipe_names)

['tok2vec', 'tagger', 'set_custom_boundaries', 'parser', 'attribute_ruler', 'lemmatizer', 'ner'] 

"Management is doing things right; leadership is doing the right things."
-Peter Drucker

"Management is doing things right;
leadership is doing the right things."
-Peter Drucker

 ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']


Older doc objects created do not show any change\
# Why not change the token directly
We dont directly assign `.is_sent_start` value to `True` as spaCy refuses to change the tag after the document is parsed to prevent inconsistencies in the data


In [72]:
print(doc[7])
#doc[7].is_sent_start = True

leadership


ValueError: [E043] Refusing to write to token.sent_start if its document is parsed, because this may cause inconsistent state.

# Changing the rules
In some cases we want to replace spaCy's default sentencizer with our own set of rules.\
We can change the rules as per our wish.

In [73]:
mystring = u'This is a sentence. This is another.\n\nthus is the \nthird sentence.'
doc = nlp(mystring)
for sent in doc.sents:
    print([token.text for token in sent])

['This', 'is', 'a', 'sentence', '.']
['This', 'is', 'another', '.', '\n\n']
['thus', 'is', 'the', '\n', 'third', 'sentence', '.']


In [80]:
nlp.remove_pipe('split_on_newlist')

('split_on_newlist', <function __main__.split_on_newlist(doc)>)

In [81]:
@Language.component('split_on_newlist')
def split_on_newlist(doc):
    start = 0
    seen_newline = False
    for word in doc:
        if word.text.startswith('\n'):
            doc[]
    return doc

nlp.add_pipe('split_on_newlist', before = 'parser')
doc = nlp(mystring)
for sent in doc.sents:
    print([token.text for token in sent])
nlp.remove_pipe('split_on_newlist')

ValueError: [E005] Pipeline component 'split_on_newlist' returned <class 'generator'> instead of a Doc. If you're using a custom component, maybe you forgot to return the processed Doc?