<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#spaCy-basics" data-toc-modified-id="spaCy-basics-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>spaCy basics</a></span><ul class="toc-item"><li><span><a href="#Installation-and-setup" data-toc-modified-id="Installation-and-setup-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Installation and setup</a></span><ul class="toc-item"><li><span><a href="#From-the-command-line-or-terminal" data-toc-modified-id="From-the-command-line-or-terminal-1.1.1"><span class="toc-item-num">1.1.1&nbsp;&nbsp;</span>From the command line or terminal</a></span></li><li><span><a href="#Alternatively,-create-a-virtual-environment" data-toc-modified-id="Alternatively,-create-a-virtual-environment-1.1.2"><span class="toc-item-num">1.1.2&nbsp;&nbsp;</span>Alternatively, create a virtual environment</a></span></li><li><span><a href="#Next,-download-the-specific-model-of-language" data-toc-modified-id="Next,-download-the-specific-model-of-language-1.1.3"><span class="toc-item-num">1.1.3&nbsp;&nbsp;</span>Next, download the specific model of language</a></span></li></ul></li><li><span><a href="#Working-with-spaCy-in-Python" data-toc-modified-id="Working-with-spaCy-in-Python-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Working with spaCy in Python</a></span></li><li><span><a href="#spaCy-objects" data-toc-modified-id="spaCy-objects-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>spaCy objects</a></span><ul class="toc-item"><li><span><a href="#Pipeline" data-toc-modified-id="Pipeline-1.3.1"><span class="toc-item-num">1.3.1&nbsp;&nbsp;</span>Pipeline</a></span></li><li><span><a href="#Tokenization" data-toc-modified-id="Tokenization-1.3.2"><span class="toc-item-num">1.3.2&nbsp;&nbsp;</span>Tokenization</a></span></li><li><span><a href="#Part-of-speech-tagging-(POS)" data-toc-modified-id="Part-of-speech-tagging-(POS)-1.3.3"><span class="toc-item-num">1.3.3&nbsp;&nbsp;</span>Part-of-speech tagging (POS)</a></span></li><li><span><a href="#Dependencies" data-toc-modified-id="Dependencies-1.3.4"><span class="toc-item-num">1.3.4&nbsp;&nbsp;</span>Dependencies</a></span></li><li><span><a href="#Additional-token-attributes" data-toc-modified-id="Additional-token-attributes-1.3.5"><span class="toc-item-num">1.3.5&nbsp;&nbsp;</span>Additional token attributes</a></span></li><li><span><a href="#Spans" data-toc-modified-id="Spans-1.3.6"><span class="toc-item-num">1.3.6&nbsp;&nbsp;</span>Spans</a></span></li><li><span><a href="#Sentences" data-toc-modified-id="Sentences-1.3.7"><span class="toc-item-num">1.3.7&nbsp;&nbsp;</span>Sentences</a></span></li></ul></li></ul></li><li><span><a href="#Tokenization" data-toc-modified-id="Tokenization-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Tokenization</a></span><ul class="toc-item"><li><span><a href="#Prefixes,-suffixes,-and-infixes" data-toc-modified-id="Prefixes,-suffixes,-and-infixes-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Prefixes, suffixes, and infixes</a></span></li><li><span><a href="#Exceptions" data-toc-modified-id="Exceptions-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Exceptions</a></span></li><li><span><a href="#Counting-tokens" data-toc-modified-id="Counting-tokens-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Counting tokens</a></span></li><li><span><a href="#Counting-vocab-entries" data-toc-modified-id="Counting-vocab-entries-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>Counting vocab entries</a></span></li><li><span><a href="#Tokens-can-be-retrieved-by-index-position-and-slice" data-toc-modified-id="Tokens-can-be-retrieved-by-index-position-and-slice-2.5"><span class="toc-item-num">2.5&nbsp;&nbsp;</span>Tokens can be retrieved by index position and slice</a></span></li><li><span><a href="#Tokens-cannot-be-reassigned" data-toc-modified-id="Tokens-cannot-be-reassigned-2.6"><span class="toc-item-num">2.6&nbsp;&nbsp;</span>Tokens cannot be reassigned</a></span></li><li><span><a href="#Named-entities" data-toc-modified-id="Named-entities-2.7"><span class="toc-item-num">2.7&nbsp;&nbsp;</span>Named entities</a></span></li><li><span><a href="#Noun-chunks" data-toc-modified-id="Noun-chunks-2.8"><span class="toc-item-num">2.8&nbsp;&nbsp;</span>Noun chunks</a></span></li></ul></li></ul></div>

# spaCy basics

* For more info, visit: https://spacy.io

## Installation and setup

* For more info, visit: https://spacy.io/usage

### From the command line or terminal

> `conda install -c conda-forge spacy`
> 
> or
> 
> `pip install -U spacy`

### Alternatively, create a virtual environment

> `conda create -n spacyenv python spacy`

### Next, download the specific model of language

> `python -m spacy download en_core_web_sm`

## Working with spaCy in Python

In [1]:
# Import spaCy and load the language library
import spacy

nlp = spacy.load('en_core_web_sm')

# Create a Doc object
doc_1 = nlp(u'Tesla is looking at buying a U.S. startup for $6 million.')

row_format = "{:>10}" * 2
# Print each token separately
for token in doc_1:
    print(row_format.format(token.text, token.pos))

     Tesla        96
        is        87
   looking       100
        at        85
    buying       100
         a        90
      U.S.        96
   startup        92
       for        85
         $        99
         6        93
   million        93
         .        97


In [2]:
row_format = "{:>10}" * 3
for token in doc_1:
    print(row_format.format(token.text, token.pos_, token.dep_))

     Tesla     PROPN     nsubj
        is       AUX       aux
   looking      VERB      ROOT
        at       ADP      prep
    buying      VERB     pcomp
         a       DET       det
      U.S.     PROPN  compound
   startup      NOUN      dobj
       for       ADP      prep
         $       SYM  quantmod
         6       NUM  compound
   million       NUM      pobj
         .     PUNCT     punct


## spaCy objects

### Pipeline

* Image source: https://spacy.io/usage/spacy-101#pipelines

![Pipeline](../Figures/1.%20Pipeline.png)

In [3]:
nlp.pipeline

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x7fe1b80dbad0>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x7fe1b80ef170>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x7fe1b80ca130>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x7fe1b80ca280>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x7fe1b79f6050>),
 ('lemmatizer',
  <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x7fe1b8135a50>)]

In [4]:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer']

### Tokenization

In [5]:
doc_2 = nlp(u"Tesla isn't looking into startups anymore.")

row_format = "{:>10}" * 3
for token in doc_2:
    print(row_format.format(token.text, token.pos_, token.dep_))

     Tesla     PROPN     nsubj
        is       AUX       aux
       n't      PART       neg
   looking      VERB      ROOT
      into       ADP      prep
  startups      NOUN      pobj
   anymore       ADV    advmod
         .     PUNCT     punct


In [6]:
doc_2 = nlp(u"Tesla isn't   looking into startups anymore.")

row_format = "{:>10}" * 3
for token in doc_2:
    print(row_format.format(token.text, token.pos_, token.dep_))

     Tesla     PROPN     nsubj
        is       AUX       aux
       n't      PART       neg
               SPACE     nsubj
   looking      VERB      ROOT
      into       ADP      prep
  startups      NOUN      pobj
   anymore       ADV    advmod
         .     PUNCT     punct


In [7]:
doc_2

Tesla isn't   looking into startups anymore.

In [8]:
type(doc_2)

spacy.tokens.doc.Doc

In [9]:
doc_2[0]

Tesla

In [10]:
doc_2[0].text

'Tesla'

### Part-of-speech tagging (POS)

* For more info, visit: https://spacy.io/usage/linguistic-features#pos-tagging

In [11]:
doc_2[0].pos_

'PROPN'

In [12]:
spacy.explain('PROPN')

'proper noun'

### Dependencies

* For more info, visit: https://spacy.io/usage/linguistic-features#dependency-parse

* [Here](https://downloads.cs.stanford.edu/nlp/software/dependencies_manual.pdf), there is a good explanation of typed dependencies.

In [13]:
doc_2[0].dep_

'nsubj'

In [14]:
spacy.explain('nsubj')

'nominal subject'

### Additional token attributes

![Additional token attributes](../Figures/2.%20Additional%20token%20attributes.png)

In [15]:
# Lemmas (the base form of the word)
print(doc_2[0].text)
print(doc_2[0].lemma_)

Tesla
Tesla


In [16]:
print(doc_2[4].text)
print(doc_2[4].lemma_)

looking
look


In [17]:
# Simple parts-of-speech & detailed tags
print(doc_2[0].pos_ + ' / ' + spacy.explain(doc_2[0].pos_))
print(doc_2[0].tag_ + ' / ' + spacy.explain(doc_2[0].tag_))

PROPN / proper noun
NNP / noun, proper singular


In [18]:
print(doc_2[4].pos_ + ' / ' + spacy.explain(doc_2[4].pos_))
print(doc_2[4].tag_ + '  / ' + spacy.explain(doc_2[4].tag_))

VERB / verb
VBG  / verb, gerund or present participle


In [19]:
# Word shapes
print(doc_2[0].text + ': ' + doc_2[0].shape_)
print(doc_1[6].text + ' : ' + doc_1[5].shape_)

Tesla: Xxxxx
U.S. : x


In [20]:
# Boolean values
print(doc_2[0].is_alpha)
print(doc_2[0].is_stop)

True
False


### Spans

In [21]:
doc_3 = nlp(
    u'Although commonly attributed to John Lennon from his song "Beautiful Boy", \
    the phrase "Life is what happens to us while we are making other plans" \
    was written by cartoonist Allen Saunders and published in Reader\'s Digest \
    in 1957 when Lennon was 17.'
)

life_quote = doc_3[17:31]
print(life_quote)

"Life is what happens to us while we are making other plans"


In [22]:
type(life_quote)

spacy.tokens.span.Span

In [23]:
type(doc_3)

spacy.tokens.doc.Doc

### Sentences

In [24]:
doc_4 = nlp(
    u'This is the first sentence. This is another sentence. This is the last sentence.'
)

for sent in doc_4.sents:
    print(sent)

This is the first sentence.
This is another sentence.
This is the last sentence.


In [25]:
doc_4[6]

This

In [26]:
doc_4[6].is_sent_start

True

In [27]:
doc_4[8]

another

In [28]:
doc_4[8].is_sent_start

False

# Tokenization

* Image source: https://spacy.io/usage/spacy-101#annotations-token

![Tokenization](../Figures/3.%20Tokenization.png)

In [29]:
# Import spaCy and load the language library
import spacy

nlp = spacy.load('en_core_web_sm')

In [30]:
# Create a string that includes opening and closing quotation marks
mystring = '"We\'re moving to L.A.!"'
print(mystring)

"We're moving to L.A.!"


In [31]:
# Create a Doc object and explore tokens
doc_1 = nlp(mystring)

for token in doc_1:
    print(token.text)

"
We
're
moving
to
L.A.
!
"


In [32]:
for token in doc_1:
    print(token.text, end=' | ')

" | We | 're | moving | to | L.A. | ! | " | 

## Prefixes, suffixes, and infixes

In [33]:
doc_2 = nlp(u"We're here to help! Send snail-mail, email support@oursite.com or visit us at http://www.oursite.com!")

for t in doc_2:
    print(t)

We
're
here
to
help
!
Send
snail
-
mail
,
email
support@oursite.com
or
visit
us
at
http://www.oursite.com
!


In [34]:
doc_3 = nlp(u'A 5km NYC cab ride costs $10.30.')

for t in doc_3:
    print(t)

A
5
km
NYC
cab
ride
costs
$
10.30
.


## Exceptions

In [35]:
doc_4 = nlp(u"Let's visit St. Louis in the U.S. next year.")

for t in doc_4:
    print(t)

Let
's
visit
St.
Louis
in
the
U.S.
next
year
.


## Counting tokens

In [36]:
len(doc_4)

11

## Counting vocab entries

In [37]:
doc_4.vocab

<spacy.vocab.Vocab at 0x7fe1b7eaae10>

In [38]:
len(doc_4.vocab)

791

## Tokens can be retrieved by index position and slice

In [39]:
doc_5 = nlp(u'It is better to give than to receive.')

# Retrieve the third token
doc_5[2]

better

In [40]:
# Retrieve three tokens from the middle
doc_5[2:5]

better to give

In [41]:
# Retrieve the last four tokens
doc_5[-4:]

than to receive.

## Tokens cannot be reassigned

In [42]:
try:
    doc_5[2] = 'worse'
except TypeError:
    print ('TypeError')

TypeError


In [43]:
doc_6 = nlp(u'My dinner was horrible.')
doc_7 = nlp(u'Your dinner was delicious.')

In [44]:
# Try to change "My dinner was horrible." to "My dinner was delicious."
try:
    doc_6[3] = doc_7[3]
except TypeError:
    print ('TypeError')

TypeError


## Named entities

In [45]:
doc_8 = nlp(u'Apple to build a Hong Kong factory for $6 million.')

for token in doc_8:
    print(token.text, end=' | ')

Apple | to | build | a | Hong | Kong | factory | for | $ | 6 | million | . | 

In [46]:
len(doc_8.ents)

3

In [47]:
for entity in doc_8.ents:
    print(entity)

Apple
Hong Kong
$6 million


In [48]:
for entity in doc_8.ents:
    print(entity)
    print(entity.label_)
    print(spacy.explain(entity.label_))
    print('\n')

Apple
ORG
Companies, agencies, institutions, etc.


Hong Kong
GPE
Countries, cities, states


$6 million
MONEY
Monetary values, including unit




In [49]:
for entity in doc_8.ents:
    print(entity.text +' - ' + entity.label_ + ' - ' + spacy.explain(entity.label_))

Apple - ORG - Companies, agencies, institutions, etc.
Hong Kong - GPE - Countries, cities, states
$6 million - MONEY - Monetary values, including unit


## Noun chunks

In [50]:
doc_9 = nlp(u"Autonomous cars shift insurance liability toward manufacturers.")

for chunk in doc_9.noun_chunks:
    print(chunk.text)

Autonomous cars
insurance liability
manufacturers


In [51]:
doc_10 = nlp(u"Red cars do not carry higher insurance rates.")

for chunk in doc_10.noun_chunks:
    print(chunk.text)

Red cars
higher insurance rates


In [52]:
doc_11 = nlp(u"He was a one-eyed, one-horned, flying, purple people-eater.")

for chunk in doc_11.noun_chunks:
    print(chunk.text)

He
purple people-eater
