<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#spaCy-basics" data-toc-modified-id="spaCy-basics-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>spaCy basics</a></span><ul class="toc-item"><li><span><a href="#Installation-and-setup" data-toc-modified-id="Installation-and-setup-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Installation and setup</a></span><ul class="toc-item"><li><span><a href="#From-the-command-line-or-terminal" data-toc-modified-id="From-the-command-line-or-terminal-1.1.1"><span class="toc-item-num">1.1.1&nbsp;&nbsp;</span>From the command line or terminal</a></span></li><li><span><a href="#Alternatively,-create-a-virtual-environment" data-toc-modified-id="Alternatively,-create-a-virtual-environment-1.1.2"><span class="toc-item-num">1.1.2&nbsp;&nbsp;</span>Alternatively, create a virtual environment</a></span></li><li><span><a href="#Next,-download-the-specific-model-of-language" data-toc-modified-id="Next,-download-the-specific-model-of-language-1.1.3"><span class="toc-item-num">1.1.3&nbsp;&nbsp;</span>Next, download the specific model of language</a></span></li></ul></li><li><span><a href="#Working-with-spaCy-in-Python" data-toc-modified-id="Working-with-spaCy-in-Python-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Working with spaCy in Python</a></span></li><li><span><a href="#spaCy-objects" data-toc-modified-id="spaCy-objects-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>spaCy objects</a></span><ul class="toc-item"><li><span><a href="#Pipeline" data-toc-modified-id="Pipeline-1.3.1"><span class="toc-item-num">1.3.1&nbsp;&nbsp;</span>Pipeline</a></span></li><li><span><a href="#Tokenization" data-toc-modified-id="Tokenization-1.3.2"><span class="toc-item-num">1.3.2&nbsp;&nbsp;</span>Tokenization</a></span></li><li><span><a href="#Part-of-speech-tagging-(POS)" data-toc-modified-id="Part-of-speech-tagging-(POS)-1.3.3"><span class="toc-item-num">1.3.3&nbsp;&nbsp;</span>Part-of-speech tagging (POS)</a></span></li><li><span><a href="#Dependencies" data-toc-modified-id="Dependencies-1.3.4"><span class="toc-item-num">1.3.4&nbsp;&nbsp;</span>Dependencies</a></span></li><li><span><a href="#Additional-token-attributes" data-toc-modified-id="Additional-token-attributes-1.3.5"><span class="toc-item-num">1.3.5&nbsp;&nbsp;</span>Additional token attributes</a></span></li></ul></li><li><span><a href="#Spans" data-toc-modified-id="Spans-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Spans</a></span></li><li><span><a href="#Sentences" data-toc-modified-id="Sentences-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Sentences</a></span></li><li><span><a href="#Next-up:-Tokenization" data-toc-modified-id="Next-up:-Tokenization-1.6"><span class="toc-item-num">1.6&nbsp;&nbsp;</span>Next up: Tokenization</a></span></li></ul></li></ul></div>

# spaCy basics

* For more info, visit: https://spacy.io

## Installation and setup

* For more info, visit: https://spacy.io/usage

### From the command line or terminal

> `conda install -c conda-forge spacy`
> 
> or
> 
> `pip install -U spacy`

### Alternatively, create a virtual environment

> `conda create -n spacyenv python spacy`

### Next, download the specific model of language

> `python -m spacy download en_core_web_sm`

## Working with spaCy in Python

In [4]:
# Import spaCy and load the language library
import spacy

In [16]:
nlp = spacy.load('en_core_web_sm')

In [35]:
# Create a Doc object
doc_1 = nlp(u'Tesla is looking at buying a U.S. startup for $6 million.')

In [21]:
row_format ="{:>10}" * 2
# Print each token separately
for token in doc_1:
    print(row_format.format(token.text, token.pos))

     Tesla        96
        is        87
   looking       100
        at        85
    buying       100
         a        90
      U.S.        96
   startup        92
       for        85
         $        99
         6        93
   million        93
         .        97


In [22]:
row_format ="{:>10}" * 3
for token in doc_1:
    print(row_format.format(token.text, token.pos_, token.dep_))

     Tesla     PROPN     nsubj
        is       AUX       aux
   looking      VERB      ROOT
        at       ADP      prep
    buying      VERB     pcomp
         a       DET       det
      U.S.     PROPN  compound
   startup      NOUN      dobj
       for       ADP      prep
         $       SYM  quantmod
         6       NUM  compound
   million       NUM      pobj
         .     PUNCT     punct


## spaCy objects

### Pipeline

* Image source: https://spacy.io/usage/processing-pipelines

![Pipeline](../Figures/1.%20Pipeline.png)

In [53]:
nlp.pipeline

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x7fa7b00822f0>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x7fa7b00af650>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x7fa7aa999ad0>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x7fa7aa999750>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x7fa7c5e368c0>),
 ('lemmatizer',
  <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x7fa7b02f3820>)]

In [54]:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer']

### Tokenization

In [55]:
doc_2 = nlp(u"Tesla isn't looking into startups anymore.")

row_format ="{:>10}" * 3
for token in doc_2:
    print(row_format.format(token.text, token.pos_, token.dep_))

     Tesla     PROPN     nsubj
        is       AUX       aux
       n't      PART       neg
   looking      VERB      ROOT
      into       ADP      prep
  startups      NOUN      pobj
   anymore       ADV    advmod
         .     PUNCT     punct


In [56]:
doc_2 = nlp(u"Tesla isn't   looking into startups anymore.")

row_format ="{:>10}" * 3
for token in doc_2:
    print(row_format.format(token.text, token.pos_, token.dep_))

     Tesla     PROPN     nsubj
        is       AUX       aux
       n't      PART       neg
               SPACE     nsubj
   looking      VERB      ROOT
      into       ADP      prep
  startups      NOUN      pobj
   anymore       ADV    advmod
         .     PUNCT     punct


In [57]:
doc_2

Tesla isn't   looking into startups anymore.

In [58]:
type(doc_2)

spacy.tokens.doc.Doc

In [59]:
doc_2[0]

Tesla

In [60]:
doc_2[0].text

'Tesla'

### Part-of-speech tagging (POS)

* For more info, visit: https://spacy.io/usage/linguistic-features#pos-tagging

In [37]:
doc_2[0].pos_

'PROPN'

In [52]:
spacy.explain('PROPN')

'proper noun'

### Dependencies

* For more info, visit: https://spacy.io/usage/linguistic-features#dependency-parse

* [Here](https://downloads.cs.stanford.edu/nlp/software/dependencies_manual.pdf), there is a good explanation of typed dependencies.

In [43]:
doc_2[0].dep_

'nsubj'

In [51]:
spacy.explain('nsubj')

'nominal subject'

### Additional token attributes

In [12]:
# Lemmas (the base form of the word):
print(doc2[4].text)
print(doc2[4].lemma_)

looking
look


In [50]:
# Simple Parts-of-Speech & Detailed Tags:
print(doc_2[4].pos_ + ' / ' + spacy.explain(doc_2[4].pos_))
print(doc_2[4].tag_ + '  / ' + spacy.explain(doc_2[4].tag_))

VERB / verb
VBG  / verb, gerund or present participle


In [14]:
# Word Shapes:
print(doc2[0].text + ': ' + doc2[0].shape_)
print(doc[5].text + ' : ' + doc[5].shape_)

Tesla: Xxxxx
U.S. : X.X.


In [15]:
# Boolean Values:
print(doc2[0].is_alpha)
print(doc2[0].is_stop)

True
False


___
## Spans
Large Doc objects can be hard to work with at times. A **span** is a slice of Doc object in the form `Doc[start:stop]`.

In [16]:
doc3 = nlp(
    u'Although commmonly attributed to John Lennon from his song "Beautiful Boy", \
the phrase "Life is what happens to us while we are making other plans" was written by \
cartoonist Allen Saunders and published in Reader\'s Digest in 1957, when Lennon was 17.'
)

In [17]:
life_quote = doc3[16:30]
print(life_quote)

"Life is what happens to us while we are making other plans"


In [18]:
type(life_quote)

spacy.tokens.span.Span

In upcoming lectures we'll see how to create Span objects using `Span()`. This will allow us to assign additional information to the Span.

___
## Sentences
Certain tokens inside a Doc object may also receive a "start of sentence" tag. While this doesn't immediately build a list of sentences, these tags enable the generation of sentence segments through `Doc.sents`. Later we'll write our own segmentation rules.

In [19]:
doc4 = nlp(
    u'This is the first sentence. This is another sentence. This is the last sentence.'
)

In [20]:
for sent in doc4.sents:
    print(sent)

This is the first sentence.
This is another sentence.
This is the last sentence.


In [21]:
doc4[6].is_sent_start

True

## Next up: Tokenization