## Intro Spacy

After importing the spacy module we load a model and named it nlp.
Next we create a Doc object by applying the model to our text, and named it doc.
spaCy also builds a companion Vocab object that we'll cover in later sections.
The Doc object that holds the processed text is our focus here.

In [1]:
import spacy

In [2]:
nlp = spacy.load('en_core_web_sm')

In [4]:
doc = nlp(u'Tesla is looking at buying U.S. startup for $6 million')

In [9]:
for token in doc:
    print(f'{token.text:10} {token.pos_:10} {token.dep_:10}')

Tesla      PROPN      nsubj     
is         AUX        aux       
looking    VERB       ROOT      
at         ADP        prep      
buying     VERB       pcomp     
U.S.       PROPN      compound  
startup    NOUN       dobj      
for        ADP        prep      
$          SYM        quantmod  
6          NUM        compound  
million    NUM        pobj      


## Tokenization

The first step in processing text is to split up all the component parts (words & punctuation) into "tokens". These tokens are annotated inside the Doc object to contain descriptive information.

In [11]:
doc2 = nlp(u"Tesla isn't   looking into startups anymore.")

for token in doc2:
     print(f'{token.text:10} {token.pos_:10} {token.dep_:10}')

Tesla      PROPN      nsubj     
is         AUX        aux       
n't        PART       neg       
           SPACE      nsubj     
looking    VERB       ROOT      
into       ADP        prep      
startups   NOUN       pobj      
anymore    ADV        advmod    
.          PUNCT      punct     


## Part-of-Speech Tagging (POS)
The next step after splitting the text up into tokens is to assign parts of speech. In the above example, `Tesla` was recognized to be a ***proper noun***. Here some statistical modeling is required. For example, words that follow "the" are typically nouns.

In [12]:
doc2[0].pos_

'PROPN'

## Dependencies
We also looked at the syntactic dependencies assigned to each token. `Tesla` is identified as an `nsubj` or the ***nominal subject*** of the sentence.

In [13]:
doc2[0].dep_

'nsubj'

In [14]:
spacy.explain('PROPN')

'proper noun'

In [15]:
spacy.explain('nsubj')

'nominal subject'

## Additional Token Attributes

In [16]:
# Lemmas (the base form of the word):
print(doc2[4].text)
print(doc2[4].lemma_)

looking
look


In [17]:
# Simple Parts-of-Speech & Detailed Tags:
print(doc2[4].pos_)
print(doc2[4].tag_ + ' / ' + spacy.explain(doc2[4].tag_))

VERB
VBG / verb, gerund or present participle


In [18]:
# Word Shapes:
print(doc2[0].text+': '+doc2[0].shape_)
print(doc[5].text+' : '+doc[5].shape_)

Tesla: Xxxxx
U.S. : X.X.


In [19]:
# Boolean Values:
print(doc2[0].is_alpha)
print(doc2[0].is_stop)

True
False


## Spans
Large Doc objects can be hard to work with at times. A **span** is a slice of Doc object in the form `Doc[start:stop]`.

In [20]:
doc3 = nlp(u'Although commmonly attributed to John Lennon from his song "Beautiful Boy", \
the phrase "Life is what happens to us while we are making other plans" was written by \
cartoonist Allen Saunders and published in Reader\'s Digest in 1957, when Lennon was 17.')

In [21]:
life_quote = doc3[16:30]
print(life_quote)

"Life is what happens to us while we are making other plans"


In [22]:
type(life_quote)

spacy.tokens.span.Span

## Sentences
Certain tokens inside a Doc object may also receive a "start of sentence" tag. While this doesn't immediately build a list of sentences, these tags enable the generation of sentence segments through `Doc.sents`. Later we'll write our own segmentation rules.

In [23]:
doc4 = nlp(u'This is the first sentence. This is another sentence. This is the last sentence.')

In [24]:
for sent in doc4.sents:
    print(sent)

This is the first sentence.
This is another sentence.
This is the last sentence.


In [28]:
doc4[6].is_sent_start

True