In [3]:
# Import spaCy and load the language library

import spacy
nlp = spacy.load('en_core_web_lg')

In [15]:
#unicode text createing doc object
doc = nlp(u'Tesla is looking at buying U.S. startup for $6 million')

In [16]:
# text gives each token, pos_ stands for part of speech,and dep_ stands for dependancies 
for token in doc :
    print(token.text,token.pos_,token.dep_)

Tesla PROPN nsubj
is AUX aux
looking VERB ROOT
at ADP prep
buying VERB pcomp
U.S. PROPN compound
startup NOUN dobj
for ADP prep
$ SYM quantmod
6 NUM compound
million NUM pobj



1. Tesla is recognized to be a Proper Noun, not just a word at the start of a sentence
2. U.S. is kept together as one entity (we call this a 'token')

___
# Pipeline
When we run `nlp`, our text enters a *processing pipeline* that first breaks down the text and then performs a series of operations to tag, parse and describe the data.   Image source: https://spacy.io/usage/spacy-101#pipelines

In [17]:
nlp.pipeline

[('tagger', <spacy.pipeline.pipes.Tagger at 0x2a3b6342908>),
 ('parser', <spacy.pipeline.pipes.DependencyParser at 0x2a394ae31c8>),
 ('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x2a3b0de91c8>)]

In [20]:
#ner stands for name entity recognizer

#we can add new component and remove component as per the case

In [21]:
nlp.pipe_names

['tagger', 'parser', 'ner']

## Tokenization
The first step in processing text is to split up all the component parts (words & punctuation) into "tokens". These tokens are annotated inside the Doc object to contain descriptive information

In [22]:
doc2 = nlp(u"Tesla isn't   looking into startups anymore.")
for token in doc2:
    print(token.text,token.pos_,token.dep_)

Tesla PROPN nsubj
is AUX ROOT
n't PART neg
   SPACE 
looking VERB xcomp
into ADP prep
startups NOUN pobj
anymore ADV advmod
. PUNCT punct


Notice how `isn't` has been split into two tokens. spaCy recognizes both the root verb `is` and the negation attached to it. Notice also that both the extended whitespace and the period at the end of the sentence are assigned their own tokens.

It's important to note that even though `doc2` contains processed information about each token, it also retains the original text:

In [23]:
doc2

Tesla isn't   looking into startups anymore.

we can use indexing to grab the token from the doc

In [24]:
doc2[0].pos_

'PROPN'

In [40]:
doc2[4]

looking

___
## Additional Token Attributes
We'll see these again in upcoming lectures. For now we just want to illustrate some of the other information that spaCy assigns to tokens:

|Tag|Description|doc2[0].tag|
|:------|:------:|:------|
|`.text`|The original word text<!-- .element: style="text-align:left;" -->|`Tesla`|
|`.lemma_`|The base form of the word|`tesla`|
|`.pos_`|The simple part-of-speech tag|`PROPN`/`proper noun`|
|`.tag_`|The detailed part-of-speech tag|`NNP`/`noun, proper singular`|
|`.shape_`|The word shape – capitalization, punctuation, digits|`Xxxxx`|
|`.is_alpha`|Is the token an alpha character?|`True`|
|`.is_stop`|Is the token part of a stop list, i.e. the most common words of the language?|`False`|

In [39]:
for token in doc2:
    print(token.text,token.lemma_,token.pos_,token.tag_,token.shape_,token.is_stop,token.is_alpha)

Tesla Tesla PROPN NNP Xxxxx False True
is be AUX VBZ xx True True
n't not PART RB x'x True False
      SPACE _SP    False False
looking look VERB VBG xxxx False True
into into ADP IN xxxx True True
startups startup NOUN NNS xxxx False True
anymore anymore ADV RB xxxx False True
. . PUNCT . . False False


___
## Spans
Large Doc objects can be hard to work with at times. A **span** is a slice of Doc object in the form `Doc[start:stop]`.

In [41]:
doc3 = nlp(u'Although commmonly attributed to John Lennon from his song "Beautiful Boy", \
the phrase "Life is what happens to us while we are making other plans" was written by \
cartoonist Allen Saunders and published in Reader\'s Digest in 1957, when Lennon was 17.')

In [45]:
span_qoute = doc3[16:30]

In [46]:
span_qoute

"Life is what happens to us while we are making other plans"

In [47]:
type(span_qoute)

spacy.tokens.span.Span

In [49]:
type(doc3)

spacy.tokens.doc.Doc

___
## Sentences
Certain tokens inside a Doc object may also receive a "start of sentence" tag. While this doesn't immediately build a list of sentences, these tags enable the generation of sentence segments through `Doc.sents`.

In [50]:
doc4 = nlp(u'This is the first sentence. This is another sentence. This is the last sentence.')

In [52]:
for sent in doc4.sents:
    print(sent)

This is the first sentence.
This is another sentence.
This is the last sentence.


In [55]:
doc4[6].is_sent_start #return none is not 

True