# SpaCy Basics

**spaCy** is an open-source Python library that parses and "understands" large volumes of text. Separate models are available that cater to specific languages (English, French, German, etc.).

In this section we'll install and setup spaCy to work with Python, and then introduce some concepts related to Natural Language Processing.

___
# Installation and Setup

Installation is a two-step process. First, install spaCy using either conda or pip. Next, download the specific model you want, based on language.

### 1. From the command line or terminal:
> `pip install -U spacy`


### 2. Next, also from the command line (you must run this as admin or use sudo):

> `python -m spacy download en`

In [1]:
import spacy
# Load language library ('en_core means core englih language and web_sm means small version of this library')
nlp = spacy.load('en_core_web_sm')  

In [2]:
doc = nlp(u'Tesla is looking at buying U.S. startup for $6 million')

Here is using the language library that we just loaded that spacy developed.
It is going to essentially parse this "Tesla is looking at buying U.S startup for $6 million" entire system into separate components for US and it's going to parse it into what are known as tokens essentially. Each of these little words is going to become a token.

In [3]:
for token in doc:
    print(token.text, token.pos)

Tesla 96
is 100
looking 100
at 85
buying 100
U.S. 96
startup 92
for 85
$ 99
6 93
million 93


**Spacy** is smart enough to actually treat "As Dot these Capital U and S as a single token we're talking about the country" and to realize that this "$" sign and the "6" sholud probably be separated the "$" sign stands the US dollar and the stands for an amount and then million stands for another amount.

pos stands for PART OF SPEECH when we run then we see the number after the word like "Tesla 99, is 100 etc." these numbers actually corresponds to the parts of speech like an adverb, a noun, conjective etc.

In [15]:
for token in doc:
    print(token.text, token.pos_)

Tesla PROPN
is VERB
looking VERB
at ADP
buying VERB
U.S. PROPN
startup NOUN
for ADP
$ SYM
6 NUM
million NUM


If we actually want the Ronayne used in is say POS underscore. So is it smart enough to know that Tesla is a proper noun is a verb looking a verb proper nous for "US", "startup" as a noun and so on.

In [16]:
for token in doc:
    print(token.text, token.pos_, token.dep_)

Tesla PROPN nsubj
is VERB aux
looking VERB ROOT
at ADP prep
buying VERB pcomp
U.S. PROPN compound
startup NOUN dobj
for ADP prep
$ SYM quantmod
6 NUM compound
million NUM pobj


Give us even more information and "dep" stands for 'syntactic dependency'

This doesn't look very user-friendly, but right away we see some interesting things happen:
1. Tesla is recognized to be a Proper Noun, not just a word at the start of a sentence
2. U.S. is kept together as one entity (we call this a 'token')

As we dive deeper into spaCy we'll see what each of these abbreviations mean and how they're derived. We'll also see how spaCy can interpret the last three tokens combined `$6 million` as referring to ***money***.

___
# spaCy Objects

After importing the spacy module in the cell above we loaded a **model** and named it `nlp`.<br>Next we created a **Doc** object by applying the model to our text, and named it `doc`.<br>spaCy also builds a companion **Vocab** object that we'll cover in later sections.<br>The **Doc** object that holds the processed text is our focus here.

___
# Pipeline
When we run `nlp`, our text enters a *processing pipeline* that first breaks down the text and then performs a series of operations to tag, parse and describe the data.

In [17]:
nlp.pipeline

[('tagger', <spacy.pipeline.pipes.Tagger at 0x24b1a84aa90>),
 ('parser', <spacy.pipeline.pipes.DependencyParser at 0x24b25474768>),
 ('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x24b254747c8>)]

In [18]:
 nlp.pipe_names

['tagger', 'parser', 'ner']

___
## Tokenization
The first step in processing text is to split up all the component parts (words & punctuation) into "tokens". These tokens are annotated inside the Doc object to contain descriptive information. We'll go into much more detail on tokenization in an upcoming lecture. For now, let's look at another example:

In [22]:
doc2 = nlp(u"Tesla isn't looking into startups anymore.")

In [24]:
for token in doc2:
    print(token.text, token.pos_, token.dep_)

Tesla PROPN nsubj
is VERB aux
n't ADV neg
looking VERB ROOT
into ADP prep
startups NOUN pobj
anymore ADV advmod
. PUNCT punct


**Notice** how `isn't` has been split into two tokens. spaCy recognizes both the root verb `is` and the negation attached to it. Notice also that both the extended whitespace and the period at the end of the sentence are assigned their own tokens.

It's important to note that even though `doc2` contains processed information about each token, it also retains the original text:

In [25]:
doc2 = nlp(u"Tesla isn't       looking into startups anymore.")

In [26]:
for token in doc2:
    print(token.text, token.pos_, token.dep_)

Tesla PROPN nsubj
is VERB aux
n't ADV neg
       SPACE 
looking VERB ROOT
into ADP prep
startups NOUN pobj
anymore ADV advmod
. PUNCT punct


In [30]:
doc2

Tesla isn't       looking into startups anymore.

In [31]:
doc2[0]

Tesla

In [29]:
type(doc2)

spacy.tokens.doc.Doc

___
## Part-of-Speech Tagging (POS)
The next step after splitting the text up into tokens is to assign parts of speech. In the above example, `Tesla` was recognized to be a ***proper noun***. Here some statistical modeling is required. For example, words that follow "the" are typically nouns.

In [27]:
doc2[0].pos_

'PROPN'

___
## Dependencies
We also looked at the syntactic dependencies assigned to each token. `Tesla` is identified as an `nsubj` or the ***nominal subject*** of the sentence.

In [28]:
doc2[0].dep_

'nsubj'

In [32]:
spacy.explain('PROPN')

'proper noun'

In [33]:
spacy.explain('nsubj')

'nominal subject'

___
## Spans
Large Doc objects can be hard to work with at times. A **span** is a slice of Doc object in the form `Doc[start:stop]`.

In [34]:
doc3 = nlp(u'Although commmonly attributed to John Lennon from his song "Beautiful Boy", \
the phrase "Life is what happens to us while we are making other plans" was written by \
cartoonist Allen Saunders and published in Reader\'s Digest in 1957, when Lennon was 17.')

In [36]:
life_quote = doc3[16:30]
print(life_quote)

"Life is what happens to us while we are making other plans"


In [37]:
type(life_quote)

spacy.tokens.span.Span

In [38]:
type(doc3)

spacy.tokens.doc.Doc

In [39]:
doc4 = nlp(u"This is the first sentence. This is another sentence. This is last sentence")

In [40]:
for sentence in doc4.sents:
    print(sentence)

This is the first sentence.
This is another sentence.
This is last sentence


In [41]:
doc4[6]

This

In [43]:
doc4[6].is_sent_start

True

In [44]:
doc4[8].is_sent_start 
# It will return none because it is not the start of the sentence unlike when he is passed and 6 it returns back true