<a href="https://colab.research.google.com/github/ANanade/Natural-Language-Processing/blob/master/01_Spacy_Basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

spaCy (https://spacy.io/) is an open-source Python library that parses and "understands" large volumes of text. Separate models are available that cater to specific languages (English, French, German, etc.).

# Installation and Setup

## 1. From the command line or terminal:
pip install -U spacy
## 2. Next, also from the command line
python -m spacy download en

# Spacy follows the following 
1. Loading the Language library
2. Building the Pipeline
3. Using Tokens.
4. Part-of-speech tagging
5. Understanding the token attributes 

In [1]:
import spacy

In [2]:
# Import spaCy and load the language library
import spacy
nlp = spacy.load('en_core_web_sm')

# Create a Doc object
doc = nlp(u'Tesla is looking at buying U.S. startup for $6 million')

# Print each token separately
for token in doc:
    print(token.text, token.pos_, token.dep_)

Tesla PROPN nsubj
is AUX aux
looking VERB ROOT
at ADP prep
buying VERB pcomp
U.S. PROPN compound
startup NOUN dobj
for ADP prep
$ SYM quantmod
6 NUM compound
million NUM pobj


This doesn't look very user-friendly, but right away we see some interesting things happen:
1. Tesla is recognized to be a Proper Noun, not just a word at the start of a sentence
2. U.S. is kept together as one entity (we call this a 'token')

As we dive deeper into spaCy we'll see what each of these abbreviations mean and how they're derived. We'll also see how spaCy can interpret the last three tokens combined `$6 million` as referring to ***money***.

## Pipeline For Spacy

In [3]:
nlp.pipeline

[('tagger', <spacy.pipeline.pipes.Tagger at 0x7f54d22f8d68>),
 ('parser', <spacy.pipeline.pipes.DependencyParser at 0x7f54d21525e8>),
 ('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x7f54d2152648>)]

In [4]:
nlp.pipe_names

['tagger', 'parser', 'ner']

___
## Tokenization
The first step in processing text is to split up all the component parts (words & punctuation) into "tokens". These tokens are annotated inside the Doc object to contain descriptive information.  For now, let's look at another example:

In [5]:
doc2 = nlp(u"Tesla isn't   looking into startups anymore.")

for token in doc2:
    print(token.text, token.pos_, token.dep_)

Tesla PROPN nsubj
is AUX aux
n't PART neg
   SPACE 
looking VERB ROOT
into ADP prep
startups NOUN pobj
anymore ADV advmod
. PUNCT punct


Notice how `isn't` has been split into two tokens. spaCy recognizes both the root verb `is` and the negation attached to it. Notice also that both the extended whitespace and the period at the end of the sentence are assigned their own tokens.

It's important to note that even though `doc2` contains processed information about each token, it also retains the original text:

In [7]:
doc2

Tesla isn't   looking into startups anymore.

In [8]:
doc2[0]

Tesla

In [9]:
type(doc2)

spacy.tokens.doc.Doc

## Part-of-Speech Tagging (POS)
The next step after splitting the text up into tokens is to assign parts of speech. In the above example, `Tesla` was recognized to be a ***proper noun***. Here some statistical modeling is required. For example, words that follow "the" are typically nouns.

In [10]:
doc2[0].pos_

'PROPN'

## Dependencies
We also looked at the syntactic dependencies assigned to each token. Tesla is identified as an nsubj or the nominal subject of the sentence.

In [11]:
doc2[0].dep_

'nsubj'

In [12]:
spacy.explain('PROPN')

'proper noun'

In [12]:
spacy.explain('nsubj')

## Additional Token Attributes

In [13]:
# Lemmas (the base form of the word):
print(doc2[4].text)
print(doc2[4].lemma_)

looking
look


In [14]:
# Simple Parts-of-Speech & Detailed Tags:
print(doc2[4].pos_)
print(doc2[4].tag_ + ' / ' + spacy.explain(doc2[4].tag_))

VERB
VBG / verb, gerund or present participle


In [15]:
# Word Shapes:
print(doc2[0].text+': '+doc2[0].shape_)
print(doc[5].text+' : '+doc[5].shape_)

Tesla: Xxxxx
U.S. : X.X.


In [16]:
# Boolean Values:
print(doc2[0].is_alpha)
print(doc2[0].is_stop)

True
False


## Spans
Large Doc objects can be hard to work with at times. A span is a slice of Doc object in the form Doc[start:stop].

In [17]:
doc3 = nlp(u'Although commmonly attributed to John Lennon from his song "Beautiful Boy", \
the phrase "Life is what happens to us while we are making other plans" was written by \
cartoonist Allen Saunders and published in Reader\'s Digest in 1957, when Lennon was 17.')

In [18]:
life_quote = doc3[16:30]
print(life_quote)

"Life is what happens to us while we are making other plans"


In [19]:
type(life_quote)

spacy.tokens.span.Span

## Sentences
Certain tokens inside a Doc object may also receive a "start of sentence" tag. While this doesn't immediately build a list of sentences, these tags enable the generation of sentence segments through Doc.sents

In [20]:
doc4 = nlp(u'This is the first sentence. This is another sentence. This is the last sentence.')

In [21]:
for sent in doc4.sents:
    print(sent)

This is the first sentence.
This is another sentence.
This is the last sentence.


In [22]:
doc4[6].is_sent_start

True