# spaCy Basics

**spaCy** (https://spacy.io/) is an open-source Python library that parses and "understands" large volumes of text. Separate models are available that cater to specific languages (English, French, German, etc.).

In this section we'll install and setup spaCy to work with Python, and then introduce some concepts related to Natural Language Processing.

In [1]:
# Downloading the English Model for spaCy
!python -m spacy download en

Collecting en_core_web_sm==2.2.5
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.5/en_core_web_sm-2.2.5.tar.gz (12.0 MB)
[K     |████████████████████████████████| 12.0 MB 6.5 MB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/usr/local/lib/python3.7/dist-packages/en_core_web_sm -->
/usr/local/lib/python3.7/dist-packages/spacy/data/en
You can now load the model via spacy.load('en')


## Working with spaCy in Python

In [2]:
# Import spaCy and load the language library
import spacy
from prettytable import PrettyTable

nlp = spacy.load('en_core_web_sm')

# Create a Doc object
doc = nlp(u'Tesla is looking at buying U.S. startup for $6 million')

# Print each token separately
t = PrettyTable(["Text", "Part of Speech", "Syntactic Dependency"])
for token in doc:
  t.add_row([token.text, token.pos_, token.dep_])
print(t)

+---------+----------------+----------------------+
|   Text  | Part of Speech | Syntactic Dependency |
+---------+----------------+----------------------+
|  Tesla  |     PROPN      |        nsubj         |
|    is   |      AUX       |         aux          |
| looking |      VERB      |         ROOT         |
|    at   |      ADP       |         prep         |
|  buying |      VERB      |        pcomp         |
|   U.S.  |     PROPN      |       compound       |
| startup |      NOUN      |         dobj         |
|   for   |      ADP       |         prep         |
|    $    |      SYM       |       quantmod       |
|    6    |      NUM       |       compound       |
| million |      NUM       |         pobj         |
+---------+----------------+----------------------+


# Pipeline
<img src="https://miro.medium.com/max/1400/1*w4qkY84JfG5h2ChhR8SKnA.png" width=600/>

In [3]:
# Prebuilt Pipeline in spaCy containing the different operatrations
nlp.pipeline

[('tagger', <spacy.pipeline.pipes.Tagger at 0x7f768af0c750>),
 ('parser', <spacy.pipeline.pipes.DependencyParser at 0x7f768afcf2f0>),
 ('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x7f768afcf360>)]

In [4]:
# Name of the Pipeline operation names
nlp.pipe_names

['tagger', 'parser', 'ner']

## Tokenization

In [5]:
doc2 = nlp(u"Tesla isn't   looking into startups anymore.")

# Print each token separately
t = PrettyTable(["Text", "Part of Speech", "Syntactic Dependency"])
for token in doc2:
  t.add_row([token.text, token.pos_, token.dep_])
print(t)

+----------+----------------+----------------------+
|   Text   | Part of Speech | Syntactic Dependency |
+----------+----------------+----------------------+
|  Tesla   |     PROPN      |        nsubj         |
|    is    |      AUX       |         aux          |
|   n't    |      PART      |         neg          |
|          |     SPACE      |                      |
| looking  |      VERB      |         ROOT         |
|   into   |      ADP       |         prep         |
| startups |      NOUN      |         pobj         |
| anymore  |      ADV       |        advmod        |
|    .     |     PUNCT      |        punct         |
+----------+----------------+----------------------+


In [6]:
doc2

Tesla isn't   looking into startups anymore.

In [7]:
doc2[0]

Tesla

In [8]:
type(doc2)

spacy.tokens.doc.Doc

## Part-of-Speech Tagging (POS)

In [9]:
doc2[0].pos_

'PROPN'

## Dependencies

In [10]:
doc2[0].dep_

'nsubj'

In [11]:
# To See full name of a tag
spacy.explain('PROPN')

'proper noun'

In [12]:
spacy.explain('nsubj')

'nominal subject'

## Additional Token Attributes

|Tag|Description|doc2[0].tag|
|:------|:------:|:------|
|`.text`|The original word text<!-- .element: style="text-align:left;" -->|`Tesla`|
|`.lemma_`|The base form of the word|`tesla`|
|`.pos_`|The simple part-of-speech tag|`PROPN`/`proper noun`|
|`.tag_`|The detailed part-of-speech tag|`NNP`/`noun, proper singular`|
|`.shape_`|The word shape – capitalization, punctuation, digits|`Xxxxx`|
|`.is_alpha`|Is the token an alpha character?|`True`|
|`.is_stop`|Is the token part of a stop list, i.e. the most common words of the language?|`False`|

In [13]:
# .lemma_ function converts the word to its base form :- Lemmatization
print("Original Word   :", doc2[4].text)
print("Lemmatized Word :", doc2[4].lemma_)

Original Word   : looking
Lemmatized Word : look


In [14]:
# Simple Parts-of-Speech & Detailed Tags:
print(doc2[4].pos_)
print(doc2[4].tag_ + ' / ' + spacy.explain(doc2[4].tag_))

VERB
VBG / verb, gerund or present participle


In [15]:
# Word Shapes :
print(doc2[0].text+': '+doc2[0].shape_)
print(doc[5].text+' : '+doc[5].shape_)

Tesla: Xxxxx
U.S. : X.X.


In [16]:
# Boolean Values:
print(doc2[0].is_alpha) # Checking is the word is alphanumeric
print(doc2[0].is_stop) # Checking if the word is a stop word

True
False


## Spans

In [17]:
doc3 = nlp(u'Although commmonly attributed to John Lennon from his song "Beautiful Boy", \
the phrase "Life is what happens to us while we are making other plans" was written by \
cartoonist Allen Saunders and published in Reader\'s Digest in 1957, when Lennon was 17.')

In [18]:
life_quote = doc3[16:30]
print(life_quote)

"Life is what happens to us while we are making other plans"


In [19]:
type(life_quote)

spacy.tokens.span.Span

## Sentences

In [20]:
doc4 = nlp(u'This is the first sentence. This is another sentence. This is the last sentence.')

In [21]:
# Getting the Sentences from the above string
for sent in doc4.sents:
    print(sent)

This is the first sentence.
This is another sentence.
This is the last sentence.


In [22]:
print("Word :", f"'{doc4[6]}'")
print("Is Sentence Start :", doc4[6].is_sent_start)

Word : 'This'
Is Sentence Start : True
