# spaCy Basics

**spaCy** (https://spacy.io/) is an open-source Python library that parses and "understands" large volumes of text. Separate models are available that cater to specific languages (English, French, German, etc.).


# Installation and Setup

Installation is a two-step process. First, install spaCy using either conda or pip. Next, download the specific model you want, based on language.<br> For more info visit https://spacy.io/usage/

### 1. From the command line or terminal:
> `conda install -c conda-forge spacy`
> <br>*or*<br>
> `pip install -U spacy`

> ### Alternatively you can create a virtual environment:
> `conda create -n spacyenv python=3 spacy=2`

### 2. Next, also from the command line (you must run this as admin or use sudo):

> `python -m spacy download en`
> `python -m spacy en_core_web_sm`


> ### If successful, you should see a message like:

> **`Linking successful`**<br>
> `    C:\Anaconda3\envs\spacyenv\lib\site-packages\en_core_web_sm -->`<br>
> `    C:\Anaconda3\envs\spacyenv\lib\site-packages\spacy\data\en`<br>
> ` `<br>
> `    You can now load the model via spacy.load('en_core_web_sm')`


# Working with spaCy in Python

In [15]:
# Import spaCy
import spacy

In [16]:
# and load the language library
nlp = spacy.load("en_core_web_sm")

In [17]:


# Create a Doc object
doc = nlp(u'I am thinking about buying a new mac worth 1280$')

# Print each token separately
for token in doc:
    print(token.text, token.pos_, token.dep_)

I PRON nsubj
am VERB aux
thinking VERB ROOT
about ADP prep
buying VERB pcomp
a DET det
new ADJ amod
mac NOUN dobj
worth ADJ prep
1280 NUM npadvmod
$ SYM advmod


___
# spaCy Objects

After importing the spacy module in the cell above we loaded a **model** and named it `nlp`.<br>Next we created a **Doc** object by applying the model to our text, and named it `doc`.<br>spaCy also builds a companion **Vocab** object that we'll cover in later sections.<br>The **Doc** object that holds the processed text is our focus here.

___
# Pipeline
When we run `nlp`, our text enters a *processing pipeline* that first breaks down the text and then performs a series of operations to tag, parse and describe the data.   Image source: https://spacy.io/usage/spacy-101#pipelines

In [18]:
nlp.pipeline

[<spacy.tagger.Tagger at 0x22c88b33678>,
 <spacy.pipeline.DependencyParser at 0x22cb5c4c138>,
 <spacy.matcher.Matcher at 0x22cc1307278>,
 <spacy.pipeline.EntityRecognizer at 0x22cc0e0b728>]

## Tokenization
The first step in processing text is to split up all the component parts (words & punctuation) into "tokens". These tokens are annotated inside the Doc object to contain descriptive information. 

In [24]:
doc2 = nlp(u"He's not looking for a job  anymore")

for token in doc2:
    print(token.text, token.pos_, token.dep_)

He PRON nsubj
's VERB aux
not ADV neg
looking VERB ROOT
for ADP prep
a DET det
job NOUN pobj
  SPACE 
anymore ADV advmod


Notice how `isn't` has been split into two tokens. spaCy recognizes both the root verb `is` and the negation attached to it. Notice also that both the extended whitespace and the period at the end of the sentence are assigned their own tokens.

It's important to note that even though `doc2` contains processed information about each token, it also retains the original text:

In [26]:
doc2

He's not looking for a job  anymore

In [27]:
doc2[0]

He

In [28]:
type(doc2)

spacy.tokens.doc.Doc

___
## Part-of-Speech Tagging (POS)
The next step after splitting the text up into tokens is to assign parts of speech. In the above example, `He` was recognized to be a ***proper noun***. Here some statistical modeling is required. For example, words that follow "the" are typically nouns.

For a full list of POS Tags visit https://spacy.io/api/annotation#pos-tagging

In [29]:
doc2[0].pos_

'PRON'

___
## Dependencies
We also looked at the syntactic dependencies assigned to each token. `He` is identified as an `nsubj` or the ***nominal subject*** of the sentence.

For a full list of Syntactic Dependencies visit https://spacy.io/api/annotation#dependency-parsing
<br>A good explanation of typed dependencies can be found [here](https://nlp.stanford.edu/software/dependencies_manual.pdf)

In [30]:
doc2[0].dep_

'nsubj'

In [31]:
spacy.explain('PROPN')

'proper noun'

In [32]:
spacy.explain('nsubj')

'nominal subject'

___
## Additional Token Attributes
We'll see these again in upcoming lectures. For now we just want to illustrate some of the other information that spaCy assigns to tokens:

|Tag|Description|doc2[0].tag|
|:------|:------:|:------|
|`.text`|The original word text<!-- .element: style="text-align:left;" -->|`He`|
|`.lemma_`|The base form of the word|`he`|
|`.pos_`|The simple part-of-speech tag|`PROPN`/`proper noun`|
|`.tag_`|The detailed part-of-speech tag|`NNP`/`noun, proper singular`|
|`.shape_`|The word shape – capitalization, punctuation, digits|`Xxxxx`|
|`.is_alpha`|Is the token an alpha character?|`True`|
|`.is_stop`|Is the token part of a stop list, i.e. the most common words of the language?|`False`|

In [33]:
# Lemmas (the base form of the word):
print(doc2[4].text)
print(doc2[4].lemma_)

for
for


In [34]:
# Simple Parts-of-Speech & Detailed Tags:
print(doc2[4].pos_)
print(doc2[4].tag_ + ' / ' + spacy.explain(doc2[4].tag_))

ADP
IN / conjunction, subordinating or preposition


In [35]:
# Word Shapes:
print(doc2[0].text+': '+doc2[0].shape_)
print(doc[5].text+' : '+doc[5].shape_)

He: Xx
a : x


In [36]:
# Boolean Values:
print(doc2[0].is_alpha)
print(doc2[0].is_stop)

True
True


___
## Spans
Large Doc objects can be hard to work with at times. A **span** is a slice of Doc object in the form `Doc[start:stop]`.

In [37]:
doc3 = nlp(u'Although commmonly attributed to John Lennon from his song "Beautiful Boy", \
the phrase "Life is what happens to us while we are making other plans" was written by \
cartoonist Allen Saunders and published in Reader\'s Digest in 1957, when Lennon was 17.')

In [38]:
life_quote = doc3[16:30]
print(life_quote)

"Life is what happens to us while we are making other plans"


In [39]:
type(life_quote)

spacy.tokens.span.Span

___
## Sentences
Certain tokens inside a Doc object may also receive a "start of sentence" tag. While this doesn't immediately build a list of sentences, these tags enable the generation of sentence segments through `Doc.sents`. Later we'll write our own segmentation rules.

In [40]:
doc4 = nlp(u'This is the first sentence. This is another sentence. This is the last sentence.')

In [41]:
for sent in doc4.sents:
    print(sent)

This is the first sentence.
This is another sentence.
This is the last sentence.
