# NLP with spaCy

- Introduction
- Setting up and getting started with spaCy
- spaCy workflow
- spaCY data model
- Processing text
- Text syntax and structure
- Word vectors

## Introduction

### Why spaCy?

- Curated algorithms
- Robust data model
- Memory and computationally efficient
- Interoperability and customization

SpaCy's structure allows accumulated annotations through NLP pipelines while preserving the original source of information. SpaCy is OO, where objects are created and then mutated and queried to get the work done.

SpaCy does tokenization, sentence recognition, part of speech tagging, lemmatization, dependency parsing, and named entity recognition all at once.

## Setting up and getting started with spaCy
1. pip install spacy
2. python -m spacy.en.download

The second line will download the model data for the english model. This will load the parser, tagger, vocabulary and word vectors.

In [1]:
# import spacy and load english model
import spacy

nlp = spacy.load('en')

In [2]:
# parse text into Document object
doc = nlp("I went to school this morning, but it's Sunday. School is closed! Silly me :)")

Each token is an object with lots of different properties. A property with underscore returns the string representation, while a property without an underscore returns an index (int) into spaCy's vocabulary. These are the  properties of the Token class:

In [3]:
# print properties and methods of Token class
[prop for prop in dir(doc[0]) if not prop.startswith('_')]

['ancestors',
 'check_flag',
 'children',
 'cluster',
 'conjuncts',
 'dep',
 'dep_',
 'doc',
 'ent_id',
 'ent_id_',
 'ent_iob',
 'ent_iob_',
 'ent_type',
 'ent_type_',
 'has_repvec',
 'has_vector',
 'head',
 'i',
 'idx',
 'is_alpha',
 'is_ancestor',
 'is_ancestor_of',
 'is_ascii',
 'is_bracket',
 'is_digit',
 'is_left_punct',
 'is_lower',
 'is_oov',
 'is_punct',
 'is_quote',
 'is_right_punct',
 'is_space',
 'is_stop',
 'is_title',
 'lang',
 'lang_',
 'left_edge',
 'lefts',
 'lemma',
 'lemma_',
 'lex_id',
 'like_email',
 'like_num',
 'like_url',
 'lower',
 'lower_',
 'n_lefts',
 'n_rights',
 'nbor',
 'norm',
 'norm_',
 'orth',
 'orth_',
 'pos',
 'pos_',
 'prefix',
 'prefix_',
 'prob',
 'rank',
 'repvec',
 'right_edge',
 'rights',
 'sentiment',
 'shape',
 'shape_',
 'similarity',
 'string',
 'subtree',
 'suffix',
 'suffix_',
 'tag',
 'tag_',
 'text',
 'text_with_ws',
 'vector',
 'vector_norm',
 'vocab',
 'whitespace_']

Now let's check the properties of the Doc class.

In [4]:
print(type(doc))

<class 'spacy.tokens.doc.Doc'>


In [5]:
# print properties and methods of Document class
[prop for prop in dir(doc) if not prop.startswith('_')]

['count_by',
 'doc',
 'ents',
 'from_array',
 'from_bytes',
 'has_vector',
 'is_parsed',
 'is_tagged',
 'mem',
 'merge',
 'noun_chunks',
 'noun_chunks_iterator',
 'read_bytes',
 'sentiment',
 'sents',
 'similarity',
 'string',
 'tensor',
 'text',
 'text_with_ws',
 'to_array',
 'to_bytes',
 'user_data',
 'user_hooks',
 'user_span_hooks',
 'user_token_hooks',
 'vector',
 'vector_norm',
 'vocab']

In [6]:
print(type(doc[0]))

<class 'spacy.tokens.token.Token'>


## spaCy workflow
1. Encode the input text in unicode (python 3 does this automatically)
2. Initiate a language model by one of two ways:
    - calling spacy.load(‘en’) and passing the language id or
    - grabbing the module directly by calling spacy.en.english
3. The language model, which is a document constructor will then take a unicode string as argument and return a Document object with annotations 
![](images/spaCy_workflow.png)

### NLP Pipeline
The language model runs it’s functions over the document to build the annotations. These functions are stored in the model’s pipeline object. The functions include:
- POS tagger
- Dependency parser
- Matcher
- Named entity recognizer

Each function mutates the document in place and once the language model is instantiated, it can be used over and over to process text.

In [7]:
# encode text as unicode (python 3 does that automatically)
text = "I'm running some python code on my jupyter notebook"

# initialize language model
nlp = spacy.load('en')

# create a document
document = nlp(text)

In [8]:
# print the default pipeline functions
for function in nlp.pipeline:
    print(function)

<spacy.tagger.Tagger object at 0x11bfc75e8>
<spacy.pipeline.DependencyParser object at 0x11de0d8b8>
<spacy.matcher.Matcher object at 0x153c7b518>
<spacy.pipeline.EntityRecognizer object at 0x1535e4d18>


### Customizing the Language Model

The language model is fully customizable. You can add or remove functions from the pipeline. Let's see an example of this.

In [9]:
# Run the standard language model
starwars_text = 'I like to watch starwars movies.'
document = nlp(starwars_text)
for token in document:
    print(token.orth_, token.tag_)

I PRP
like VBP
to TO
watch VB
starwars NNS
movies NNS
. .


In [10]:
# Modify the default language model
def identify_starwars(doc):
    for token in doc:
        if token.text == 'starwars':
            token.tag_ = 'NNP'
            
def get_new_pipeline(nlp):
    return [nlp.tagger, nlp.parser, nlp.matcher, nlp.entity, identify_starwars]

custom_nlp = spacy.load('en', create_pipeline=get_new_pipeline)
new_document = custom_nlp(starwars_text)

for function in custom_nlp.pipeline:
    print(function)

<spacy.tagger.Tagger object at 0x153f4e318>
<spacy.pipeline.DependencyParser object at 0x155da42c8>
<spacy.matcher.Matcher object at 0x18bc6fc18>
<spacy.pipeline.EntityRecognizer object at 0x18b5bf548>
<function identify_starwars at 0x153c852f0>


In [11]:
for token in new_document:
    print(token.orth_, token.tag_)

I PRP
like VBP
to TO
watch VB
starwars NNP
movies NNS
. .


### Multi-threading

Natural Language Processing is a task that can usually be paralellized. This means that processing a document is normally independent of processing another document. The function nlp.pipe returns a generator of document objects that can use multiple threads to process an array or generator of texts. Because it returns a generator of document objects, the pipeline is only run once we access the generator.

In [13]:
texts = ['This text will be multiplied ten thousand times.'] * 10000

In [14]:
%%time

for doc in nlp.pipe(texts, batch_size=100, n_threads=4):
    doc.is_parsed

CPU times: user 2.25 s, sys: 122 ms, total: 2.37 s
Wall time: 2.38 s


## spaCy data model

Other NLP libraries like NLTK provide pure functions for tasks like tokenization and POS tagging. Those functions destroy the input and replace it with processed outputs. One key difference is that spaCy takes an Object Oriented approach, where objects are created and then mutated through added metadata and queried to get work done. To prevent data inconsistencies, spaCy stores the data only once and provides different pointers and views to that data.

The central repository in spaCy is called the StringStore. The StringStore is a huge list of strings containing words and annotations, like POS. This way all the token contains is instructions on how to retrieve values, with few integers for references. This helps reduce the footprint of keeping a document in memory.

In [15]:
# properties without '_' at the end are indexes to the StringStore
doc = nlp("Let's take a look at how spaCy's data is organized.")
token = doc[0]
print(token.pos, token.pos_)

98 VERB


In [16]:
# access the same index through the StringStore
POS_index = token.pos
StringStore = token.vocab.strings

value = StringStore[POS_index]
print(value)

VERB


In [17]:
# print a piece of the StringStore
for i in [1, 2, 3, 4, 5, 6, 82, 83, 84, 100, 101, 102, 200, 500, 501, 502, 1000, 1001, 1002, 1003, 5000, 7000, 50000]:
    print(i, StringStore[i])

1 IS_ALPHA
2 IS_ASCII
3 IS_DIGIT
4 IS_LOWER
5 IS_PUNCT
6 IS_SPACE
82 ADJ
83 ADP
84 ADV
100 EOL
101 SPACE
102 Animacy_anim
200 PronType_ind
500 en
501 the
502 xxx
1000 igh
1001 Is
1002 Bush
1003 bush
5000 guaranteed
7000 July
50000 debt-discounting


Here is a summarized representation of spaCy's data model.

![](images/spaCy_data_model.png)

Many of the atributes stored in a token are context specific, which means that we will need information from the document to determine their values.

- *ex: "hit" can be a noun or a verb*

SpaCy has data structures called *lexemes* to store all non-context specific information about a word, like its lower case form, shape, suffix, etc.

Similarly, a document simply points to its tokens. A document also provides spans, which are subsequences of tokens such as multi-token entities and noun chunks.

![](images/spaCy_data_model_2.png)

## Processing Text

* Tokenizing words, spans and sentences
* Removing stopwords
* Removing punctuations
* Lemmatization

### Getting Tokens and Spans

In [18]:
doc = nlp("The brown fox is quick and he is jumping over the lazy dog")

# print all the tokens
print([token for token in doc])

[The, brown, fox, is, quick, and, he, is, jumping, over, the, lazy, dog]


In [19]:
# another way to do this is indexing the doc
print('Number of tokens: {}'.format(len(doc)))
print('First token: {}'.format(doc[0]))
print('Last token: {}'.format(doc[-1]))
print('Tokens 2 through 4: {}'.format(doc[1:4]))

Number of tokens: 13
First token: The
Last token: dog
Tokens 2 through 4: brown fox is


In [20]:
# print all the noun chunks
for np in doc.noun_chunks:
    print(np)

The brown fox
he
the lazy dog


### Getting Sentences

- spaCy performs sentence boundary detection to automatically detect sentences
- calling document.sents returns an iterable of sentences
- because spaCy returns generators, it postpones doing work until we actually need results
    - this allows us to process individual elements of an array without having to load everything to memory all at once

In [21]:
doc = nlp("I went to school this morning, but it's Sunday. School is closed! Silly me :)")

# print all the sentences
for sent in doc.sents:
    print(sent)

I went to school this morning, but it's Sunday.
School is closed!
Silly me :)


### Removing Stop Words

In [22]:
# print all of the stop words in the document
print([token for token in doc if token.is_stop])

[I, to, this, but, it, is, me]


In [23]:
# remove the stop words
doc_1 = [token for token in doc if not token.is_stop]
print(doc_1)

[went, school, morning, ,, 's, Sunday, ., School, closed, !, Silly, :)]


### Removing Punctuation

In [24]:
# print all punctuation in the document
print([token for token in doc if token.is_punct])

[,, ., !, :)]


In [25]:
doc_2 = [token for token in doc_1 if not token.is_punct]
print(doc_2)

[went, school, morning, 's, Sunday, School, closed, Silly]


In [28]:
doc2 = nlp('Hey, this is gr8! I rly enjoy it!')

for token in doc2:
    print(token.text, token.pos_)

Hey INTJ
, PUNCT
this DET
is VERB
gr8 NOUN
! PUNCT
I PRON
rly ADV
enjoy VERB
it PRON
! PUNCT


### Lemmatization

In [29]:
lemmas = [token.lemma_ for token in doc_2]
print(lemmas)

['go', 'school', 'morning', 'be', 'sunday', 'school', 'closed', 'silly']


## Text Syntax and Structure

* Part-of-Speech (POS) tagging
* Dependency parsing
* Named entity recognition

### POS tagging
* allows you to know what is a verb and what is a noun
* can't always rely on a lexicon (ex: hit can be a noun or verb)
    * "You can verb anything." William Safire
* Can be achieved with dependency parsing

In [30]:
print('TOKEN\t:\tPTB_TAG\t:\tUNIVERSAL_TAG')
print('---------------------------------------------')
for token in doc:
    print('{}\t:\t{}\t:\t{}'.format(token, token.tag_, token.pos_))

TOKEN	:	PTB_TAG	:	UNIVERSAL_TAG
---------------------------------------------
I	:	PRP	:	PRON
went	:	VBD	:	VERB
to	:	IN	:	ADP
school	:	NN	:	NOUN
this	:	DT	:	DET
morning	:	NN	:	NOUN
,	:	,	:	PUNCT
but	:	CC	:	CCONJ
it	:	PRP	:	PRON
's	:	VBZ	:	VERB
Sunday	:	NNP	:	PROPN
.	:	.	:	PUNCT
School	:	NNP	:	PROPN
is	:	VBZ	:	VERB
closed	:	JJ	:	ADJ
!	:	.	:	PUNCT
Silly	:	VB	:	VERB
me	:	PRP	:	PRON
:)	:	.	:	PUNCT


### Dependency Parsing
The sentence meaning comes from combining *chunks* of meaning. Think of words as functions, where they have arguments (dependent words).

*ex: "She hit the wall."*
    - "hit" requires a hitter (nsubj) and a hittee (direct object)
    - "hit defines how this sentence flows"
![](images/displacy_example.png?raw=true)

In [31]:
doc2 = nlp('The brown fox is quick and he is jumping over the lazy dog')

def print_dep_parsing(doc):
    header = 'LEFT <--- WORD[WORD_TYPE] --> RIGHT\n-----------------------------------'
    dep_pattern = '{left} <--- {word}[{word_type}] ---> {right}'
    print(header)
    for token in doc2:
        print(dep_pattern.format(
            word=token.orth_,
            word_type=token.dep_,
            left=[t.orth_ for t in token.lefts],
            right=[t.orth_ for t in token.rights]
        ))

print_dep_parsing(doc2)

LEFT <--- WORD[WORD_TYPE] --> RIGHT
-----------------------------------
[] <--- The[det] ---> []
[] <--- brown[amod] ---> []
['The', 'brown'] <--- fox[nsubj] ---> []
['fox'] <--- is[ROOT] ---> ['quick', 'jumping']
[] <--- quick[acomp] ---> ['and']
[] <--- and[cc] ---> []
[] <--- he[nsubj] ---> []
[] <--- is[aux] ---> []
['he', 'is'] <--- jumping[ccomp] ---> ['over']
[] <--- over[prep] ---> ['dog']
[] <--- the[det] ---> []
[] <--- lazy[amod] ---> []
['the', 'lazy'] <--- dog[pobj] ---> []


A better way to visualize the dependency parsing is using [displayCy](https://demos.explosion.ai/displacy/).

### Entity Recognition
Named entities can be accessed through doc.ents. Entity recognition allows finding proper nouns, such as specific people and locations.

In [32]:
doc = nlp('I flew to New York in the morning, got breakfast at Culture Espresso and met Susan at the Central Park.')

print('TOKEN : NAMED_ENTITY')
print('--------------------')
for ent in doc.ents:
    print('{} : {}'.format(ent, ent.label_))

TOKEN : NAMED_ENTITY
--------------------
New York : GPE
morning : TIME
Culture Espresso : ORG
Susan : PERSON
the Central Park : LOC


## Word vectors

SpaCy makes using word vectors really easy. The English model installs vectors for 1 million vocabulary entries, using 300-dimensional vectors trained on the Common Crawl corpus using the [GloVe](https://nlp.stanford.edu/projects/glove/) algorithm.

You can access the vector form a Lexeme, Token, Span or Doc class through the '.vector' property.

In [33]:
doc = nlp('Math quiz: do you remember your linear algebra classes?')
print(doc[0].vector.shape)

(300,)


In [34]:
print(doc[0].vector)

[-0.16111     0.37452999  0.032565   -0.040995   -0.0052605   0.15727
 -0.2818      0.23202001  0.31810999  1.56599998 -0.28884    -0.11199
  0.31402999  0.20852    -0.0071106   0.21749     0.24699999  1.51549995
 -0.50665998  0.069402    0.27742001 -0.35484001  0.214      -0.016425
  0.21309    -0.34055001  0.042495    0.094604    0.47981    -0.0035279
 -0.17168     0.16067     0.64766002 -0.15987     0.37630999  0.29563001
  0.24669001  0.64608997 -0.10777    -0.068073    0.25356001 -0.66258001
 -0.38949999 -0.13439     0.17922001  0.2665      0.12296     0.36366999
  0.35499001 -0.17848    -0.49405     0.30045     0.52583998  0.45199999
 -0.77667999 -0.05694     0.21585     0.19785    -0.094671    0.11303
 -0.20819999  0.042582   -0.013595   -0.22459     0.46709001  0.077615
 -0.55409998  0.86278999  0.92205     0.13732    -0.34322    -0.16752
  0.088037   -0.60540998 -0.11989     0.73468     0.16678999  0.34481999
  0.10886     1.10099995 -0.49709001 -0.38852     0.59628999  0.3600

### Word Similarity

SpaCy has also a convenient way of calculating cosine similarity between words and documents. All we have to do is access the *.similarity* method from the object.

In [35]:
# get a few words and spans form the last utterance
linear_algebra = doc[-4:-2]
classes = doc[-2]
math = doc[0]
print(classes)

classes


In [37]:
math.similarity(linear_algebra)

0.35822012113274243

We can also calculate the similarity between words from the vocabulary.

In [38]:
# get words fom vocabulary
happy = nlp.vocab['happy']
excited = nlp.vocab['excited']
sad = nlp.vocab['sad']
green = nlp.vocab['green']

In [41]:
# compare similarity
happy.similarity(green)

0.31557633403397889

In [42]:
def get_similar_words(word):
    
    # get all the words form the corpus, only lower case
    all_words = [w for w in nlp.vocab if w.has_vector \
                 and w.orth_.islower() and w.lower_ != word.lower_]
    
    # sort the words based on similarity
    sorted_list = sorted(all_words, key=word.similarity, reverse=True)
    
    return sorted_list

In [44]:
car = nlp.vocab['pizza']
for word in get_similar_words(car)[:10]:
    print(word.orth_)

pasta
burger
sandwiches
sandwich
burgers
cheese
bread
steak
fries
salad
