In [1]:
import warnings
warnings.filterwarnings("ignore")

# spaCy

spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python. It's designed specifically for production use and helps you build applications that process and "understand" large volumes
of text.

[Documentation](https://spacy.io)

Reference: [DataCamp](https://www.datacamp.com/community/blog/spacy-cheatsheet)

**Download statistical models**

Predict part-of-speech tags, dependency labels, named entities and more. See here for available models.

In [2]:
# ! python -m spacy download en_core_web_sm

**Check that your installed models are up to date**

In [3]:
# ! python -m spacy validate

**Loading statistical models**

In [4]:
import spacy
nlp = spacy.load("en_core_web_sm")

**Processing text**

Processing text with the nlp object returns a Doc object that holds all information about the tokens, their linguistic features and their relationships.

In [5]:
doc = nlp("This is a text")

**Accessing token attributes**

In [6]:
[token.text for token in doc]

['This', 'is', 'a', 'text']

**Accessing spans**

Span indices are exclusive. So `doc[2:4]` is a span starting at token 2, up to – but not including! – token 4.

In [7]:
span = doc[2:4]
span.text

'a text'

**Creating a span manually**

In [8]:
from spacy.tokens import Span
doc = nlp("I live in New York")

# Span for "New York" with label GPE (geopolitical)
span = Span(doc, 3, 5, label="GPE")
span.text

'New York'

**Part-of-speech tags (predicted by statistical model)**

In [9]:
doc = nlp("This is a text.")

# Coarse-grained part-of-speech tags
print([token.pos_ for token in doc])

# Fine-grained part-of-speech tags
print([token.tag_ for token in doc])

['DET', 'AUX', 'DET', 'NOUN', 'PUNCT']
['DT', 'VBZ', 'DT', 'NN', '.']


**Syntactic dependencies (predicted by statistical model)**

In [10]:
# Dependency labels
print([token.dep_ for token in doc])

# Syntactic head token (governor)
print([token.head.text for token in doc])

['nsubj', 'ROOT', 'det', 'attr', 'punct']
['is', 'is', 'text', 'is', 'is']


**Named Entities (predicted by statistical model)**

In [11]:
doc = nlp("Larry Page founded Google")
# Text and label of named entity span
[(ent.text, ent.label_) for ent in doc.ents]

[('Larry Page', 'PERSON'), ('Google', 'ORG')]

**Sentences (usually needs the dependency parser)**

In [12]:
doc = nlp("This a sentence. This is another one.")
# doc.sents is a generator that yields sentence spans
[sent.text for sent in doc.sents]

['This a sentence.', 'This is another one.']

**Base noun phrases (needs the tagger and parser)**

In [13]:
doc = nlp("I have a red car")
# doc.noun_chunks is a generator that yields spans
[chunk.text for chunk in doc.noun_chunks]

['I', 'a red car']

**Label explanations**

In [14]:
print(spacy.explain("RB"))
print(spacy.explain("GPE"))

adverb
Countries, cities, states


**Visualizing**

In [15]:
from spacy import displacy

doc = nlp("This is a sentence")
displacy.render(doc, style="dep")

In [16]:
doc = nlp("Larry Page founded Google")
displacy.render(doc, style="ent")

**Word vectors and similarity**

In [17]:
doc1 = nlp("I like cats")
doc2 = nlp("I like dogs")

# Compare 2 documents
print(doc1.similarity(doc2))
# Compare 2 tokens
print(doc1[2].similarity(doc2[2]))
# Compare tokens and spans
print(doc1[0].similarity(doc2[1:3]))

0.8868963431185917
0.71163386
0.004555256


**Accessing word vectors**

In [18]:
doc = nlp("I like cats")

print(doc[2].vector)
print(doc[2].vector_norm)

[-0.56020373  0.5598117  -1.7227595  -1.4512374  -1.8633732   1.5181652
  2.4497442   2.71671     0.7253275  -0.91191405 -2.365035    1.452349
  0.99665594 -0.07202923  0.45251393 -0.627396    0.09271556 -0.923568
  1.043541   -2.3212037   2.518638    1.0047516  -1.465948    0.16524512
  2.3367646   1.0068069  -0.40840077 -1.7979589  -0.01955447 -3.6745605
 -0.33341095 -1.1460736   0.02087694 -1.0230592  -1.0785705  -0.4837411
  3.5818644  -1.6769918  -0.6168177  -0.44305182  1.1309762  -3.375774
 -0.7314371   3.8165627  -1.2384409  -1.840332    1.9778384  -0.99668455
 -0.91630244  0.9320802   2.6079159   2.8978107   2.014973   -0.4275763
 -2.230522    1.5558622  -3.321266    1.4540917  -1.2626281   1.5063637
 -1.4935691   1.3518717   0.72768307 -2.4933002   2.5934777  -2.8751411
  0.70920813 -4.5025535  -1.4368149  -0.4430135  -1.9885433  -2.3653316
  0.662556    2.171824    0.14537    -0.92470443  1.3664596   0.4993388
  3.4350014  -0.93349695  3.529655    1.2558105  -1.772904    0.4

**Pipeline components**

Functions that take a Doc object, modify it and return it.

![spacy](../meta/spacy.png)

Name | Description | Creates
:---|:---|:---
tagger | Part-of-speech tagger  |Token.tag
parser | Dependency parser | Token.dep , Token.head , Doc.sents , Doc.noun_chunks
ner | Named entity recognizer | Doc.ents , Token.ent_iob , Token.ent_type
textcat | Text classier | Doc.cats

* Pipeline is defined in model's meta.json in order
* Built-in components need binary data to make predictions

**Pipeline information**

In [19]:
print(nlp.pipe_names)

print(nlp.pipeline)

['tagger', 'parser', 'ner']
[('tagger', <spacy.pipeline.pipes.Tagger object at 0x000001AD0C73A088>), ('parser', <spacy.pipeline.pipes.DependencyParser object at 0x000001AD0C681108>), ('ner', <spacy.pipeline.pipes.EntityRecognizer object at 0x000001AD0C6811C8>)]


**Custom components**

In [20]:
# Function that modifies the doc and returns it
def custom_component(doc):
    print("Do something to the doc here!")
    return doc

# Add the component first in the pipeline
# nlp.add_pipe(custom_component, first=True)

Components can be added first, last (default), or before or after an existing component.

**Rule-based matching**

In [21]:
from spacy.matcher import Matcher

# Each dict represents one token and its attributes
matcher = Matcher(nlp.vocab)

# Add with ID, optional callback and pattern(s)
pattern = [{"LOWER": "new"}, {"LOWER": "york"}]
matcher.add('CITIES', None, pattern)

# Match by calling the matcher on a Doc object
doc = nlp("I live in New York")
matches = matcher(doc)

# Matches are (match_id, start, end) tuples
for match_id, start, end in matches:
     # Get the matched span by slicing the Doc
    span = doc[start:end]
    print(span.text)

New York


In [22]:
# "love cats", "loving cats", "loved cats"
pattern1 = [{"LEMMA": "love"}, {"LOWER": "cats"}]
# "10 people", "twenty people"
pattern2 = [{"LIKE_NUM": True}, {"TEXT": "people"}]
# "book", "a cat", "the sea" (noun + optional article)
pattern3 = [{"POS": "DET", "OP": "?"}, {"POS": "NOUN"}]

**Using operators and quantifiers**

Operator | Description
:---:|:---
`{'OP': '!'}` | Negation: match 0 times
`{'OP': '?'}` | Optional: match 0 or 1 times
`{'OP': '+'}` | Match 1 or more times
`{'OP': '*'}` | Match 0 or more times

**Shared vocab and string store**

Vocab : stores data shared across multiple documents
* To save memory, spaCy encodes all strings to hash values
* Strings are only stored once in the StringStore via `nlp.vocab.strings`
* String store: lookup table in both directions
* Hashes can't be reversed – that's why we need to provide the shared vocab
* The doc also exposes the vocab and strings

In [23]:
doc = nlp("I love coffee")
print('hash value:', nlp.vocab.strings['coffee'])
print('string value:', nlp.vocab.strings[3197928453018144401])

print('hash value:', doc.vocab.strings['coffee'])

hash value: 3197928453018144401
string value: coffee
hash value: 3197928453018144401


**Lexemes: entries in the vocabulary**
    
A Lexeme object is an entry in the vocabulary
* Contains the context-independent information about a word
* Word text: lexeme.text and lexeme.orth (the hash)
* Lexical attributes like lexeme.is_alpha
* Not context-dependent part-of-speech tags, dependencies or entity labels

In [24]:
lexeme = nlp.vocab['coffee']
print(lexeme.text, lexeme.orth, lexeme.is_alpha)

coffee 3197928453018144401 True


**Vocab, hashes and lexemes**

![](../meta/spacy-lexeme.png)

**Processing large volumes of text**
* Use nlp.pipe method
* Processes texts as a stream, yields Doc objects
* Much faster than calling nlp on each text

BAD:
```
docs = [nlp(text) for text in LOTS_OF_TEXTS]
```
GOOD:
```
docs = list(nlp.pipe(LOTS_OF_TEXTS))
```

**Passing in context**

* Setting as_tuples=True on nlp.pipe lets you pass in (text, context) tuples
* Yields (doc, context) tuples
* Useful for associating metadata with the doc


In [25]:
data = [
    ('This is a text', {'id': 1, 'page_number': 15}),
    ('And another text', {'id': 2, 'page_number': 16}),
]
for doc, context in nlp.pipe(data, as_tuples=True):
    print(doc.text, context['page_number'])

This is a text 15
And another text 16


In [26]:
from spacy.tokens import Doc
Doc.set_extension('id', default=None)
Doc.set_extension('page_number', default=None)
data = [
    ('This is a text', {'id': 1, 'page_number': 15}),
    ('And another text', {'id': 2, 'page_number': 16}),
]
for doc, context in nlp.pipe(data, as_tuples=True):
    doc._.id = context['id']
    doc._.page_number = context['page_number']

**Using only the tokenizer**

* Use nlp.make_doc to turn a text in to a Doc object
* Use nlp.disable_pipes to temporarily disable one or more pipes
* restores them after the with block
* only runs the remaining components

In [27]:
text = "Larry Page founded Google"
with nlp.disable_pipes('tagger', 'parser'):
    doc = nlp(text)
    print(doc.ents)

(Larry Page, Google)


**Training and updating models**

**Why updating the model?**

* Better results on your specic domain
* Learn classication schemes specically for your problem
* Essential for text classication
* Very useful for named entity recognition
* Less critical for part-of-speech tagging and dependency parsing

**How training works**

1. Initialize the model weights randomly with nlp.begin_training
2. Predict a few examples with the current weights by calling nlp.update
3. Compare prediction with true labels
4. Calculate how to change weights to improve predictions
5. Update weights slightly
6. Go back to 2

**Training the entity recognizer**

* The entity recognizer tags words and phrases in context
* Each token can only be part of one entity
* Examples need to come with context
```
("iPhone X is coming", {'entities': [(0, 8, 'GADGET')]})
```
* Texts with no entities are also important
```
("I need a new phone! Any tips?" , {'entities': []})
```
* Goal:teach the model to generalize

* Update an existing model: a few hundred to a few thousand examples
* Train a new category: a few thousand to a million examples
* spaCy's English models: 2 million words
* Usually created manually by human annotators
* Can be semi-automated – for example, using spaCy's Matcher !


**The steps of a training loop**

1. Loop for a number oftimes.
2. Shufe the training data.
3. Divide the data into batches.
4. Update the model for each batch.
5. Save the updated model.

```
TRAINING_DATA = [
("How to preorder the iPhone X", {'entities': [(20, 28, 'GADGET')]})
# And many more examples...
]

for i in range(10):
    # Shuffle the training data
    random.shuffle(TRAINING_DATA)
    # Create batches and iterate over them
    for batch in spacy.util.minibatch(TRAINING_DATA):
        # Split the batch in texts and annotations
        texts = [text for text, annotation in batch]
        annotations = [annotation for text, annotation in batch]
        # Update the model
        nlp.update(texts, annotations)
# Save the model
nlp.to_disk(path_to_model)
```

**Setting up a new pipeline from scratch**

```
# Start with blank English model
nlp = spacy.blank('en')
# Create blank entity recognizer and add it to the pipeline
ner = nlp.create_pipe('ner')
nlp.add_pipe(ner)
# Add a new label
ner.add_label('GADGET')
# Start the training
nlp.begin_training()
# Train for 10 iterations
for itn in range(10):
    random.shuffle(examples)
    # Divide examples into batches
    for batch in spacy.util.minibatch(examples, size=2):
        texts = [text for text, annotation in batch]
        annotations = [annotation for text, annotation in batch]
        # Update the model
        nlp.update(texts, annotations)
```