Spacy operates on a pipeline which consists of many steps including tokenisation, lemmisation and other optional add in components.

Pipeline is instantiated with a spacy pipeline by loading either a blank one with `spacy.blank(<language>)` or with `spacy.load(<name of trained model>)`

Objects
- nlp : pipeline
- doc : the end result of the raw text through the pipeline
- token : individual token (word or component) in the document
- Span : Mulitple tokens

# Tutorial 1

In [58]:
import spacy

In [59]:
nlp = spacy.blank("en")

In [60]:
sample_string = "Randy is learning NLP on spacy!"
doc =nlp(sample_string)

In [61]:
doc.text

'Randy is learning NLP on spacy!'

In [62]:
# tokens
tokens = [tok for tok in doc]

In [63]:
tokens

[Randy, is, learning, NLP, on, spacy, !]

In [64]:
# spans
sample_span = tokens[:3]
sample_span

[Randy, is, learning]

In [65]:
# Lexical attributes
# Doc: https://spacy.io/api/token
# e.g token.is_alpha , token.is_punct , token.like_num
for idx, token in enumerate(tokens):
    print(f"Examining {token} in idx {idx}")
    print(f"Alpha: {token.is_alpha}")
    print(f"ascii: {token.is_ascii}")
    print(f"Title case: {token.is_title}")
    print(f"Lower case: {token.is_lower}")
    print(f"Left punctuation: {token.is_left_punct}")
    print(f"Right punctuation: {token.is_right_punct}")
    print(f"Number: {token.like_num}")
    print(f"OOV?: {token.is_oov}")
    print(f"url: {token.like_url}")
    print(f"Email: {token.like_email}")


Examining Randy in idx 0
Alpha: True
ascii: True
Title case: True
Lower case: False
Left punctuation: False
Right punctuation: False
Number: False
OOV?: True
url: False
Email: False
Examining is in idx 1
Alpha: True
ascii: True
Title case: False
Lower case: True
Left punctuation: False
Right punctuation: False
Number: False
OOV?: True
url: False
Email: False
Examining learning in idx 2
Alpha: True
ascii: True
Title case: False
Lower case: True
Left punctuation: False
Right punctuation: False
Number: False
OOV?: True
url: False
Email: False
Examining NLP in idx 3
Alpha: True
ascii: True
Title case: False
Lower case: False
Left punctuation: False
Right punctuation: False
Number: False
OOV?: True
url: False
Email: False
Examining on in idx 4
Alpha: True
ascii: True
Title case: False
Lower case: True
Left punctuation: False
Right punctuation: False
Number: False
OOV?: True
url: False
Email: False
Examining spacy in idx 5
Alpha: True
ascii: True
Title case: False
Lower case: True
Left punct

# Loading Trained model

In [66]:
nlp_en_core_web_sm = spacy.load("en_core_web_sm")



In [67]:
sample_email = "chngyuanlong@gmail.com"
sample_url = "https://myfirstdatasciencejob.wordpress.com/"
doc = nlp_en_core_web_sm(sample_email)
tokens = [tok for tok in doc]
print(tokens)
print(f"Like email : {tokens[0].like_email}")

[chngyuanlong@gmail.com]
Like email : True


In [68]:
doc = nlp_en_core_web_sm(sample_url)
tokens = [tok for tok in doc]
print(tokens)
print(f"Like url : {tokens[0].like_url}")

[https://myfirstdatasciencejob.wordpress.com/]
Like url : True


In [69]:
sample_similarity_sentence = "My cat is like a god. Oh opps I meant dog"
doc = nlp_en_core_web_sm(sample_similarity_sentence)
tokens = [tok for tok in doc]
print(tokens)

[My, cat, is, like, a, god, ., Oh, opps, I, meant, dog]


In [70]:
print(tokens[1].similarity(tokens[5]))
print(tokens[1].similarity(tokens[-1]))

0.34402838349342346
0.23464104533195496


  print(tokens[1].similarity(tokens[5]))
  print(tokens[1].similarity(tokens[-1]))


# NLP Tasks

The thing that fascinates me is that some inbuilt features that relates to core NLP tasks that dependency parsing, NER, POS, identification of syntactic heads.

In [71]:
for token in tokens:
    print(f"Text: {token.text}, POS Tag: {token.pos_}, Dep label: {token.dep_}, Head token: {token.head.text}, Lemmatised token: {token.lemma_}")

Text: My, POS Tag: PRON, Dep label: poss, Head token: cat, Lemmatised token: my
Text: cat, POS Tag: NOUN, Dep label: nsubj, Head token: is, Lemmatised token: cat
Text: is, POS Tag: AUX, Dep label: ROOT, Head token: is, Lemmatised token: be
Text: like, POS Tag: ADP, Dep label: prep, Head token: is, Lemmatised token: like
Text: a, POS Tag: DET, Dep label: det, Head token: god, Lemmatised token: a
Text: god, POS Tag: NOUN, Dep label: pobj, Head token: like, Lemmatised token: god
Text: ., POS Tag: PUNCT, Dep label: punct, Head token: is, Lemmatised token: .
Text: Oh, POS Tag: INTJ, Dep label: poss, Head token: opps, Lemmatised token: oh
Text: opps, POS Tag: NOUN, Dep label: npadvmod, Head token: meant, Lemmatised token: opps
Text: I, POS Tag: PRON, Dep label: nsubj, Head token: meant, Lemmatised token: I
Text: meant, POS Tag: VERB, Dep label: ROOT, Head token: meant, Lemmatised token: mean
Text: dog, POS Tag: NOUN, Dep label: dobj, Head token: meant, Lemmatised token: dog


In [72]:
# If there is no entity it will not print it.
for ent in doc.ents:
    # Print the entity text and label
    print(ent.text, ent.label_)

In [73]:
sample_NER = "We will need to destroy the Whiskey company else they get too big"
doc = nlp_en_core_web_sm(sample_NER)
tokens = [tok for tok in doc]
print(tokens)

[We, will, need, to, destroy, the, Whiskey, company, else, they, get, too, big]


In [74]:
for ent in doc.ents:
    # Print the entity text and label
    print(ent.text, ent.label_)

Whiskey ORG


In [77]:
from spacy.matcher import Matcher

[From website](https://course.spacy.io/en/chapter1)

Compared to regular expressions, the matcher works with Doc and Token objects instead of only strings.

It's also more flexible: you can search for texts but also other lexical attributes.

You can even write rules that use a model's predictions.

For example, find the word "duck" only if it's a verb, not a noun.

!!! Important things to note: each dictionary pattern represents one token

In [115]:
sample_matcher_text = "I need two dozen eggs and maybe 1kg of minced beef. Do not beef me"

In [116]:
doc = nlp_en_core_web_sm(sample_matcher_text)

In [118]:
tokens = [tok for tok in doc]

for token in tokens:
    print(token.text, token.pos_)

I PRON
need VERB
two NUM
dozen NOUN
eggs NOUN
and CCONJ
maybe ADV
1 NUM
kg NOUN
of ADP
minced VERB
beef NOUN
. PUNCT
Do AUX
not PART
beef VERB
me PRON


In [154]:
# Initialize the Matcher with the shared vocabulary
matcher = Matcher(nlp_en_core_web_sm.vocab)

# Let's try to catch the following pattern : VERB , [NOUN or PRON]
VERB_PATTERN = [{"POS":"VERB"}, {"POS":{"IN":["PRON", "NOUN"]}}]

In [155]:
# Add the pattern to the matcher
matcher.add("items_pattern", [VERB_PATTERN])

# Use the matcher on the doc
matches = matcher(doc)
print("Matches:", [doc[start:end].text for match_id, start, end in matches])

# It fetches according to the POS tags but minced meat would be more like ADJ NOUN rather than VERB NOUN

Matches: ['minced beef', 'beef me']


# Tutorial 2

StringStore, Lexemes and Vocab

In [1]:
import spacy

In [2]:
nlp = spacy.blank("en")
doc = nlp("I have a cat")

In [3]:
cat_hash = nlp.vocab.strings['cat']

In [4]:
cat_hash

5439657043933447811

In [5]:
cat_string = nlp.vocab.strings[cat_hash]

In [6]:
cat_string

'cat'

In [7]:
nlp.vocab.strings['have']

14692702688101715474

In [8]:
nlp.vocab.strings[14692702688101715474]

'have'

In [9]:
nlp.vocab.strings['fishy']

3687079329867984377

In [10]:
nlp.vocab.strings[3687079329867984377]

KeyError: "[E018] Can't retrieve string for hash '3687079329867984377'. This usually refers to an issue with the `Vocab` or `StringStore`."

In [11]:
nlp.vocab.strings.add('fishy')

3687079329867984377

In [12]:
nlp.vocab.strings['fishy']

3687079329867984377

In [13]:
nlp.vocab.strings[3687079329867984377]

'fishy'

In [14]:
lexeme = nlp.vocab['fishy']

In [19]:
lexeme.orth

3687079329867984377

Using Doc, Span manually

In [32]:
from spacy.tokens import Doc

words = ["spacy","is",'cool','!']
spaces = [True, True, False, False]

In [33]:
doc = Doc(nlp.vocab, words=words, spaces=spaces)

In [34]:
doc.text

'spacy is cool!'

Using Doc, Span and Entities manually

In [35]:
import spacy

nlp = spacy.blank("en")

from spacy.tokens import Doc, Span

words = ["I", "like", "David", "Bowie"]
spaces = [True, True, True, False]

In [37]:
doc = Doc(nlp.vocab, words, spaces)

In [41]:
doc.text

'I like David Bowie'

In [42]:
span = Span(doc, 2,4, label="PERSON")

In [44]:
span.text

'David Bowie'

In [45]:
span.label

380

In [46]:
doc.ents = [span]

In [47]:
# Print entities' text and labels
print([(ent.text, ent.label_) for ent in doc.ents])

[('David Bowie', 'PERSON')]


Best Practices

Convert the following code that best use the features of Spacy

In [48]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Berlin looks like a nice city")

# Get all tokens and part-of-speech tags
token_texts = [token.text for token in doc]
pos_tags = [token.pos_ for token in doc]

for index, pos in enumerate(pos_tags):
    # Check if the current token is a proper noun
    if pos == "PROPN":
        # Check if the next token is a verb
        if pos_tags[index + 1] == "VERB":
            result = token_texts[index]
            print("Found proper noun before a verb:", result)



Found proper noun before a verb: Berlin


In [54]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Singapore takes the cake everytime!")



In [56]:
for token in doc:
    if (token.pos_ == "PROPN") and doc[token.i + 1].pos_ == "VERB":
        res = token.text
        print("Found proper noun before a verb:", res)

Found proper noun before a verb: Singapore


Similarity Measure

In [58]:
nlp = spacy.load("en_core_web_md")



In [59]:
doc = nlp("Fee fi fo fum")

In [60]:
fum_vector = doc[3].vector

In [61]:
fum_vector

array([-6.2737e-01, -3.4660e-01,  2.7209e-01,  3.0464e-01,  3.2837e-01,
       -1.7831e-02,  1.1909e+00, -1.0056e-01,  2.2841e-01, -1.5137e+00,
        5.0329e-01,  1.7327e-01,  5.1083e-01,  1.7106e-01, -3.7955e-01,
       -2.7223e-01,  4.1014e-01, -1.6467e+00,  2.7785e-01, -2.9190e-01,
        8.9143e-02,  4.7478e-01,  2.1643e-01,  4.0249e-01,  5.0436e-02,
       -7.3996e-02,  3.0738e-01, -3.6356e-01,  5.9862e-01, -4.0765e-01,
        3.8553e-01, -3.0245e-01, -3.1639e-01, -1.5023e-01, -1.0749e-01,
       -6.5798e-01,  1.6014e-02,  2.9421e-01,  1.1150e+00,  6.6496e-01,
       -6.7577e-01, -5.0856e-01, -7.1808e-01, -3.0598e-01, -4.3807e-01,
       -7.4416e-01,  6.3013e-01, -1.1900e-01,  3.6612e-01,  1.6718e-01,
        2.4640e-01, -1.4228e-01, -7.4555e-01, -4.5723e-01,  1.0380e-01,
       -4.8714e-01,  2.5367e-01,  3.2114e-04,  4.7412e-01, -5.2455e-01,
       -1.8667e-01,  3.9628e-02,  9.3008e-03,  5.5973e-02,  1.4076e-01,
        2.6564e-02,  3.4447e-01,  9.9613e-02,  2.2072e-01, -1.90

In [62]:
fee_vector = doc[0].vector

In [63]:
doc[3].similarity(doc[0])

-0.11414127796888351

In [64]:
doc = nlp("This movie is the shit. I have never been blown away by such astounding storytelling in my life")

span1 = doc[0:5]
span2 = doc[6:]

In [65]:
span1.similarity(span2)

0.8902060389518738

Combining predictions and rules

In [70]:
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
doc = nlp(
    "Twitch Prime, the perks program for Amazon Prime members offering free "
    "loot, games and other benefits, is ditching one of its best features: "
    "ad-free viewing. According to an email sent out to Amazon Prime members "
    "today, ad-free viewing will no longer be included as a part of Twitch "
    "Prime for new members, beginning on September 14. However, members with "
    "existing annual subscriptions will be able to continue to enjoy ad-free "
    "viewing until their subscription comes up for renewal. Those with "
    "monthly subscriptions will have access to ad-free viewing until October 15."
)

# Create the match patterns
pattern1 = [{"LOWER": "amazon"}, {"IS_TITLE": True, "POS": "PROPN"}]
pattern2 = [{"LOWER": "ad"}, {"TEXT": "-"}, {"LOWER": "free"}, {"POS": "NOUN"}]

# Initialize the Matcher and add the patterns
matcher = Matcher(nlp.vocab)
matcher.add("PATTERN1", [pattern1])
matcher.add("PATTERN2", [pattern2])



In [71]:
# Iterate over the matches
for match_id, start, end in matcher(doc):
    # Print pattern string name and text of matched span
    print(doc.vocab.strings[match_id], doc[start:end].text)

PATTERN1 Amazon Prime
PATTERN2 ad-free viewing
PATTERN1 Amazon Prime
PATTERN2 ad-free viewing
PATTERN2 ad-free viewing
PATTERN2 ad-free viewing


In [72]:
from spacy.matcher import PhraseMatcher

In [73]:
matcher = PhraseMatcher(nlp.vocab)

Efficient Phrase Matching

In [None]:
import json
import spacy

# The next phrase is un-runable since I do not have the json,
# but the example illustrates that if the entire listing is available
# then using a phrase matcher would help greatly
with open("exercises/en/countries.json", encoding="utf8") as f:
    COUNTRIES = json.loads(f.read())

nlp = spacy.blank("en")
doc = nlp("Czech Republic may help Slovakia protect its airspace")

# Import the PhraseMatcher and initialize it
from spacy.matcher import PhraseMatcher

matcher = PhraseMatcher(nlp.vocab)

# Create pattern Doc objects and add them to the matcher
# This is the faster version of: [nlp(country) for country in COUNTRIES]
patterns = list(nlp.pipe(COUNTRIES))
matcher.add("COUNTRY", patterns)

# Call the matcher on the test document and print the result
matches = matcher(doc)
print([doc[start:end] for match_id, start, end in matches])

In [75]:
# [Czech Republic, Slovakia]

David

In [None]:
import spacy
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span
import json

# The next phrase is un-runable since I do not have the json,
# but the example illustrates that if the entire listing is available
# then using a phrase matcher would help greatly
with open("exercises/en/countries.json", encoding="utf8") as f:
    COUNTRIES = json.loads(f.read())
with open("exercises/en/country_text.txt", encoding="utf8") as f:
    TEXT = f.read()

nlp = spacy.load("en_core_web_sm")
matcher = PhraseMatcher(nlp.vocab)
patterns = list(nlp.pipe(COUNTRIES))
matcher.add("COUNTRY", patterns)

# Create a doc and reset existing entities
doc = nlp(TEXT)
doc.ents = []

# Iterate over the matches
for match_id, start, end in matcher(doc):
    # Create a Span with the label for "GPE"
    span = Span(doc, start, end, label="GPE")

    # Overwrite the doc.ents and add the span
    doc.ents = list(doc.ents) + [span]

    # Get the span's root head token
    span_root_head = span.root.head
    # Print the text of the span root's head token and the span text
    print(span_root_head.text, "-->", span.text)

# Print the entities in the document
print([(ent.text, ent.label_) for ent in doc.ents if ent.label_ == "GPE"])

In [None]:
# in --> Namibia
# in --> South Africa
# Africa --> Cambodia
# of --> Kuwait
# as --> Somalia
# Somalia --> Haiti
# Haiti --> Mozambique
# in --> Somalia
# for --> Rwanda
# Britain --> Singapore
# War --> Sierra Leone
# of --> Afghanistan
# invaded --> Iraq
# in --> Sudan
# of --> Congo
# earthquake --> Haiti
# [('Namibia', 'GPE'), ('South Africa', 'GPE'), ('Cambodia', 'GPE'), ('Kuwait', 'GPE'), ('Somalia', 'GPE'), ('Haiti', 'GPE'), ('Mozambique', 'GPE'), ('Somalia', 'GPE'), ('Rwanda', 'GPE'), ('Singapore', 'GPE'), ('Sierra Leone', 'GPE'), ('Afghanistan', 'GPE'), ('Iraq', 'GPE'), ('Sudan', 'GPE'), ('Congo', 'GPE'), ('Haiti', 'GPE')]

# Tutorial 3

Processing pipeline

In [79]:
import spacy

# Load the en_core_web_sm pipeline
nlp = spacy.load("en_core_web_sm")

# Print the names of the pipeline components
print(nlp.pipe_names)

# Print the full pipeline of (name, component) tuples
print(nlp.pipeline)



['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']
[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec object at 0x000002310138FB80>), ('tagger', <spacy.pipeline.tagger.Tagger object at 0x0000023102406B80>), ('parser', <spacy.pipeline.dep_parser.DependencyParser object at 0x0000023116CC89E0>), ('attribute_ruler', <spacy.pipeline.attributeruler.AttributeRuler object at 0x0000023106D48A40>), ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer object at 0x0000023102054240>), ('ner', <spacy.pipeline.ner.EntityRecognizer object at 0x0000023116CC8E40>)]


Custom Components

In [82]:
import spacy
from spacy.language import Language

# Define the custom component
@Language.component("length_component")
def length_component_function(doc):
    # Get the doc's length
    doc_length = len(doc)
    print(f"This document is {doc_length} tokens long.")
    # Return the doc
    return doc


# Load the small English pipeline
nlp = spacy.load("en_core_web_sm")

# Add the component first in the pipeline and print the pipe names
nlp.add_pipe('length_component', first=True)
print(nlp.pipe_names)

# Process a text
doc = nlp("Who let the dogs out? Who?! Who?! Who?! Who?! Who?!")

['length_component', 'tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']
This document is 21 tokens long.


Complex Components

In this exercise, you’ll be writing a custom component that uses the PhraseMatcher to find animal names in the document and adds the matched spans to the doc.ents. A PhraseMatcher with the animal patterns has already been created as the variable matcher.

- Define the custom component and apply the matcher to the doc.
- Create a Span for each match, assign the label ID for "ANIMAL" and overwrite the doc.ents with the new spans.
- Add the new component to the pipeline after the "ner" component.
- Process the text and print the entity text and entity label for the entities in doc.ents.

In [86]:
import spacy
from spacy.language import Language
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span

nlp = spacy.load("en_core_web_sm")
animals = ["Golden Retriever", "cat", "turtle", "Rattus norvegicus"]
animal_patterns = list(nlp.pipe(animals))
print("animal_patterns:", animal_patterns)
matcher = PhraseMatcher(nlp.vocab)
matcher.add("ANIMAL", animal_patterns)

# Define the custom component
@Language.component("animal_component")
def animal_component_function(doc):
    # Apply the matcher to the doc
    matches = matcher(doc)
    # Create a Span for each match and assign the label "ANIMAL"
    spans = [Span(doc, start, end, label="ANIMAL") for match_id, start, end in matches]
    # Overwrite the doc.ents with the matched spans
    doc.ents = spans
    return doc

# Add the component to the pipeline after the "ner" component
nlp.add_pipe('animal_component', after="ner")
print(nlp.pipe_names)

# Process the txt and print the text and label for the doc.ents
doc = nlp("I have a cat and a Golden Retriever")
print([(ent.text, ent.label_) for ent in doc.ents])

  """


animal_patterns: [Golden Retriever, cat, turtle, Rattus norvegicus]
['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner', 'animal_component']
[('cat', 'ANIMAL'), ('Golden Retriever', 'ANIMAL')]
