#  Lecture 8: The Spacy Pipeline

In this session, we take a closer look at Spacy. 

* What happens when we apply a model to a text ?
* Spacy pipelines
* Spacy attributes
* Word and sentence similarity
* Accessing and creating spans
* Adding components to a Spacy pipeline
* Adding attributes

[Spacy Cheat-sheet](https://www.datacamp.com/cheat-sheet/spacy-cheat-sheet-advanced-nlp-in-python)

In [None]:
import spacy

In [None]:
#!python -m spacy download en_core_web_sm

## Loading a language model
We are going to load the English language model. For other languages see: https://spacy.io/models/

In [None]:
# Load the Spacy model "en_core_web_sm"
nlp = spacy.load("en_core_web_sm")

# Apply the model to a text
doc = nlp("This is a text")

## Spacy Models/Pipelines

#### What does the nlp object actually do?

- Spacy models are _**pipelines**_ made of several components  e.g., a tokenizer and a tagger
- Applying a Spacy model (here the "nlp" object) to a string yields a _**Doc object**_ that contains various types of linguistic annotations/attributes for _**text, tokens and spans**_ depending on which components the model is made of. 
- For instance, if the model contains a POS tagger, applying this model to a text will yield a doc object which associates each token in the text with a POS tag.
- A _**pipeline component**_ is a function which takes a Doc object as input, modify it and return the modified doc. 

In [None]:
# Accessing the text of each token t
[token.text for token in doc]

In [None]:
# Accessing the text of each token t
[token for token in doc]

In [None]:
#type([token for token in doc][0])

In [None]:
#type([token.text for token in doc][0])

#### Getting information about the pipeline 

In [None]:
# List the model's components
nlp = spacy.load("en_core_web_sm")
nlp.pipe_names

In [None]:
nlp.pipeline

## Spacy Attributes
- Spacy Attributes return label IDs. 
- For **string labels**, use the attributes with an underscore. For example, token.pos_.

#### Part-of-speech tags (predicted by statistical model)

In [None]:
doc = nlp("This is a text.")
# Coarse-grained part-of-speech tags
[token.pos_ for token in doc]

In [None]:
# Fine-grained part-of-speech tags
[token.tag_ for token in doc]

In [None]:
# Label explanations
spacy.explain("RB")

In [None]:
spacy.explain("RB")

In [None]:
spacy.explain("GPE")

#### Syntactic dependencies (predicted by statistical model)

In [None]:
# Dependency labels
[(token.text, token.dep_)for token in doc]

In [None]:
# Syntactic head token (governor)
[(token.text, token.head.text)for token in doc]

#### Named Entities (predicted by statistical model)

In [None]:
doc = nlp("Henry Poincaré was born in Nancy, France")
# Text and label of named entity span
[(ent.text, ent.label_) for ent in doc.ents]

#### Sentences (usually needs the dependency parser)

In [None]:
doc = nlp("Henry Poincaré was born in Nancy. Henri Poincaré was a French genius who revolutionized mathematics, physics, and philosophy in the late 19th century.")
[sent.text for sent in doc.sents]

#### NP Chunks
- Base noun phrases (needs the tagger and parser)

In [None]:
doc = nlp("I have a red car")
# doc.noun_chunks is a generator that yields spans
[chunk.text for chunk in doc.noun_chunks]

In [None]:
doc = nlp("Henri Poincaré was a French genius who revolutionized mathematics, physics, and philosophy in the late 19th century.")
# doc.noun_chunks is a generator that yields spans
[chunk.text for chunk in doc.noun_chunks]

#### Getting information about Spacy attributes
Type 'doc.' followed by TAB to see the different methods and data structures. Take your time to check out the documentation at: https://spacy.io/api/ to learn what they are and what you can do. The better you understand the data objects and function the easier it will be to use it.

In [None]:
doc.lang_

### Visualizing
⚠️ If you're in a Jupyter notebook, use displacy.render. Otherwise, use displacy.serve to start a web server and show the visualization in your browser.

In [None]:
from spacy import displacy

#### Visualize dependencies

In [None]:
doc = nlp("This is a sentence")
displacy.render(doc, style="dep")

#### Visualize named entities

In [None]:
doc = nlp("Larry Page founded Google")
displacy.render(doc, style="ent")

## Word and Sentence  similarity
⚠️ To use word vectors, you need to install the larger models ending in md or lg , for example en_core_web_lg.

In [None]:
#!python -m spacy download en_core_web_lg

In [None]:
#nlp = spacy.load("en_core_web_lg")

#### Sentence Similarity

In [None]:
doc1 = nlp("I like cats")
doc2 = nlp("I like dogs")
# Compare 2 documents
doc1.similarity(doc2)

#### Word Similarity

In [None]:
# Compare 2 tokens
doc1[2].similarity(doc2[2])
# Compare tokens and spans
doc1[0].similarity(doc2[1:3])

#### Accessing word vectors

In [None]:
# Vector as a numpy array
doc = nlp("I like cats")
# The L2 norm of the token's vector
doc[2].vector
#doc[2].vector_norm

## Spans

There are three basic units in Spacy: documents, tokens and spans

#### Accessing Spans

In [None]:
doc = nlp("I live in New York")
# Span for "New York" with label GPE (geopolitical)
span = doc[3:6]
span.text

In [None]:
displacy.render(doc, style="ent")

#### Creating Spans

In [None]:
# Import the Span object
from spacy.tokens import Span
# Create a Doc object
doc = nlp("I live in X Y")
# Span for "New York" with label GPE (geopolitical)
span = Span(doc, 3, 5, label="GPE")
span.text

In [None]:
span.label_

## Custom components: Modifying the Pipeline
Components can be added to a pipeline first, last (default), or before or after an existing component.

Custom pipeline components let you add your own component to the spaCy pipeline that is executed when you call the pipeline on a text – for example, to modify the doc and add more data to it.

- A custom component is a function which modifies a doc and returns the modified version
- Custom component are defined using the following syntax

<code>@Language.component("custom_component")
def custom_component(doc):
    Do something to the doc here 
    return doc
</code>

- Use the <code>nlp.add_pipe</code> method to add the component to the pipeline.
- Remember to use the string name of the component 

<code>nlp.add_pipe(custom_component, first=True)</code>

In [None]:
from spacy.language import Language

#### A custom component that prints the number of tokens in a document.

In [None]:
import spacy
from spacy.language import Language

# Define the custom component
@Language.component("length_component")
def length_component_function(doc):
    # Get the doc's length 
    doc_length = len(doc)
    print(f"This document is {doc_length} tokens long.")
    # Return the doc
    return doc


# Load the small English pipeline
nlp = spacy.load("en_core_web_sm")

# Add the component first in the pipeline and print the pipe names
nlp.add_pipe("length_component", first=True)
print(nlp.pipe_names)

# Process a text
doc = nlp("This is a sentence.")

### Adding custom attributes to the Doc, Token and Span objects 
- You can also add custom attribute to create and store additional annotations about the input text

- declare new token attribute "ATTR"

> <code>Token.set_extension("ATTR", default=False)</code>

- assign attribute to token

> <code>doc[i]._.ATTR = True</code>

- print attribute for input text

> <code>print([token._.ATTR) for token in doc])</code>

#### Adding new attributes
- Custom attributes can be added to the Doc, Token and Span objects.
- Custom attributes  are registered on the global Doc, Token and Span classes and become available as <code>._.attr</code>

In [None]:
from spacy.tokens import Doc, Token, Span
doc = nlp("The sky over New York is blue")

#### Attribute extensions (with default value)

- Adding  the <code>is_country</code> attribute for tokens which denote a country

In [None]:
# Use Token.set_extension to register "is_country" (default False).
# Update it for "Spain" and print it for all tokens.

import spacy
from spacy.tokens import Token

nlp = spacy.blank("en")

# Register the Token extension attribute "is_country" with the default value False
Token.set_extension("is_country", default=False)

# Process the text and set the is_country attribute to True for the token "Spain"
doc = nlp("I live in Spain.")
doc[3]._.is_country = True

# Print the token text and the is_country attribute for all tokens
print([(token.text, token._.is_country) for token in doc])

#### Attribute extensions (with getter & setter)

In [None]:
# Register custom attribute "reversed" on Doc class
get_reversed = lambda doc: doc.text[::-1]
Doc.set_extension("reversed", getter=get_reversed)
# Compute value of extension attribute with getter
doc._.reversed

In [None]:
# Custom attribute "reversed" on Token class

from spacy.tokens import Token

nlp = spacy.blank("en")

# Define the getter function that takes a token and returns its reversed text
def get_reversed(token):
    return token.text[::-1]

# Register the Token property extension "reversed" with the getter get_reversed
Token.set_extension("reversed", getter=get_reversed, force=True)

# Process the text and print the reversed attribute for each token
doc = nlp("All generalizations are false, including this one.")
for token in doc:
    print("reversed:", token._.reversed)

#### Span extensions (with getter & setter)
- Enriching named entities with their wikipedia URL

In [None]:
import spacy
from spacy.tokens import Span

nlp = spacy.load("en_core_web_sm")

def get_wikipedia_url(span):
    # Get a Wikipedia URL if the span has one of the labels
    if span.label_ in ("PERSON", "ORG", "GPE", "LOCATION"):
        entity_text = span.text.replace(" ", "_")
        return "https://en.wikipedia.org/w/index.php?search=" + entity_text

# Set the Span extension wikipedia_url using the getter get_wikipedia_url
Span.set_extension("wikipedia_url", getter=get_wikipedia_url)

doc = nlp(
    "In over fifty years from his very first recordings right through to his "
    "last album, David Bowie was at the vanguard of contemporary culture in Paris."
)
for ent in doc.ents:
    # Print the text and Wikipedia URL of the entity
    print(ent.text, ent._.wikipedia_url)

## Rule-based matching

spaCy’s rule-based Matcher lets you define rules to find words and phrases in text.
 https://spacy.io/api/matcher
 
 - Import the <code>Matcher</code> from <code>spacy.matcher</code>.
 - Initialize it with the nlp object’s shared vocab. The shared vocabulary is available as the <code>nlp.vocab</code> attribute.
 - Create a pattern 
 - Use the <code>matcher.add</code> method to add the pattern to the matcher.
 - Call the matcher on the doc and store the result in the variable "matches".
 - Iterate over the matches and get the matched span from the start to the end index.
 

### SpaCy Patterns
 
 A _**pattern**_ is a list of dictionaries keyed by the attribute names. For example,<code>[{"TEXT": "Hello"}]</code> will match one token whose exact text is “Hello”.
 
 The start and end values of each match describe the start and end index of the matched span. To get the span, you can create a slice of the doc using the given start and end.

 - To match a token with an exact text, you can use the TEXT attribute. For example, <code>{"TEXT": "Apple"}</code> will match tokens with the exact text “Apple”.
 - To match a number token, you can use the <code>"IS_DIGIT"</code> attribute, which will only return True for tokens consisting of only digits.
 - To specify a lemma, you can use the "LEMMA" attribute in the token pattern. For example, <code>{"LEMMA": "be"}</code> will match tokens like “is”, “was” or “being”.
 - To find proper nouns, you want to match all tokens whose "POS" value equals "PROPN": <code>{"POS": "PROPN"}</code>
 - Operators can be added via the "OP" key. For example, "OP": "?" to match zero or one time.
 - A pattern for adjective plus one or two nouns   
 <code>[{"POS": "ADJ"}, {"POS": "NOUN"}, {"POS": "NOUN", "OP": "?"}]</code>
 
 See https://spacy.io/api/matcher for the full description of Spacy pattern format
 
 See https://course.spacy.io/en/chapter1 for some more pattern examples


#### Using the Matcher

In [None]:

from spacy.matcher import Matcher
# Matcher is initialized with the shared vocab
matcher = Matcher(nlp.vocab)

# Pattern for New York, new york, new York or New york
pattern = [{"LOWER": "new"}, {"LOWER": "york"}]

# Add pattern to matcher
matcher.add('CITIES', [pattern])

# Match by calling the matcher on a Doc object
doc = nlp("I live in New York")
matches = matcher(doc)

# Matches are (match_id, start, end) tuples
for match_id, start, end in matches:
     # Get the matched span by slicing the Doc
     span = doc[start:end]
     print(span.text)

#### Patterns

In [None]:
# "love cats", "loving cats", "loved cats"
pattern1 = [{"LEMMA": "love"}, {"LOWER": "cats"}]
# "10 people", "twenty people"
pattern2 = [{"LIKE_NUM": True}, {"TEXT": "people"}]
# "book", "a cat", "the sea" (noun + optional article)
pattern3 = [{"POS": "DET", "OP": "?"}, {"POS": "NOUN"}]

#### Operators and quantifiers
Can be added to a token dict as the "OP" key.

OP	Description
- !	Negate pattern and match exactly 0 times.
- ?	Make pattern optional and match 0 or 1 times.
- \+	Require pattern to match 1 or more times.
- \*	Allow pattern to match 0 or more times.

#### Matching word forms and POS tags
Search for forms of “download” (tokens with the lemma “download”), followed by a token with the part-of-speech tag "PROPN" (proper noun).

In [None]:
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

doc = nlp(
    "i downloaded Fortnite on my laptop and can't open the game at all. Help? "
    "so when I was downloading Minecraft, I got the Windows version where it "
    "is the '.zip' folder and I used the default program to unpack it... do "
    "I also need to download Winzip?"
)

# Write a pattern that matches a form of "download" plus proper noun
pattern = [{"LEMMA": "download"}, {"POS": "PROPN"}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("DOWNLOAD_THINGS_PATTERN", [pattern])
matches = matcher(doc)
print("Total matches found:", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)

 #### Creating a pattern from a list of tokens
- The **Matcher** lets you match sequences based on lists of token descriptions, 
- The **PhraseMatcher** lets you efficiently match large terminology lists. 
- The PhraseMatcher accepts match patterns in the form of Doc objects.
 
https://spacy.io/api/phrasematcher

In [None]:
import spacy

COUNTRIES = ["France","Germany","Czech Republic","Slovakia"]

nlp = spacy.blank("en")
doc = nlp("Czech Republic may help Slovakia protect its airspace")

# Import the PhraseMatcher and initialize it
from spacy.matcher import PhraseMatcher

matcher = PhraseMatcher(nlp.vocab)

# Create pattern Doc objects and add them to the matcher
# This is the faster version of: [nlp(country) for country in COUNTRIES]
patterns = list(nlp.pipe(COUNTRIES))
matcher.add("COUNTRY", patterns)

# Call the matcher on the test document and print the result
matches = matcher(doc)
print([doc[start:end] for match_id, start, end in matches])

## Adding a custom component which adds a new attribute using Matching

####  The custom component adds  the "capital" attributes for tokens which denote a country

In [None]:
import json
import spacy
from spacy.language import Language
from spacy.tokens import Span
from spacy.matcher import PhraseMatcher


COUNTRIES = ["France","Germany","Czech Republic","Slovakia"]

# Empty pipeline for English (tokenizer)
nlp = spacy.blank("en")
matcher = PhraseMatcher(nlp.vocab)
matcher.add("COUNTRY", list(nlp.pipe(COUNTRIES)))

@Language.component("countries_component")
def countries_component_function(doc):
    # Get al matches (all countries)
    matches = matcher(doc)
    # Create an entity Span with the label "GPE" for each match
    doc.ents = [Span(doc, start, end, label="GPE") for match_id, start, end in matches]
    return doc

# Add the component to the pipeline
nlp.add_pipe("countries_component")
print(nlp.pipe_names)


CAPITALS = {'Afghanistan': 'Kabul', 'Czech Republic': 'Prague','Slovakia': 'Bratislava', 'Slovenia': 'Ljubljana'}

# Getter that looks up the span text in the dictionary of country capitals
# retrieves capital from dictionary using python get method for dictionaries
get_capital = lambda span: CAPITALS.get(span.text)

# Register the Span extension attribute "capital" with the getter get_capital
Span.set_extension("capital", getter=get_capital)

# Process the text and print the entity text, label and capital attributes
doc = nlp("Czech Republic may help Slovakia protect its airspace")
print([(ent.text, ent.label_, ent._.capital) for ent in doc.ents])

#### A custom component that uses the PhraseMatcher to find animal names in the document and adds the matched spans to the doc.ents. 

A PhraseMatcher with the animal patterns has already been created as the variable matcher.

- Define the custom component and apply the matcher to the doc.
- Create a Span for each match, assign the label ID for "ANIMAL" and overwrite the doc.ents with the new spans.
- Add the new component to the pipeline after the "ner" component.
- Process the text and print the entity text and entity label for the entities in doc.ents

In [None]:
import spacy
from spacy.language import Language
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span

nlp = spacy.load("en_core_web_sm")
animals = ["Golden Retriever", "cat", "turtle", "Rattus norvegicus"]

# Pattern that matches any animal in the animals list
animal_patterns = list(nlp.pipe(animals))
print("animal_patterns:", animal_patterns)
matcher = PhraseMatcher(nlp.vocab)
matcher.add("ANIMAL", animal_patterns)

# Define the custom component
@Language.component("animal_component")
def animal_component_function(doc):
    # Apply the matcher to the doc
    matches = matcher(doc)
    # Create a Span for each match and assign the label "ANIMAL"
    spans = [Span(doc, start, end, label="ANIMAL") for match_id, start, end in matches]
    # Overwrite the doc.ents with the matched spans
    doc.ents = spans
    return doc


# Add the component to the pipeline after the "ner" component
nlp.add_pipe("animal_component", after="ner")
print(nlp.pipe_names)

# Process the text and print the text and label for the doc.ents
doc = nlp("I have a cat and a Golden Retriever")
print([(ent.text, ent.label_) for ent in doc.ents])

## Scaling
Instead of iterating over the texts and processing them, iterate over the doc objects yielded by nlp.pipe.


In [None]:
import re, spacy
nlp = spacy.load("en_core_web_sm")
with open("../data/webnlg-test.txt") as infile:
    novel = infile.readlines()
    #print(file_content[:50])
# Process the texts and print the adjectives
for doc in nlp.pipe(novel):
    print([token.text for token in doc])