Ch1 - Introduction

Getting Started

In [None]:
# Import the English language class and create the nlp object
from spacy.lang.en import English

nlp = English()

# Process the text
doc = nlp("I like tree kangaroos and narwhals.")

# Select the first token
first_token = doc[0]

# Print the first token's text
print(first_token.text)

Documents, spans and tokens

In [None]:
# Import the English language class and create the nlp object
from spacy.lang.en import English

nlp = English()

# Process the text
doc = nlp("I like tree kangaroos and narwhals.")

# A slice of the Doc for "tree kangaroos"
tree_kangaroos = doc
print(tree_kangaroos.text)

# A slice of the Doc for "tree kangaroos and narwhals" (without the ".")
tree_kangaroos_and_narwhals = doc[2:6]
print(tree_kangaroos_and_narwhals.text)

Lexical Attributes

In [None]:
from spacy.lang.en import English

nlp = English()

# Process the text
doc = nlp(
    "In 1990, more than 60% of people in East Asia were in extreme poverty. "
    "Now less than 4% are."
)

# Iterate over the tokens in the doc
for token in doc:
    # Check if the token resembles a number
    if token.like_num:
        # Get the next token in the document
        next_token = doc[token.i + 1]
        # Check if the next token's text equals "%"
        if next_token.text == "%":
            print("Percentage found:", token.text)

Loading Models

In [None]:
import spacy

# Load the "en_core_web_sm" model
nlp = spacy.load("en_core_web_sm")

text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# Process the text
doc = nlp(text)

# Print the document text
print(doc.text)

Predicting Linguistic Annotations

In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")

text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# Process the text
doc = nlp(text)

for token in doc:
    # Get the token text, part-of-speech tag and dependency label
    token_text = token.text
    token_pos = token.pos_
    token_dep = token.dep_
    # This is for formatting only
    print(f"{token_text:<12}{token_pos:<10}{token_dep:<10}")

In [None]:
spacy.explain("nsubj")

In [None]:
spacy.explain("ccomp")

In [None]:
spacy.explain("acomp")

In [None]:
spacy.explain("det")

In [None]:
spacy.explain("amod")

In [None]:
spacy.explain("relcl")

In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")

text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# Process the text
doc = nlp(text)

# Iterate over the predicted entities
for ent in doc.ents:
    # Print the entity text and its label
    print(ent.text, ent.label_)

In [None]:
doc.ents

Predicting named entities in context

In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")

text = "Upcoming iPhone X release date leaked as Apple reveals pre-orders"

# Process the text
doc = nlp(text)

# Iterate over the entities
for ent in doc.ents:
    # Print the entity text and label
    print(ent.text, ent.label_)

# Get the span for "iPhone X"
iphone_x = doc[1:3]

# Print the span text
print("Missing entity:", iphone_x.text)

Rule-based Matching

In [None]:
import spacy

# Import the Matcher
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
doc = nlp("Upcoming iPhone X release date leaked as Apple reveals pre-orders")

# Initialize the Matcher with the shared vocabulary
matcher = Matcher(nlp.vocab)

# Create a pattern matching two tokens: "iPhone" and "X"
pattern = [{"TEXT": "iPhone"}, {"TEXT": "X"}]

# Add the pattern to the matcher
matcher.add("IPHONE_X_PATTERN", [pattern])

# Use the matcher on the doc
matches = matcher(doc)
print("Matches:", [doc[start:end].text for match_id, start, end in matches])

In [None]:
Writing match patterns

In [None]:
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

doc = nlp(
    "After making the iOS update you won't notice a radical system-wide "
    "redesign: nothing like the aesthetic upheaval we got with iOS 7. Most of "
    "iOS 11's furniture remains the same as in iOS 10. But you will discover "
    "some tweaks once you delve a little deeper."
)

# Write a pattern for full iOS versions ("iOS 7", "iOS 11", "iOS 10")
pattern = [{"TEXT": "iOS"}, {"IS_DIGIT": True}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("IOS_VERSION_PATTERN", [pattern])
matches = matcher(doc)
print("Total matches found:", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)

In [None]:
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

doc = nlp(
    "i downloaded Fortnite on my laptop and can't open the game at all. Help? "
    "so when I was downloading Minecraft, I got the Windows version where it "
    "is the '.zip' folder and I used the default program to unpack it... do "
    "I also need to download Winzip?"
)

# Write a pattern that matches a form of "download" plus proper noun
pattern = [{"LEMMA": "download"}, {"POS": "PROPN"}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("DOWNLOAD_THINGS_PATTERN", [pattern])
matches = matcher(doc)
print("Total matches found:", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)

In [None]:
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

doc = nlp(
    "Features of the app include a beautiful design, smart search, automatic "
    "labels and optional voice responses."
)

# Write a pattern for adjective plus one or two nouns
pattern = [{"POS": "ADJ"}, {"POS": "NOUN"}, {"POS": "NOUN", "OP": "?"}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("ADJ_NOUN_PATTERN", [pattern])
matches = matcher(doc)
print("Total matches found:", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)


# Summary - what have I learnt?

## What does is spaCy & what does it do?  
It is framework for nlp to label entities in text. 

It has a number of models under the hood to predict complex linguistic attributes in context, including:
- Part-of-speech tagging - Identifying words as Noun, Verb, Adjective etc.
- Syntactic dependencies - Identifying relative dependency of each word/linguistic breakdown of sentence structure (Subject, Object, Article etc.)
- Named Entities - Identifying words as Organisations, Countries, Money etc.  

Built in models have been trained on example texts, but can be updates with more examples to fine-tune predictions


Given that these models are never going to be perfect, there is also the option to add various "Rule-based matching" criteria to manually assign certain words/phrases. The predictions from the aforementioned models can also be used within the Rule-based matching system.

There are also a number of preprocessing methods that allow for easy lemmatization etc. The lemmatizer can also be edited/created manually as a rule based method

## What structures does it use?

### nlp object
An object loaded from spacy containing the propcessing pipeline (including the language specific rules for tokenisation etc). 

E.g. `nlp = spacy.load("en_core_web_sm")` OR `nlp = English()
` in above code

### Document

String of text that you have processed with the nlp object

E.g. `doc = nlp("Upcoming iPhone X release date leaked as Apple reveals pre-orders")` 

### Token

Constituent parts of a Document i.e. words and punctuation

E.g. `doc = nlp("Upcoming iPhone X release date leaked as Apple reveals pre-orders")`  
`token = doc[0]` OR `token = doc[i]` for i =< length  of doc

### Span

Collection of Tokens within a Document (like slicing any standard python object)

E.g. `doc = nlp("Upcoming iPhone X release date leaked as Apple reveals pre-orders")`  
`span = doc[2:6]`

## What functions/methods are available to use on documents and tokens?

Particularly useful methods are:

### text
returns text of a token/document

### pos_
Part of Speech

### dep_
Syntactic Dependency

### ents
All Entities within a doc

### label_
Label of entity


For a full list of methods/attributes for your document or token, use the dir() function E.g. `dir(doc)`

## What are the possible classifications of a Part-of-speech tag, dependency label or entity type?

Glossary of all of the classifications and their abbreviations can be found in the [spacy github repo](https://github.com/explosion/spaCy/blob/master/spacy/glossary.py).

This list is used as the basis for the `spacy.explain()` function

### Rule Based Matching

Similar to regular expressions, but instead of simply matching strings, it can match on any combination of attributes of the Doc or Token objects. This can include model predictions e.g. duck (verb) v duck (noun) can be distinguished by POS attribute of the token generated by the models.

Match patterns defined by a list of dictionaries for each token. e.g. `pattern = [{"LOWER": "iphone"}, {"LOWER": "x"}]`

You can initialise a pattern matcher with a pretrained model

`matcher = Matcher(nlp.vocab)`

then add the above pattern to the matcher

`matcher.add("IPHONE_PATTERN", [pattern])`

then call it on any doc

`matches = matcher(doc)`

More involved examples are included above (and in the slides on the spaCy website)

## Useful Links

The spaCy docs are detailed and easy to understand. Some useful links for reference are:  
- [Linguistic Features](https://spacy.io/usage/linguistic-features)