# spaCy
- Natural Language Understanding Library
- Rule based and machine learning approaches to natural language understanding  
- Build rule-based training data sets, use statistical models to predict features in text
> [Main Table of Contents](../../../README.md)

In [205]:
import spacy

## In This Notebook
- Spacy usage example
- Top-Level Functions
- Pipelines
	- Tokenizer is a *special* component
	- Trained pipelines
- Containers
	- Language class
	- Doc class
		- Doc Attributes
		- Doc Methods
	- Span class
		- Span Attributes
		- Span Methods
	- Token class
		- Token/lexical Attributes
		- Token/lexical Methods
- Statistical Models
	- POS - Parts of Speech
	- Deps - Dependencies
	- Ents - Named Entities
- Matcher - Pattern Rules
	- Matcher Usage Flow
- Phrase Matcher
- Train Models

## Spacy usage example
> Spacy can find elements in a text based on categories (part-of-speech tags, dependency labels, named entity labels, etc) using statistical models like Tagger, DependencyParser, EntityRecognizer, etc or find elements in text based on a rule or pattern, similar to regex

In [206]:
import spacy

# Load a trained pipeline
# python -m spacy download en_core_web_sm
nlp = spacy.load("en_core_web_sm") 

# Process whole documents
text = ("When Sebastian Thrun started working on self-driving cars at "
        "Google in 2007, few people outside of the company took him "
        "seriously. “I can tell you very senior CEOs of major American "
        "car companies would shake my hand and turn away because I wasn’t "
        "worth talking to,” said Thrun.")
doc = nlp(text)

# Analyze syntax
print("Noun phrases:", [chunk.text for chunk in doc.noun_chunks])
print("Verbs:", [token.lemma_ for token in doc if token.pos_ == "VERB"])

# Find named entities, phrases and concepts
for entity in doc.ents:
    print((entity.text, entity.label_), end=' ')

Noun phrases: ['Sebastian Thrun', 'self-driving cars', 'Google', 'few people', 'the company', 'him', 'I', 'you', 'very senior CEOs', 'major American car companies', 'my hand', 'I', 'Thrun']
Verbs: ['start', 'work', 'drive', 'take', 'tell', 'shake', 'turn', 'talk', 'say']
('Sebastian Thrun', 'PERSON') ('Google', 'ORG') ('2007', 'DATE') ('American', 'NORP') ('Thrun', 'GPE') 

## Top-Level Functions

Function/Attribute | Description
--- | ---
spacy.lang.en.stop_word | List of stop words, used for `Token.is_stop`
spacy.load(str) | Load a pipeline using the name of an installed package, string path, Path object<br>Instantiates Language class which is a text-processing pipeline<br>Returns Language object which is conventionally named `nlp`
spacy.blank(str) | Create blank pipeline of given language class<br>Equivalent to `English()` from `spacy.lang.en`
spacy.explain(str) | Get a description of POS (part of speech) tag, dependency label or entity type<br>See `glossary.py` for full list of terms
spacy.lang.en.English() | Access to English class<br>Equivalent to `spacy.blank("en")`

In [207]:
nlp = spacy.load("en_core_web_sm") # load a trained pipeline by package name
# nlp = spacy.load("/path/to/pipeline") # string path
nlp = spacy.load("en_core_web_sm", exclude=["parser", "tagger"])

## Pipelines
> Text -> Tokenizer -> nlp pipeline/components [Tagger -> Parser -> NER -> ...] -> Doc

- nlp pipeline/components are functions and each component receives Doc and returns Doc

### Tokenizer is a *special* component
- While other components receive and return `Doc`, Tokenizer receives text and returns `Doc`
- Can only be *one* Tokenizer
- Tokenizer *DOES NOT* show up in `nlp.pipe_names`
- It is writable and customizer like other components

### Trained Pipelines
- The three "en_core_web_" are english pipelines are optimized for CPU
	- Trained on Genre: written text (blogs, news, comments)
	- Active Pipeline Components: tok2vec, tagger, parser, attribute_ruler, lemmatizer, ner  




	Trained Pipeline Name | Description
	--- | ---
	en_core_web_sm | Doesn't include vectors
	en_core_web_md | Includes 20k unique vectors
	en_core_web_lg | Includes 514k unique vectors

## Containers
- Language
- Doc
- Span
- Token

### Language class
- A text-processing pipeline
- Conventionally the instantiated object is named `nlp`
- A call to `spacy.load`, `spacy.blank`, or `English` returns this object
- A pipeline consists of components
	- Components are functions that receive a Doc and returns a Doc  


	Attribute or Method | Description
	--- | ---
	nlp.make_doc | Returns only tokenized doc<br>No other pipeline components<br>TODO: HOW TO TELL IF THIS IS TRUE THOUGH??? SEE EXAMPLE BELOW
	nlp.pipe_names | Returns list of pipeline component names
	nlp.pipeline | Returns list of tuples (component name, component object)
	nlp.pipe(text:str) | Process an iterable of text as stream<br>Returns a generator that yields Doc
	nlp.add_pipe('component_fn_name') | Add any pipeline component<br>Default `last=True`<br>Location kwargs: `before`, `after`, `first`, `last`<br>kwarg `source` add component from another pipeline (Language object)
	nlp.select_pipes(*, disable: str\|iterable, enable: str\|iterable) | Context manager<br>kwarg 'disable' Name(s) of pipeline components to disable<br>kwarg 'disable' Name(s) of pipeline components that will not be disabled

In [208]:
# nlp.pipe(text:str) EXAMPLE
nlp = spacy.load("en_core_web_sm")
texts = ['Use pipe method.', 'Do not use list comprehension to process iterable of texts']

docs = [nlp(doc) for text in texts]  # Bad

docs = nlp.pipe(texts)               # Good, docs is generator
for doc in docs:
    print(doc, end=' ')

docs = list(nlp.pipe(texts))          # Good, docs is list

Use pipe method. Do not use list comprehension to process iterable of texts 

In [209]:
# nlp.select_pipes(*, disable: str\|iterable, enable: str\|iterable) EXAMPLE
nlp = spacy.load("en_core_web_sm")
texts = ['Use pipe method.', 'Do not use list comprehension to process iterable of texts']
with nlp.select_pipes(disable=["tagger", "parser"]):
    # Everything in context won't use tagger, parser components
    print(nlp.pipe_names)

# Everything here will use all pipeline components
print(nlp.pipe_names)
   

['tok2vec', 'attribute_ruler', 'lemmatizer', 'ner']
['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']


In [210]:
# TODO: How to TELL IF THIS IS TRUE??  maybe use has_annotation method, how to ck for named entities though?  'ENT' 'ENTS' doesn't work
doc = nlp.make_doc('Will only use tokenizer pipeline component') 
doc.has_annotation('DEP')

False

### Doc class
- Container to access linguistic annotations
- Access sentences, named entities
- A sequence of `Token` objects

In [211]:
# Access Doc object

# 1. via nlp object
doc = nlp('Some text')

# 2. via nlp object
docs = nlp.pipe(['Hello World', 'Some text']) # generator yields Doc object

# 3. Manually instantiate Doc class. Doc(vocab, words, spaces)
import spacy
from spacy.tokens import Doc
nlp = spacy.load("en_core_web_sm")
words = ["hello", "world", "!"]
spaces = [True, False, False]
doc = Doc(nlp.vocab, words=words, spaces=spaces)

#### Doc Attributes
- Attribute name without trailing underscore usually return numerical version
- Attribute name with trailing underscore returns string version

	Attribute | Description
	--- | ---
	text | String representation of document text
	vocab | The store of lexical types
	lang | Lange of the doc's vocabulary (int)
	lang_ | Lange of the doc's vocabulary (str)
	spans | dictionary of named span groups
	ents | Tuple of `Span` objects<br>named entities in the doc
	cats | Typically set by `TextCategorizer`<br>text categories mapped to scores
	sents | Sentences in the doc<br>Sentence spans have no label
	noun_chunks | If doc syntactically parsed (e.g. DEP model) then attribute contains noun chunks/noun phrase `Span` objects
	sentiment | Scalar value indicating the positivity or negativity of doc (if avail)
	vector | Default: Average of token vectors<br>*Best practice* to use short phrase docs than long docs riddled with common words<br>1D array<br>Numerical representation of doc



#### Doc Methods  
- Methods

  
	Method | Description
	--- | ---
	.has_annotation(attr:int\|str) | bool<br>Whether doc contains annoation on a `Token` attribute
	.similarity(other) | Default: cosine similarity of word vectors<br>'other' can be Doc, Span, Token, Lexeme objects<br>Higher is more similar<br>Need trained model that contains vectors e.g. `en_core_web_md` or `en_core_web_lg`
	.set_ents(entities:list[Span]) | Set the named entities in the doc
	.set_extension(fn_name) | Add custom attributes on the doc<br>kwarg 'getter' sets a custom doc property<br>kwarg 'method' sets a custom method<br>kwarg 'default' sets a custom writable attribute<br>custom attributes/methods accessible via `doc._.attr_name` or `doc._.method_name`

In [212]:
doc = nlp('Hello World, I am span')
# print(f'Named entity model run on doc? {doc.has_annotation("ENTS")}')  # TODO: WHY DOESN'T THIS WORK
print(f'Dependency parser run on doc? {doc.has_annotation("DEP")}')


Dependency parser run on doc? True


### Span class
- Span is a view of one or more tokens

In [213]:
# Access a Span object

# 1. Slice bracket Notation on a Doc object. doc[start: exclusive_end]
doc = nlp('Hello World, I am span')
span = doc[3:]  # ints are token idx NOT char idx
print(span)

# 2. Manually instantiate Span class. Span(doc, start, exclusive_end)
from spacy.tokens import Span
span = Span(doc, 3, 6) # ints are token idx NOT char idx
print(span)


I am span
I am span


#### Span Attributes
- Attribute name without trailing underscore usually return numerical version
- Attribute name with trailing underscore returns string version

	Attribute | Description
	--- | ---
	doc | Parent document
	text | String representation of span text
	sent | The sentence span this span is a part of
	sents | Iterable[Span]
	start | Token offset for the start of the span
	end | Token offset for the end of the span
	start_char | Char offset for the start of the span
	end_char | char offset for the end of the span
	ents | Tuple of `Span` objects<br>Named entities that fall completely within the span
	noun_chunks |  Yields `Span`<br>If doc syntactically parsed (e.g. DEP model) then attribute contains noun chunks/noun phrase in the span
	label | span's label (int)<br>i.e. named entity type<br>see named entity section
	label_ | span's label (str)<br>i.e. named entity type<br>see named entity section
	lemma_ | span's lemma
	sentiment | Scalar value indicating the positivity or negativity of the span
	vector | Default: Average of token vectors<br>*Best practice* to use short phrase spans than long spans riddled with common words<br>1D array<br>Numerical representation of span
	root | Token with the shortest path to the root of the sentence (or the root itself)

#### Span Methods  
- Methods

  
	Method | Description
	--- | ---
	.as_doc | Create new `Doc` object corresponding to the span, with a copy of the data
	.similarity(other) | Default: cosine similarity of word vectors<br>'other' can be Doc, Span, Token, Lexeme objects<br>Higher is more similar<br>Need trained model that contains vectors e.g. `en_core_web_md` or `en_core_web_lg`
	.set_extension(fn_name) | Add custom attributes on a `Span` object<br>kwarg 'getter' sets a custom doc property<br>kwarg 'method' sets a custom method<br>kwarg 'default' sets a custom writable attribute<br>custom attributes/methods accessible via `span._.attr_name` or `span._.method_name`

### Token class

In [214]:
# Access a Token object

# 1. Bracket notation on Doc object.  doc[token_idx]
doc = nlp('Hello World, I am span')
token = doc[1]

#### Token/Lexical Attributes
- Attribute name without trailing underscore usually return numerical version
- Attribute name with trailing underscore returns string version

	Attributes | Description
	--- | ---
	i | Token index in the parent doc<br>Preferred way to index tokens<br>Instead of e.g. [i for i, token in enumerate(doc)]
	idx | Char index of the token in parent doc
	text | String representation token content<br>Equivalent to `orth_`
	sent | Sentence *span* this token is a part of
	ent_type | named entity type (int)
	ent_type_ | named entity type (str)
	pos | part of speech (int)<br>simple UPOS pos tag
	pos_ | part of speech (str)<br>simple UPOS pos tag
	tag | Fine-grained part of speech  (int)<br>detailed pos tag
	tag_ | Fine-grained part of speech  (str)<br>detailed pos tag
	dep | Syntactic dependency relation (int)
	dep_ | Syntactic dependency relation (str)
	lang | Language of parent doc (int)
	lang_ | Language of parent doc (str)
	lemma | Base form of the token, with no inflectional suffixes (int)
	lemma_ | Base form of the token, with no inflectional suffixes (str)
	ancestors | Sequence of the token's ancestors
	children | Sequence of the token's immediate syntactic children
	sentiment | Scalar value indicating the positivity or negativity of the token
	vector |  1D array<br>Numerical representation of doc
	is_alpha, is_ascii, is_digit, is_lower, is_upper, is_title<br>is_sent_start, is_sent_end, is_space<br>is_punct, is_left_punct, is_right_punct, is_bracket<br>is_quote, is_currency | bool
	like_url, like_num, like_email | bool
	is_oov | Is the token out-of-vocabulary?<br>Does it not have a word vector?<br>bool
	is_stop | Is token part of a "stop list"?<br>bool



#### Token Methods  
- Methods

  
	Method | Description
	--- | ---
	.similarity(other) | Default: cosine similarity of word vectors<br>'other' can be Doc, Span, Token, Lexeme objects<br>Higher is more similar<br>Need trained model that contains vectors e.g. `en_core_web_md` or `en_core_web_lg`
	.set_extension(fn_name) | Add custom attributes on a `Token` object<br>kwarg 'getter' sets a custom doc property<br>kwarg 'method' sets a custom method<br>kwarg 'default' sets a custom writable attribute<br>custom attributes/methods accessible via `token._.attr_name` or `token._.method_name`

## Statistical Models
- Predict features in text after proper training
- Feature Examples
	- e.g. POS model predicts which tag/label most likely applies in any context similar to trained context
	- e.g. NER model predicts which label/ent_type most likely applies in any  context similar to trained context
- Statistical Models are context dependent

### POS - Parts of Speech
- Part Of Speech are part-of-speech catagories
	- e.g. noun, verb, adjective, etc
- POS annotations are accessed as `Token` attributes

	

### Deps - Dependencies
- Dependency Parsers create a syntactice based tree
- Provides powerful API to navigate this tree 
- Dependency annotations are access as `Token` attributes

In [215]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Autonomous cars shift insurance liability toward manufacturers")

# dep, dep_ attribute
for token in doc:
    print(token.text, token.dep, token.dep_)

# noun_chunks attribute
print('\nCHUNKS')
for chunk in doc.noun_chunks:
    print(chunk.text, chunk.root.text, chunk.root.dep_,
            chunk.root.head.text)

# Navigate parse tree
# Use head, children attributes
print('\nNAVIGATE TREE')
for token in doc:
    print(token.text, token.head.text, token.head.pos_,
            [child for child in token.children])

Autonomous 402 amod
cars 429 nsubj
shift 8206900633647566924 ROOT
insurance 7037928807040764755 compound
liability 416 dobj
toward 443 prep
manufacturers 439 pobj

CHUNKS
Autonomous cars cars nsubj shift
insurance liability liability dobj shift
manufacturers manufacturers pobj toward

NAVIGATE TREE
Autonomous cars NOUN []
cars shift VERB [Autonomous]
shift shift VERB [cars, liability, toward]
insurance liability NOUN []
liability shift VERB [insurance]
toward shift VERB [manufacturers]
manufacturers toward ADP []


### Ents - Named Entities
- Named entities are real-world categories/labels	
- Standard way to access named entity annotations is in `doc.ents` property
	- `doc.ents` is a sequence of `Span` objects
	- Entity type can be accessed as a hash value (doc-lvl) `ent.label` \| `span.label`
	- Entity type can be accessed as a string value (doc-lvl) `ent.label_` \| `span.label_`
	- Entity type can be accessed as a hash value (tkn-lvl) `token.ent_type`
	- Entity type can be accessed as a string value (tkn-lvl) `token.ent_type_`  


	Built-in entities | Description | Example
	--- | --- | ---
	ORG | Organization | Apple
	GPE | Geopolitical entity<br>i.e. countries, cities, states | U.K.
	MONEY | Monetary values<br>including unit | $400 million
	DATE | Date | 2007

In [216]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("San Francisco considers banning sidewalk delivery robots")

# document level
ents = [(e.text, e.label_) for e in doc.ents]
print(ents)

# token level
ents = [(token.text, token.ent_type_) for token in doc]
print(ents) # NOTE: Only first two tkns are named entities in this model

[('San Francisco', 'GPE')]
[('San', 'GPE'), ('Francisco', 'GPE'), ('considers', ''), ('banning', ''), ('sidewalk', ''), ('delivery', ''), ('robots', '')]


## EntityRuler
- `doc.ents` is writable and can add custom named entities with:
		doc.ents = [Span(doc, text) for text in doc]

## Matcher - Pattern Rules
- Matchers are beefed up regex to match sequences on tokens
- `Matcher` match sequences based on lists of token descriptions
- Find words/phrases using rules/patterns describing token attributes
	- Can be token annotations like text, POS tags, lexical attributes
- Applying a matcher to a `Doc` gives access to the matched tokens
	- Matched tokens are list[tuples] where (match_id, start_tkn_idx, end_tkn_idx (exclusive))
- When writing patterns, one dictionary is one token

### Matcher Usage Flow
1. Initialize Matcher with a vocab
2. Add pattern with `matcher.add` method
3. Use matcher by calling matcher on a doc
4. Returns list[tuples] where (match_id, start, end)
	- match_id is int
	- Use `nlp.vocab.strings[match_id]` to get str version of match_id
	- start, end are token index (where end is exclusive)

In [217]:
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)  # 1. Initialize Matcher with a vocab
pattern = [{"LOWER": "hello"}, {"IS_PUNCT": True}, {"LOWER": "world"}]
# Add match ID "HelloWorld" with no callback and one pattern
matcher.add("HelloWorld", [pattern])  # 2. Add pattern to matcher

doc = nlp("Hello, world! Hello world!")
matches = matcher(doc)  # 3. Use matcher by calling matcher on a doc
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  # Get string rep of match_id
    span = doc[start:end]  # The matched span
    print(match_id, string_id, start, end, span.text)

15578876784678163569 HelloWorld 0 3 Hello, world


## Phrase Matcher
- `PhraseMatcher` efficiently match large terminology lists
- Accespts match pattersn in the form of `Doc` objects
- Same Usage Flow as `Matcher`.
	- Difference is in step 2. Instead of adding pattern list[dict], add list[Doc]
- `Doc` pattern can contain single or multiple tokens

In [218]:
import spacy
from spacy.matcher import PhraseMatcher

nlp = spacy.load("en_core_web_sm")
matcher = PhraseMatcher(nlp.vocab)
terms = ["Barack Obama", "Angela Merkel", "Washington, D.C."]
# Only run nlp.make_doc to speed things up
patterns = [nlp.make_doc(text) for text in terms]
matcher.add("TerminologyList", patterns)

doc = nlp("German Chancellor Angela Merkel and US President Barack Obama "
          "converse in the Oval Office inside the White House in Washington, D.C.")
matches = matcher(doc)
for match_id, start, end in matches:
    span = doc[start:end]
    print(span.text)

Angela Merkel
Barack Obama
Washington, D.C.


## Train Models