In [11]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

**Snippet: Self-made 3 Versions of Preprocessing Text**

1. Bare Token: Sans grammer, sans semantics

```python
def token_cleaner(text):
    text = strip_multiple_whitespaces(text)
    text = remove_stopwords(text)
    text = strip_numeric(text) 
    text = strip_non_alphanum(text)
    text = strip_punctuation(text)
    text = strip_short(text, minsize=3)
    text = [ tok.lemma_.lower().strip() for tok in nlp(text, disable=['tagger', 'parser', 'ner']) ]
    text = [ tok for tok in text if tok not in SYMBOLS and tok not in STOPLIST ]
    return ' '.join(text)
```

2. Lemmas: Retain grammar and semantcis

```python
def token_cleaner(text):
    text = strip_multiple_whitespaces(text)
    text = strip_non_alphanum(text)
    text = strip_punctuation(text)
    text = strip_short(text, minsize=3) # optional
    text = [ tok.lemma_.lower().strip() for tok in nlp(text, disable=['tagger', 'parser', 'ner']) ]
    text = [ tok for tok in text if tok not in SYMBOLS ]
    return ' '.join(text)
```

3. Clean Text: Remove non-text only

```python
def token_cleaner(text):
    text = strip_multiple_whitespaces(text)
    text = strip_non_alphanum(text)
    text = strip_punctuation(text)
    text = strip_short(text, minsize=3) # optional
    text = [ tok.text.lower().strip() for tok in nlp(text, disable=['tagger', 'parser', 'ner']) ]
    text = [ tok for tok in text if tok not in SYMBOLS ]
    return ' '.join(text)
```

### SpaCy Trick to speed up above by applying on whole text

```python
def doc_to_spans(list_of_texts, join_string=' ||| '):
    all_docs = nlp(' ||| '.join(list_of_texts))
    split_inds = [i for i, token in enumerate(all_docs) if token.text == '|||'] + [len(all_docs)]
    new_docs = [all_docs[(i + 1 if i > 0 else i):j] for i, j in zip([0] + split_inds[:-1], split_inds)]
    return new_docs 
```

**GPU**

```python
def prefer_gpu():
    used = spacy.util.use_gpu(0)
    if used is None:
        return False
    else:
        import cupy.random

        cupy.random.seed(0)
        return True
random.seed(0)
numpy.random.seed(0)
use_gpu = prefer_gpu()
print("Using GPU?", use_gpu)
```

# COURSE

##¬†Token, Span, Lexical Attr, 

In [2]:
from spacy.lang.en import English
nlp = English()
doc = nlp("Hellow world!")
for token in doc:
    print(token.text)
token = doc[1]
print(token.text)

Hellow
world
!
world


In [3]:
span = doc[1:4]
print(span.text)

world!


In [4]:
doc = nlp("It costs $5.")

print('Index: ', [token.i for token in doc])
print('Text: ', [token.text for token in doc])
print('is_alpha:', [token.is_alpha for token in doc])
print('is_punct:', [token.is_punct for token in doc])
print('like_num:', [token.like_num for token in doc])
print('like_url', [token.like_url for token in doc])

Index:  [0, 1, 2, 3, 4]
Text:  ['It', 'costs', '$', '5', '.']
is_alpha: [True, True, False, False, False]
is_punct: [False, False, False, False, True]
like_num: [False, False, False, True, False]
like_url [False, False, False, False, False]


In [5]:
doc = nlp("In 1990, more than 60% of people in East Asia were in extreme proverty. "
          "Now less than 4% are.")
for token in doc:
    if token.like_num:
        next_token = doc[token.i + 1]
        if next_token.text == "%":
            print("Percentage found:", token.text)

Percentage found: 60
Percentage found: 4


### Statistical Model
- POS
- Syntatic Dep
- NE

1. **Training on Labelled Data**
2. **Can be updated with more examples to fine-tune**

**Model packages built-in**
- Binary weights
- Vocab
- Metadata (language, pipeline)

In [7]:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp("She ate the pizza")
for token in doc:
    print(token.text, token.pos_)

She PRON
ate VERB
the DET
pizza NOUN


In [8]:
for token in doc:
    print(token.text, token.pos_, token.dep_, token.head.text)

She PRON nsubj ate
ate VERB ROOT ate
the DET det pizza
pizza NOUN dobj ate


In [9]:
doc = nlp(u"Apple is looking at buying U.K. startup for $1 billion")
for ent in doc.ents:
    print(ent.text, ent.label_)

Apple ORG
U.K. GPE
$1 billion MONEY


In [12]:
spacy.explain("GPE")
spacy.explain("NNP")
spacy.explain('dobj')

'Countries, cities, states'

'noun, proper singular'

'direct object'

### Rule-based Matching

**Why not REGEX**
1. Match on `Doc` not just strings
2. Match on tokens and token attributes
3. Use model's predictions
4. e.g. "duck" (verb) vs. "duck" (noun)

**Match patterns**
- Lists of dictionaries, one per token
- Match extact
`[{'TEXT': 'iPhone'}, {'TEXT': 'X'}]`
- Match lexical attributes
`[{'LOWER': 'iphone'}, {'LOWER': 'x'}]
- Match any token attributes
`[{'LEMMA': 'buy'}, {'POS': 'NOUN'}]

In [13]:
from spacy.matcher import Matcher

# INIT matcher with SHARED VOCAB
matcher = Matcher(nlp.vocab)

#¬†Add pattern to matcher
pattern = [{'TEXT': 'iPhone'}, {'TEXT': 'X'}]
matcher.add('IPHONE_PATTERN', None, pattern)

doc = nlp("New iPhone X release date leaked")

matches = matcher(doc)

In [15]:
for matche_id, start, end in matches:
    matched_span = doc[start:end]
    print(matched_span.text)

iPhone X


**Matching Lexical**

```python
pattern = [
    {'IS_DIGIT': True},
    {'LOWER': 'fifa'},
    {'LOWER': 'world'},
    {'LOWER': 'cup'},
    {'IS_PUNCT': 'True'}
]

doc = nlp("2018 FIFA World Cup: France won!")
# 2018 FIFA World Cup:
```

**Matching other Token Attributes**

```python
pattern = [
    {'LEMMA': 'love', 'POS': 'VERB'},
    {'POS': 'NOUN'}
]
# loved dogs 
# love cats
```

**Operators and Quantifiers**

```python
pattern = [
    {'LEMMA': 'buy'}.
    {'POS': 'DET', 'OP': '?'}, # optional: match 0 or 1 times
    {'POS': 'NOUN'}
]
# bought a smartphone
# buying apps
```

- `{'OP': '!'}` Negation: match 0 times
- `'+'` Match 1 or more times
- `'*'` Match 0 or more times


## Large-scale Data Analysis 

**Data Structure: Vocab, Lexemes and StringStore**

- `Vocab` stores data shared across multiple documents
- To save memory, spaCy encodes all strings to **hash value**
- Strings are only stored once in the `StringStore` via `nlp.vocab.strings`
- String store: **lookup table** in both directions

```python
coffee_hash = nlp.vocab.strings['coffee']
coffee_string = nlp.vocab.strings[coffee_hash]
```

- Hashes cannot be reversed - hence need to provide shared vocab!!!

```python
# Raises error if not seen string before
string = nlp.vocab.strings[319792845301844401]
```

- Look up string and hash in `nlp.vocab.strings`

```python
doc = nlp("I love coffee")
nlp.vocab.strings['coffee'] # 319792745301814401
nlp.vocab.strings[hash] # coffee

# doc also exposes vocab and strings
doc.vocab.strings['coffee']
```

- `Lexeme` obj is entry in vocab

```python
lexeme = nlp.vocab['coffee']

lexeme.text, lexeme.orth, lexeme.is_alpha # coffee 3179... True
```

- Contains **context-independent** info about word
    - Word text: `lexeme.text` and `lexeme.orth` (the hash)
    - Lexical attributes like `lexeme.is_alpha`
    - NOT context-dependent POS, DEPs or NE
    
**DATA STRUCTURE**

VOCAB, HASHES, LEXEMES

- DOC (Token : I : PRON) <-nsubj- (Token : love : VERB) -dobj-> (Token : coffee : NOUN)
- VOCAB (Lexeme : 46904...) (Lexeme : 37020...) (Lexeme : 31979...)
- STRINGSTORE (4905... : "I") (37020... : "love") (31979... : "coffee")

In [17]:
doc = nlp("I have a cat")

# look up hash for 'cat'
cat_hash = nlp.vocab.strings["cat"]
print(cat_hash)

# loop up cat_hash to get string
cat_string = nlp.vocab.strings[cat_hash]
print(cat_string)

5439657043933447811
cat


**Docs, Spans, NE from Scratch**
- spaCy under the hood

In [20]:
from spacy.tokens import Doc

words = ['Hello', 'world', '!']
spaces = [True, False, False]
# manual creation doc
doc = Doc(nlp.vocab, words=words, spaces=spaces)
print(doc.text)

Hello world!


In [22]:
from spacy.tokens import Span

span = Span(doc, 0, 2)

span_with_label = Span(doc, 0, 2, label="GREETING")

print(span_with_label.text, span_with_label.label_)

doc.ents = [span_with_label]

print([(ent.text, ent.label_) for ent in doc.ents])

Hello world GREETING
[('Hello world', 'GREETING')]


**BEST PRACTICES**

- CONVERT RESULT TO STRINGS AS LATE AS POSSIBLE
- USE TOKEN ATTRIBUTE IF AVAILABLE e.g. `token.i` for token index
- ALWAYS PASS IN SHARED `vocab`

**Vectors**

`Doc.simiarlity(), Span.similarity(), Token.similarity()`

**THREE WAY COMPARISON POSSIBLE**

- Cosine default can be changed
- `Doc` and `Span` default to average of `Token` vectors
- Short phrases are better than long documents with many irrelevant words

- Useful for many app: recomm, flagging duplicates, etc
- Depends on context and what app needs to do

### Combining Models and Rules

In [24]:
from spacy.matcher import PhraseMatcher

matcher = PhraseMatcher(nlp.vocab)

pattern = nlp("Golden Retriever")
matcher.add("DOG", None, pattern)
doc = nlp("I have a Golden Retriever")

for match_id, start, end in matcher(doc):
    span = doc[start:end]
    print('Matched span:', span.text)

Matched span: Golden Retriever


**Exact Match**

COUNTRY a list of string names from json file

```python
# faster version of [nlp(country) for country in COUNTRIES]
patterns = list(nlp.pipe(COUNTRIES))
matcher.add("COUNTRY", None, *patterns)

# create doc and find matches in it
doc = nlp(TEXT) # some new text

for match_id, start, end in matcher(doc):
    # create a Span with label for "GPE"
    span = Span(doc, start, end, label="GPE")
    # overwrite doc.ents and add the span
    doc.ents = list(doc.ents) + [span]
    # get span's root and head token
    span_root_head = span.root.head
    print(span_root_head.text, "-->", span.text)
print([
    (ent.text, ent.label_) for ent in doc.ents if ent.label_ ==
    "GPE"])
```

## Processing Pipeline

**Built-in Components**
- **tagger** POS Token.tag
- **parser** Dependency parser Token.dep, Token.head, Doc.sents, Doc.noun_chuncks
- **ner** Doc.ents, Token.ent_iob, Token.ent_type
- **textcat** Doc.cats

**Custom Components**
- Func takes `doc` and modifies it then returns it
- can be added using `nlp.add_pipe`

In [29]:
import spacy
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span

nlp = spacy.load('en_core_web_sm')
animals = ["Golden Retriever", "cat", "turtle", "Rattus norvegicus"]
animal_patterns = list(nlp.pipe(animals))

matcher = PhraseMatcher(nlp.vocab)
matcher.add("ANIMAL", None, *animal_patterns)

def animal_compoenent(doc):
    matches = matcher(doc)
    # create Span for each match and assign label 
    spans = [Span(doc, start, end, label="ANIMAL") for match_id, start, end in matches]
    # OVERWRITE doc.ents with matched spans
    doc.ents = spans
    return doc

nlp.add_pipe(animal_compoenent, after="ner")
print(nlp.pipe_names)

doc = nlp("I have a cat and a Golden Retriever in my home in Montreal")
print([ (ent.text, ent.label_) for ent in doc.ents ])

['tagger', 'parser', 'ner', 'animal_compoenent']
[('cat', 'ANIMAL'), ('Golden Retriever', 'ANIMAL')]


**Extension Attributes**

- Add custom metadata to documents, tokens and spans
- Accessible via `._` property

```python
doc._.title = "My document"
token._.is_color = True
span._.has_color = False
```

- Registered on global `Doc, Token, Span` using `set_extension`

```python
Doc.set_extension('title', default=None)
Token.set_extension('is_color', default=False)
Span.set_extension('has_color', default=False)
```

**Scaling and Performance**

- SLOW `docs = [nlp(text) for text in BIG_TEXT]`
- FAST `docs = list(nlp.pipe(BIG_TEXT))`
- Passing in context 
    - `for doc, context in nlp.pipe(data, as_tuples=True): doc.text, context['page_number'] # self-defined`
    - combined with `set_extension` in for loop to make `doc._.id and doc._.page_number` 
- Use ONLY `Tokenizer` disable other pipelines
    - BAD: `doc = nlp("Hellow world')`
    - GOOD: `doc = nlp.make_doc("Hellow world")`
    - `with nlp.disable_pipes('tagger', 'parser'): doc = nlp(text)'
    
```python
# Example of pipe
docs = list(nlp.pipe(TEXTS))
entities = [doc.ents for doc in docs]
print(*entities)
```

## Training Neural Network

- Essential for custom Textcat and NER
- Less critical for POS tagging and Dep parsing

**FLOW**
1. INIT model weights randomly with `nlp.begin_training`
2. Predict a few examples with current weights by calling `nlp.update`
3. Compare prediction with true labels
4. Calculate how to change weights to improve predictions
5. Update weigths slightly
6. Go back to 2

**NER Trainer**
- NER tags words and phrases
- Each token in mutually exclusive NE
- Examples need CONTEXT!!
    - `("iPhone X is coming", {"entities": [(0, 8), 'GADGET')]})`

**Texts with no entities also important!!**
- `("I need a new phone! Any tipes?", {"entities": []})`

**Training Data**

- Updating existing model = 00s to 000s examples
- New category = 000s to 1,000,000 examples
- Manual human annotators
- Can be semi-automated - **Matcher**

```python
# make patterns for Matcher
matcher.add("GADGET", None, pattern1, pattern2) # two patterns for 'iPhone x and op ?`
TRAINING_DATA = []

for doc in nlp.pipe(TEXTS):
    # match on doc and create list of matched spans
    spans = [doc[start:end] for match_id, start, end in matcher(doc)]
    # get (start¬†char, end char, label) tuples of matches
    entities = [(span.start_char, span.end_char, "GADGET") for span in spans]
    # format matches as a (doc.text, entities) tuple
    training_example = (doc.text, {"entities": entities})
    # append to training
    TRAINING_DATA.append(training_example)
```

**Training Loop**
- Loop for times
- Shuffle training data
- Divide data into batches
- Update model per batch
- Save model

```python
for i in range(10):
    random.shuffle(TRAINING_DATA)
    for batch in spacy.util.minibatch(TRAINING_DATA):
        texts = [text for text, annotation in batch]
        annotations = [annotation for text, annotation in batch]
        nlp.update(texts, annotations)
nlp.to_disk(path_to_model)
```

```python
# new pipeline from scratch
nlp = spacy.blank('en')
ner = nlp.create_pipe('ner')
nlp.add_pipe(ner)
ner.add_label('GADGET')

nlp.begin_training()
for itn in range(10):
    random.shuffle(examples)
    losses = {}
    for batch in spacy.util.minibatch(examples, size=2):
        ...
        nlp.update(texts, annotations, losses=losses)
        print(losses)
```

**BEST PRACTICES**
- Existing model can overfit new data
    - e.g. if only update with `WEBSITE`, it can 'unlearn' what a `PERSON` is
    - Aka 'catastrophic forgetting' problem
- Solution 1: Mix in previously correct preditions
    - also include `PERSON` examples
    - Run existing spaCy model over data and extract all other relevant entities !!

- Models cannot learn everything
    - predictions based on **local context**
    - model can struggle to learn if decision hard to make based on context
    - **label scheme needs be consistent and not too specific**
        - e.g. `CLOTHING` is better than `ADULT_CLOTHING` and etc
- Solution 2: Plan label scheme carefully
    - pick categories reflecting local context
    - more generic better than specific
    - **use rules to go from generic labels to specific categories**

# GUIDE

## Linguistic Features
Raw text to Doc, rich annotated object using linguistic features.



### POS Tagging (Need Model)

- After tokenizing, spaCy can **parse, tag** a given `Doc`, in comes stats-model prediction
- Model consists of **binary** data, e.g. "the" + NOUN in English
- Ling-annos under `Token`
- spaCy encodes all strings to hash values to reduce MEM and speed
- `_` + name give string repr

In [76]:
import en_core_web_sm
nlp = en_core_web_sm.load()
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

print("{:<8}\t{:<8}\t{:<8}\t{:<8}\t{:<8}\t{:<8}\t{:<8}\t{:<8}".format('text', 
                                                                      'lemma', 
                                                                      'pos',
                                                                      'tag',
                                                                      'dep', 
                                                                      'shape',
                                                                      'is_alpha',
                                                                      'is_stop'))
print()
for token in doc:
    print("{:<8}\t{:<8}\t{:<8}\t{:<8}\t{:<8}\t{:<8}\t{:<8}\t{:<8}".format(
        token.text, 
        token.lemma_, # base form
        token.pos_, # simple POS tag
        token.tag_, # detailed POS tag
        token.dep_, # syntactic dependency or relation between tokens
        token.shape_, # word shape - cap, punc, digit
        token.is_alpha, 
        token.is_stop))
    
spacy.explain('ADP')

text    	lemma   	pos     	tag     	dep     	shape   	is_alpha	is_stop 

Apple   	Apple   	PROPN   	NNP     	nsubj   	Xxxxx   	1       	0       
is      	be      	AUX     	VBZ     	aux     	xx      	1       	1       
looking 	look    	VERB    	VBG     	ROOT    	xxxx    	1       	0       
at      	at      	ADP     	IN      	prep    	xx      	1       	1       
buying  	buy     	VERB    	VBG     	pcomp   	xxxx    	1       	0       
U.K.    	U.K.    	PROPN   	NNP     	compound	X.X.    	0       	0       
startup 	startup 	NOUN    	NN      	dobj    	xxxx    	1       	0       
for     	for     	ADP     	IN      	prep    	xxx     	1       	1       
$       	$       	SYM     	$       	quantmod	$       	0       	0       
1       	1       	NUM     	CD      	compound	d       	0       	0       
billion 	billion 	NUM     	CD      	pobj    	xxxx    	1       	0       


'adposition'

In [48]:
options = {"compact": True, "bg": "#09a3d5",
           "color": "white", "font": "Source Sans Pro"}

spacy.displacy.render(doc, options=options)

Apple Apple PROPN NNP nsubj Xxxxx True False
is be AUX VBZ aux xx True True
looking look VERB VBG ROOT xxxx True False
at at ADP IN prep xx True True
buying buy VERB VBG pcomp xxxx True False
U.K. U.K. PROPN NNP compound X.X. False False
startup startup NOUN NN dobj xxxx True False
for for ADP IN prep xxx True True
$ $ SYM $ quantmod $ False False
1 1 NUM CD compound d False False
billion billion NUM CD pobj xxxx True False


In [54]:
# long text each sent

text = """In ancient Rome, some neighbors live in three adjacent houses. In the center is the house of Senex, who lives there with wife Domina, son Hero, and several slaves, including head slave Hysterium and the musical's main character Pseudolus. """
#A slave belonging to Hero, Pseudolus wishes to buy, win, or steal his freedom. One of the neighboring houses is owned by Marcus Lycus, who is a buyer and seller of beautiful women; the other belongs to the ancient Erronius, who is abroad searching for his long-lost children (stolen in infancy by pirates). One day, Senex and Domina go on a trip and leave Pseudolus in charge of Hero. Hero confides in Pseudolus that he is in love with the lovely Philia, one of the courtesans in the House of Lycus (albeit still a virgin)."""
doc = nlp(text)
sentence_spans = list(doc.sents)
spacy.displacy.render(sentence_spans, style="dep")

In [56]:
text = "When Sebastian Thrun started working on self-driving cars at Google in 2007, few people outside of the company took him seriously."
doc = nlp(text)
spacy.displacy.render(doc, style="ent")

### Rule-based Morphology

- a **lemma** (root) is **inflected** (modified/combined) with one or more **morphological features** to create a surface form.
- I read the paper yesterday.
    - SURFACE: read
    - LEMMA: read
    - POS: verb
    - MORPHOLOGICAL FEAT.: VerbForm=Fin, Mood=Ind, Tense=Past
- VerbForm=Ger (gerand), Mood=Subj, Tense=Future, etc
- English is simple morphologically, spaCy uses rules that can be **keyed by the token, POS tag, or mix**
- Logic
    - tokenizer lookup **MAPPING TABLE** `TOKENIZER_EXCEPTIONS`, allowing seq of char be mapped to multiple tokens. Each token may be assigned a POS and >=1 morpho-features
    - POS tagger then assigns each token **extended POS tag** - `Token.tag` expressing POS and some morpho-info, e.g. Verbe Tense=Past
    - Words POS not set, a **MAPPING TABLE** `TAG_MAP` maps the tags to a POS and a set of morpho-features
    - **rule-based deterministic lemmatizer** maps the surface form, to a lemma in light of the assigned extended POS and morpho info, without consulting the context of the token! (lemmatizer also accepts list-based exception files, acquired from WordNet)

### Dependency Parsing (Need Model)

- fast navigating the dep-tree
- sentence boundary detection
- iterate over base noun phrases or "chunks"
- `doc.is_parsed` -> bool

**NOUN CHUNCKS**

- "base noun phrases" - flat phrases having a noun as HEAD - sort of "noun plus words describing it" (the lavish green grass or the world's largest tech fund)
- root.text - original text connecting them
    - cars
    - liability
    - manufacturers
- root.dep - dep relation connecting root to its head
    - nsubj
    - dobj
    - proj
- root.head.text - root token's head text
    - shift
    - shift
    - toward

In [57]:
doc = nlp("Autonomous cars shift insurance liability toward manufacturers")
for chunk in doc.noun_chunks:
    print(chunk.text, 
          chunk.root.text, 
          chunk.root.dep_,
          chunk.root.head.text)

Autonomous cars cars nsubj shift
insurance liability liability dobj shift
manufacturers manufacturers pobj toward


**NAVIGATING PARSE TREE**

- spaCy uses **head, child** to repr words **connected by single arc** in dep-tree
- `.dep` is hash and `.dep_` string

In [74]:
doc = nlp("Autonomous cars shift insurance liability toward manufacturers")
print("{:<8s}\t{:<8s}\t{:<8s}\t{:<8s}\t{:<8s}".format('text', 'dep', ' head-text', 'head-POS', 'children'))
print()
for token in doc:
    print("{:<8s}\t{:<8s}\t{:<8s}\t{:<8s}\t{}".format(token.text,
                                      token.dep_, 
                                      token.head.text, 
                                      token.head.pos_,
                                      [child for child in token.children]))

text    	dep     	 head-text	head-POS	children

Autonomous	amod    	cars    	NOUN    	[]
cars    	nsubj   	shift   	VERB    	[Autonomous]
shift   	ROOT    	shift   	VERB    	[cars, liability]
insurance	compound	liability	NOUN    	[]
liability	dobj    	shift   	VERB    	[insurance, toward]
toward  	prep    	liability	NOUN    	[manufacturers]
manufacturers	pobj    	toward  	ADP     	[]


- dep: syntactic link connecting child to head
- head.text: token head text
- head.pos: POS tag of token head
- children: immediate syntactic deps of token

> BECAUSE SYNTATIC RELATIONS FROM A TREE - EACH WORD ONE HEAD

- hence iterate over the arcs in tree by words in sentence

In [77]:
from spacy.symbols import nsubj, VERB

doc = nlp("Autonomous cars shift insurance liability toward manufacturers")

# Finding a verb with a subject from below ‚Äî good
verbs = set()
for possible_subject in doc:
    if possible_subject.dep == nsubj and possible_subject.head.pos == VERB:
        verbs.add(possible_subject.head)
print(verbs)

{shift}


In [78]:
# less good 
# iterate twice, once for the head, then again via children
for possible_verb in doc:
    if possible_verb.pos == VERB:
        for possible_subject in possible_verb.children:
            if possible_subject.dep == nsubj:
                verbs.append(possible_verb)
                break

AttributeError: 'set' object has no attribute 'append'

**Iterating around local tree**

In [79]:
doc = nlp("bright red apples on the tree")
print([token.text for token in doc[2].lefts])  # ['bright', 'red']
print([token.text for token in doc[2].rights])  # ['on']
print(doc[2].n_lefts)  # 2
print(doc[2].n_rights)  # 1

['bright', 'red']
['on']
2
1


> **get whole phrase by head using `Token.subtree` returning an ordered seq tokens - up the tree with `Token.ancestors` and dominance with `Token.is_ancestor`**

In [83]:
doc = nlp("Credit and mortgage account holders must submit their requests")
print("{:<8s}\t{:<8s}\t{:<8s}\t{:<8s}\t{:<8s}".format(
    'TEXT', 'DEP', 'N_LEFTS', 'N_RIGHETS', 'ANCESTORS'))
print()
root = [token for token in doc if token.head == token][0]
subject = list(root.lefts)[0]
for descendant in subject.subtree:
    assert subject is descendant or subject.is_ancestor(descendant)
    print("{:<8s}\t{:<8s}\t{:<8}\t{:<8}\t{}".format(
        descendant.text, descendant.dep_, descendant.n_lefts,
            descendant.n_rights,
            [ancestor.text for ancestor in descendant.ancestors]))

TEXT    	DEP     	N_LEFTS 	N_RIGHETS	ANCESTORS

Credit  	nmod    	0       	2       	['account', 'holders', 'submit']
and     	cc      	0       	0       	['Credit', 'account', 'holders', 'submit']
mortgage	conj    	0       	0       	['Credit', 'account', 'holders', 'submit']
account 	compound	1       	0       	['holders', 'submit']
holders 	nsubj   	1       	0       	['submit']


- `.left_edge` and right esp. useful as giving first and last token of the subtree - **easiest way to create a `Span` for a syntactic phrase`** (RIGHT_EDGE IS WITHIN SUBTREE, +1 AS END-POINT OF RANGE)

In [87]:
doc = nlp("Credit and mortgage account holders must submit their requests")
span = doc[doc[4].left_edge.i : doc[4].right_edge.i+1]
with doc.retokenize() as retokenizer:
    retokenizer.merge(span)
print("{:8<}\t{:8<}\t{:8<}\t{:8<}".format(
    'TEXT', 'POS', 'DEP', 'HEAD TEXT'))
print()
for token in doc:
    print("{:8<}\t{:8<}\t{:8<}\t{:8<}".format(
        token.text, token.pos_, token.dep_, token.head.text))

TEXT	POS	DEP	HEAD TEXT

Credit and mortgage account holders	NOUN	nsubj	submit
must	VERB	aux	submit
submit	VERB	ROOT	submit
their	DET	poss	requests
requests	NOUN	dobj	submit


**DISABLING PARSER**

- If no need any syntactic info, should disable parser
- Load and run faster

### Entity

- stats-model prediction

**ACCESS**

- `doc.ents` => `Span` sequence
- accessed either as hash or string `ent.label, ent.label_`
- also at `token.ent_iob` and `token.ent_type` 
- IOB indicates whether an NE starts, continues or ends on the tag
    - I inside an NE
    - O outside 
    - B beginning of NE

In [92]:
doc = nlp("San Francisco considers banning sidewalk delivery robots")

# document level
ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
print(ents)

# token level
ent_san = [doc[0].text, doc[0].ent_iob_, doc[0].ent_type_]
ent_francisco = [doc[1].text, doc[1].ent_iob_, doc[1].ent_type_]
print(ent_san)  # ['San', 'B', 'GPE']
print(ent_francisco)  # ['Francisco', 'I', 'GPE']

[('San Francisco', 0, 13, 'GPE')]
['San', 'B', 'GPE']
['Francisco', 'I', 'GPE']


#### Setting NE Annotations

- ensure consistency, set at **document level**
- BUT CANNOT write directly to `token.ent_iob, token.ent_type` 
- SO easiest to set by assigning to `doc.ents` and create new NE as `Span`

In [40]:
doc = nlp("fb is hiring a new vice president of global policy")
ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
print('Before', ents)
# the model didn't recognise "fb" as an entity :(

Before []


In [96]:
from spacy.tokens import Span

fb_ent = Span(doc, 0, 1, label="ORG")
doc.ents = list(doc.ents) + [fb_ent]

ents = [(e.text, e.start_char, e.end_char, e.label_) 
       for e in doc.ents]
print('After', ents)

After [('fb', 0, 2, 'ORG')]


In [100]:
doc[0]

fb

#### Setting NE from array

- `doc.from_array` with both `ENT_TYPE, ENT_IOB` in the array

In [103]:
import numpy
from spacy.attrs import ENT_IOB, ENT_TYPE

doc = nlp.make_doc("London is a big city in the United Kingdom.")
print("Before", doc.ents)  # []

header = [ENT_IOB, ENT_TYPE]
attr_array = numpy.zeros((len(doc), len(header)))
attr_array[0, 0] = 3  # B
attr_array[0, 1] = doc.vocab.strings["GPE"]
doc.from_array(header, attr_array)
print("After", doc.ents)  # [London]

Before ()
After (London,)


#### Setting NE in Cython

- writing to underlying struc
- efficient native code

```cython
# cython: infer_types=True
from spacy.tokens.doc cimport Doc

cpdef set_entity(Doc doc, int start, int end, int ent_type):
    for i in range(start, end):
        doc.c[i].ent_type = ent_type
    doc.c[start].ent_iob = 3
    for i in range(start+1, end):
        doc.c[i].ent_iob = 2
```

- if writing to `TokenC*` structs, responsible for ensuring data is left in a consistent state

### Entity Linking

- resolve a textual entity to UID from KB
- (processing scripts)[https://github.com/explosion/spaCy/tree/master/bin/wiki_entity_linking] use WikiData identifiers but can create own KB and train new EL model using custom-made KB (see KB API and Training later)

#### Acessing Entity Identifiers

- either a hash or string using `ent.kb_id, ent.kb_id_` of a Span, or `ent_kb_id/_` on Token

### Tokenization

- input string, output `Doc`
- make `Doc` requires `Vocab` instance, a seq of word strings and optional seq of spaces booleans
- non-destructive, `doc.text == input_text`
- first segment into words, punc, etc by applying rules specific to each lang
- `text.split(' ')`
- on each substring
    1. **Does the substring match a tokenizer exception rule?** e.g. 'don't' does not contian whitespace, but should be split into two tokens, 'do' and 'n't' while U.K. should always remain one token
    2. **Can a prefix, suffix or infix be split off?** e.g. puncs like commas, periods, hypens or quotes
- if matched, the rule is applied and continues its loop starting with the newly split substrings
- spaCy can split complex, nested tokens like combinations of abbreviations and multiple punctuation marks

![exmaple](https://spacy.io/tokenization-57e618bd79d933c4ccd308b5739062d6.svg)

#### Tokenizer Data

- Global and lang-specific tokenizer data is supplied via lang-data

![Language data](https://spacy.io/language_data-ef63e6a58b7ec47c073fb59857a76e5f.svg)

#### Custom Tokenisation Rules

- very certain expressions, or abbreviations only used there

In [1]:
from spacy.symbols import ORTH
import en_core_web_sm

nlp = en_core_web_sm.load()
doc = nlp("gimme that")
print([w.text for w in doc])

['gimme', 'that']


In [2]:
special_case = [{ORTH: "gim"}, {ORTH: "me"}]
nlp.tokenizer.add_special_case("gimme", special_case)

print([w.text for w in nlp("gimme that")])

['gim', 'me', 'that']


> doesn't have to match entire whitespace-delimited substring, tokenizer will incrementally split off puncs and keep looking up remaining substring

In [3]:
[w.text for w in nlp("gimme!")]

['gim', 'me', '!']

In [4]:
[w.text for w in nlp('("...gimme...?")')]

['(', '"', '...', 'gim', 'me', '...', '?', '"', ')']

#### How it works?

- handle "don't" as well as "(don't)!" via splitting off the open bracket, then the exclamation, then close bracket, finally matching special case

```python
def tokenizer_pseudo_code(self, special_cases, prefix_search, suffix_search,
                          infix_finditer, token_match):
    tokens = []
    for substring in text.split():
        suffixes = []
        while substring:
            while prefix_search(substring) or suffix_search(substring):
                if substring in special_cases:
                    tokens.extend(special_cases[substring])
                    substring = ''
                    break
                if prefix_search(substring):
                    split = prefix_search(substring).end()
                    tokens.append(substring[:split])
                    substring = substring[split:]
                    if substring in special_cases:
                        continue
                if suffix_search(substring):
                    split = suffix_search(substring).start()
                    suffixes.append(substring[split:])
                    substring = substring[:split]
            if substring in special_cases:
                tokens.extend(special_cases[substring])
                substring = ''
            elif token_match(substring):
                tokens.append(substring)
                substring = ''
            elif list(infix_finditer(substring)):
                infixes = infix_finditer(substring)
                offset = 0
                for match in infixes:
                    tokens.append(substring[offset : match.start()])
                    tokens.append(substring[match.start() : match.end()])
                    offset = match.end()
                if substring[offset:]:
                    tokens.append(substring[offset:])
                substring = ''
            elif substring:
                tokens.append(substring)
                substring = ''
        tokens.extend(reversed(suffixes))
    return tokens
```

1. iterate over whitespace-separated substrings
2. check if an explicitly defined rule for this substring, if so use it
3. else try to consume one predix, if so go back to 2) so that special cases always get priority
4. if didn't consume predix, try to consume a suffix then go to 2)
5. if can't consume either, look for special case
6. look for token match
7. look for "infixes" - stuff like hypens etc and split the substring into tokens on all infixes
8. once can't consume any more of the string, handle it as a single token

#### Debugging Tokenizer V2.2.3

- above pseudo-code is at `nlp.tokenizer.explain(text)`
- returns a list of tuples showing which tokenizer rule or pattern was matched for each token
- tokens produced are same as `nlp.tokenizer()` except for whitespace tokens

In [5]:
from spacy.lang.en import English

nlp = English()
text = '''"Let's go!"'''
doc = nlp(text)
tok_exp = nlp.tokenizer.explain(text)
print(tok_exp)
for t in tok_exp:
    print(t[1], "\t", t[0])

AttributeError: 'spacy.tokenizer.Tokenizer' object has no attribute 'explain'

#### Customising Tokenizer Class

Let‚Äôs imagine you wanted to create a tokenizer for a new language or specific domain. There are five things you would need to define:

1. A dictionary of special cases. This handles things like contractions, units of measurement, emoticons, certain abbreviations, etc.
2. A function prefix_search, to handle preceding punctuation, such as open quotes, open brackets, etc.
3. A function suffix_search, to handle succeeding punctuation, such as commas, periods, close quotes, etc.
4. A function infixes_finditer, to handle non-whitespace separators, such as hyphens etc.
5. An optional boolean function token_match matching strings that should never be split, overriding the infix rules. Useful for things like URLs or numbers. Note that prefixes and suffixes will be split off before token_match is applied.

> shouldn't usually need to create a Tokenizer subclass, standard usage is to use `re.compile()` to build a regex object, pass its `.search()` and `.finditer()`:

In [6]:
import re
import spacy
from spacy.tokenizer import Tokenizer

special_cases = {":)": [{"ORTH": ":)"}]}
prefix_re = re.compile(r'''^[[("']''')
suffix_re = re.compile(r'''[])"']$''')
infix_re = re.compile(r'''[-~]''')
simple_url_re = re.compile(r'''^https?://''')

def custom_tokenizer(nlp):
    return Tokenizer(nlp.vocab, rules=special_cases,
                                prefix_search=prefix_re.search,
                                suffix_search=suffix_re.search,
                                infix_finditer=infix_re.finditer,
                                token_match=simple_url_re.match)

nlp = spacy.load("en_core_web_sm")
nlp.tokenizer = custom_tokenizer(nlp)
doc = nlp("hello-world. :)")
print([t.text for t in doc]) # ['hello', '-', 'world.', ':)']

  


['hello', '-', 'world.', ':)']


> If you need to subclass the tokenizer instead, the relevant methods to specialize are find_prefix, find_suffix and find_infix.

Important note
When customizing the prefix, suffix and infix handling, remember that you‚Äôre passing in functions for spaCy to execute, e.g. prefix_re.search ‚Äì not just the regular expressions. This means that your functions also need to define how the rules should be applied. For example, if you‚Äôre adding your own prefix rules, you need to make sure they‚Äôre only applied to characters at the beginning of a token, e.g. by adding ^. Similarly, suffix rules should only be applied at the end of a token, so your expression should end with a $.

## Rule Based Matching

### EntityRuler

- new component allows adding NE based on pattern dict - combining rules and statistical NER

#### Entity Patterns

- dict with "label" and "pattern" 
- **phrase patterns** exact string
    - `{"label": "ORG", "pattern": "Apple"}`
- **token patterns** describing one token (list)
    - `{"label": "GPE", "pattern": [{"LOWER": "san"}, {"LOWER": "francisco"}]}`
    
#### Usage

- typically added to doc.ents after match found
- designed to integrate with stats-model, if added BEFORE "ner", it will respect existing NE spans and adjust its predictions around it, if added AFTER "ner", will only add spans to `doc.ents` if NO OVERLAP 
- to OVERWRITE NEs, set `overwrite_ents=True` on init

In [3]:
from spacy.lang.en import English
from spacy.pipeline import EntityRuler

nlp = English()
ruler = EntityRuler(nlp)
patterns = [{"label": "ORG", "pattern": "Apple"},
            {"label": "GPE", "pattern": [{"LOWER": "san"}, {"LOWER": "francisco"}]}]
ruler.add_patterns(patterns)
nlp.add_pipe(ruler)

doc = nlp("Apple is opening its first big office in San Francisco.")
print([(ent.text, ent.label_) for ent in doc.ents])

[('Apple', 'ORG'), ('San Francisco', 'GPE')]


In [6]:
import spacy
nlp = spacy.load("en_core_web_sm")
ruler = EntityRuler(nlp)
patterns = [{"label": "ORG", "pattern": "MyCorp Inc."}]
ruler.add_patterns(patterns)
nlp.add_pipe(ruler)

doc = nlp("MyCorp Inc. is a company in the U.S.")
print([(ent.text, ent.label_) for ent in doc.ents])

print(nlp.pipe_names)

[('MyCorp Inc.', 'ORG'), ('U.S.', 'GPE')]
['tagger', 'parser', 'ner', 'entity_ruler']


#### Validating and debugging patterns

- validate agasint JSON schema

`ruler = EntityRuler(nlp, validate=True)`

#### Adding IDs to patterns

- also accept an `id` attribute for each pattern
- allows multiple patterns to be associated with SAME NE !!!

In [7]:
nlp = English()
ruler = EntityRuler(nlp)
patterns = [{"label": "ORG", "pattern": "Apple", "id": "apple"},
            {"label": "GPE", "pattern": [{"LOWER": "san"}, {"LOWER": "francisco"}], "id": "san-francisco"},
            {"label": "GPE", "pattern": [{"LOWER": "san"}, {"LOWER": "fran"}], "id": "san-francisco"}]
ruler.add_patterns(patterns)
nlp.add_pipe(ruler)

doc1 = nlp("Apple is opening its first big office in San Francisco.")
print([(ent.text, ent.label_, ent.ent_id_) for ent in doc1.ents])

doc2 = nlp("Apple is opening its first big office in San Fran.")
print([(ent.text, ent.label_, ent.ent_id_) for ent in doc2.ents])

[('Apple', 'ORG', 'apple'), ('San Francisco', 'GPE', 'san-francisco')]
[('Apple', 'ORG', 'apple'), ('San Fran', 'GPE', 'san-francisco')]


> if `id` included, `ent_id` property of the matched NE is set to `id` !!! so in the example above its easy to identitfy that both patterns mapped to same NE

#### Pattern files

- serialisation let saving and loading patterns to and fro JSONL files, one pattern per line

```jsonl
# patterns.jsonl
{"label": "ORG", "pattern": "Apple"}
{"label": "GPE", "pattern": [{"LOWER": "san"}, {"LOWER": "francisco"}]}
```

```python
ruler.to_disk('./patterns.jsonl')
new_ruler = EntityRuler(nlp).from_disk('./patterns.jsonl')
```

- serialisation is auto-exported to disk dir
- same as dir contianing the jsonl file

In [None]:
nlp = spacy.load("en_core_web_sm")
ruler = EntityRuler(nlp)
ruler.add_patterns([{"label": "ORG", "pattern": "Apple"}])
nlp.add_pipe(ruler)
nlp.to_disk("/path/to/model")

### Combining Model and Rule

- rules augments stats-model by **presetting tags, NE or sentence boundaries for specific tokens**
- stats-model can sometimes improves the accuracy of OTHER DECISIONS!!
- post-model correction of common errors
- referenced attributes set by models, to implement more abstract logic ?!

#### Example: Expanding NE

- maybe only partial NE - either incorrect prediction or if NE type defined in original corppose mismatch context
- e.g. Mr. Dr.

In [10]:
nlp = spacy.load("en_core_web_sm")
from spacy.tokens import Span

def expand_person_entities(doc):
    new_ents = []
    for ent in doc.ents:
        if ent.label_ == "PERSON" and ent.start != 0:
            prev_token = doc[ent.start - 1]
            if prev_token.text in ("Dr", "Dr.", "Mr", "Mr.", "Ms", "Ms."):
                new_ent = Span(doc, ent.start - 1, ent.end, label=ent.label)
                new_ents.append(new_ent)
        else:
            new_ents.append(ent)
    doc.ents = new_ents
    return doc

# Add the component after the named entity recognizer
nlp.add_pipe(expand_person_entities, after='ner')

doc = nlp("Dr Alex Smith chaired first board meeting of Acme Corp Inc.")
print([(ent.text, ent.label_) for ent in doc.ents])

[('Dr Alex Smith', 'PERSON'), ('Acme Corp Inc.', 'ORG')]


- alternative **extension attribute** `._.person_title` adding to Span 
- advantage being NE text stays intact and can still be used to look up the name in a KB

In [12]:
def get_person_title(span):
    if span.label_ == "PERSON" and span.start != 0:
        prev_token = span.doc[span.start - 1]
        if prev_token.text in ("Dr", "Dr.", "Mr", "Mr.", "Ms", "Ms."):
            return prev_token.text

# Register the Span extension as 'person_title'
Span.set_extension("person_title", getter=get_person_title)

doc = nlp("Dr Alex Smith chaired first board meeting of Acme Corp Inc.")
print([(ent.text, ent.label_, ent._.person_title) for ent in doc.ents])

[('Dr Alex Smith', 'PERSON', None), ('Acme Corp Inc.', 'ORG', None)]


#### Example: POS Tags and Depen-Parse with NE

- case: past and present occupation NER !!
- trigger word: "work" being PASTE TENSE or PRESENT TENSE, 
- whether company names are attached to it and whether person is the subject

In [13]:
nlp = spacy.load("en_core_web_sm")
doc = nlp("Alex Smith worked at Acme Corp Inc.")
print([(ent.text, ent.label_) for ent in doc.ents])
spacy.displacy.render(doc, options={'fine_grained': True})

[('Alex Smith', 'PERSON'), ('Acme Corp Inc.', 'ORG')]


- "worked" is ROOT of sentence and past tense verb
- subj is "AS", the person who worked at "Acme Corp Inc." the **prepositional phrase** attached to verb "worked"
- workflow
    - find predicted PERSON NE,
    - find their HEAD and check if attached to trigger word "work"
    - check for prepo phrases attached to HEAD and if containing ORG NE
    - find out if company affiliation is current by HEAD's POS TAG tense

In [14]:
person_entities = [ent for ent in doc.ents if ent.label_ == "PERSON"]
for ent in person_entities:
    # Because the entity is a spans, we need to use its root token. The head
    # is the syntactic governor of the person, e.g. the verb
    head = ent.root.head
    if head.lemma_ == "work":
        # Check if the children contain a preposition
        preps = [token for token in head.children if token.dep_ == "prep"]
        for prep in preps:
            # Check if tokens part of ORG entities are in the preposition's
            # children, e.g. at -> Acme Corp Inc.
            orgs = [token for token in prep.children if token.ent_type_ == "ORG"]
            # If the verb is in past tense, the company was a previous company
            print({'person': ent, 'orgs': orgs, 'past': head.tag_ == "VBD"})

{'person': Alex Smith, 'orgs': [Inc.], 'past': True}


- create custom pipeline 
- above logic expects NE are merged into single tokens !!! `merge_entities`
- write to custom attributes on entity span

In [15]:
from spacy.pipeline import merge_entities
from spacy import displacy

nlp = spacy.load("en_core_web_sm")

def extract_person_orgs(doc):
    person_entities = [ent for ent in doc.ents if ent.label_ == "PERSON"]
    for ent in person_entities:
        head = ent.root.head
        if head.lemma_ == "work":
            preps = [token for token in head.children if token.dep_ == "prep"]
            for prep in preps:
                orgs = [token for token in prep.children if token.ent_type_ == "ORG"]
                print({'person': ent, 'orgs': orgs, 'past': head.tag_ == "VBD"})
    return doc

# To make the entities easier to work with, we'll merge them into single tokens
nlp.add_pipe(merge_entities)
nlp.add_pipe(extract_person_orgs)

doc = nlp("Alex Smith worked at Acme Corp Inc.")
# If you're not in a Jupyter / IPython environment, use displacy.serve
displacy.render(doc, options={'fine_grained': True})

{'person': Alex Smith, 'orgs': [Acme Corp Inc.], 'past': True}


In [18]:
def extract_person_orgs(doc):
    person_entities = [ent for ent in doc.ents if ent.label_ == "PERSON"]
    for ent in person_entities:
        head = ent.root.head
        if head.lemma_ == "work":
            preps = [token for token in head.children if token.dep_ == "prep"]
            for prep in preps:
                orgs = [t for t in prep.children if t.ent_type_ == "ORG"]
                aux = [token for token in head.children if token.dep_ == "aux"]                
                past_aux = any(t.tag_ == "VBD" for t in aux)                
                past = head.tag_ == "VBD" or head.tag_ == "VBG" and past_aux                
                print({'person': ent, 'orgs': orgs, 'past': past})
    return doc

## Pipelines

- `Tokenizer` "tokenizer" -> `Doc` tokens
- `Tagger` "tagger" -> `Doc[i].tag` POS tags
- `DependencyParser` "parser" -> `Doc[i].head, Doc[i].dep, Doc.sents, Doc.noun_chunks`
- etc
- ALWAYS DEPENDS ON STATS-MODEL and its capabilities

**ORDER OF PIPELINE**
> v2.x, stats-models like tagger or parser are INDEPENDENT and don't share any data between themselves. E.g. NER doesn't use any features set by tagger and parser, and so on. This means CAN SWAP, REMOVE without affecting others

> BUT, custom componeents depending on annotations set by others - e.g. a custom lemmatier may need tags, so order matteres. 

**TOKENIZER SPECIAL**
> "tokenizer" hidden since there can ONLY be one tokenizer, and while all other pipeline components take `Doc` and return it, the tokenizer takes a **string of text** and turns it into `Doc` - still customisable via `nlp.tokenizer` writable

### Processing Text

- when processing large volumes of text, the **STATS-MODELS ARE MORE EFFICIENT IF LET THEM WORK ON BATCHES OF TEXTS** via `nlp.pipe` batching internally

### How it works

- when loading, spaCy consults `meta.json`
- load language class and data via `get_lang_class` and init
- **`Language` class contains shared vocab, tokenization rules and lang-specific annotations schemes**
- add each pipeline component in order via `add_pipe`
- make model data available to `Language` class via `from_disk` from dir

- e.g.
    - "en"
    - ["tagger", "parser", "ner"]
    - `spacy.lang.en.English`
    
> **FUNDAMENTALLY, SPACY MODEL HAS 3 COMPONENTS: WEIGHTS, i.e. binary data loaded from dir, PIPELINE of functions called in order, LANG DATA like tokenization rules and annotation schemes

In [21]:
# spacy.load under-hood

lang = "en"
pipeline = ["tagger", "parser", "ner"]
data_path = "path/to/encore"

cls = spacy.util.get_lang_class(lang) # == English()
nlp = cls() # init
for name in pipeline:
    component = nlp.create_pipe(name)
    nlp.add_pipe(component)
# nlp.from_disk(data_path)

In [22]:
# pipeline under-hood
doc = nlp.make_doc("This is a sentence.")
for name, proc in nlp.pipeline:
    doc = proc(doc)

ValueError: [E109] Model for component 'tagger' not initialized. Did you forget to load a model, or forget to call begin_training()?

- built-in components also available in `Language.factories` - meaning that can init via `nlp.create_pipe` with string names

FACTORIES

- tagger, parser, ner, 
- "entity_linker" `from spacy.pipeline import EntityLinker`
- "sentencizer"
- "merge_noun_chunks" - merge all noun chunks into single token - should be after tagger and parser
- "merge_entities" - after NER
- "merge_subtokens" - merge subtokens predicted by parser into single tokens, after parser

### Custom Pipeline

```python
def my_component(doc):
    # do doc
    return doc
```

In [23]:
def my_component(doc):
    print("After tokenization, this doc has {} tokens.".format(len(doc)))
    print("The part-of-speech tags are:", [token.pos_ for token in doc])
    if len(doc) < 10:
        print("This is a pretty short document.")
    return doc

nlp = spacy.load("en_core_web_sm")
nlp.add_pipe(my_component, name="print_info", last=True)
print(nlp.pipe_names)  # ['tagger', 'parser', 'ner', 'print_info']
doc = nlp("This is a sentence.")

['tagger', 'parser', 'ner', 'print_info']
After tokenization, this doc has 5 tokens.
The part-of-speech tags are: ['DET', 'AUX', 'DET', 'NOUN', 'PUNCT']
This is a pretty short document.


> wrapping component as class to allow INIT with custom settings and hold state within the component !!!

- stateful components, especially ones depending on shared data !!!
- e.g. custom `EntityMatcher` init with nlp, a term list and NE label, using `PhraseMatcher` it then matches terms in `Doc` and adds them to NEs

In [25]:
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span

class EntityMatcher(object):
    name = "entity_matcher"
    
    def __init__(self, nlp, terms, label):
        patterns = [nlp.make_doc(text) for text in terms]
        self.matcher = PhraseMatcher(nlp.vocab)
        self.matcher.add(label, None, *patterns)
        
    def __call__(self, doc):
        matches = self.matcher(doc)
        for match_id, start, end in matches:
            span = Span(doc, start, end, label=match_id)
            doc.ents = list(doc.ents) + [span]
        return doc
    
nlp = spacy.load("en_core_web_sm")
terms = ("cat", "dog", "tree kangaroo", "giant sea spider")
entity_matcher = EntityMatcher(nlp, terms, "ANIMAL")

nlp.add_pipe(entity_matcher, after="ner")

doc = nlp("This is a text about Barack Obama and a tree kangaroo")
print([(ent.text, ent.label_) for ent in doc.ents])

[('Barack Obama', 'LOC'), ('tree kangaroo', 'ANIMAL')]


### Example: Custom sentence segmentation logic

- after tokenization but BEFORE "parser" - take advantage of sentence boundaries at parser stage

In [26]:
def custom_sentencizer(doc):
    for i, token in enumerate(doc[:-2]):
        # Define sentence start if pipe + titlecase token
        if token.text == "|" and doc[i+1].is_title:
            doc[i+1].is_sent_start = True
        else:
            # Explicitly set sentence start to False otherwise to
            # tell parser to leave those tokens alone
            doc[i+1].is_sent_start = False
    return doc

nlp = spacy.load("en_core_web_sm")
nlp.add_pipe(custom_sentencizer, before="parser")  # Insert before the parser
doc = nlp("This is. A sentence. | This is. Another sentence.")
for sent in doc.sents:
    print(sent.text)

This is. A sentence. |
This is. Another sentence.


### Example: Entity Matching and Tagging with Custom Attributes

- taking a term list (firm names) matches occurrences as ORG
- merge tokens 
- set custom `._.is_tech_org` attriutes
- `PhraseMatcher` applies to Doc

[link to github](https://github.com/explosion/spaCy/tree/master/examples/pipeline/custom_component_entities.py)

### Adding factories

- lookup string in internal factories to init
- custom component won't trigger this
- have to tell spaCy where to find components via `Language.factories`

In [27]:
from spacy.language import Language
Language.factories["entity_matcher"] = lambda nlp, **cfg: EntityMatcher(nlp, **cfg)

> ship above code and custom component in package model's `__init__.py`, so it's exec when loading, `**cfg` are passed all the way down from `load()`

In [None]:
nlp = spacy.load("custom_model", terms=["tree kangaroo"], label="ANIMAL")

### Extension attributes

- storage of additional info

# API

### Containers

- DOC 
    - container of sequence of tokens & annotations
    - owns data
    - made via `Tokenizer`
    - mod (inplace) via COMPONENTs of PIPELINE
        - `Language` object coordinates COMPONENTS
        - raw text -> pipeline -> annotated document
        - orchestrate training & serialisation
    - SPAN
        - view pointer
        - slice of `Doc`
    - TOKEN
        - view pointer
        - {word, punctuation, symbol, whitespace, etc}

- VOCAB
    - look-up tables / meta 
    - LEXEME
        - entry in vocabulary
        - word type without annotations (opposed to token)



In [184]:
# Overriding Labels in Span()

from spacy.tokens import Span

doc = nlp(u"FB is hiring a new VP of global policy")
doc.ents = [Span(doc, 0, 1, label=doc.vocab.strings[u"ORG"])]
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

FB 0 2 ORG


In [191]:
import spacy

nlp = spacy.load('en_core_web_sm')
doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')

for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop)

Apple Apple PROPN NNP nsubj Xxxxx True False
is be VERB VBZ aux xx True True
looking look VERB VBG ROOT xxxx True False
at at ADP IN prep xx True True
buying buy VERB VBG pcomp xxxx True False
U.K. U.K. PROPN NNP compound X.X. False False
startup startup NOUN NN dobj xxxx True False
for for ADP IN prep xxx True True
$ $ SYM $ quantmod $ False False
1 1 NUM CD compound d False False
billion billion NUM CD pobj xxxx True False


###¬†Pipeline

- `Language`
    - text-processing pipe
    - load once per process as `nlp`
    - pass instance around application
- `Tokenizer`
    - segment text & create `Doc`
- `Lemmatizer`
    - determine base forms of words
- `Morphology`
    - assign linguistic features 
    - {lemmas, noun case, verb tense, etc}
    - based on word, POS
- `Tagger`
    - annotate POS on `Doc`
- `DependencyParser`
    - annotate syntactic dependencies on `Doc`
- `EntityRecognizer`
    - annotate NER
- `TextCategorizer`
    - assign categories or labels to `Doc`
- `Matcher` 
    - match seq of tok based on pattern rules (similar to Regex)
- `PhraseMatcher`
    - match seq of tok based on prahses
- `EntityRuler`
    - add entity `span` to `Doc` using token-based rules or exact phrase mathces
- `Sentencizer`
    - custom sent boundary detection logic (no need dependency parsing)
- Other func
    - auto-apply sth to `Doc`, e.g. merge spans of tokens
    



### Models and Training Data

#### JSON input format for training

`convert` commmand converts `.conllu` format used by the Universal Dependencies corpora to training format

```json
[{
    "id": int,                      # ID of the document within the corpus
    "paragraphs": [{                # list of paragraphs in the corpus
        "raw": string,              # raw text of the paragraph
        "sentences": [{             # list of sentences in the paragraph
            "tokens": [{            # list of tokens in the sentence
                "id": int,          # index of the token in the document
                "dep": string,      # dependency label
                "head": int,        # offset of token head relative to token index
                "tag": string,      # part-of-speech tag
                "orth": string,     # verbatim text of the token
                "ner": string       # BILUO label, e.g. "O" or "B-ORG"
            }],
            "brackets": [{          # phrase structure (NOT USED by current models)
                "first": int,       # index of first token
                "last": int,        # index of last token
                "label": string     # phrase label
            }]
        }]
    }]
}]
```
> EXAMPLE: dep, POS, NER from Wall Street Journal portion of Penn Treebank

```json
[
    {
      "id": 42,
      "paragraphs": [
        {
          "raw": "In an Oct. 19 review of \"The Misanthrope\" at Chicago's Goodman Theatre (\"Revitalized Classics Take the Stage in Windy City,\" Leisure & Arts), the role of Celimene, played by Kim Cattrall, was mistakenly attributed to Christina Haag. Ms. Haag plays Elianti.",
          "sentences": [
            {
              "tokens": [
                {
                  "head": 44,
                  "dep": "prep",
                  "tag": "IN",
                  "orth": "In",
                  "ner": "O",
                  "id": 0
                },
                {
                  "head": 3,
                  "dep": "det",
                  "tag": "DT",
                  "orth": "an",
                  "ner": "O",
                  "id": 1
                },
                {
                  "head": 2,
                  "dep": "nmod",
                  "tag": "NNP",
                  "orth": "Oct.",
                  "ner": "B-DATE",
                  "id": 2
                },
```

####¬†Lexical data for Vocab

- CLI `spacy init-model` to populate vocab loading in **JSONL** file `--jsonl-loc` option
- first line defines lang and setting
- rest of lines JSON objects desc lexemes
- attr set as attributes `Lexeme`
- `vocab` output ready-to-use model with `Vocab` containing lexical data

**FIRST LINE**

```json
{"lang": "en", "settings": {"oov_prob": -20.502029418945312}}
{"orth": ".", "id": 1, "lower": ".", "norm": ".", "shape": ".", "prefix": ".", "suffix": ".", "length": 1, "cluster": "8", "prob": -3.0678977966308594, "is_alpha": false, "is_ascii": true, "is_digit": false, "is_lower": false, "is_punct": true, "is_space": false, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}
{"orth": ",", "id": 2, "lower": ",", "norm": ",", "shape": ",", "prefix": ",", "suffix": ",", "length": 1, "cluster": "4", "prob": -3.4549596309661865, "is_alpha": false, "is_ascii": true, "is_digit": false, "is_lower": false, "is_punct": true, "is_space": false, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}
{"orth": "the", "id": 3, "lower": "the", "norm": "the", "shape": "xxx", "prefix": "t", "suffix": "the", "length": 3, "cluster": "11", "prob": -3.528766632080078, "is_alpha": true, "is_ascii": true, "is_digit": false, "is_lower": true, "is_punct": false, "is_space": false, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}
{"orth": "I", "id": 4, "lower": "i", "norm": "I", "shape": "X", "prefix": "I", "suffix": "I", "length": 1, "cluster": "346", "prob": -3.791565179824829, "is_alpha": true, "is_ascii": true, "is_digit": false, "is_lower": false, "is_punct": false, "is_space": false, "is_title": true, "is_upper": true, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}
{"orth": "to", "id": 5, "lower": "to", "norm": "to", "shape": "xx", "prefix": "t", "suffix": "to", "length": 2, "cluster": "12", "prob": -3.8560216426849365, "is_alpha": true, "is_ascii": true, "is_digit": false, "is_lower": true, "is_punct": false, "is_space": false, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}
...
```



###¬†Other Classes

- `Vocab`	
    - A lookup table for the vocabulary that allows you to access Lexeme objects.
- `StringStore`	
    - Map strings to and from hash values.
- `Vectors`	
    - Container class for vector data keyed by string.
- `GoldParse`	
    - Collection for training annotations.
- `GoldCorpus`	
    - An annotated corpus, using the JSON file format. Manages annotations for tagging, dependency parsing and NER.

In [None]:
#¬†Example Text Pipeline
from spacy.lang.en import English
nlp = English()
tokens = nlp(u"Some\nspaces  and\ttab characters")
tokens_text = [t.text for t in tokens]
assert tokens_text == ["Some", "\n", "spaces", " ", "and", "\t", "tab", "characters"]

In [None]:
# You can also use spacy.explain to get the description for the string representation of a tag. 
import spacy
spacy.explain("RB")

'adverb'

In [None]:
#You can also use spacy.explain to get the description for the string representation of a label.
spacy.explain("prt")

'particle'

In [None]:
# Dependency
spacy.explain("LANGUAGE")

'Any named language'

## Vocab

In [17]:
from spacy.vocab import Vocab

print(Vocab.__doc__)

A look-up table that allows you to access `Lexeme` objects. The `Vocab`
    instance also provides access to the `StringStore`, and owns underlying
    C-data that is shared between `Doc` objects.

    DOCS: https://spacy.io/api/vocab
    


In [23]:
vocab = Vocab(strings=['Hello', 'world'])
vocab, vocab.__len__()

(<spacy.vocab.Vocab at 0x7f04bffbc7a0>, 2)

In [29]:
import spacy

In [30]:
nlp = spacy.load('eng_large')

In [33]:
nlp.vocab.vectors_length

300

In [34]:
nlp.vocab.get_vector('apple')

array([-3.6391e-01,  4.3771e-01, -2.0447e-01, -2.2889e-01, -1.4227e-01,
        2.7396e-01, -1.1435e-02, -1.8578e-01,  3.7361e-01,  7.5339e-01,
       -3.0591e-01,  2.3741e-02, -7.7876e-01, -1.3802e-01,  6.6992e-02,
       -6.4303e-02, -4.0024e-01,  1.5309e+00, -1.3897e-02, -1.5657e-01,
        2.5366e-01,  2.1610e-01, -3.2720e-01,  3.4974e-01, -6.4845e-02,
       -2.9501e-01, -6.3923e-01, -6.2017e-02,  2.4559e-01, -6.9334e-02,
       -3.9967e-01,  3.0925e-02,  4.9033e-01,  6.7524e-01,  1.9481e-01,
        5.1488e-01, -3.1149e-01, -7.9939e-02, -6.2096e-01, -5.3277e-03,
       -1.1264e-01,  8.3528e-02, -7.6947e-03, -1.0788e-01,  1.6628e-01,
        4.2273e-01, -1.9009e-01, -2.9035e-01,  4.5630e-02,  1.0120e-01,
       -4.0855e-01, -3.5000e-01, -3.6175e-01, -4.1396e-01,  5.9485e-01,
       -1.1524e+00,  3.2424e-02,  3.4364e-01, -1.9209e-01,  4.3255e-02,
        4.9227e-02, -5.4258e-01,  9.1275e-01,  2.9576e-01,  2.3658e-02,
       -6.8737e-01, -1.9503e-01, -1.1059e-01, -2.2567e-01,  2.41

In [35]:
nlp.vocab.to_disk('DELE_VOCAB')

In [43]:
!tree DELE_VOCAB/ -sh

[01;34mDELE_VOCAB/[00m
‚îú‚îÄ‚îÄ [9.0M]  key2row
‚îú‚îÄ‚îÄ [123M]  lexemes.bin
‚îú‚îÄ‚îÄ [1.6M]  lookups.bin
‚îú‚îÄ‚îÄ [ 22M]  strings.json
‚îî‚îÄ‚îÄ [784M]  vectors

0 directories, 5 files


In [45]:
dele_bytestrings = nlp.vocab.to_bytes()

In [46]:
dele_bytestrings.__class__

bytes

In [47]:
dele_bytestrings.__len__()

979847654

In [52]:
dele_bytestrings[:100], dele_bytestrings[-100:]

(b'\x84\xa7strings\xdb\x01\x169\xc3["\\"\\"","#","$","\'\'",",","-LRB-","-RRB-",".",":","ADD","AFX","BES","CC","CD","DT","EX"',
 b'elling\x91\xa5yodel\xa6zapped\x91\xa3zap\xa7zapping\x91\xa3zap\xa9zigzagged\x91\xa6zigzag\xaazigzagging\x91\xa6zigzag\xa6zipped\x91\xa3zip\xa7zipping\x91\xa3zip')

In [57]:
nlp.vocab.strings['apple']

8566208034543834098

In [61]:
nlp.vocab.lookups.tables

['lemma_lookup', 'lemma_rules', 'lemma_index', 'lemma_exc']

## StringStore 64-bit Hash

In [63]:
from spacy.strings import StringStore
StringStore.__doc__

'Look up strings by 64-bit hashes.\n\n    DOCS: https://spacy.io/api/stringstore\n    '

In [65]:
stringstore = StringStore(['hellow' ,'world'])

In [66]:
stringstore['hellow']

6030250719154556199

In [67]:
stringstore = StringStore(["apple", "orange"])
banana_hash = stringstore.add("banana")
assert len(stringstore) == 3
assert banana_hash == 2525716904149915114
assert stringstore[banana_hash] == "banana"
assert stringstore["banana"] == banana_hash

In [68]:
from spacy.strings import hash_string
assert hash_string("apple") == 8566208034543834098

## Vectors

In [70]:
from spacy.vectors import Vectors
import numpy 

empty_vectors = Vectors(shape=(10000, 300))

data = numpy.zeros((3, 300), dtype='f')
keys = ["cat", "dog", "rat"]
vectors = Vectors(data=data, keys=keys)

In [73]:
empty_vectors.data

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)

In [74]:
Vectors.__doc__

'Store, save and load word vectors.\n\n    Vectors data is kept in the vectors.data attribute, which should be an\n    instance of numpy.ndarray (for CPU vectors) or cupy.ndarray\n    (for GPU vectors). `vectors.key2row` is a dictionary mapping word hashes to\n    rows in the vectors.data table.\n\n    Multiple keys can be mapped to the same vector, and not all of the rows in\n    the table need to be assigned - so len(list(vectors.keys())) may be\n    greater or smaller than vectors.shape[0].\n\n    DOCS: https://spacy.io/api/vectors\n    '

In [82]:
vectors.shape

(3, 300)

In [83]:
cat_id = nlp.vocab.strings["cat"]
cat_vector = nlp.vocab.vectors[cat_id]
assert cat_vector == nlp.vocab["cat"].vector

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

In [86]:
for key in nlp.vocab.vectors.keys():
    print(key, nlp.vocab.strings[key])
    break

3424551750583975941 croup


## Lookups and Table

## KB

In [87]:
from spacy.kb import KnowledgeBase
print(KnowledgeBase.__doc__)

A `KnowledgeBase` instance stores unique identifiers for entities and their textual aliases,
    to support entity linking of named entities to real-world concepts.

    DOCS: https://spacy.io/api/kb
    


In [88]:
kb = KnowledgeBase(vocab=nlp.vocab, entity_vector_length=64)

In [91]:
kb.add_entity.__doc__

'\n        Add an entity to the KB, optionally specifying its log probability based on corpus frequency\n        Return the hash of the entity ID/name at the end.\n        '

```python
kb.add_entity(entity="Q42", freq=32, entity_vector=vector1)
kb.add_entity(entity="Q463035", freq=111, entity_vector=vector2)
kb.set_entities(entity_list=["Q42", "Q463035"], freq_list=[32, 111], vector_list=[vector1, vector2])
kb.add_alias(alias="Douglas", entities=["Q42", "Q463035"], probabilities=[0.6, 0.3])

```

## GoldParse

## nlp.entity

In [96]:
nlp.entity.cfg

{'beam_width': 1,
 'beam_density': 0.0,
 'beam_update_prob': 1.0,
 'cnn_maxout_pieces': 3,
 'deprecation_fixes': {'vectors_name': 'en_core_web_lg.vectors'},
 'nr_class': 74,
 'hidden_depth': 1,
 'token_vector_width': 96,
 'hidden_width': 64,
 'maxout_pieces': 2,
 'pretrained_vectors': 'en_core_web_lg.vectors',
 'bilstm_depth': 0}

<thinc.neural.optimizers.Optimizer at 0x7f04c00669d0>

## CLI API

### Download

- **Model** 
    - installed as Python Packages like any module
    - Chinese (None model yet) (dependencies = Jieba)
    - en_core_web_sm/md/lg
    - en_vectors_web_lg (631MB, 300-Dim 1070971 unique vectors)

> **It‚Äôs not recommended to use this command as part of an automated process. If you know which model your project needs, you should consider a direct download via pip, or uploading the model to a local PyPi installation and fetching it straight from there. This will also allow you to add it as a versioned package dependency to your project.**

```bash
python -m spacy download [model] [--direct]
```
> **As of v2.0, spaCy expects all shortcut links to be loadable model packages. If you want to load a data directory, call spacy.load() or Language.from_disk() with the path, or use the package command to create a model package.**




### Info & Validate

```bash
python -m spacy info [model] [--markdown]
```

- find all models installed (packages and symlinks) and check compatibility with spaCy
- run after `pip install -U spacy` to ensure
- useful to detect off-sync links
- use in production build process (1 for error)

```bash
python -m spacy validate
```


### Convert

- convert files into spaCy's JSON for `train`

```bash
python -m spacy convert [input_file] [output_dir] [--file-type] [--converter]
[--n-sents] [--morphology] [--lang]
```

- default `jsonl` format
- options
    - `auto`: auto pick converter based on file ext
    - `conll, conllu, conllubio`: Unniversal Dependences
    - `ner`: tab-based NER
    - `iob`: IOB or IOB2 NER
    


###¬†Train

- Input as JSON
- each epoch, a model will be saved to DIR
- **accuracy and details added to `meta.json` to allow packaging model using `package` CLI**
- `--pipeline tagger, parser` will only train tagger and parse

```bash
python -m spacy train [lang] [output_path] [train_path] [dev_path]
[--base-model] [--pipeline] [--vectors] [--n-iter] [--n-examples] [--use-gpu]
[--version] [--meta-path] [--init-tok2vec] [--parser-multitasks]
[--entity-multitasks] [--gold-preproc] [--noise-level] [--learn-tokens]
[--verbose]
```

**DETAIL OPTIONS**

`output_path`	positional
- Directory to store model in. Will be created if it doesn‚Äôt exist.

`train_path`	positional
- Location of JSON-formatted training data. Can be a file or a directory of files.

`dev_path`	positional
- Location of JSON-formatted development data for evaluation. Can be a file or a directory of files.

`--base-model, -b`	option	
- Optional name of base model to update. Can be any loadable spaCy model.

`--pipeline, -p`	option	
- Comma-separated names of pipeline components to train. Defaults to 'tagger,parser,ner'.

`--vectors, -v`	option	
- Model to load vectors from.

`--n-iter, -n`	option	
- Number of iterations (default: 30).

`--n-examples, -ns`	option	
- Number of examples to use (defaults to 0 for all examples).

`--use-gpu, -g`	option	
- Whether to use GPU. Can be either 0, 1 or -1.

`--version, -V`	option	
- Model version. Will be written out to the model‚Äôs meta.json after training.

`--meta-path, -m`	option	
- Optional path to model meta.json. All relevant properties like lang, pipeline and spacy_version will be overwritten.

`--init-tok2vec, -t2v` option	
- Path to pretrained weights for the token-to-vector parts of the models. See spacy pretrain. Experimental.

`--parser-multitasks, -pt`	option	
- Side objectives for parser CNN, e.g. 'dep' or 'dep,tag
`--entity-multitasks, -et`	option	
- Side objectives for NER CNN, e.g. 'dep' or 'dep,tag'

`--noise-level, -nl`	option	
- Float indicating the amount of corruption for data augmentation.

`--gold-preproc, -G`	flag	
- Use gold preprocessing.

`--learn-tokens, -T`	flag
- Make parser learn gold-standard tokenization by merging ] subtokens. Typically used for languages like Chinese.

`--verbose, -VV` flag	
- Show more detailed messages during training.

`--help, -h`	flag	
- Show help message and available arguments.

**CUSTOM ENV VARIABLE**

```bash
token_vector_width=256 learn_rate=0.0001 spacy train [...]

alias train-parser="python -m spacy train en /output /data /train /dev -n 1000"
token_vector_width=256 train-parser
```

`dropout_from`
- Initial dropout rate.	0.2

`dropout_to`
- Final dropout rate.	0.2

`dropout_decay`
- Rate of dropout change.	0.0

`batch_from`
- Initial batch size.	1

`batch_to`
- Final batch size.	64

`batch_compound`
- Rate of batch size acceleration.	1.001

`token_vector_width`
- Width of embedding tables and convolutional layers.	128

`embed_size`
- Number of rows in embedding tables.	7500

`hidden_width`
- Size of the parser‚Äôs and NER‚Äôs hidden layers.	128

`learn_rate`
- Learning rate.	0.001

`optimizer_B1`
- Momentum for the Adam solver.	0.9

`optimizer_B2`
- Adagrad-momentum for the Adam solver.	0.999

`optimizer_eps`
- Epsilon value for the Adam solver.	1e-08

`L2_penalty`
- L2 regularization penalty.	1e-06

`grad_norm_clip`	
- Gradient L2 norm constraint.	1.0




###¬†Pretrain

- pretrain on `tok2vec` layer of pipeline component
- using LMAO
    - load pre-trained vector
    - train a component like CNN, BiLSTM, etc
    - predict vectors matching pre-trained ones
    - weights saved per epoch
    - pass a path to one of these weights files to `spacy train`
- esp helful in little labelled data
- experimental now, result varies
- piping to train must ensure all settings identical

```bash
python -m spacy pretrain [texts_loc] [vectors_model] [output_dir] [--width]
[--depth] [--embed-rows] [--dropout] [--seed] [--n-iter] [--use-vectors]
```

`texts_loc`	positional
- Path to JSONL file with raw texts to learn from, with text provided as the key "text". See here for details.

`vectors_model`	positional
- Name or path to spaCy model with vectors to learn from.

`output_dir`	positional
- Directory to write models to on each epoch.

`--width, -cw`	option
- Width of CNN layers.

`--depth, -cd`	option
- Depth of CNN layers.

`--embed-rows, -er`	option
- Number of embedding rows.

`--dropout, -d`	option
- Dropout rate.

`--batch-size, -bs`	option
- Number of words per training batch.

`--max-length, -xw`	option
- Maximum words per example. Longer examples are discarded.

`--min-length, -nw`	option
- Minimum words per example. Shorter examples are discarded.

`--seed, -s`	option
- Seed for random number generators.

`--n-iter, -i`	option
- Number of iterations to pretrain.

`--use-vectors, -uv`	flag
- Whether to use the static vectors as input features.

**JSONL format raw text**

   > raw text can be provided as a .jsonl (newline-delimited JSON) file containing one input text per line (roughly paragraph length is good). Optionally, custom tokenization can be provided.<br>
    > Our utility library `srsly` provides a handy `write_jsonl` helper that takes a file path and list of dictionaries and writes out JSONL-formatted data.

```python
import srsly
data = [{"text": "Some text"}, {"text": "More..."}]
srsly.write_jsonl("/path/to/text.jsonl", data)
``` 
<br>

- Example

```python
{"text": "Can I ask where you work now and what you do, and if you enjoy it?"}
{"text": "They may just pull out of the Seattle market completely, at least until they have autonomous vehicles."}
{"text": "My cynical view on this is that it will never be free to the public. Reason: what would be the draw of joining the military? Right now their selling point is free Healthcare and Education. Ironically both are run horribly and most, that I've talked to, come out wishing they never went in."}
```
<br>




### INIT-Model

- **Converting word vectors for use in spaCy**
- create new model DIR from raw data like word freq, Brown clusteres and word vectors 
- similar to `spacy model`
- output = model containing vocab and vectors

> As of v2.1.0, the --freqs-loc and --clusters-loc are deprecated and have been replaced with the --jsonl-loc argument, which lets you pass in a a newline-delimited JSON (JSONL) file containing one lexical entry per line. For more details on the format, see the annotation specs.

**EXAMPLE Using other Vectors**

```shell
wget https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.la.300.vec.gz
python -m spacy init-model en /tmp/la_vectors_wiki_lg --vectors-loc cc.la.300.vec.gz
```

```bash
python -m spacy init-model [lang] [output_dir] [--jsonl-loc] [--vectors-loc]
[--prune-vectors]
```

```python
nlp_latin = spacy.load("/tmp/la_vectors_wiki_lg")
doc1 = nlp_latin(u"Caecilius est in horto")
doc2 = nlp_latin(u"servus est in atrio")
doc1.similarity(doc2)
```

`lang`
- positional	Model language ISO code, e.g. en.

`output_dir`
- positional	Model output directory. Will be created if it doesn‚Äôt exist.

`--jsonl-loc, -j`
- option	Optional location of JSONL-formatted vocabulary file with lexical attributes.

`--vectors-loc, -v`
- option	Optional location of vectors file. Should be a tab-separated file in Word2Vec format where the first column contains the word and the remaining columns the values. File can be provided in .txt format or as a zipped text file in .zip or .tar.gz format.

`--prune-vectors, -V`
- flag	Number of vectors to prune the vocabulary to. Defaults to -1 for no pruning.

**Optimizing vector coverage **

- To help you strike a good balance between coverage and memory usage, spaCy‚Äôs Vectors class lets you map multiple keys to the same row of the table. If you‚Äôre using the spacy init-model command to create a vocabulary, pruning the vectors will be taken care of automatically if you set the --prune-vectors flag. You can also do it manually in the following steps:
  1. Start with a word vectors model that covers a huge vocabulary. For instance, the en_vectors_web_lg model provides 300-dimensional GloVe vectors for over 1 million terms of English.
  2. If your vocabulary has values set for the Lexeme.prob attribute, the lexemes will be sorted by descending probability to determine which vectors to prune. Otherwise, lexemes will be sorted by their order in the Vocab.
  3. Call Vocab.prune_vectors with the number of vectors you want to keep.
  
  ```python
  nlp = spacy.load('en_vectors_web_lg')
  n_vectors = 105000  # number of vectors to keep
  removed_words = nlp.vocab.prune_vectors(n_vectors)

  assert len(nlp.vocab.vectors) <= n_vectors  # unique vectors have been pruned
  assert nlp.vocab.vectors.n_keys > n_vectors  # but not the total entries
  ```
  > **Vocab.prune_vectors reduces the current vector table to a given number of unique entries, and returns a dictionary containing the removed words, mapped to (string, score) tuples, where string is the entry the removed word was mapped to, and score the similarity score between the two words.**
  
  - In the example above, the vector for ‚ÄúShore‚Äù was removed and remapped to the vector of ‚Äúcoast‚Äù, which is deemed about 73% similar. ‚ÄúLeaving‚Äù was remapped to the vector of ‚Äúleaving‚Äù, which is identical.
  - If you‚Äôre using the init-model command, you can set the --prune-vectors option to easily reduce the size of the vectors as you add them to a spaCy model:
  
  ```shell
  python -m spacy init-model /tmp/la_vectors_web_md --vectors-loc la.300d.vec.tgz --prune-vectors 10000
  ```
  
  - This will create a spaCy model with vectors for the first 10,000 words in the vectors model. All other words in the vectors model are mapped to the closest vector among those retained.
  
**Loading GloVe vectors**

- spaCy comes with built-in support for loading GloVe vectors from a directory. The Vectors.from_glove method assumes a binary format, the vocab provided in a vocab.txt, and the naming scheme of vectors.{size}.[fd.bin]. For example:

```python
nlp = spacy.load("en_core_web_sm")
nlp.vocab.vectors.from_glove("/path/to/vectors")
```

- If your instance of Language already contains vectors, they will be overwritten. To create your own GloVe vectors model package like spaCy‚Äôs en_vectors_web_lg, you can call nlp.to_disk, and then package the model using the package command.

**Storing Vectors on GPU (Chain or PyTorch) https://spacy.io/usage/vectors-similarity#gpu**




###¬†Evaluate

- accuracy and speed on JSON-annotated data
- print results with displaCy

```bash
python -m spacy evaluate [model] [data_path] [--displacy-path] [--displacy-limit]
[--gpu-id] [--gold-preproc]
```

`model`
- positional	Model to evaluate. Can be a package or shortcut link name, or a path to a model data directory.

`data_path`
- positional	Location of JSON-formatted evaluation data.

`--displacy-path, -dp`
- option	Directory to output rendered parses as HTML. If not set, no visualizations will be generated.

`--displacy-limit, -dl`
- option	Number of parses to generate per file. Defaults to 25. Keep in mind that a significantly higher number might cause the .html files to render slowly.

`--gpu-id, -g`
- option	GPU to use, if any. Defaults to -1 for CPU.

`--gold-preproc, -G`




### Package

- generate model package from existing data DIR
- if path to `meta.json`, used
- else data entered from CLI
- `python setup.py sdist` from newly created DIR to turn model into installable archive file

```bash
python -m spacy package [input_dir] [output_dir] [--meta-path] [--create-meta] [--force]
```

**Example**

```bash
python -m spacy package /input /output
cd /output/en_model-0.0.0
python setup.py sdist
pip install dist/en_model-0.0.0.tar.gz
```

# Custom Pipelines and Extensions

In [None]:
import spacy
from spacy.tokens import Doc

In [None]:
Doc.set_extension('is_greeting', default=False)
nlp = spacy.load('en')
doc = nlp(u'hello world')
doc._.doc_extensions

# ._ create extensibility and distinction to built-ins, code-break resilient upon update
doc._.is_greeting = True 

In [None]:
# Customise Processing Pipeline (same nlp() as above)

component = MyComponent() # See below for INIT

nlp.add_pipe(component, after='tagger')

doc = nlp(u'This is a sentence')

**The nlp object is an instance of Language, which contains the data and annotation scheme of the language you're using and a pre-defined pipeline of components, like the tagger, parser and entity recognizer. If you're loading a model, the Language instance also has access to the model's binary data. All of this is specific to each model, and defined in the model's meta.json ‚Äì for example, a Spanish NER model requires different weights, language data and pipeline components than an English parsing and tagging model. This is also why the pipeline state is always held by the Language class. spacy.load() puts this all together and returns an instance of Language with a pipeline set and access to the binary data.**

```python
doc = nlp.make_doc(u'This is a sentence')   # create a Doc from raw text
for name, proc in nlp.pipeline:             # iterate over components in order
    doc = proc(doc)                         # call each component on the Doc
```

**spaCy 2.0 simply list of (name, function) tuple**

```python
nlp.pipeline
[('tagger', <spacy.pipeline.Tagger>), ('parser', <spacy.pipeline.DependencyParser>),
 ('ner', <spacy.pipeline.EntityRecognizer>)]
```

To make it more convenient to modify the pipeline, there are several built-in methods to get, add, replace, rename or remove individual components. spaCy's default pipeline components, like the tagger, parser and entity recognizer now all follow the same, consistent API and are subclasses of `Pipe`. If you're developing your own component, using the Pipe API will make it fully trainable and serializable. At a minimum, a component needs to be a callable that takes a Doc and returns it:

```python
def my_component(doc):
    print("The doc is {} characters long and has {} tokens."
          .format(len(doc.text), len(doc))
    return doc
```

The component can then be added at any position of the pipeline using the `nlp.add_pipe()` method. The arguments `before, after, first, and last` let you specify component names to insert the new component before or after, or tell spaCy to insert it first (i.e. directly after tokenization) or last in the pipeline.

```python
nlp = spacy.load('en')
nlp.add_pipe(my_component, name='print_length', last=True)
doc = nlp(u"This is a sentence.")
```

**Extension attributes on Doc, Token and Span**

When you implement your own pipeline components that modify the `Doc`, you often want to extend the API, so that the information you're adding is conveniently accessible. spaCy v2.0 introduces a new mechanism that lets you register your own attributes, properties and methods that become available in the `._` namespace, for example, `doc._.my_attr`. There are mostly three types of extensions that can be registered via the `set_extension()`` method:
**Why ._?**
Writing to a ._ attribute instead of to the Doc directly keeps a clearer separation and makes it easier to ensure backwards compatibility. For example, if you've implemented your own .coref property and spaCy claims it one day, it'll break your code. Similarly, just by looking at the code, you'll immediately know what's built-in and what's custom ‚Äì for example, doc.sentiment is spaCy, while doc._.sent_score isn't.

1. Attribute extensions. Set a default value for an attribute, which can be overwritten.
2. Property extensions. Define a `getter` and an optional `setter` function.
3. Method extensions. Assign a function that becomes available as an object method.

```python
Doc.set_extension('hello_attr', default=True)
Doc.set_extension('hello_property', getter=get_value, setter=set_value)
Doc.set_extension('hello_method', method=lambda doc, name: 'Hi {}!'.format(name))

doc._.hello_attr            # True
doc._.hello_property        # return value of get_value
doc._.hello_method('Ines')  # 'Hi Ines!'
```

**WHY Extensions?**

Being able to easily write custom data to the `Doc, Token and Span` means that applications using spaCy can take full advantage of the built-in data structures and the benefits of Doc objects as the **single source of truth** containing all information:

- No information is lost during tokenization and parsing, so you can always relate annotations to the original string.
- The Token and Span are views of the Doc, so they're always up-to-date and consistent.
- Efficient C-level access is available to the underlying TokenC* array via doc.c.
- APIs can standardise on passing around Doc objects, reading and writing from them whenever necessary. Fewer signatures makes functions more reusable and composable.

**TODO - learn these examples of custom componets**

- https://explosion.ai/blog/spacy-v2-pipelines-extensions
- https://github.com/explosion/spaCy/blob/develop/examples/pipeline/custom_component_countries_api.py
- https://github.com/explosion/spaCy/blob/develop/examples/pipeline/custom_component_entities.py
- https://github.com/explosion/spaCy/blob/develop/examples/pipeline/custom_attr_methods.py
- https://github.com/explosion/spaCy/blob/develop/examples/pipeline/custom_sentence_segmentation.py
- https://github.com/explosion/spaCy/blob/develop/examples/pipeline/fix_space_entities.py
- https://github.com/explosion/spaCy/blob/develop/examples/pipeline/multi_processing.py

# Serialisation & Packaging

## NER Model

- Custom `tokenizer` self-serialised as JSON
- **custom componenet** not - best way to wrap as Python package
- `spacy package` in model dir saved -> create all files needed to package (`__init__.py` consisting of `load()` for calling `spacy.load` equal:
    ```python
    import en_core_web_sm
    nlp = en_core_web_sm.load()
    ```
- **QUICK-DIRTY** way to add all custom code to `__init__.py` and add `CustomEntityRecognizer` to global factories:
    ```python
    from spacy.language import Language
    # add custom NER to global factories
    Language.factories['CustomEntityRecognizer'] = CustomEntityRecognizer
    ```
- **Everything** needs to be available from within the package; also specify additional dependencies in `setup.py`, adding files for modules etc
- Once done, `python setup.py sdist` to build package adding `.tar.gz` to `/dist` for `pip install` usage within dir later

#### More Elegantly

**ADD/SERIALISATION**
- by add `to_disk, from_disk, to_bytes, from_bytes` methods
- `nlp.from_disk` iterates over pipeline and checks for methods:
    ```python
    from pipe_name, proc in nlp.pipeline:
        if hasattr(proc, 'from_disk'):
            proc.from_disk(model_path/pipe_name)
    ```
- under `/path/to/model/custom_ner/`:

    ```python
    class CustomEntityRecognizer:
        name = 'custom_ner'
        def __init__(self, nlp):
            self.vocab = nlp.vocab
            self.some_data = None
        def __call__(self, spacy_doc, *args, **kwargs):
            return predcit_single(spacy_doc)
        def from_disk(self, path, *kwargs):
            # do sth here and load all data needed
            data_path = path/'some_data.json'
            with data_path.open() as f:
                self.some_data = json.load(f)
    ```

**TL;DR**

1. Save out model with custom tokenizer only
2. `spacy package` in saved dir, edit meta etc
3. Edit `__init__.py` include custom component, custom NER model etc and add entry to global factories
4. `python setup.py sdist` within package dir to build package
5. Install `.tar.gz` model created in `/dist` 


### Factories via ENTRY POINT

```python
# SERIALIZE

bytes_data = nlp.to_bytes()
lang = nlp.meta["lang"]  # "en"
pipeline = nlp.meta["pipeline"]  # ["tagger", "parser", "ner"]

# DESERIALIZE

nlp = spacy.blank(lang)
for pipe_name in pipeline:
    pipe = nlp.create_pipe(pipe_name)
    nlp.add_pipe(pipe)
nlp.from_bytes(bytes_data)
```

```python
# SPACY.LOAD UNDER THE HOOD

lang = "en"
pipeline = ["tagger", "parser", "ner"]
data_path = "path/to/en_core_web_sm/en_core_web_sm-2.0.0"

cls = spacy.util.get_lang_class(lang)   # 1. Get Language instance, e.g. English()
nlp = cls()                             # 2. Initialize it
for name in pipeline:
    component = nlp.create_pipe(name)   # 3. Create the pipeline components
    nlp.add_pipe(component)             # 4. Add the component to the pipeline
nlp.from_disk(model_data_path)          # 5. Load in the binary data


#THE PIPELINE UNDER THE HOOD

doc = nlp.make_doc(u"This is a sentence")   # create a Doc from raw text
for name, proc in nlp.pipeline:             # iterate over components in order
    doc = proc(doc)                         # apply each component
    
```

**Using Pickle**

- pickling `Doc` or `EntityRecognizer` beware of all requiring common `vocab` (including string2has mappings, label schemes and optional vectors) - CANNOT be too large

```python
# PICKLING OBJECTS WITH SHARED DATA

doc1 = nlp(u"Hello world")
doc2 = nlp(u"This is a test")

doc1_data = pickle.dumps(doc1)
doc2_data = pickle.dumps(doc2)
print(len(doc1_data) + len(doc2_data))  # 6636116 üòû

doc_data = pickle.dumps([doc1, doc2])print(len(doc_data))  # 3319761 üòÉ
```

### Example
- `EntityRuler` component, patterns saved as `.jsonl` if pipeline to_disk, and to a bytestring if pipeline to_bytes - allowing saving out model with rule-based ENR and incluidng all rules WITH the model data

```python
class CustomComponent(object):
    name = "my_component"

    def __init__(self):
        self.data = []

    def __call__(self, doc):
        # Do something to the doc here
        return doc

    def add(self, data):
        # Add something to the component's data
        self.data.append(data)

    def to_disk(self, path):
        # This will receive the directory path + /my_component
        data_path = path / "data.json"
        with data_path.open("w", encoding="utf8") as f:
            f.write(json.dumps(self.data))

    def from_disk(self, path, **cfg):
        # This will receive the directory path + /my_component
        data_path = path / "data.json"
        with data_path.open("r", encoding="utf8") as f:
            self.data = json.loads(f)
        return self
```

- After adding Component to pipeline and adding some data to it, cna seriliase `nlp` object to dir
- will call custom component's `to_disk` method

```python
nlp = spacy.load("en_core_web_sm")
my_component = CustomComponent()
my_component.add({"hello": "world"})
nlp.add_pipe(my_component)
nlp.to_disk("/path/to/model")
```

```bash
DIRECTORY STRUCTURE
‚îî‚îÄ‚îÄ /path/to/model
    ‚îú‚îÄ‚îÄ my_component     # data serialized by "my_component"
    |   ‚îî‚îÄ‚îÄ data.json
    ‚îú‚îÄ‚îÄ ner              # data for "ner" component
    ‚îú‚îÄ‚îÄ parser           # data for "parser" component
    ‚îú‚îÄ‚îÄ tagger           # data for "tagger" component
    ‚îú‚îÄ‚îÄ vocab            # model vocabulary
    ‚îú‚îÄ‚îÄ meta.json        # model meta.json with name, language and pipeline
    ‚îî‚îÄ‚îÄ tokenizer        # tokenization rules
```

**NOTE on loading components**
- `meta.json` check to look up compoentn name in internal factories
- ensure spacy to INIT `my_component` :

```python
from spacy.language import Language
Language.factories["my_component"] = lambda nlp, **cfg: CustomComponent()
```

####¬†ENTRY POINT

- specificall, `nlp.create_pipe` and look up in **factories**
- Must write to `Language.factories` **BEFORE** loading model

```python
pipe = nlp.create_pipe("custom_component")  # fails üëé

Language.factories["custom_component"] = CustomComponentFactory
pipe = nlp.create_pipe("custom_component")  # works üëç
```

- this is messy and often requires INIT code shipped with model
- Using **ENTRY POINT** model pkg and ext pkg can define own `spacy_factories` to add and INIT
- automated in package in same ENV exposes spacy entry points - **SNEK example**

```bash
PACKAGE DIRECTORY STRUCTURE
‚îú‚îÄ‚îÄ snek.py   # the extension code
‚îî‚îÄ‚îÄ setup.py  # setup file for pip installation
```

```python
# SNEK.PY
snek = """
    --..,_                     _,.--.
       `'.'.                .'`__ o  `;__.
          '.'.            .'.'`  '---'`  `
            '.`'--....--'`.'
              `'--....--'`
"""

class SnekFactory(object):
    def __init__(self, nlp, **cfg):
        self.nlp = nlp

    def __call__(self, doc):
        print(snek)
        return doc
```

- adding entry to factories need exposing it in `setup.py` via `entry_point`:

```python
# SETUP.PY
from setuptools import setup

setup(
    name="snek",
    entry_points={
        "spacy_factories": [
            "snek = snek:SnekFactory"
         ]
    }
)
```

- Entry Point lets spacy name `snek` found in module `snek` (i.e. `snek.py`) as `SnekFactory`
- same package can expose multiple EP 
- to make them available to spaCy, install via `python setup.py develop`
- now from spacy::

```python
>>> from spacy.lang.en import English
>>> nlp = English()
>>> snek = nlp.create_pipe("snek")  # this now works! üêçüéâ
>>> nlp.add_pipe(snek)
>>> doc = nlp(u"I am snek")
    --..,_                     _,.--.
       `'.'.                .'`__ o  `;__.
          '.'.            .'.'`  '---'`  `
            '.`'--....--'`.'
              `'--....--'`
```

**ADVANCED COMPONENTS WITH SETTINGS `**cfg`**

```python
nlp = spacy.load("en_core_snek_sm", snek_style="cute")

# how
SNEKS = {"basic": snek, "cute": cute_snek}  # collection of sneks

class SnekFactory(object):
    def __init__(self, nlp, **cfg):
        self.nlp = nlp
        self.snek_style = cfg.get("snek_style", "basic")
        self.snek = SNEKS[self.snek_style]

    def __call__(self, doc):
        print(self.snek)
        return doc

    def to_disk(self, path):
    snek_path = path / "snek.txt"
    with snek_path.open("w", encoding="utf8") as snek_file:
        snek_file.write(self.snek)

    def from_disk(self, path, **cfg):
        snek_path = path / "snek.txt"
        with snek_path.open("r", encoding="utf8") as snek_file:
            self.snek = snek_file.read()
        return self
```

**CUSTOM LANGUAGE CLASSES VIA ENTRY POINT**

- `SnekLanguage` class for custom model BUT not modifying code to add a language

```python
#SNEK.PY

from spacy.language import Language
from spacy.attrs import LANG

class SnekDefaults(Language.Defaults):
    lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
    lex_attr_getters[LANG] = lambda text: "snk"


class SnekLanguage(Language):
    lang = "snk"
    Defaults = SnekDefaults
    # Some custom snek language stuff here
```

- Alongside `spacy_factories` also EP opton for `spacy_language` mapping language codes to language-specific `Language` subclasses:

```python
#SETUP.PY

from setuptools import setup

setup(
    name="snek",
    entry_points={
        "spacy_factories": [
            "snek = snek:SnekFactory"
         ]
+       "spacy_languages": [
+           "sk = snek:SnekLanguage"
+       ]
    }
)
```

- Then load custom `sk` language and resolved to `SnekLanguage` via custom EP
- e.g. `meta.json` specifying `"lang": "snk"

```python
from spacy.util import get_lang_class

SnekLanguage = get_lang_class("snk")
nlp = SnekLanguage()
```

**Distribution of Model**

- `Language.to_disk()`
- Dir created writing out WHOLE pipeline
- Deploy via wrapping as Python package
- **CLI** 

```bash
python -m spacy package /home/me/data/en_example_model /home/me/my_models

# creating
DIRECTORY STRUCTURE
‚îî‚îÄ‚îÄ /
    ‚îú‚îÄ‚îÄ MANIFEST.in                   # to include meta.json
    ‚îú‚îÄ‚îÄ meta.json                     # model meta data
    ‚îú‚îÄ‚îÄ setup.py                      # setup file for pip installation
    ‚îî‚îÄ‚îÄ en_example_model              # model directory
        ‚îú‚îÄ‚îÄ __init__.py               # init for pip installation
        ‚îî‚îÄ‚îÄ en_example_model-1.0.0    # model data
```

- eware of directories need to be name per naming conventions of `lang_name` and `lang_name-version`

**Custom Model Setup**

- `load()` method coming with model package tempaltes will handle assembling and returning `Language` object with loaded pipeline and data
- If requiring custom pipeline component / custom language class => **ship code with model**
- For examples of this, check out the implementations of spaCy‚Äôs [`load_model_from_init_py`](https://spacy.io/api/top-level#util.load_model_from_init_py) and [`load_model_from_path`](https://spacy.io/api/top-level#util.load_model_from_path) utility functions.

**Building Model PKG**

- `python setup.py sdist`
- `pip install /path/to/ex_xxx.tar.gz`
- **Loading only binary data** => `nlp = spacy.blank('en').from_disk('/path/to/data')

Publishing a new version of spaCy often means re-training all available models, which is [quite a lot](https://spacy.io/usage/models#languages). To make this run smoothly, we‚Äôre using an automated build process and a [`spacy train`](https://spacy.io/api/cli#train) template that looks like this:

```bash
python -m spacy train {lang} {models_dir}/{name} {train_data} {dev_data} -m meta/{name}.json -V {version} -g {gpu_id} -n {n_epoch} -ns {n_sents}
```

In a directory `meta`, we keep `meta.json` templates for the individual models, containing all relevant information that doesn‚Äôt change across versions, like the name, description, author info and training data sources. When we train the model, we pass in the file to the meta template as the `--meta` argument, and specify the current model version as the `--version` argument

On each epoch, the model is saved out with a `meta.json` using our template and added properties, like the `pipeline`, `accuracy` scores and the `spacy_version` used to train the model. After training completion, the best model is selected automatically and packaged using the [`package`](https://spacy.io/api/cli#package) command. Since a full meta file is already present on the trained model, no further setup is required to build a valid model package.

```bash
python -m spacy package -f {best_model} dist/
cd dist/{model_name}
python setup.py sdist
```

This process allows us to quickly trigger the model training and build process for all available models and languages, and generate the correct meta data automatically.


In [None]:
import spacy
from spacy.tokens import Doc
from spacy.vocab import Vocab

nlp = spacy.load("en_core_web_sm")
customer_feedback = open("customer_feedback_627.txt").read()
doc = nlp(customer_feedback)
doc.to_disk("/tmp/customer_feedback_627.bin")

new_doc = Doc(Vocab()).from_disk("/tmp/customer_feedback_627.bin")

# Training

```python
for doc in textcat.pipe(docs, batch_size=50):
    pass

scores = textcat.predict([doc1, doc2])

textcat.set_annotations([doc1, doc2], scores)

losses = {}
optimizer = nlp.begin_training()
textcat.update([doc1, doc2], [gold1, gold2], losses=losses, sgd=optimizer)

optimizer = textcat.begin_training(pipeline=nlp.pipeline)
# An optional optimizer. Should take two arguments weights and gradient, and an optional ID. Will be created via TextCategorizer if not set.

# demo
optimizer = nlp.begin_training(get_data)
for itn in range(100):
    random.shuffle(train_data)
    for raw_text, entity_offsets in train_data:
        doc = nlp.make_doc(raw_text)
        gold = GoldParse(doc, entities=entity_offsets)
        nlp.update([doc], [gold], drop=0.5, sgd=optimizer)
nlp.to_disk("/model")


# recommended simple training format
{
   "entities": [(0, 4, "ORG")],
   "heads": [1, 1, 1, 5, 5, 2, 7, 5],
   "deps": ["nsubj", "ROOT", "prt", "quantmod", "compound", "pobj", "det", "npadvmod"],
   "tags": ["PROPN", "VERB", "ADP", "SYM", "NUM", "NUM", "DET", "NOUN"],
   "cats": {"BUSINESS": 1.0},
}

nlp = spacy.blank('en')
optimizer = nlp.begin_training()
for i in range(20):
    random.shuffle(TRAIN_DATA)
    for text, annotations in TRAIN_DATA:
        nlp.update([text], [annotations], sgd=optimizer)
nlp.to_disk("/model")

#¬†BATCH HEURISTIC
def get_batches(train_data, model_type):
    max_batch_sizes = {"tagger": 32, "parser": 16, "ner": 16, "textcat": 64}
    max_batch_size = max_batch_sizes[model_type]
    if len(train_data) < 1000:
        max_batch_size /= 2
    if len(train_data) < 500:
        max_batch_size /= 2
    batch_size = compounding(1, max_batch_size, 1.001)
    batches = minibatch(train_data, size=batch_size)
    return batches
```

> This will set the batch size to start at 1, and increase each batch until it reaches a maximum size. The tagger, parser and entity recognizer all take whole sentences as input, so they‚Äôre learning a lot of labels in a single example. You therefore need smaller batches for them. The batch size for the text categorizer should be somewhat larger, especially if your documents are long.

> By default spaCy uses the Adam solver, with default settings (learning rate 0.001, beta1=0.9, beta2=0.999). Some researchers have said they found these settings terrible on their problems ‚Äì but they‚Äôve always performed very well in training spaCy‚Äôs models, in combination with the rest of our recipe. You can change these settings directly, by modifying the corresponding attributes on the optimizer object. You can also set environment variables, to adjust the defaults.

> There are two other key hyper-parameters of the solver: L2 regularization, and gradient clipping (max_grad_norm). Gradient clipping is a hack that‚Äôs not discussed often, but everybody seems to be using. It‚Äôs quite important in helping to ensure the network doesn‚Äôt diverge, which is a fancy way of saying ‚Äúfall over during training‚Äù. The effect is sort of similar to setting the learning rate low. It can also compensate for a large batch size (this is a good example of how the choices of all these hyper-parameters intersect).

> For small datasets, it‚Äôs useful to set a high dropout rate at first, and decay it down towards a more reasonable value. This helps avoid the network immediately overfitting, while still encouraging it to learn some of the more interesting things in your data. spaCy comes with a decaying utility function to facilitate this. You might try setting:

```python
from spacy.util import decaying
dropout = decaying(0.6, 0.2, 1e-4)
```

> The trick is to store the moving average of the weights during training. We don‚Äôt optimize this average ‚Äì we just track it. Then when we want to actually use the model, we use the averages, not the most recent value. In spaCy (and Thinc) this is done by using a context manager, use_params, to temporarily replace the weights:

```python
with nlp.use_params(optimizer.averages):
    nlp.to_disk("/model")
```

> The context manager is handy because you naturally want to evaluate and save the model at various points during training (e.g. after each epoch). After evaluating and saving, the context manager will exit and the weights will be restored, so you resume training from the most recent value, rather than the average. By evaluating the model after each epoch, you can remove one hyper-parameter from consideration (the number of epochs). Having one less magic number to guess is extremely nice ‚Äì so having the averaging under a context manager is very convenient.

## THINC 

In [None]:
"""This script is experimental.

Try pre-training the CNN component of the text categorizer using a cheap
language modelling-like objective. Specifically, we load pre-trained vectors
(from something like word2vec, GloVe, FastText etc), and use the CNN to
predict the tokens' pre-trained vectors. This isn't as easy as it sounds:
we're not merely doing compression here, because heavy dropout is applied,
including over the input words. This means the model must often (50% of the time)
use the context in order to predict the word.

To evaluate the technique, we're pre-training with the 50k texts from the IMDB
corpus, and then training with only 100 labels. Note that it's a bit dirty to
pre-train with the development data, but also not *so* terrible: we're not using
the development labels, after all --- only the unlabelled text.

@plac.annotations(
    width=("Width of CNN layers", "positional", None, int),
    embed_size=("Embedding rows", "positional", None, int),
    pretrain_iters=("Number of iterations to pretrain", "option", "pn", int),
    train_iters=("Number of iterations to pretrain", "option", "tn", int),
    train_examples=("Number of labelled examples", "option", "eg", int),
    vectors_model=("Name or path to vectors model to learn from"),
)

"""
import plac
import random
import spacy
import thinc.extra.datasets
from spacy.util import minibatch, use_gpu, compounding
import tqdm
from spacy._ml import Tok2Vec
from spacy.pipeline import TextCategorizer
import numpy

In [None]:
pretrain_iters=30
train_iters=30
train_examples=1000

In [None]:
# Load pretrain data - un-labelled

def load_texts(limit=0):
  train, dev = thinc.extra.datasets.imdb()
  train_texts, train_labels = zip(*train)
  dev_texts, dev_labels = zip(*train)
  train_texts = list(train_texts)
  dev_texts = list(dev_texts)
  random.shuffle(train_texts)
  random.shuffle(dev_texts)
  if limit >= 1:
      return train_texts[:limit]
  else:
      return list(train_texts) + list(dev_texts)

In [None]:
temp_text = load_texts(limit=0)

In [None]:
# Load Textcat pipe train-dev data - LABELLED 

def load_textcat_data(limit=0):
    """Load data from the IMDB dataset."""
    # Partition off part of the train data for evaluation
    train_data, eval_data = thinc.extra.datasets.imdb()
    random.shuffle(train_data)
    train_data = train_data[-limit:]
    texts, labels = zip(*train_data)
    eval_texts, eval_labels = zip(*eval_data)
    cats = [{"POSITIVE": bool(y), "NEGATIVE": not bool(y)} for y in labels]
    eval_cats = [{"POSITIVE": bool(y), "NEGATIVE": not bool(y)} for y in eval_labels]
    return (texts, cats), (eval_texts, eval_cats)

In [None]:
temp_train, temp_eval = load_textcat_data()

In [None]:
# labels 
temp_train[1][:5]

[{'NEGATIVE': True, 'POSITIVE': False},
 {'NEGATIVE': True, 'POSITIVE': False},
 {'NEGATIVE': False, 'POSITIVE': True},
 {'NEGATIVE': False, 'POSITIVE': True},
 {'NEGATIVE': True, 'POSITIVE': False}]

In [None]:
def prefer_gpu():
    used = spacy.util.use_gpu(0)
    if used is None:
        return False
    else:
        import cupy.random

        cupy.random.seed(0)
        return True

In [None]:
random.seed(0)
numpy.random.seed(0)
use_gpu = prefer_gpu()
print("Using GPU?", use_gpu)

Using GPU? True


In [None]:
# Textcat model construct

def build_textcat_model(tok2vec, nr_class, width):
    from thinc.v2v import Model, Softmax, Maxout
    from thinc.api import flatten_add_lengths, chain
    from thinc.t2v import Pooling, sum_pool, mean_pool, max_pool
    from thinc.misc import Residual, LayerNorm
    from spacy._ml import logistic, zero_init

    with Model.define_operators({">>": chain}):
        model = (
            tok2vec
            >> flatten_add_lengths
            >> Pooling(mean_pool)
            >> Softmax(nr_class, width)
        )
    model.tok2vec = tok2vec
    return model

In [None]:
# Create NLP or model object

def create_pipeline(width, embed_size, vectors_model):
    print("Load vectors")
    nlp = spacy.load(vectors_model)
    print("Start training")
    textcat = TextCategorizer(
        nlp.vocab,
        labels=["POSITIVE", "NEGATIVE"],
        model=build_textcat_model(
            Tok2Vec(width=width, embed_size=embed_size), 2, width
        ),
    )

    nlp.add_pipe(textcat)
    return nlp

In [None]:
nlp = create_pipeline(width=300, embed_size=7500, vectors_model='en')

Load vectors
Start training


In [None]:
# no idea what for this FN
def block_gradients(model):
    from thinc.api import wrap

    def forward(X, drop=0.0):
        Y, _ = model.begin_update(X, drop=drop)
        return Y, None

    return wrap(forward, model)

# Main FN for pretraining "tensorizer" pipeline using texts
def train_tensorizer(nlp, texts, dropout, n_iter):
    tensorizer = nlp.create_pipe("tensorizer")
    nlp.add_pipe(tensorizer)
    optimizer = nlp.begin_training()
    for i in range(n_iter):
        losses = {}
        for i, batch in enumerate(minibatch(tqdm.tqdm(texts))):
            docs = [nlp.make_doc(text) for text in batch]
            tensorizer.update(docs, None, losses=losses, sgd=optimizer, drop=dropout)
        print(losses)
    return optimizer

For GPU support, we're grateful to use the work of Chainer's cupy module, which provides a numpy-compatible interface for GPU arrays. However, installing Chainer when no GPU is available currently causes an error. We therefore do not list Chainer as an explicit dependency ‚Äî so building Thinc for GPU requires some extra steps:



In [None]:
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130


In [None]:
# Seems right version of CuPy needed
pip install cupy-cuda100



In [None]:
# Bunch of CLI for asserting the right CUDA for THINIC GPU implementation
# Optional?? 

!ls /usr/local/cuda -a
!export CUDA_HOME=/usr/local/cuda # Or wherever your CUDA is
!export PATH=$PATH:$CUDA_HOME/bin
!pip install chainer
!python -c "import cupy; assert cupy" # Check it installed
!pip install thinc_gpu_ops thinc # Or `thinc[cuda]`
!python -c "import thinc_gpu_ops" # Check the GPU ops were built

**ERROR**

- CuPy dtype error
  - seems to be Colab env-dep issues
  - But dimension error still occurs 
  - Perhaps due to incorrect dimension or width and embed_size hyperparams 
  - These unknown as not given in GitHub source
  
- Solution
  - Not able to use this snippet
  - Resort to only TextCat training as above without pretrain this way
  - Could still pretrain using CLI model method (see later)

In [None]:
optimizer = train_tensorizer(nlp, temp_text, dropout=0.2, n_iter=pretrain_iters)

  0%|          | 0/50000 [00:00<?, ?it/s]


ValueError: ignored

## Subclassing TextCategorizer

```python
- to make default, override `Language.factories`
- `Language.factories['textcat'] = lambda nlp, **cfg: CustomTextCat(nlp.vocab, **cfg)`
- use:  `nlp.create_pipe('textcat')`

class CustomTextCat(spacy.pipeline.TextCategorizer):
    @classmethod
    def Model(cls, nr_class=1, width=128, **cfg):
        # this needs to return a Thinc model
        return build_text_classifier(nr_class, width, **cfg)
```

## Evaluate

```python
tp = 0.0
fp = 0.0
fn = 0.0
for eg in test_examples:
    doc = nlp(eg["text"])
    guesses = set((ent.start_char, ent.end_char, ent.label_) for ent in doc.ents)
    truths = set((span['start'], span['end'], span['label']) for span in eg['spans'])
    tp += len(guesses.intersection(truths))
    fn += len(truths - guesses)
    fp += len(guesses - truths)
precision = tp / (tp+fp+1e-10)
recall = tp / (tp+fn+1e-10)
fscore = (2 * precision * recall) / (precision + recall + 1e-10)

# usage
nlp = spacy.load('en_core_web_lg')
model = EntityRecognizer(nlp, label=['PERSON', 'ORG'])
stats = model.evaluate(examples, no_missing=True)

def gold_to_spacy(dataset, spacy_model):
    annos = []
    for eg in dataset:
        entities = [(span['start'], span['end'], span['label'])
                    for span in eg.get('spans', [])]
        if bilou:
            doc = nlp(eg['text'])
            entities = spacy.gold.bilou_tags_from_offsets(doc, entities)
            anno_entry = [eg['text'], entities]
        else:
            anno_entry = [eg['text'], {'entities': entities}]
        annos.append(anno_entry)

def eval_prf(ner_model, examples):
    scorer = spacy.scorer.Scorer()
    for input_, anno in examples:
        doc_gold_text = ner_model.make_doc(input_)
        gold = spacy.gold.GoldParse(doc_gold_text, entities=anno['entities'])
        pre_value = ner_model(input_)
        scorer.score(pred_value, gold)
    return scorer.scores

def model_stats(dataset, spacy_model, label=None, is_prf=False):
    """Evaluate model accuracy of model based on dataset without training
    """
    nlp = spacy.load(spacy_model)
    
    if is_prf:
        examples = gold_to_spacy(dataset, spacy_model)
        score = eval_prf(nlp, examples)
        print("Precision {:0.4f}\tRecall {:0.4f}\tF-score {:0.4f}".format(score['ents_p'],
                                                                          score['ents_r'],
                                                                          score['ents_f']))
    else:
        model = EntityRecognizer(nlp, label=label)
        evaldoc = merge_spans(dataset)
        evals = list(split_sentences(model.nlp, evaldoc))
        scores = model.evaluate(evals)
        print("Accuracy {:0.4f}\tRight {:0.0f}\tWrong {:0.0f}\tUnknown {:0.0f}\tEntities {:0.0f}".format(scores['acc'],
                                                                                                         scores['right'],
                                                                                                         scores['wrong'],
                                                                                                         scores['unk'],
                                                                                                         scores['ents']))
````

### CI of NER

```python
# CI
doc = nlp.make_doc(text)
(beams, somethingelse) = nlp.entity.beam_parse([doc], beam_width=16, beam_density=0.0001)
for score, ents in nlp.entity.moves.get_beam_parses(beams[0]):
    print(score, ents)
    entity_scores = defaultdict(float)
    for start, end, label in ents:
        # print("here")
        entity_scores[(start, end, label)] += score
        print('entity_scores', entity_scores)
for (start, end, label), value in entity_scores.items():
    if label == 'LOCATION':
        print(start, tokens[start], value)
        
        
# another impl
ner = nlp.get_pipe('ner')
docs = [nlp.make_doc(text) for text in batch]
beams = ner.beam_parse(docs, beam_width=16)
for beam in beams:
    entities = ner.moves.get_beam_annot(beam)
```

#¬†Rule-Based Matching

## Token Based

**Adding 3 patterns**

- 'hello' or 'HELLO'
- is_punct flag == True
- lowercase == "world"

```python
[{"LOWER": "hello"}, {"IS_PUNCT": True}, {"LOWER": "world"}]
```

**Important note**

> When writing patterns, keep in mind that each dictionary represents one token. If spaCy‚Äôs tokenization doesn‚Äôt match the tokens defined in a pattern, the pattern is not going to produce any results. When developing complex patterns, make sure to check examples against spaCy‚Äôs tokenization:

```python
doc = nlp(u"A complex-example,!")
print([token.text for token in doc])
```


In [1]:
import spacy
from spacy.matcher import Matcher

In [2]:
nlp = spacy.load("en_core_web_sm")

# vocab must be shared with document the matcher operates on
matcher = Matcher(nlp.vocab)

# Add match ID "HelloWorld" with no callback and one pattern
pattern = [{"LOWER": "hello"}, {"IS_PUNCT": True}, {"LOWER": "world"}]
matcher.add("HelloWorld", None, pattern) # first arg is ID

In [3]:
# load text using same nlp object (hence same vocab space)
doc = nlp(u"Hello, World! Hello world!")
#¬†operate on doc
matches = matcher(doc) 

for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  # Get string representation
    span = doc[start:end]  # The matched span
    print(match_id, string_id, start, end, span.text)

15578876784678163569 HelloWorld 0 3 Hello, World


In [4]:
# Optionally, we could also choose to add more than one pattern, 
# for example to also match sequences without punctuation between ‚Äúhello‚Äù and ‚Äúworld‚Äù:

matcher.add("HelloWorld", None,
            [{"LOWER": "hello"}, {"IS_PUNCT": True}, {"LOWER": "world"}],
            [{"LOWER": "hello"}, {"LOWER": "world"}])
matches = matcher(doc)
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  # Get string representation
    span = doc[start:end]  # The matched span
    print(match_id, string_id, start, end, span.text)

15578876784678163569 HelloWorld 0 3 Hello, World
15578876784678163569 HelloWorld 4 6 Hello world


> By default, the matcher will only return the matches and not do anything else, like merge entities or assign labels. This is all up to you and can be defined individually for each pattern, by passing in a callback function as the `on_match` argument on `add()`. This is useful, because it lets you write entirely custom and pattern-specific logic. For example, you might want to **merge some patterns into one token, while adding entity labels for other pattern types**. You shouldn‚Äôt have to create different matchers for each of those processes.

**Available token attributes**

`ORTH` 
- unicode	The exact verbatim text of a token.

`TEXT` 
- V2.1	unicode	The exact verbatim text of a token.

`LOWER` 
- unicode	The lowercase form of the token text.

`LENGTH` 
- int	The length of the token text.

`IS_ALPHA` 
- , IS_ASCII, IS_DIGIT	bool	Token text consists of alphanumeric characters, ASCII characters, digits.

`IS_LOWER` 
- , IS_UPPER, IS_TITLE	bool	Token text is in lowercase, uppercase, titlecase.

`IS_PUNCT` 
- , IS_SPACE, IS_STOP	bool	Token is punctuation, whitespace, stop word.

`LIKE_NUM` 
- , LIKE_URL, LIKE_EMAIL	bool	Token text resembles a number, URL, email.

`POS` 
- , TAG, DEP, LEMMA, SHAPE	unicode	The token‚Äôs simple and extended part-of-speech tag, dependency label, lemma, shape.

`ENT_TYPE` 
- unicode	The token‚Äôs entity label.

`_` 
- V2.1	dict	Properties in custom extension attributes.

**Extended pattern syntax and attributes V2.1**

> Instead of mapping to a single value, token patterns can also map to a `dictionary of properties`. For example, to specify that the value of a lemma should be part of a list of values, or to set a minimum character length. The following rich comparison attributes are available:

```python
# Matches "love cats" or "likes flowers"
pattern1 = [{"LEMMA": {"IN": ["like", "love"]}},
            {"POS": "NOUN"}] # NOT_IN, ==, >=, etc

# Matches tokens of length >= 10
pattern2 = [{"LENGTH": {">=": 10}}]
```

**REGEX**

In some cases, only matching tokens and token attributes isn‚Äôt enough ‚Äì for example, you might want to match different spellings of a word, without having to add a new pattern for each spelling.

```python
pattern = [{"TEXT": {"REGEX": "^([Uu](\.?|nited) ?[Ss](\.?|tates)"}},
           {"LOWER": "president"}]
```

> 'REGEX' as an operator (instead of a top-level property that only matches on the token‚Äôs text) allows defining rules for any string value, including custom attributes

```python
# Match tokens with fine-grained POS tags starting with 'V'
pattern = [{"TAG": {"REGEX": "^V"}}]

# Match custom attribute values with regular expressions
pattern = [{"_": {"country": {"REGEX": "^([Uu](\.?|nited) ?[Ss](\.?|tates)"}}}]
```
**Operators and quantifiers**

The matcher also lets you use quantifiers, specified as the `OP` key. Quantifiers let you define sequences of tokens to be matched, e.g. one or more punctuation marks, or specify optional tokens. Note that there are no nested or scoped quantifiers ‚Äì instead, you can build those behaviors with `on_match` callbacks.

`!` 
- Negate the pattern, by requiring it to match exactly 0 times.

`?` 
- Make the pattern optional, by allowing it to match 0 or 1 times.

`+` 
- Require the pattern to match 1 or more times.

`*` 
- Allow the pattern to match zero or more times.

```python
pattern = [{"LOWER": "hello"},
           {"IS_PUNCT": True, "OP": "?"}]
```

**Wildcard**
- empty dictionary, `{}` as a wildcard representing any token. 
- useful if you know the context of what you‚Äôre trying to match, but very little about the specific token and its characters.
- For example, let‚Äôs say you‚Äôre trying to extract people‚Äôs user names from your data. All you know is that they are listed as `‚ÄúUser name: {username}‚Äú`. The name itself may contain any character, but no whitespace ‚Äì so you‚Äôll know it will be handled as one token.

```python
[{"ORTH": "User"}, {"ORTH": "name"}, {"ORTH": ":"}, {}]
```

**Adding on_match rules**

- See below demo
> match all mentions of ‚ÄúGoogle I/O‚Äù (which spaCy tokenizes as `['Google', 'I', '/', 'O']`). To be safe, you only match on the uppercase versions, in case someone has written it as ‚ÄúGoogle i/o‚Äù.


In [88]:
import spacy
from spacy.matcher import Matcher
from spacy.tokens import Span

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

In [89]:
def add_event_ent(matcher, doc, i, matches):
    # Get the current match and create tuple of entity label, start and end.
    # Append entity to the doc's entity. (Don't overwrite doc.ents!)
    match_id, start, end = matches[i]
    entity = Span(doc, start, end, label="EVENT")
    doc.ents += (entity,)
    print(entity.text)

pattern = [{"ORTH": "Google"}, {"ORTH": "I"}, {"ORTH": "/"}, {"ORTH": "O"}]
matcher.add("GoogleIO", add_event_ent, pattern)

In [90]:
doc = nlp(u"This is a text about Google I/O.")

In [91]:
matches = matcher(doc)

In [92]:
matches

[]

In [31]:
from spacy import displacy
html = displacy.render(doc, style="ent", page=True,
                options={"ents": ["EVENT"]})

**Example: Using linguistic annotations**

- analysing user comments and you want to find out what people are saying about Facebook. 
- finding adjectives following ‚ÄúFacebook is‚Äù or ‚ÄúFacebook was‚Äù.

```python
[{"LOWER": "facebook"}, {"LEMMA": "be"}, {"POS": "ADV", "OP": "*"}, {"POS": "ADJ"}]
```
- quick overview of the results, collect all sentences containing a match and render them with the displaCy visualizer. 
- In the callback function, you‚Äôll have access to the start and end of each match, as well as the parent Doc. 
- determine doc[start : end.sent], and calculate the start and end of the matched span within the sentence.
- Using displaCy in ‚Äúmanual‚Äù mode lets you pass in a list of dictionaries containing the text and entities to render.

In [111]:
import spacy
from spacy import displacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
matched_sents = []  # Collect data of matched sentences to be visualized

def collect_sents(matcher, doc, i, matches):
    match_id, start, end = matches[i]
    span = doc[start:end]  # Matched span
    sent = span.sent  # Sentence containing matched span
    # Append mock entity for match in displaCy style to matched_sents
    # get the match span by ofsetting the start and end of the span with the
    # start and end of the sentence in the doc
    match_ents = [{
        "start": span.start_char - sent.start_char,
        "end": span.end_char - sent.start_char,
        "label": "MATCH",
    }]
    matched_sents.append({"text": sent.text, "ents": match_ents})

pattern = [{"LOWER": "facebook"}, {"LEMMA": "be"}, {"POS": "ADV", "OP": "*"},
           {"POS": "ADJ"}]
matcher.add("FacebookIs", collect_sents, pattern)  # add pattern
doc = nlp(u"I'd say that Facebook is evil. ‚Äì Facebook is pretty cool, right?")
matches = matcher(doc)

matched_sents

# error to be fixed
# Serve visualization of sentences containing match with displaCy
# set manual=True to make displaCy render straight from a dictionary
# (if you're not running the code within a Jupyer environment, you can
# use displacy.serve instead)

[{'text': "I'd say that Facebook is evil.",
  'ents': [{'start': 13, 'end': 29, 'label': 'MATCH'}]},
 {'text': 'Facebook is pretty cool, right?',
  'ents': [{'start': 0, 'end': 23, 'label': 'MATCH'}]}]

## Efficient Phrase Matching

- match large terminology lists to use `PhraseMatcher` and create `Doc` objects instead of token patterns
- The Doc patterns can contain single or multiple tokens.

In [112]:
import spacy
from spacy.matcher import PhraseMatcher

nlp = spacy.load('en_core_web_sm')
matcher = PhraseMatcher(nlp.vocab)
terminology_list = [u"Barack Obama", u"Angela Merkel", u"Washington, D.C."]
# Only run nlp.make_doc to speed things up
patterns = [nlp.make_doc(text) for text in terminology_list]

In [113]:
patterns

[Barack Obama, Angela Merkel, Washington, D.C.]

In [114]:
matcher.add("TerminologyList", None, *patterns)

doc = nlp(u"German Chancellor Angela Merkel and US President Barack Obama "
          u"converse in the Oval Office inside the White House in Washington, D.C.")
matches = matcher(doc)
for match_id, start, end in matches:
    span = doc[start:end]
    print(span.text)

Angela Merkel
Barack Obama
Washington, D.C.


**Speed on creating pattern**

```bash
- patterns = [nlp(term) for term in LOTS_OF_TERMS]
+ patterns = [nlp.make_doc(term) for term in LOTS_OF_TERMS]
+ patterns = list(nlp.tokenizer.pipe(LOTS_OF_TERMS))
```


**Matching on other token attributes**

- By default, the PhraseMatcher will match on the `verbatim token text, e.g. Token.text.`
- By setting the attr argument on initialization, you can change which token attribute the matcher should use when comparing the phrase pattern to the matched Doc. 
- For example, using the attribute `LOWER` lets you match on Token.lower and create case-insensitive match patterns:

In [115]:
from spacy.lang.en import English
from spacy.matcher import PhraseMatcher

nlp = English()
matcher = PhraseMatcher(nlp.vocab, attr="LOWER")
patterns = [nlp.make_doc(name) for name in [u"Angela Merkel", u"Barack Obama"]]

In [116]:
patterns

[Angela Merkel, Barack Obama]

In [117]:
matcher.add("Names", None, *patterns)

doc = nlp(u"angela merkel and us president barack Obama")
for match_id, start, end in matcher(doc):
    print("Matched based on lowercase token text:", doc[start:end])

Matched based on lowercase token text: angela merkel
Matched based on lowercase token text: barack Obama


**Matching on SHAPE**

In [118]:
from spacy.lang.en import English
from spacy.matcher import PhraseMatcher

nlp = English()
matcher = PhraseMatcher(nlp.vocab, attr="SHAPE")
matcher.add("IP", None, nlp(u"127.0.0.1"), nlp(u"127.127.0.0"))

doc = nlp(u"Often the router will have an IP address such as 192.168.1.1 or 192.168.2.1.")
for match_id, start, end in matcher(doc):
    print("Matched based on token shape:", doc[start:end])

Matched based on token shape: 192.168.1.1
Matched based on token shape: 192.168.2.1


> In theory, the same also works for attributes like POS. For example, a pattern nlp("I like cats") matched based on its part-of-speech tag would return a match for ‚ÄúI love dogs‚Äù. You could also match on boolean flags like IS_PUNCT to match phrases with the same sequence of punctuation and non-punctuation tokens as the pattern. But this can easily get confusing and doesn‚Äôt have much of an advantage over writing one or two token patterns.

## Rule-based ER

- The `EntityRuler` is an exciting new **component** that lets you **add named entities based on pattern dictionaries, and makes it easy to combine rule-based and statistical named entity recognition for even more powerful models.**

**Entity Pattern**

1. **Phrase patterns** for exact string matches (string).
```python 
{"label": "ORG", "pattern": "Apple"}
```
2. **Token patterns** with one dictionary describing one token (list).
```python
{"label": "GPE", "pattern": [{"lower": "san"}, {"lower": "francisco"}]}
```

**Using the entity ruler**
- The `EntityRuler` is a **pipeline component** that‚Äôs typically added via `nlp.add_pipe`. 
- When the nlp object is called on a text, it will **find matches in the doc and add them as entities to the `doc.ents`, using the specified pattern label as the entity label.**



In [119]:
from spacy.lang.en import English
from spacy.pipeline import EntityRuler

nlp = English()
ruler = EntityRuler(nlp)
patterns = [{"label": "ORG", "pattern": "Apple"},
            {"label": "GPE", "pattern": [{"lower": "san"}, {"lower": "francisco"}]}]
ruler.add_patterns(patterns)
nlp.add_pipe(ruler)

doc = nlp(u"Apple is opening its first big office in San Francisco.")
print([(ent.text, ent.label_) for ent in doc.ents])

[('Apple', 'ORG'), ('San Francisco', 'GPE')]


> The entity ruler is designed to **integrate with spaCy‚Äôs existing statistical models and enhance the named entity recognizer.**
> If it‚Äôs **added before the `"ner"` component,** the entity recognizer will **respect the existing entity spans** and **adjust its predictions around it.**
> This can significantly improve accuracy in some cases. **If it‚Äôs added after the "ner" component, the entity ruler will only add spans to the `doc.ents` if they don‚Äôt overlap with existing entities predicted by the model.**
> To **overwrite overlapping entities**, you can set `overwrite_ents=True` on initialization.

In [120]:
import spacy
from spacy.pipeline import EntityRuler

nlp = spacy.load("en_core_web_sm")
ruler = EntityRuler(nlp)
patterns = [{"label": "ORG", "pattern": "MyCorp Inc."}]
ruler.add_patterns(patterns)
nlp.add_pipe(ruler)

doc = nlp(u"MyCorp Inc. is a company in the U.S.")
print([(ent.text, ent.label_) for ent in doc.ents])

[('MyCorp Inc.', 'ORG'), ('U.S.', 'GPE')]


**Using Pattern FILE**

The `to_disk` and `from_disk` let you save and load patterns to and from JSONL (newline-delimited JSON) files, containing one pattern object per line.

```python
# pattern.jsonl
{"label": "ORG", "pattern": "Apple"}
{"label": "GPE", "pattern": [{"lower": "san"}, {"lower": "francisco"}]}

ruler.to_disk("./patterns.jsonl")
new_ruler = EntityRuler(nlp).from_disk("./patterns.jsonl")
```

> When you save out an `nlp` object that has an `EntityRuler` added to its pipeline, its patterns are automatically exported to the model directory:
```python
nlp = spacy.load("en_core_web_sm")
ruler = EntityRuler(nlp)
ruler.add_patterns([{"label": "ORG", "pattern": "Apple"}])
nlp.add_pipe(ruler)
nlp.to_disk("/path/to/model")
```

> The saved model now **includes the "entity_ruler" in its "pipeline" setting in the meta.json,** and the model directory contains a file `entityruler.jsonl` with the patterns. When you load the model back in, all pipeline components will be restored and deserialized ‚Äì including the entity ruler. **This lets you ship powerful model packages with binary weights and rules included!**

## Combining Model and Rules

You can combine statistical and rule-based components in a variety of ways. Rule-based components can be used to improve the accuracy of statistical models, by presetting tags, entities or sentence boundaries for specific tokens. The statistical models will usually respect these preset annotations, which sometimes improves the accuracy of other decisions. You can also use rule-based components after a statistical model to correct common errors. Finally, rule-based components can reference the attributes set by statistical models, in order to implement more abstract logic.



In [161]:
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

def set_sentiment(matcher, doc, i, matches):
    doc.sentiment += 0.1

pattern1 = [{"ORTH": "Google"}, {"ORTH": "I"}, {"ORTH": "/"}, {"ORTH": "O"}]
pattern2 = [[{"ORTH": emoji, "OP": "+"}] for emoji in ["üòÄ", "üòÇ", "ü§£", "üòç"]]
matcher.add("GoogleIO", None, pattern1)  # Match "Google I/O" or "Google i/o"
matcher.add("HAPPY", set_sentiment, *pattern2)  # Match one or more happy emoji

doc = nlp(u"A text about Google I/O üòÄüòÄ")
matches = matcher(doc)

for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]
    span = doc[start:end]
    print(string_id, span.text)
print("Sentiment", doc.sentiment)

GoogleIO Google I/O
HAPPY üòÄ
HAPPY üòÄüòÄ
HAPPY üòÄ
Sentiment 0.30000001192092896


**Example: Expanding Named Entity**

- For example, the corpus spaCy‚Äôs English models were trained on defines a `PERSON` entity as just the **person name, without titles like ‚ÄúMr‚Äù or ‚ÄúDr‚Äù.**
- This makes sense, because it makes it easier to resolve the entity type back to a knowledge base. 
- But what if your application needs the full names, including the titles?

In [121]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Dr Alex Smith chaired first board meeting of Acme Corp Inc.")
print([(ent.text, ent.label_) for ent in doc.ents])

[('Alex Smith', 'PERSON'), ('first', 'ORDINAL'), ('Acme Corp Inc.', 'ORG')]


- While you could try and teach the model a new definition of the PERSON entity by **updating it with more examples of spans that include the title, this might not be the most efficient approach.**
- The existing model was trained on over **2 million words**, so in order to completely change the definition of an entity type, you might need a lot of training examples. 
- However, if you already have the predicted PERSON entities, you can use a **rule-based approach** that checks whether they come with a title and if so, expands the entity span by one token. 
- After all, what all titles in this example have in common is that if they occur, they occur in the previous token right before the person entity.

> modify `Doc` and its `doc.ents` and returns it. This is **exactly what a pipeline component does**, so in order to let it run automatically when processing a text with the nlp object, we can use `nlp.add_pipe` to add it to the current pipeline.

In [122]:
import spacy
from spacy.tokens import Span

nlp = spacy.load("en_core_web_sm")

def expand_person_entities(doc):
    new_ents = []
    for ent in doc.ents:
        if ent.label_ == "PERSON" and ent.start != 0:
            prev_token = doc[ent.start - 1]
            if prev_token.text in ("Dr", "Dr.", "Mr", "Mr.", "Ms", "Ms."):
                new_ent = Span(doc, ent.start - 1, ent.end, label=ent.label)
                new_ents.append(new_ent)
        else:
            new_ents.append(ent)
    doc.ents = new_ents
    return doc

# Add the component after the named entity recognizer
nlp.add_pipe(expand_person_entities, after='ner')

doc = nlp("Dr Alex Smith chaired first board meeting of Acme Corp Inc.")
print([(ent.text, ent.label_) for ent in doc.ents])

[('Dr Alex Smith', 'PERSON'), ('first', 'ORDINAL'), ('Acme Corp Inc.', 'ORG')]


> An alternative approach would be to an **extension attribute** like `._.person_title` and add it to `Span` objects (which includes entity spans in `doc.ents`). The **advantage here is that the entity text stays intact and can still be used to look up the name in a knowledge base.** The following function takes a Span object, checks the previous token if it‚Äôs a `PERSON` entity and returns the title if one is found. The Span.doc attribute gives us easy access to the span‚Äôs parent document.
```python
def get_person_title(span):
    if span.label_ == "PERSON" and span.start != 0:
        prev_token = span.doc[span.start - 1]
        if prev_token.text in ("Dr", "Dr.", "Mr", "Mr.", "Ms", "Ms."):
            return prev_token.text
```

In [123]:
import spacy
from spacy.tokens import Span

nlp = spacy.load("en_core_web_sm")

def get_person_title(span):
    if span.label_ == "PERSON" and span.start != 0:
        prev_token = span.doc[span.start - 1]
        if prev_token.text in ("Dr", "Dr.", "Mr", "Mr.", "Ms", "Ms."):
            return prev_token.text

# Register the Span extension as 'person_title'
Span.set_extension("person_title", getter=get_person_title)

doc = nlp("Dr Alex Smith chaired first board meeting of Acme Corp Inc.")
print([(ent.text, ent.label_, ent._.person_title) for ent in doc.ents])

[('Alex Smith', 'PERSON', 'Dr'), ('first', 'ORDINAL', None), ('Acme Corp Inc.', 'ORG', None)]


**Example: Using entities, part-of-speech tags and the dependency parse**

- Let‚Äôs say you want to parse **professional biographies and extract the person names and company names, and whether it‚Äôs a company they‚Äôre currently working at, or a previous company.**
- One approach could be to try and train a named entity recognizer to predict CURRENT_ORG and PREVIOUS_ORG ‚Äì but this distinction is very subtle and something the entity recognizer may struggle to learn. Nothing about ‚ÄúAcme Corp Inc.‚Äù is inherently ‚Äúcurrent‚Äù or ‚Äúprevious‚Äù.
- However, the **syntax** of the sentence holds some very important clues: we can check for **trigger words like ‚Äúwork‚Äù, whether they‚Äôre past tense or present tense, whether company names are attached to it and whether the person is the subject.**
- All of this information is available in the part-of-speech tags and the dependency parse.

In [124]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Alex Smith worked at Acme Corp Inc.")
print([(ent.text, ent.label_) for ent in doc.ents])

[('Alex Smith', 'PERSON'), ('Acme Corp Inc.', 'ORG')]


In [127]:
displacy.render(doc, style='dep', options={'fine_grained': True})

> In this example, ‚Äúworked‚Äù is the root of the sentence and is a past tense verb. 
    - Its subject is ‚ÄúAlex Smith‚Äù, the person who worked. ‚Äúat Acme Corp Inc.‚Äù is a prepositional phrase attached to the verb ‚Äúworked‚Äù. 
    - To extract this relationship, we can start by looking at the predicted PERSON entities, find their heads and check whether they‚Äôre attached to a trigger word like ‚Äúwork‚Äù. Next, we can check for prepositional phrases attached to the head and whether they contain an ORG entity. 
    - Finally, to determine whether the company affiliation is current, we can check the head‚Äôs part-of-speech tag.

In [128]:
import spacy
from spacy.pipeline import merge_entities
from spacy import displacy

nlp = spacy.load("en_core_web_sm")

def extract_person_orgs(doc):
    person_entities = [ent for ent in doc.ents if ent.label_ == "PERSON"]
    for ent in person_entities:
        head = ent.root.head
        if head.lemma_ == "work":
            preps = [token for token in head.children if token.dep_ == "prep"]
            for prep in preps:
                orgs = [token for token in prep.children if token.ent_type_ == "ORG"]
                print({'person': ent, 'orgs': orgs, 'past': head.tag_ == "VBD"})
    return doc

# To make the entities easier to work with, we'll merge them into single tokens
nlp.add_pipe(merge_entities)
nlp.add_pipe(extract_person_orgs)

doc = nlp("Alex Smith worked at Acme Corp Inc.")
# If you're not in a Jupyter / IPython environment, use displacy.serve
displacy.render(doc, options={'fine_grained': True})

{'person': Alex Smith, 'orgs': [Acme Corp Inc.], 'past': True}


> If you change the sentence structure above, for example to ‚Äúwas working‚Äù, you‚Äôll notice that our current logic fails and doesn‚Äôt correctly detect the company as a past organization. That‚Äôs because the root is a participle and the tense information is in the attached auxiliary ‚Äúwas‚Äù:

> To solve this, we can adjust the rules to also check for the above construction:
```python
def extract_person_orgs(doc):
    person_entities = [ent for ent in doc.ents if ent.label_ == "PERSON"]
    for ent in person_entities:
        head = ent.root.head
        if head.lemma_ == "work":
            preps = [token for token in head.children if token.dep_ == "prep"]
            for prep in preps:
                orgs = [t for t in prep.children if t.ent_type_ == "ORG"]
                aux = [token for token in head.children if token.dep_ == "aux"]
                past_aux = any(t.tag_ == "VBD" for t in aux)
                past = head.tag_ == "VBD" or head.tag_ == "VBG" and past_aux
                print({'person': ent, 'orgs': orgs, 'past': past})
    return doc
```

In [136]:
[tok for tok in doc if tok.has_vector]

[Alex Smith, worked, at, Acme Corp Inc.]

In [147]:
# ADD CUSTOM SIMILARITY HOOKS

class SimilarityModel(object):
    def __init__(self, model):
        self._model = model

    def __call__(self, doc):
        doc.user_hooks["similarity"] = self.similarity
        doc.user_span_hooks["similarity"] = self.similarity
        doc.user_token_hooks["similarity"] = self.similarity

    def similarity(self, obj1, obj2):
        y = self._model([obj1.vector, obj2.vector])
        return float(y[0])

In [152]:
doc.user_hooks['vector'] = np.array(10)

In [156]:
print(doc.vector)

TypeError: 'numpy.ndarray' object is not callable

## Ines Guide

```python
### Ines Guide on NLP Task
- start with generic cats and extract from which more specific info
- start with lg stock NE to get similar info
- incrementally updating particular NE with new data - fully batch train at end
- add rules "Q2 2018" and match patterns based on regex and spaCy's linguistic pipelines
- use "parser" to extract relationships around NE
- train TextCat component to assign labels to whole sentences or paragraphs (good for less dense) 
- e.g. "Sales totalled 864 million"
    - tokens 864 million == "MONEY"
    - walk up the tree and check how it attaches to rest of sentence
        - direct object attached to verb.lemma "total / to total" with subject "sales"
    - NER "NOUN" and "VERB" for example
- e.g. financial report: period/timeframe encoded in headline
    - Detect headlines - rules or textcat "HEADLINE"
    - say all text in between are "body" associated with headline
    - detect if headline references a quarter, normalize into structured
        - e.g. "second quarter of 2018" -> {"q": 2, "year": 2018} via Custom Attribute
```

```python
# register ext token._.headline
Token.set_extension('headline', default=None)
Token.set_extension('year', defualt=None)

doc = nlp("This is a headline. This is some text.")
headline = doc[0:5] # Span containing token 0-4

for token in doc[5:10]: # rest of text
    token._.headline = headline
    # set structured data on the token  could come from
    # a function parsing headline text 
    token._.year = get_year_from_headline(headline)
`````

> for any token in doc now, "MONEY" ent now able to check its `._.year` to see if it's linked with a year based on its headline

# Extra EcoSystem Lib

### ADAM - Wikipedia Q&A

```bash
git clone https://github.com/5hirish/adam_qas.git
cd adam_qas
pip install -r requirements.txt
python -m qas.adam 'When was linux kernel version 4.0 released ?'
```

### AllenNLP

- use to develop pipeline components √•dding annotations to `Doc`

###¬†ExcelCy - Excel Integration with spaCy. Training NER using XLSX from PDF, DOCX, PPT, PNG or JPG.

```bash
from excelcy import ExcelCy
# collect sentences, annotate Entities and train NER using spaCy
excelcy = ExcelCy.execute(file_path='https://github.com/kororo/excelcy/raw/master/tests/data/test_data_01.xlsx')
# use the nlp object as per spaCy API
doc = excelcy.nlp('Google rebrands its business apps')
# or save it for faster bootstrap for application
excelcy.nlp.to_disk('/model')
```

### explacy - visualise spaCy parse

```python
import spacy
import explacy

nlp = spacy.load('en')
explacy.print_parse_info(nlp, 'The salad was surprisingly tasty.')
```

###¬†spacy_hunspell - spell checker

```python
import spacy
from spacy_hunspell import spaCyHunSpell

nlp = spacy.load('en_core_web_sm')
hunspell = spaCyHunSpell(nlp, 'mac')
nlp.add_pipe(hunspell)
doc = nlp('I can haz cheezeburger.')
haz = doc[2]
haz._.hunspell_spell  # False
haz._.hunspell_suggest  # ['ha', 'haze', 'hazy', 'has', 'hat', 'had', 'hag', 'ham', 'hap', 'hay', 'haw', 'ha z']
```

### spacy-lookup 
- powerful NER matcher for large dictionaries using FlashText module
-  The extension sets the custom `Doc, Token and Span` attributes `._.is_entity, ._.entity_type, ._.has_entities and ._.entities.` Named Entities are matched using the python module flashtext, and looked up in the data provided by different dictionaries.

```python
import spacy
from spacy_lookup import Entity

nlp = spacy.load('en')
entity = Entity(keywords_list=['python', 'java platform'])
nlp.add_pipe(entity, last=True)

doc = nlp(u"I am a product manager for a java and python.")
assert doc._.has_entities == True
assert doc[2:5]._.has_entities == True
assert doc[0]._.is_entity == False
assert doc[3]._.is_entity == True
print(doc._.entities)
```

### spacy-vis using Hierplane

- local installation https://github.com/DeNeutoy/spacy-vis

```bash
docker run -p 8080:8080 -it markn/spacy-vis bash bin/serve
```

###¬†textpipe - clean and extracat metadata

```python
from textpipe import doc, pipeline
sample_text = 'Sample text! <!DOCTYPE>'
document = doc.Doc(sample_text)
print(document.clean)
'Sample text!'
print(document.language)
# 'en'
print(document.nwords)
# 2

pipe = pipeline.Pipeline(['CleanText', 'NWords'])
print(pipe(sample_text))
# {'CleanText': 'Sample text!', 'NWords': 2}
```



# KEY SOURCE CODE - spacy_pipe.pyx

```python
# cython: infer_types=True
# cython: profile=True
# coding: utf8
from __future__ import unicode_literals

cimport numpy as np

import numpy
import srsly
from collections import OrderedDict
from thinc.api import chain
from thinc.v2v import Affine, Maxout, Softmax
from thinc.misc import LayerNorm
from thinc.neural.util import to_categorical, copy_array

from ..tokens.doc cimport Doc
from ..syntax.nn_parser cimport Parser
from ..syntax.ner cimport BiluoPushDown
from ..syntax.arc_eager cimport ArcEager
from ..morphology cimport Morphology
from ..vocab cimport Vocab

from ..syntax import nonproj
from ..attrs import POS, ID
from ..parts_of_speech import X
from .._ml import Tok2Vec, build_tagger_model
from .._ml import build_text_classifier, build_simple_cnn_text_classifier
from .._ml import build_bow_text_classifier
from .._ml import link_vectors_to_models, zero_init, flatten
from .._ml import masked_language_model, create_default_optimizer
from ..errors import Errors, TempErrors
from .. import util


def _load_cfg(path):
    if path.exists():
        return srsly.read_json(path)
    else:
        return {}


class Pipe(object):
    """This class is not instantiated directly. Components inherit from it, and
    it defines the interface that components should follow to function as
    components in a spaCy analysis pipeline.
    """

    name = None

    @classmethod
    def Model(cls, *shape, **kwargs):
        """Initialize a model for the pipe."""
        raise NotImplementedError

    def __init__(self, vocab, model=True, **cfg):
        """Create a new pipe instance."""
        raise NotImplementedError

    def __call__(self, doc):
        """Apply the pipe to one document. The document is
        modified in-place, and returned.

        Both __call__ and pipe should delegate to the `predict()`
        and `set_annotations()` methods.
        """
        self.require_model()
        scores, tensors = self.predict([doc])
        self.set_annotations([doc], scores, tensors=tensors)
        return doc

    def require_model(self):
        """Raise an error if the component's model is not initialized."""
        if getattr(self, "model", None) in (None, True, False):
            raise ValueError(Errors.E109.format(name=self.name))

    def pipe(self, stream, batch_size=128, n_threads=-1):
        """Apply the pipe to a stream of documents.

        Both __call__ and pipe should delegate to the `predict()`
        and `set_annotations()` methods.
        """
        for docs in util.minibatch(stream, size=batch_size):
            docs = list(docs)
            scores, tensors = self.predict(docs)
            self.set_annotations(docs, scores, tensor=tensors)
            yield from docs

    def predict(self, docs):
        """Apply the pipeline's model to a batch of docs, without
        modifying them.
        """
        self.require_model()
        raise NotImplementedError

    def set_annotations(self, docs, scores, tensors=None):
        """Modify a batch of documents, using pre-computed scores."""
        raise NotImplementedError

    def update(self, docs, golds, drop=0.0, sgd=None, losses=None):
        """Learn from a batch of documents and gold-standard information,
        updating the pipe's model.

        Delegates to predict() and get_loss().
        """
        self.require_model()
        raise NotImplementedError

    def rehearse(self, docs, sgd=None, losses=None, **config):
        pass

    def get_loss(self, docs, golds, scores):
        """Find the loss and gradient of loss for the batch of
        documents and their predicted scores."""
        raise NotImplementedError

    def add_label(self, label):
        """Add an output label, to be predicted by the model.

        It's possible to extend pre-trained models with new labels,
        but care should be taken to avoid the "catastrophic forgetting"
        problem.
        """
        raise NotImplementedError

    def create_optimizer(self):
        return create_default_optimizer(self.model.ops, **self.cfg.get("optimizer", {}))

    def begin_training(
        self, get_gold_tuples=lambda: [], pipeline=None, sgd=None, **kwargs
    ):
        """Initialize the pipe for training, using data exampes if available.
        If no model has been initialized yet, the model is added."""
        if self.model is True:
            self.model = self.Model(**self.cfg)
        link_vectors_to_models(self.vocab)
        if sgd is None:
            sgd = self.create_optimizer()
        return sgd

    def use_params(self, params):
        """Modify the pipe's model, to use the given parameter values."""
        with self.model.use_params(params):
            yield

    def to_bytes(self, exclude=tuple(), **kwargs):
        """Serialize the pipe to a bytestring.

        exclude (list): String names of serialization fields to exclude.
        RETURNS (bytes): The serialized object.
        """
        serialize = OrderedDict()
        serialize["cfg"] = lambda: srsly.json_dumps(self.cfg)
        if self.model not in (True, False, None):
            serialize["model"] = self.model.to_bytes
        serialize["vocab"] = self.vocab.to_bytes
        exclude = util.get_serialization_exclude(serialize, exclude, kwargs)
        return util.to_bytes(serialize, exclude)

    def from_bytes(self, bytes_data, exclude=tuple(), **kwargs):
        """Load the pipe from a bytestring."""

        def load_model(b):
            # TODO: Remove this once we don't have to handle previous models
            if self.cfg.get("pretrained_dims") and "pretrained_vectors" not in self.cfg:
                self.cfg["pretrained_vectors"] = self.vocab.vectors.name
            if self.model is True:
                self.model = self.Model(**self.cfg)
            self.model.from_bytes(b)

        deserialize = OrderedDict()
        deserialize["cfg"] = lambda b: self.cfg.update(srsly.json_loads(b))
        deserialize["vocab"] = lambda b: self.vocab.from_bytes(b)
        deserialize["model"] = load_model
        exclude = util.get_serialization_exclude(deserialize, exclude, kwargs)
        util.from_bytes(bytes_data, deserialize, exclude)
        return self

    def to_disk(self, path, exclude=tuple(), **kwargs):
        """Serialize the pipe to disk."""
        serialize = OrderedDict()
        serialize["cfg"] = lambda p: srsly.write_json(p, self.cfg)
        serialize["vocab"] = lambda p: self.vocab.to_disk(p)
        if self.model not in (None, True, False):
            serialize["model"] = lambda p: p.open("wb").write(self.model.to_bytes())
        exclude = util.get_serialization_exclude(serialize, exclude, kwargs)
        util.to_disk(path, serialize, exclude)

    def from_disk(self, path, exclude=tuple(), **kwargs):
        """Load the pipe from disk."""

        def load_model(p):
            # TODO: Remove this once we don't have to handle previous models
            if self.cfg.get("pretrained_dims") and "pretrained_vectors" not in self.cfg:
                self.cfg["pretrained_vectors"] = self.vocab.vectors.name
            if self.model is True:
                self.model = self.Model(**self.cfg)
            self.model.from_bytes(p.open("rb").read())

        deserialize = OrderedDict()
        deserialize["cfg"] = lambda p: self.cfg.update(_load_cfg(p))
        deserialize["vocab"] = lambda p: self.vocab.from_disk(p)
        deserialize["model"] = load_model
        exclude = util.get_serialization_exclude(deserialize, exclude, kwargs)
        util.from_disk(path, deserialize, exclude)
        return self


class Tensorizer(Pipe):
    """Pre-train position-sensitive vectors for tokens."""

    name = "tensorizer"

    @classmethod
    def Model(cls, output_size=300, **cfg):
        """Create a new statistical model for the class.

        width (int): Output size of the model.
        embed_size (int): Number of vectors in the embedding table.
        **cfg: Config parameters.
        RETURNS (Model): A `thinc.neural.Model` or similar instance.
        """
        input_size = util.env_opt("token_vector_width", cfg.get("input_size", 96))
        return zero_init(Affine(output_size, input_size, drop_factor=0.0))

    def __init__(self, vocab, model=True, **cfg):
        """Construct a new statistical model. Weights are not allocated on
        initialisation.

        vocab (Vocab): A `Vocab` instance. The model must share the same
            `Vocab` instance with the `Doc` objects it will process.
        model (Model): A `Model` instance or `True` allocate one later.
        **cfg: Config parameters.

        EXAMPLE:
            >>> from spacy.pipeline import TokenVectorEncoder
            >>> tok2vec = TokenVectorEncoder(nlp.vocab)
            >>> tok2vec.model = tok2vec.Model(128, 5000)
        """
        self.vocab = vocab
        self.model = model
        self.input_models = []
        self.cfg = dict(cfg)
        self.cfg.setdefault("cnn_maxout_pieces", 3)

    def __call__(self, doc):
        """Add context-sensitive vectors to a `Doc`, e.g. from a CNN or LSTM
        model. Vectors are set to the `Doc.tensor` attribute.

        docs (Doc or iterable): One or more documents to add vectors to.
        RETURNS (dict or None): Intermediate computations.
        """
        tokvecses = self.predict([doc])
        self.set_annotations([doc], tokvecses)
        return doc

    def pipe(self, stream, batch_size=128, n_threads=-1):
        """Process `Doc` objects as a stream.

        stream (iterator): A sequence of `Doc` objects to process.
        batch_size (int): Number of `Doc` objects to group.
        YIELDS (iterator): A sequence of `Doc` objects, in order of input.
        """
        for docs in util.minibatch(stream, size=batch_size):
            docs = list(docs)
            tensors = self.predict(docs)
            self.set_annotations(docs, tensors)
            yield from docs

    def predict(self, docs):
        """Return a single tensor for a batch of documents.

        docs (iterable): A sequence of `Doc` objects.
        RETURNS (object): Vector representations for each token in the docs.
        """
        self.require_model()
        inputs = self.model.ops.flatten([doc.tensor for doc in docs])
        outputs = self.model(inputs)
        return self.model.ops.unflatten(outputs, [len(d) for d in docs])

    def set_annotations(self, docs, tensors):
        """Set the tensor attribute for a batch of documents.

        docs (iterable): A sequence of `Doc` objects.
        tensors (object): Vector representation for each token in the docs.
        """
        for doc, tensor in zip(docs, tensors):
            if tensor.shape[0] != len(doc):
                raise ValueError(Errors.E076.format(rows=tensor.shape[0], words=len(doc)))
            doc.tensor = tensor

    def update(self, docs, golds, state=None, drop=0.0, sgd=None, losses=None):
        """Update the model.

        docs (iterable): A batch of `Doc` objects.
        golds (iterable): A batch of `GoldParse` objects.
        drop (float): The droput rate.
        sgd (callable): An optimizer.
        RETURNS (dict): Results from the update.
        """
        self.require_model()
        if isinstance(docs, Doc):
            docs = [docs]
        inputs = []
        bp_inputs = []
        for tok2vec in self.input_models:
            tensor, bp_tensor = tok2vec.begin_update(docs, drop=drop)
            inputs.append(tensor)
            bp_inputs.append(bp_tensor)
        inputs = self.model.ops.xp.hstack(inputs)
        scores, bp_scores = self.model.begin_update(inputs, drop=drop)
        loss, d_scores = self.get_loss(docs, golds, scores)
        d_inputs = bp_scores(d_scores, sgd=sgd)
        d_inputs = self.model.ops.xp.split(d_inputs, len(self.input_models), axis=1)
        for d_input, bp_input in zip(d_inputs, bp_inputs):
            bp_input(d_input, sgd=sgd)
        if losses is not None:
            losses.setdefault(self.name, 0.0)
            losses[self.name] += loss
        return loss

    def get_loss(self, docs, golds, prediction):
        ids = self.model.ops.flatten([doc.to_array(ID).ravel() for doc in docs])
        target = self.vocab.vectors.data[ids]
        d_scores = (prediction - target) / prediction.shape[0]
        loss = (d_scores ** 2).sum()
        return loss, d_scores

    def begin_training(self, gold_tuples=lambda: [], pipeline=None, sgd=None, **kwargs):
        """Allocate models, pre-process training data and acquire an
        optimizer.

        gold_tuples (iterable): Gold-standard training data.
        pipeline (list): The pipeline the model is part of.
        """
        if pipeline is not None:
            for name, model in pipeline:
                if getattr(model, "tok2vec", None):
                    self.input_models.append(model.tok2vec)
        if self.model is True:
            self.model = self.Model(**self.cfg)
        link_vectors_to_models(self.vocab)
        if sgd is None:
            sgd = self.create_optimizer()
        return sgd


class Tagger(Pipe):
    """Pipeline component for part-of-speech tagging.

    DOCS: https://spacy.io/api/tagger
    """

    name = "tagger"

    def __init__(self, vocab, model=True, **cfg):
        self.vocab = vocab
        self.model = model
        self._rehearsal_model = None
        self.cfg = OrderedDict(sorted(cfg.items()))
        self.cfg.setdefault("cnn_maxout_pieces", 2)

    @property
    def labels(self):
        return tuple(self.vocab.morphology.tag_names)

    @property
    def tok2vec(self):
        if self.model in (None, True, False):
            return None
        else:
            return chain(self.model.tok2vec, flatten)

    def __call__(self, doc):
        tags, tokvecs = self.predict([doc])
        self.set_annotations([doc], tags, tensors=tokvecs)
        return doc

    def pipe(self, stream, batch_size=128, n_threads=-1):
        for docs in util.minibatch(stream, size=batch_size):
            docs = list(docs)
            tag_ids, tokvecs = self.predict(docs)
            self.set_annotations(docs, tag_ids, tensors=tokvecs)
            yield from docs

    def predict(self, docs):
        self.require_model()
        if not any(len(doc) for doc in docs):
            # Handle case where there are no tokens in any docs.
            n_labels = len(self.labels)
            guesses = [self.model.ops.allocate((0, n_labels)) for doc in docs]
            tokvecs = self.model.ops.allocate((0, self.model.tok2vec.nO))
            return guesses, tokvecs
        tokvecs = self.model.tok2vec(docs)
        scores = self.model.softmax(tokvecs)
        guesses = []
        for doc_scores in scores:
            doc_guesses = doc_scores.argmax(axis=1)
            if not isinstance(doc_guesses, numpy.ndarray):
                doc_guesses = doc_guesses.get()
            guesses.append(doc_guesses)
        return guesses, tokvecs

    def set_annotations(self, docs, batch_tag_ids, tensors=None):
        if isinstance(docs, Doc):
            docs = [docs]
        cdef Doc doc
        cdef int idx = 0
        cdef Vocab vocab = self.vocab
        for i, doc in enumerate(docs):
            doc_tag_ids = batch_tag_ids[i]
            if hasattr(doc_tag_ids, "get"):
                doc_tag_ids = doc_tag_ids.get()
            for j, tag_id in enumerate(doc_tag_ids):
                # Don't clobber preset POS tags
                if doc.c[j].tag == 0 and doc.c[j].pos == 0:
                    # Don't clobber preset lemmas
                    lemma = doc.c[j].lemma
                    vocab.morphology.assign_tag_id(&doc.c[j], tag_id)
                    if lemma != 0 and lemma != doc.c[j].lex.orth:
                        doc.c[j].lemma = lemma
                idx += 1
            if tensors is not None and len(tensors):
                if isinstance(doc.tensor, numpy.ndarray) \
                and not isinstance(tensors[i], numpy.ndarray):
                    doc.extend_tensor(tensors[i].get())
                else:
                    doc.extend_tensor(tensors[i])
            doc.is_tagged = True

    def update(self, docs, golds, drop=0., sgd=None, losses=None):
        self.require_model()
        if losses is not None and self.name not in losses:
            losses[self.name] = 0.

        tag_scores, bp_tag_scores = self.model.begin_update(docs, drop=drop)
        loss, d_tag_scores = self.get_loss(docs, golds, tag_scores)
        bp_tag_scores(d_tag_scores, sgd=sgd)

        if losses is not None:
            losses[self.name] += loss

    def rehearse(self, docs, drop=0., sgd=None, losses=None):
        """Perform a 'rehearsal' update, where we try to match the output of
        an initial model.
        """
        if self._rehearsal_model is None:
            return
        guesses, backprop = self.model.begin_update(docs, drop=drop)
        target = self._rehearsal_model(docs)
        gradient = guesses - target
        backprop(gradient, sgd=sgd)
        if losses is not None:
            losses.setdefault(self.name, 0.0)
            losses[self.name] += (gradient**2).sum()

    def get_loss(self, docs, golds, scores):
        scores = self.model.ops.flatten(scores)
        tag_index = {tag: i for i, tag in enumerate(self.labels)}
        cdef int idx = 0
        correct = numpy.zeros((scores.shape[0],), dtype="i")
        guesses = scores.argmax(axis=1)
        known_labels = numpy.ones((scores.shape[0], 1), dtype="f")
        for gold in golds:
            for tag in gold.tags:
                if tag is None:
                    correct[idx] = guesses[idx]
                elif tag in tag_index:
                    correct[idx] = tag_index[tag]
                else:
                    correct[idx] = 0
                    known_labels[idx] = 0.
                idx += 1
        correct = self.model.ops.xp.array(correct, dtype="i")
        d_scores = scores - to_categorical(correct, nb_classes=scores.shape[1])
        d_scores *= self.model.ops.asarray(known_labels)
        loss = (d_scores**2).sum()
        d_scores = self.model.ops.unflatten(d_scores, [len(d) for d in docs])
        return float(loss), d_scores

    def begin_training(self, get_gold_tuples=lambda: [], pipeline=None, sgd=None,
                       **kwargs):
        orig_tag_map = dict(self.vocab.morphology.tag_map)
        new_tag_map = OrderedDict()
        for raw_text, annots_brackets in get_gold_tuples():
            for annots, brackets in annots_brackets:
                ids, words, tags, heads, deps, ents = annots
                for tag in tags:
                    if tag in orig_tag_map:
                        new_tag_map[tag] = orig_tag_map[tag]
                    else:
                        new_tag_map[tag] = {POS: X}
        cdef Vocab vocab = self.vocab
        if new_tag_map:
            vocab.morphology = Morphology(vocab.strings, new_tag_map,
                                          vocab.morphology.lemmatizer,
                                          exc=vocab.morphology.exc)
        self.cfg["pretrained_vectors"] = kwargs.get("pretrained_vectors")
        if self.model is True:
            for hp in ["token_vector_width", "conv_depth"]:
                if hp in kwargs:
                    self.cfg[hp] = kwargs[hp]
            self.model = self.Model(self.vocab.morphology.n_tags, **self.cfg)
        link_vectors_to_models(self.vocab)
        if sgd is None:
            sgd = self.create_optimizer()
        return sgd

    @classmethod
    def Model(cls, n_tags, **cfg):
        if cfg.get("pretrained_dims") and not cfg.get("pretrained_vectors"):
            raise ValueError(TempErrors.T008)
        return build_tagger_model(n_tags, **cfg)

    def add_label(self, label, values=None):
        if label in self.labels:
            return 0
        if self.model not in (True, False, None):
            # Here's how the model resizing will work, once the
            # neuron-to-tag mapping is no longer controlled by
            # the Morphology class, which sorts the tag names.
            # The sorting makes adding labels difficult.
            # smaller = self.model._layers[-1]
            # larger = Softmax(len(self.labels)+1, smaller.nI)
            # copy_array(larger.W[:smaller.nO], smaller.W)
            # copy_array(larger.b[:smaller.nO], smaller.b)
            # self.model._layers[-1] = larger
            raise ValueError(TempErrors.T003)
        tag_map = dict(self.vocab.morphology.tag_map)
        if values is None:
            values = {POS: "X"}
        tag_map[label] = values
        self.vocab.morphology = Morphology(
            self.vocab.strings, tag_map=tag_map,
            lemmatizer=self.vocab.morphology.lemmatizer,
            exc=self.vocab.morphology.exc)
        return 1

    def use_params(self, params):
        with self.model.use_params(params):
            yield

    def to_bytes(self, exclude=tuple(), **kwargs):
        serialize = OrderedDict()
        if self.model not in (None, True, False):
            serialize["model"] = self.model.to_bytes
        serialize["vocab"] = self.vocab.to_bytes
        serialize["cfg"] = lambda: srsly.json_dumps(self.cfg)
        tag_map = OrderedDict(sorted(self.vocab.morphology.tag_map.items()))
        serialize["tag_map"] = lambda: srsly.msgpack_dumps(tag_map)
        exclude = util.get_serialization_exclude(serialize, exclude, kwargs)
        return util.to_bytes(serialize, exclude)

    def from_bytes(self, bytes_data, exclude=tuple(), **kwargs):
        def load_model(b):
            # TODO: Remove this once we don't have to handle previous models
            if self.cfg.get("pretrained_dims") and "pretrained_vectors" not in self.cfg:
                self.cfg["pretrained_vectors"] = self.vocab.vectors.name
            if self.model is True:
                token_vector_width = util.env_opt(
                    "token_vector_width",
                    self.cfg.get("token_vector_width", 96))
                self.model = self.Model(self.vocab.morphology.n_tags, **self.cfg)
            self.model.from_bytes(b)

        def load_tag_map(b):
            tag_map = srsly.msgpack_loads(b)
            self.vocab.morphology = Morphology(
                self.vocab.strings, tag_map=tag_map,
                lemmatizer=self.vocab.morphology.lemmatizer,
                exc=self.vocab.morphology.exc)

        deserialize = OrderedDict((
            ("vocab", lambda b: self.vocab.from_bytes(b)),
            ("tag_map", load_tag_map),
            ("cfg", lambda b: self.cfg.update(srsly.json_loads(b))),
            ("model", lambda b: load_model(b)),
        ))
        exclude = util.get_serialization_exclude(deserialize, exclude, kwargs)
        util.from_bytes(bytes_data, deserialize, exclude)
        return self

    def to_disk(self, path, exclude=tuple(), **kwargs):
        tag_map = OrderedDict(sorted(self.vocab.morphology.tag_map.items()))
        serialize = OrderedDict((
            ("vocab", lambda p: self.vocab.to_disk(p)),
            ("tag_map", lambda p: srsly.write_msgpack(p, tag_map)),
            ("model", lambda p: p.open("wb").write(self.model.to_bytes())),
            ("cfg", lambda p: srsly.write_json(p, self.cfg))
        ))
        exclude = util.get_serialization_exclude(serialize, exclude, kwargs)
        util.to_disk(path, serialize, exclude)

    def from_disk(self, path, exclude=tuple(), **kwargs):
        def load_model(p):
            # TODO: Remove this once we don't have to handle previous models
            if self.cfg.get("pretrained_dims") and "pretrained_vectors" not in self.cfg:
                self.cfg["pretrained_vectors"] = self.vocab.vectors.name
            if self.model is True:
                self.model = self.Model(self.vocab.morphology.n_tags, **self.cfg)
            with p.open("rb") as file_:
                self.model.from_bytes(file_.read())

        def load_tag_map(p):
            tag_map = srsly.read_msgpack(p)
            self.vocab.morphology = Morphology(
                self.vocab.strings, tag_map=tag_map,
                lemmatizer=self.vocab.morphology.lemmatizer,
                exc=self.vocab.morphology.exc)

        deserialize = OrderedDict((
            ("cfg", lambda p: self.cfg.update(_load_cfg(p))),
            ("vocab", lambda p: self.vocab.from_disk(p)),
            ("tag_map", load_tag_map),
            ("model", load_model),
        ))
        exclude = util.get_serialization_exclude(deserialize, exclude, kwargs)
        util.from_disk(path, deserialize, exclude)
        return self


class MultitaskObjective(Tagger):
    """Experimental: Assist training of a parser or tagger, by training a
    side-objective.
    """

    name = "nn_labeller"

    def __init__(self, vocab, model=True, target='dep_tag_offset', **cfg):
        self.vocab = vocab
        self.model = model
        if target == "dep":
            self.make_label = self.make_dep
        elif target == "tag":
            self.make_label = self.make_tag
        elif target == "ent":
            self.make_label = self.make_ent
        elif target == "dep_tag_offset":
            self.make_label = self.make_dep_tag_offset
        elif target == "ent_tag":
            self.make_label = self.make_ent_tag
        elif target == "sent_start":
            self.make_label = self.make_sent_start
        elif hasattr(target, "__call__"):
            self.make_label = target
        else:
            raise ValueError(Errors.E016)
        self.cfg = dict(cfg)
        self.cfg.setdefault("cnn_maxout_pieces", 2)

    @property
    def labels(self):
        return self.cfg.setdefault("labels", {})

    @labels.setter
    def labels(self, value):
        self.cfg["labels"] = value

    def set_annotations(self, docs, dep_ids, tensors=None):
        pass

    def begin_training(self, get_gold_tuples=lambda: [], pipeline=None, tok2vec=None,
                       sgd=None, **kwargs):
        gold_tuples = nonproj.preprocess_training_data(get_gold_tuples())
        for raw_text, annots_brackets in gold_tuples:
            for annots, brackets in annots_brackets:
                ids, words, tags, heads, deps, ents = annots
                for i in range(len(ids)):
                    label = self.make_label(i, words, tags, heads, deps, ents)
                    if label is not None and label not in self.labels:
                        self.labels[label] = len(self.labels)
        if self.model is True:
            token_vector_width = util.env_opt("token_vector_width")
            self.model = self.Model(len(self.labels), tok2vec=tok2vec)
        link_vectors_to_models(self.vocab)
        if sgd is None:
            sgd = self.create_optimizer()
        return sgd

    @classmethod
    def Model(cls, n_tags, tok2vec=None, **cfg):
        token_vector_width = util.env_opt("token_vector_width", 96)
        softmax = Softmax(n_tags, token_vector_width*2)
        model = chain(
            tok2vec,
            LayerNorm(Maxout(token_vector_width*2, token_vector_width, pieces=3)),
            softmax
        )
        model.tok2vec = tok2vec
        model.softmax = softmax
        return model

    def predict(self, docs):
        self.require_model()
        tokvecs = self.model.tok2vec(docs)
        scores = self.model.softmax(tokvecs)
        return tokvecs, scores

    def get_loss(self, docs, golds, scores):
        if len(docs) != len(golds):
            raise ValueError(Errors.E077.format(value="loss", n_docs=len(docs),
                                                n_golds=len(golds)))
        cdef int idx = 0
        correct = numpy.zeros((scores.shape[0],), dtype="i")
        guesses = scores.argmax(axis=1)
        for i, gold in enumerate(golds):
            for j in range(len(docs[i])):
                # Handes alignment for tokenization differences
                label = self.make_label(j, gold.words, gold.tags,
                                        gold.heads, gold.labels, gold.ents)
                if label is None or label not in self.labels:
                    correct[idx] = guesses[idx]
                else:
                    correct[idx] = self.labels[label]
                idx += 1
        correct = self.model.ops.xp.array(correct, dtype="i")
        d_scores = scores - to_categorical(correct, nb_classes=scores.shape[1])
        loss = (d_scores**2).sum()
        return float(loss), d_scores

    @staticmethod
    def make_dep(i, words, tags, heads, deps, ents):
        if deps[i] is None or heads[i] is None:
            return None
        return deps[i]

    @staticmethod
    def make_tag(i, words, tags, heads, deps, ents):
        return tags[i]

    @staticmethod
    def make_ent(i, words, tags, heads, deps, ents):
        if ents is None:
            return None
        return ents[i]

    @staticmethod
    def make_dep_tag_offset(i, words, tags, heads, deps, ents):
        if deps[i] is None or heads[i] is None:
            return None
        offset = heads[i] - i
        offset = min(offset, 2)
        offset = max(offset, -2)
        return "%s-%s:%d" % (deps[i], tags[i], offset)

    @staticmethod
    def make_ent_tag(i, words, tags, heads, deps, ents):
        if ents is None or ents[i] is None:
            return None
        else:
            return "%s-%s" % (tags[i], ents[i])

    @staticmethod
    def make_sent_start(target, words, tags, heads, deps, ents, cache=True, _cache={}):
        """A multi-task objective for representing sentence boundaries,
        using BILU scheme. (O is impossible)

        The implementation of this method uses an internal cache that relies
        on the identity of the heads array, to avoid requiring a new piece
        of gold data. You can pass cache=False if you know the cache will
        do the wrong thing.
        """
        assert len(words) == len(heads)
        assert target < len(words), (target, len(words))
        if cache:
            if id(heads) in _cache:
                return _cache[id(heads)][target]
            else:
                for key in list(_cache.keys()):
                    _cache.pop(key)
            sent_tags = ["I-SENT"] * len(words)
            _cache[id(heads)] = sent_tags
        else:
            sent_tags = ["I-SENT"] * len(words)

        def _find_root(child):
            seen = set([child])
            while child is not None and heads[child] != child:
                seen.add(child)
                child = heads[child]
            return child

        sentences = {}
        for i in range(len(words)):
            root = _find_root(i)
            if root is None:
                sent_tags[i] = None
            else:
                sentences.setdefault(root, []).append(i)
        for root, span in sorted(sentences.items()):
            if len(span) == 1:
                sent_tags[span[0]] = "U-SENT"
            else:
                sent_tags[span[0]] = "B-SENT"
                sent_tags[span[-1]] = "L-SENT"
        return sent_tags[target]


class ClozeMultitask(Pipe):
    @classmethod
    def Model(cls, vocab, tok2vec, **cfg):
        output_size = vocab.vectors.data.shape[1]
        output_layer = chain(
            LayerNorm(Maxout(output_size, tok2vec.nO, pieces=3)),
            zero_init(Affine(output_size, output_size, drop_factor=0.0))
        )
        model = chain(tok2vec, output_layer)
        model = masked_language_model(vocab, model)
        model.tok2vec = tok2vec
        model.output_layer = output_layer
        return model

    def __init__(self, vocab, model=True, **cfg):
        self.vocab = vocab
        self.model = model
        self.cfg = cfg

    def set_annotations(self, docs, dep_ids, tensors=None):
        pass

    def begin_training(self, get_gold_tuples=lambda: [], pipeline=None,
                        tok2vec=None, sgd=None, **kwargs):
        link_vectors_to_models(self.vocab)
        if self.model is True:
            self.model = self.Model(self.vocab, tok2vec)
        X = self.model.ops.allocate((5, self.model.tok2vec.nO))
        self.model.output_layer.begin_training(X)
        if sgd is None:
            sgd = self.create_optimizer()
        return sgd

    def predict(self, docs):
        self.require_model()
        tokvecs = self.model.tok2vec(docs)
        vectors = self.model.output_layer(tokvecs)
        return tokvecs, vectors

    def get_loss(self, docs, vectors, prediction):
        # The simplest way to implement this would be to vstack the
        # token.vector values, but that's a bit inefficient, especially on GPU.
        # Instead we fetch the index into the vectors table for each of our tokens,
        # and look them up all at once. This prevents data copying.
        ids = self.model.ops.flatten([doc.to_array(ID).ravel() for doc in docs])
        target = vectors[ids]
        gradient = (prediction - target) / prediction.shape[0]
        loss = (gradient**2).sum()
        return float(loss), gradient

    def update(self, docs, golds, drop=0., sgd=None, losses=None):
        pass

    def rehearse(self, docs, drop=0., sgd=None, losses=None):
        self.require_model()
        if losses is not None and self.name not in losses:
            losses[self.name] = 0.
        predictions, bp_predictions = self.model.begin_update(docs, drop=drop)
        loss, d_predictions = self.get_loss(docs, self.vocab.vectors.data, predictions)
        bp_predictions(d_predictions, sgd=sgd)

        if losses is not None:
            losses[self.name] += loss


class TextCategorizer(Pipe):
    """Pipeline component for text classification.

    DOCS: https://spacy.io/api/textcategorizer
    """
    name = 'textcat'

    @classmethod
    def Model(cls, nr_class=1, **cfg):
        embed_size = util.env_opt("embed_size", 2000)
        if "token_vector_width" in cfg:
            token_vector_width = cfg["token_vector_width"]
        else:
            token_vector_width = util.env_opt("token_vector_width", 96)
        if cfg.get("architecture") == "simple_cnn":
            tok2vec = Tok2Vec(token_vector_width, embed_size, **cfg)
            return build_simple_cnn_text_classifier(tok2vec, nr_class, **cfg)
        elif cfg.get("architecture") == "bow":
            return build_bow_text_classifier(nr_class, **cfg)
        else:
            return build_text_classifier(nr_class, **cfg)

    @property
    def tok2vec(self):
        if self.model in (None, True, False):
            return None
        else:
            return self.model.tok2vec

    def __init__(self, vocab, model=True, **cfg):
        self.vocab = vocab
        self.model = model
        self._rehearsal_model = None
        self.cfg = dict(cfg)

    @property
    def labels(self):
        return tuple(self.cfg.setdefault("labels", []))

    @labels.setter
    def labels(self, value):
        self.cfg["labels"] = tuple(value)

    def __call__(self, doc):
        scores, tensors = self.predict([doc])
        self.set_annotations([doc], scores, tensors=tensors)
        return doc

    def pipe(self, stream, batch_size=128, n_threads=-1):
        for docs in util.minibatch(stream, size=batch_size):
            docs = list(docs)
            scores, tensors = self.predict(docs)
            self.set_annotations(docs, scores, tensors=tensors)
            yield from docs

    def predict(self, docs):
        self.require_model()
        scores = self.model(docs)
        scores = self.model.ops.asarray(scores)
        tensors = [doc.tensor for doc in docs]
        return scores, tensors

    def set_annotations(self, docs, scores, tensors=None):
        for i, doc in enumerate(docs):
            for j, label in enumerate(self.labels):
                doc.cats[label] = float(scores[i, j])

    def update(self, docs, golds, state=None, drop=0., sgd=None, losses=None):
        scores, bp_scores = self.model.begin_update(docs, drop=drop)
        loss, d_scores = self.get_loss(docs, golds, scores)
        bp_scores(d_scores, sgd=sgd)
        if losses is not None:
            losses.setdefault(self.name, 0.0)
            losses[self.name] += loss

    def rehearse(self, docs, drop=0., sgd=None, losses=None):
        if self._rehearsal_model is None:
            return
        scores, bp_scores = self.model.begin_update(docs, drop=drop)
        target = self._rehearsal_model(docs)
        gradient = scores - target
        bp_scores(gradient, sgd=sgd)
        if losses is not None:
            losses.setdefault(self.name, 0.0)
            losses[self.name] += (gradient**2).sum()

    def get_loss(self, docs, golds, scores):
        truths = numpy.zeros((len(golds), len(self.labels)), dtype="f")
        not_missing = numpy.ones((len(golds), len(self.labels)), dtype="f")
        for i, gold in enumerate(golds):
            for j, label in enumerate(self.labels):
                if label in gold.cats:
                    truths[i, j] = gold.cats[label]
                else:
                    not_missing[i, j] = 0.
        truths = self.model.ops.asarray(truths)
        not_missing = self.model.ops.asarray(not_missing)
        d_scores = (scores-truths) / scores.shape[0]
        d_scores *= not_missing
        mean_square_error = (d_scores**2).sum(axis=1).mean()
        return float(mean_square_error), d_scores

    def add_label(self, label):
        if label in self.labels:
            return 0
        if self.model not in (None, True, False):
            # This functionality was available previously, but was broken.
            # The problem is that we resize the last layer, but the last layer
            # is actually just an ensemble. We're not resizing the child layers
            # - a huge problem.
            raise ValueError(Errors.E116)
            # smaller = self.model._layers[-1]
            # larger = Affine(len(self.labels)+1, smaller.nI)
            # copy_array(larger.W[:smaller.nO], smaller.W)
            # copy_array(larger.b[:smaller.nO], smaller.b)
            # self.model._layers[-1] = larger
        self.labels = tuple(list(self.labels) + [label])
        return 1

    def begin_training(self, get_gold_tuples=lambda: [], pipeline=None, sgd=None, **kwargs):
        if self.model is True:
            self.cfg["pretrained_vectors"] = kwargs.get("pretrained_vectors")
            self.model = self.Model(len(self.labels), **self.cfg)
            link_vectors_to_models(self.vocab)
        if sgd is None:
            sgd = self.create_optimizer()
        return sgd


cdef class DependencyParser(Parser):
    """Pipeline component for dependency parsing.

    DOCS: https://spacy.io/api/dependencyparser
    """

    name = "parser"
    TransitionSystem = ArcEager

    @property
    def postprocesses(self):
        return [nonproj.deprojectivize]

    def add_multitask_objective(self, target):
        if target == "cloze":
            cloze = ClozeMultitask(self.vocab)
            self._multitasks.append(cloze)
        else:
            labeller = MultitaskObjective(self.vocab, target=target)
            self._multitasks.append(labeller)

    def init_multitask_objectives(self, get_gold_tuples, pipeline, sgd=None, **cfg):
        for labeller in self._multitasks:
            tok2vec = self.model.tok2vec
            labeller.begin_training(get_gold_tuples, pipeline=pipeline,
                                    tok2vec=tok2vec, sgd=sgd)

    def __reduce__(self):
        return (DependencyParser, (self.vocab, self.moves, self.model), None, None)

    @property
    def labels(self):
        # Get the labels from the model by looking at the available moves
        return tuple(set(move.split("-")[1] for move in self.move_names))


cdef class EntityRecognizer(Parser):
    """Pipeline component for named entity recognition.

    DOCS: https://spacy.io/api/entityrecognizer
    """

    name = "ner"
    TransitionSystem = BiluoPushDown
    nr_feature = 6

    def add_multitask_objective(self, target):
        if target == "cloze":
            cloze = ClozeMultitask(self.vocab)
            self._multitasks.append(cloze)
        else:
            labeller = MultitaskObjective(self.vocab, target=target)
            self._multitasks.append(labeller)

    def init_multitask_objectives(self, get_gold_tuples, pipeline, sgd=None, **cfg):
        for labeller in self._multitasks:
            tok2vec = self.model.tok2vec
            labeller.begin_training(get_gold_tuples, pipeline=pipeline,
                                    tok2vec=tok2vec)

    def __reduce__(self):
        return (EntityRecognizer, (self.vocab, self.moves, self.model),
                None, None)

    @property
    def labels(self):
        # Get the labels from the model by looking at the available moves, e.g.
        # B-PERSON, I-PERSON, L-PERSON, U-PERSON
        return tuple(set(move.split("-")[1] for move in self.move_names
                if move[0] in ("B", "I", "L", "U")))


class EntityLinker(Pipe):
    name = 'entity_linker'

    @classmethod
    def Model(cls, nr_class=1, **cfg):
        # TODO: non-dummy EL implementation
        return None

    def __init__(self, model=True, **cfg):
        self.model = False
        self.cfg = dict(cfg)
        self.kb = self.cfg["kb"]

    def __call__(self, doc):
        self.set_annotations([doc], scores=None, tensors=None)
        return doc

    def pipe(self, stream, batch_size=128, n_threads=-1):
        """Apply the pipe to a stream of documents.
        Both __call__ and pipe should delegate to the `predict()`
        and `set_annotations()` methods.
        """
        for docs in util.minibatch(stream, size=batch_size):
            docs = list(docs)
            self.set_annotations(docs, scores=None, tensors=None)
            yield from docs

    def set_annotations(self, docs, scores, tensors=None):
        """
        Currently implemented as taking the KB entry with highest prior probability for each named entity
        TODO: actually use context etc
        """
        for i, doc in enumerate(docs):
            for ent in doc.ents:
                candidates = self.kb.get_candidates(ent.text)
                if candidates:
                    best_candidate = max(candidates, key=lambda c: c.prior_prob)
                    for token in ent:
                        token.ent_kb_id_ = best_candidate.entity_

    def get_loss(self, docs, golds, scores):
        # TODO
        pass

    def add_label(self, label):
        # TODO
        pass


class Sentencizer(object):
    """Segment the Doc into sentences using a rule-based strategy.

    DOCS: https://spacy.io/api/sentencizer
    """

    name = "sentencizer"
    default_punct_chars = [".", "!", "?"]

    def __init__(self, punct_chars=None, **kwargs):
        """Initialize the sentencizer.

        punct_chars (list): Punctuation characters to split on. Will be
            serialized with the nlp object.
        RETURNS (Sentencizer): The sentencizer component.

        DOCS: https://spacy.io/api/sentencizer#init
        """
        self.punct_chars = punct_chars or self.default_punct_chars

    def __call__(self, doc):
        """Apply the sentencizer to a Doc and set Token.is_sent_start.

        doc (Doc): The document to process.
        RETURNS (Doc): The processed Doc.

        DOCS: https://spacy.io/api/sentencizer#call
        """
        start = 0
        seen_period = False
        for i, token in enumerate(doc):
            is_in_punct_chars = token.text in self.punct_chars
            token.is_sent_start = i == 0
            if seen_period and not token.is_punct and not is_in_punct_chars:
                doc[start].is_sent_start = True
                start = token.i
                seen_period = False
            elif is_in_punct_chars:
                seen_period = True
        if start < len(doc):
            doc[start].is_sent_start = True
        return doc

    def to_bytes(self, **kwargs):
        """Serialize the sentencizer to a bytestring.

        RETURNS (bytes): The serialized object.

        DOCS: https://spacy.io/api/sentencizer#to_bytes
        """
        return srsly.msgpack_dumps({"punct_chars": self.punct_chars})

    def from_bytes(self, bytes_data, **kwargs):
        """Load the sentencizer from a bytestring.

        bytes_data (bytes): The data to load.
        returns (Sentencizer): The loaded object.

        DOCS: https://spacy.io/api/sentencizer#from_bytes
        """
        cfg = srsly.msgpack_loads(bytes_data)
        self.punct_chars = cfg.get("punct_chars", self.default_punct_chars)
        return self

    def to_disk(self, path, exclude=tuple(), **kwargs):
        """Serialize the sentencizer to disk.

        DOCS: https://spacy.io/api/sentencizer#to_disk
        """
        path = util.ensure_path(path)
        path = path.with_suffix(".json")
        srsly.write_json(path, {"punct_chars": self.punct_chars})


    def from_disk(self, path, exclude=tuple(), **kwargs):
        """Load the sentencizer from disk.

        DOCS: https://spacy.io/api/sentencizer#from_disk
        """
        path = util.ensure_path(path)
        path = path.with_suffix(".json")
        cfg = srsly.read_json(path)
        self.punct_chars = cfg.get("punct_chars", self.default_punct_chars)
        return self

      
__all__ = ["Tagger", "DependencyParser", "EntityRecognizer", "Tensorizer", "TextCategorizer", "EntityLinker", "Sentencizer"]
```

# Build LANG

## TRAIN Language Model
https://spacy.io/usage/training

**Key demo (Norwagian language model creation) https://github.com/explosion/spaCy/issues/3082**

Flow of Training
- Creating a vocabulary file
  - spaCy expects that common words will be cached in a Vocab instance. The vocabulary caches lexical features. spaCy loads the vocabulary from binary data, in order to keep loading efficient. The easiest way to save out a new binary vocabulary file is to use the spacy init-model command, which expects a JSONL file with words and their lexical attributes. See the docs on the vocab JSONL format for details.
- Training the word vectors
  - Word2vec and related algorithms let you train useful word similarity models from unlabeled text. This is a key part of using deep learning for NLP with limited labeled data. The vectors are also useful by themselves ‚Äì they power the .similarity methods in spaCy. For best results, you should pre-process the text with spaCy before training the Word2vec model. This ensures your tokenization will match. You can use our word vectors training script, which pre-processes the text with your language-specific tokenizer and trains the model using Gensim. The vectors.bin file should consist of one word and vector per line.
  - https://github.com/explosion/spacy/tree/master/bin/train_word_vectors.py
  - If you don‚Äôt have a large sample of text available, you can also convert word vectors produced by a variety of other tools into spaCy‚Äôs format. See the docs on converting word vectors for details.
- Creating or converting a training corpus
  - The easiest way to train spaCy‚Äôs tagger, parser, entity recognizer or text categorizer is to use the spacy train command-line utility. In order to use this, you‚Äôll need training and evaluation data in the JSON format spaCy expects for training.
  - You can now train the model using a corpus for your language annotated with If your data is in one of the supported formats, the easiest solution might be to use the spacy convert command-line utility. This supports several popular formats, including the IOB format for named entity recognition, the JSONL format produced by our annotation tool Prodigy, and the CoNLL-U format used by the Universal Dependencies corpus.
  - One thing to keep in mind is that spaCy expects to train its models from whole documents, not just single sentences. If your corpus only contains single sentences, spaCy‚Äôs models will never learn to expect multi-sentence documents, leading to low performance on real text. To mitigate this problem, you can use the -N argument to the spacy convert command, to merge some of the sentences into longer pseudo-documents.
- Training the tagger and parser
  - Once you have your training and evaluation data in the format spaCy expects, you can train your model use the using spaCy‚Äôs train command. Note that training statistical models still involves a degree of trial-and-error. You may need to tune one or more settings, also called ‚Äúhyper-parameters‚Äù, to achieve optimal performance. See the usage guide on training for more details.
  


1. From scratch 
2. Update on existing model


> Both can be preceded by **Pretrain**

### (1) From Scratch (CLI or Code)

**CLI method**
- Input
  - **Annotated format - supports several popular formats, including the IOB format for named entity recognition, the JSONL format produced by our annotation tool Prodigy, and the CoNLL-U format used by the Universal Dependencies corpus.**
  - `spacy convert` into spaCy JSON format
- Example:

```shell
git clone https://github.com/UniversalDependencies/UD_Spanish-AnCora
mkdir ancora-json
python -m spacy convert UD_Spanish-AnCora/es_ancora-ud-train.conllu ancora-json
python -m spacy convert UD_Spanish-AnCora/es_ancora-ud-dev.conllu ancora-json
mkdir models
python -m spacy train es models ancora-json/es_ancora-ud-train.json ancora-json/es_ancora-ud-dev.json
```

**Simple code method (Preferred)**

> Instead of sequences of `Doc and GoldParse` objects, you can also use the ‚Äúsimple training style‚Äù and **pass raw texts and dictionaries of annotations to nlp.update.** The dictionaries can have the **keys entities, heads, deps, tags and cats.** This is generally recommended, as it removes one layer of abstraction, and avoids unnecessary imports. It also makes it easier to structure and load your training data.

- Example Annotations

```json
{
   "entities": [(0, 4, "ORG")],
   "heads": [1, 1, 1, 5, 5, 2, 7, 5],
   "deps": ["nsubj", "ROOT", "prt", "quantmod", "compound", "pobj", "det", "npadvmod"],
   "tags": ["PROPN", "VERB", "ADP", "SYM", "NUM", "NUM", "DET", "NOUN"],
   "cats": {"BUSINESS": 1.0},
}
```

- Simple Training Loop

```python
TRAIN_DATA = [
        (u"Uber blew through $1 million a week", {"entities": [(0, 4, "ORG")]}),
        (u"Google rebrands its business apps", {"entities": [(0, 6, "ORG")]})]

nlp = spacy.blank('en')
optimizer = nlp.begin_training()
for i in range(20):
    random.shuffle(TRAIN_DATA)
    for text, annotations in TRAIN_DATA:
        nlp.update([text], [annotations], sgd=optimizer)
nlp.to_disk("/model")
```

> The above training loop leaves out a few details that can really improve accuracy ‚Äì but the principle really is that simple. Once you‚Äôve got your pipeline together and you want to tune the accuracy, you usually want to process your training examples in batches, and experiment with minibatch sizes and dropout rates, set via the drop keyword argument. See the Language and Pipe API docs for available options.





#### NER



**(1) BUILT-IN ENTITY**

**Blank Model or Load Built-in**

**(2) CUSTOM ENTITY**

**Training an additional entity type** \

> **In practice, you‚Äôll need many more ‚Äî a few hundred would be a good start. You will also likely need to mix in examples of other entity types, which might be obtained by running the entity recognizer over unlabelled sentences, and adding their annotations to the training set.**

In [7]:
clean_text = "In practice, you‚Äôll need many more ‚Äî a few hundred would be a good start. You will also likely need to mix in examples of other entity types, which might be obtained by running the entity recognizer over unlabelled sentences, and adding their annotations to the training set."

In [17]:
import spacy
sent_nlp = spacy.load('en_core_web_lg')

In [18]:
sent_doc = sent_nlp(clean_text)

In [19]:
sent.sents

NameError: name 'sent' is not defined

In [20]:
sent_nlp.add_pipe(sent_nlp.create_pipe('sentencizer'), before='parser')

In [21]:
sent_nlp.pipe_names

['tagger', 'sentencizer', 'parser', 'ner']

In [22]:
sent_doc = sent_nlp(clean_text)

In [24]:
sent_text = [sent for sent in sent_doc.sents]

In [26]:
sent_text

[In practice, you‚Äôll need many more ‚Äî a few hundred would be a good start.,
 You will also likely need to mix in examples of other entity types, which might be obtained by running the entity recognizer over unlabelled sentences, and adding their annotations to the training set.]

In [34]:
sent_text[0].text.find('good') + len('good')

66

In [None]:
TRAIN_DATA = [
    # instead u'raw text' using sent1 made above
    (sent1, {'entities': [
        (17, 21, 'MONEY'),
        (52, 57, 'LOC'),
        (77, 91, 'PERSON'),
        (129, 134, 'MONEY')
    ]}),
    (sent2, {'entities': [
        (250, 256, 'PERSON')
    ]}),
    (sent3, {'entities': [
        (64, 71, 'ORG'),
        (154, 158, 'MONEY')
    ]}),
    (sent4, {'entities': [
        (55, 61, 'PERSON'),
        (88, 93, 'MONEY'),
        (111, 115, 'MONEY')
    ]}),
    (sent5, {'entities': [
        (6, 11, 'MONEY'),
        (25, 32, 'ORG')
    ]}),
]

# MISC

## GENSIM loading Pretrained Word2Vec 

```python
from gensim.models import KeyedVectors
# Load vectors directly from the file 1G
model = KeyedVectors.load_word2vec_format('data/GoogleGoogleNews-vectors-negative300.bin', binary=True)
# Access vectors for specific words with a keyed lookup:
vector = model['easy']
# see the shape of the vector (300,)
vector.shape
# Processing sentences is not as simple as with Spacy:
vectors = [model[x] for x in "This is some text I am processing with Spacy".split(' ')]
```


