## Reasons for updating model:
* Better results on your specific domain
* Learn classification schemes specifically for your problem
* Essential for text classification
* Very useful for named entity recongnition
* Less critical for part of speech tagging and dependency parsing

## How training works:
* Initialize the model with randomly (if not starting with trained pipeline
* Predict a few examples with the current weight
* Compare prediction with true labels
* Calculate how to change weights to improve predictions
* Update weights slightly
* Go back to second step

![image.png](attachment:9f202c24-6863-4bc3-b941-97230f06265b.png)

* Training data - Example and their annotations
* Text - Input text the model predict a label for (sentence, paragraph or longer document)
* Label - label the model should predict (text category, entity span and it's type)
* Gradient - how to change the weights to reduce the current error (computed when compare the predicted label to true label)

In [None]:
# Example: Training the entity recognizer
# The entity recognizer tags words and phrases in context
# Each token can only be part of one entity
# Examples need to come with context
doc = nlp("iPhone X is coming")
doc.ents = [Span(doc, 0, 2, label="GADGET")]

# Texts with no entities are also important
doc = nlp("I need a new phone! Any tips?")
doc.ents = []
# Goal: teach the model to generalize

## The training data
* Example of what we want the model to predict in context
* Updating an existing model, require a few hundred to a few thousand examples
* Train a new category, require a few thousand to a million examples
* Usually created manually by human annotators
* Can be semi-automated for example by SpaCy's Matcher

## Training vs evaluation data
* Training data: used to update the model
* Evaluation data:
  * data the model hasn't seen during training
  * used to calculate how accurate the model is
  * Should be representative of the data the model will see at runtime

In [3]:
# Generating training corpus
import random
import spacy
from spacy.tokens import Span,  DocBin

nlp = spacy.blank("en")

# Create a Doc with entity spans,  usually want at least a few hundred to a few thousand representative examples.
doc1 = nlp("iPhone X is coming")
doc1.ents = [Span(doc1, 0, 2, label="GADGET")]
# Create another doc without entity spans
doc2 = nlp("I need a new phone! Any tips?")

docs = [doc1, doc2]  # and so on...

# split data into two portions:
# training data - used to update the model
# development (evaluation) data 
random.shuffle(docs)
train_docs = docs[:len(docs) // 2]
dev_docs = docs[len(docs) // 2:]

# DocBin - container to effeciently store and save Doc objects such as Store training and development (evaluation) data as files on disk, to load them 
# into spaCy's training process
# can be saved to a binary file
# binary files are used for training
# Create and save a collection of training docs
train_docbin = DocBin(docs=train_docs)
train_docbin.to_disk("models/train.spacy")
# Create and save a collection of evaluation docs
dev_docbin = DocBin(docs=dev_docs)
dev_docbin.to_disk("models/dev.spacy")

# if training and evaluation data in another format (conll, iob, conllu) than spacy binary 
# !python -m spacy convert ./train.gold.conll ./corpus

In [9]:
import json
import spacy
from spacy.matcher import Matcher
from spacy.tokens import Span, DocBin

with open("exercises/en/iphone.json", encoding="utf8") as f:
    TEXTS = json.loads(f.read())

nlp = spacy.blank("en")
matcher = Matcher(nlp.vocab)
# Add patterns to the matcher
pattern1 = ([{"LOWER": "iphone"}, {"LOWER": "x"}])
pattern2 = [{"LOWER": "iphone"}, {"IS_DIGIT": True}]
matcher.add("GADGET", [pattern1, pattern2])
docs = []
for doc in nlp.pipe(TEXTS):
    matches = matcher(doc)
    spans = [Span(doc, start, end, label=match_id) for match_id, start, end in matches]
    doc.ents = spans
    docs.append(doc)

# Instantiate the DocBin with the list of docs
doc_bin = DocBin(docs=docs)
# Save the DocBin to a file called train.spacy
doc_bin.to_disk("models/train.spacy")

## Configuring and running training 
* Single source of truth for all settings
* Called config.cfg
* Defines how to initialize the nlp object
* Includes all settings about the pipeline components and their model implementations
* Configures training process and hyperparameters
* makes your training more reproducible

### Generating a config
* SpaCy can auto-generate a default config file for you
* interactive [quickstart widget](https://spacy.io/usage/training#quickstart "quickstart widget")
* [init config](https://spacy.io/api/cli#init-config "init config") command on the CLI

In [12]:
# init config - command to run
# config.cfg - output path for the generated config
# --lang - language class of he pipeline, en for English
# --pipeline - comma separated names of components to include
!python -m spacy init config ./config.cfg --lang en --pipeline ner

# to view the config file 
!cat ./config.cfg

[38;5;3m[!] To generate a more effective transformer-based config (GPU-only),
install the spacy-transformers package and re-run this command. The config
generated now does not use transformers.[0m
[38;5;4m[i] Generated config template specific for your use case[0m
- Language: en
- Pipeline: ner
- Optimize for: efficiency
- Hardware: CPU
- Transformer: None
[38;5;2m[+] Auto-filled config with all values[0m
[38;5;2m[+] Saved config[0m
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


'cat' is not recognized as an internal or external command,
operable program or batch file.


### Training a pipeline
* all you need is the config.cfg and the training and developer data .spacy files
* config settings can be overwritten on the command line

In [None]:
# train - the command to run
# config.cfg - the path to the config file
# --output -  the path to the output directory to save the trained pipeline
# --paths.train - override with path to the training data
# --paths.dev - override with path to the evaluation data
# to make several passes over the data during training. Each pass over the data is also called an "epoch"
# Within each epoch, spaCy outputs the accuracy scores every 200 examples, you can change the frequency in the config
# The training runs until the model stops improving and exits automatically.
!python -m spacy train ./config.cfg --output ./output --paths.train train.spacy --paths.dev dev.spacy

### Loading a trained pipeline
* output after training is a regular loadable spaCy pipeline
  * model-last: last trained pipeline
  * model-best: best trained pipeline
* load it with spacy.load

In [None]:
import spacy

nlp = spacy.load("/path/to/output/model-best")
doc = nlp("iPhone 11 vs iPhone 8: What's the difference?")
print(doc.ents)

To make it easy to deploy a pipelines, create a python package 
* [spacy package](https://spacy.io/api/cli#package "spacy package"): create an installable python package containing your pipeline
* easy to version and deploy

## Best practices for training spaCy models

Problem 1: Models can "forget" things
* Existing model can overfit on new data (if you only update it with "WEBSITE", it can unlearn what a "PERSON" is
* Also known as "catastrophic forgetting" problem

Solution 1: Mix in previously correct predictions
* for example if you are training "WEBSITE" also include example of "PERSON"
* run existing spaCy model over data and extract all other relevant entities

Problem 2: Models can not learn everything
* spaCy's models make predictions based on local context
* Model can struggle to learn if decision is difficult to make based on context
* Label scheme needs to be consistent
    * For example: "CLOTHING" is better than "ADULT_CLOTHING" and "CHILDRENS_CLOTHING"
Solution 2: Plan your label scheme carefully
* Pick categories that are reflected in local context
* More generic is better than too specific
* Use rules to go from generic labels to specific categories
* BAD: LABELS = ["ADULT_SHOES", "CHILDRENS_SHOES", "BANDS_I_LIKE"]
* GOOD: LABELS = ["CLOTHING", "BAND"]

## Training multiple labels
* automate labeling use an annotation tools such as Brat, a popular open-source solution, or Prodigy, our own annotation tool that integrates with spaCy

In [None]:
# Complete the token offsets for the "WEBSITE" entities in the data
# Update the training data to include annotations for the "PERSON" entities “PewDiePie” and “Alexis Ohanian”.
import spacy
from spacy.tokens import Span

nlp = spacy.blank("en")

doc1 = nlp("Reddit partners with Patreon to help creators build communities")
doc1.ents = [
    Span(doc1, 0, 1, label="WEBSITE"),
    Span(doc1, 3, 4, label="WEBSITE"),
]

doc2 = nlp("PewDiePie smashes YouTube record")
doc2.ents = [Span(doc2, 0, 1, label="PERSON"), Span(doc2, 2, 3, label="WEBSITE")]

doc3 = nlp("Reddit founder Alexis Ohanian gave away two Metallica tickets to fans")
doc3.ents = [Span(doc3, 0, 1, label="WEBSITE"), Span(doc3, 2, 4, label="PERSON")]

# And so on...