# Chapter 4: Training a neural network model

## Why updating the model?
- Better results on your specific domain
- Learn classification schemes specifically for your problem
- Essential for text classification
- Very useful for named entity recognition
- Less critical for part-of-speech tagging and dependency parsing

How training works (1)
1. Initialize the model weights randomly with nlp.begin_training
1. Predict a few examples with the current weights by calling nlp.update
1. Compare prediction with true labels
1. Calculate how to change weights to improve predictions
1. Update weights slightly
1. Go back to 2.

### How training works (2)
- Diagram of the training process
- Training data: Examples and their annotations.
- Text: The input text the model should predict a label for.
- Label: The label the model should predict.
- Gradient: How to change the weights.


### Example: Training the entity recognizer
- The entity recognizer tags words and phrases in context
- Each token can-  only be part of one entity
- Examples need to come with context
```
("iPhone X is coming", {'entities': [(0, 8, 'GADGET')]})
```
- Texts with no entities are also important
```
("I need a new phone! Any tips?", {'entities': []})
```
- Goal: teach the model to generalize

### The training data
- Examples of what we want the model to predict in context
- Update an existing model: a few hundred to a few thousand examples
- Train a new category: a few thousand to a million examples
- spaCy's English models: 2 million words
- Usually created manually by human annotators
- Can be semi-automated – for example, using spaCy's Matcher!

In [5]:
import json
from spacy.matcher import Matcher
from spacy.lang.en import English

with open("static/iphone.json") as f:
    TEXTS = json.loads(f.read())

nlp = English()
matcher = Matcher(nlp.vocab)

# Two tokens whose lowercase forms match 'iphone' and 'x'
pattern1 = [{"LOWER": 'iphone'}, {"LOWER": 'x'}]

# Token whose lowercase form matches 'iphone' and an optional digit
pattern2 = [{"LOWER": 'iphone'}, { 'IS_DIGIT': True, "OP": "?"}]

# Add patterns to the matcher
matcher.add("GADGET", None, pattern1, pattern2)

In [11]:
TRAINING_DATA = []

# Create a Doc object for each text in TEXTS
for doc in nlp.pipe(TEXTS):
    # Match on the doc and create a list of matched spans
    spans = [doc[start:end] for match_id, start, end in matcher(doc)]
    print(spans)
    # Get (start character, end character, label) tuples of matches
    entities = [(span.start_char, span.end_char, "GADGET") for span in spans]
    # Format the matches as a (doc.text, entities) tuple
    training_example = (doc.text, {"entities": entities})
    # Append the example to the training data
    TRAINING_DATA.append(training_example)

print(*TRAINING_DATA, sep="\n")

[iPhone X, iPhone]
[iPhone X, iPhone]
[iPhone X, iPhone]
[iPhone, iPhone 8]
[iPhone]
[]
('How to preorder the iPhone X', {'entities': [(20, 28, 'GADGET'), (20, 26, 'GADGET')]})
('iPhone X is coming', {'entities': [(0, 8, 'GADGET'), (0, 6, 'GADGET')]})
('Should I pay $1,000 for the iPhone X?', {'entities': [(28, 36, 'GADGET'), (28, 34, 'GADGET')]})
('The iPhone 8 reviews are here', {'entities': [(4, 10, 'GADGET'), (4, 12, 'GADGET')]})
('Your iPhone goes up to 11 today', {'entities': [(5, 11, 'GADGET')]})
('I need a new phone! Any tips?', {'entities': []})


### The steps of a training loop
- Loop for a number of times.
- Shuffle the training data.
- Divide the data into batches.
- Update the model for each batch.
- Save the updated model.

In [18]:
import random 
from spacy.lang.en import English
nlp = English()
nlp.pipe_names, nlp.pipeline

([], [])

### Updating an existing model
- Improve the predictions on new data
- Especially useful to improve existing categories, like PERSON
- Also possible to add new categories
- Be careful and make sure the model doesn't "forget" the old ones

```
# Loop for 10 iterations
for i in range(10):
    # Shuffle the training data
    random.shuffle(TRAINING_DATA)
    # Create batches and iterate over them
    for batch in spacy.util.minibatch(TRAINING_DATA):
        # Split the batch in texts and annotations
        texts = [text for text, annotation in batch]
        annotations = [annotation for text, annotation in batch]
        # Update the model
        print(texts, annotations)
        nlp.update(texts, annotations)

# Save the model
nlp.to_disk(path_to_model)
```


### Setting up a new pipeline from scratch
```
# Start with blank English model
nlp = spacy.blank('en')

# Create blank entity recognizer and add it to the pipeline
ner = nlp.create_pipe('ner')
nlp.add_pipe(ner)

# Add a new label
ner.add_label('GADGET')

# Start the training
nlp.begin_training()

# Train for 10 iterations
for itn in range(10):
    random.shuffle(examples)
    # Divide examples into batches
    for batch in spacy.util.minibatch(examples, size=2):
        texts = [text for text, annotation in batch]
        annotations = [annotation for text, annotation in batch]
        # Update the model
        nlp.update(texts, annotations)
```

In [24]:
import spacy
TRAINING_DATA = [
    ['How to preorder the iPhone X', {'entities': [[20, 28, 'GADGET']]}], 
    ['iPhone X is coming', {'entities': [[0, 8, 'GADGET']]}], 
    ['Should I pay $1,000 for the iPhone X?', {'entities': [[28, 36, 'GADGET']]}], 
    ['The iPhone 8 reviews are here', {'entities': [[4, 12, 'GADGET']]}], 
    ['Your iPhone goes up to 11 today', {'entities': [[5, 11, 'GADGET']]}], 
    ['I need a new phone! Any tips?', {'entities': []}]
]

# Create a blank 'en' model
nlp = spacy.blank('en')

# Create a new entity recognizer and add it to the pipeline
ner = nlp.create_pipe('ner')
nlp.add_pipe(ner)

# Add the label 'GADGET' to the entity recognizer
ner.add_label('GADGET')
print(nlp.pipe_names)


# Start the training
nlp.begin_training()

# Loop for 10 iterations
for itn in range(10):
    # Shuffle the training data
    random.shuffle(TRAINING_DATA)
    losses = {}

    # Batch the examples and iterate over them
    for batch in spacy.util.minibatch(TRAINING_DATA, size=2):
        texts = [text for text, entities in batch]
        annotations = [entities for text, entities in batch]

        # Update the model
        nlp.update(texts, annotations, losses=losses)
        print(losses)

['ner']
{'ner': 13.333332538604736}
{'ner': 25.146877884864807}
{'ner': 33.52562475204468}
{'ner': 6.892914116382599}
{'ner': 15.981010377407074}
{'ner': 19.556943893432617}
{'ner': 3.2705454602837563}
{'ner': 6.5285542123019695}
{'ner': 8.234125423361547}
{'ner': 1.54903374746209}
{'ner': 4.290305365080712}
{'ner': 5.90817542761215}
{'ner': 5.616827309131622}
{'ner': 13.944963905960321}
{'ner': 15.914044762554113}
{'ner': 2.5698778703808784}
{'ner': 6.481321446597576}
{'ner': 9.146115079522133}
{'ner': 2.071006953716278}
{'ner': 3.934548668563366}
{'ner': 5.221237783553079}
{'ner': 0.9154163065832108}
{'ner': 1.3927827711322607}
{'ner': 2.7633928216837376}
{'ner': 1.0691045480070898}
{'ner': 1.108544173936025}
{'ner': 1.1252407127994957}
{'ner': 0.0021242555508251826}
{'ner': 0.003405092637052576}
{'ner': 2.2033404988185588}


In [31]:
unseen_data = [
    "Apple is slowing down the iPhone 8 and iPhone X - how to stop it",
    "I finally understand what the iPhone X ‘notch’ is for",
    "Everything you need to know about the Samsung Galaxy S9",
    "Looking to compare iPad models? Here’s how the 2018 lineup stacks",
    "The iPhone 8 and iPhone 8 Plus are smartphones designed, developed, and marketed by Apple",
    "what is the cheapest ipad, especially ipad pro???",
    "Samsung Galaxy is a series of mobile computing devices designed, manufactured and marketed by Samsung Electronics",
]

docs = nlp.pipe(unseen_data)
for doc in docs:
    print(doc, doc.ents)
    print(20*"_")

Apple is slowing down the iPhone 8 and iPhone X - how to stop it (iPhone 8, iPhone X)
____________________
I finally understand what the iPhone X ‘notch’ is for (iPhone X,)
____________________
Everything you need to know about the Samsung Galaxy S9 ()
____________________
Looking to compare iPad models? Here’s how the 2018 lineup stacks ()
____________________
The iPhone 8 and iPhone 8 Plus are smartphones designed, developed, and marketed by Apple (iPhone 8, iPhone 8)
____________________
what is the cheapest ipad, especially ipad pro??? ()
____________________
Samsung Galaxy is a series of mobile computing devices designed, manufactured and marketed by Samsung Electronics ()
____________________


### Problem 1: Models can "forget" things
Existing model can overfit on new data
e.g.: if you only update it with WEBSITE, it can "unlearn" what a PERSON is
Also known as "catastrophic forgetting" problem

### Solution 1: Mix in previously correct predictions
For example, if you're training WEBSITE, also include examples of PERSON
Run existing spaCy model over data and extract all other relevant entities

BAD:
```
TRAINING_DATA = [
    ('Reddit is a website', {'entities': [(0, 6, 'WEBSITE')]})
]
```
GOOD:
```
TRAINING_DATA = [
    ('Reddit is a website', {'entities': [(0, 6, 'WEBSITE')]}),
    ('Obama is a person', {'entities': [(0, 5, 'PERSON')]})
]
```

### Problem 2: Models can't learn everything
- spaCy's models make predictions based on local context
- Model can struggle to learn if decision is difficult to make based on context
- Label scheme needs to be consistent and not too specific
- For example: CLOTHING is better than ADULT_CLOTHING and CHILDRENS_CLOTHING
    
### Solution 2: Plan your label scheme carefully
Pick categories that are reflected in local context
More generic is better than too specific
Use rules to go from generic labels to specific categories

BAD:
```
LABELS = ['ADULT_SHOES', 'CHILDRENS_SHOES', 'BANDS_I_LIKE']
```
GOOD:
```
LABELS = ['CLOTHING', 'BAND']
```

In [34]:
nlp = spacy.load("en_core_web_sm")

In [43]:
doc = nlp("There's also a Paris in Arkansas, lol")
doc.ents

(Paris, Arkansas)

In [47]:
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab)
matcher.add("GPE", None, *list(nlp(ent.text) for ent in doc.ents))

In [49]:
matcher(doc)

[(384, 4, 5), (384, 6, 7)]

## Your new spaCy skills
- Extract linguistic features: part-of-speech tags, dependencies, named entities
- Work with pre-trained statistical models
- Find words and phrases using Matcher and PhraseMatcher match rules
- Best practices for working with data structures Doc, Token Span, Vocab, Lexeme
- Find semantic similarities using word vectors
- Write custom pipeline components with extension attributes
- Scale up your spaCy pipelines and make them fast
- Create training data for spaCy' statistical models
- Train and update spaCy's neural network models with new data

In [70]:
nlp.pipeline

[('tagger', <spacy.pipeline.pipes.Tagger at 0x7f14a1b65d30>),
 ('parser', <spacy.pipeline.pipes.DependencyParser at 0x7f14a1a562e8>),
 ('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x7f14a1a56348>)]