# Chapter 4: Training a neural network model

- https://course.spacy.io/chapter4

How training works 

- Initialize the model weights randomly with nlp.begin_training
- Predict a few examples with the current weights by calling nlp.update
- Compare prediction with true labels
- Calculate how to change weights to improve predictions
- Update weights slightly
- Go back to 2.

# 1)-Creating training data

spaCy’s rule-based Matcher is a great way to quickly create training data for named entity models. A list of sentences is available as the variable TEXTS. You can print it the IPython shell to inspect it. We want to find all mentions of different iPhone models, so we can create training data to teach a model to recognize them as 'GADGET'.

In [1]:
import json
import random
import spacy
from spacy.matcher import Matcher
from spacy.lang.en import English

with open("iphone.json") as f:
    TEXTS = json.loads(f.read())

In [2]:
TEXTS

['How to preorder the iPhone X',
 'iPhone X is coming',
 'Should I pay $1,000 for the iPhone X?',
 'The iPhone 8 reviews are here',
 'Your iPhone goes up to 11 today',
 'I need a new phone! Any tips?']

In [3]:
nlp = English()
matcher = Matcher(nlp.vocab)

nlp = English()
matcher = Matcher(nlp.vocab)

# Two tokens whose lowercase forms match 'iphone' and 'x'
pattern1 = [{"LOWER": "iphone"}, {"LOWER": "x"}]

# Token whose lowercase form matches 'iphone' and an optional digit
pattern2 = [{"LOWER": "iphone"}, {"IS_DIGIT": True, "OP": "?"}]

# Add patterns to the matcher
matcher.add("GADGET", None, pattern1, pattern2)

In [4]:
print(matcher)

<spacy.matcher.matcher.Matcher object at 0x12251f0e0>


In [5]:
with open("iphone.json") as f:
    TEXTS = json.loads(f.read())

In [6]:
type(TEXTS)

list

In [7]:
doc=str(TEXTS)

In [8]:
doc

"['How to preorder the iPhone X', 'iPhone X is coming', 'Should I pay $1,000 for the iPhone X?', 'The iPhone 8 reviews are here', 'Your iPhone goes up to 11 today', 'I need a new phone! Any tips?']"

In [9]:
doc_2=nlp(doc)
doc_2

['How to preorder the iPhone X', 'iPhone X is coming', 'Should I pay $1,000 for the iPhone X?', 'The iPhone 8 reviews are here', 'Your iPhone goes up to 11 today', 'I need a new phone! Any tips?']

In [10]:
type(doc_2)

spacy.tokens.doc.Doc

In [11]:
print([t.text for t in doc_2])

['[', "'", 'How', 'to', 'preorder', 'the', 'iPhone', 'X', "'", ',', "'", 'iPhone', 'X', 'is', 'coming', "'", ',', "'", 'Should', 'I', 'pay', '$', '1,000', 'for', 'the', 'iPhone', 'X', '?', "'", ',', "'", 'The', 'iPhone', '8', 'reviews', 'are', 'here', "'", ',', "'", 'Your', 'iPhone', 'goes', 'up', 'to', '11', 'today', "'", ',', "'", 'I', 'need', 'a', 'new', 'phone', '!', 'Any', 'tips', '?', "'", ']']


In [12]:
matches = matcher(doc_2)
matches

[(274229460355380281, 6, 8),
 (274229460355380281, 6, 7),
 (274229460355380281, 11, 13),
 (274229460355380281, 11, 12),
 (274229460355380281, 25, 27),
 (274229460355380281, 25, 26),
 (274229460355380281, 32, 33),
 (274229460355380281, 32, 34),
 (274229460355380281, 41, 42)]

In [13]:
for match_id, start, end in matches:
    span = doc_2[start:end]
    print(span.text)

iPhone X
iPhone
iPhone X
iPhone
iPhone X
iPhone
iPhone
iPhone 8
iPhone


### Case 2:
    
Let’s use the match patterns we’ve created in the previous exercise to bootstrap a set of training examples. A list of sentences is available as the variable TEXTS.

**Loading file**

In [14]:
import json
from spacy.matcher import Matcher
from spacy.lang.en import English

with open("iphone.json") as f:
    TEXTS = json.loads(f.read())

In [15]:
TEXTS

['How to preorder the iPhone X',
 'iPhone X is coming',
 'Should I pay $1,000 for the iPhone X?',
 'The iPhone 8 reviews are here',
 'Your iPhone goes up to 11 today',
 'I need a new phone! Any tips?']

**Creating Pattern**

In [16]:
nlp = English()
matcher = Matcher(nlp.vocab)
pattern1 = [{"LOWER": "iphone"}, {"LOWER": "x"}]
pattern2 = [{"LOWER": "iphone"}, {"IS_DIGIT": True, "OP": "?"}]
matcher.add("GADGET", None, pattern1, pattern2)

In [17]:
TRAINING_DATA = []

# Create a Doc object for each text in TEXTS
for doc in nlp.pipe(TEXTS):
    # Match on the doc and create a list of matched spans
    spans = [doc[start:end] for match_id, start, end in matcher(doc)]
    # Get (start character, end character, label) tuples of matches
    entities = [(span.start_char, span.end_char, "GADGET") for span in spans]
    # Format the matches as a (doc.text, entities) tuple
    training_example = (doc.text, {"entities": entities})
    # Append the example to the training data
    TRAINING_DATA.append(training_example)

print(*TRAINING_DATA, sep="\n")

('How to preorder the iPhone X', {'entities': [(20, 28, 'GADGET'), (20, 26, 'GADGET')]})
('iPhone X is coming', {'entities': [(0, 8, 'GADGET'), (0, 6, 'GADGET')]})
('Should I pay $1,000 for the iPhone X?', {'entities': [(28, 36, 'GADGET'), (28, 34, 'GADGET')]})
('The iPhone 8 reviews are here', {'entities': [(4, 10, 'GADGET'), (4, 12, 'GADGET')]})
('Your iPhone goes up to 11 today', {'entities': [(5, 11, 'GADGET')]})
('I need a new phone! Any tips?', {'entities': []})


# 2)-Training Loop

- The steps of a training loop
- Loop for a number of times.
- Shuffle the training data.
- Divide the data into batches.
- Update the model for each batch.
- Save the updated model.

In [None]:
# Loop for 10 iterations
for i in range(10):
    # Shuffle the training data
    random.shuffle(TRAINING_DATA)
    # Create batches and iterate over them
    for batch in spacy.util.minibatch(TRAINING_DATA):
        # Split the batch in texts and annotations
        texts = [text for text, annotation in batch]
        annotations = [annotation for text, annotation in batch]
        # Update the model
        nlp.update(texts, annotations)

**Setting new pipeline from scratch**

In [None]:
# Start with blank English model
nlp = spacy.blank('en')
# Create blank entity recognizer and add it to the pipeline
ner = nlp.create_pipe('ner')
nlp.add_pipe(ner)
# Add a new label
ner.add_label('GADGET')

# Start the training
nlp.begin_training()
# Train for 10 iterations
for itn in range(10):
    random.shuffle(examples)
    # Divide examples into batches
    for batch in spacy.util.minibatch(examples, size=2):
        texts = [text for text, annotation in batch]
        annotations = [annotation for text, annotation in batch]
        # Update the model
        nlp.update(texts, annotations)

# 3)-Setting up the pipeline

In [18]:
import spacy

# Create a blank 'en' model
nlp = spacy.blank("en")

# Create a new entity recognizer and add it to the pipeline
ner = nlp.create_pipe("ner")
nlp.add_pipe(ner)

# Add the label 'GADGET' to the entity recognizer
ner.add_label("GADGET")

# 4)-Building a training loop

In [19]:
import spacy
import random
import json

with open("gadgets.json") as f:
    TRAINING_DATA = json.loads(f.read())

In [20]:
TRAINING_DATA

[['How to preorder the iPhone X', {'entities': [[20, 28, 'GADGET']]}],
 ['iPhone X is coming', {'entities': [[0, 8, 'GADGET']]}],
 ['Should I pay $1,000 for the iPhone X?', {'entities': [[28, 36, 'GADGET']]}],
 ['The iPhone 8 reviews are here', {'entities': [[4, 12, 'GADGET']]}],
 ['Your iPhone goes up to 11 today', {'entities': [[5, 11, 'GADGET']]}],
 ['I need a new phone! Any tips?', {'entities': []}]]

In [21]:
nlp = spacy.blank("en")
ner = nlp.create_pipe("ner")
nlp.add_pipe(ner)
ner.add_label("GADGET")

In [22]:
# Start the training
nlp.begin_training()

# Loop for 10 iterations
for itn in range(10):
    # Shuffle the training data
    random.shuffle(TRAINING_DATA)
    losses = {}

    # Batch the examples and iterate over them
    for batch in spacy.util.minibatch(TRAINING_DATA, size=2):
        texts = [text for text, entities in batch]
        annotations = [entities for text, entities in batch]

        # Update the model
        nlp.update(texts, annotations, losses=losses)
        print(losses)

{'ner': 10.833333492279053}
{'ner': 21.11037278175354}
{'ner': 32.944974422454834}
{'ner': 10.338136613368988}
{'ner': 17.314597129821777}
{'ner': 21.89810425043106}
{'ner': 1.8344886228442192}
{'ner': 4.833294221898541}
{'ner': 8.11140689929016}
{'ner': 1.1629105684405658}
{'ner': 3.208279492217116}
{'ner': 5.001780176447937}
{'ner': 4.669799611903727}
{'ner': 6.814525853376836}
{'ner': 10.169673154130578}
{'ner': 2.9788101445883512}
{'ner': 3.9745154915872263}
{'ner': 6.346071177118574}
{'ner': 1.0001555554335937}
{'ner': 2.775552038263413}
{'ner': 3.6978900395915844}
{'ner': 1.4259431460814085}
{'ner': 1.599497755166908}
{'ner': 1.725024571223571}
{'ner': 0.7875489346955078}
{'ner': 0.7933864060581186}
{'ner': 0.7936531797922908}
{'ner': 0.0006064233625888704}
{'ner': 2.180438003223818}
{'ner': 2.1806718718755658}


The
numbers printed to the shell represent the loss on each iteration, the amount of
work left for the optimizer. The lower the number, the better. In real life, you
normally want to use *a lot* more data than this, ideally at least a few hundred
or a few thousand examples.

# 5)-Training multiple labels

In [23]:
TRAINING_DATA = [
    (
        "Reddit partners with Patreon to help creators build communities",
        {"entities": [(0, 6, "WEBSITE"), (21, 28, "WEBSITE")]},
    ),
    ("PewDiePie smashes YouTube record", {"entities": [(18, 25, "WEBSITE")]}),
    (
        "Reddit founder Alexis Ohanian gave away two Metallica tickets to fans",
        {"entities": [(0, 6, "WEBSITE")]},
    ),
    # And so on...
]

Its case of one class labeling

### 5b.

Update the training data to include annotations for the PERSON entities “PewDiePie” and “Alexis Ohanian”

In [24]:
TRAINING_DATA = [
    (
        "Reddit partners with Patreon to help creators build communities",
        {"entities": [(0, 6, "WEBSITE"), (21, 28, "WEBSITE")]},
    ),
    (
        "PewDiePie smashes YouTube record",
        {"entities": [(0, 9, "PERSON"), (18, 25, "WEBSITE")]},
    ),
    (
        "Reddit founder Alexis Ohanian gave away two Metallica tickets to fans",
        {"entities": [(0, 6, "WEBSITE"), (15, 29, "PERSON")]},
    ),
    # And so on...
]

There are other advanced Annotation tools such as Brat and Prodigy. Manual labeling is long and tiresome work.