### Practical Introduction to Model Training
#### In this notebook, we will train a spaCy named entity recognition model (NER) using data from [LitBank](https://github.com/dbamman/litbank) an annotated dataset of 100 works of English-language fiction.

Steps:  
✅ Load annotation data from LitBank  
✅ Create train and validation sets  
✅ Train NER from scratch using only the EN language object  
✅ Visualize the results and compare the model's predictions against the original data  
✅ Is the model sufficiently useful for research? What would need to be improved and changed?  


In [None]:
!pip install spacy sklearn tqdm
!git clone https://github.com/dbamman/litbank.git
import spacy 
print(f'Using spaCy version {spacy.__version__}')

In [2]:
from pathlib import Path
entities_path = Path.cwd() / 'litbank' / 'entities' / 'brat'

text_files = [f for f in entities_path.iterdir() if f.suffix == '.txt']
assert len(text_files) == 100
print(f'[*] imported {len(text_files)} files')

[*] imported 100 files


In [3]:
# for each file, create a Doc object and add the annotation data to doc.ents
# our output is a list of Doc objects 
import spacy 
from tqdm.notebook import tqdm
from spacy.tokens import Span, DocBin
from spacy.util import filter_spans


docs = []

#note: not using pretrained model because it adds predictions, just want LitBank data
nlp = spacy.blank("en")
nlp.add_pipe('sentencizer') # used in training assessment


for text_file in tqdm(text_files):
    doc = nlp(text_file.read_text())
    annotation_file = (entities_path / (text_file.stem +'.ann'))
    annotations = annotation_file.read_text().split('\n')
    ents = []
    for annotation in annotations[:-1]:
        label, start, end = annotation.split('\t')[1].split()
        span = doc.char_span(int(start), int(end), label=label)
        if span: # when start and end do not match a valid string, spaCy returns a NoneType span
            ents.append(span)
    
    filtered = filter_spans(ents)
    doc.ents = filtered
    docs.append(doc)
    

assert len(docs) == 100

  0%|          | 0/100 [00:00<?, ?it/s]

In [5]:
# Split the data into sets for training and validation 
from sklearn.model_selection import train_test_split

train_set, validation_set = train_test_split(docs, test_size=0.1)
validation_set, test_set = train_test_split(validation_set, test_size=0.3)
print(f'🚂 Created {len(train_set)} training docs')
print(f'😊 Created {len(validation_set)} validation docs')
print(f'🧪 Created {len(test_set)} test docs')

🚂 Created 90 training docs
😊 Created 7 validation docs
🧪 Created 3 test docs


In [6]:
# Add training Docs to DocBin and store to disk
from spacy.tokens import DocBin

# the DocBin will store the training documents
train_db = DocBin()
for doc in train_set:
    train_db.add(doc)
train_db.to_disk("./train.spacy")

# Save the validation Docs to disk 
validation_db = DocBin()
for doc in validation_set:
    validation_db.add(doc)
    
validation_db.to_disk("./dev.spacy") 

# Save the test Docs to disk 
test_db = DocBin()
for doc in test_set:
    test_db.add(doc)
    
test_db.to_disk("./test.spacy") 

In [7]:
!ls -al train.spacy dev.spacy test.spacy

-rw-r--r-- 1 root root  115753 Dec 23 08:20 dev.spacy
-rw-r--r-- 1 root root   53751 Dec 23 08:20 test.spacy
-rw-r--r-- 1 root root 1406959 Dec 23 08:20 train.spacy


In [9]:
!python3 -m spacy init config ./config.cfg --lang en --pipeline ner -F

[38;5;3m⚠ To generate a more effective transformer-based config (GPU-only),
install the spacy-transformers package and re-run this command. The config
generated now does not use transformers.[0m
[38;5;4mℹ Generated config template specific for your use case[0m
- Language: en
- Pipeline: ner
- Optimize for: efficiency
- Hardware: CPU
- Transformer: None
[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


In [11]:
!python3 -m spacy train config.cfg --output ./output --paths.train train.spacy --paths.dev dev.spacy

[38;5;4mℹ Saving to output directory: output[0m
[38;5;4mℹ Using CPU[0m
[1m
[2021-12-23 08:22:05,786] [INFO] Set up nlp object from config
[2021-12-23 08:22:05,792] [INFO] Pipeline: ['tok2vec', 'ner']
[2021-12-23 08:22:05,794] [INFO] Created vocabulary
[2021-12-23 08:22:05,795] [INFO] Finished initializing nlp object
[2021-12-23 08:22:11,376] [INFO] Initialized pipeline components: ['tok2vec', 'ner']
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00   1072.88    0.00    0.00    0.00    0.00
  2     200      18889.69  63358.78   35.62   28.97   46.21    0.36
  4     400      11414.18  27850.79   53.58   60.92   47.83    0.54
  6     600      24399.51  23338.55   54.53   64.79   47.08    0.55
  8     800      20359.37  18970.20   57.03   62.17   52.6

In [22]:
# View the predictions of our new model
import random
from spacy import displacy 

new_nlp = spacy.load("output/model-last")
val_doc = random.choice(test_set)
doc = new_nlp(val_doc.text)

displacy.render(doc[:100], jupyter=True, style="ent")

In [23]:
# Compare against the original LitBank annotations 
displacy.render(val_doc[:100], jupyter=True, style="ent")