### Outline
#### Goal is to train spaCy NER from litbank data 

- Load annotation data from LitBank
- Create train and validation sets
- Identify entities in text using Matcher (note missed ents in val set, not learning, just matching)
- Train NER from scratch using only language object
- Train NER from scratch  for small en model  
- Fine-tune existing NER pipeline
- Assess results for various approaches 
- Where do we see improvement? When is the model sufficiently useful in research?


In [3]:
!pip install spacy sklearn tqdm
!git clone https://github.com/dbamman/litbank.git
import spacy 
print(f'Using spaCy version {spacy.__version__}')

fatal: destination path 'litbank' already exists and is not an empty directory.
Using spaCy version 3.2.1


In [4]:
from pathlib import Path
entities_path = Path.cwd() / 'litbank' / 'entities' / 'brat'

text_files = [f for f in entities_path.iterdir() if f.suffix == '.txt']
assert len(text_files) == 100
print(f'[*] imported {len(text_files)} files')

[*] imported 100 files


In [5]:
# for each file, create a Doc object and add the annotation data to doc.ents
# our output is a list of Doc objects 
import spacy 
from tqdm.notebook import tqdm
from spacy.tokens import Span, DocBin
from spacy.util import filter_spans


docs = []

#note: not using pretrained model because it adds predictions, just want LitBank data
nlp = spacy.blank("en")

for text_file in tqdm(text_files):
    doc = nlp(text_file.read_text())
    annotation_file = (entities_path / (text_file.stem +'.ann'))
    annotations = annotation_file.read_text().split('\n')
    ents = []
    for annotation in annotations[:-1]:
        label, start, end = annotation.split('\t')[1].split()
        span = doc.char_span(int(start), int(end), label=label)
        if span: # when start and end do not match a valid string, spaCy returns a NoneType span
            ents.append(span)
    
    filtered = filter_spans(ents)
    doc.ents = filtered
    docs.append(doc)
    

assert len(docs) == 100

  0%|          | 0/100 [00:00<?, ?it/s]

In [6]:
# Split the data into sets for training and validation 
from sklearn.model_selection import train_test_split

train_set, validation_set = train_test_split(docs, test_size=0.2)
print(f'Created {len(train_set)} training docs')
print(f'Created {len(validation_set)} validation docs')

Created 80 training docs
Created 20 validation docs


In [7]:
# Add training Docs to DocBin and store to disk
from spacy.tokens import DocBin

# the DocBin will store the training documents
train_db = DocBin()
for doc in train_set:
    train_db.add(doc)
train_db.to_disk("./train.spacy")

In [8]:
# Save the validation Docs to disk 
validation_db = DocBin()
for doc in validation_set:
    validation_db.add(doc)
    
validation_db.to_disk("./validate.spacy")

In [9]:
!ls -al train.spacy validate.spacy

-rw-rw-r-- 1 ds ds 1240441 Dec 15 08:57 train.spacy
-rw-rw-r-- 1 ds ds  334587 Dec 15 08:57 validate.spacy


In [6]:
!python3 -m spacy init config ./config.cfg --lang en --pipeline ner


[38;5;1m✘ Unknown command: init[0m
Available: download, link, info, train, pretrain, debug-data, evaluate, convert,
package, init-model, profile, validate



In [5]:
# inspect the new config.cfg file 
!cat config.cfg

cat: config.cfg: No such file or directory


In [None]:
#!python3 -m spacy train ./config.cfg --output ./output --paths.train train.spacy --paths.dev dev.spacy