# Fine-tuning a spaCy NER Model with arXiv Summaries
I'll create a notebook that fine-tunes a spaCy NER model using the scientific summaries you've extracted. Since the summaries don't have pre-labeled entities, we'll use a two-step approach:
1. Use an existing spaCy model to pre-annotate entities
2. Fine-tune the model on these annotations

In [15]:
import pandas as pd
import numpy as np
import spacy
import random
from spacy.tokens import DocBin
from spacy.util import minibatch, compounding
from pathlib import Path
import csv
import subprocess
import sys

# 1. Load the data
print("Loading dataset...")
summaries_df = pd.read_csv('data/arXiv_summaries_3000.csv')
# Display a few examples
print(f"Loaded {len(summaries_df)} summaries")
print("\nExample summaries:")
summaries_df.head()

Loading dataset...
Loaded 3000 summaries

Example summaries:


Unnamed: 0,summary
0,"We present a general, consistency-based framew..."
1,In this paper we present a transformation of f...
2,We consider the integration of existing cone-s...
3,"We introduce Ak, an extension of the action de..."
4,Evolutionary artificial neural networks (EANNs...


In [None]:
# 2. Pre-annotate data using an existing model
print("\nPre-annotating entities using existing model...")

# load large pre-existing model
try:
    nlp = spacy.load("en_core_web_lg")
    print("Using existing en_core_web_lg model")
except OSError:
    print("Model not found. Installing en_core_web_lg...")
    subprocess.check_call([
        spacy.cli.download("en_core_web_lg")
    ])
    nlp = spacy.load("en_core_web_lg")
    print("Successfully installed and loaded en_core_web_lg model")


Pre-annotating entities using existing model...
Using existing en_core_web_lg model


In [None]:
# Create training data
TRAIN_DATA = []

# Process in batches to improve speed
batch_size = 100
for i in range(0, len(summaries_df), batch_size):
    batch = summaries_df['summary'][i:i+batch_size]
    for text in batch:
        doc = nlp(text)
        entities = []
        for ent in doc.ents:
            entities.append((ent.start_char, ent.end_char, ent.label_))
        TRAIN_DATA.append((text, {"entities": entities}))
    print(f"Processed {min(i+batch_size, len(summaries_df))}/{len(summaries_df)} summaries")

print(f"\nCreated {len(TRAIN_DATA)} training examples")3

Processed 100/3000 summaries
Processed 200/3000 summaries
Processed 300/3000 summaries
Processed 400/3000 summaries
Processed 500/3000 summaries
Processed 600/3000 summaries
Processed 700/3000 summaries
Processed 800/3000 summaries
Processed 900/3000 summaries
Processed 1000/3000 summaries
Processed 1100/3000 summaries
Processed 1200/3000 summaries
Processed 1300/3000 summaries
Processed 1400/3000 summaries
Processed 1500/3000 summaries
Processed 1600/3000 summaries
Processed 1700/3000 summaries
Processed 1800/3000 summaries
Processed 1900/3000 summaries
Processed 2000/3000 summaries
Processed 2100/3000 summaries
Processed 2200/3000 summaries
Processed 2300/3000 summaries
Processed 2400/3000 summaries
Processed 2500/3000 summaries
Processed 2600/3000 summaries
Processed 2700/3000 summaries
Processed 2800/3000 summaries
Processed 2900/3000 summaries
Processed 3000/3000 summaries

Created 3000 training examples


In [21]:
# 3. Split into training and evaluation sets
random.shuffle(TRAIN_DATA)
split = int(len(TRAIN_DATA) * 0.8)  # 80% train, 20% eval
train_data = TRAIN_DATA[:split]
eval_data = TRAIN_DATA[split:]

print(f"Training set: {len(train_data)} examples")
print(f"Evaluation set: {len(eval_data)} examples")

Training set: 2400 examples
Evaluation set: 600 examples


In [22]:
# 4. Convert to spaCy's binary format
def convert_to_spacy(data, output_path):
    nlp = spacy.blank("en")  # Create blank Language class
    db = DocBin()  # Create a DocBin object
    
    for text, annot in data:
        doc = nlp.make_doc(text)  # Create Doc object
        ents = []
        for start, end, label in annot["entities"]:
            span = doc.char_span(start, end, label=label)
            if span is not None:  # Some spans might be invalid
                ents.append(span)
        doc.ents = ents  # Add entities to Doc
        db.add(doc)
    
    db.to_disk(output_path)  # Save to disk
    print(f"Saved {len(data)} examples to {output_path}")

# Create output directory if it doesn't exist
Path("./corpus").mkdir(parents=True, exist_ok=True)

# Convert and save data
convert_to_spacy(train_data, "./corpus/train.spacy")
convert_to_spacy(eval_data, "./corpus/eval.spacy")

Saved 2400 examples to ./corpus/train.spacy
Saved 600 examples to ./corpus/eval.spacy


In [None]:
# 5. Create config file for training
config = """
[paths]
train = "./corpus/train.spacy"
dev = "./corpus/eval.spacy"
vectors = null
[system]
gpu_allocator = "mps"
seed = 0
[nlp]
lang = "en"
pipeline = ["tok2vec", "ner"]
batch_size = 1000
[components]
[components.tok2vec]
factory = "tok2vec"
[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v2"
[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = ${components.tok2vec.model.encode.width}
attrs = ["LOWER", "PREFIX", "SUFFIX", "SHAPE"]
rows = [5000, 1000, 1000, 2500]
include_static_vectors = false
[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 96
depth = 4
window_size = 1
maxout_pieces = 3
[components.ner]
factory = "ner"
incorrect_spans_key = null
moves = null
scorer = {"@scorers":"spacy.ner_scorer.v1"}
update_with_oracle_cut_size = 100
[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = true
nO = null
[components.ner.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}
upstream = "*"
[training]
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components = []
annotating_components = []
before_to_disk = null
[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null
[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
t = 0.0
[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = true
[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.001
[training.score_weights]
ents_f = 1.0
ents_p = 0.0
ents_r = 0.0
ents_per_type = null
[pretraining]
[initialize]
vectors = null
vocab_data = null
lookups = null
before_init = null
after_init = null
[initialize.components]
[initialize.tokenizer]
"""
# Write config file
with open("./config.cfg", "w") as f:
    f.write(config)

In [None]:
# 6. Train the model (showing commands - run in terminal)
print("\nTo train the model, run the following commands in your terminal:\n")
print("cd /Users/benny/Projects/NER-SLU-Library/training")
print("python -m spacy train config.cfg --output ./output")


To train the model, run the following commands in your terminal:

cd /Users/benny/Projects/NER-SLU-Library/training
python -m spacy train config.cfg --output ./output


In [48]:
# # 7. Code for loading and testing the trained model
# Load the trained model
trained_nlp = spacy.load("./output/model-best")

# testing the trained model
sample_text = """MANILA, Philippines — The methodical handling of the arrest of former President Rodrigo Duterte could redeem the “good image” of the Philippine National Police, said former Sen. Antonio Trillanes IV on Saturday.
In a radio interview, Trillanes praised PNP Chief Gen. Rommel Marbil and Maj. Gen. Nicolas Torre, head of the PNP Criminal Investigation and Detection Group (PNP-CIDG), for showing professionalism even when Duterte’s relatives and lawyers tried to prevent them from implementing the International Criminal Court’s (ICC) arrest order.
“So far, we see (the PNP) as very professional compared to the time of Duterte when (police officers) themselves were involved in killing ordinary Filipinos whom they were supposed to protect,” Trillanes said in the “Usapang Senado” program on dwIZ."""
doc = trained_nlp(sample_text)

print("\nDetected entities:")
for ent in doc.ents:
    print(f"{ent.text} - {ent.label_}")



Detected entities:
MANILA - ORG
Philippines — The - ORG
President Rodrigo Duterte - ORG
the Philippine National Police - ORG
Sen. Antonio Trillanes IV - ORG
Saturday - GPE
Trillanes - ORG
PNP Chief Gen. Rommel Marbil - ORG
Gen. Nicolas Torre - ORG
PNP Criminal Investigation - ORG
PNP-CIDG - ORG
the International Criminal Court - ORG
PNP - ORG
Filipinos - ORG
Trillanes - ORG


In [47]:
# 8. Visualization of the entities
from spacy import displacy
displacy.render(doc, style="ent")