# spaCy NER

NER pipeline to detect `Dataset` entities. Training and evaluation annotations made using [data-to-prodigy](https://prodi.gy/docs/recipes#data-to-spacy) recipe in Prodigy CLI by loading .jsonl labels into Prodigy database.

## Sources:
- bibliofake_20220309_v4 (2,005 labels)
- s2orc_20220309_v4 (2,005 labels)
- paperpile_20220309_v4 (2,004 labels)

NER inputs (80/20 split)
- 4812 training examples (ner/train.spacy)
- 1202 evaluation examples (ner/dev.spacy)
- custom configuration file (/config.cfg)

In [1]:
import spacy
from spacy import displacy

print(spacy.__version__)
print(spacy.require_gpu())

3.1.1
True


## Load configuration settings

In [2]:
!python -m spacy init fill-config ../archive/tuning-config.cfg

[paths]
train = "../ner/train.spacy"
dev = "../ner/dev.spacy"
vectors = null
init_tok2vec = null

[system]
gpu_allocator = "pytorch"
seed = 0

[nlp]
lang = "en"
pipeline = ["transformer","ner"]
batch_size = 128
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.ner]
factory = "ner"
incorrect_spans_key = null
moves = null
update_with_oracle_cut_size = 100

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = false
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
pooling = {"@layers":"reduce_mean.v1"}
upstream = "*"

[components.transformer]
factory = "transformer"
max_batch_items = 4096
set_extra_annotations = {"@annotation_setters":"spacy-transformers.null_annotation_setter.v1"}

[component

## Train and save NER model

In [3]:
!python -m spacy train ../config.cfg --output ../corpus/ --paths.train ../ner/train.spacy --paths.dev ../ner/dev.spacy --gpu-id 0

[38;5;4mℹ Using GPU: 0[0m
[1m
[2022-05-30 20:42:55,789] [INFO] Set up nlp object from config
[2022-05-30 20:42:57,628] [INFO] Pipeline: ['transformer', 'ner']
[2022-05-30 20:42:57,631] [INFO] Created vocabulary
[2022-05-30 20:42:57,632] [INFO] Finished initializing nlp object
[2022-05-30 20:43:07,771] [INFO] Initialized pipeline components: ['transformer', 'ner']
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['transformer', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.0[0m
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[34m[1mwandb[0m: Currently logged in as: [33mlafias[0m. Use [1m`wandb login --relogin`[0m to force relogin
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disa