Skip to content
This repository has been archived by the owner on Jan 29, 2024. It is now read-only.

Use individual spaCy with transformer backbones for all NER models #335

Closed
5 tasks done
FrancescoCasalegno opened this issue Apr 13, 2021 · 17 comments · Fixed by #351
Closed
5 tasks done

Use individual spaCy with transformer backbones for all NER models #335

FrancescoCasalegno opened this issue Apr 13, 2021 · 17 comments · Fixed by #351

Comments

@FrancescoCasalegno
Copy link
Contributor

FrancescoCasalegno commented Apr 13, 2021

🚀 Feature

In light of what we discussed in PR #328, and in particular looking at the results shown in this comparison table, we should operate the following changes to the NER models in data_and_models/pipelines/ner.

  • All NER models should use a transformer backbone. Now they are using tok2vec.
  • All NER models should initialize the weights of this backbone using the pre-trained weights of CORD19 NLI+STS v1. Also, the weights should not be frozen during the fine-tuning on the NER task.
  • There should be one distinct NER model (= spaCy pipeline) for each entity type we support. Note that currently this is not the case, as e.g. model2 is used to extract 3 different entity types (see table here).
  • Unlike the experiments of PR NER with spacy-transformers and our BioBERT NLI+STS CORD-19 #328, the spaCy pipeline should also include the rule-based entity extraction component (Note: for the moment let's keep using add_er.py, then in the future we'll improve that with Migrate add_er to spacy train and config.cfg #310).
  • All the evaluation results (token and entity based, Prec, Rec, F1) obtained before and after this PR should be collected in a table.
@pafonta
Copy link
Contributor

pafonta commented Apr 14, 2021

Regarding this point:

Unlike the experiments of PR #328, the spaCy pipeline should also include the rule-based entity extraction component.

An existing issue is about the same: #310.

@FrancescoCasalegno
Copy link
Contributor Author

As this issue is addressed, there is also the following action to be done (see also PR #345).

  • remove from requirements
    • en_ner_bc5cdr_md
    • en_ner_bionlp13cg_md
    • en_ner_craft_md
    • en_ner_jnlpba_md

@EmilieDel EmilieDel added this to To do in Sprint 6 Apr 20, 2021
@pafonta pafonta self-assigned this Apr 22, 2021
@pafonta pafonta moved this from To do to In progress in Sprint 6 Apr 22, 2021
@pafonta pafonta removed their assignment Apr 22, 2021
@pafonta pafonta moved this from In progress to To do in Sprint 6 Apr 22, 2021
@pafonta
Copy link
Contributor

pafonta commented Apr 22, 2021

Hello @FrancescoCasalegno,

Before moving further with this task, I think some points should be clarified:

  1. Are all the checkboxes in the issue description expected to be done in this issue, which implies, in one pull request?
    Because that is a lot of work. The current issue could be split into several incremental pieces of work. As said in Use individual spaCy with transformer backbones for all NER models #335 (comment), there was for example already an issue for one point.

  2. Should this issue also deal with migrating the whole NER elements in BBS from using some models for several entities to using one model per entity?
    That is not clear at the moment which amount of extra work this task unlisted in the issue description could take (e.g. moving away from the ee_models_library.csv logic).

  3. Should this issue also deal with building the new train / dev sets with only one entity per set?
    Currently, 3 out of 5 annotation sets have multiple entities instead of one.

  4. Are the new models expected to use part of the scispaCy elements used currently? Or should it be a brand new ["transformer", "ner"] pipeline like in NER with spacy-transformers and our BioBERT NLI+STS CORD-19 #328?
    The currently used elements from scispaCy are: a custom tokenizer, a custom vocabulary, and several trained components (tagger, attribute_ruler, lemmatizer, parser, ner).

  5. Shouldn't the issue be started only after the conclusions of some experiments, like Test reproducibility of spaCy training #343 and Compare runtimes of spaCy NER pipelines using CPU and GPU #337?
    Indeed, if the results are not reproducible on GPU (Test reproducibility of spaCy training #343), this will impact the current issue (e.g. should accuracy be favoured over reproducibility?, should the tracking of the NER models with DVC be stopped?). Besides, if the runtime on CPU with a transformer component (Compare runtimes of spaCy NER pipelines using CPU and GPU #337) has increased too much compared to before, BBS would need to be adapted (e.g. switching to inference on GPU instead of CPU).

  6. Should we decide of a convention for the new model names?
    Currently they are model1, ..., model5 but this issue wants to create one model per entity. Some entity (e.g. cell compartement) have spaces in their names.

  7. Should this issue also deal with changing the pipeline/ner/Dockerfile based on miniconda for one with GPU support?
    Indeed, GPU support is required for fine-tuning the transformer component.

@pafonta
Copy link
Contributor

pafonta commented Apr 23, 2021

All points above, except the following one, are independent of implementation constraints.

For the following point, some implementation constraints could apply (tested on spaCy 3.0.5).

  1. Are the new models expected to use part of the scispaCy elements used currently? Or should it be a brand new ["transformer", "ner"] pipeline like in NER with spacy-transformers and our BioBERT NLI+STS CORD-19 #328?
    The currently used elements from scispaCy are: a custom tokenizer, a custom vocabulary, and several trained components (tagger, attribute_ruler, lemmatizer, parser, ner).

I have investigated. Technically, this is possible to reuse from scispaCy:

  • the custom tokenizer,
  • the custom vocabulary,
  • the trained tagger, attribute_ruler, lemmatizer, and parser components.

The trained ner component cannot be reused but that is expected. Indeed, we use a different embedding layer type. That's it: scispaCy used a spacy.Tok2Vec while the ticket is about using a spacy-transformers.TransformerModel).

➡️ I will post on Monday a performance (entity F1) comparison between reusing or not the scispaCy elements.

@pafonta
Copy link
Contributor

pafonta commented Apr 26, 2021

➡️ I will post on Monday a performance (entity F1) comparison between reusing or not the scispaCy elements.

TL;DR

In the context of the experiment (see Methods below), re-using scispaCy components gives lower F1 and precision but higher recall.

Results

The pipeline B is the pipeline A but with the tokenizer, the vocabulary, the tagger, the attribute_ruler, the lemmatizer, and the parser from scispaCy.

reusing scispaCy parts

Method

Performance scores are from pipelines/ner/eval.py. Reported scores are an average on two trainings.

Both pipelines use the same train, dev, and test sets and Transformer checkpoint.

Pipeline B uses en_ner_bc5cdr_md as source for the scispaCy components.

@FrancescoCasalegno
Copy link
Contributor Author

FrancescoCasalegno commented Apr 27, 2021

1. Are all the checkboxes in the issue description expected to be done in this issue?

Let's try to do so, unless some logical steps can be isolated (see also next point).

2. Should this issue also deal with migrating the whole NER elements in BBS from using some models for several entities to using one model per entity?

Let's try to keep using ee_models_library.csv even though it will be redundant and trivial (one line per entity type).
Then, let's work on a separate issue to get rid of this file which is indeed becoming obsolete → see Issue #349.

3. Should this issue also deal with building the new train / dev sets with only one entity per set?

Let's first try to implement the following points:

  • one train JSONL file per entity type (also test JSONL file?)
  • preprocess step split in 2: train_valid_split + convert_jsonl_spacy
    then request review, and once it's OK don't merge into master but keep working to add the remaining features.
    Why?: because this is the only way to bring master from a valid, complete state to another that has the same properties ("consistency").

4. Are the new models expected to use part of the scispaCy elements used currently? Or should it be a brand new ["transformer", "ner"] pipeline like in #328?

Based on the results shown here and the discussion with the team, let's stick with a brand new pipeline. This is easier and also seems to provide better F1 score.

5. Shouldn't the issue be started only after the conclusions of some experiments, like #343 and #337?

Yes, let's wait for #343 and #337 to be done!

6. Should we decide of a convention for the new model names?

Convention we agreed upon is: model-<entity_type_name>—so for instance we could have model-cell_compartment.

7. Should this issue also deal with changing the pipeline/ner/Dockerfile based on miniconda for one with GPU support?

How long do the train_xxx phases take if run on CPU?
I would say we should try to avoid assuming that the user has access to a GPU, even if this comes at the cost of slightly longer runtimes. Anyway, let's also wait for the results of #343 and #337.

@FrancescoCasalegno FrancescoCasalegno added this to To do in Sprint 7 Apr 27, 2021
@pafonta pafonta self-assigned this Apr 28, 2021
@pafonta
Copy link
Contributor

pafonta commented Apr 28, 2021

How long do the train_xxx phases take if run on CPU?

With the exact same settings, training on CPU (80 cores) is 4.730 times slower than training on GPU (1).

It takes 2.840 secs / step on CPU while it takes 0.6005 sec / step on GPU.

@pafonta
Copy link
Contributor

pafonta commented Apr 29, 2021

UPDATE Create one JSONL per entity from the JSONLs containing several entities.

While working on the task (see previous line), I have noticed several issues with the annotations:

  1. Some texts are ignored. They should then be deleted?
  2. Some texts are ignored but have spans. They should be checked and then deleted?
  3. Some annotation files have duplicated _input_hash. They should be checked and then only one duplicate be kept?
  4. Some annotation files have labels in different cases. The case should be normalized?

@FrancescoCasalegno FrancescoCasalegno added this to the v0.2.0 milestone Apr 30, 2021
@pafonta
Copy link
Contributor

pafonta commented Apr 30, 2021

UPDATE Automate analysis of annotations

I have made a code (analyze.py) to analyze annotations in order to clean them up for training NER models. The code could either be used as a script or as a function. As a script, it lets analyze an individual annotation file with. As a function, it lets analyze several annotation files while retrieving the results.

The goal is to reuse the code in the future to analyze new annotations. The code is added to the PR in 020d6f4. There is an extensive documentation inside the file.

Note for both sections below:
Valid texts are texts which are accepted texts and which contains spans (i.e. entities).

Here is the summary table created with the code as a function:

for all the annotation files used for training and evaluation
total texts ignored texts w/o spans ignored texts w/ spans accepted texts w/o spans accepted texts w/ spans accepted texts w/ spans (%) duplicated '_input_hash' accepted texts w/ multiple labels labels needing normalization
annotations5 207 19 2 114 72 34.78 1 0 0
annotations6 634 104 0 241 289 45.58 4 24 0
annotations9 131 7 0 45 79 60.31 0 31 0
annotations14 383 7 13 57 304 79.37 0 35 0
annotations15 151 3 0 33 115 76.16 0 0 0
annotations10 309 3 1 6 299 96.76 0 252 7
annotations12 23 1 0 2 20 86.96 0 20 0

Here is an example of individual report created by the code as a script:

python analyze.py --verbose annotations5_EmmanuelleLogette_2020-06-30_raw2_Disease.jsonl
Analyzing annotations5_EmmanuelleLogette_2020-06-30_raw2_Disease.jsonl...

total texts: 207

ignored texts w/o spans: 19
### INFO ### This might not be expected.
(only the first 10 rows are displayed)
                       text spans
42                        4  None
49   Redelman-Sidi et al. [  None
75                  2006) .  None
78                        †  None
91                       55  None
94                        *  None
95                  2017) .  None
133                     S0.  None
136      2012; Wang et al.,  None
137                2016) ].  None

ignored texts w/ spans: 2
### WARNING ### There might be an issue.
                                                  text               spans
55   Deshalb wurden Bioindikatoren entwickelt, dere...           [DISEASE]
116                      nodeficiency syndrome (AIDS).  [DISEASE, DISEASE]

accepted texts w/o spans: 114
### INFO ### This might not be expected.
(only the first 10 rows are displayed)
                                                 text spans
2   This analysis and report was approved by the I...  None
3   Before their postmortem examinations, all the ...  None
5   In a contemporary study, it was revealed that ...  None
7   Clinical information is indispensable and micr...  None
9   Although the beneficial effects of EP4 agonist...  None
10  Nineteen patients (95%) had fourfold or greate...  None
11  Consequently, extensive shotgun sequencing may...  None
12  Pairwise distance among coronaviruses (CoVs), ...  None
13  In the cranial concept, during cranial externa...  None
14  The PanAd3-based vaccine was tested for induct...  None

accepted texts w/ spans: 72
(only the first 10 rows are displayed)
                                                 text                                 spans
0   Diarrhea can develop as a result of C. parvum ...           [DISEASE, DISEASE, DISEASE]
1   Neonates with febrile pneumonia generally shou...                             [DISEASE]
4   6, 7 Set against the interconnectedness of onl...  [DISEASE, DISEASE, DISEASE, DISEASE]
6   HIV-1, the main cause of acquired immune defic...                    [DISEASE, DISEASE]
8   Furthermore, rigorous selection is also effect...                    [DISEASE, DISEASE]
16  However, detection and response capacity of CO...                             [DISEASE]
26  Methods: All patients with the diagnosis of pr...                    [DISEASE, DISEASE]
32  Those results suggest that the enzyme plays a ...                    [DISEASE, DISEASE]
36  GIVA cPLA 2 has been shown to increase and med...                    [DISEASE, DISEASE]
37  Taken together, the observations suggest that ...                             [DISEASE]

accepted texts w/ spans (%): 34.78

duplicated '_input_hash': 1
### CRITICAL ### There is an issue to investigated.
    text                                      meta  _input_hash  _task_hash                                          tokens spans  _session_id    _view_id  answer
21     (  {'source': 'CORD-19 - sentence 6254790'}   -996919429  -247505887  [{'text': '(', 'start': 0, 'end': 1, 'id': 0}]  None          NaN  ner_manual  accept
197    (     {'source': 'CORD-19 - sentence 3673'}   -996919429  -247505887  [{'text': '(', 'start': 0, 'end': 1, 'id': 0}]  None          NaN  ner_manual  ignore

accepted texts w/ multiple labels: 0
Empty DataFrame
Columns: [text, spans, spans_count]
Index: []

labels needing normalization: 0

labels before normalization:
DISEASE 142

labels after normalization:
DISEASE 142

...analyzed annotations5_EmmanuelleLogette_2020-06-30_raw2_Disease.jsonl

@pafonta
Copy link
Contributor

pafonta commented May 3, 2021

UPDATE Analyze the annotations to define cleaning rules (for the next line).

Here is:

Case 1 - Ignored texts (w/o and w/ spans)

Causes:

  • sentence selection (e.g. annotations5)
    • sentences are not sentences
    • sentences are not in English
  • annotation guidelines (e.g. annotations14)
    • texts have been annotated but are ignored

Case 2 - Duplicated _input_hash

Potential causes:

  • Prodigy import/export

Observations:

  • the 11 texts are single character texts
  • not all texts are ignored (e.g. annotations5)

Case 3 - Inconsistent case for labels

Causes:

  • annotation session setup

Example from annotations10 (numbers are label counts):

CELL_TYPE  66
CHEMICAL 100
CONDITION  59
DISEASE 241
ORGANISM 174
PATHWAY  97
PROTEIN 228

cell_type  30
chemical  26
condition  49
disease  76
organism  25
pathway  38
protein  30

@FrancescoCasalegno FrancescoCasalegno added this to In progress in Sprint 8 May 4, 2021
@pafonta
Copy link
Contributor

pafonta commented May 6, 2021

Hello @FrancescoCasalegno,

The evaluation is crashing. After investigation, the bug was introduced by #348.

Would you know why?

The bug:

Traceback (most recent call last):
  File "eval.py", line 111, in <module>
    main()
  File "eval.py", line 107, in main
    json.dump(all_metrics_dict, f)
  File "/opt/conda/lib/python3.8/json/__init__.py", line 179, in dump
    for chunk in iterable:
  File "/opt/conda/lib/python3.8/json/encoder.py", line 431, in _iterencode
    yield from _iterencode_dict(o, _current_indent_level)
  File "/opt/conda/lib/python3.8/json/encoder.py", line 405, in _iterencode_dict
    yield from chunks
  File "/opt/conda/lib/python3.8/json/encoder.py", line 438, in _iterencode
    o = _default(o)
  File "/opt/conda/lib/python3.8/json/encoder.py", line 179, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type int64 is not JSON serializable

To reproduce:

# For the bug introduced by #348, use 0bb500551b1b7c6f5bb9228335aa4df30a654e9c.
# For the working code __before__ #348, use b9c886966ca4d893b41457a17262e198e3ba7f03.
export COMMIT=...

git clone https://github.com/BlueBrain/Search
cd Search/

# Change <image> and <container>.
docker build -f data_and_models/pipelines/ner/Dockerfile --build-arg BBS_REVISION=$COMMIT -t <image> .
docker run -it --rm -v /raid:/raid --name <container> <image>

git checkout $COMMIT
git checkout -- data_and_models/pipelines/ner/dvc.lock

cd data_and_models/pipelines/ner/
dvc pull --with-deps evaluation@organism
dvc repro -fs evaluation@organism

@FrancescoCasalegno
Copy link
Contributor Author

Hi @pafonta,
#362 should fix it, can you please have a look?

@pafonta
Copy link
Contributor

pafonta commented May 7, 2021

Hello @FrancescoCasalegno,

I have 1) merged #362 on top of #351 and 2) locally removed the fix float(v) in ner/eval.py.

The evaluation stage worked! Thank you 🎉

NB: For practical reasons, I keep float(v) in ner/eval.py for #351 until #362 is merged.

@FrancescoCasalegno
Copy link
Contributor Author

@pafonta FYI #362 just merged into master 🙂

@pafonta
Copy link
Contributor

pafonta commented May 7, 2021

Hello @FrancescoCasalegno,

Could you confirm your position on the following points for #351?

  1. Should the training happens on GPU (nVidia)?
  2. Should the inference happens on CPU (Intel) ?
  3. What revision should be put in ner/Dockerfile? I would recommend v0.2.0a for the reason below.

Regarding point 3:
From my understanding, our current handling of DVC + Docker makes mandatory to use a version tag in the Dockerfile. Indeed, we cannot know before merging the commit hash of the merge commit, or do we?
Besides, we cannot release v0.2.0 yet as other elements are part of it (cf. https://github.com/BlueBrain/Search/milestone/1).

@FrancescoCasalegno
Copy link
Contributor Author

FrancescoCasalegno commented May 7, 2021

Hi!

Here are my 2 cents. The following points are what I think we should keep in mind.

  1. We should not release a new version if results are not reproducible.
  2. We currently know (see Test reproducibility of spaCy training #343) that using spacy 3.x will give non-deterministic results with any of (a) using a transformer backbone or (b) using GPU.
  3. To fix those 2 issues we are working on the following.
    a. Created PR on pytorch Ensure torch.save() deterministic output pytorch/pytorch#57536 + created local PR to patch torch while waiting for the pytorch PR to get merged and Patch torch.save() to ensure deterministic behavior #359.
    b . Created issue on spacy NER Training on GPU is not reproducible explosion/spaCy#7981.
  4. As long as (b) is not solved, IMO the next release will have to use CPU. I understand that this could take forever with the current settings, that's why we have Adjust patience and other NER training duration parameters #356.
  5. I would also say, that as long as at lest one of the 2 PR proposed for (a) is not merged, this PR cannot be merged either (since results wouldn't be deterministic).

Given those points:

Should the training happens on GPU (nVidia)?

Not sure this would be good. Because we would need to modify the Dockerfiles of dvc, but most important because results wouldn't be determinstic.

Should the inference happens on CPU (Intel) ?

I don't think this really matters, or does it? Results should be the same on CPU and GPU, and moreover the runtime should be irrelevant in both cases.

What revision should be put in ner/Dockerfile? I would recommend v0.2.0a for the reason below.

I see our re-occurring nightmare here 😁 Sure, v0.2.0a is a viable solution, otherwise v0.0.11 is also ok with me.

@pafonta
Copy link
Contributor

pafonta commented May 7, 2021

Thank you very much for the clarifications!

as long as at lest one of the 2 PR proposed for (a) is not merged, this PR cannot be merged either

So, should we mark #351 as blocked until #359 is merged?

Otherwise, I have updated the TODO list of #351 to:

  • include the switch to training on CPU,
  • use v0.2.0a as tag for the DVC + Docker magic.

Yes, using the revision in the Dockerfile for DVC is a nightmare. Maybe we could come up with something easier.

@Stannislav Stannislav added this to In progress in Sprint 9 May 11, 2021
@FrancescoCasalegno FrancescoCasalegno added this to In progress in Sprint 10 May 25, 2021
Sprint 10 automation moved this from In progress to Done Jun 1, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
No open projects
Sprint 10
  
Done
Sprint 5
  
To do
Sprint 6
  
To do
Sprint 7
  
In progress
Sprint 8
  
In progress
Sprint 9
  
In progress
Development

Successfully merging a pull request may close this issue.

2 participants