<a href="https://www.kaggle.com/code/remakia/introduction-to-ner-exercise?scriptVersionId=162906763" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [None]:
!pip install jupyter_black
!pip install "edsnlp[ml]==0.10.5"
%reload_ext jupyter_black

# Introduction to Named Entity Reocgnition (NER) and its clinical applications

This notebook will explore the world of NER, examining the underlying concepts and techniques that make it possible and showcasing real-world applications of NER in the medical domain.

## What Is Named Entity Recognition ?

At its core, Named Entity Recognition, or NER for short, is a subtask of NLP that focuses on identifying and classifying entities within textual data. These entities encompass a diverse range of information, including names of individuals, organizations, locations, dates, numerical values, and more. NER equips machines with the ability to extract these entities, making it a fundamental tool for diverse applications across various industries.



![NER_health.png](https://www.johnsnowlabs.com/wp-content/uploads/2023/04/1_0T2vy3DbBnLzI01KfGFUng.webp)

From the above image, you might have gotten some ideas about what an NER model does. The model can find different entities that are present in the text, such as persons, dates, organizations, and locations. Thus NER helps in adding more meaning to the text document. In simple words, you can say that it is doing information extraction.

## Task: Detecting chemicals and drugs in a medical document

The objective of this first class is to extract automatically with a NER method the chemicals and drugs from medical documents

## I. Rule-Based Approaches
Rule-based approaches are one of the earliest and simplest techniques used for named entity recognition (NER). In rule-based approaches, a set of handcrafted rules and patterns are defined to identify entities in text data.

### i. NER on one text with manual patterns


#### Introduction to Spacy

spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python. For more information check out the [documentation](https://spacy.io/usage/spacy-101).


#### Architecture

The central data structures in spaCy are the [Language](https://spacy.io/api/language) class, the [Vocab](https://spacy.io/api/vocab) and the [Doc](https://spacy.io/api/doc) object. The `Language` class is used to process a text and turn it into a `Doc` object. It’s typically stored as a variable called `nlp`. The `Doc` object owns the **sequence of tokens** and all their annotations. The `Doc` object is constructed by the [Tokenizer](https://spacy.io/api/tokenizer), and then modified in place by the components of the pipeline.

![image.png](https://spacy.io/images/architecture.svg)




#### First example

Let's try with a simple example: Extract Disorder (DISO) and the Drug name (DRUG) from the text below:

`"Le patient atteint de Covid 19 ne tolère pas le paracétamol"`


In [None]:
import spacy

print(spacy.__version__)

In [None]:
text = "Le patient atteint de Covid 19 ne tolère pas le paracétamol"
print(text)

#### Tokenization

During processing, spaCy first tokenizes the text, i.e. segments it into words, punctuation and so on. This is done by applying rules specific to each language. For example, punctuation at the end of a sentence should be split off – whereas “U.K.” should remain one token. Each Doc consists of individual tokens, and we can iterate over them:

While punctuation rules are usually pretty general, tokenizer exceptions strongly depend on the specifics of the individual language. This is why each available language has its own subclass, like English or French, that loads in lists of hard-coded data and exception rules in the [spacy.blank](https://spacy.io/usage/spacy-101#annotations) method 

![image.png](attachment:7f55d743-bc31-41ff-b49b-6bfe9326d39f.png)

In [None]:
# Import the NLP framework spaCy
import spacy

# Load the French language model
nlp = spacy.blank("fr")

# Apply the pipeline and get a spaCy Doc object.
doc = nlp(text)

# If you do not want to run the pipeline but only tokenize the text
doc = nlp.make_doc(text)

# Text processing in spaCy is non-destructive
doc.text == text

# You can access a specific token
token = doc[2]  # The third token

# And create a Span using slices
span = doc[:3]  # The first three tokens

print(f"The tokens: {list(doc)}\nA Span: {doc[:3]}")

#### Using the entity ruler

The [EntityRuler](https://spacy.io/api/entityruler) is a pipeline component that’s typically added via [nlp.add_pipe](https://spacy.io/api/language#add_pipe). When the nlp object is called on a text, it will find matches in the doc and add them as entities to the `doc.ents`, using the specified pattern label as the entity label. If any matches were to overlap, the pattern matching most tokens takes priority. If they also happen to be equally long, then the match occurring first in the Doc is chosen.

In [None]:
# Create a new pipeline component for entity recognition
ruler = nlp.add_pipe("entity_ruler")

#### Define [Entity patterns](https://spacy.io/usage/rule-based-matching#entityruler-patterns)

Entity patterns are dictionaries with two keys: "label", specifying the label to assign to the entity if the pattern is matched, and "pattern", the match pattern. The entity ruler accepts two types of patterns:

1. Phrase patterns for exact string matches (string).

```python
{"label": "ORG", "pattern": "Apple"}
```

2. Token patterns with one dictionary describing one token (list).

```python
{"label": "GPE", "pattern": [{"LOWER": "san"}, {"LOWER": "francisco"}]}
```


In [None]:
# Define some patterns to match entities
patterns = [
    {
        "label": "DRUG",
        "pattern": "paracétamol",
    },  # match "PARACETAMOL" as a "Chemical and drug"
    {
        "label": "DISO",
        "pattern": [{"LOWER": "covid"}, {"LOWER": "19"}],
    },  # match "Covid 19" as a disorder
    # LOWER attribute makes it case insensitive: it matchs also "CoVId 19"
]

# Add the patterns to the entity ruler
ruler.add_patterns(patterns)


# Process some text with the pipeline
doc = nlp(text)


# Print the recognized entities and their labels
print([(ent.text, ent.label_) for ent in doc.ents])

#### Visualizing the entity recognizer

The entity visualizer, `ent`, highlights named entities and their labels in a text.

In [None]:
import spacy
from spacy import displacy

displacy.render(doc, style="ent")

#### Your turn !

**Sub-task 1**: Extract with the Spacy [entity_ruler](https://spacy.io/api/entityruler) pipeline the "Chemicals and Drugs" (`CHEM` label) from the text below: `Carbonate de lithium`, `antihistaminique` and `Cétirizine`.

```
"Perfusion d'une ampoule de Carbonate de lithium et introduction d'un antihistaminique par Cétirizine 10 mg x 2 par jour, avec diminution puis disparition de l'oedème."
```

In [None]:
text = "Perfusion d'une ampoule de Carbonate de lithium et introduction d'un antihistaminique par Cétirizine 10 mg x 2 par jour, avec diminution puis disparition de l'oedème."
print(text)

In [None]:
### Load the French language model
nlp = spacy.blank("fr")

# Create a new pipeline component for entity recognition
ruler = nlp.add_pipe("entity_ruler")

# Define some patterns to match entities
## YOUR CODE HERE
# Find the 3 patterns matching Carbonate de lithium, antihistaminique and Cétirizine
patterns = 

### END YOUR CODE

# Add the patterns to the entity ruler
ruler.add_patterns(patterns)

# Process some text with the pipeline
doc = nlp(text)

# Print the recognized entities and their labels
print([(ent.text, ent.label_) for ent in doc.ents])

displacy.render(doc, style="ent")

#### **Issue**: The pipeline does not apply to all drugs 

In [None]:
text = "Le patient a avalé un comprimé de FLUOCARIL BI-FLUORE le matin, un doliprane le midi et deux Paracetamol le soir"

# Process some text with the pipeline
doc = nlp(text)

# Print the recognized entities and their labels
print([(ent.text, ent.label_) for ent in doc.ents])

displacy.render(doc, style="ent")

#### **Solution**: Adding all known french drug names in the pipeline's patterns

### ii. NER on one text with a knowledge dictionary

#### a. Load the drugs dictionary

##### Introducing ROMEDI database

We are using the "Référentiel Ouvert du Médicament" (ROMEDI). [ROMEDI](https://bioportal.lirmm.fr/ontologies/ROMEDI?p=summary) is a database derived from the public drug database: http://base-donnees-publique.medicaments.gouv.fr/ It was created and is maintained by the Equipe de Recherche en Informatique Appliquée à la Santé (ERIAS), Université de Bordeaux, Inserm [Cossin et al., 2019](https://ebooks.iospress.nl/publication/51952).

It contains **5789 drug names**.

In [None]:
import json

# Path of the knowledge dictionary
drugs_dict_path = "/kaggle/input/drugs-dictionary/drugs.json"

# Opening JSON file
with open(drugs_dict_path) as json_file:
    drugs_dict = json.load(json_file)

print(drugs_dict)

#### b. Using the entity ruler with large amount of patterns

**Sub-task 2**: Add all drug names from the [ROMEDI](https://bioportal.lirmm.fr/ontologies/ROMEDI?p=summary) database to the Spacy [entity_ruler](https://spacy.io/api/entityruler) pipeline's patterns. In order to extract the "Chemicals and Drugs" (`CHEM` label) from the text below:

```
""Le patient a avalé un comprimé de FLUOCARIL BI-FLUORE le matin, un doliprane le midi et deux Paracetamol le soir"
```

*Tips*:
- Make sure it is case insensitive (Use **LOWER** atribute for token patterns).
- Make sure it matchs drugs with multiple tokens such as "FLUOCARIL BI-FLUORE".

In [None]:
text = "Le patient a avalé un comprimé de FLUOCARIL BI-FLUORE le matin, un doliprane le midi et deux Paracetamol le soir"
print(text)

In [None]:
# Load the French language model
nlp = spacy.blank("fr")

# Create a new pipeline component for entity recognition
ruler = nlp.add_pipe("entity_ruler")

# Define some patterns to match entities
## YOUR CODE HERE
# Find the patterns matching all drug names from ROMEDI database
# Be careful with the drugs containint multiple tokens


patterns = 


## END YOUR CODE

ruler.add_patterns(patterns)

# Process some text with the pipeline
doc = nlp(text)

# Print the recognized entities and their labels
print([(ent.text, ent.label_) for ent in doc.ents])

displacy.render(doc, style="ent")

#### **Issue**: the pipeline does not detect drugs with **accents**

Now we have a rule-based pipeline able to detect any drugs from the ROMEDI database. **However, It does not detect drugs with accents.**

In [None]:
text = "Le patient a avalé un comprimé de FLUOCARIL BI-FLUORE le matin, un doliprane le midi et deux Paracétamol le soir"

# Process some text with the pipeline
doc = nlp(text)

# Print the recognized entities and their labels
print([(ent.text, ent.label_) for ent in doc.ents])

displacy.render(doc, style="ent")

#### **Solution**: Normalizing the text before the NER process

#### c. Normalization

The normalisation scheme adheres to the non-destructive doctrine. In other words,`nlp(text).text == text` is always true. To achieve this, the input text is never modified.Instead, the `norm_` attribute of each token is modified.

**Sub-task 3**: Remove accents from the `norm_` attribute of each token by using the `unidecode()` function.

In [None]:
from unidecode import unidecode


@spacy.Language.component("normalize")
def normalize(doc):
    for token in doc:
        ## YOUR CODE HERE (1 line)
        ## Remove accent from `.norm_` accent
        
        ## END YOUR CODE
    return doc

In [None]:
# Load the French language model
nlp = spacy.blank("fr")

# Create a new pipeline component for entity recognition
nlp.add_pipe("normalize")

# Create a new pipeline component for entity recognition
ruler = nlp.add_pipe("entity_ruler")

# Define some patterns to match entities
## YOUR CODE HERE
# Use the NORM attribute instead of LOWER attribute


patterns = 


## END YOUR CODE

ruler.add_patterns(patterns)

# Process some text with the pipeline
doc = nlp(text)

# Print the recognized entities and their labels
print([(ent.text, ent.label_) for ent in doc.ents])

displacy.render(doc, style="ent")

#### Congrats

Now we have a rule-based pipeline, normalizing text and detecting any drugs from the ROMEDI database. Let's try it on a real Corpus.

In [None]:
nlp.to_disk("model_rule_based")

### iii. NER on real wolrd data with multiple texts

#### Introducing QUAERO 2016

The [**QUAERO French Medical Corpus**](https://quaerofrenchmed.limsi.fr/) has been initially developed as a resource for named entity recognition and normalization. It was then improved with the purpose of creating a gold standard set of normalized entities for French biomedical text, that was used in the CLEF eHealth evaluation lab.

The QUAERO dataset is divided into 2 datasets:
- MEDLINE: A selection of 2497 annotated MEDLINE titles.
- EMEA: A selection of 9 annotated EMEA documents (drug evaluation documents made by the European Medicines Agency) divided into 35 files.

#### Annotation

The annotation process was guided by concepts in the [Unified Medical Language System (UMLS)](https://www.nlm.nih.gov/research/umls/index.html): Ten types of clinical entities, as defined by the following [UMLS Semantic Groups](https://lhncbc.nlm.nih.gov/semanticnetwork/download/SemGroups.txt) (Bodenreider and McCray 2003) were annotated:
- Anatomy (ANAT)
- Chemical and Drugs (CHEM)
- Devices (DEVI)
- Disorders (DISO)
- Geographic Areas (GEOG)
- Living Beings (LIVB)
- Objects (OBJC)
- Phenomena (PHEN)
- Physiology (PHYS)
- Procedures (PROC)


Annotations are available in the BRAT Rapid Annotation Tool (BRAT) standoff format, described here: http://brat.nlplab.org/standoff.html, which can be loaded into BRAT for vizualization.
![image.png](https://quaerofrenchmed.limsi.fr/images/quaeroFRmed_MEDLINE.jpg)
![image.png](https://quaerofrenchmed.limsi.fr/images/quaeroFRmed_EMEA.jpg)

#### BRAT standoff format
The annotations are stored separately from the annotated document text, which is never modified by the tool.

For each text document in the system, there is a corresponding annotation file. The two are associatied by the file naming convention that their base name (file name without suffix) is the same: for example, the file `10028548.ann` contains annotations for the file `10028548.txt`.

##### Text files (.txt)
Text files are expected to have the suffix `.txt` and contain the text of the original documents input into the system.

```
Analyse minéralogique et exploration des pathologies asbestosiques.
```

The document texts are stored in plain text files encoded using UTF-8 (an extension of ASCII — plain ASCII texts work also).

##### Annotation files (.ann)
Annotations are stored in files with the `.ann` suffix:
```
T1	PROC 0 7	Analyse
#1	AnnotatorNotes T1	C0936012
T2	DISO 41 66	pathologies asbestosiques
#2	AnnotatorNotes T2	C0003949
```

#### Have a look !

Please, go to the `quaero` data folder and have look on the different data files.

#### a. Converting BRAT data into a list of Spacy Doc

In order to use our spacy pipeline, we need to convert the Brat data into a list of Spacy [Doc]. Thanks to the python library [EDS-NLP](https://aphp.github.io/edsnlp/v0.10.5/), you can easily integrate BRAT into your spacy project.

The [BratReader](https://aphp.github.io/edsnlp/v0.10.5/data/) (or edsnlp.data.read_standoff) reads a directory of BRAT files and yields Spacy [Doc](https://aphp.github.io/edsnlp/latest/tutorials/spacy101/) objects.

**Sub-task 4**: 
- Convert the BRAT dev data folder from MEDLINE dataset to a list of Spacy doc, with the [edsnlp.data.read_standoff()](https://aphp.github.io/edsnlp/latest/data/standoff/#edsnlp.data.standoff.read_standoff) function. 
- Use  the [span_setter](https://aphp.github.io/edsnlp/latest/reference/edsnlp/utils/span_getters/#edsnlp.utils.span_getters.SpanSetterArg) argument in order to keep only the `CHEM` labelled entities.

In [None]:
import edsnlp

# Path of the dev MEDLINE dataset
dev_dataset = "/kaggle/input/quaero/QUAERO_FrenchMed/corpus/dev/MEDLINE"

# Data Connector
## YOUR CODE HERE (1 line)
# use edsnlp.data.read_standoff() function
# complete the span_setter argument


## END YOUR CODE

`edsnlp.data.read_standoff()` returns a [LazyCollection](https://aphp.github.io/edsnlp/latest/concepts/inference/#edsnlp.core.lazy_collection.LazyCollection):
To iterate over the documents multiple times efficiently or to access them by
index, you must convert it to a list :

In [None]:
true_docs = list(doc_iterator)

#### Visualizing annotations

The entity visualizer, `ent`, highlights named entities and their labels in a text.

In [None]:
from spacy import displacy

displacy.render(true_docs[0], style="ent")

#### b. Processing multiple texts

We've seen how to apply a spaCy NLP pipeline to a single text. Let's deploy it on a large number of documents.

**Sub-task 5**: Implement a function that processes a list of Spacy Doc (`test_docs` object) with a spaCy NLP pipeline (`nlp` object) document by document using a for loop.

*Tips*: The output Doc must not contain the annotation ! To do that, make sure to process only the text (`doc.text`) of the sapcy Doc.

In [None]:
from typing import List
from spacy.tokens import Doc
from spacy import Language


def process_docs(nlp: Language, docs: List[Doc]) -> List[Doc]:
    """Process document by document in a for loop an returns the list of predicted documents"""
    ## YOUR CODE HERE

    
    ## END YOUR CODE
    return pred_docs

In [None]:
pred_docs = process_docs(nlp=nlp, docs=true_docs)
displacy.render(pred_docs[0], style="ent")

#### c. NER Evaluation

In pattern recognition, information retrieval, object detection and classification (machine learning), precision and recall are performance metrics that apply to data retrieved from a collection, corpus or sample space. These metrics are based on the numbers of True positives (TP), False positives (FP) and False negatives (FN):
- True Positive (TP): entity that is returned by a NER system and also appears in the ground truth.
- False Positive (FP): entity that is returned by a NER system but does not appear in the ground truth.
- False Negative (FN): entity that is not returned by a NER system but appears in the ground truth.


![image.png](attachment:15a24392-ec2f-40b5-8448-565f079a5850.png)

##### Matching mode

A True Positive (TP) result is obtained when the predicted entity matches the ground truth entity. There are two types of "match":
- **Exact** boundary matching：predicted entity boundaries is exaclty the same as the true entity boudaries.
- **Partial** boundary matching：predicted entity boundaries overlap the true entity boudaries.

**Sub-task 6**: Implement a function that returns `True` when two entities are matching and `False` otherwise. Taking into account the matching mode (`exact` or `partial`)


In [None]:
from typing import List
from spacy.tokens import Doc, Span
from spacy import Language


def is_match(true_ent: Span, pred_ent: Span, matching_mode: str = "exact") -> bool:
    """Returns True if the predicted entity matches the ground truth entity.
    If matching_mode = "exact": The function returns True when the entity boundaries are exactly the same.
    If matching_mode = "partial" The function returns True when the entity boundaries are overlaping.
    """
    start_char_true, end_char_true = (true_ent.start_char, true_ent.end_char)
    start_char_pred, end_char_pred = (pred_ent.start_char, pred_ent.end_char)
    if matching_mode == "exact":
        ## YOUR CODE HERE

        
        ## END YOUR CODE
    elif matching_mode == "partial":
        ## YOUR CODE HERE

        
        
        ## END YOUR CODE

    else:
        raise ValueError(
            f"Expecting matching_mode to be 'exact' or 'partial' and not {matching_mode}"
        )

##### Metrics

##### 1. Precision:
Precision (also called positive predictive value) is the fraction of relevant instances among the retrieved instances. Written as a formula:

![CodeCogsEqn (54).png](attachment:cf80b753-ebc5-4ebc-adac-25811904f082.png)

##### 2. Recall:
Recall, or sensitivity, gauges the model’s ability to identify all relevant positive entities. It is the fraction of relevant instances that were retrieved. Written as a formula:

![CodeCogsEqn (55).png](attachment:eaffe310-adf1-4654-9803-5245250e546f.png)

##### 3. F1 Score:
The F1 score is the harmonic mean of precision and recall:

![image.png](attachment:f39c28a0-8b0a-4152-ae7d-f142ebf6ae3f.png)

**Sub-task 7**: Implement a function that compare documents by documents two list of Spacy Doc and returns a dictionary with:
   - The number of True Positive (TP): predicted entities matching true entities.
   - The number of False Positive (FP): predicted entities not matching any true entities.
   - The number of False Negative (FN): true entities not matching any predicted entities.
   - The Precision (precision): the fraction of relevant instances among the retrieved instances.
   - The Recall (recall): the fraction of relevant instances that were retrieved.
   - The F1-score (f1): the harmonic mean of precision and recall.

In [None]:
from typing import List, Dict
from spacy.tokens import Doc


def evaluate(
    true_docs: List[Doc], pred_docs: List[Doc], matching_mode: str = "exact"
) -> Dict:
    """Compare documents by documents two list of Spacy Doc and returns a dictionary with:
    - The number of True Positive (TP): predicted entities matching true entities.
    - The number of False Positive (FP): predicted entities not matching any true entities.
    - The number of False Negative (FN): true entities not matching any predicted entities.
    - The Precision (precision): the fraction of relevant instances among the retrieved instances.
    - The Recall (recall): the fraction of relevant instances that were retrieved.
    - The F1-score (f1): the harmonic mean of precision and recall.
    """
    scores = {"TP": 0, "FP": 0, "FN": 0}
    for true_doc, pred_doc in zip(true_docs, pred_docs):
        ## YOUR CODE HERE (~15 lines)

        
        
        ## END YOUR CODE
            
    ## YOUR CODE HERE
    # Compute Precision, Recall, F1
    # Precision

    # Recall


    # F1

    
    ## END YOUR CODE
    return scores

##### Sanity check

The `evaluate(test_docs, test_docs)` function should retrun `{'TP': 765, 'FP': 0, 'FN': 0, 'precision': 1.0, 'recall': 1.0, 'f1': 1.0}`

In [None]:
evaluate(true_docs=true_docs, pred_docs=true_docs, matching_mode="exact")

#### **Issue**: The recall is low which means that many entities are not detected.

This is the limitation of rule-based methods.



In [None]:
scores = evaluate(true_docs=true_docs, pred_docs=pred_docs, matching_mode="exact")
print(scores)

In [None]:
displacy.render(pred_docs[0], style="ent")

#### **Solution**: Training a supervised machine learning model.

### iv. NER with Transformer

Transformers are used for Named Entity Recognition (NER) due to the following reasons:

- **Contextual information**: Transformers are capable of capturing the context and relationships between words in a sentence, which is essential for accurate NER. By considering the context, transformers can disambiguate words that have multiple meanings and identify named entities more accurately.
- **Pre-training**: Transformers are pre-trained on large corpus of text data, which provides them with a vast knowledge of the language and the ability to generate high-quality token representations that can be fine-tuned for specific NER tasks.
- **Transfer learning**: Transformers can be fine-tuned for NER using a small annotated dataset, which reduces the amount of labeled data required to train NER models and makes it possible to adapt to new domains and languages easily.
- **High accuracy**: Transformers have shown to achieve state-of-the-art performance on various NER benchmarks and have been widely adopted in many NLP applications.

In summary, transformers provide a powerful and flexible framework for NER, making it possible to extract structured information from unstructured text data effectively and efficiently.

#### BERT: The Transformer model used for NER

[BERT](https://arxiv.org/abs/1810.04805) is an "encoder-only" transformer architecture.

##### **Resources**

If you're interested in the details of the Transformer model. please have a look on these resources:
- [The Illustrated Transformer](https://jalammar.github.io/illustrated-transformer/) (This one is really enlightening)
- [Attention is all you need](https://arxiv.org/abs/1706.03762)
- [Transformer (Google AI blog post)](https://blog.research.google/2017/08/transformer-novel-neural-network.html)


##### **A quick summary**

The BERT model is a stack of encoders (there’s nothing magical about the number six, one can definitely experiment with other arrangements).

![The_transformer_encoder_decoder_stack.png](attachment:6a449110-52df-483c-bc52-70a05e11b847.png)

##### **Encoder block**

The encoders are all identical in structure (yet they do not share weights). Each one is broken down into blocks:

![image.png](attachment:80ee162a-698e-435f-a688-8a0ac00c3ecf.png)

##### **Input**

We begin by turning each input word into a vector using an embedding algorithm. The transformer adds a positional encoding vector to each input embedding. These vectors follow a specific pattern that the model learns, which helps it determine the position of each word, or the distance between different words in the sequence.

![image.png](attachment:891d960a-998e-4192-84a1-c09f66792019.png)!

##### **Self-attention layer**

The most important part of the model is the self-attention layer – a layer that helps the encoder look at other words in the input sentence as it encodes a specific word. Say the following sentence is an input sentence we want to translate:

”The animal didn't cross the street because it was too tired”

What does “it” in this sentence refer to? Is it referring to the street or to the animal? It’s a simple question to a human, but not as simple to an algorithm. When the model is processing the word “it”, self-attention allocate more weight to “animal”.


#### Introducing [**EDS-NLP**](https://aphp.github.io/edsnlp/v0.10.5/)

[EDS-NLP](https://aphp.github.io/edsnlp/v0.10.5/) is a collaborative NLP framework that aims at extracting information from French clinical notes. At its core, it is a collection of components or pipes, either rule-based functions or deep learning modules. These components are organized into a novel efficient and modular pipeline system, built for hybrid and multitask models. We use [spaCy](https://spacy.io/) to represent documents and their annotations, and [Pytorch](https://pytorch.org/) as a deep-learning backend for trainable components.

[EDS-NLP](https://aphp.github.io/edsnlp/v0.10.5/) is versatile and can be used on any textual document. The rule-based components are fully compatible with [spaCy](https://spacy.io/)'s pipelines, and vice versa. This library is a product of collaborative effort, and we encourage further contributions to enhance its capabilities.

Check out the [documentation](https://aphp.github.io/edsnlp/v0.10.5/)

#### Step-by-step walkthrough

Training a supervised deep-learning model consists in feeding batches of annotated samples taken from a training corpus to a model and optimizing its parameters of the model to decrease its prediction error. The process of training a pipeline with [EDS-NLP](https://aphp.github.io/edsnlp/v0.10.5/) is structured as follows:

#### a. Defining the model
We first start by seeding the random states and instantiating a new trainable pipeline. The model described here computes text embeddings with a pre-trained BERT transformer and performs the NER prediction task using a linear layer and a softmax. To compose deep-learning modules, we nest them in a dictionary : each new dictionary will instantiate a new module, and the @factory key will be used to select the class of the module.

In [None]:
import edsnlp
from confit.utils.random import set_seed

set_seed(42)

nlp = edsnlp.blank("eds")
nlp.add_pipe(
    "eds.ner_crf",  # We use the eds.ner_crf NER task module, which classifies word embeddings into NER labels (BIOUL scheme).
    name="ner",
    config={
        "mode": "independent",  # Linear + softmax
        "window": 1,  # Linear + softmax
        "embedding": {
            "@factory": "eds.transformer",  # Embedding
            "model": "prajjwal1/bert-tiny",  # BERT model
        },
    },
)

#### b. Adapting a dataset

To train a pipeline, we must convert our annotated data into documents that will be either used as training samples or a evaluation samples. This is done by designing a function to convert the dataset into a list of spaCy Doc objects.

At this step, we might also want to perform **data augmentation**, **filtering**, **splitting** or any other **data transformation**. For example, as the Transformer cannot handle too much text in one sample, large documents need to be splitted into pieces.However, in our case, the MEDLINE documents are small, so we do not need to split the documents.

**Sub-task 8**: Implement the NER adapter:
- Convert the BRAT data folder to a list of Spacy doc, with the [edsnlp.data.read_standoff()](https://aphp.github.io/edsnlp/latest/data/standoff/#edsnlp.data.standoff.read_standoff) function. 
- Use  the [span_setter](https://aphp.github.io/edsnlp/latest/reference/edsnlp/utils/span_getters/#edsnlp.utils.span_getters.SpanSetterArg) argument in order to keep only the `CHEM` labelled entities.
- Use the `ents` attribute of each `Doc` to check if it contains any annotations. If it is empty don't yield the doc.

In [None]:
import edsnlp


def ner_adapter(
    path: str,
    skip_empty: bool = True,  # skip documents that do not contain any annotations.
):
    """Take path to Brat data foler and yield spacy Doc with only the CHEM labeled entities
    If skip_empty is True, it also skip the doc without CHEM labeled entities
    """
    ## YOUR CODE HERE
    # use edsnlp.data.read_standoff() function
    # complete the span_setter argument
    docs =
    
    ## END CODE HERE
    for doc in docs:
        ## YOUR CODE HERE
        # Skip the doc without annotations ("continue")


        ## END CODE HERE
        yield doc

#### c. Loading the data

We then load and adapt (i.e., convert into spaCy Doc objects) the training and validation dataset. Since the adaption of raw documents depends on tokenization used in the trained model, we need to pass the model to the adapter function.

In [None]:
train_data_path = "/kaggle/input/quaero/QUAERO_FrenchMed/corpus/train/MEDLINE"
val_data_path = "/kaggle/input/quaero/QUAERO_FrenchMed/corpus/dev/MEDLINE"

train_docs = list(ner_adapter(path=train_data_path, skip_empty=True))
val_docs = list(ner_adapter(path=val_data_path, skip_empty=True))

#### d. Complete the initialization with the training data
We initialize the missing or incomplete components attributes (such as label vocabularies) with the training dataset

In [None]:
nlp.post_init(train_docs)

#### e. Preprocessing the data

The training dataset is then preprocessed into features. The resulting preprocessed dataset is then wrapped into a pytorch DataLoader to be fed to the model during the training loop with the model's own collate method.

In [None]:
import torch

batch_size = 16

preprocessed = list(
    nlp.preprocess_many(  #
        train_docs,
        supervision=True,
    )
)
dataloader = torch.utils.data.DataLoader(
    preprocessed,
    batch_size=batch_size,
    collate_fn=nlp.collate,
    shuffle=True,
)

#### f. Looping through the training data
We instantiate an optimizer and start the training loop. Inside the training loop, the trainable components are fed the collated batches from the dataloader by calling the `TorchComponent.module_forward` methods to compute the losses.

In [None]:
from itertools import chain, repeat
from tqdm import tqdm

lr = 3e-4
n_steps = len(dataloader) * 5  # 5 Epochs
optimizer = torch.optim.AdamW(
    params=nlp.parameters(),
    lr=lr,
)

# We will loop over the dataloader
iterator = chain.from_iterable(repeat(dataloader))
for step in tqdm(range(n_steps), "Training model", leave=True):
    batch = next(iterator)
    optimizer.zero_grad()
    with nlp.cache():
        loss = torch.zeros((), device="cpu")
        for name, component in nlp.torch_components():
            output = component.module_forward(batch[name])  #
            if "loss" in output:
                loss += output["loss"]
    loss.backward()

    optimizer.step()

#### g. Evaluating the model
Finally, the model is evaluated on the test dataset.

In [None]:
pred_docs = process_docs(nlp=nlp, docs=val_docs)
scores = evaluate(true_docs=val_docs, pred_docs=pred_docs, matching_mode="exact")
print(scores)

In [None]:
displacy.render(val_docs[0], style="ent")

In [None]:
displacy.render(pred_docs[0], style="ent")

### v. Model's hyperparameters optimization

#### a. Batch Size

Batch size is one of the most important hyperparameters in deep learning training, and it represents the number of samples used in one forward and backward pass through the network and has a direct impact on the accuracy and computational efficiency of the training process. The batch size can be understood as a trade-off between accuracy and speed. Large batch sizes can lead to faster training times but may result in lower accuracy and overfitting, while smaller batch sizes can provide better accuracy, but can be computationally expensive and time-consuming.

The batch size can also affect the convergence of the model, meaning that it can influence the optimization process and the speed at which the model learns. Small batch sizes can be more susceptible to random fluctuations in the training data, while larger batch sizes are more resistant to these fluctuations but may converge more slowly.

It is important to note that there is no one-size-fits-all answer when it comes to choosing a batch size, as the ideal size will depend on several factors, including the size of the training dataset, the complexity of the model, and the computational resources available.

#### Monitoring the loss

To know how the model is training, it is important to monitor the **Training loss**.

**Sub-task 9**: Store the loss at each step in order to plot the training loss curve.
*Task*: You will have to convert the loss which is a [`torch.Tensor`](https://pytorch.org/docs/stable/tensors.html) into a `float`.

In [None]:
from torch import Tensor


def store_train_loss(train_losses: list, loss: Tensor) -> None:
    ## YOUR CODE HERE (1 line)


    ## END CODE HERE

In [None]:
import edsnlp
import torch
from confit.utils.random import set_seed
from itertools import chain, repeat
from tqdm import tqdm


def train_NER_bert(batch_size: int):

    # 1. Defining the model
    set_seed(42)
    nlp = edsnlp.blank("eds")
    nlp.add_pipe(
        "eds.ner_crf",  # We use the eds.ner_crf NER task module, which classifies word embeddings into NER labels (BIOUL scheme).
        name="ner",
        config={
            "mode": "independent",  # Linear + softmax
            "window": 1,  # Linear + softmax
            "embedding": {
                "@factory": "eds.transformer",  # Embedding
                "model": "prajjwal1/bert-tiny",  # BERT model
            },
        },
    )

    # 2. Complete the initialization with the training data
    nlp.post_init(train_docs)

    # 3. Preprocessing the data
    preprocessed = list(
        nlp.preprocess_many(  #
            train_docs,
            supervision=True,
        )
    )
    dataloader = torch.utils.data.DataLoader(
        preprocessed,
        batch_size=batch_size,
        collate_fn=nlp.collate,
        shuffle=True,
    )

    # 4. Training loop
    lr = 3e-4
    n_steps = len(dataloader) * 5  # 5 Epochs
    train_losses = []
    optimizer = torch.optim.AdamW(
        params=nlp.parameters(),
        lr=lr,
    )

    # We will loop over the dataloader
    iterator = chain.from_iterable(repeat(dataloader))
    for step in tqdm(range(n_steps), "Training model", leave=True):
        batch = next(iterator)
        optimizer.zero_grad()
        with nlp.cache():
            loss = torch.zeros((), device="cpu")
            for name, component in nlp.torch_components():
                output = component.module_forward(batch[name])  #
                if "loss" in output:
                    loss += output["loss"]

        loss.backward()

        optimizer.step()

        store_train_loss(train_losses=train_losses, loss=loss)

    return nlp, train_losses

**Sub-task 10**: for **batch size = 1, 2, 4, 8, 16, 32**:
- Print the training duration using `time` module.
- Print the metrics (TP, FP, FN, Precision, Recall, F1-score) on the validation dataset using `process_docs` and `evaluate()`.
- Plot the Training loss using `matplotlib` module.

In [None]:
import datetime
from matplotlib import pyplot as plt
import time

for batch_size in [1, 2, 4, 8, 16, 32]:
    print(f"######## BATCH SIZE = {batch_size} #######")
    ## YOUR CODE HERE


    
    nlp, train_losses = train_NER_bert(batch_size=batch_size)

    
    
    ## END CODE HERE

#### Why Do Large Batch Sizes Lead To Poorer Generalization?

Gradient with small batch size oscillates much more compared to larger batch size. This oscillation can be considered noise. However, for a non-convex loss landscape(which is often the case), this noise helps come out of the local minima.  Thus larger batches do fewer and coarser search steps for the optimal solution, and so by construction, will be less likely to converge on the optimal solution.

#### b. Epochs

An epoch is a full training cycle through all of the samples in the training dataset. The number of epochs determines how many times the model will see the entire training data before completing training.

The number of epochs is an important hyperparameter to set correctly, as it can affect both the accuracy and computational efficiency of the training process. If the number of epochs is too small, the model may not learn the underlying patterns in the data, resulting in underfitting. On the other hand, if the number of epochs is too large, the model may overfit the training data, leading to poor generalization performance on new, unseen data.

The ideal number of epochs for a given training process can be determined through experimentation, and monitoring the performance of the model on a validation set. Once the model stops improving on the validation set, it is a good indication that the number of epochs has been reached.

#### Monitoring the metrics

To know if the model is underfitting or overfitting, it is important to monitor the validation metrics:
- Validation Precision, Recall, F1

**Sub-task 11**: Store the evaluation metrics (TP, FP, FN, Precision, Recall, F1) on the validation dataset and training dataset every 100 steps.

In [None]:
from typing import List
from spacy import Language
from spacy.tokens import Doc


def store_evaluation_metrics(
    train_scores: List[dict],
    train_docs: List[Doc],
    val_scores: List[dict],
    val_docs: List[Doc],
    nlp: Language,
    step: int,
) -> None:
    ## YOUR CODE HERE
    # Train Precicions/Recall/F1

    train_score =
    
    train_score["Step"] = step
    train_score["Dataset"] = "Train"

    # Val Precicions/Recall/F1
    
    val_score = 
    
    val_score["Step"] = step
    val_score["Dataset"] = "Validation"

    
    
    
    ## END CODE HERE
    print(val_score)

In [None]:
import edsnlp
import torch
from confit.utils.random import set_seed
from itertools import chain, repeat
from tqdm import tqdm


def train_NER_bert(batch_size: int, n_epochs: int):
    # 1. Defining the model
    set_seed(42)
    nlp = edsnlp.blank("eds")
    nlp.add_pipe(
        "eds.ner_crf",  # We use the eds.ner_crf NER task module, which classifies word embeddings into NER labels (BIOUL scheme).
        name="ner",
        config={
            "mode": "independent",  # Linear + softmax
            "window": 1,  # Linear + softmax
            "embedding": {
                "@factory": "eds.transformer",  # Embedding
                "model": "prajjwal1/bert-tiny",  # BERT model
            },
        },
    )

    # 2. Complete the initialization with the training data
    nlp.post_init(train_docs)

    # 3. Preprocessing the data
    preprocessed = list(
        nlp.preprocess_many(  #
            train_docs,
            supervision=True,
        )
    )
    dataloader = torch.utils.data.DataLoader(
        preprocessed,
        batch_size=batch_size,
        collate_fn=nlp.collate,
        shuffle=True,
    )

    # 4. Training loop
    lr = 3e-4
    n_steps = len(dataloader) * n_epochs  # Epochs
    train_losses = []
    train_scores = []
    val_scores = []
    optimizer = torch.optim.AdamW(
        params=nlp.parameters(),
        lr=lr,
    )

    # We will loop over the dataloader
    iterator = chain.from_iterable(repeat(dataloader))
    for step in tqdm(range(n_steps), "Training model", leave=True):
        batch = next(iterator)
        optimizer.zero_grad()
        with nlp.cache():
            loss = torch.zeros((), device="cpu")
            for name, component in nlp.torch_components():
                output = component.module_forward(batch[name])  #
                if "loss" in output:
                    loss += output["loss"]

        loss.backward()

        optimizer.step()

        store_train_loss(train_losses=train_losses, loss=loss)

        ## Storing metrics on Validation Dataset and Training dataset
        if (step % 100) == 0:
            store_evaluation_metrics(
                train_scores=train_scores,
                train_docs=train_docs,
                val_scores=val_scores,
                val_docs=val_docs,
                nlp=nlp,
                step=step,
            )

    return nlp, train_losses, train_scores, val_scores

**Sub-task 12**: For **n_epochs = 20**, **batch_size = 2** and **lr = 3e-4**:
- Print the training duration using `time` module.
- Print the metrics (TP, FP, FN, Precision, Recall, F1-score) on the validation dataset using `process_docs` and `evaluate()`.
- Plot the Training loss using `matplotlib` module.
- Plot the Validation F1-score against the Training F1-socre using `matplotlib` module.

In [None]:
import altair as alt
from matplotlib import pyplot as plt
import pandas as pd
import time

batch_size = 2
n_epochs = 20


## YOUR CODE HERE



nlp, train_losses, train_scores, val_scores = train_NER_bert(
    batch_size=batch_size, n_epochs=n_epochs
)



## END CODE HERE


#### c. Learning rate

The learning rate controls how much to change the model in response to the estimated error each time the model weights are updated. Choosing the learning rate is challenging as a value too small may result in a long training process that could get stuck, whereas a value too large may result in learning a sub-optimal set of weights too fast or an unstable training process.

#### **Issue**: Correlation between the hyperparameters

Hyperparameters have an influence on each other. For instace decreasing the learning rate may improve the model with a larger number of epochs or smaller batch size.

#### Introducing Optuna

[Optuna](https://optuna.org/) is an automatic hyperparameter optimization software framework, particularly designed for machine learning.

In [None]:
import edsnlp
import torch
from confit.utils.random import set_seed
from itertools import chain, repeat
from tqdm import tqdm


def train_NER_bert(batch_size: int, n_epochs: int, lr: float):
    # 1. Defining the model
    set_seed(42)
    nlp = edsnlp.blank("eds")
    nlp.add_pipe(
        "eds.ner_crf",  # We use the eds.ner_crf NER task module, which classifies word embeddings into NER labels (BIOUL scheme).
        name="ner",
        config={
            "mode": "independent",  # Linear + softmax
            "window": 1,  # Linear + softmax
            "embedding": {
                "@factory": "eds.transformer",  # Embedding
                "model": "prajjwal1/bert-tiny",  # BERT model
            },
        },
    )

    # 2. Complete the initialization with the training data
    nlp.post_init(train_docs)

    # 3. Preprocessing the data
    preprocessed = list(
        nlp.preprocess_many(  #
            train_docs,
            supervision=True,
        )
    )
    dataloader = torch.utils.data.DataLoader(
        preprocessed,
        batch_size=batch_size,
        collate_fn=nlp.collate,
        shuffle=True,
    )

    # 4. Training loop
    n_steps = len(dataloader) * n_epochs  # Epochs
    optimizer = torch.optim.AdamW(
        params=nlp.parameters(),
        lr=lr,
    )

    # We will loop over the dataloader
    iterator = chain.from_iterable(repeat(dataloader))
    for step in tqdm(range(n_steps), "Training model", leave=True):
        batch = next(iterator)
        optimizer.zero_grad()
        with nlp.cache():
            loss = torch.zeros((), device="cpu")
            for name, component in nlp.torch_components():
                output = component.module_forward(batch[name])  #
                if "loss" in output:
                    loss += output["loss"]

        loss.backward()

        optimizer.step()

    return nlp

**OPTIONAL: Sub-task 13**: Using [Optuna](https://optuna.org/), find the optimal combination of `batch_size`, `n_epochs` and `learning_rate` maximizing the `F1-score` on the Validation dataset.

In [None]:
import optuna


def objective(trial):
    ## YOUR CODE HERE (~ 7 lines)

    
    
    
    ## END YOUR CODE


In [None]:
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=10)

In [None]:
study.best_params

#### Choose the best combination

**Sub-task 14**: For **n_epochs = 6**, **batch_size = 2** and **lr = 3e-4**:
- Print the training duration using `time` module.
- Print the metrics (TP, FP, FN, Precision, Recall, F1-score) on the validation dataset using `process_docs` and `evaluate()`.
- Plot the Training loss using `matplotlib` module.
- Plot the Validation F1-score against the Training F1-socre using `matplotlib` module.

In [None]:
import edsnlp
import torch
from confit.utils.random import set_seed
from itertools import chain, repeat
from tqdm import tqdm


def train_NER_bert(batch_size: int, n_epochs: int, lr: float):
    # 1. Defining the model
    set_seed(42)
    nlp = edsnlp.blank("eds")
    nlp.add_pipe(
        "eds.ner_crf",  # We use the eds.ner_crf NER task module, which classifies word embeddings into NER labels (BIOUL scheme).
        name="ner",
        config={
            "mode": "independent",  # Linear + softmax
            "window": 1,  # Linear + softmax
            "embedding": {
                "@factory": "eds.transformer",  # Embedding
                "model": "prajjwal1/bert-tiny",  # BERT model
            },
        },
    )

    # 2. Complete the initialization with the training data
    nlp.post_init(train_docs)

    # 3. Preprocessing the data
    preprocessed = list(
        nlp.preprocess_many(  #
            train_docs,
            supervision=True,
        )
    )
    dataloader = torch.utils.data.DataLoader(
        preprocessed,
        batch_size=batch_size,
        collate_fn=nlp.collate,
        shuffle=True,
    )

    # 4. Training loop
    lr = 3e-4
    n_steps = len(dataloader) * n_epochs  # Epochs
    train_losses = []
    train_scores = []
    val_scores = []
    optimizer = torch.optim.AdamW(
        params=nlp.parameters(),
        lr=lr,
    )

    # We will loop over the dataloader
    iterator = chain.from_iterable(repeat(dataloader))
    for step in tqdm(range(n_steps), "Training model", leave=True):
        batch = next(iterator)
        optimizer.zero_grad()
        with nlp.cache():
            loss = torch.zeros((), device="cpu")
            for name, component in nlp.torch_components():
                output = component.module_forward(batch[name])  #
                if "loss" in output:
                    loss += output["loss"]

        loss.backward()

        optimizer.step()

        store_train_loss(train_losses=train_losses, loss=loss)

        ## Storing metrics on Validation Dataset and Training dataset
        if (step % 100) == 0:
            store_evaluation_metrics(
                train_scores=train_scores,
                train_docs=train_docs,
                val_scores=val_scores,
                val_docs=val_docs,
                nlp=nlp,
                step=step,
            )

    return nlp, train_losses, train_scores, val_scores

In [None]:
import altair as alt
from matplotlib import pyplot as plt
import pandas as pd
import time

batch_size = 2
n_epochs = 6
lr = 3e-4


## YOUR CODE HERE



nlp, train_losses, train_scores, val_scores = train_NER_bert(
    batch_size=batch_size, n_epochs=n_epochs, lr=lr
)



## END CODE HERE
nlp.to_disk("model_ML_BERT_tiny")

#### d. Other hyperparamerters

Many other hyperparamerters could be fine tuned such as:
- The AdamW optimizer parameters
    - betas
    - weigh_decay
- Using another optimizer (SGD, Adam... etc.)
- Model's architecture:
    - Number of layers
    - Adding CNN
    - Adding CRF
    
However, thoese experimentations are time consuming and may only slighly improve the model.

#### e. Embedding: pre-trained BERT models (Transfer learning)

The weights of the initial embedding process are very important for the downstream task.
Here, we are using [BERT-Tiny](https://huggingface.co/prajjwal1/bert-tiny) which is one of the smallest pre-trained BERT variants. I has been trained mainly on English text.

What about using:
- a larger BERT
- a larger BERT trained on French text
- a larger BERT trained on French biomedical text

In [None]:
import edsnlp
import torch
from confit.utils.random import set_seed
from itertools import chain, repeat
from tqdm import tqdm
from accelerate import Accelerator


def train_NER_bert(batch_size: int, n_epochs: int, lr: float, embedding_model: str):
    # 1. Defining the model
    set_seed(42)
    nlp = edsnlp.blank("eds")
    nlp.add_pipe(
        "eds.ner_crf",  # We use the eds.ner_crf NER task module, which classifies word embeddings into NER labels (BIOUL scheme).
        name="ner",
        config={
            "mode": "independent",  # Linear + softmax
            "window": 1,  # Linear + softmax
            "embedding": {
                "@factory": "eds.transformer",  # Embedding
                "model": embedding_model,  # BERT model
            },
        },
    )

    # 2. Complete the initialization with the training data
    nlp.post_init(train_docs)

    # 3. Preprocessing the data
    preprocessed = list(
        nlp.preprocess_many(  #
            train_docs,
            supervision=True,
        )
    )
    dataloader = torch.utils.data.DataLoader(
        preprocessed,
        batch_size=batch_size,
        collate_fn=nlp.collate,
        shuffle=True,
    )

    # 4. Training loop
    n_steps = len(dataloader) * n_epochs  # Epochs
    train_losses = []
    train_scores = []
    val_scores = []
    optimizer = torch.optim.AdamW(
        params=nlp.parameters(),
        lr=lr,
    )
    _, trained_pipes = zip(*nlp.torch_components())
    accelerator = Accelerator()
    print("Device:", accelerator.device)
    [dataloader, optimizer, *trained_pipes] = accelerator.prepare(
        dataloader,
        optimizer,
        *trained_pipes,
    )
    # We will loop over the dataloader
    iterator = chain.from_iterable(repeat(dataloader))
    for step in tqdm(range(n_steps), "Training model", leave=True):
        batch = next(iterator)
        optimizer.zero_grad()
        with nlp.cache():
            loss = torch.zeros((), device=accelerator.device)
            for name, pipe in nlp.torch_components():
                output = pipe.module_forward(batch[name])  #
                if "loss" in output:
                    loss += output["loss"]

        accelerator.backward(loss)
        optimizer.step()

        # Storing training losss
        train_loss = float(loss) / (len(dataloader) * batch_size)
        train_losses.append(train_loss)

        ## Storing metrics on Validation Dataset and Training dataset
        if (step % 20) == 0:
            # Train Precicions/Recall/F1
            pred_train_docs = process_docs(nlp=nlp, docs=train_docs)
            train_score = evaluate(
                true_docs=train_docs, pred_docs=pred_train_docs, matching_mode="exact"
            )
            train_score["Step"] = step
            train_score["Dataset"] = "Train"
            train_scores.append(train_score)  #

            # Val Precicions/Recall/F1
            pred_val_docs = process_docs(nlp=nlp, docs=val_docs)
            val_score = evaluate(
                true_docs=val_docs, pred_docs=pred_val_docs, matching_mode="exact"
            )
            val_score["Step"] = step
            val_score["Dataset"] = "Validation"
            val_scores.append(val_score)  #
            print(val_score)

    return nlp, train_losses, train_scores, val_scores

##### A larger BERT

[BERT base model (uncased)](https://huggingface.co/google-bert/bert-base-uncased) is a transformers model pretrained on a large corpus of English data in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labeling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts.

**Sub-task 14**: For **n_epochs = 100**, **batch_size = 32**, **lr = 5e-5** and **embedding_model = "google-bert/bert-base-uncased"**:
- Print the training duration using `time` module.
- Print the metrics (TP, FP, FN, Precision, Recall, F1-score) on the validation dataset using `process_docs` and `evaluate()`.
- Plot the Training loss using `matplotlib` module.
- Plot the Validation F1-score against the Training F1-socre using `matplotlib` module.

In [None]:
import altair as alt
import pandas as pd

batch_size = 32
n_epochs = 100
lr = 5e-5
embedding_model = "google-bert/bert-base-uncased"

## YOUR CODE HERE



nlp, train_losses, train_scores, val_scores = train_NER_bert(
    batch_size=batch_size, n_epochs=n_epochs, lr=lr, embedding_model=embedding_model
)



## END CODE HERE
nlp.to_disk("model_ML_bert_base")

##### A larger BERT trained on French text

The [CamemBERT](https://arxiv.org/abs/1911.03894) model which is based on Facebook’s RoBERTa model released in 2019. It is a model trained on 138GB of French text. It is a state-of-the-art language model for French text.

**Sub-task 15**: For **n_epochs = 100**, **batch_size = 32**, **lr = 5e-5** and **embedding_model = "almanach/camembert-base"**:
- Print the training duration using `time` module.
- Print the metrics (TP, FP, FN, Precision, Recall, F1-score) on the validation dataset using `process_docs` and `evaluate()`.
- Plot the Training loss using `matplotlib` module.
- Plot the Validation F1-score against the Training F1-socre using `matplotlib` module.

In [None]:
import altair as alt
import pandas as pd

batch_size = 32
n_epochs = 100
lr = 5e-5
embedding_model = "almanach/camembert-base"

## YOUR CODE HERE



nlp, train_losses, train_scores, val_scores = train_NER_bert(
    batch_size=batch_size, n_epochs=n_epochs, lr=lr, embedding_model=embedding_model
)



## END CODE HERE
nlp.to_disk("model_ML_camembert_base")

##### A larger BERT trained on biomedical French text

[CamemBERT-bio](https://huggingface.co/almanach/camembert-bio-base) is a state-of-the-art french biomedical language model built using continual-pretraining from camembert-base.

**Sub-task 15**: Try the `"almanach/camembert-bio-base"` embedding model.

In [None]:
import altair as alt
import pandas as pd

batch_size = 32
n_epochs = 100
lr = 5e-5
embedding_model = "almanach/camembert-bio-base"

## YOUR CODE HERE



nlp, train_losses, train_scores, val_scores = train_NER_bert(
    batch_size=batch_size, n_epochs=n_epochs, lr=lr, embedding_model=embedding_model
)



## END CODE HERE
nlp.to_disk("model_ML_camembert_bio")

**Sub-task 16**: Display a pandas table with the metrics (Precision, Recall, F1-score) on the MEDLINE test dataset for:
- Rule-based method
- ML mehtod with tiny BERT
- ML method with camembert-base
- ML method with camembert-bio-base

In [None]:
import edsnlp

# Path of the test MEDLINE dataset
dev_dataset = "/kaggle/input/quaero/QUAERO_FrenchMed/corpus/test/MEDLINE"

doc_iterator = edsnlp.data.read_standoff(
    dev_dataset,
    span_setter={"ents": "CHEM"},
)

test_docs = list(doc_iterator)

In [None]:
import spacy
import edsnlp

## YOUR CODE HERE






## END CODE HERE