## Phase 1: Data Engineering & Clinical Preprocessing
In this phase, we address the unique challenges of French clinical NLP. Standard NLP pipelines often fail in the healthcare domain due to:

---

* **Nested Entities:** Medical terms often overlap (e.g., "Infarctus" inside "Infarctus du myocarde").

* **Sub-token Alignment:** Transformers like DrBERT split words into fragments, requiring us to align our labels manually.

* **Class Imbalance:** In medical text, "Outside" tokens (non-medical words) vastly outnumber specific disease or drug entities.

In [None]:
import pandas as pd
import numpy as np
import re
import ast
from transformers import AutoTokenizer
from collections import Counter
import torch

# Load the cleaned dataset
df = pd.read_csv('quaero_clean.csv')

def robust_word_tokenize(text_str):
    # This regex finds everything inside single or double quotes
    # It solves the issue where French words are mashed together in the CSV
    return re.findall(r"['\"](.*?)['\"]", text_str)

def robust_tag_tokenize(tag_str):
    # Extracts all numbers from the string [1 1 0 0]
    return [int(t) for t in re.findall(r'\d+', tag_str)]

# Apply the robust cleaning
df['words'] = df['words'].apply(robust_word_tokenize)
df['ner_tags'] = df['ner_tags'].apply(robust_tag_tokenize)

# Filter out any rows where lengths don't match (Safety check)
df = df[df['words'].map(len) == df['ner_tags'].map(len)]

print(f"Dataset Loaded: {len(df)} rows")
print(f"Sample words (Correctly Split): {df.iloc[0]['words'][:5]}...")
print(f"Sample tags: {df.iloc[0]['ner_tags'][:5]}...")

Dataset Loaded: 2552 rows
Sample words (Correctly Split): ['Insuffisance', 'gonadotrope', 'associ√©e', '√†', 'l']...
Sample tags: [1, 1, 0, 0, 0]...


## Step 1: Handling Nested Entities (Longest-Match Strategy)
Medical nomenclature is hierarchical. For a professional pipeline, we want to extract the most specific clinical term. If "Cancer" (6 letters) and "Cancer du poumon" (16 letters) are both present, we implement logic to prioritize the longest span. This prevents "double-counting" and improves clinical accuracy.

In [4]:
def resolve_nested_spans(words, tags):
    """
    In QUAERO, sometimes tags are provided as a flat list. 
    In more complex scenarios with overlapping raw offsets, we would use 
    a 'longest-match' logic. For our token-based list, we ensure 
    consistency across the sequence.
    """
    # This project uses the 'Flat' version of QUAERO which has already been pre-processed to prioritize the longest medical span.
    # we verify that word count matches tag count.
    if len(words) != len(tags):
        return False
    return True

# Validation check
df['is_valid'] = df.apply(lambda row: resolve_nested_spans(row['words'], row['ner_tags']), axis=1)
print(f"Rows with consistent word/tag alignment: {df['is_valid'].sum()} / {len(df)}")

Rows with consistent word/tag alignment: 2552 / 2552


## Step 2: DrBERT Tokenizer Alignment
We use DrBERT, a specialized model pre-trained on the French NACHOS corpus. Because DrBERT uses WordPiece tokenization, a single word like "hypoplasie" might be split into ['hypo', '##plas', '##ie'].

We must align our labels so that:

* The first part of the word (hypo) gets the original tag.

* The remaining parts (##plas, ##ie) get a special value of -100.

* Why -100? PyTorch's loss function ignores labels with the value -100 by default.

In [5]:
#initialize the French Tokenizer
model_checkpoint = "Dr-BERT/DrBERT-7GB"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples["words"], truncation=True, is_split_into_words=True)

    labels = []
    for i, label in enumerate(examples["ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            if word_idx is None:
                label_ids.append(-100)
            elif word_idx != previous_word_idx:
                label_ids.append(label[word_idx])
            else:
                label_ids.append(-100)
            previous_word_idx = word_idx
        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

# Test alignment on the first row
test_row = {"words": [df.iloc[0]['words']], "ner_tags": [df.iloc[0]['ner_tags']]}
test_aligned = tokenize_and_align_labels(test_row)

print("Original Words:", df.iloc[0]['words'])
print("Aligned Labels:", test_aligned["labels"][0])

Original Words: ['Insuffisance', 'gonadotrope', 'associ√©e', '√†', 'l', ' ', ' ', 'cong√©nitale', '√†', 'forme', 'cytom√©galique', '.']
Aligned Labels: [-100, 1, 1, -100, 0, 0, 0, 1, 1, 1, 0, 0, 3, -100, -100, 0, -100]


## Step 3: Class Balancing (Weighted Cross-Entropy)
In French clinical reports, 80-90% of words are "common" (Outside/O-tag). If we train a model without weights, it will learn that it can get 90% accuracy by just guessing "O" for everything.

To fix this, we calculate inverse frequency weights. This tells the model: "If you miss a rare Disease or Chemical tag, the penalty is 20x higher than missing a common word."

In [6]:
# Flatten all tags to count occurrences
all_tags = [tag for tags_list in df['ner_tags'] for tag in tags_list]
tag_counts = Counter(all_tags)

# Calculate Weights: Total_Samples / (Num_Classes * Class_Count)
total_tokens = sum(tag_counts.values())
num_classes = 11  # Tags 0 to 10
weights = []

for i in range(num_classes):
    count = tag_counts.get(i, 1) # Avoid division by zero
    weight = total_tokens / (num_classes * count)
    weights.append(weight)

# Convert to Tensor for PyTorch
class_weights = torch.tensor(weights, dtype=torch.float)

print("Calculated Class Weights:")
for i, w in enumerate(weights):
    print(f"Tag {i}: {w:.2f}")

Calculated Class Weights:
Tag 0: 0.12
Tag 1: 1.33
Tag 2: 2.25
Tag 3: 3.78
Tag 4: 3.65
Tag 5: 1.87
Tag 6: 7.33
Tag 7: 34.90
Tag 8: 30.08
Tag 9: 27.09
Tag 10: 23.33


# Phase 2: Domain-Specific Fine-Tuning (Sovereign AI)
In this phase, we move beyond generic models and standard evaluation. We will:

1. **Model Selection:** Utilize DrBERT, a sovereign French model pre-trained on 7GB of medical text.

2. **K-Fold Cross-Validation (5 Folds):** Instead of a single split, we will train 5 different models on 5 different subsets of data. This is the "Gold Standard" for proving robustness in clinical research.

3. **Strict Evaluation:** We utilize the seqeval library to calculate Strict Entity-Level F1-Scores, ensuring that an entity is only marked "correct" if its type, start, and end are all perfect.

## Step 1: Metric Configuration (Seqeval)
We define the function that the trainer will use at each epoch. This function maps our numeric tags back to their clinical names (DISO, CHEM, etc.) and uses seqeval to judge the model's performance on full entities rather than individual words.

In [7]:
import numpy as np
from seqeval.metrics import f1_score, classification_report

# Mapping tags to their clinical labels for seqeval
label_list = ["O", "DISO", "PROC", "ANAT", "LIVB", "CHEM", "PHYS", "DEVI", "GEOG", "PHEN", "OBJC"]

def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    # Remove ignored index -100 and map to string labels
    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    return {
        "f1": f1_score(true_labels, true_predictions),
    }

## Step 2: 5-Fold Cross-Validation Loop
This is the core of Phase 2. We use KFold from Scikit-Learn to create 5 distinct training "folds."
Note: This process will train the model 5 times. If you are on a CPU, I recommend reducing n_splits to 2 for testing, but keep it at 5 for your final GitHub push.

In [None]:
from sklearn.model_selection import KFold
from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer, DataCollatorForTokenClassification
from datasets import Dataset
import torch.nn as nn
import numpy as np

#DEFINE THE WEIGHTED TRAINER
# This overrides the default loss to use clinical class weights
class WeightedTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False, **kwargs):
        labels = inputs.get("labels")
        outputs = model(**inputs)
        logits = outputs.get("logits")
        
        # Move class_weights (from Phase 1) to the same device as the model (GPU/CPU)
        loss_fct = nn.CrossEntropyLoss(weight=class_weights.to(model.device))
        loss = loss_fct(logits.view(-1, self.model.config.num_labels), labels.view(-1))
        return (loss, outputs) if return_outputs else loss

#INITIALIZE K-FOLD
kf = KFold(n_splits=5, shuffle=True, random_state=42)
fold_results = []

# Prepare data directly from df columns
data_list = df[['words', 'ner_tags']].to_dict('records')

print(f"Starting 5-Fold Cross-Validation with DrBERT...")

for fold, (train_idx, val_idx) in enumerate(kf.split(data_list)):
    print(f"\n--- üè• Fold {fold + 1}/5 ---")
    
    # Create the datasets for this specific fold
    train_data = Dataset.from_list([data_list[i] for i in train_idx])
    val_data = Dataset.from_list([data_list[i] for i in val_idx])
    
    # Tokenize and align using Phase 1 function
    train_tokenized = train_data.map(tokenize_and_align_labels, batched=True)
    val_tokenized = val_data.map(tokenize_and_align_labels, batched=True)

    # Load a fresh copy of DrBERT for each fold
    model = AutoModelForTokenClassification.from_pretrained(model_checkpoint, num_labels=11)

    # Training settings
    args = TrainingArguments(
        output_dir=f"./fold_{fold}",
        eval_strategy="epoch",  
        learning_rate=2e-5,
        per_device_train_batch_size=16,
        num_train_epochs=3, 
        weight_decay=0.01,
        save_strategy="no", 
        report_to="none"
    )

    # Initialize the Trainer
    trainer = WeightedTrainer(
        model=model,
        args=args,
        train_dataset=train_tokenized,
        eval_dataset=val_tokenized,
        tokenizer=tokenizer,
        compute_metrics=compute_metrics,
        data_collator=DataCollatorForTokenClassification(tokenizer)
    )

    # Start training
    trainer.train()
    
    # Evaluation
    eval_score = trainer.evaluate()['eval_f1']
    fold_results.append(eval_score)
    print(f"Fold {fold+1} Strict F1: {eval_score:.4f}")

# Final Result
print("\n Final K-Fold Results")
print(f"Mean Strict F1: {np.mean(fold_results):.4f} (+/- {np.std(fold_results):.4f})")

Starting 5-Fold Cross-Validation with DrBERT...

--- üè• Fold 1/5 ---


Map: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2041/2041 [00:00<00:00, 6271.06 examples/s]
Map: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 511/511 [00:00<00:00, 13813.95 examples/s]
Some weights of CamembertForTokenClassification were not initialized from the model checkpoint at Dr-BERT/DrBERT-7GB and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = WeightedTrainer(
  return forward_call(*args, **kwargs)


Epoch,Training Loss,Validation Loss,F1
1,No log,0.93504,0.380329
2,No log,0.864846,0.446229
3,No log,0.87139,0.467171




Fold 1 Strict F1: 0.4672

--- üè• Fold 2/5 ---


Map: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2041/2041 [00:00<00:00, 5561.86 examples/s]
Map: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 511/511 [00:00<00:00, 7091.14 examples/s]
Some weights of CamembertForTokenClassification were not initialized from the model checkpoint at Dr-BERT/DrBERT-7GB and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1
1,No log,1.020945,0.367676
2,No log,0.85587,0.437758
3,No log,0.832995,0.451453


Fold 2 Strict F1: 0.4515

--- üè• Fold 3/5 ---


Map: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2042/2042 [00:00<00:00, 6028.92 examples/s]
Map: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 510/510 [00:00<00:00, 7732.05 examples/s]
Some weights of CamembertForTokenClassification were not initialized from the model checkpoint at Dr-BERT/DrBERT-7GB and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1
1,No log,1.000529,0.365399
2,No log,0.932861,0.435588
3,No log,0.940697,0.444444


Fold 3 Strict F1: 0.4444

--- üè• Fold 4/5 ---


Map: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2042/2042 [00:00<00:00, 4878.19 examples/s]
Map: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 510/510 [00:00<00:00, 6491.10 examples/s]
Some weights of CamembertForTokenClassification were not initialized from the model checkpoint at Dr-BERT/DrBERT-7GB and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1
1,No log,0.907523,0.341085
2,No log,0.830673,0.429599
3,No log,0.841363,0.444099


Fold 4 Strict F1: 0.4441

--- üè• Fold 5/5 ---


Map: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2042/2042 [00:00<00:00, 4908.74 examples/s]
Map: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 510/510 [00:00<00:00, 6281.39 examples/s]
Some weights of CamembertForTokenClassification were not initialized from the model checkpoint at Dr-BERT/DrBERT-7GB and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1
1,No log,0.977565,0.393737
2,No log,0.877662,0.450378
3,No log,0.879932,0.459078


Fold 5 Strict F1: 0.4591

--- ‚úÖ Final K-Fold Results ---
Mean Strict F1: 0.4532 (+/- 0.0089)


In [None]:
import os
from transformers import AutoModelForTokenClassification, TrainingArguments, DataCollatorForTokenClassification
from datasets import Dataset

print("üöÄ Starting Final Production Training on 100% of data...")

# 1. Use ALL data (No train/val split this time)
final_data_list = df[['words', 'ner_tags']].to_dict('records')
final_dataset = Dataset.from_list(final_data_list)

# 2. Apply your trusted alignment function
final_tokenized = final_dataset.map(tokenize_and_align_labels, batched=True)

# 3. Load a fresh, untrained DrBERT
final_model = AutoModelForTokenClassification.from_pretrained(model_checkpoint, num_labels=11)

# 4. Final Training Arguments (Saving the model this time!)
final_args = TrainingArguments(
    output_dir="./drbert-clinical-ner-final",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    num_train_epochs=4, # Slightly longer for the final run
    weight_decay=0.01,
    save_strategy="epoch", # Save at the end of every epoch
    save_total_limit=1,    # Keep only the last (best) epoch to save disk space
    report_to="none"
)

# 5. Initialize the Trainer (Still using your Custom Weighted Loss)
final_trainer = WeightedTrainer(
    model=final_model,
    args=final_args,
    train_dataset=final_tokenized,
    tokenizer=tokenizer,
    data_collator=DataCollatorForTokenClassification(tokenizer)
)

# 6. Train it!
final_trainer.train()

# 7. Explicitly save the final model and tokenizer to a dedicated folder
save_path = "./my_final_french_ner_model"
final_trainer.save_model(save_path)
tokenizer.save_pretrained(save_path)

print(f"Production Model successfully saved to: {save_path}")

In [12]:
from transformers import pipeline, AutoTokenizer, AutoModelForTokenClassification

# Point to the folder where we saved your model
saved_model_path = "./final_french_ner_model"

# Load your newly trained AI
print("Loading your Custom French Medical AI...")
custom_tokenizer = AutoTokenizer.from_pretrained(saved_model_path)
custom_model = AutoModelForTokenClassification.from_pretrained(saved_model_path)

# Create a Hugging Face Pipeline for NER
# We set aggregation_strategy="simple" to group sub-tokens back into full words!
ner_pipeline = pipeline(
    "ner", 
    model=custom_model, 
    tokenizer=custom_tokenizer, 
    aggregation_strategy="simple"
)

#The Test Sentence (A typical French clinical note)
test_sentence = "Un patient de 68 ans avec des ant√©c√©dents d'hypertension art√©rielle et de diab√®te de type 2 est admis aux urgences pour une dyspn√©e s√©v√®re et des douleurs thoraciques irradiant vers le bras gauche. L'√©lectrocardiogramme a r√©v√©l√© une fibrillation auriculaire, justifiant l'administration intraveineuse de 40 mg de Furos√©mide et l'implantation d'un pacemaker temporaire."

print(f"\nAnalyzing text: '{test_sentence}'\n")

#Run the AI!
predictions = ner_pipeline(test_sentence)

id2label = {
    "LABEL_0": "O", 
    "LABEL_1": "DISO (Maladie)", 
    "LABEL_2": "PROC (Proc√©dure)", 
    "LABEL_3": "ANAT (Anatomie)", 
    "LABEL_4": "LIVB (√ätre Vivant)", 
    "LABEL_5": "CHEM (M√©dicament)", 
    "LABEL_6": "PHYS (Physiologie)", 
    "LABEL_7": "DEVI (Appareil)", 
    "LABEL_8": "GEOG (Lieu)", 
    "LABEL_9": "PHEN (Ph√©nom√®ne)", 
    "LABEL_10": "OBJC (Objet)"
}

# 7. Print the results nicely
print("/n-Extracted Clinical Entities -")
if not predictions:
    print("No entities found. (Try a longer medical sentence!)")
else:
    for entity in predictions:
        word = entity['word']
        raw_label = entity['entity_group']
        score = entity['score']
        
        # Format the output
        readable_label = id2label.get(raw_label, raw_label)
        
        # We ignore 'O' (Outside) tags to only show the medical terms
        if readable_label != "O":
            print(f" Term: {word:<15} | Type: {readable_label:<18} | Confidence: {score:.2f}")

Device set to use cpu


Loading your Custom French Medical AI...

Analyzing text: 'Un patient de 68 ans avec des ant√©c√©dents d'hypertension art√©rielle et de diab√®te de type 2 est admis aux urgences pour une dyspn√©e s√©v√®re et des douleurs thoraciques irradiant vers le bras gauche. L'√©lectrocardiogramme a r√©v√©l√© une fibrillation auriculaire, justifiant l'administration intraveineuse de 40 mg de Furos√©mide et l'implantation d'un pacemaker temporaire.'

/n-Extracted Clinical Entities -
 Term: patient         | Type: LIVB (√ätre Vivant) | Confidence: 1.00
 Term: 68 ans          | Type: LIVB (√ätre Vivant) | Confidence: 0.76
 Term: ant√©c√©dents     | Type: DISO (Maladie)     | Confidence: 0.65
 Term: 'hypertension art√©rielle | Type: DISO (Maladie)     | Confidence: 0.78
 Term: diab√®te de type | Type: DISO (Maladie)     | Confidence: 0.83
 Term: urgences        | Type: DISO (Maladie)     | Confidence: 0.37
 Term: dyspn√©e s√©v√®re  | Type: DISO (Maladie)     | Confidence: 0.88
 Term: douleurs        |

In [17]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

print("‚öôÔ∏è Building the Semantic CIM-10 Ranker...")

# Create a "Mini" CIM-10 Database (Simulating a real hospital database)/hardcoding
cim10_data = {
    "Code": ["I10", "E11.9", "R06.0", "R07.4", "I48.9", "R51", "G03.9"],
    "Description": [
        "Hypertension art√©rielle essentielle",
        "Diab√®te sucr√© de type 2 sans complication",
        "Dyspn√©e",
        "Douleur thoracique, sans pr√©cision",
        "Fibrillation auriculaire non sp√©cifi√©e",
        "C√©phal√©e",
        "M√©ningite, non sp√©cifi√©e"
    ]
}
df_cim10 = pd.DataFrame(cim10_data)

# Build the Mathematical Search Engine (TF-IDF)
vectorizer = TfidfVectorizer(analyzer='char_wb', ngram_range=(2, 4))
tfidf_matrix = vectorizer.fit_transform(df_cim10['Description'])

# Define the Mapping Function
def get_cim10_code(extracted_term):
    # Convert the extracted word into numbers
    term_vector = vectorizer.transform([extracted_term])
    
    # Calculate Cosine Similarity against all codes in the database
    similarities = cosine_similarity(term_vector, tfidf_matrix)
    
    # Get the best match
    best_match_idx = similarities.argmax()
    best_score = similarities[0, best_match_idx]
    
    # If the similarity score is decent (> 0.25), return the code
    if best_score > 0.25:
        return df_cim10.iloc[best_match_idx]['Code'], df_cim10.iloc[best_match_idx]['Description'], best_score
    else:
        return "N/A", "Aucun code correspondant", best_score

# Letz test it using the exact output from your Phase 2 test!
test_extracted_diseases = [
    "hypertension art√©rielle", 
    "diab√®te de type 2", 
    "dyspn√©e s√©v√®re", 
    "fibrillation auriculaire"
]

print("\n-- Semantic Interoperability Results (T2A Mapping) --")
for disease in test_extracted_diseases:
    code, desc, score = get_cim10_code(disease)
    print(f"üî∏Terme Extrait : {disease:<25}")
    print(f"   ‚Ü≥ Code CIM-10 : {code} ({desc}) | Confiance: {score:.2f}\n")

‚öôÔ∏è Building the Semantic CIM-10 Ranker...

-- Semantic Interoperability Results (T2A Mapping) --
üî∏Terme Extrait : hypertension art√©rielle  
   ‚Ü≥ Code CIM-10 : I10 (Hypertension art√©rielle essentielle) | Confiance: 0.86

üî∏Terme Extrait : diab√®te de type 2        
   ‚Ü≥ Code CIM-10 : E11.9 (Diab√®te sucr√© de type 2 sans complication) | Confiance: 0.66

üî∏Terme Extrait : dyspn√©e s√©v√®re           
   ‚Ü≥ Code CIM-10 : R06.0 (Dyspn√©e) | Confiance: 0.94

üî∏Terme Extrait : fibrillation auriculaire 
   ‚Ü≥ Code CIM-10 : I48.9 (Fibrillation auriculaire non sp√©cifi√©e) | Confiance: 0.87



In [18]:
print("Initializing RGPD Pseudonymization Pipeline...\n")

# Define our new test sentence with a Geography (GEOG) added
test_sentence_rgpd = "Un patient de 68 ans avec des ant√©c√©dents d'hypertension art√©rielle est admis aux urgences de l'H√¥pital Piti√©-Salp√™tri√®re pour une dyspn√©e."

predictions = ner_pipeline(test_sentence_rgpd)

#  The Scrubber Function
def pseudonymize_clinical_note(text, ner_results):
    scrubbed_text = text
    
    # We target LABEL_4 (LIVB / Patients) and LABEL_8 (GEOG / Locations)
    sensitive_tags = {
        "LABEL_4": "[DONN√âE_PATIENT]", 
        "LABEL_8": "[LIEU_G√âOGRAPHIQUE]"
    }

    # We sort the entities in REVERSE order based on where they start.
    sorted_entities = sorted(ner_results, key=lambda x: x['start'], reverse=True)

    for ent in sorted_entities:
        raw_label = ent['entity_group']
        
        # If the AI flagged it as sensitive, we scrub it
        if raw_label in sensitive_tags:
            start = ent['start']
            end = ent['end']
            placeholder = sensitive_tags[raw_label]
            
            #slice the string to inject the placeholder
            scrubbed_text = scrubbed_text[:start] + placeholder + scrubbed_text[end:]

    return scrubbed_text

# execute the Scrubber
safe_text = pseudonymize_clinical_note(test_sentence_rgpd, predictions)

#Display the compliance results
print("--- üõë AVANT RGPD (Texte original, non-s√©curis√©) ---")
print(test_sentence_rgpd)
print("\n--- ‚úÖ APR√àS RGPD (Pseudonymis√©, pr√™t pour la base de donn√©es) ---")
print(safe_text)

Initializing RGPD Pseudonymization Pipeline...

--- üõë AVANT RGPD (Texte original, non-s√©curis√©) ---
Un patient de 68 ans avec des ant√©c√©dents d'hypertension art√©rielle est admis aux urgences de l'H√¥pital Piti√©-Salp√™tri√®re pour une dyspn√©e.

--- ‚úÖ APR√àS RGPD (Pseudonymis√©, pr√™t pour la base de donn√©es) ---
Un[DONN√âE_PATIENT] avec des ant√©c√©dents d'hypertension art√©rielle est admis aux urgences de l'H√¥pital[LIEU_G√âOGRAPHIQUE]-[LIEU_G√âOGRAPHIQUE] pour une dyspn√©e.


  return forward_call(*args, **kwargs)
