# üß™ Test ClinicalBERT / CamemBERT-bio

Ce notebook teste les mod√®les de text encoding m√©dical.

**Mod√®les disponibles :**
- `ClinicalBERT` : Anglais (medicalai/ClinicalBERT)
- `CamemBERT-bio` : Fran√ßais (almanach/camembert-bio-base) ‚≠ê RECOMMAND√â

## 1Ô∏è‚É£ Installation des d√©pendances

In [1]:
# Installer si n√©cessaire
!pip install transformers torch -q

## 2Ô∏è‚É£ Imports

In [2]:
import torch
from transformers import AutoTokenizer, AutoModel
import numpy as np

print(f"‚úÖ PyTorch version: {torch.__version__}")
print(f"‚úÖ CUDA disponible: {torch.cuda.is_available()}")

  from .autonotebook import tqdm as notebook_tqdm


‚úÖ PyTorch version: 2.10.0+cpu
‚úÖ CUDA disponible: False


## 3Ô∏è‚É£ Test CamemBERT-bio (Fran√ßais) ‚≠ê

In [3]:
# Charger le mod√®le fran√ßais
print("üîß Chargement CamemBERT-bio...")

model_name = "almanach/camembert-bio-base"
tokenizer_fr = AutoTokenizer.from_pretrained(model_name)
model_fr = AutoModel.from_pretrained(model_name)

# Mettre en mode √©valuation
model_fr.eval()

print("‚úÖ Mod√®le charg√© !")

üîß Chargement CamemBERT-bio...


To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Loading weights: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 197/197 [00:00<00:00, 826.58it/s, Materializing param=encoder.layer.11.output.dense.weight]              
CamembertModel LOAD REPORT from: almanach/camembert-bio-base
Key                             | Status     | 
--------------------------------+------------+-
lm_head.bias                    | UNEXPECTED | 
lm_head.dense.bias              | UNEXPECTED | 
roberta.embeddings.position_ids | UNEXPECTED | 
lm_head.layer_norm.bias         | UNEXPECTED | 
lm_head.dense.weight            | UNEXPECTED | 
lm_head.layer_norm.weight       | UNEXPECTED | 
pooler.dense.weight             | MISSING    | 
pooler.dense.bias               | MISSING    | 

Notes:
- UNEXPECTED	:can be ignored w

‚úÖ Mod√®le charg√© !


In [4]:
# Tester avec des sympt√¥mes en fran√ßais
symptoms_fr = [
    "douleur thoracique intense",
    "essoufflement au repos",
    "sueurs froides"
]

# Joindre en un texte
text_fr = ", ".join(symptoms_fr)
print(f"üìù Texte d'entr√©e : {text_fr}")

# Tokenize
inputs = tokenizer_fr(text_fr, return_tensors="pt", max_length=512, truncation=True)
print(f"\nüî¢ Tokens : {inputs['input_ids'].shape}")

# Encoder
with torch.no_grad():
    outputs = model_fr(**inputs)

# Extraire embeddings (token [CLS])
embeddings = outputs.last_hidden_state[:, 0, :].numpy()

print(f"\n‚úÖ Embeddings extraits !")
print(f"   Shape : {embeddings.shape}")
print(f"   Type : {embeddings.dtype}")
print(f"   Min : {embeddings.min():.3f}")
print(f"   Max : {embeddings.max():.3f}")
print(f"   Mean : {embeddings.mean():.3f}")
print(f"\n   Premiers 10 valeurs : {embeddings[0][:10]}")

üìù Texte d'entr√©e : douleur thoracique intense, essoufflement au repos, sueurs froides

üî¢ Tokens : torch.Size([1, 16])

‚úÖ Embeddings extraits !
   Shape : (1, 768)
   Type : float32
   Min : -1.855
   Max : 0.524
   Mean : 0.011

   Premiers 10 valeurs : [ 0.07329346  0.19525796  0.07638985 -0.02134542 -0.12308737  0.1294392
  0.02200944  0.14559159  0.11541703 -0.00263739]


## 6Ô∏è‚É£ Cr√©er une fonction r√©utilisable

In [5]:
def encode_symptoms(symptoms: list, model_type="camembert-bio"):
    """
    Encode une liste de sympt√¥mes en embeddings.
    
    Args:
        symptoms: Liste de sympt√¥mes
        model_type: "camembert-bio" ou "clinical-bert"
    
    Returns:
        np.ndarray: Embeddings (768,)
    """
    # Joindre sympt√¥mes
    text = ", ".join(symptoms)
    
    # Choisir mod√®le
    if model_type == "camembert-bio":
        tokenizer = tokenizer_fr
        model = model_fr
    else:
        tokenizer = tokenizer_en
        model = model_en
    
    # Encoder
    inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True)
    
    with torch.no_grad():
        outputs = model(**inputs)
    
    embeddings = outputs.last_hidden_state[:, 0, :].numpy()
    
    return embeddings[0]

In [6]:
# Test de la fonction
test_symptoms = ["fi√®vre √©lev√©e", "toux s√®che", "fatigue"]

emb = encode_symptoms(test_symptoms)

print(f"‚úÖ Fonction test√©e !")
print(f"   Input : {test_symptoms}")
print(f"   Output shape : {emb.shape}")
print(f"   Exemple : {emb[:5]}")

‚úÖ Fonction test√©e !
   Input : ['fi√®vre √©lev√©e', 'toux s√®che', 'fatigue']
   Output shape : (768,)
   Exemple : [ 0.05768203  0.19409488  0.11901671 -0.06593066 -0.05924556]
