<a href="https://colab.research.google.com/github/24p11/recode-icd/blob/main/icd_models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

"""
# üìÑ Notebook: Hybrid Model for ICD-10 Code Matching with RAG
**Context**: This notebook implements a **hybrid model** combining **binary classification** and **vector similarity (RAG)**
to determine if a medical expression can be related to an ICD-10 code.

The model has been build with:
- **40,000 ICD-10 codes** (including 15,000 frequently used codes).
- **350,000 training examples** (extract-definition pairs with binary labels from ICD index and APHP ICD-10 vocabulary list).
- **Explanation generation** via RAG for traceability and auditability.

---

## üè• About ICD-10 (International Classification of Diseases)
### **Context**
The **International Classification of Diseases (ICD-10)** is a standardized medical coding system recommended by the WHO, used for:
- **Classifying medical diagnoses** for mortality and morbidity statistics.
- **Facilitating billing** (e.g., T2A in France for activity-based pricing).
- **Producing public health statistics** (epidemiology, research).

### **ICD-10 Hierarchy**
ICD-10 is organized into a **hierarchical structure**:
1. **21 chapters** (e.g., *Chapter II* for tumors).
2. **Blocks**: Homogeneous groupings of 3-character categories (e.g., *"C00-C97"* for malignant tumors).
3. **3-character categories** (e.g., *"C15"* for malignant tumors of the esophagus).
4. **4-character codes** (e.g., *"C15.9"* for malignant tumor of the esophagus, unspecified).

### **Classification Content**
ICD-10 consists of two main parts:
1. **Tabular List**:
   - Hierarchical list of codes organized as **chapter ‚Üí block ‚Üí category ‚Üí code**.
   - Example:
     ```
     Chapter II: Neoplasms (C00-D48)
     ‚îî‚îÄ‚îÄ Block: Malignant neoplasms (C00-C97)
         ‚îî‚îÄ‚îÄ Category: Malignant neoplasm of esophagus (C15)
             ‚îî‚îÄ‚îÄ Code: C15.9 (unspecified)
     ```
2. **Alphabetic Index**:
   - List of medical terms (e.g., *"chest pain"*) with associated ICD-10 codes.
   - Essential tool for **medical coders** (e.g., PMSI coders in France).

### **ICD-10 Code Examples**
| Code   | Description                          | Chapter                     |
|--------|--------------------------------------|-----------------------------|
| R07.4  | Chest pain, unspecified              | Chapter XVIII (Symptoms)    |
| I21.9  | Acute myocardial infarction, unspecified | Chapter IX (Circulatory diseases) |
| C15.9  | Malignant neoplasm of esophagus, unspecified | Chapter II (Neoplasms) |


---

## üéØ Objectives
1. **Classify** whether an expression can be related to a specific ICD-10 code (binary task).
2. **RAG** given an expression suggest a probable ICD-10 code if possible.
2. **Explain** the decision using **similar codes** (RAG) and textual justifications.


---

## üèó Architecture
### 1. Hybrid Model
- **Binary Classification**:
  - Input: `[CLS] extract [SEP] ICD-10 definition [SEP]` (tokenized).
  - Output: Match probability (0 or 1).
  - Model: `CamemBERT-Bio-Base` (110M parameters, optimized for French).
- **Vector Similarity (RAG)**:
  - Precomputed embeddings of ICD-10 definitions (Annoy index for fast retrieval).
  - Cosine similarity score between extract and definitions.
- **Combined Score**:
  - Final score = `0.6 * classification_prob + 0.4 * similarity` (adjustable weights).

### 2. Data
- **Format**:
  ```json
  {
    "extract": "severe chest pain",
    "code": "R07.4",
    "definition": "Chest pain, unspecified",
    "label": 1  // 1=match, 0=no match
  }


In [1]:
import pandas as pd
import numpy as np
import json
import ast


In [2]:
PATH_DATA = "data/"
SOURCE_MODEL_NAME = "almanach/camembert-bio-base"
FINAL_MODEL_NAME = "camenbert_bio_icd_code_check_fr"

hf_token = "xxxx"
hf_username = "rflicoteaux"

In [3]:
df_index_icd = pd.read_csv(PATH_DATA + "cim_index_modifie.csv", sep=";")

In [4]:
df_index_icd

Unnamed: 0,code,icd_description,index_orginal,index_reformulate
0,A163,"Tuberculose des ganglions intrathoraciques, (s...","Ad√©nite (de), m√©diastinale, tuberculeuse (avec...",ad√©nite m√©diastinale tuberculeuse
1,A163,"Tuberculose des ganglions intrathoraciques, (s...","Ad√©nite (de), m√©diastinale, tuberculeuse (avec...",ad√©nite tuberculeuse m√©diastinale
2,A163,"Tuberculose des ganglions intrathoraciques, (s...","Ad√©nite (de), m√©diastinale, tuberculeuse (avec...",ad√©nite m√©diastinale tuberculeuse avec manifes...
3,A163,"Tuberculose des ganglions intrathoraciques, (s...","Ad√©nite (de), m√©diastinale, tuberculeuse (avec...",ad√©nite tuberculeuse m√©diastinale avec manifes...
4,A163,"Tuberculose des ganglions intrathoraciques, (s...","Ad√©nite (de), m√©diastinale, tuberculeuse (avec...",ad√©nite de m√©diastin tuberculeuse
...,...,...,...,...
281098,Z907,Absence acquise d'organe(s) g√©nital(aux),"Absence (compl√®te ou partielle) (de), vulve (c...",absence compl√®te acquise de vulve
281099,Z907,Absence acquise d'organe(s) g√©nital(aux),"Absence (compl√®te ou partielle) (de), vulve (c...",absence partielle acquise de la vulve
281100,Z907,Absence acquise d'organe(s) g√©nital(aux),"Absence (compl√®te ou partielle) (de), vulve (c...",absence partielle acquise de vulve
281101,Z907,Absence acquise d'organe(s) g√©nital(aux),Hyst√©rectomis√©e,hyst√©rectomis√©e


In [5]:
df_index_icd['code'] = df_index_icd['code'].str.replace('.', '', regex=False)

In [6]:
df_icd = pd.read_csv(PATH_DATA +"cim_10_atih_2019.tsv", sep="\t",header=None,names=["code","aut_mco","pos","aut_ssr","lib_court","libelle"])
df_icd.code = df_icd.code.str.replace(" ","")


In [7]:
icd10_description = {}
for index, row in df_icd.iterrows():
    code = row["code"]
    description = row["libelle"]
    icd10_description[code] = description

D√©tail source Hector :
- A  = Index CIM-10 (lite alphab√©tique - r√©f√©rence)
- B  =  CIM-10 (liste analytique) (ICD-D√©finition)
- DR1 =     (sur)
- ED1 =  Dictionaire coll√©giale endocrinologie (sur)
- GRONES = Groupe NESTOR (sur)
- METABOL = (sur)
- NP1= (sur)
- OP1= (sur)
- ORPHA  = Classification ORHPA Net (sur)
- RH1= (sur)
- SPILFG = Dictionaire Soci√©t√© fran√ßaise de pathologies infectieuses (sur)
- SRLF  = Dictionaire Soci√©t√© fran√ßaise de r√©animation (sur)
- T  = Th√©sam (√† fiabiliser)

In [8]:
df_hector_1= pd.read_excel(PATH_DATA + "Dictionnaire_Hector_MAJ062019.xlsx",sheet_name="Cim Analytique",names =["libelle","source","code","autre_code"])
df_hector_2= pd.read_excel(PATH_DATA + "Dictionnaire_Hector_MAJ062019.xlsx",sheet_name="Cim Alphab√©tique",names =["libelle","source","code","autre_code"])
df_hector = pd.concat([df_hector_1,df_hector_2],axis =0)
tmp= pd.read_excel(PATH_DATA + "Dictionnaire_Hector_MAJ062019.xlsx",sheet_name="Thesam",names =["libelle","source","code","autre_code"])
df_hector = pd.concat([df_hector,tmp],axis =0)
tmp= pd.read_excel(PATH_DATA + "Dictionnaire_Hector_MAJ062019.xlsx",sheet_name="Dermatologie",names =["libelle","source","code","autre_code"])
df_hector = pd.concat([df_hector,tmp],axis =0)
tmp= pd.read_excel(PATH_DATA + "Dictionnaire_Hector_MAJ062019.xlsx",sheet_name="Endocrinologie",names =["libelle","source","code","autre_code"])
df_hector = pd.concat([df_hector,tmp],axis =0)
tmp= pd.read_excel(PATH_DATA + "Dictionnaire_Hector_MAJ062019.xlsx",sheet_name="GRONES",names =["libelle","source","code","autre_code"])
df_hector = pd.concat([df_hector,tmp],axis =0)
tmp= pd.read_excel(PATH_DATA + "Dictionnaire_Hector_MAJ062019.xlsx",sheet_name="Troubles m√©taboliques",names =["libelle","source","code","autre_code"])
df_hector = pd.concat([df_hector,tmp],axis =0)
tmp= pd.read_excel(PATH_DATA + "Dictionnaire_Hector_MAJ062019.xlsx",sheet_name="N√©phrologie",names =["libelle","source","code","autre_code"])
df_hector = pd.concat([df_hector,tmp],axis =0)
tmp= pd.read_excel(PATH_DATA + "Dictionnaire_Hector_MAJ062019.xlsx",sheet_name="Ophtalmo",names =["libelle","source","code","autre_code"])
df_hector = pd.concat([df_hector,tmp],axis =0)
tmp= pd.read_excel(PATH_DATA + "Dictionnaire_Hector_MAJ062019.xlsx",sheet_name="Orphanet",names =["libelle","source","code","autre_code"])
df_hector = pd.concat([df_hector,tmp],axis =0)
tmp= pd.read_excel(PATH_DATA + "Dictionnaire_Hector_MAJ062019.xlsx",sheet_name="Rhumatologie",names =["libelle","source","code","autre_code"])
df_hector = pd.concat([df_hector,tmp],axis =0)
tmp= pd.read_excel(PATH_DATA + "Dictionnaire_Hector_MAJ062019.xlsx",sheet_name="Germes",names =["libelle","source","code","autre_code"])
df_hector = pd.concat([df_hector,tmp],axis =0)
tmp= pd.read_excel(PATH_DATA + "Dictionnaire_Hector_MAJ062019.xlsx",sheet_name="SRLF",names =["libelle","source","code","autre_code"])
df_hector = pd.concat([df_hector,tmp],axis =0)

In [9]:
df_hector.groupby("source").size()

source
A          40303
B          45266
DR1         1834
ED1          301
GRONES       166
METABOL      269
NP1          715
OP1          444
ORPHA      10274
RH1         1042
SPILFG       258
SRLF          51
T          21247
dtype: int64

In [10]:
df_prep = df_hector[~ df_hector.source.isin(["A","B","T"])].rename(columns={"libelle":"extrait"}).merge(df_icd[["code","libelle"]].rename(columns={"libelle":"definition"}) )[["extrait","code","definition"]]
df_prep = pd.concat([df_prep,df_index_icd[["index_reformulate","code","icd_description"]].rename(columns={"icd_description":"definition",		"index_reformulate": "extrait"})])

In [11]:
df_prep = df_prep[~((df_prep.definition.isna()) | (df_prep.code.str.contains("nocode") ) ) ]

In [12]:
df_prep

Unnamed: 0,extrait,code,definition
0,Tuberculose pulmonaire SAI,A159,Tuberculose de l'appareil respiratoire sans pr...
1,Ad√©nite tuberculeuse,A182,Ad√©nopathie tuberculeuse p√©riph√©rique
2,Abc√®s tuberculeux cutan√© et sous cutan√©,A184,Tuberculose de la peau et du tissu cellulaire ...
3,Tuberculose cutan√©e,A184,Tuberculose de la peau et du tissu cellulaire ...
4,Lupus tuberculeux,A184,Tuberculose de la peau et du tissu cellulaire ...
...,...,...,...
281098,absence compl√®te acquise de vulve,Z907,Absence acquise d'organe(s) g√©nital(aux)
281099,absence partielle acquise de la vulve,Z907,Absence acquise d'organe(s) g√©nital(aux)
281100,absence partielle acquise de vulve,Z907,Absence acquise d'organe(s) g√©nital(aux)
281101,hyst√©rectomis√©e,Z907,Absence acquise d'organe(s) g√©nital(aux)


### Preparation negatives data :
Build 3 dataframes with charactistitics
- The code and extrait are differents
- ```df_prep_neg1``` : code and extrait are from the same ICD chapter. Will be reprensented at 30%
- ```df_prep_neg2``` : code and extrait are from the same ICD categorie. Will be reprensented at 60%
- ```df_prep_neg3``` :  code and extrait are not inside the same chapter and are randomly choosen. Will be reprensented at 10%

In [13]:
df_prep_neg1 = df_prep.assign(categ = df_prep.code.str[0]).sample(frac=1).sort_values("categ")["code"].reset_index()
df_prep_neg1 =pd.concat([df_prep.sort_values('code').reset_index().rename(columns={"code":"old_code"}).drop(columns="index"),
           df_prep_neg1.drop(columns="index")],axis=1)
df_prep_neg1 = df_prep_neg1[(df_prep_neg1.old_code!=df_prep_neg1.code)]
df_prep_neg1 = df_prep_neg1.drop(columns=['definition','old_code']).merge(df_icd[["code","libelle"]].rename(columns={"libelle":"definition"}))

In [14]:
df_prep_neg2 = df_prep.assign(categ = df_prep.code.str[:3]).sample(frac=1).sort_values("categ")["code"].reset_index()
df_prep_neg2 =pd.concat([df_prep.sort_values('code').reset_index().rename(columns={"code":"old_code"}).drop(columns="index"),
           df_prep_neg2.drop(columns="index")],axis=1)
df_prep_neg2 = df_prep_neg2[ (df_prep_neg2.old_code!=df_prep_neg2.code)]
df_prep_neg2 = df_prep_neg2.drop(columns=['definition','old_code']).merge(df_icd[["code","libelle"]].rename(columns={"libelle":"definition"}))

In [15]:
df_prep_neg3 = df_prep.sample(frac=1)["code"].reset_index()
df_prep_neg3 =pd.concat([df_prep.reset_index().rename(columns={"code":"old_code"}).drop(columns="index"),
           df_prep_neg3.drop(columns="index")],axis=1)
df_prep_neg3 = df_prep_neg3[ (df_prep_neg3.old_code!=df_prep_neg3.code) & (df_prep_neg3.old_code.str[0]!=df_prep_neg3.code.str[0])]
df_prep_neg3 = df_prep_neg3.drop(columns=['definition','old_code']).merge(df_icd[["code","libelle"]].rename(columns={"libelle":"definition"}))

In [16]:
df_prep_neg = pd.concat([df_prep_neg1.sample(round(len(df_prep) * 0.3 )),
          df_prep_neg2.sample(round(len(df_prep) * 0.6 )),
          df_prep_neg3.sample(round(len(df_prep) * 0.1 ))],
          axis=0).reset_index().drop(columns="index")


In [17]:
df_prep_final = pd.concat([df_prep.assign(label=1).reset_index().drop(columns="index"),
                           df_prep_neg.assign(label=0).reset_index().drop(columns="index")]).\
                           sample(frac=1).reset_index().drop(columns="index")

In [18]:
df_prep_final

Unnamed: 0,extrait,code,definition,label
0,kyste collo√Øde folliculaire de la peau,L721,Kyste s√©bac√©,0
1,instabilit√© de l'√©motion,Q068,Autres malformations cong√©nitales pr√©cis√©es de...,0
2,"hydrates de carbone, anomalie du m√©tabolisme c...",E740,Th√©saurismose glycog√©nique,0
3,calcul urinaire obstructif sans infection (emp...,N133,"Hydron√©phroses, autres et sans pr√©cision",0
4,dysplasie moyenne col ut√©rin,A601,Infection de la marge cutan√©e de l'anus et du ...,0
...,...,...,...,...
552929,phlegmon √† ligament ut√©rin rond,N732,"Param√©trite et phlegmon pelvien, sans pr√©cision",1
552930,fracture par abduction de l'orteil,S642,L√©sion traumatique du nerf radial au niveau du...,0
552931,ag√©n√©sie oreille,Q169,Malformation cong√©nitale de l'oreille (avec at...,1
552932,surveillance m√©dicale pour grossesse avec ant√©...,Z354,Surveillance d'une grossesse avec multiparit√© ...,0


In [19]:
len(df_prep_final)

552934

In [None]:
data = []
for index, row in df_prep_final.iterrows():
    data.append({
        "extrait": row["extrait"],
        "code": row["code"],
        "definition": row["definition"],
        "label": row["label"]  # Set label to 1 for all entries
    })

In [None]:
from transformers import AutoModel, AutoTokenizer
import torch
from sklearn.metrics.pairwise import cosine_similarity
from datasets import Dataset
import numpy as np


In [None]:
tokenizer = AutoTokenizer.from_pretrained(SOURCE_MODEL_NAME)

tokenizer_config.json: 0.00B [00:00, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/811k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/374 [00:00<?, ?B/s]

In [None]:
# Convertir en Dataset Hugging Face
dataset = Dataset.from_list(data)

# Tokenisation pour la classification binaire
def tokenize(examples):
    texts = [f"[CLS] {e} [SEP] {d} [SEP]" for e, d in zip(examples["extrait"], examples["definition"])]
    return tokenizer(texts, padding="max_length", truncation=True, max_length=64)

tokenized_dataset = dataset.map(tokenize, batched=True)

Map:   0%|          | 0/552934 [00:00<?, ? examples/s]

In [None]:
# Convertir en Dataset Hugging Face
dataset_val = Dataset.from_list(data[:1000])

tokenized_dataset_val = dataset_val.map(tokenize, batched=True)

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [None]:
from transformers import AutoModelForSequenceClassification, AutoModel
import torch.nn as nn
import torch

class HybridModel(nn.Module):
    def __init__(self, model_name):
        super().__init__()
        self.base_model = AutoModel.from_pretrained(model_name)
        self.classifier = nn.Linear(self.base_model.config.hidden_size, 1)  # T√™te de classification binaire
        # Pas de t√™te suppl√©mentaire pour les embeddings (on utilise les embeddings du [CLS])
        self.loss_fct = nn.BCEWithLogitsLoss() # Binary Cross-Entropy with Logits Loss

    def forward(self, input_ids, attention_mask, labels=None, output_embeddings=False):
        outputs = self.base_model(input_ids=input_ids, attention_mask=attention_mask)
        cls_embedding = outputs.last_hidden_state[:, 0, :]  # Embedding du token [CLS]
        logits = self.classifier(cls_embedding)

        if labels is not None:
            # Calculate loss if labels are provided
            loss = self.loss_fct(logits.squeeze(-1), labels.float())
            if output_embeddings:
                return (loss, logits, cls_embedding)
            return (loss, logits)

        if output_embeddings:
            return logits, cls_embedding
        return logits

# Initialiser le mod√®le
model = HybridModel(SOURCE_MODEL_NAME)
model.to("cuda")

config.json:   0%|          | 0.00/710 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/443M [00:00<?, ?B/s]

Some weights of CamembertModel were not initialized from the model checkpoint at almanach/camembert-bio-base and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


HybridModel(
  (base_model): CamembertModel(
    (embeddings): CamembertEmbeddings(
      (word_embeddings): Embedding(32005, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): CamembertEncoder(
      (layer): ModuleList(
        (0-11): 12 x CamembertLayer(
          (attention): CamembertAttention(
            (self): CamembertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): CamembertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
             

In [None]:
#2.3. Entra√Ænement du Mod√®le
from transformers import Trainer, TrainingArguments
from sklearn.metrics import f1_score, roc_auc_score, precision_score, recall_score
import numpy as np
import torch


# 1. Define evaluation metrics
def compute_metrics(eval_pred):
    """Compute classification metrics for binary task."""
    logits, labels = eval_pred
    probs = torch.sigmoid(torch.tensor(logits)).numpy()  # Convert logits to probabilities
    preds = (probs > 0.5).astype(int)  # Default threshold of 0.5 for binary classification

    # Return comprehensive metrics
    return {
        "f1": f1_score(labels, preds, average="binary"),
        "precision": precision_score(labels, preds, average="binary"),
        "recall": recall_score(labels, preds, average="binary"),
        "roc_auc": roc_auc_score(labels, probs),
        "accuracy": (preds == labels).mean()
    }

# 2. Configure training arguments for A100 40GB optimization
training_args = TrainingArguments(
    output_dir="./results",                # Output directory for model checkpoints
    per_device_train_batch_size=32,       # Batch size for training (32 fits well in 40GB)
    per_device_eval_batch_size=64,        # Larger batch size for evaluation
    gradient_accumulation_steps=2,        # Accumulate gradients over 2 steps (equivalent to batch_size=64)
    num_train_epochs=5,                    # Number of training epochs
    learning_rate=2e-5,                   # Typical learning rate for BERT-like models
    warmup_steps=500,                      # Learning rate warmup steps
    weight_decay=0.01,                     # L2 regularization
    logging_dir="./logs",                  # Directory for training logs
    logging_steps=100,                     # Log every 100 steps
    eval_strategy="epoch",          # Evaluate after each epoch
    save_strategy="epoch",                 # Save model after each epoch
    load_best_model_at_end=True,           # Load best model at end of training
    metric_for_best_model="f1",            # Use F1 score to determine best model
    greater_is_better=True,                # Higher F1 is better
    fp16=True,                             # Use mixed precision (FP16) to save memory
    report_to="none"                      # Disable TensorBoard/WandB reporting
)

# 3. Initialize Trainer with our hybrid model
trainer = Trainer(
    model=model,                          # Our hybrid model (classification + embeddings)
    args=training_args,                   # Training configuration
    train_dataset=tokenized_dataset,      # Training data
    eval_dataset=tokenized_dataset_val,       # Evaluation data (replace with actual validation set)
    compute_metrics=compute_metrics       # Metrics computation function
)

# 4. Start training
print("Starting training...")
trainer.train()



Starting training...


Epoch,Training Loss,Validation Loss,F1,Precision,Recall,Roc Auc,Accuracy
1,0.3007,0.250969,0.890966,0.863179,0.920601,0.959023,0.500204
2,0.2167,0.16351,0.933754,0.915464,0.95279,0.982429,0.50102
3,0.1854,0.117568,0.957806,0.941909,0.974249,0.989367,0.501224
4,0.1511,0.089649,0.964931,0.955789,0.974249,0.992349,0.5017
5,0.1422,0.076325,0.971429,0.958246,0.984979,0.993373,0.501428


TrainOutput(global_step=43200, training_loss=0.21994276695781284, metrics={'train_runtime': 3863.2349, 'train_samples_per_second': 715.636, 'train_steps_per_second': 11.182, 'total_flos': 0.0, 'train_loss': 0.21994276695781284, 'epoch': 5.0})

In [None]:
# 5. Save the trained model and tokenizer
import safetensors
import os
output_dir = "./" + FINAL_MODEL_NAME
os.makedirs(output_dir, exist_ok=True)
safetensors.torch.save_model(model, os.path.join(output_dir, "model.safetensors"))
tokenizer.save_pretrained(output_dir)
print(f"Model saved to {output_dir}")

Model saved to ./camenbert_bio_icd_code_check_fr


In [None]:
from huggingface_hub import login, create_repo, upload_folder
# Log in to Hugging Face Hub using the token accessed earlier
login(token=hf_token)
# Define the local directory containing your model files
# This should be the directory where you saved model.safetensors, params.json, and tekken.json
local_model_directory = output_dir
repo_id =  hf_username + "/" + FINAL_MODEL_NAME

create_repo(repo_id, exist_ok=True)

print(f"Hugging Face repository '{repo_id}' created or already exists.")



Hugging Face repository 'rflicoteaux/camenbert_bio_icd_code_check_fr' created or already exists.


In [None]:
# Upload the files to the Hugging Face repository
# The 'repo_id' is the name of your repository on the Hub
# The 'folder_path' is the local directory containing the files to upload
# The 'path_in_repo' is the path within the repository where the files will be uploaded (e.g., "." for the root)
upload_folder(
    repo_id=repo_id,
    folder_path=local_model_directory,
    path_in_repo=".",
    commit_message="Upload Cambenbert-bio finetuned to validate ICD-10 expressions in French"
)

print(f"Model files from '{local_model_directory}' uploaded to '{repo_id}'.")

Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...heck_fr/model.safetensors:   0%|          |  551kB /  443MB            

  ...r/sentencepiece.bpe.model: 100%|##########|  811kB /  811kB            

Model files from './camenbert_bio_icd_code_check_fr' uploaded to 'rflicoteaux/camenbert_bio_icd_code_check_fr'.


In [None]:
#2.4. Utilisation des Embeddings pour le RAG
def get_embedding(text, model, tokenizer):
    inputs = tokenizer(f"[CLS] {text} [SEP]", return_tensors="pt", padding=True, truncation=True, max_length=64).to("cuda")
    with torch.no_grad():
        _, embedding = model(**inputs, output_embeddings=True)
    return embedding.cpu().numpy()[0]



In [None]:
# Fonction de recherche des codes similaires (RAG)
from sklearn.metrics.pairwise import cosine_similarity

def retrieve_similar_codes(extrait, top_k=5):
    extrait_embedding = get_embedding(extrait, model, tokenizer)
    similarities = {
        code: cosine_similarity([extrait_embedding], [embedding])[0][0]
        for code, embedding in cim10_embeddings.items()
    }
    return sorted(similarities.items(), key=lambda x: x[1], reverse=True)[:top_k]

In [None]:
# Pr√©calculer les embeddings des d√©finitions CIM10
cim10_embeddings = {}
for index, row in df_icd.iterrows():
    code = row["code"]
    definition = row["libelle"]
    cim10_embeddings[code] = get_embedding(definition, model, tokenizer)

In [None]:
df_mistral_syn= pd.read_excel(PATH_DATA + "cim_mistral_synomyns.xlsx",index_col=0)

In [None]:
# Test S321

In [None]:
# 2.5. M√©canisme de RAG Complet
def predict_with_rag(expression, icd10_code, seuil_classification=0.5, seuil_similarity=0.7):
    # √âtape 1 : Classification binaire
    inputs = tokenizer(f"[CLS] {expression} [SEP] {icd10_description[icd10_code]} [SEP]", return_tensors="pt").to("cuda")
    with torch.no_grad():
        logits = model(**inputs)
    prob_match = torch.sigmoid(logits).item()
    is_match = prob_match >= seuil_classification

    # √âtape 2 : Similarit√© (RAG)
    similarity = cosine_similarity(
        [get_embedding(expression, model, tokenizer)],
        [cim10_embeddings[icd10_code]]
    )[0][0]

    # √âtape 3 : D√©cision combin√©e
    combined_score = 0.6 * prob_match + 0.4 * similarity  # Pond√©ration ajustable
    final_match = combined_score >= seuil_similarity

    # √âtape 4 : Explications via RAG (top 3 codes similaires)
    similar_codes = retrieve_similar_codes(expression, top_k=3)

    # Convert all float32 values to standard floats before returning
    return {
        "code": icd10_code,
        "icd10_description": icd10_description[icd10_code],
         "expression": expression,
        "is_match": bool(final_match),
        "prob_match": float(prob_match),
        "similarity": float(similarity),
        "combined_score": float(combined_score),
        "similar_codes": [(code, float(sim)) for code, sim in similar_codes], # Ensure similarity scores are floats
        "explanation": f"Le score combin√© ({combined_score:.2f}) est bas√© sur : "
                       f"Classification ({prob_match:.2f}) + Similarit√© ({similarity:.2f}). ",
        "Codes similaires" : f"{', '.join([f'{icd10_description[code] + "[" + code +"]"} ({sim:.2f})' for code, sim in similar_codes])}"
    }

In [None]:
# Exemple d'utilisation
result = predict_with_rag(
    expression="douleur thoracique non syst√©matis√©e, sans signe d'accompagnement",
    icd10_code="R074"
)



print(json.dumps(result, indent=2, ensure_ascii=False))

{
  "code": "R074",
  "icd10_description": "Douleur thoracique, sans pr√©cision",
  "expression": "douleur thoracique non syst√©matis√©e, sans signe d'accompagnement",
  "is_match": true,
  "prob_match": 0.9576753973960876,
  "similarity": 0.5909492373466492,
  "combined_score": 0.8109849095344543,
  "similar_codes": [
    [
      "M2559",
      0.676640510559082
    ],
    [
      "I209",
      0.6534093022346497
    ],
    [
      "M479",
      0.6530258655548096
    ]
  ],
  "explanation": "Le score combin√© (0.81) est bas√© sur : Classification (0.96) + Similarit√© (0.59). ",
  "Codes similaires": "Douleur articulaire - Si√®ge non pr√©cis√©[M2559] (0.68), Angine de poitrine, sans pr√©cision[I209] (0.65), Spondylarthrose, sans pr√©cision[M479] (0.65)"
}


In [None]:
#4. √âvaluation et Optimisation
#4.1. M√©triques Cl√©s
from sklearn.metrics import roc_auc_score, f1_score, precision_score, recall_score

def evaluate_model(test_dataset, model, tokenizer):
    predictions = []
    labels = []
    similarities = []

    for example in test_dataset:
        inputs = tokenizer(f"[CLS] {example['extrait']} [SEP] {example['definition']} [SEP]", return_tensors="pt").to("cuda")
        with torch.no_grad():
            logits, embedding = model(**inputs, output_embeddings=True)
        prob = torch.sigmoid(logits).item()
        similarity = cosine_similarity(
            [embedding.cpu().numpy()],
            [cim10_embeddings[example["code"]]]
        )[0][0]
        predictions.append(prob >= 0.5)
        labels.append(example["label"])
        similarities.append(similarity)

    # M√©triques
    metrics = {
        "auc_roc": roc_auc_score(labels, [p for p in predictions]),
        "f1": f1_score(labels, predictions),
        "precision": precision_score(labels, predictions),
        "recall": recall_score(labels, predictions),
        "mean_similarity_pos": np.mean([s for s, l in zip(similarities, labels) if l == 1]),
        "mean_similarity_neg": np.mean([s for s, l in zip(similarities, labels) if l == 0])
    }
    return metrics

# Exemple d'utilisation
metrics = evaluate_model(tokenized_dataset, model, tokenizer)
print(json.dumps(metrics, indent=2))


In [None]:
#4.2. Optimisation des Seuils
def optimize_thresholds(test_dataset, model, tokenizer):
    probs = []
    labels = []
    similarities = []

    for example in test_dataset:
        inputs = tokenizer(f"[CLS] {example['extrait']} [SEP] {example['definition']} [SEP]", return_tensors="pt").to("cuda")
        with torch.no_grad():
            logits, embedding = model(**inputs, output_embeddings=True)
        probs.append(torch.sigmoid(logits).item())
        labels.append(example["label"])
        # Pass the embeddings as 2D arrays without extra list wrapping
        similarity = cosine_similarity(
            embedding.cpu().numpy(),
            cim10_embeddings[example["code"]].reshape(1, -1) # Reshape cim10_embeddings to be 2D
        )[0][0]
        similarities.append(similarity)


    # Optimiser le seuil pour la classification
    f1_scores = []
    thresholds = np.linspace(0, 1, 50)
    for t in thresholds:
        preds = [p >= t for p in probs]
        f1_scores.append(f1_score(labels, preds))
    optimal_cls_threshold = thresholds[np.argmax(f1_scores)]

    # Optimiser le seuil pour la similarit√©
    combined_scores = [0.6*p + 0.4*s for p, s in zip(probs, similarities)]
    f1_scores = []
    for t in thresholds:
        preds = [cs >= t for cs in combined_scores]
        f1_scores.append(f1_score(labels, preds))
    optimal_combined_threshold = thresholds[np.argmax(f1_scores)]

    return {
        "optimal_cls_threshold": optimal_cls_threshold,
        "optimal_combined_threshold": optimal_combined_threshold,
        "max_f1_cls": max(f1_scores),
        "max_f1_combined": max(f1_scores)
    }

In [None]:

# Exemple d'utilisation
thresholds = optimize_thresholds(tokenized_dataset_val, model, tokenizer)
print(json.dumps(thresholds, indent=2))

{
  "optimal_cls_threshold": 0.673469387755102,
  "optimal_combined_threshold": 0.2040816326530612,
  "max_f1_cls": 0.7077385424492862,
  "max_f1_combined": 0.7077385424492862
}


In [None]:
#5.2. G√©n√©ration d'Explications avec RAG
from transformers import pipeline

# Charger un mod√®le de g√©n√©ration (ex : Mistral 7B pour les explications)
generator = pipeline("text-generation", model="mistralai/Mistral-7B-Instruct", device="cuda")

def generate_explanation(extrait, code, similar_codes):
    prompt = f"""
    **Contexte** :
    - Extrait du CRH : "{extrait}"
    - Code CIM10 : {code} ({cim10_definitions[code]})
    - Codes similaires : {', '.join([f"{c} ({cim10_definitions[c]})" for c, _ in similar_codes])}

    **T√¢che** : Expliquez pourquoi l'extrait correspond (ou non) au code CIM10, en citant des √©l√©ments pr√©cis du texte.
    Si l'extrait ne correspond pas, proposez une alternative parmi les codes similaires et justifiez.
    **Format** :
    - Correspondance : [Oui/Non]
    - Justification : "[explication avec citations]"
    - Alternative sugg√©r√©e : [CODE] - "[justification]"
    """
    response = generator(prompt, max_length=256, temperature=0.3)
    return response[0]["generated_text"]

# Exemple d'utilisation
explanation = generate_explanation(
    extrait="douleurs thoraciques intenses",
    code="R07.4",
    similar_codes=retrieve_similar_codes("douleurs thoraciques intenses")
)
print(explanation)
