<a href="https://colab.research.google.com/github/24p11/recode-scenario/blob/main/scenario_oncology_v2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Create fictive clinical notes from Code set (DRG + ICD)

Code set are the raw classification data, they are extracted from French National database (Base nationale PMSI). They are made of 
* classification profile made of grouping variables from DRG records which are prepared with their frequency in the national database
    - age (class)
    - sexe
    - DRG (racine GHM)
    - Main diagnosis (ICD10) : cf
    - Hospitalization management type : cf
* diagnosis associated to each classification profile, extracted with their frequencies
* procedures associated to each classification profile, specialy for surgery and technical gestures, extracted with their frequencies

From thoses raw information we produce a coded clinical scenario which will be uses a seed.

This scenario is transformed into a detail prompt that will be given to a LLM for generation.
From the combinaision of primary and related diagnosis in French discharge abstract, we derived two notions :
* Primary diagnosis : host the notion of principal pathology, it is rather the primary diagnosis of the discharge abstract or the related diagnosis when it exists and that the primary diagnosis of the discharge abstract is from the chapter "Facteurs influant sur l’état de santé" of ICD10
* The Hospitalization management type is rather the term "Primary diagnosis" or the ICD-10 code of the related diagnosis when it exists

Variables dictionary :
* drg_code
* drg_description
* drg_parent_code
* drg_parent_code_description
* icd_code
* icd_code_description
* icd_parent_code
* icd_parent_code_description
* icd_primary_code : digonstic principal
* icd_primary_code_definition : digonstic principal
* icd_secondary : related diagnosis
* cage : age classes [0-1[, [1-5[,[15-18[, [5-10[, [10-15[, [30-40[, [50-60[, [18-30[, [40-50[, [60-70[, [70-80[, [80-[
* cage2 :  age classes [0-1[ , [1-5[ , [5-10[ , [10-15[ , [15-18[, [18-50[ , [50-[
* sexe : 1/2 (F)
       

Table classification_profile
* drg_parent_code
* icd_primary_code
* icd_primary_parent_code
* case_management_type
* cage
* cage2
* sexe

Table secondary_diagnosis
* drg_parent_code
* icd_primary_parent_code
* cage2
* sexe

Use the col_names options in the load function of the project to align the columns names of your files with this dictionary.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import pandas as pd
import numpy as np
import datetime as dt

In [3]:
from utils import *

In [259]:
gs = generate_scenario()
# Load official dictionaries
# col_names option allow you to algin your column names the project dictionary.
gs.load_offical_icd("cim_2024.xlsx",col_names={"code" : "icd_code","libelle":"icd_code_description"} )
gs.load_offical_procedures("ccam_actes_2024.xlsx",col_names={"code":"procedure","libelle_long":"procedure_description"} )

In [260]:
# Load data from BN  PMSI
col_names={"racine":"drg_parent_code","das": "icd_secondary_code","diag":"icd_primary_code","categ_cim":"icd_primary_parent_code","mdp":"case_management_type","nb_situations":"nb","acte":"procedure"}
gs.load_classification_profile("bn_pmsi_cases_20250819.csv", col_names)
gs.load_secondary_icd("bn_pmsi_related_diag_20250818.csv",col_names)
gs.load_procedures("bn_pmsi_procedures_20250818.csv",col_names)

In [308]:
current_profile = gs.df_classification_profile.drop(columns="nb").loc[120]

In [313]:
scenario = gs.get_clinical_scenario_template()
for k,v in current_profile.items():
    scenario[k]=v 


In [322]:
scenario["icd_primary_description"] = gs.get_icd_description(current_profile.icd_primary_code)
scenario["icd_primary_description_alternative"] = gs.get_icd_alternative_descriptions(current_profile.icd_primary_code)
scenario["case_management_type_description"] = gs.get_icd_description(current_profile.case_management_type)
scenario["text_secondary_icd"] = ""
scenario["age"] = get_age(current_profile.cage)
#For chronic diseases we choose grouping profile only on ICD
grouping_secondary =["icd_primary_code","icd_secondary_code","cage2","sexe","nb"]
scenario["metastases"] = gs.sample_from_df(profile =current_profile,df_values= gs.df_secondary_icd.query("type=='Metastasis'")[grouping_secondary])  

###Secodary diagnosis :
###We sample secondary diagnosis by steps : metastases, metastases ln, chronic,complications
###Each time build :
### - official descriptions
### - official : alternatives descripttion
scenario["text_secondary_icd_official"]=""
scenario["text_secondary_icd_alternative"]=""
if scenario["metastases"].shape[0] > 0 :
  for index, row in scenario["metastases"].iterrows():
    scenario["text_secondary_icd_official"] += "- " + row.icd_code_description_official + "("+ row.icd_secondary_code+")\n"
    scenario["text_secondary_icd_alternative"] += "- " + row.icd_code_description_official + "("+ row.icd_secondary_code+") : " + row.icd_code_description_alternative + "\n"
    
scenario["metastases_ln"] = gs.sample_from_df(profile =current_profile,df_values= gs.df_secondary_icd.query("type=='Metastasis LN'")[grouping_secondary])  
if scenario["metastases_ln"].shape[0] > 0 :
  for index, row in scenario["metastases_ln"].iterrows():
    scenario["text_secondary_icd_official"] += "- " + row.icd_code_description_official + "("+ row.icd_secondary_code+")\n"
    scenario["text_secondary_icd_alternative"] += "- " + row.icd_code_description_official + "("+ row.icd_secondary_code+") : " + row.icd_code_description_alternative + "\n"

scenario["chronic"] =  gs.sample_from_df(profile =current_profile,df_values= gs.df_secondary_icd.query("type.isin(['Chronic'])")[grouping_secondary])  
if scenario["chronic"].shape[0] > 0 :
  for index, row in scenario["chronic"].iterrows():
    scenario["text_secondary_icd_official"] += "- " + row.icd_code_description_official + "("+ row.icd_secondary_code+")\n"
    scenario["text_secondary_icd_alternative"] += "- " + row.icd_code_description_official + "("+ row.icd_secondary_code+") : " + row.icd_code_description_alternative + "\n"

#For complication drg_parent_code we choose grouping profile only on ICD
grouping_secondary =["drg_parent_code","icd_secondary_code","cage2","sexe","nb"]
scenario["complications"] = gs.sample_from_df(profile =current_profile,df_values= gs.df_secondary_icd.query("type.isin(['Acute'])")[grouping_secondary]) 
if scenario["complications"].shape[0] > 0 :
  for index, row in scenario["complications"].iterrows():
    scenario["text_secondary_icd_official"] += "- " + row.icd_code_description_official + "("+ row.icd_secondary_code+")\n"
    scenario["text_secondary_icd_alternative"] += "- " + row.icd_code_description_official + "("+ row.icd_secondary_code+") : " + row.icd_code_description_alternative + "\n"

scenario["date_entry"],scenario["date_discharge"] = get_dates_of_stay(current_profile.mode_hospit,current_profile.mode_entree,current_profile.los_mean,current_profile.los_sd)
scenario["date_of_birth"] = random_date_between( scenario["date_entry"] - datetime.timedelta(days = 365*(scenario["age"]+1)) , scenario["date_entry"] - datetime.timedelta(days = 365*(scenario["age"])))
current_profile["case_management_type_description"] = scenario["case_management_type_description"] 
scenario["case_management_type_text"] , scenario["cd_md_pec"] = gs.define_cancer_md_pec(current_profile)
scenario["first_name"] , scenario["laste_name"] = gs.get_names(current_profile.sexe)
scenario["first_name_med"] , scenario["laste_name_med"] = gs.get_names(random.randint(1, 2))


18


In [332]:
SCENARIO = "**SCÉNARIO DE DÉPART :**\n"

for k,v in scenario.items():
  if k == "age" and v is not None:
    SCENARIO +="- Âge du patient : " + str(v) + "ans\n"
  if k == "sexe" and v is not None:  
    SCENARIO +="- Sexe du patient : " + str(v) + "\n"
  if k == "date_entry" and v is not None:  
    SCENARIO +="- Date d'entrée : "+ v.strftime("%d/%m/%Y") + "\n"
  if k == "date_discharge" and v is not None:  
    SCENARIO +="- Date de sortie : "+ v.strftime("%d/%m/%Y") + "\n" 
  if k == "date_of_birth" and v is not None:  
    SCENARIO +="- Date de naissance : "+ v.strftime("%d/%m/%Y") + "\n"  
  if k == "last_name" and v is not None:  
    SCENARIO +="- Nom du patient : "+ v + "\n" 
  if k == "first_name" and v is not None:  
    SCENARIO +="- Prénom du patient : "+ v + "\n" 
  if(scenario["icd_primary_code"] in gs.icd_codes_cancer) : 
    if k == "icd_primary_code_definition" and v is not None:  
      SCENARIO +="- Localisation anatomique de la tumeur primaire : ("+ v + scenario["icd_primary_code"] + ")\n" 
    if k == "cancer_histology":
      if v is not None:  
          SCENARIO +="- Type anatomopathologique de la tumeur primaire : ("+ v + scenario["icd_primary_code"] + ")\n" 
      else :
          SCENARIO +="- Type anatomopathologique de la tumeur primaire : Vous choisirez vous même un type histologique cohérent avec la localisation anatomique\n" 
    if k == "scrore_TNM":
      if v is not None:  
          SCENARIO +="- Score TNM :"+ v + "\n" 
      else :
          SCENARIO +="- Score TNM : Si la notion de score de TNM est pertinente avec le type histologique et la localisation anatomique, vous choisirez un score TNM\n" 
    if k == "cancer_stage":
       if v is not None:  
          SCENARIO +="- Stade tumoral : " + v + "\n" 
    if k == "biomarqueurs":
       if v is not None:  
          SCENARIO +="- Biomarqueurs tumoraux : "+ v + "\n" 
       else :
          SCENARIO +="- Biomarqueurs tumoraux : Vous choisirez des biomarqueur tumoraux cohérent avec la localisation et l'histologie de la tumeur\n" 
  
  if k == "mode_entree" and v is not None:  
    SCENARIO +="- Mode d'entrée' : "+ v + "\n" 
  if k == "mode_sortie" and v is not None:  
    SCENARIO +="- Mode de sortie' : "+ v + "\n" 

  if k == "case_management_type" and v is not None:  
    SCENARIO +="- Mode de prise en charge : "+ scenario["case_management_type_text"] + "\n" 
    SCENARIO +="- codage CIM10 :\n" 
    SCENARIO +="   * Diagnostic principal : "+  scenario["icd_primary_description"] + "("+ scenario["icd_primary_code"] + ")\n"  
    if scenario["case_management_type"]!="DP":
      SCENARIO +="   * Diagnostic relié : "+  scenario["case_management_type_description"] + "("+ scenario["case_management_type"] + ")\n"  
    else:
      SCENARIO +="   * Diagnostic relié :  Aucun\n"  
    
    SCENARIO +="   * Diagnostic associés : \n"
    SCENARIO +=  scenario["text_secondary_icd_official"]  + "\n"  

print(SCENARIO)


**SCÉNARIO DE DÉPART :**
- Âge du patient : 81ans
- Sexe du patient : 1
- Date d'entrée : 16/10/2024
- Date de sortie : 26/10/2024
- Date de naissance : 01/05/1943
- Prénom du patient : Franco
- Mode de prise en charge : Hospitalisation pour prise en charge diagnostique et thérapeutique du diagnotic principal en hospitalisation complète
- codage CIM10 :
   * Diagnostic principal : Hypotension orthostatique(I951)
   * Diagnostic relié :  Aucun
   * Diagnostic associés : 
- Tumeur maligne secondaire des os et de la moelle osseuse(C795)
- Tumeur maligne secondaire du poumon(C780)
- Tumeur maligne secondaire du poumon(C780)
- Tumeur maligne secondaire des os et de la moelle osseuse(C795)
- Tumeur maligne secondaire et non précisée des ganglions lymphatiques de la tête, de la face et du cou(C770)
- Sclérose systémique, sans précision(M349)
- Cardiopathie hypertensive, (sans insuffisance cardiaque congestive)(I119)
- Besoin d'assistance à domicile, aucun autre membre du foyer n'étant capable

In [333]:

ICD_ALTERNATIVES =""

if scenario["icd_primary_code"] not in gs.icd_codes_cancer :
      ICD_ALTERNATIVES +=" - " + scenario["icd_primary_description"] + "("+ scenario["icd_primary_code"] + ") : " 
      ICD_ALTERNATIVES +=": "+ scenario["icd_primary_description_alternative"] + "\n"
ICD_ALTERNATIVES +=  scenario["text_secondary_icd_alternative"]  + "\n" 

print(ICD_ALTERNATIVES)


 - Hypotension orthostatique(I951) : : hypotension orthostatique
- Tumeur maligne secondaire des os et de la moelle osseuse(C795) : metastases medullaires vertebrales, metastase osseuse maxillaire compressive, metastase clavicule, metastase osseuse costale, metastases l l
- Tumeur maligne secondaire du poumon(C780) : metastase hilaire, metastases pulmonaires surinfectees, metastase lobe superieur droit, metastase champ pulmoniare droit, metastases endobronchiques
- Tumeur maligne secondaire du poumon(C780) : metastase lobe superieur droit, metastase champ pulmoniare droit, metastase pulmonaire apicale gauche, metastase pulmonaire contro laterale, metastase pulmonaire lobe superieur
- Tumeur maligne secondaire des os et de la moelle osseuse(C795) : metastase clavicule gauche, metastase humerale droite, metastase canal rachidien, metastase os malaire gauche, metastase vertebrale d d
- Tumeur maligne secondaire et non précisée des ganglions lymphatiques de la tête, de la face et du cou(C7

In [335]:
def prepare_prompt(prompt_path, case):
  with open(prompt_path, "r", encoding="utf-8") as f:
      content = f.read()
  return (content
          .replace("[[SCENARIO here]", case["SCENARIO"])
          .replace("[ICD_ALTERNATIVES here]", case["ICD_ALTERNATIVES"])

          )

prepare_prompt("templates/scenario_onco_v1.txt",case = {"SCENARIO":SCENARIO,"ICD_ALTERNATIVES":ICD_ALTERNATIVES})

'Vous êtes un oncologue clinicien expert. Votre tâche est de générer un compte rendu d\'hospitalisation en style clinique synthétique à partir d\'un scénario comprenant le résumé PMSI (codes de la classification internationale des maladies), ainsi que d\'autres informations décrivant l\'hospitalisation.\n\n\n[SCENARIO here]\n\n\n\n\n**INSTRUCTIONS :**\n\n- Vous devez utiliser exclusivement les diagnostics fournis. Aucun autre diagnostic ne doit être ajouté ou inventé.\t\t\n\n- Lors vous mentionnerez dans le texte les diagnostics, vous utiliserez autant que faire ce peut une formulation moins formelle que la définition du code. Pour vous aider nous vous proposons ci-dessous quelques exemples de formulation alternative qui peut être retrouvée dans les comptes rendus d\'hospitalisation.\n - Hypotension orthostatique(I951) : : hypotension orthostatique\n- Tumeur maligne secondaire des os et de la moelle osseuse(C795) : metastases medullaires vertebrales, metastase osseuse maxillaire compre