<a href="https://colab.research.google.com/github/24p11/recode-scenario/blob/main/scenario_oncology_v2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Create fictive clinical notes from Code set (DRG + ICD)

Code set are the raw classification data, we can extract from National database (Base nationale PMSI en France). They are made of 
* classification profile made of grouping variables from DRG records which are prepared with their frequency in the national database
    - age (class)
    - sexe
    - DRG (racine GHM)
    - Main diagnosis (ICD10) : cf
    - Hospitalization management type : cf
* diagnosis associated to each classification profile, extracted with their frequencies
* procedures associated to each classification profile, specialy for surgery and technical gestures, extracted with their frequencies

From thoses raw information we produce a coded clinical scenario which will be uses a seed.

This scenario is transformed into a detail prompt that will be given to a LLM for generation.
From the combinaision of primary and related diagnosis in French discharge abstract, we derived two notions :
* Primary diagnosis : host the notion of principal pathology, it is rather the primary diagnosis of the discharge abstract or the related diagnosis when it exists and that the primary diagnosis of the discharge abstract is from the chapter "Facteurs influant sur l’état de santé" of ICD10
* The Hospitalization management type is rather the term "Primary diagnosis" or the ICD-10 code of the related diagnosis when it exists


In [4]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [5]:
import pandas as pd
import numpy as np
import datetime as dt

In [6]:
from utils import *

In [9]:
gs = generate_scenario()
# Load official dictionaries
# col_names option allow you to algin your column names the project dictionary.
gs.load_offical_icd("cim_2024.xlsx",col_names={"code" : "icd_code","libelle":"icd_code_description"} )
gs.load_offical_procedures("ccam_actes_2024.xlsx",col_names={"code":"procedure","libelle_long":"procedure_description"} )
col_names={"Code CIM":"icd_parent_code","Localisation":"primary_site","Type Histologique":"histological_type",
           "Stade":"stage","Marqueurs Tumoraux":"biomarkers","Traitement":"treatment_recommandation","Protocole de Chimiothérapie":"chemotherapy_regimen"}
gs.load_cancer_treatement_recommandations("Tableau récapitulatif traitement cancer.xlsx",col_names ) 

In [10]:
# Load data from BN  PMSI
col_names={"racine":"drg_parent_code","das": "icd_secondary_code","diag":"icd_primary_code","categ_cim":"icd_primary_parent_code",
            "mdp":"case_management_type","nb_situations":"nb","acte":"procedure",
            "mode_entree":"admission_mode",
            "mode_sortie":"discharge_disposition",
            "mode_hospit":"admission_type"}
gs.load_classification_profile("bn_pmsi_cases_20250819.csv", col_names)
gs.load_secondary_icd("bn_pmsi_related_diag_20250818.csv",col_names)
gs.load_procedures("bn_pmsi_procedures_20250818.csv",col_names)

In [6]:
cols_scenario = ["first_name","last_name","cage2","cage","sexe",
                "last_name_med","icd_primary_code",
                "admission_type","admission_mode","discharge_disposition",
                'drg_parent_code','icd_primary_code','icd_secondaray_code','cd_md_pec']
                
cols_cancer = ["cancer_stage","TNM_score","histological_type","treatment_recommandation","chemotherapy_regimen"]

In [11]:
#Prepare cases
# df_profile = gs.df_classification_profile.drop(columns="nb")
df_profile = gs.df_classification_profile
df_profile = df_profile[(df_profile.icd_primary_code.isin(gs.icd_codes_cancer) )  & (~df_profile.drg_parent_code.isin(gs.drg_parent_code_radio) ) ]

In [17]:
def create_system_prompt(scenario):
    if scenario['admission_type'] == "Inpatient" and scenario['drg_parent_code'][2:3]=="C" :
        template_name = "surgery_complete.txt"
    elif scenario['admission_type'] == "Outpatient" and scenario['drg_parent_code'][2:3]=="C" :
        template_name = "surgery_outpatient.txt"
    else:
        template_name = "scenario_onco_v1.txt"
        
    with open("templates/" + template_name, "r", encoding="utf-8") as f:
        prompt = f.read()
    
    return prompt

In [12]:
# col_names = ["icd_primary_code", "case_management_type", "drg_parent_code", "cage2","cage", "sexe", "admission_type","admission_mode", "discharge_disposition",
#              "dms", "los_mean", "los_sd", "drg_parent_description"]
# test_generation_v1 = df_profile.sample(20)[col_names].reset_index(drop=True)
test_generation_v1 = df_profile.sample(1000, weights="nb").reset_index(drop=True)
test_generation_v1

Unnamed: 0,icd_primary_code,case_management_type,drg_parent_code,age,cage,cage2,sexe,admission_type,admission_mode,discharge_disposition,...,da,libelle_da,gp_cas,libelle_gp_cas,ga,libelle_ga,da_gp,da_gp_ga,anseqta,aso
0,D067,DP,13C11,ge_18,[40-50[,[18-50[,2,Outpatient,DOMICILE,DOMICILE,...,D12,Gynécologie - sein,C17,Chirurgie Gynécologique,G103,Chirurgie pour tumeurs malignes (app génital fem),D12C17,D12C17G103,2024,C
1,C67,DP,11C13,ge_18,[70-80[,[50-[,1,Inpatient,DOMICILE,DOMICILE,...,D15,Uro-néphrologie et génital,C19,Chirurgie Urologique,G127,"Chirurgies transurétrales, autres",D15C19,D15C19G127,2024,C
2,C250,Z511,28Z07,ge_18,[50-60[,[50-[,2,Outpatient,DOMICILE,DOMICILE,...,D27,Séances,S02,Chimiothérapie pour tumeur,G190,Séances : chimiothérapie,D27S02,D27S02G190,2024,M
3,C67,Z511,17M06,ge_18,[60-70[,[50-[,1,Inpatient,DOMICILE,DOMICILE,...,D17,"Chimiothérapie, radiothérapie, hors séances",X23,Chimiothérapie (hors séances),G148,Chimiothérapie hors séances,D17X23,D17X23G148,2024,M
4,C920,DP,17M09,ge_18,[60-70[,[50-[,1,Inpatient,DOMICILE,DOMICILE,...,D16,Hématologie,X14,"Maladies immunitaires, du Sang, des Organes hé...",G144,Affections hématologiques malignes,D16X14,D16X14G144,2024,M
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,C182,DP,06M05,ge_18,[80-[,[50-[,2,Inpatient,URGENCES,DOMICILE,...,D01,Digestif,X02,Hépato-Gastro-Entérologie,G011,Prise en charge médicale des tumeurs malignes ...,D01X02,D01X02G011,2024,M
996,C64,Z515,23Z02,ge_18,[40-50[,[18-50[,1,Inpatient,DOMICILE,DECES,...,D24,"Douleurs chroniques, Soins palliatifs",X22,Douleur et soins palliatifs,G176,Soins palliatifs,D24X22,D24X22G176,2024,M
997,C880,Z511,17M06,ge_18,[60-70[,[50-[,1,Inpatient,DOMICILE,DOMICILE,...,D17,"Chimiothérapie, radiothérapie, hors séances",X23,Chimiothérapie (hors séances),G148,Chimiothérapie hors séances,D17X23,D17X23G148,2024,M
998,C19,Z511,28Z07,ge_18,[40-50[,[18-50[,2,Outpatient,DOMICILE,DOMICILE,...,D27,Séances,S02,Chimiothérapie pour tumeur,G190,Séances : chimiothérapie,D27S02,D27S02G190,2024,M


In [15]:
from tqdm import tqdm

In [77]:
list_scenario = []

#for i in tqdm(range(len(test_generation_v1))):
for i in tqdm(range(1)):
    profile = test_generation_v1.iloc[i].copy()
    scenario = gs.generate_scenario_from_profile(profile.drop("nb"))
    row = {k:scenario[k] for k in scenario.keys()}
    user_prompt = gs.make_prompts_marks_from_scenario(scenario)
    system_prompt = create_system_prompt(scenario)
    row["user_prompt"] = user_prompt
    row["system_prompt"] = system_prompt
    
    if  scenario["icd_primary_code"]  in gs.icd_codes_cancer  :
        prefix = """Lorsque vous mentionnerez dans le texte les diagnostics, vous utiliserez une formulation moins formelle que la définition du code. Veillez à bien préciser le type histologique et la valeur des biomarqueurs si recherchés. Respecter le plan.
                """
    else :
        prefix = """Lorsque vous mentionnerez dans le texte les diagnostics, vous utiliserez une formulation moins formelle que la définition du code. Respecter le plan."""  
    row["prefix"] = prefix
    list_scenario.append(row)

100%|██████████| 1/1 [00:01<00:00,  1.84s/it]


In [78]:
keep_cols = ['age',  'cage', 'cage2','sexe', 'date_entry', 'date_discharge', 'date_of_birth',
       'first_name', 'last_name', 'icd_primary_code', 'icd_primary_description', 'icd_parent_code',
       'case_management_type','case_management_type_description', 'case_management_type_text', 
       'drg_parent_code', 'drg_parent_description',
       'icd_secondaray_code',  'text_secondary_icd_official', 
        'procedure', 'text_procedure',
        'admission_type','admission_mode', 'discharge_disposition', 'dms', 'los_mean', 'los_sd',
       'cancer_stage', 'score_TNM', 'histological_type',
       'treatment_recommandation', 'chemotherapy_regimen', 'biomarkers',
       'first_name_med', 'last_name_med',
       'cd_md_pec', 'user_prompt', 'system_prompt']
df_scenario = pd.DataFrame(list_scenario)[keep_cols]
df_scenario

Unnamed: 0,age,cage,cage2,sexe,date_entry,date_discharge,date_of_birth,first_name,last_name,icd_primary_code,...,score_TNM,histological_type,treatment_recommandation,chemotherapy_regimen,biomarkers,first_name_med,last_name_med,cd_md_pec,user_prompt,system_prompt
0,48,[40-50[,[18-50[,2,2024-04-30,2024-04-30,1975-06-14,Simone,Rouzil,D067,...,,,,,,Jacqueline,Biehlmann,11,**SCÉNARIO DE DÉPART :**\n- Âge du patient : 4...,Vous êtes chirurgien praticien. Votre tâche es...


In [79]:
gs.df_procedure_official.procedure[0:2]

0    AAFA001
1    AAFA002
Name: procedure, dtype: object

In [80]:
print(df_scenario.loc[0,"user_prompt"])

**SCÉNARIO DE DÉPART :**
- Âge du patient : 48 ans
- Sexe du patient : Féminin
- Date d'entrée : 30/04/2024
- Date de sortie : 30/04/2024
- Date de naissance : 14/06/1975
- Prénom du patient : Simone
- Nom du patient : Rouzil
- Mode de prise en charge : Prise en charge en chirugie ambulatoire pour - destruction de la muqueuse utérine par thermocontact, par voie vaginale (jknd001)

- Codage CIM10 :
   * Diagnostic principal : Carcinome in situ d'autres parties du col de l'utérus (D067)
   * Diagnostic associés : 
- Sclérose en plaques (G35)
- Syndrome de dépendance à l'alcool, utilisation actuelle, sans symptôme physique (F10240)
- Insuffisance (de la valvule) mitrale (non rhumatismale) (I340)
- Présence de prothèse d'une valvule cardiaque (Z952)
- Tumeur maligne secondaire et non précisée des ganglions lymphatiques intrapelviens (C775)
- Dysplasie moyenne du col de l'utérus (N871)

* Acte CCAM :
- destruction de la muqueuse utérine par thermocontact, par voie vaginale (jknd001)

- Loca

In [23]:
df_scenario.to_csv("test_generation_v2.csv")

In [None]:
# df_scenario =[]
# for i in range(0,5):

#     current_profile = df_profile.iloc[i,:]

#     scenario = gs.generate_scenario_from_profile(current_profile)
#     row = {k:scenario[k] for k in scenario if k in cols_scenario }
#     cancer = [scenario[k] for k in scenario if k in cols_cancer ]

#     row.update({"cancer":cancer})
    
#     case  = gs.make_prompts_marks_from_scenario(scenario)
    
#     row.update({'case': case})


#     if row['admission_type'] == "Inpatient" and row['drg_parent_code'][2:3]=="C" :
#         template_name = "surgery_complete.txt"
#     elif row['admission_type'] == "Outpatient" and row['drg_parent_code'][2:3]=="C" :
#         template_name = "surgery_outpatient.txt"
#     elif row['drg_parent_code'][2:3]=="K" :
#         template_name = "interventionnel.txt"
#     elif row['cd_md_pec']==17 :
#         template_name = "bilan.txt"
#     else:
#         template_name = "scenario_onco_v1.txt"
        
#     prompt =  prepare_prompt("templates/" + template_name ,case =case)
#     row.update({'prompt': prompt})

#     df_scenario.append(row)

In [226]:
# df_scenario= pd.DataFrame(df_scenario)

In [227]:
# df_scenario.to_csv(gs.path_data + "test_scenario_v1.csv")

In [None]:
# df_scenario[]