<a href="https://colab.research.google.com/github/24p11/recode-scenario/blob/main/scenario_oncology_v2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Create fictive clinical notes from Code set (DRG + ICD)

Code set are the raw classification data, we can extract from National database (Base nationale PMSI en France). They are made of 
* classification profile made of grouping variables from DRG records which are prepared with their frequency in the national database
    - age (class)
    - sexe
    - DRG (racine GHM)
    - Main diagnosis (ICD10) : cf
    - Hospitalization management type : cf
* diagnosis associated to each classification profile, extracted with their frequencies
* procedures associated to each classification profile, specialy for surgery and technical gestures, extracted with their frequencies

From thoses raw information we produce a coded clinical scenario which will be uses a seed.

This scenario is transformed into a detail prompt that will be given to a LLM for generation.
From the combinaision of primary and related diagnosis in French discharge abstract, we derived two notions :
* Primary diagnosis : host the notion of principal pathology, it is rather the primary diagnosis of the discharge abstract or the related diagnosis when it exists and that the primary diagnosis of the discharge abstract is from the chapter "Facteurs influant sur l’état de santé" of ICD10
* The Hospitalization management type is rather the term "Primary diagnosis" or the ICD-10 code of the related diagnosis when it exists


In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import pandas as pd
import numpy as np
import datetime as dt

In [3]:
from utils import *

In [4]:
gs = generate_scenario()
# Load official dictionaries
# col_names option allow you to algin your column names the project dictionary.
gs.load_offical_icd("cim_2024.xlsx",col_names={"code" : "icd_code","libelle":"icd_code_description"} )
gs.load_offical_procedures("ccam_actes_2024.xlsx",col_names={"code":"procedure","libelle_long":"procedure_description"} )
col_names={"Code CIM":"icd_parent_code","Localisation":"primary_site","Type Histologique":"histological_type",
           "Stade":"stage","Marqueurs Tumoraux":"biomarkers","Traitement":"treatment_recommandation","Protocole de Chimiothérapie":"chemotherapy_regimen"}
gs.load_cancer_treatement_recommandations("Tableau récapitulatif traitement cancer.xlsx",col_names ) 

In [5]:
# Load data from BN  PMSI
col_names={"racine":"drg_parent_code","das": "icd_secondary_code","diag":"icd_primary_code","categ_cim":"icd_primary_parent_code",
            "mdp":"case_management_type","nb_situations":"nb","acte":"procedure",
            "mode_entree":"admission_mode",
            "mode_sortie":"discharge_disposition",
            "mode_hospit":"admission_type"}
gs.load_classification_profile("bn_pmsi_cases_20250819.csv", col_names)
gs.load_secondary_icd("bn_pmsi_related_diag_20250818.csv",col_names)
gs.load_procedures("bn_pmsi_procedures_20250818.csv",col_names)

In [6]:
cols_scenario = ["first_name","last_name","cage2","cage","sexe",
                "last_name_med","icd_primary_code",
                "admission_type","admission_mode","discharge_disposition",
                'drg_parent_code','icd_primary_code','icd_secondaray_code','cd_md_pec']
                
cols_cancer = ["cancer_stage","TNM_score","histological_type","treatment_recommandation","chemotherapy_regimen"]

In [7]:
#Prepare cases
# df_profile = gs.df_classification_profile.drop(columns="nb")
df_profile = gs.df_classification_profile
df_profile = df_profile[df_profile.icd_primary_code.isin(gs.icd_codes_cancer)]

In [8]:
# test_profile = df_profile[(df_profile["icd_primary_code"]=="C50") & (df_profile["case_management_type"]=="DP")].iloc[0].copy()
# test_profile

In [9]:
# test_scenario = gs.generate_scenario_from_profile(test_profile)
# test_scenario

In [10]:
# def create_prompt(scenario):
#     if scenario['admission_type'] == "Inpatient" and scenario['drg_parent_code'][2:3]=="C" :
#         template_name = "surgery_complete.txt"
#     elif scenario['admission_type'] == "Outpatient" and scenario['drg_parent_code'][2:3]=="C" :
#         template_name = "surgery_outpatient.txt"
#     else:
#         template_name = "scenario_onco_v1.txt"
    
#     case  = gs.make_prompts_marks_from_scenario(scenario)
#     prompt = prepare_prompt("templates/" + template_name, case=case)
#     return prompt

In [11]:
def create_system_prompt(scenario):
    if scenario['admission_type'] == "Inpatient" and scenario['drg_parent_code'][2:3]=="C" :
        template_name = "surgery_complete.txt"
    elif scenario['admission_type'] == "Outpatient" and scenario['drg_parent_code'][2:3]=="C" :
        template_name = "surgery_outpatient.txt"
    else:
        template_name = "scenario_onco_v1.txt"
        
    with open("templates/" + template_name, "r", encoding="utf-8") as f:
        prompt = f.read()
    
    return prompt

In [12]:
def create_user_prompt(scenario):
    case  = gs.make_prompts_marks_from_scenario(scenario)
    prompt = case["SCENARIO"]
    if len(case["INSTRUCTIONS_CANCER"]) > 0:
        prompt += case["INSTRUCTIONS_CANCER"]
    return prompt

In [13]:
# test_prompt = create_prompt(test_scenario)
# print (test_prompt)

In [14]:
# col_names = ["icd_primary_code", "case_management_type", "drg_parent_code", "cage2","cage", "sexe", "admission_type","admission_mode", "discharge_disposition",
#              "dms", "los_mean", "los_sd", "drg_parent_description"]
# test_generation_v1 = df_profile.sample(20)[col_names].reset_index(drop=True)
test_generation_v1 = df_profile.sample(1000, weights="nb").reset_index(drop=True)
test_generation_v1

Unnamed: 0,icd_primary_code,case_management_type,drg_parent_code,age,cage,cage2,sexe,admission_type,admission_mode,discharge_disposition,...,da,libelle_da,gp_cas,libelle_gp_cas,ga,libelle_ga,da_gp,da_gp_ga,anseqta,aso
0,C50,Z04880,23M20,ge_18,[80-[,[50-[,2,Outpatient,DOMICILE,DOMICILE,...,D26,"Activités inter spécialités, suivi thérapeutiq...",X24,"Médecine inter spécialités, Autres symptômes o...",G194,Signes et symptômes,D26X24,D26X24G194,2024,M
1,D059,DP,09C19,ge_18,[40-50[,[18-50[,2,Inpatient,DOMICILE,DOMICILE,...,D12,Gynécologie - sein,C18,Chirurgie du sein,G107,Chirurgie pour tumeurs malignes sein,D12C18,D12C18G107,2024,C
2,C67,DP,11C02,ge_18,[70-80[,[50-[,2,Inpatient,DOMICILE,DOMICILE,...,D15,Uro-néphrologie et génital,C19,Chirurgie Urologique,G126,"Chirurgies reins, uretères, vessie, glandes su...",D15C19,D15C19G126,2024,C
3,C821,Z511,28Z07,ge_18,[50-60[,[50-[,2,Outpatient,DOMICILE,DOMICILE,...,D27,Séances,S02,Chimiothérapie pour tumeur,G190,Séances : chimiothérapie,D27S02,D27S02G190,2024,M
4,C61,Z087,12M08,ge_18,[80-[,[50-[,1,Outpatient,DOMICILE,DOMICILE,...,D15,Uro-néphrologie et génital,X13,Appareil génital masculin,G139,Explorations et surveillance des affections de...,D15X13,D15X13G139,2024,M
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,C180,DP,06C04,ge_18,[70-80[,[50-[,2,Inpatient,DOMICILE,DOMICILE,...,D01,Digestif,C06,"Chir. Digestive majeure : oesophage, estomac, ...",G002,Chirurgie digestive majeure,D01C06,D01C06G002,2024,C
996,C349,Z511,28Z07,ge_18,[80-[,[50-[,2,Outpatient,DOMICILE,DOMICILE,...,D27,Séances,S02,Chimiothérapie pour tumeur,G190,Séances : chimiothérapie,D27S02,D27S02G190,2024,M
997,C07,DP,03C26,ge_18,[80-[,[50-[,1,Inpatient,DOMICILE,DOMICILE,...,D10,"ORL, Stomatologie",C15,Chirurgie ORL stomato,G085,Chirurgies ORL majeures,D10C15,D10C15G085,2024,C
998,C185,DP,06C04,ge_18,[70-80[,[50-[,1,Inpatient,URGENCES,DOMICILE,...,D01,Digestif,C06,"Chir. Digestive majeure : oesophage, estomac, ...",G002,Chirurgie digestive majeure,D01C06,D01C06G002,2024,C


In [15]:
from tqdm import tqdm

In [16]:
list_scenario = []

for i in tqdm(range(len(test_generation_v1))):
    profile = test_generation_v1.iloc[i].copy()
    scenario = gs.generate_scenario_from_profile(profile)
    row = {k:scenario[k] for k in scenario.keys()}
    user_prompt = create_user_prompt(scenario)
    system_prompt = create_system_prompt(scenario)
    row["user_prompt"] = user_prompt
    row["system_prompt"] = system_prompt
    list_scenario.append(row)

  1%|          | 7/1000 [00:22<54:12,  3.28s/it]


NameError: name 'scenario' is not defined

In [16]:
keep_cols = ['age', 'sexe', 'date_entry', 'date_discharge', 'date_of_birth',
       'first_name', 'last_name', 'icd_primary_code', 'case_management_type',
       'icd_secondaray_code', 'admission_mode', 'discharge_disposition',
       'cancer_stage', 'score_TNM', 'histological_type',
       'treatment_recommandation', 'chemotherapy_regimen', 'drg_parent_code',
       'cage', 'cage2', 'admission_type', 'dms', 'los_mean', 'los_sd',
       'drg_parent_description', 'icd_parent_code', 'icd_primary_description',
       'case_management_type_description', 'first_name_med', 'last_name_med',
       'text_secondary_icd_official', 'procedure', 'text_procedure',
       'case_management_type_text', 'cd_md_pec', 'user_prompt', 'system_prompt', 'biomarkers']
df_scenario = pd.DataFrame(list_scenario)[keep_cols]
df_scenario

Unnamed: 0,age,sexe,date_entry,date_discharge,date_of_birth,first_name,last_name,icd_primary_code,case_management_type,icd_secondaray_code,...,first_name_med,last_name_med,text_secondary_icd_official,procedure,text_procedure,case_management_type_text,cd_md_pec,user_prompt,system_prompt,biomarkers
0,59,2,2024-05-04,2024-05-04,1964-11-01,Claude,Viaene,C530,DP,[],...,Claude,Le fur,,[ZZQX092],- Examen immunocytochimique ou immunohistochim...,Prise en charge en chirugie ambulatoire pour -...,11,**SCÉNARIO DE DÉPART :**\n- Âge du patient : 5...,Vous êtes chirurgien praticien. Votre tâche es...,
1,62,1,2024-07-24,2024-07-31,1961-10-15,Ylhan,Virfolet,C793,Z515,[],...,Anahé,Faduilhe,,[ZCQH001],- Scanographie de l'abdomen et du petit bassin...,Prise en charge pour soins palliatifs,8,**SCÉNARIO DE DÉPART :**\n- Âge du patient : 6...,Vous êtes un oncologue clinicien expert. Votre...,
2,83,1,2024-07-26,2024-08-17,1941-05-19,Gianluca,Godement,C67,DP,[],...,Katerina,Foucault,,[EBQJ002],- Remnographie des vaisseaux cervicaux [Angio-...,Prise en charge chirugicale en hospitalisation...,12,**SCÉNARIO DE DÉPART :**\n- Âge du patient : 8...,Vous êtes chirurgien praticien. Votre tâche es...,HER2


In [23]:
df_scenario.to_csv("test_generation_v2.csv")

In [None]:
# df_scenario =[]
# for i in range(0,5):

#     current_profile = df_profile.iloc[i,:]

#     scenario = gs.generate_scenario_from_profile(current_profile)
#     row = {k:scenario[k] for k in scenario if k in cols_scenario }
#     cancer = [scenario[k] for k in scenario if k in cols_cancer ]

#     row.update({"cancer":cancer})
    
#     case  = gs.make_prompts_marks_from_scenario(scenario)
    
#     row.update({'case': case})


#     if row['admission_type'] == "Inpatient" and row['drg_parent_code'][2:3]=="C" :
#         template_name = "surgery_complete.txt"
#     elif row['admission_type'] == "Outpatient" and row['drg_parent_code'][2:3]=="C" :
#         template_name = "surgery_outpatient.txt"
#     elif row['drg_parent_code'][2:3]=="K" :
#         template_name = "interventionnel.txt"
#     elif row['cd_md_pec']==17 :
#         template_name = "bilan.txt"
#     else:
#         template_name = "scenario_onco_v1.txt"
        
#     prompt =  prepare_prompt("templates/" + template_name ,case =case)
#     row.update({'prompt': prompt})

#     df_scenario.append(row)

In [226]:
# df_scenario= pd.DataFrame(df_scenario)

In [227]:
# df_scenario.to_csv(gs.path_data + "test_scenario_v1.csv")

In [None]:
# df_scenario[]