In [1]:
import pandas as pd
from modules.NER_functions import *
import warnings 
warnings.filterwarnings("ignore")

  from .autonotebook import tqdm as notebook_tqdm


# NoteBook Conversion
Le but de ce notebook est d'extraire le nombre de conditions par critère d'inclusivité des essais cliniques. 

Voici un schéma simplifié des étapes du notebook:

![Texte alternatif](./images/schema.png)



Le modèle entrainé est un modèle BERT fine-tuné. C'est un modèle de reconnaissance d'entités nommés, il est entrainé pour assigner un label à chaque token d'une phrase qu'on lui soumet.

![Texte alternatif2](./images/ner.jpg)

Notre modèle a été entrainé spécialement sur des données d'essais cliniques et présente 5 labels:
* Mood
* Drug
* Condition
* Person
* Procedure
* Observation

Le but du modèle est ainsi de compter le nombre d'attributs recherchés pour un essai clinique afin de construire des indicateurs si un essai clinique est relativement exigeant sur sa sélection de patient.

In [6]:
data = pd.read_csv("../data/clini_data.csv")

## Predictions du modèle

In [7]:
# On charge d'abord le modèle

model = Bert_Model()

In [8]:
model.load_model('./model/model1')

In [9]:
tags = ['I-Procedure',
 'B-Drug',
 'I-Condition',
  'I-Person',
 'B-Condition',
 'O',
 'I-Observation',
 'B-Procedure',
 'B-Person',
 'B-Mood',
 'I-Mood',
 'B-Observation',
 'I-Drug',
 'PAD']

# Ici on rentre les tags "a la main" car il faut qu'ils respectent le même ordre que ceux qui ont servis à entraîner le modèle.

In [10]:
model.getTag(tags)

In [11]:
data = pd.read_csv("../data/clini_data.csv")

In [12]:
data = data_treatment(data)

In [13]:
data['EligibilityCriteria'].iloc[3]

'Inclusion Criteria:\n\n● Age between 18 and 60 years.\n\nWillingness and ability to fully understand the content and scope of the experiment and comply with its instructions.\nHave signed the informed consent.\n\nExclusion Criteria:\n\n● Pregnancy.\n\nOngoing chronic pain or neuromuscular disorder, or any Desis that effect the nociceptive system and not allowed to be evaluated in normal Condition\nHistory of addictive behavior, defined as abuse of alcohol, cannabis, opioids, or other drugs.\nHistory of heat sensitivity disorders.\nHistory of mental illness.\nPresence of fever, tuberculosis, malignant tumors, infectious processes, acute inflammatory processes\nImplantation of pacemakers or metal prostheses.\nUse of analgesics within 24 hours prior to participation in the experiment.\nLack of sleep (< 6 hours) the night before the experiment.\nHigh alcohol intake the evening before the experiment.'

In [14]:
preds = model.predict(data['EligibilityCriteria'].iloc[80][:200])

In [15]:
# Voici un exemple de prédictions du modèle
for elem in preds[10:]:
    print(elem[0], "-"*10, elem[1])

##met ---------- O
##ast ---------- O
##atic ---------- O
, ---------- O
localized ---------- O
, ---------- O
or ---------- O
regional ---------- O
solid ---------- I-Condition
or ---------- I-Condition
blood ---------- I-Condition
ma ---------- I-Condition
##li ---------- I-Condition
##gna ---------- I-Condition
##ncy ---------- I-Condition
( ---------- I-Condition
i ---------- I-Condition
##es ---------- I-Condition
) ---------- I-Condition
; ---------- O
( ---------- O
2 ---------- O
) ---------- O
completion ---------- O
of ---------- O
primary ---------- B-Procedure
cancer ---------- I-Procedure
treatment ---------- I-Procedure
( ---------- O
radiation ---------- B-Procedure
, ---------- O
surgery ---------- B-Procedure
, ---------- O
and ---------- O
/ ---------- O
or ---------- O
ch ---------- B-Procedure
##em ---------- B-Procedure
##otherapy ---------- B-Procedure
) ---------- O
; ---------- O
( ---------- O


In [16]:
# On ne veut garder que les critères d'inclusions. Pour ce faire, on ne sélectionne dans les critères d'éligibilités que le texte compris entre 
# Inclusion Criteria: et Exclusion Criteria

In [17]:
def extract_inclusion_criteria(eligibility_criteria):
    if type(eligibility_criteria)!=str:
        return None
    match = re.search(r'Inclusion Criteria:(.*?)Exclusion Criteria:', eligibility_criteria, re.DOTALL)
    
    if match:
        return match.group(1).strip()
    else:
        return None

data['InclusionCriteria'] = data['EligibilityCriteria'].apply(extract_inclusion_criteria)

In [18]:
print(data['EligibilityCriteria'].iloc[2],"\n\n", "-"*40,"\n\n", data["InclusionCriteria"].iloc[2])

Inclusion Criteria:

Symptomatic paroxysmal AF that are unresponsive to antiarrhythmic drugs (one or more than one).
Willing to undergo catheter ablation for AF.

Exclusion Criteria:

History of any type of catheter ablation for cardiac arrhythmias.
Sinus node dysfunction that requires permanent pacemaker implantation. 

 ---------------------------------------- 

 Symptomatic paroxysmal AF that are unresponsive to antiarrhythmic drugs (one or more than one).
Willing to undergo catheter ablation for AF.


In [84]:
df = data[(data['Phase']=="Phase 3") & (data['EligibilityCriteria']!="None") & (data['EligibilityCriteria']!="")] #On enlève les valeurs manquantes 
# et on ne garde que les essais cliniques en phase 3

Etant donné que les critères d'éligibilités sont généralement long (et la taille de token que peut prendre en compte notre modèle est d'environ 100), on sépare les textes en plusieurs observations de 80 mots qu'on va re-assembler après.

In [85]:
from textwrap import wrap

def split_text_by_words(text, max_words=80):
    words = text.split()
    segments = []
    current_segment = []
    word_count = 0

    for word in words:
        if word_count  + 1 <= max_words:
            current_segment.append(word)
            word_count +=  1
        else:
            segments.append(' '.join(current_segment))
            current_segment = [word]
            word_count = len(word)

    segments.append(' '.join(current_segment))
    return segments

df['InclusionCriteria'] = df['InclusionCriteria'].fillna('').apply(split_text_by_words)

df_expanded = df.explode('InclusionCriteria')

df_expanded.reset_index(drop=True, inplace=True)

In [191]:
df_expanded[['NCTId','InclusionCriteria']].head(10)

Unnamed: 0,NCTId,InclusionCriteria
0,NCT06183229,Male and female subjects aged 18 to 65 years i...
1,NCT06183229,men - consent to use approved methods of contr...
2,NCT06182631,Need IV's line placement for IV fluids and/or ...
3,NCT06182319,Male or female ≥18 years of age. Documentation...
4,NCT06181435,Participants must be 18 years of age (when sig...
5,NCT06181435,at baseline Weekly average of daily PP-NRS of ...
6,NCT06180707,Participants must have at least 28 teeth in th...
7,NCT06179732,Shelters with 1 or more children from 6-10 yea...
8,NCT06178991,Participants 18 through 64 years of age (or th...
9,NCT06178679,Be at least 18 years of age; Provide written i...


In [92]:
# On met maintenant les textes dans une liste pour y appliquer le modèle.
list_inclusion = df_expanded['InclusionCriteria'].tolist()

In [97]:
list_inclusion[:3]

['Male and female subjects aged 18 to 65 years inclusive. Written informed consent. Co-living with persons who has developed influenza or other acute respiratory viral infection, diagnosed no more than 3 days ago. No signs of acute respiratory viral infection, influenza or COVID-19 at the time of inclusion in the study. For women with preserved reproductive potential - a negative pregnancy test and consent to use approved methods of contraception during the entire period of participation in the study; for',
 'men - consent to use approved methods of contraception during the entire period of participation in the study and for 3 weeks after the end of the study.',
 "Need IV's line placement for IV fluids and/or phlebotomy"]

In [94]:
preds = model.batch_predict(list_inclusion)

12151it [05:07, 39.48it/s]


In [100]:
# On crée un dataframe temporaire, celui ci va nous servir à re-répartir les prédictions par essais cliniques car la liste_preds n'est pas indexée par
# les NCTId des essais cliniques mais ont le même ordre que ceux présents dans df_expanded.
df_temp= pd.DataFrame()
df_temp['NCTId'] = df_expanded['NCTId']
df_temp['Liste'] = preds

In [192]:
df_temp.head(10)

Unnamed: 0,NCTId,Liste
0,NCT06183229,"[(male, B-Person), (and, O), (female, B-Person..."
1,NCT06183229,"[(men, O), (-, O), (consent, O), (to, O), (use..."
2,NCT06182631,"[(need, O), (i, B-Procedure), (##v, B-Procedur..."
3,NCT06182319,"[(male, B-Person), (or, O), (female, B-Person)..."
4,NCT06181435,"[(participants, O), (must, O), (be, O), (18, O..."
5,NCT06181435,"[(at, O), (base, O), (##line, O), (weekly, O),..."
6,NCT06180707,"[(participants, O), (must, O), (have, O), (at,..."
7,NCT06179732,"[(shelters, O), (with, O), (1, O), (or, O), (m..."
8,NCT06178991,"[(participants, O), (18, O), (through, O), (64..."
9,NCT06178679,"[(be, O), (at, O), (least, O), (18, O), (years..."


In [142]:
# On fusionne maintenant les prédictions entre elles 
df_merged = df_temp.groupby('NCTId')['Liste'].agg(sum).reset_index()

In [193]:
list_preds = df_merged['Liste'].tolist() # On a enfin notre liste de prediction avec une prédiction par essai clinique:
len(list_preds) == df_expanded['NCTId'].nunique()

True

## Transcription des prédictions du modèle

Maintenant qu'on a nos prédictions, on compte le nombre d'attributs mentionnée dans les critères d'éligibilité des essais cliniques. 
On créer pour chaque essais clinique un dictionnaire qui associe a un attribut le nombre de fois qu'il apparait. 

In [149]:
list_keys = [element[2:] for element in tags if element.startswith("B")]
list_keys

['Drug', 'Condition', 'Procedure', 'Person', 'Mood', 'Observation']

In [150]:
for elem in preds[0]:
    if elem[1]!='O':
        print(elem)
# On ne va compter que les éléments qui commencent par B pour ne pas compter un même mot/groupe de mots plusieurs fois

('male', 'B-Person')
('female', 'B-Person')
('aged', 'B-Person')
('in', 'B-Condition')
('##fluenza', 'B-Condition')
('acute', 'B-Condition')
('respiratory', 'I-Condition')
('viral', 'I-Condition')
('infection', 'I-Condition')
('acute', 'B-Condition')
('respiratory', 'I-Condition')
('viral', 'I-Condition')
('infection', 'I-Condition')
('in', 'B-Condition')
('##fluenza', 'B-Condition')


In [151]:
def count_attributes(pred):
    dict_clini = {} # dict_clini est un dictionnaire qui repertorie le nombre de mots d'intérêts par catégorie.
    # On initialise le dictionnaire
    for elem in list_keys:
        dict_clini[elem]=0



    for i in pred:
        if i[1][0]=="B":
            if i[0][:2]!="##":
                dict_clini[i[1][2:]]+=1 #On rajoute 1 à la clé du compteur
    return dict_clini
count_attributes(list_preds[0])

{'Drug': 2,
 'Condition': 5,
 'Procedure': 0,
 'Person': 3,
 'Mood': 0,
 'Observation': 0}

In [153]:
preds_dict_list = [count_attributes(elem) for elem in list_preds]


In [155]:
df_merged['raw_count'] = preds_dict_list
# On assigne les "compteurs" aux observations, ici l'ordre est préservé.

In [166]:
df_augmented = pd.merge(df_merged, df, how='inner', on='NCTId') #On fusionne notre dataframe qui contient les prédictions avec l'ancien. On utilise le NCTId
# comme clé de fusion.

In [178]:
# On vérifie qu'on a bien fais match les essais cliniques entre eux
stri =''
for word in df_test['Liste'].iloc[0]:
    stri+= ' ' + word[0]
print(stri)
print(df_augmented['EligibilityCriteria'].iloc[0])

 aged 12 - 85 years ; of either gender . confirmed p ##ah due to id ##io ##pathic pulmonary art ##erial h ##yper ##tens ##ion ( i ##pa ##h ) or f ##ami ##lial pulmonary art ##erial h ##yper ##tens ##ion ( f ##pa ##h ) . 6 - minute walk distance ( 6 - m ##w ##d ) between 100 - 450 meters at screening . on a stable dose of si ##lde ##na ##fi ##l , with or without b ##ose ##nta ##n .
Inclusion Criteria:

Aged 12-85 years; of either gender.
Confirmed PAH due to idiopathic pulmonary arterial hypertension (IPAH) or familial pulmonary arterial hypertension (FPAH).
6-minute walk distance (6-MWD) between 100-450 meters at screening.
On a stable dose of sildenafil, with or without bosentan.

Exclusion Criteria:

Any treatment for PAH with prostacyclins, prostacyclin analogues, endothelin-1 antagonists, or phosphodiesterase-5 (PDE-5) inhibitors other than sildenafil within the past 12 weeks.
Pulmonary hypertension due to conditions other than those stated in inclusion criteria.
Additional PAH med

In [168]:
df_augmented[['InclusionCriteria','raw_count']]

Unnamed: 0,InclusionCriteria,raw_count
0,[Aged 12-85 years; of either gender. Confirmed...,"{'Drug': 2, 'Condition': 5, 'Procedure': 0, 'P..."
1,[Clinical diagnosis of 4-10 previously untreat...,"{'Drug': 0, 'Condition': 1, 'Procedure': 0, 'P..."
2,[Community dwelling patients 65 years of age o...,"{'Drug': 0, 'Condition': 1, 'Procedure': 0, 'P..."
3,[Male and female patients with mild to severe ...,"{'Drug': 0, 'Condition': 1, 'Procedure': 0, 'P..."
4,[Persistent asthma of a minimum of six months ...,"{'Drug': 0, 'Condition': 0, 'Procedure': 1, 'P..."
...,...,...
6256,[Participants must have at least 28 teeth in t...,"{'Drug': 0, 'Condition': 0, 'Procedure': 0, 'P..."
6257,[Participants must be 18 years of age (when si...,"{'Drug': 0, 'Condition': 4, 'Procedure': 1, 'P..."
6258,[Male or female ≥18 years of age. Documentatio...,"{'Drug': 0, 'Condition': 6, 'Procedure': 0, 'P..."
6259,[Need IV's line placement for IV fluids and/or...,"{'Drug': 1, 'Condition': 0, 'Procedure': 2, 'P..."


In [194]:
print(df_augmented['EligibilityCriteria'].iloc[3000])
print("\n\n")
print("Extraction pour les critères d'inclusions :\n\n",df_augmented['raw_count'].iloc[3000])

Inclusion Criteria:

Age ≥2 years and ≤75 years.
Confirmed diagnosis of primary immunodeficiency (PI) disease as defined by the European Society for Immunodeficiencies and Pan American Group for Immunodeficiency and requiring immunoglobulin replacement therapy due to hypogammaglobulinaemia or agammaglobulinaemia. Note: The exact type of PI disease will be recorded.
Established on a consistent or stable mg/kg dose of any SCIG treatment for a minimum of 3 months prior to Screening. Note: patients entering Cohort 3 must be on weekly SCIG infusions for a minimum of 12 weeks.

Availability of the Immunoglobulin G (IgG) trough levels of 2 previous SCIG infusions within 1 year of Screening, with 1 trough level obtained within 3 months prior to enrollment, and maintenance of trough serum IgG levels

≥5.0 g/L in 2 previous infusions. Patients with no prior IgG trough level within 3 months prior to enrollment may use the Screening IgG trough level as their 2nd reading.

Voluntarily given, fully 

In [200]:
# On réparti maintenant les observations dans des colonnes annexes (afin de pouvoir utiliser les variables facilement ensuite)

colonnes_separees = df_augmented['raw_count'].apply(pd.Series)
colonnes_separees.rename(columns={'Condition': 'Conditions'}, inplace=True)
df_final = pd.concat([df_augmented, colonnes_separees], axis=1)
df_final[["NCTId", "InclusionCriteria","Drug","Conditions","Procedure"]].sample(5)

Unnamed: 0,NCTId,InclusionCriteria,Drug,Conditions,Procedure
4057,NCT04592419,[Signed informed consent prior to participatio...,0,5,0
5790,NCT05783492,[General Criteria Provision of signed and date...,0,1,1
4906,NCT05136261,[Children and adolescents aged three to 16 yea...,0,2,0
562,NCT01527357,[Has signed an institutional review board/inde...,1,1,0
3532,NCT04311658,[adolescents and young women requesting LNG-IU...,0,0,1


In [201]:
print(df[df['NCTId']=='NCT04311658'].EligibilityCriteria.iloc[0])

Inclusion Criteria:

adolescents and young women requesting LNG-IUD insertion

Exclusion Criteria:

heavy vaginal bleeding,pregnancy, contraindications to IUD insertion, allergy or contraindication to isosorbide mononitrate, uterine anomaly


In [187]:
# On sauvegarde le modèle 

df_final.to_csv('./data/Data_augmented_final.csv')

In [188]:
import pickle

# J'enregistre les prédictions dans un fichier pkl au cas où je souhaite les réutiliser
with open('./data/predictions_final.pkl', 'wb') as fichier:
    pickle.dump(preds, fichier)

In [189]:
with open('./data/predictions_final.pkl', 'rb') as fichier:
    predictions = pickle.load(fichier)