# Advancing Representation LLMs with Knowledge Graph Integration for Clinical Entity Normalization in Spanish

This notebook processes concepts from the **Unified Medical Language System (UMLS)**, focusing on extracting hierarchical relations (parents and grandparents) to **enrich representation models (Representation LLMs)**.

The goal is to provide additional semantic context to concepts, enabling models to capture ontological structures and enhance their representation and generalization capabilities.

The workflow includes:
- Loading and cleaning UMLS concepts.
- Filtering relevant languages (English and Spanish).
- Preparing data for constructing triplets enriched with parent relations.

---


In [None]:
import os, sys
import pandas as pd
from tqdm.auto import tqdm
import gc
import random
import itertools

sys.path.append(os.path.join(os.getcwd(), '../src'))
from utils import extract_column_names_from_ctl_file


  from .autonotebook import tqdm as notebook_tqdm


## Defining Paths and Limits

Configure input paths and set maximum limits for parent and grandparent concepts.


In [2]:
UMLS_VERSION = "2023AA"
UMLS_PATH = f"/scratch/data/UMLS/{UMLS_VERSION}/META/"
MAX_NOPARENTS = 15
MAX_PARENTS = 15
MAX_GRAND_PARENTS = 45

## Loading and Filtering Concepts

Load UMLS concepts and filter to keep only English and Spanish terms.

In [3]:
with open(os.path.join(UMLS_PATH, "MRCONSO.RRF"), "r") as f:
    lines = f.readlines()
print (len(lines))

cleaned = []
for l in tqdm(lines):
    lst = l.rstrip("\n").split("|")
    cui, lang, synonym = lst[0], lst[1], lst[14]
    if lang not in ["ENG", "SPA"]: continue 
    cleaned.append(lst[0:-1])
print (len(cleaned))

13609918


100%|██████████| 13609918/13609918 [00:34<00:00, 398883.57it/s]

9975167





## Structuring Concepts

Organize the information into a DataFrame, removing duplicates and keeping key columns for hierarchical relations.


In [4]:
colnames = extract_column_names_from_ctl_file(os.path.join(UMLS_PATH, "MRCONSO.ctl"))
conso_es_en_df = pd.DataFrame(cleaned, columns=colnames)[["CUI","STR","LAT","TS","STT","ISPREF","AUI"]]
conso_es_en_df.drop_duplicates(inplace=True)
conso_es_en_df.head()

Unnamed: 0,CUI,STR,LAT,TS,STT,ISPREF,AUI
0,C0000005,(131)I-Macroaggregated Albumin,ENG,P,PF,Y,A26634265
1,C0000005,(131)I-MAA,ENG,S,PF,Y,A26634266
2,C0000039,"1,2-dipalmitoylphosphatidylcholine",ENG,P,PF,N,A28315139
3,C0000039,"1,2-dipalmitoylphosphatidylcholine",ENG,P,PF,Y,A28572604
4,C0000039,"1,2-Dipalmitoylphosphatidylcholine",ENG,P,VC,Y,A0016515


In [5]:
del cleaned, lines
gc.collect()

0

In [6]:
conso_en_df = conso_es_en_df[conso_es_en_df.LAT=="ENG"].reset_index(drop=True)
conso_es_df = conso_es_en_df[conso_es_en_df.LAT=="SPA"].reset_index(drop=True)

In [7]:
conso_en_df.shape, conso_es_df.shape

((8603906, 7), (1371261, 7))

In [8]:
dict_AUI_CUI = dict(zip(conso_es_en_df.AUI.to_list(), conso_es_en_df.CUI.to_list()))

In [9]:
del conso_es_en_df
gc.collect()

0

## Including TradeNames

This section adds **TradeNames** for each concept, providing additional term variants that may be relevant in clinical contexts.

It is important to note that **TradeNames are only available in the English version of UMLS**, so this step is performed exclusively on English concepts. This enrichment allows models to capture commercial names alongside standardized terminologies, enhancing their semantic representation.


In [10]:
with open(os.path.join(UMLS_PATH, "MRREL.RRF"), "r") as f:
    lines = f.readlines()
print (len(lines))

cleaned = []
for l in tqdm(lines):
    lst = l.rstrip("\n").split("|")
    rel_atribute = lst[7]
    if rel_atribute != "has_tradename": continue 
    cleaned.append(lst[0:-1])
print (len(cleaned))

44047264


100%|██████████| 44047264/44047264 [00:26<00:00, 1676260.81it/s]

60991





In [11]:
colnames = extract_column_names_from_ctl_file(os.path.join(UMLS_PATH, "MRREL.ctl"))
rel_ht_df = pd.DataFrame(cleaned, columns=colnames)
rel_ht_df.drop_duplicates(inplace=True)
rel_ht_df.head()

Unnamed: 0,CUI1,AUI1,STYPE1,REL,CUI2,AUI2,STYPE2,RELA,RUI,SRUI,SAB,SL,RG,DIR,SUPPRESS,CVF
0,C0000266,A10336090,SCUI,RB,C0006230,A31756224,SCUI,has_tradename,R49103038,,RXNORM,RXNORM,,,N,
1,C0000266,A29937449,AUI,RB,C0546852,A11915709,AUI,has_tradename,R181538536,,PDQ,PDQ,,,N,
2,C0000266,A7572841,AUI,SY,C0546852,A10760252,AUI,has_tradename,R54483006,,NCI,NCI,,,N,
3,C0001520,A10334571,SCUI,RB,C0031447,A31713462,SCUI,has_tradename,R49101957,,RXNORM,RXNORM,,,N,
4,C0002025,A10762147,AUI,SY,C0700562,A7577925,AUI,has_tradename,R54481441,,NCI,NCI,,,N,


In [12]:
rel_ht_df.shape

(60991, 16)

In [13]:
del cleaned, lines
gc.collect()

0

In [14]:
tradenames_CUIs = rel_ht_df.AUI1.to_list()
conso_tradenames = conso_en_df[conso_en_df.AUI.isin(tradenames_CUIs)]


# 3. Generamos dataframe con la estructura de conso_es
# Crea el nuevo dataframe transformado
tradenames_es = pd.DataFrame()
tradenames_es['CUI'] = rel_ht_df['CUI2']
tradenames_es['STR'] = ''
tradenames_es['LAT'] = 'ENG'
tradenames_es['TS'] = 'S'
tradenames_es['STT'] = ''
tradenames_es['ISPREF'] = ''
tradenames_es['AUI'] = rel_ht_df['AUI1']
# 4. Completamos los campos que quedan con conso_tradenames. 
# Filtrar filas con valor único de CUI y STR
unique_tradenames_rows = conso_tradenames.groupby(['AUI', 'STR']).filter(lambda x: len(x) == 1)
# Generar el diccionario con clave CUI y valor una lista de STR
trade_aui_str = unique_tradenames_rows.groupby('AUI')['STR'].apply(list).to_dict()
trade_aui_stt = unique_tradenames_rows.groupby('AUI')['STT'].apply(list).to_dict()
trade_aui_ispref = unique_tradenames_rows.groupby('AUI')['ISPREF'].apply(list).to_dict()
# Mapeamos el STR con el CUI
tradenames_es["STR"] = tradenames_es.AUI.map(trade_aui_str)
tradenames_es["STT"] = tradenames_es.AUI.map(trade_aui_stt)
tradenames_es["ISPREF"] = tradenames_es.AUI.map(trade_aui_ispref)
tradenames_es["STR"] = tradenames_es.STR.explode()
tradenames_es["STT"] = tradenames_es.STT.explode()
tradenames_es["ISPREF"] = tradenames_es.ISPREF.explode()
# Ahora combinamos conso_es con los fármacos
conso_es_td_df = pd.concat([conso_es_df,tradenames_es]).reset_index(drop=True).copy()


In [15]:
conso_es_td_df.shape 

(1432252, 7)

In [16]:
del rel_ht_df, conso_en_df, conso_tradenames, tradenames_es
gc.collect()

0

## Loading the MRHIER File to Extract Hierarchical Relations

In this section, the **MRHIER.RRF** file is loaded, which contains the **hierarchical relations** between UMLS concepts. This file will be used to identify the **parents** and **grandparents** of each concept.

It is important to note that **MRHIER does not directly provide the parent or grandparent CUI (Concept Unique Identifier)**. Instead, it provides the **AUI (Atom Unique Identifier)**, which refers to specific terms or synonyms. Therefore, a **manual mapping from AUI to CUI** is required, using the previously loaded data from the **MRCONSO.RRF** file.

This step is crucial to building the correct hierarchical relations that will enrich the knowledge of the Representation LLMs.


In [17]:
with open(os.path.join(UMLS_PATH, "MRHIER.RRF"), "r") as f:
    lines = f.readlines()
print (len(lines))


33656779


In [18]:
cleaned = []
for l in tqdm(lines):
    lst = l.rstrip("\n").split("|")
    cleaned.append(lst[0:-1])
print (len(cleaned))

100%|██████████| 33656779/33656779 [01:27<00:00, 385525.83it/s] 

33656779





In [19]:
colnames = extract_column_names_from_ctl_file(os.path.join(UMLS_PATH, "MRHIER.ctl"))
hier_df = pd.DataFrame(cleaned, columns=colnames)
hier_df["AB_AUI"] = hier_df["PTR"].map(lambda x: x.split(".")[-2] if len(x.split(".")) >= 2 else "NO_AB")
hier_df = hier_df[["CUI","AUI","PAUI", "AB_AUI"]]
hier_df.drop_duplicates(inplace=True)
hier_df.head()

Unnamed: 0,CUI,AUI,PAUI,AB_AUI
0,C0000039,A0016515,A0100865,A0064073
1,C0000039,A0049238,A0699760,A0650841
2,C0000039,A11754881,A11752705,A11755810
3,C0000039,A13042554,A13041957,A13055790
4,C0000039,A13096036,A9113576,A9114688


In [20]:
del cleaned, lines
gc.collect()

0

In [21]:
hier_df["PCUI"] = hier_df.PAUI.map(lambda x: dict_AUI_CUI.get(x,"") )
hier_df["AB_CUI"] = hier_df.AB_AUI.map(lambda x: dict_AUI_CUI.get(x,"") )

In [22]:
hier_df.head()

Unnamed: 0,CUI,AUI,PAUI,AB_AUI,PCUI,AB_CUI
0,C0000039,A0016515,A0100865,A0064073,C1959616,C0162448
1,C0000039,A0049238,A0699760,A0650841,C0311458,C0311397
2,C0000039,A11754881,A11752705,A11755810,,
3,C0000039,A13042554,A13041957,A13055790,,
4,C0000039,A13096036,A9113576,A9114688,,


In [23]:
grouped_hier_df =  hier_df.groupby('CUI').agg(lambda x: list(set(x)))

In [24]:
del hier_df
gc.collect()

0

In [25]:
grouped_hier_df.head()

Unnamed: 0_level_0,AUI,PAUI,AB_AUI,PCUI,AB_CUI
CUI,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
C0000039,"[A32917594, A13424401, A7480660, A18972171, A0...","[A8049187, A9113576, A7551192, A24703238, A189...","[A7492380, A13421833, A15706899, A9220464, A32...","[, C0031676, C0311458, C1959616]","[, C0162448, C0311397, C0023779, C1979801]"
C0000052,"[A0630731, A8009325, A7480666, A27769867, A189...","[A13423393, A8020611, A7492280, A0055007, A349...","[A32866223, A8018552, A3144582, A6074366, A288...","[, C0014442, C0443499, C0017772, C0019495, C00...","[, C0014442, C0443485, C0201682, C0019495, C00..."
C0000084,"[A13431788, A9223642, A15722925, A11745997, A1...","[A15678887, A0063904, A7469369, A9114942, A754...","[A13064528, A15912976, A13053086, A7440080, A1...","[C0017789, , C0040887]","[, C0815040, C0001129, C0002524]"
C0000096,"[A18993650, A7548002, A7480686, A7471256, A052...","[A9169756, A7554490, A0124908, A13059178, A157...","[A7471261, A0134459, A11766490, A7512404, A802...","[, C0039771]","[, C0043318]"
C0000097,"[A1199224, A15732141, A18998932, A3230610, A13...","[A13050367, A7560729, A30048386, A11751019, A7...","[A3268142, A11752904, A0959166, A18969400, A34...","[, C0034255, C0242702, C0027934, C0576798]","[, C0019398, C0019399, C0032346, C0040549, C05..."


In [26]:
def remove_empty_values(lst):
    return [x for x in lst if x != '']
grouped_hier_df["PCUI"] = grouped_hier_df.PCUI.map(lambda x: remove_empty_values(x))
grouped_hier_df["AB_CUI"] = grouped_hier_df.AB_CUI.map(lambda x: remove_empty_values(x))

In [27]:
grouped_hier_df.head()

Unnamed: 0_level_0,AUI,PAUI,AB_AUI,PCUI,AB_CUI
CUI,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
C0000039,"[A32917594, A13424401, A7480660, A18972171, A0...","[A8049187, A9113576, A7551192, A24703238, A189...","[A7492380, A13421833, A15706899, A9220464, A32...","[C0031676, C0311458, C1959616]","[C0162448, C0311397, C0023779, C1979801]"
C0000052,"[A0630731, A8009325, A7480666, A27769867, A189...","[A13423393, A8020611, A7492280, A0055007, A349...","[A32866223, A8018552, A3144582, A6074366, A288...","[C0014442, C0443499, C0017772, C0019495, C0040...","[C0014442, C0443485, C0201682, C0019495, C0040..."
C0000084,"[A13431788, A9223642, A15722925, A11745997, A1...","[A15678887, A0063904, A7469369, A9114942, A754...","[A13064528, A15912976, A13053086, A7440080, A1...","[C0017789, C0040887]","[C0815040, C0001129, C0002524]"
C0000096,"[A18993650, A7548002, A7480686, A7471256, A052...","[A9169756, A7554490, A0124908, A13059178, A157...","[A7471261, A0134459, A11766490, A7512404, A802...",[C0039771],[C0043318]
C0000097,"[A1199224, A15732141, A18998932, A3230610, A13...","[A13050367, A7560729, A30048386, A11751019, A7...","[A3268142, A11752904, A0959166, A18969400, A34...","[C0034255, C0242702, C0027934, C0576798]","[C0019398, C0019399, C0032346, C0040549, C0597..."


In [None]:
cui_pcui_dict = {}
cui_ab_cui_dict = {}

for index, row in grouped_hier_df.iterrows():
    cui = index
    pcui_list = row['PCUI']
    ab_cui_list = row['AB_CUI']

    cui_pcui_dict[cui] = pcui_list

    cui_ab_cui_dict[cui] = ab_cui_list

In [29]:
grouped_conso_es_td_df = conso_es_td_df.groupby('CUI').agg(lambda x: list(set(x)))

In [30]:
del conso_es_td_df, grouped_hier_df
gc.collect()

0

In [31]:
grouped_conso_es_td_df.head()

Unnamed: 0_level_0,STR,LAT,TS,STT,ISPREF,AUI
CUI,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
C0000039,"[1,2-Dipalmitoilfosfatidilcolina, Dipalmitoil-...",[SPA],"[P, S]",[PF],[Y],"[A27548619, A9176213]"
C0000052,"[enzima ramificadora 1,4-alfa-glucano (sustanc...",[SPA],"[P, S]",[PF],[Y],"[A27900241, A27899115, A27905681, A27907797, A..."
C0000084,"[Carboxiglutamato-gamma, Ácido gamma-Carboxigl...",[SPA],"[P, S]",[PF],[Y],"[A27683108, A9223642, A27539770, A27602743, A2..."
C0000096,"[3-Isobutil-1-metilxantina, Isobutilteofilina,...",[SPA],"[P, S]",[PF],[Y],"[A27700895, A9176264, A27638420]"
C0000097,"[metilfeniltetrahidropiridina (sustancia), 1-M...",[SPA],"[P, S]",[PF],[Y],"[A27602762, A5902941, A9176265, A5902942]"


In [None]:
grouped_conso_es_td_df = grouped_conso_es_td_df.reset_index()
grouped_conso_es_td_df["PCUI"] = grouped_conso_es_td_df.CUI.map(cui_pcui_dict)
grouped_conso_es_td_df["GPCUI"] = grouped_conso_es_td_df.CUI.map(cui_ab_cui_dict)

In [33]:
grouped_conso_es_td_df = grouped_conso_es_td_df[["CUI","STR","PCUI","GPCUI"]]
grouped_conso_es_td_df.head(25)

Unnamed: 0,CUI,STR,PCUI,GPCUI
0,C0000039,"[1,2-Dipalmitoilfosfatidilcolina, Dipalmitoil-...","[C0031676, C0311458, C1959616]","[C0162448, C0311397, C0023779, C1979801]"
1,C0000052,"[enzima ramificadora 1,4-alfa-glucano (sustanc...","[C0014442, C0443499, C0017772, C0019495, C0040...","[C0014442, C0443485, C0201682, C0019495, C0040..."
2,C0000084,"[Carboxiglutamato-gamma, Ácido gamma-Carboxigl...","[C0017789, C0040887]","[C0815040, C0001129, C0002524]"
3,C0000096,"[3-Isobutil-1-metilxantina, Isobutilteofilina,...",[C0039771],[C0043318]
4,C0000097,"[metilfeniltetrahidropiridina (sustancia), 1-M...","[C0034255, C0242702, C0027934, C0576798]","[C0019398, C0019399, C0032346, C0040549, C0597..."
5,C0000098,"[1-Metil-4-fenilpiridinio, Ciperquat]",[C0034256],[C0034255]
6,C0000102,"[1-Aminonaftaleno, 1-naftilamina (sustancia), ...","[C0541489, C0027378, C0002508, C0007090]","[C0541487, C0303743, C0032346, C0032458, C0029..."
7,C0000103,"[1-Naftilisotiocianato, alfa-Naftilisotiocianato]","[C0206359, C0027378]","[C0032458, C0038776, C0162344]"
8,C0000107,"[1-Sar-8-Ile-Angiotensina II, 1-Sarcosina-8-Is...",[C0003009],[C0003018]
9,C0000119,"[11-Hidroxicorticoesteroides, 11-Hidroxicortic...",[C0020343],[C0001617]


In [None]:
grouped_conso_es_td_df["PCUI"] = grouped_conso_es_td_df["PCUI"].astype(str).map(lambda x: eval(x) if isinstance(x, str) and x!="nan" else [] )
grouped_conso_es_td_df["GPCUI"] = grouped_conso_es_td_df["GPCUI"].astype(str).map(lambda x: eval(x) if isinstance(x, str) and x!="nan" else [] )

In [34]:
grouped_conso_es_td_df["STR"] = grouped_conso_es_td_df["STR"].map(lambda x: [item.lower() for item in x])

In [35]:
CUI_STR = dict(zip(grouped_conso_es_td_df.CUI.to_list(), grouped_conso_es_td_df.STR.to_list()))

## Generating Positive Pairs (noparents, parents, and grandparents)

In this section, **positive pairs** are generated to form the triplets that will enrich Representation LLMs. Three types of relations are considered:

- **Noparents**: Concepts without hierarchical relations.
- **Parents**: Directly related concepts (parents).
- **Grandparents**: Two-level related concepts (grandparents).

To avoid **imbalances** in the dataset, the **number of positive pairs per concept is limited**, ensuring a controlled proportion of examples for each type of relation. This balance is crucial to ensure that the model properly learns the different semantic hierarchies.


In [36]:
noparents_dict = {}
for cui, str_list in zip(grouped_conso_es_td_df['CUI'], grouped_conso_es_td_df['STR']):
    terms = list(set([term for term in str_list if isinstance(term, str)]))
    noparents_dict[cui] = [(term1, term2) for term1, term2 in itertools.combinations(terms, 2)]

In [37]:
noparents_dict_nmax = {}
for key, value in noparents_dict.items():
    random.shuffle(value)
    noparents_dict_nmax[key] = value[:MAX_NOPARENTS]

In [38]:
pos_pairs = []
for k,v in tqdm(noparents_dict_nmax.items()):
    for p in v:
        line = str(k) + "||" + p[0] + "||" + p[1]
        pos_pairs.append(line)
len(pos_pairs)

100%|██████████| 517580/517580 [00:00<00:00, 818904.76it/s]


1357991

In [None]:
with open(f'../data/triplets/training_file_umls{UMLS_VERSION.lower()}_es_uncased_no_dup_pairwise_pair_th{MAX_NOPARENTS}_noparents.txt', 'w') as f:
    for line in pos_pairs:
        f.write("%s\n" % line)

In [None]:
def map_codes(codes):
    return [item for sublist in [CUI_STR.get(code) for code in codes] if sublist for item in sublist]
 
grouped_conso_es_td_df['STR_Parent'] = grouped_conso_es_td_df['PCUI'].map(map_codes)
grouped_conso_es_td_df['STR_Granparent'] = grouped_conso_es_td_df['GPCUI'].map(map_codes)

In [46]:
grouped_conso_es_td_df.shape

(517580, 6)

In [47]:
grouped_conso_es_td_df.head()

Unnamed: 0,CUI,STR,PCUI,GPCUI,STR_Parent,STR_Granparent
0,C0000039,"[1,2-dipalmitoilfosfatidilcolina, dipalmitoil-...","[C0031676, C0311458, C1959616]","[C0162448, C0311397, C0023779, C1979801]","[fosfolípido, producto que contiene fosfolípid...","[fosfoglicérido, glicerofosfolípidos, fosfogli..."
1,C0000052,"[enzima ramificadora 1,4-alfa-glucano (sustanc...","[C0014442, C0443499, C0017772, C0019495, C0040...","[C0014442, C0443485, C0201682, C0019495, C0040...",[sustancia con mecanismo de acción enzimática ...,[sustancia con mecanismo de acción enzimática ...
2,C0000084,"[carboxiglutamato-gamma, ácido gamma-carboxigl...","[C0017789, C0040887]","[C0815040, C0001129, C0002524]","[glutamatos, ácido tricarboxílico, ácidos tric...","[aminoácidos acídicos, aminoácidos ácidos, áci..."
3,C0000096,"[3-isobutil-1-metilxantina, isobutilteofilina,...",[C0039771],[C0043318],"[somophyllin-t, uniphyl, theobid duracap, t-ph...",[xantinas]
4,C0000097,"[metilfeniltetrahidropiridina (sustancia), 1-m...","[C0034255, C0242702, C0027934, C0576798]","[C0019398, C0019399, C0032346, C0040549, C0597...","[piridinas, antagonista de receptor de dopamin...","[compuesto heterocíclico, compuestos heterocíc..."


In [48]:
%%time
parents_dict = {}
for cui, str_list, str_parlist in zip(grouped_conso_es_td_df['CUI'], grouped_conso_es_td_df['STR'], grouped_conso_es_td_df['STR_Parent']):
    terms = list(set([term for term in str_list if isinstance(term, str)] + str_parlist))
    parents_dict[cui] = [(term1, term2) for term1, term2 in itertools.combinations(terms, 2)]
parents_dict_nmax = {}
for key, value in parents_dict.items():
    random.shuffle(value)
    parents_dict_nmax[key] = value[:MAX_PARENTS]

CPU times: user 13.7 s, sys: 75.4 ms, total: 13.8 s
Wall time: 13.8 s


In [49]:
pos_pairs_pcui = []
for k,v in tqdm(parents_dict_nmax.items()):
    for p in v:
        line = str(k) + "||" + p[0] + "||" + p[1]
        pos_pairs_pcui.append(line)
len(pos_pairs_pcui)

100%|██████████| 517580/517580 [00:02<00:00, 235220.40it/s]


5350647

In [50]:
with open(f'../data/triplets/training_file_umls{UMLS_VERSION.lower()}_es_cased_no_dup_pairwise_pair_th{MAX_PARENTS}_parents.txt', 'w') as f:
    for line in pos_pairs_pcui:
        f.write("%s\n" % line)

In [51]:
%%time
grandparents_dict = {}
for cui, str_list, str_parlist, str_granparent in zip(grouped_conso_es_td_df['CUI'], grouped_conso_es_td_df['STR'], grouped_conso_es_td_df['STR_Parent'], grouped_conso_es_td_df['STR_Granparent']):
    terms = list(set([term for term in str_list if isinstance(term, str)] + str_parlist + str_granparent))
    grandparents_dict[cui] = [(term1, term2) for term1, term2 in itertools.combinations(terms, 2)]

grandparents_dict_nmax = {}
for key, value in grandparents_dict.items():
    random.shuffle(value)
    grandparents_dict_nmax[key] = value[:MAX_GRAND_PARENTS]

CPU times: user 1min 2s, sys: 302 ms, total: 1min 2s
Wall time: 1min 2s


In [52]:
pos_pairs_gpcui = []
for k,v in tqdm(grandparents_dict_nmax.items()):
    for p in v:
        line = str(k) + "||" + p[0] + "||" + p[1]
        pos_pairs_gpcui.append(line)

100%|██████████| 517580/517580 [00:06<00:00, 80992.48it/s] 


In [53]:
with open(f'../data/triplets/training_file_umls{UMLS_VERSION.lower()}_es_uncased_no_dup_pairwise_pair_th{MAX_GRAND_PARENTS}_grandparents.txt', 'w') as f:
    for line in pos_pairs_gpcui:
        f.write("%s\n" % line)