# Clinical Entity Resolution:

Use Cases:
* This technique is used to derive insight from detected and recognized clinical entities. 
* It includes activities such as abbriviation detection and resolution.
* Links entities to preexisting databases

Also known as:
* Clinical record matching
* Clinical fuzzy matching
* Clinical record linkage

### Clinical Knowledge Bases:
* Are curated databases for clinical and biomedical data.

Some examples are:
* UMLS (unified medical language system)
* MeSH (medical subject headings)
* RxNorm (medication names)
* Go (Gene Ontology)
* HPO (Human phenotype ontology)

### Central Principles:

* Entity recognition and extraction
* Established knowledge base 
* Entity matching and linkage without unique identifier

### Practical Uses:
* Deduplication of clinical entities for data quality assurance 
* Access further information from knowledge bases 
* Linkage of disparate clinical data sources 
* Clinical entity disambiguation 

In [1]:
import scispacy
import spacy
import en_ner_bc5cdr_md
import en_ner_bionlp13cg_md
import en_core_sci_md
from scispacy.abbreviation import AbbreviationDetector
from scispacy.linking import EntityLinker
from pprint import pprint
import pandas as pd

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# Saving my sample data into a variable called case_report1
# This can be found here https://www.frontiersin.org/articles/10.3389/fendo.2022.951377/full
case_report1 = """Background: Hypoglycemia is uncommon in people who are not being treated for diabetes mellitus and, when present, the differential diagnosis is broad. 
Artifactual hypoglycemia describes discrepancy between low capillary and normal plasma glucose levels regardless of symptoms and should be considered in patients with Raynaud’s phenomenon.
Case Presentation: A 46-year-old female patient with a history of a sleeve gastrectomy started complaining about episodes of lipothymias preceded by sweating, nausea, and dizziness. 
During one of these episodes, a capillary blood glucose was obtained with a value of 24 mg/dl. She had multiple emergency admissions with low-capillary glycemia. 
An exhaustive investigation for possible causes of hypoglycemia was made for 18 months. 
The 72h fasting test was negative for hypoglycemia. A Raynaud’s phenomenon was identified during one appointment.
Conclusion: Artifactual hypoglycemia has been described in various conditions including Raynaud’s phenomenon, peripheral arterial disease, Eisenmenger syndrome, acrocyanosis, or hypothermia. 
With this case report, we want to reinforce the importance of being aware of this diagnosis to prevent anxiety, unnecessary treatment, and diagnostic tests."""

In [3]:
# Saving my sample data into a variable called case_report2
# This can be found here https://www.frontiersin.org/articles/10.3389/fcdhc.2022.934629/full
case_report2 = """Introduction: This study aimed at assessing the patterns of care and glycemic control of patients with diabetes (DM) in real life during a follow-up of 2 years in the public and private health sectors in Brazil.
Methods: BINDER was an observational study of patients >18 years old, with type-1 (T1DM) and type-2 DM (T2DM), followed at 250 sites from 40 cities across the five regions of Brazil. The results for the 1,266 participants who were followed for 2 years are presented.
Main results: Most patients were Caucasians (75%), male (56.7%) and from the private health sector (71%). Of the 1,266 patients who entered the analysis, 104 (8.2%) had T1DM and 1162 (91.8%) had T2DM. Patients followed in the private sector represented 48% of the patients with T1DM and 73% of those with T2DM. For T1DM, in addition to insulins (NPH in 24%, regular in 11%, long-acting analogues in 58%, fast-acting analogues in 53%, and others in 12%), the patients received biguanide (20%), SGLT2-I (4%), and GLP-1Ra (<1%). After 2 years, 13% of T1DM patients were using biguanide, 9% SGLT2-I, 1% GLP-1Ra, and 1% pioglitazone; the use of NPH and regular insulins decreased to 13% and 8%, respectively, while 72% were receiving long-acting insulin analogues, and 78% fast-acting insulin analogues. Treatment for T2DM consisted of biguanide (77%), sulfonylureas (33%), DPP4 inhibitors (24%), SGLT2-I (13%), GLP-1Ra (2.5%), and insulin (27%), with percentages not changing during follow-up. Regarding glucose control, mean HbA1c at baseline and after 2 years of follow-up was 8.2 (1.6)% and 7.5 (1.6)% for T1DM, and 8.4 (1.9)% and 7.2 (1.3)% for T2DM, respectively. After 2 years, HbA1c<7% was reached in 25% of T1DM and 55% of T2DM patients from private institutions and in 20.5% of T1DM and 47% of T2DM from public institutions.
Conclusion: Most patients did not reach the HbA1c target in private or public health systems. At the 2-year follow-up, there were no significant improvements in HbA1c in either T1DM or T2DM, which suggests an important clinical inertia.
Introduction
According to the International Diabetes Federation, 463 million people are currently living with diabetes (DM) worldwide (1, 2). In 2019, it was estimated that there were about 16.8 million people aged from 20 to 79 years with DM in Brazil, with a projected increase of 55% by the year 2045 (1, 2). Type 2 diabetes (T2DM) comprises approximately 90% of all DM diagnoses (3). Estimates related to the number of existing cases of type 1 diabetes (T1DM) in children and adolescents from 0 to 14 years show that Brazil occupies the third place in the global panorama, with 55,500 cases, behind India (95,600) and the United States (94,200) (1).
Chronic non-communicable diseases (NCDs) are responsible for nearly two thirds of deaths in Brazil, 5.3% of which due to DM (4). In addition, DM is known to be an important risk factor for chronic cardiovascular disease (CVD), which accounts for 31.3% of deaths in our country (5).
Over the last decades, age-standardized rates have shown a tendency to reduced mortality caused by CVD and DM in Brazil (6, 7), in agreement with the aging of the population and the extension of life with the disease. The considerable burden of these diseases was highlighted in the Project on the Global Burden of Disease in Brazil (Burden of disease in Brazil, 1990–2016), in which DM was identified to be responsible for 4.7% of disability-adjusted life-years (DALY) in total and 6.1% of DALY originated by NCDs (8).
One of the great current challenges is, therefore, to deal with this increase in morbidity, which requires controlling the disease and preventing complications. These data are even more worrisome when considering the number of affected people in Brazil. Brazilian data on the prevalence of DM representative of the population of nine capitals date from the 1980s (9). At that time, it was estimated that approximately 7.6% of the Brazilian population aged between 30 and 69 years had DM, with both genders being equally affected, and with the prevalence of the disease increasing with age and body fat. A more recent estimate of the prevalence of self-reported DM in Brazil was performed by the Surveillance System of Risk and Protective Factors for Chronic Diseases by Telephone Survey (VIGITEL, Vigilância de Fatores de Risco por Inquérito Telefônico), implemented in 27 state capitals since 2006 (10). In the VIGITEL 2018, 8.1% of women and 7.1% of men ≥18 years old in Brazil reported having DM; the numbers increased with age, reaching 23.1% in individuals over 64 years of age, and decreased with higher the level of education, affecting 15.2% of the participants with from 0 to 8 years of schooling and 3.7% in the group with higher education (10).
The high prevalence of DM exerts a negative impact on health not only due to mortality, but also through complications and disabilities resulting from the prolonged time living with the disease and poor metabolic control. In addition to the health-related effects, diabetes is associated with an unwanted economic impact on both individual and society levels. Studies show that associated costs increase according to the duration of DM and the presence of micro- and macrovascular complications (11, 12). Inadequate glycemic control can aggravate these medical conditions and has been reported in studies including patients with T1DM and T2DM treated in the Brazilian Public Unified Health System (SUS, Sistema Único de Saúde) (11, 13, 14). Data related to the management of diabetes in the private sector in Brazil are still scarce.
To understand this scenario, there is a lack of data on the prevalence of chronic complications and comorbidities, including cardiovascular risk factors, associated with DM in the Brazilian population. In this regard, public and private health services represent opportunities to access professional care and different medications, providing information to guide better strategies for secondary and tertiary prevention of DM. The disease burden of DM is a relevant concern that requires secondary and tertiary prevention strategies. To develop these actions, it is necessary to understand the epidemiological and current management landscape of patients with diabetes in Brazil.
The BrazIliaN Type 1 & 2 DiabetEs Disease Registry (BINDER) study was an observational study, with both a cross-sectional and a longitudinal phase, designed to assess the demographic and clinical characteristics, patterns of care and glycemic control of patients with DM in real life during a follow-up of 2 years in the public and private health sectors in Brazil. In this paper, we present the results of the longitudinal analysis which included the patients followed for 2 years.
Patients and methods
Study design and population
This was a observational study of individuals with DM followed for 2 years in the BINDER study. BINDER included patients with T1DM and T2DM followed by 250 physicians from different public and private healthcare services, geographically distributed in 40 cities across the five regions of Brazil. The study had both cross-sectional and longitudinal phases (for a total duration of 2 years). Five waves of data collection were performed; for each wave, information from the last 6 months was obtained. To be enrolled in the study, patients had to be 18 years or older, have T1DM or T2DM, and had to have attended at least one medical visit at the study site in the 6 months prior to study entry. Pregnancy, gestational diabetes and other types of DM except T1DM or T2DM were excluded.
Each medical specialist (endocrinologists, cardiologists, or general practitioners) was responsible for recruiting about ten patients. To minimize patient selection bias, investigators were instructed to recruit patients in a retrospective consecutive manner starting from the patients that were last seen in the service according to medical charts. The initial sample of the study comprised 2,488 patients who entered the first wave of data collection (baseline visit). In the longitudinal phase, four subsequent follow-up visits were planned to occur every 6 months until the completion of the 2-year follow-up period. In this paper, we present the results obtained for the 1,266 participants who completed the final visit scheduled to occur after 2 years of follow-up and comprised the population of the longitudinal analysis.
Participating study centers were selected by the Associação Brasileira de Organizações Representativas de Pesquisa Clínica according to a proprietary database. A total of 250 sites/medical specialists of 40 Brazilian cities of the five country regions were chosen: 124 in the Southeast; 48 in the Northeast; 38 in the South Region; 30 in the Central-West Region; and 10 in the North Region.
The participant physicians collected data from patient medical charts covering the medical appointments that occurred from 07-Apr-2016, the date of study initiation, to 13-Dec-2019, the date of the final visit for the study.
The study was conducted after the approval by the ethics committee of the Universidade Federal de São Paulo (São Paulo, Brazil), and the study was conducted in accordance with the Declaration of Helsinki and the International Conference on Harmonization guidelines for Good Clinical Practice. Informed consent was obtained from all patients.
Data collection, variables and evaluation criteria
Data were collected from medical charts using an electronic CRF (e-CRF), and data management was performed according to the Data Validation Plan with data review processes in order to clarify data issues.
Variables of interest in the cross-sectional (baseline) phase were age, gender, ethnicity, educational level, body mass index (BMI), age at diagnosis, DM duration (time since diagnosis), abdominal circumference, blood pressure and laboratory results, risk factors for CVD, comorbidities, DM complications, glycemic control, medical specialties involved in patient care, and type of treatment. For the subsequent waves and longitudinal phase, collected data included glycemic control (HbA1c), weight, BMI, use of insulins and other medications, number of medications, and comorbidities and complications.
The achievement of individual HbA1c target (<7.0% or defined individual target) in patients with T1DM and T2DM at the study baseline (cross-sectional phase) and after 2 years of follow-up was the primary objective of the study and was described by the proportion of patients who reached the target in the overall study population and per DM type. The proportion was complemented by the respective 95% confidence interval (CI). Secondary objectives included the description of patients regarding their demographic and clinical characteristics, presence of comorbidities, complications, patterns of treatment and hospitalizations at baseline and during the follow-up period.
As this is a disease registry, non-interventional study, no data were collected beyond those required for routine clinical practice. However, Adverse Drug Reactions to any Sanofi product that occurred during the course of the study was to be reported to the Sponsor within 24 hours from the moment the investigator was notified about the case, in compliance with pharmacovigilance practice.
Statistical considerations and analysis
Statistical analysis was based on pooled data from all patients. Given the observational nature of the study, the statistical analysis was mainly descriptive, using appropriate summary statistics according to the type of variable. Descriptive statistics as number of non-missing data, range (minimum and maximum values), mean, standard deviation (SD), median and interquartile range (IQR) were calculated for summarizing numerical variables. Frequencies and proportions were calculated for summarizing categorical variables. There was no data imputation for missing/not available data in the calculations. The number of participants with available information for each variable are displayed in the tables, when considered relevant.
For the longitudinal phase, statistical analysis was based on pooled data of all patients who had available data at baseline and also at the end of follow-up, after 2 years. Descriptive analyses were performed according to the DM type and health care system (private and public sectors).
For the cross-sectional phase, nearly 2,500 patients were planned to be enrolled. Considering a planned sample size of 2,500 patients for the cross-sectional phase and assuming that T2DM comprise 90% of DM cases, the study expected to recruit about 2,250 patients with T2DM and 250 patients with T1DM. The sample size of 2,250 T2DM patients would ensure 95% CIs with a maximum width of 2.1% below and above point estimate. On the other hand, with a sample size of 250 T1DM patients, the maximum expected width was 6.2% below and above the point estimate.
Sample size calculation was performed based on published data from population studies conducted in Brazil that estimated the proportion of patients with HbA1c values within the target. Considering an expected proportion of 27% of patients with T2DM within the HbA1c target (3), the sample of 2,250 patients with T2DM would allow assessing this proportion with 95% CIs with a maximum width of 1.8% below and above the point estimate; and for an expected proportion of 10% of T1DM patients within the HbA1c target (14), the sample of 250 patients with T1DM would allow assessing this proportion with 95% CIs with a maximum width of 3.7% below and above the point estimate.
Results
Baseline characteristics and comorbidities of the subset of patients who entered the longitudinal analysis were similar to those of the patients comprising the total study sample (data not shown). The baseline sample comprised 91.9% of patients with T2DM, the mean age was 63 years, and 52.2% were from the Southeast Region, while the sample at the end of the follow-up period had 91.8% of patients with T2DM, mean age of 62 years, and 51.8% from the Southeast Region.
Patient characteristics
The study sample for the longitudinal analysis comprised a total of 1,266 patients of the BINDER study who had completed the 2 years of follow-up with data collection in all five waves. As shown in Table 1, 56.7% of patients were male, 74.7% were Caucasian, and 33.5% had a college or higher degree of education. One hundred and four patients had T1DM (8.2%), and 1162 (91.8%) had T2DM. At the time of the initial study visit, the mean age of T1DM and T2DM patients were 35.0 and 63.7 years, respectively; patients aged 18 to 30 years comprised 38.5% of the T1DM group and 0.5% of the T2DM patients. T1DM patients were under treatment for a longer time (mean treatment duration: 15.8years for T1DM vs 9.8 years for T2DM), although the mean time since DM diagnosis was similar between T1DM (16.5 years) and T2DM (17.8 years). Of the assessed patients, 48% of those with T1DM and 73.2% of those with T2DM were followed in the private health sector (Table 1). A family history of DM was reported by 12.2% and 25.3% of patients with T1DM and T2DM, respectively.
Table 1"""

In [4]:
def show_medical_abbreviation(model, document):
    """
    This function detects and resolves medical abbriviations in word entities

    Parameters:
        model (module): A pre-trained biomedical model from ScispaCy
        document (str): Document to be procesed  
    
    Returns: list of unique abbriviations and their resolution
    """
    
    nlp = model.load()
    nlp.add_pipe('abbreviation_detector')
    doc = nlp(document)
    
    # list is set to ensure only unique values are returened 
    abbreviated = list(set([f"{abrv} {abrv._.long_form}" for abrv in doc._.abbreviations]))
    
    return abbreviated

In [5]:
show_medical_abbreviation(en_ner_bionlp13cg_md, case_report1)

# No abbreviations were detected 

  global_matches = self.global_matcher(doc)


[]

In [6]:
show_medical_abbreviation(en_ner_bc5cdr_md, case_report1)

[]

In [7]:
show_medical_abbreviation(en_ner_bionlp13cg_md, case_report2)
# The output shows the model detected the abbreviations 

['Paulo Paulo, Brazil',
 'IQR interquartile range',
 'NCDs non-communicable diseases',
 'T2DM type-2 DM',
 'Brazil Brazil, 1990–2016',
 'DALY disability-adjusted life-years',
 'SD standard deviation',
 'CVD cardiovascular disease',
 'e-CRF electronic CRF',
 'CI confidence interval',
 'target target',
 'BMI body mass index',
 'BINDER BrazIliaN Type 1 & 2 DiabetEs Disease Registry']

In [8]:
show_medical_abbreviation(en_ner_bc5cdr_md, case_report2)

['Paulo Paulo, Brazil',
 'IQR interquartile range',
 'NCDs non-communicable diseases',
 'T2DM type-2 DM',
 'Brazil Brazil, 1990–2016',
 'DALY disability-adjusted life-years',
 'SD standard deviation',
 'CVD cardiovascular disease',
 'e-CRF electronic CRF',
 'CI confidence interval',
 'target target',
 'BMI body mass index',
 'BINDER BrazIliaN Type 1 & 2 DiabetEs Disease Registry']

I will now be resolving clinical entities by linking them to established knowledge bases. 

I will be using scispacy linking submodel to connect to the MeSH, HPO, RxNorm, GO, and UMLS biomedical knowledge bases.

In [9]:
entities_df = pd.read_csv('data/bionlp_entities.csv')

In [10]:
entities_df

Unnamed: 0,Entity,Label,Ner_model
0,patients,ORGANISM,bionlp13cg
1,loperamide hydrochloride,SIMPLE_CHEMICAL,bionlp13cg
2,sodium chloride,SIMPLE_CHEMICAL,bionlp13cg
3,gut-liver,CELLULAR_COMPONENT,bionlp13cg
4,lymphocytes,CELL,bionlp13cg
...,...,...,...
96,electrolytes,CELLULAR_COMPONENT,bionlp13cg
97,lymphocyte,CELL,bionlp13cg
98,C-reactive protein,GENE_OR_GENE_PRODUCT,bionlp13cg
99,glutathione,SIMPLE_CHEMICAL,bionlp13cg


In [11]:
# For easy use, entity resolution tasks will be applied to the entities coloumn of the dataframe

# To link to the MeSH database I need to specify a general pipeline named mesh_nlp
mesh_nlp = spacy.load('en_core_sci_md')

# MeSH contains 30k entities
mesh_nlp.add_pipe('scispacy_linker', config={'resolve_abbreviations': True, 'linker_name': 'mesh'})
linker = mesh_nlp.get_pipe('scispacy_linker')

def mesh_entity_linker(document):
    doc = mesh_nlp(document)
    try:
        entity = doc.ents[0]
    except IndexError:
        entity = 'Nan'
    entity_details = []
    entity_details.append(entity)
    try:
        for linker_ent in entity._.kb_ents:
            Concept_id, Score = linker_ent
            entity_details.append('Entity_Matching_Score : {}'.format(Score))
            entity_details.append(linker.kb.cui_to_entity[linker_ent[0]])
    except AttributeError:
        pass
    return entity_details

https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


In [12]:
# Apply the MeSH entity linker to the Entity column in the entities_df
# Nan means that the MeSH database does not have a definition for the entity
entities_df['mesh_output'] = entities_df['Entity'].apply(lambda x: mesh_entity_linker(x)) 

In [13]:
# I can see the new column with entities resolved and their confidence scores.
entities_df

Unnamed: 0,Entity,Label,Ner_model,mesh_output
0,patients,ORGANISM,bionlp13cg,"[(patients), Entity_Matching_Score : 0.9999998..."
1,loperamide hydrochloride,SIMPLE_CHEMICAL,bionlp13cg,"[(loperamide, hydrochloride), Entity_Matching_..."
2,sodium chloride,SIMPLE_CHEMICAL,bionlp13cg,"[(sodium, chloride), Entity_Matching_Score : 0..."
3,gut-liver,CELLULAR_COMPONENT,bionlp13cg,[(gut-liver)]
4,lymphocytes,CELL,bionlp13cg,"[(lymphocytes), Entity_Matching_Score : 1.0, (..."
...,...,...,...,...
96,electrolytes,CELLULAR_COMPONENT,bionlp13cg,"[(electrolytes), Entity_Matching_Score : 1.0, ..."
97,lymphocyte,CELL,bionlp13cg,"[(lymphocyte), Entity_Matching_Score : 0.90758..."
98,C-reactive protein,GENE_OR_GENE_PRODUCT,bionlp13cg,"[(C-reactive, protein), Entity_Matching_Score ..."
99,glutathione,SIMPLE_CHEMICAL,bionlp13cg,"[(glutathione), Entity_Matching_Score : 1.0, (..."


In [14]:
entities_df['mesh_output'][0]

[patients,
 'Entity_Matching_Score : 0.9999998807907104',
 CUI: D010361, Name: Patients
 Definition: Individuals participating in the health care system for the purpose of receiving therapeutic, diagnostic, or preventive procedures.
 TUI(s): 
 Aliases: (total: 2): 
 	 Patients, Clients,
 'Entity_Matching_Score : 0.8265573382377625',
 CUI: D028642, Name: Mentally Ill Persons
 Definition: Persons with psychiatric illnesses or diseases, particularly psychotic and severe mood disorders.
 TUI(s): 
 Aliases: (total: 3): 
 	 Mentally Ill Persons, Mentally Ill, Mental Patients,
 'Entity_Matching_Score : 0.7403873801231384',
 CUI: D007297, Name: Inpatients
 Definition: Persons admitted to health facilities which provide board and room, for the purpose of observation, care, diagnosis or treatment.
 TUI(s): 
 Aliases: (total: 1): 
 	 Inpatients]

In [15]:
# I will repeat the above steps for the HPO knowledge base 
hpo_nlp = spacy.load('en_core_sci_md')

# 16k concepts focused on phenotypic abnormalities encountered in human disease's 
hpo_nlp.add_pipe('scispacy_linker', config={'resolve_abbreviations': True, 'linker_name': 'hpo'})
linker = hpo_nlp.get_pipe('scispacy_linker')

def hpo_entity_linker(document):
    doc = hpo_nlp(document)
    try:
        entity = doc.ents[0]
    except IndexError:
        entity = 'NaN'
    entity_details = []
    entity_details.append(entity)
    try:
        for linker_ent in entity._.kb_ents:
            Concept_id, Score = linker_ent
            entity_details.append('Entity_Matching_Score : {}'.format(Score))
            entity_details.append(linker.kb.cui_to_entity[linker_ent[0]])
    except AttributeError:
        pass
    return entity_details

https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


In [16]:
entities_df['hpo_output'] = entities_df['Entity'].apply(lambda x: hpo_entity_linker(x)) 

In [17]:
# When I observe the dataframe I notice that only entities that may be related to human phynotypes-
# like lymphocytes and electrolytes have an output with a score returned  
entities_df

Unnamed: 0,Entity,Label,Ner_model,mesh_output,hpo_output
0,patients,ORGANISM,bionlp13cg,"[(patients), Entity_Matching_Score : 0.9999998...",[(patients)]
1,loperamide hydrochloride,SIMPLE_CHEMICAL,bionlp13cg,"[(loperamide, hydrochloride), Entity_Matching_...","[(loperamide, hydrochloride)]"
2,sodium chloride,SIMPLE_CHEMICAL,bionlp13cg,"[(sodium, chloride), Entity_Matching_Score : 0...","[(sodium, chloride)]"
3,gut-liver,CELLULAR_COMPONENT,bionlp13cg,[(gut-liver)],[(gut-liver)]
4,lymphocytes,CELL,bionlp13cg,"[(lymphocytes), Entity_Matching_Score : 1.0, (...","[(lymphocytes), Entity_Matching_Score : 0.9143..."
...,...,...,...,...,...
96,electrolytes,CELLULAR_COMPONENT,bionlp13cg,"[(electrolytes), Entity_Matching_Score : 1.0, ...","[(electrolytes), Entity_Matching_Score : 0.712..."
97,lymphocyte,CELL,bionlp13cg,"[(lymphocyte), Entity_Matching_Score : 0.90758...","[(lymphocyte), Entity_Matching_Score : 0.83922..."
98,C-reactive protein,GENE_OR_GENE_PRODUCT,bionlp13cg,"[(C-reactive, protein), Entity_Matching_Score ...","[(C-reactive, protein), Entity_Matching_Score ..."
99,glutathione,SIMPLE_CHEMICAL,bionlp13cg,"[(glutathione), Entity_Matching_Score : 1.0, (...",[(glutathione)]


In [18]:
entities_df['hpo_output'][4]

[lymphocytes,
 'Entity_Matching_Score : 0.9143446683883667',
 CUI: C0221277, Name: Abnormal lymphocyte morphology
 Definition: A lymphocyte that may be irregular or not conforming to type.
 TUI(s): T033
 Aliases: (total: 2): 
 	 Abnormality of cells of the lymphoid lineage, Abnormal lymphocytes,
 'Entity_Matching_Score : 0.7550817728042603',
 CUI: C0580550, Name: Abnormal number of lymphocytes
 Definition: Any abnormality in the total number of lymphocytes in the blood. []
 TUI(s): T033
 Aliases: (total: 4): 
 	 Abnormal numbers of lymphocytes, Abnormal lymphocyte count, Abnormal lymphocyte counts, Abnormality of lymphocyte number,
 'Entity_Matching_Score : 0.7446752190589905',
 CUI: C0024282, Name: Lymphocytosis
 Definition: Excess of normal lymphocytes in the blood or in any effusion.
 TUI(s): T047
 Aliases: (total: 1): 
 	 High lymphocyte count,
 'Entity_Matching_Score : 0.7003375291824341',
 CUI: C1836855, Name: Vacuolated blood lymphocytes
 Definition: The presence of clear, sharp

In [19]:
# I will now explore the code with the RxNorm database.
rxnorm_nlp = spacy.load('en_core_sci_md')

# RxNorm contains 100k concepts focused on normalized names for clinical drugs
rxnorm_nlp.add_pipe('scispacy_linker', config={'resolve_abbreviations': True, 'linker_name': 'rxnorm'})
linker = rxnorm_nlp.get_pipe('scispacy_linker')

def rxnorm_entity_linker(document):
    doc = rxnorm_nlp(document)
    try:
        entity = doc.ents[0]
    except IndexError:
        entity = 'NaN'
    entity_details = []
    entity_details.append(entity)
    try:
        for linker_ent in entity._.kb_ents:
            Concept_id, Score = linker_ent
            entity_details.append('Entity_Matching_Score : {}'.format(Score))
            entity_details.append(linker.kb.cui_to_entity[linker_ent[0]])
    except AttributeError:
        pass
    return entity_details

https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


https://ai2-s2-scispacy.s3-us-west-2.amazonaws.com/data/linkers/2020-10-09/rxnorm/concept_aliases.json not found in cache, downloading to C:\Users\MUNTAN~1\AppData\Local\Temp\1\tmp6mq66qlj
Finished download, copying C:\Users\MUNTAN~1\AppData\Local\Temp\1\tmp6mq66qlj to cache at C:\Users\muntanerl2\.scispacy\datasets\a65018bff2c6c9ef7e02f3658b2b5253fc4d52c823d985d58fcc2614ae9c5bf5.a74273b8c58718a2cd4635a4b3db50dfd129410fbbfd23fcc97c3f39314e5753.concept_aliases.json
https://ai2-s2-scispacy.s3-us-west-2.amazonaws.com/data/kbs/2020-10-09/umls_2020_rxnorm.jsonl not found in cache, downloading to C:\Users\MUNTAN~1\AppData\Local\Temp\1\tmpvo9_e2zy
Finished download, copying C:\Users\MUNTAN~1\AppData\Local\Temp\1\tmpvo9_e2zy to cache at C:\Users\muntanerl2\.scispacy\datasets\b82f1e42068c00f53c786f44bfc56353d65f7e9aec08b6b46d9c6d2c36538a76.ea8986981f7bafd0fcc8b5dc575df9adfb54145107af0e88c2ef5472b578f2b6.umls_2020_rxnorm.jsonl


In [20]:
entities_df['rxnorm_output'] = entities_df['Entity'].apply(lambda x: rxnorm_entity_linker(x)) 

In [21]:
entities_df

Unnamed: 0,Entity,Label,Ner_model,mesh_output,hpo_output,rxnorm_output
0,patients,ORGANISM,bionlp13cg,"[(patients), Entity_Matching_Score : 0.9999998...",[(patients)],[(patients)]
1,loperamide hydrochloride,SIMPLE_CHEMICAL,bionlp13cg,"[(loperamide, hydrochloride), Entity_Matching_...","[(loperamide, hydrochloride)]","[(loperamide, hydrochloride), Entity_Matching_..."
2,sodium chloride,SIMPLE_CHEMICAL,bionlp13cg,"[(sodium, chloride), Entity_Matching_Score : 0...","[(sodium, chloride)]","[(sodium, chloride), Entity_Matching_Score : 0..."
3,gut-liver,CELLULAR_COMPONENT,bionlp13cg,[(gut-liver)],[(gut-liver)],[(gut-liver)]
4,lymphocytes,CELL,bionlp13cg,"[(lymphocytes), Entity_Matching_Score : 1.0, (...","[(lymphocytes), Entity_Matching_Score : 0.9143...",[(lymphocytes)]
...,...,...,...,...,...,...
96,electrolytes,CELLULAR_COMPONENT,bionlp13cg,"[(electrolytes), Entity_Matching_Score : 1.0, ...","[(electrolytes), Entity_Matching_Score : 0.712...",[(electrolytes)]
97,lymphocyte,CELL,bionlp13cg,"[(lymphocyte), Entity_Matching_Score : 0.90758...","[(lymphocyte), Entity_Matching_Score : 0.83922...","[(lymphocyte), Entity_Matching_Score : 0.70314..."
98,C-reactive protein,GENE_OR_GENE_PRODUCT,bionlp13cg,"[(C-reactive, protein), Entity_Matching_Score ...","[(C-reactive, protein), Entity_Matching_Score ...","[(C-reactive, protein)]"
99,glutathione,SIMPLE_CHEMICAL,bionlp13cg,"[(glutathione), Entity_Matching_Score : 1.0, (...",[(glutathione)],"[(glutathione), Entity_Matching_Score : 1.0, (..."


In [22]:
entities_df['rxnorm_output'][99]

[glutathione,
 'Entity_Matching_Score : 1.0',
 CUI: C0017817, Name: L-Glutathione
 Definition: A tripeptide with many roles in cells. It conjugates to drugs to make them more soluble for excretion, is a cofactor for some enzymes, is involved in protein disulfide bond rearrangement and reduces peroxides.
 TUI(s): T116, T121, T123
 Aliases: (total: 1): 
 	 Glutathione]

In [23]:
# I will now explore the code with the GO(Gene Ontology) database.
go_nlp = spacy.load('en_core_sci_md')

# GO contains 67k concepts focused on the functions of genes
go_nlp.add_pipe('scispacy_linker', config={'resolve_abbreviations': True, 'linker_name': 'go'})
linker = go_nlp.get_pipe('scispacy_linker')

def go_entity_linker(document):
    doc = go_nlp(document)
    try:
        entity = doc.ents[0]
    except IndexError:
        entity = 'NaN'
    entity_details = []
    entity_details.append(entity)
    try:
        for linker_ent in entity._.kb_ents:
            Concept_id, Score = linker_ent
            entity_details.append('Entity_Matching_Score : {}'.format(Score))
            entity_details.append(linker.kb.cui_to_entity[linker_ent[0]])
    except AttributeError:
        pass
    return entity_details

https://ai2-s2-scispacy.s3-us-west-2.amazonaws.com/data/linkers/2020-10-09/go/tfidf_vectors_sparse.npz not found in cache, downloading to C:\Users\MUNTAN~1\AppData\Local\Temp\1\tmplaihzje3
Finished download, copying C:\Users\MUNTAN~1\AppData\Local\Temp\1\tmplaihzje3 to cache at C:\Users\muntanerl2\.scispacy\datasets\b334d2aef25bc7a56977fef1f116aa0e375e7f28d806f1d1f38f0102c944f614.7a1380f515ec8b6cae9292fa5ca9428035d2e910087903cb4579197426bd1328.tfidf_vectors_sparse.npz
https://ai2-s2-scispacy.s3-us-west-2.amazonaws.com/data/linkers/2020-10-09/go/nmslib_index.bin not found in cache, downloading to C:\Users\MUNTAN~1\AppData\Local\Temp\1\tmppnlj46w1
Finished download, copying C:\Users\MUNTAN~1\AppData\Local\Temp\1\tmppnlj46w1 to cache at C:\Users\muntanerl2\.scispacy\datasets\c7f4cf6e197dc669c4002543bebdc5af8bf3a664ffe83e77d8c50a84dbfcb7bb.94202367314015866d347f8ec9e038f436c57c79effe3cc1c878e758b0a91930.nmslib_index.bin
https://ai2-s2-scispacy.s3-us-west-2.amazonaws.com/data/linkers/2020-1

https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


https://ai2-s2-scispacy.s3-us-west-2.amazonaws.com/data/linkers/2020-10-09/go/concept_aliases.json not found in cache, downloading to C:\Users\MUNTAN~1\AppData\Local\Temp\1\tmpkh2kr3ym
Finished download, copying C:\Users\MUNTAN~1\AppData\Local\Temp\1\tmpkh2kr3ym to cache at C:\Users\muntanerl2\.scispacy\datasets\46ba8e208840a5cfacbe9339bad10d25f5b05e12aeee273702f6e22e328e276a.63aa020841f2db9949bf97a4ccfe59dca520559619ebacca6addc216d55df88a.concept_aliases.json
https://ai2-s2-scispacy.s3-us-west-2.amazonaws.com/data/kbs/2020-10-09/umls_2020_go.jsonl not found in cache, downloading to C:\Users\MUNTAN~1\AppData\Local\Temp\1\tmpch0wlj_r
Finished download, copying C:\Users\MUNTAN~1\AppData\Local\Temp\1\tmpch0wlj_r to cache at C:\Users\muntanerl2\.scispacy\datasets\105e565e209cc69e1d17ed9d739429d0a43b1ddb9f5f4dbbfae8373bed749c10.334d77c439e3dfa4c752b5f9f23a03e6a330e8e30113a84967f82255a9a45ce9.umls_2020_go.jsonl


In [24]:
entities_df['go_output'] = entities_df['Entity'].apply(lambda x: go_entity_linker(x)) 

In [25]:
entities_df

Unnamed: 0,Entity,Label,Ner_model,mesh_output,hpo_output,rxnorm_output,go_output
0,patients,ORGANISM,bionlp13cg,"[(patients), Entity_Matching_Score : 0.9999998...",[(patients)],[(patients)],[(patients)]
1,loperamide hydrochloride,SIMPLE_CHEMICAL,bionlp13cg,"[(loperamide, hydrochloride), Entity_Matching_...","[(loperamide, hydrochloride)]","[(loperamide, hydrochloride), Entity_Matching_...","[(loperamide, hydrochloride)]"
2,sodium chloride,SIMPLE_CHEMICAL,bionlp13cg,"[(sodium, chloride), Entity_Matching_Score : 0...","[(sodium, chloride)]","[(sodium, chloride), Entity_Matching_Score : 0...","[(sodium, chloride)]"
3,gut-liver,CELLULAR_COMPONENT,bionlp13cg,[(gut-liver)],[(gut-liver)],[(gut-liver)],[(gut-liver)]
4,lymphocytes,CELL,bionlp13cg,"[(lymphocytes), Entity_Matching_Score : 1.0, (...","[(lymphocytes), Entity_Matching_Score : 0.9143...",[(lymphocytes)],"[(lymphocytes), Entity_Matching_Score : 0.7786..."
...,...,...,...,...,...,...,...
96,electrolytes,CELLULAR_COMPONENT,bionlp13cg,"[(electrolytes), Entity_Matching_Score : 1.0, ...","[(electrolytes), Entity_Matching_Score : 0.712...",[(electrolytes)],[(electrolytes)]
97,lymphocyte,CELL,bionlp13cg,"[(lymphocyte), Entity_Matching_Score : 0.90758...","[(lymphocyte), Entity_Matching_Score : 0.83922...","[(lymphocyte), Entity_Matching_Score : 0.70314...","[(lymphocyte), Entity_Matching_Score : 0.89669..."
98,C-reactive protein,GENE_OR_GENE_PRODUCT,bionlp13cg,"[(C-reactive, protein), Entity_Matching_Score ...","[(C-reactive, protein), Entity_Matching_Score ...","[(C-reactive, protein)]","[(C-reactive, protein)]"
99,glutathione,SIMPLE_CHEMICAL,bionlp13cg,"[(glutathione), Entity_Matching_Score : 1.0, (...",[(glutathione)],"[(glutathione), Entity_Matching_Score : 1.0, (...","[(glutathione), Entity_Matching_Score : 0.8375..."


In [26]:
# This was able to resolve genetic related entities
entities_df['go_output'][4]

[lymphocytes,
 'Entity_Matching_Score : 0.7786888480186462',
 CUI: C1326202, Name: B cell apoptotic process
 Definition: Any apoptotic process in a B cell, a lymphocyte of B lineage with the phenotype CD19-positive and capable of B cell mediated immunity. [CL:0000236, GOC:add, GOC:mtg_apoptosis, ISBN:0781735149]
 TUI(s): T043
 Aliases (abbreviated, total: 20): 
 	 B-cell apoptosis, programmed cell death of B-lymphocytes by apoptosis, programmed cell death, B-cells, apoptosis of B-lymphocytes, B-lymphocyte programmed cell death by apoptosis, B cell programmed cell death by apoptosis, programmed cell death, B lymphocytes, apoptosis of B-cells, apoptosis of B cells, programmed cell death, B cells,
 'Entity_Matching_Score : 0.7447288632392883',
 CUI: C0024262, Name: lymphocyte activation
 Definition: Morphologic alteration of small B LYMPHOCYTES or T LYMPHOCYTES in culture into large blast-like cells able to synthesize DNA and RNA and to divide mitotically. It is induced by INTERLEUKINS; M

In [27]:
# I will now explore the final code with the UMLS(Unified Medical Language System) database.
umls_nlp = spacy.load('en_core_sci_md')

# UMLS , levels 0,1,2, and 9. This has 3M concepts 
umls_nlp.add_pipe('scispacy_linker', config={'resolve_abbreviations': True, 'linker_name': 'umls'})
linker = umls_nlp.get_pipe('scispacy_linker')

def umls_entity_linker(document):
    doc = umls_nlp(document)
    try:
        entity = doc.ents[0]
    except IndexError:
        entity = 'NaN'
    entity_details = []
    entity_details.append(entity)
    try:
        for linker_ent in entity._.kb_ents:
            Concept_id, Score = linker_ent
            entity_details.append('Entity_Matching_Score : {}'.format(Score))
            entity_details.append(linker.kb.cui_to_entity[linker_ent[0]])
    except AttributeError:
        pass
    return entity_details

https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/data/linkers/2020-10-09/umls/tfidf_vectors_sparse.npz not found in cache, downloading to C:\Users\MUNTAN~1\AppData\Local\Temp\1\tmpiq030tx0
Finished download, copying C:\Users\MUNTAN~1\AppData\Local\Temp\1\tmpiq030tx0 to cache at C:\Users\muntanerl2\.scispacy\datasets\e9f7327283e43f0482f7c0c71b71dec278a58ccb3ffdd03c2c2350159e7ef146.f2a350ad19015b2591545f7feeed6a6d6d2fffcd635d868a5d7fc0dfc3cadfd8.tfidf_vectors_sparse.npz
https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/data/linkers/2020-10-09/umls/nmslib_index.bin not found in cache, downloading to C:\Users\MUNTAN~1\AppData\Local\Temp\1\tmpev8bc_f4
Finished download, copying C:\Users\MUNTAN~1\AppData\Local\Temp\1\tmpev8bc_f4 to cache at C:\Users\muntanerl2\.scispacy\datasets\f48455d6c79262057cce66b4619123c2b558b21092d42fac97f47bb99a5b8f9f.dd70d3dffe7d90d7ac8914460e16a48375dab32485fb6313a34e6fbcaf53218b.nmslib_index.bin
https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/data/linkers/20

https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/data/linkers/2020-10-09/umls/concept_aliases.json not found in cache, downloading to C:\Users\MUNTAN~1\AppData\Local\Temp\1\tmpc9kdykx9
Finished download, copying C:\Users\MUNTAN~1\AppData\Local\Temp\1\tmpc9kdykx9 to cache at C:\Users\muntanerl2\.scispacy\datasets\1428ec15d3b1061731ea273c03699130b3d6b90948993e74bda66af605ff8e2a.aeb7a686c654df6bccb6c2c23d3eda3eb381daaefda4592b58158d0bee53b352.concept_aliases.json
https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/data/kbs/2020-10-09/umls_2020_aa_cat0129.jsonl not found in cache, downloading to C:\Users\MUNTAN~1\AppData\Local\Temp\1\tmp6wmdjhyl
Finished download, copying C:\Users\MUNTAN~1\AppData\Local\Temp\1\tmp6wmdjhyl to cache at C:\Users\muntanerl2\.scispacy\datasets\4d7fb8fcae1035d1e0a47d9072b43d5a628057d35497fbfb2499b4b7b2dd4dd7.05ec7eef12f336d4666da85b7fa69b9401883a7dd4244473f7b88b413ccbba03.umls_2020_aa_cat0129.jsonl
https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/data/umls_se

In [28]:
entities_df['umls_output'] = entities_df['Entity'].apply(lambda x: umls_entity_linker(x))

In [29]:
entities_df

Unnamed: 0,Entity,Label,Ner_model,mesh_output,hpo_output,rxnorm_output,go_output,umls_output
0,patients,ORGANISM,bionlp13cg,"[(patients), Entity_Matching_Score : 0.9999998...",[(patients)],[(patients)],[(patients)],"[(patients), Entity_Matching_Score : 1.0, (C00..."
1,loperamide hydrochloride,SIMPLE_CHEMICAL,bionlp13cg,"[(loperamide, hydrochloride), Entity_Matching_...","[(loperamide, hydrochloride)]","[(loperamide, hydrochloride), Entity_Matching_...","[(loperamide, hydrochloride)]","[(loperamide, hydrochloride), Entity_Matching_..."
2,sodium chloride,SIMPLE_CHEMICAL,bionlp13cg,"[(sodium, chloride), Entity_Matching_Score : 0...","[(sodium, chloride)]","[(sodium, chloride), Entity_Matching_Score : 0...","[(sodium, chloride)]","[(sodium, chloride), Entity_Matching_Score : 1..."
3,gut-liver,CELLULAR_COMPONENT,bionlp13cg,[(gut-liver)],[(gut-liver)],[(gut-liver)],[(gut-liver)],[(gut-liver)]
4,lymphocytes,CELL,bionlp13cg,"[(lymphocytes), Entity_Matching_Score : 1.0, (...","[(lymphocytes), Entity_Matching_Score : 0.9143...",[(lymphocytes)],"[(lymphocytes), Entity_Matching_Score : 0.7786...","[(lymphocytes), Entity_Matching_Score : 1.0, (..."
...,...,...,...,...,...,...,...,...
96,electrolytes,CELLULAR_COMPONENT,bionlp13cg,"[(electrolytes), Entity_Matching_Score : 1.0, ...","[(electrolytes), Entity_Matching_Score : 0.712...",[(electrolytes)],[(electrolytes)],"[(electrolytes), Entity_Matching_Score : 1.0, ..."
97,lymphocyte,CELL,bionlp13cg,"[(lymphocyte), Entity_Matching_Score : 0.90758...","[(lymphocyte), Entity_Matching_Score : 0.83922...","[(lymphocyte), Entity_Matching_Score : 0.70314...","[(lymphocyte), Entity_Matching_Score : 0.89669...","[(lymphocyte), Entity_Matching_Score : 1.0, (C..."
98,C-reactive protein,GENE_OR_GENE_PRODUCT,bionlp13cg,"[(C-reactive, protein), Entity_Matching_Score ...","[(C-reactive, protein), Entity_Matching_Score ...","[(C-reactive, protein)]","[(C-reactive, protein)]","[(C-reactive, protein), Entity_Matching_Score ..."
99,glutathione,SIMPLE_CHEMICAL,bionlp13cg,"[(glutathione), Entity_Matching_Score : 1.0, (...",[(glutathione)],"[(glutathione), Entity_Matching_Score : 1.0, (...","[(glutathione), Entity_Matching_Score : 0.8375...","[(glutathione), Entity_Matching_Score : 0.9999..."


In [32]:
entities_df['umls_output'][100]

[C-reactive,
 'Entity_Matching_Score : 0.8884108662605286',
 CUI: C0006560, Name: C-reactive protein
 Definition: A plasma protein that circulates in increased amounts during inflammation and after tissue damage.
 TUI(s): T116, T129
 Aliases (abbreviated, total: 12): 
 	 Protein, C-Reactive, c reactive protein, CRP, C reactive protein (substance), C Reactive Protein, CRP - C-reactive protein, c reactive proteins, Proteins, specific or class, C-reactive, c-reactive protein (CRP), C-reactive protein,
 'Entity_Matching_Score : 0.8884108662605286',
 CUI: C1413716, Name: CRP gene
 Definition: This gene plays a role in immune and inflammatory processes.
 TUI(s): T028
 Aliases: (total: 9): 
 	 PENTRAXIN 1, SHORT, CRP gene, CRP, C-REACTIVE PROTEIN, PENTRAXIN-RELATED, C-Reactive Protein, Pentraxin-Related Gene, pentraxin 1, C-reactive protein, PTX1, CRP Gene,
 'Entity_Matching_Score : 0.8884108662605286',
 CUI: C4048285, Name: C-Reactive Protein, human
 Definition: C-reactive protein (224 aa, ~

All of these models will help me and guide me in deciding which is the most appropriate knowledge base for my specific use case.