# About this notebook:

- We are creating dictionaries for subsequent article labelling.
- We downloaded the MeSH terms data from the website [US National Library of Medicine (NLM)](https://www.nlm.nih.gov/databases/download/mesh.html)

**About MeSH**
- The Medical Subject Headings (MeSH) thesaurus is a controlled and hierarchically-organized vocabulary produced by the National Library of Medicine. 
- It is used for indexing, cataloging, and searching of biomedical and health-related information. 
- MeSH includes the subject headings appearing in MEDLINE/PubMed, the NLM Catalog, and other NLM databases.


We will be focusing on NLM **Primary Disease** MeSH:
![MESH](MeSH_tree.jpg)

# Import Libraries

In [1]:
# Import libaries
import numpy as np 
import pandas as pd 

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

#Visualisation:
import seaborn               as sns
import matplotlib.pyplot     as plt
sns.set_theme(style="whitegrid")

# Part 1: Preparing MeSH terms dataframe

In [2]:
# Read the data from the BIN file
with open('mtrees2022.bin', 'r') as f:
    lines = f.readlines()

# Create a list of dictionaries with the Term and term_full_ID for each line
data = []
for line in lines:
    term, term_full_ID = line.strip().split(';')
    data.append({'Term': term, 'term_full_ID': term_full_ID})

print(len(data))

62317


In [3]:
# Create a Pandas dataframe from the list of dictionaries
df = pd.DataFrame(data)

# Print the dataframe
print(df)

                        Term     term_full_ID
0               Body Regions              A01
1         Anatomic Landmarks          A01.111
2                     Breast          A01.236
3      Mammary Glands, Human      A01.236.249
4                    Nipples      A01.236.500
...                      ...              ...
62312              North Sea  Z01.756.092.650
62313              Black Sea      Z01.756.217
62314           Indian Ocean      Z01.756.342
62315      Mediterranean Sea      Z01.756.592
62316          Pacific Ocean      Z01.756.700

[62317 rows x 2 columns]


In [4]:
df['Root_term'] = df['term_full_ID'].apply(lambda x: x[0])
df.head()

Unnamed: 0,Term,term_full_ID,Root_term
0,Body Regions,A01,A
1,Anatomic Landmarks,A01.111,A
2,Breast,A01.236,A
3,"Mammary Glands, Human",A01.236.249,A
4,Nipples,A01.236.500,A


In [5]:
disease_df = df.copy()

In [6]:
disease_df['Root_term'].unique()
disease_df['Root_term'].nunique()

array(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M',
       'N', 'V', 'Z'], dtype=object)

16

In [7]:
disease_df = disease_df.loc[disease_df['Root_term']=='C',:]
disease_df

Unnamed: 0,Term,term_full_ID,Root_term
8513,Infections,C01,C
8514,"Aneurysm, Infected",C01.069,C
8515,"Arthritis, Infectious",C01.100,C
8516,"Arthritis, Reactive",C01.100.500,C
8517,Asymptomatic Infections,C01.125,C
...,...,...,...
21258,"Eye Injuries, Penetrating",C26.986.450,C
21259,"Head Injuries, Penetrating",C26.986.500,C
21260,"Wounds, Gunshot",C26.986.900,C
21261,"Wounds, Stab",C26.986.950,C


In [8]:
disease_df['Pri_Disease_ID'] = disease_df['term_full_ID'].apply(lambda x: x[:3])

disease_df['Pri_Disease_ID'].unique()
disease_df['Pri_Disease_ID'].nunique()

array(['C01', 'C04', 'C05', 'C06', 'C07', 'C08', 'C09', 'C10', 'C11',
       'C12', 'C14', 'C15', 'C16', 'C17', 'C18', 'C19', 'C20', 'C21',
       'C22', 'C23', 'C24', 'C25', 'C26'], dtype=object)

23

In [9]:
disease_df.drop(columns="Root_term",inplace=True)
disease_df

Unnamed: 0,Term,term_full_ID,Pri_Disease_ID
8513,Infections,C01,C01
8514,"Aneurysm, Infected",C01.069,C01
8515,"Arthritis, Infectious",C01.100,C01
8516,"Arthritis, Reactive",C01.100.500,C01
8517,Asymptomatic Infections,C01.125,C01
...,...,...,...
21258,"Eye Injuries, Penetrating",C26.986.450,C26
21259,"Head Injuries, Penetrating",C26.986.500,C26
21260,"Wounds, Gunshot",C26.986.900,C26
21261,"Wounds, Stab",C26.986.950,C26


# Part 2: Primary Disease and associated diseases

Understand that NLM MeSH terms forms a tree.<br> The image below demonstrate how various associated diseases terms falls under Primary Disease branch and how they are connected (in this case: "**Disgestive System diseases**" and "**Neoplasms**").

![MESH_digestive](MeSH_disease_example.png)

In [10]:
# key = Primary disease ID
# value = Associate Primary disease type
_df = disease_df.loc[(disease_df["term_full_ID"].str.len())<4,:]
Pri_disease_dict = dict(zip(_df['term_full_ID'], _df['Term']))
Pri_disease_dict

{'C01': 'Infections',
 'C04': 'Neoplasms',
 'C05': 'Musculoskeletal Diseases',
 'C06': 'Digestive System Diseases',
 'C07': 'Stomatognathic Diseases',
 'C08': 'Respiratory Tract Diseases',
 'C09': 'Otorhinolaryngologic Diseases',
 'C10': 'Nervous System Diseases',
 'C11': 'Eye Diseases',
 'C12': 'Urogenital Diseases',
 'C14': 'Cardiovascular Diseases',
 'C15': 'Hemic and Lymphatic Diseases',
 'C16': 'Congenital, Hereditary, and Neonatal Diseases and Abnormalities',
 'C17': 'Skin and Connective Tissue Diseases',
 'C18': 'Nutritional and Metabolic Diseases',
 'C19': 'Endocrine System Diseases',
 'C20': 'Immune System Diseases',
 'C21': 'Disorders of Environmental Origin',
 'C22': 'Animal Diseases',
 'C23': 'Pathological Conditions, Signs and Symptoms',
 'C24': 'Occupational Diseases',
 'C25': 'Chemically-Induced Disorders',
 'C26': 'Wounds and Injuries'}

In [11]:
print(f"Length of Pri_disease_dict: {len(Pri_disease_dict)}")

Length of Pri_disease_dict: 23


In [12]:
disease_df['Pri_disease_term'] = disease_df['Pri_Disease_ID'].map(Pri_disease_dict)
disease_df

Unnamed: 0,Term,term_full_ID,Pri_Disease_ID,Pri_disease_term
8513,Infections,C01,C01,Infections
8514,"Aneurysm, Infected",C01.069,C01,Infections
8515,"Arthritis, Infectious",C01.100,C01,Infections
8516,"Arthritis, Reactive",C01.100.500,C01,Infections
8517,Asymptomatic Infections,C01.125,C01,Infections
...,...,...,...,...
21258,"Eye Injuries, Penetrating",C26.986.450,C26,Wounds and Injuries
21259,"Head Injuries, Penetrating",C26.986.500,C26,Wounds and Injuries
21260,"Wounds, Gunshot",C26.986.900,C26,Wounds and Injuries
21261,"Wounds, Stab",C26.986.950,C26,Wounds and Injuries


## Creating the important dictionary

Each unique term forms the 'key', with its value containins a list of secondary ids

In [16]:
disease_df = disease_df.apply(lambda x: x.str.lower())


In [17]:
pri_term_dict = disease_df.groupby('Term')['Pri_disease_term'].apply(list).to_dict()
len(pri_term_dict)

4933

In [18]:
pri_term_dict

{'22q11 deletion syndrome': ['musculoskeletal diseases',
  'cardiovascular diseases',
  'cardiovascular diseases',
  'hemic and lymphatic diseases',
  'congenital, hereditary, and neonatal diseases and abnormalities',
  'congenital, hereditary, and neonatal diseases and abnormalities',
  'congenital, hereditary, and neonatal diseases and abnormalities',
  'congenital, hereditary, and neonatal diseases and abnormalities',
  'congenital, hereditary, and neonatal diseases and abnormalities',
  'congenital, hereditary, and neonatal diseases and abnormalities',
  'endocrine system diseases'],
 '46, xx disorders of sex development': ['urogenital diseases',
  'urogenital diseases',
  'urogenital diseases',
  'congenital, hereditary, and neonatal diseases and abnormalities',
  'endocrine system diseases'],
 '46, xx testicular disorders of sex development': ['urogenital diseases',
  'urogenital diseases',
  'urogenital diseases',
  'congenital, hereditary, and neonatal diseases and abnormalitie

In [19]:
#remove duplicate values in each key
for key, value in pri_term_dict.items():
    pri_term_dict[key]=set(value)

In [20]:
pri_term_dict

{'22q11 deletion syndrome': {'cardiovascular diseases',
  'congenital, hereditary, and neonatal diseases and abnormalities',
  'endocrine system diseases',
  'hemic and lymphatic diseases',
  'musculoskeletal diseases'},
 '46, xx disorders of sex development': {'congenital, hereditary, and neonatal diseases and abnormalities',
  'endocrine system diseases',
  'urogenital diseases'},
 '46, xx testicular disorders of sex development': {'congenital, hereditary, and neonatal diseases and abnormalities',
  'endocrine system diseases',
  'urogenital diseases'},
 'abdomen, acute': {'pathological conditions, signs and symptoms'},
 'abdominal abscess': {'infections'},
 'abdominal injuries': {'wounds and injuries'},
 'abdominal neoplasms': {'neoplasms'},
 'abdominal pain': {'pathological conditions, signs and symptoms'},
 'abducens nerve diseases': {'nervous system diseases'},
 'abducens nerve injury': {'nervous system diseases', 'wounds and injuries'},
 'aberrant crypt foci': {'neoplasms'},
 'a

In [21]:
pri_term_dict = {key.lower(): value for key, value in pri_term_dict.items()}
pri_term_dict 

{'22q11 deletion syndrome': {'cardiovascular diseases',
  'congenital, hereditary, and neonatal diseases and abnormalities',
  'endocrine system diseases',
  'hemic and lymphatic diseases',
  'musculoskeletal diseases'},
 '46, xx disorders of sex development': {'congenital, hereditary, and neonatal diseases and abnormalities',
  'endocrine system diseases',
  'urogenital diseases'},
 '46, xx testicular disorders of sex development': {'congenital, hereditary, and neonatal diseases and abnormalities',
  'endocrine system diseases',
  'urogenital diseases'},
 'abdomen, acute': {'pathological conditions, signs and symptoms'},
 'abdominal abscess': {'infections'},
 'abdominal injuries': {'wounds and injuries'},
 'abdominal neoplasms': {'neoplasms'},
 'abdominal pain': {'pathological conditions, signs and symptoms'},
 'abducens nerve diseases': {'nervous system diseases'},
 'abducens nerve injury': {'nervous system diseases', 'wounds and injuries'},
 'aberrant crypt foci': {'neoplasms'},
 'a

In [22]:
np.save('NLM_MeSH.npy', pri_term_dict) 

## Viewing as dataframe: disease_df

In [23]:
disease_df

Unnamed: 0,Term,term_full_ID,Pri_Disease_ID,Pri_disease_term
8513,infections,c01,c01,infections
8514,"aneurysm, infected",c01.069,c01,infections
8515,"arthritis, infectious",c01.100,c01,infections
8516,"arthritis, reactive",c01.100.500,c01,infections
8517,asymptomatic infections,c01.125,c01,infections
...,...,...,...,...
21258,"eye injuries, penetrating",c26.986.450,c26,wounds and injuries
21259,"head injuries, penetrating",c26.986.500,c26,wounds and injuries
21260,"wounds, gunshot",c26.986.900,c26,wounds and injuries
21261,"wounds, stab",c26.986.950,c26,wounds and injuries


In [24]:
disease_df.to_csv("MeSH_terms.csv",index=False)