# Exploration

in this notebook, we explore the dataset with medical questions for entity recognition generated using wikidata. The purpose is to ensure the quality of the questions, and check that everything worked fine in the generation.

Ideally, we'd need to have at least 2 questions per entity.

In [7]:
import pandas as pd

In [9]:
pd.set_option('display.max_colwidth', 1000)
df = pd.read_csv('entity_recognition_dataset_2.csv')
df.sample(5)

Unnamed: 0,entity_type,entity_name,relation,template,missing_attribute
4065,disease,mucocutaneous leishmaniasis,is located (anatomically) in the,The disease mucocutaneous leishmaniasis is located (anatomically) in the,upper respiratory tract
512,disease,myoclonic dystonia 26,is caused by a mutation in the gene named,The disease myoclonic dystonia 26 is caused by a mutation in the gene named,KCTD17
4470,disease,poikiloderma with neutropenia,", its main symptom is","The disease poikiloderma with neutropenia , its main symptom is",neutropenia
4096,disease,hand and arm congenital deformity,is located (anatomically) in the,The disease hand and arm congenital deformity is located (anatomically) in the,upper extremity
4142,disease,pleural tuberculosis,is located (anatomically) in the,The disease pleural tuberculosis is located (anatomically) in the,pleura


In [10]:
df['relation'].value_counts()

relation
is caused by a mutation in the gene named    3701
, its main symptom is                         669
 is located (anatomically) in the             379
is used to treat                              141
was first identified by                       108
, it's chemical formula is                     68
has an active ingredient with the name of      16
originated in                                   2
Name: count, dtype: int64

In the next plot, we can see that most of the entities have at least 2 questions assigned to them.

In [11]:
# Check that most of the entities have at least two rows in the dataset
frac_entities = (sum(df['entity_name'].value_counts().values > 1) / len(df))
print(f"Percentage of entities with at least two questions: {frac_entities*100:.2f}%")

Percentage of entities with at least two questions: 7.20%


In [15]:
sum(df['entity_name'].value_counts().values > 1)

366

# Formatting the dataset to replicate Oliver's methodology

In [28]:
filename = "formated_entity_recognition_dataset.csv"
df_formated = df[df['entity_name'].map(df['entity_name'].value_counts()) > 1].copy()
# For the entities with more than 2 questions, we will only keep the first two
df_formated = df_formated.groupby('entity_name').head(2)

In [29]:
df_formated['relation'].value_counts()

relation
, its main symptom is                        274
is caused by a mutation in the gene named    252
 is located (anatomically) in the            104
was first identified by                       36
is used to treat                              32
, it's chemical formula is                    22
has an active ingredient with the name of     12
Name: count, dtype: int64

In [30]:
# Rename columns:
# template to query
# entity_name to entity
# missing_attribute to missing_words
# New column with "red_herring" set to False
# New column with description set to None
# New column with id set to None
# Erase all other columns

df_formated = df_formated.rename(columns={'template': 'query', 'entity_name': 'entity', 'missing_attribute': 'missing_words'})
df_formated['red_herring'] = False
df_formated['description'] = None
df_formated['id'] = None
df_formated = df_formated[['entity', 'description', 'id', 'red_herring', 'query', 'missing_words']]
# turn the strings into a list of a single string for each missing_words
df_formated['missing_words'] = df_formated['missing_words'].apply(lambda x: [x])
df_formated

Unnamed: 0,entity,description,id,red_herring,query,missing_words
0,Legg–Calvé–Perthes disease,,,False,The disease Legg–Calvé–Perthes disease was first identified by,[Karel Maydl]
2,Wiskott-Aldrich syndrome,,,False,The disease Wiskott-Aldrich syndrome was first identified by,[Alfred Wiskott]
4,Gordon-Holmes syndrome,,,False,The disease Gordon-Holmes syndrome was first identified by,[Gordon Morgan Holmes]
10,erythromelalgia,,,False,The disease erythromelalgia was first identified by,[Silas Weir Mitchell]
13,chronic congestive splenomegaly,,,False,The disease chronic congestive splenomegaly was first identified by,[Guido Banti]
...,...,...,...,...,...,...
5067,amentia,,,False,"The disease amentia , its main symptom is",[amnesia]
5070,interstitial lung disease,,,False,"The disease interstitial lung disease , its main symptom is",[inflammation]
5075,trimethylaminuria,,,False,"The disease trimethylaminuria , its main symptom is",[fetor]
5076,appendicitis,,,False,"The disease appendicitis , its main symptom is",[vomiting]


In [31]:
df_formated.to_csv(filename, index=False)