## Background
The aim of this research is to use AI-assisted technology to automatically generate interaction claims from additional, previously under utilized sources of data. In doing so, we can increase data parity between the published state of knowledge and what is available in databases.  
  
**Approach**:  
  
1. Take an existing pretrained model and use train it to perform **entity recognition** of drugs, genes, variants, and phenotypes. 
2. Either (1): identify a suitable model or, (2): train an existing pretrained model further to perform **text summarization** to summarize a label to link drugs, genes, variants, phenotypes within one individual 'claim'. (**relationship classification**??)

## Load Labels using Datasets
Dataset is obtained from OpenFDA json resource. Pre-sectioned labels were imported using the json library and converted to a pandas dataframe. 

In [1]:
from datasets import Dataset
import pandas as pd

df = pd.read_excel('../data/openfda.xlsx').reset_index(drop=True).drop('Unnamed: 0', axis=1)

df


Unnamed: 0,brand_name,adverse_reactions,indications_and_usage,contraindications,warnings_and_cautions,warnings,precautions,pharmacokinetics,purpose,clinical_pharmacology,active_ingredient,stop_use,boxed_warning,pharmacodynamics,pharmacogenomics
0,AMOXICILLIN AND CLAVULANATE POTASSIUM,ADVERSE REACTIONS SECTION The following are di...,INDICATIONS & USAGE SECTION To reduce the deve...,CONTRAINDICATIONS SECTION Amoxicillinfor oral ...,WARNINGS AND PRECAUTIONS SECTION 5.1 Anaphylac...,,,,,CLINICAL PHARMACOLOGY SECTION 12.1 Mechanism o...,,,,,
1,UNDA 312,,Uses For the relief of symptoms associated wit...,,,Warnings Sore throat warning: Severe or persis...,,,Uses For the relief of symptoms associated wit...,,Active ingredients Each drop contains: Angelic...,Stop use and ask a doctor if Cough persists fo...,,,
2,SUN PROTECT LIP BALM SPF 30,,Uses Helps prevent sunburn. If used as directe...,,,Warnings For external use only. Do not use on ...,,,Purpose Sunscreen,,Drug Facts Active ingredients Non Nano Zinc Ox...,,,,
3,LOSARTAN POTASSIUM AND HYDROCHLOROTHIAZIDE,,,,,,,,,,,,,,
4,Potassium Phosphates,6 ADVERSE REACTIONS The following clinically s...,1 INDICATIONS AND USAGE Potassium Phosphates I...,4 CONTRAINDICATIONS Potassium Phosphates Injec...,5 WARNINGS AND PRECAUTIONS Serious Cardiac Adv...,,,12.3 Pharmacokinetics Distribution Approximate...,,12 CLINICAL PHARMACOLOGY 12.1 Mechanism of Act...,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
215910,Carbidopa and Levodopa,ADVERSE REACTIONS The most common adverse reac...,INDICATIONS AND USAGE Carbidopa and levodopa t...,CONTRAINDICATIONS Nonselective monoamine oxida...,,WARNINGS When carbidopa and levodopa tablets a...,"PRECAUTIONS General As with levodopa, periodic...",Pharmacokinetics Carbidopa reduces the amount ...,,CLINICAL PHARMACOLOGY Mechanism of Action Park...,,,,Pharmacodynamics When levodopa is administered...,
215911,REFRESH Optive Mega-3,,"Uses For the temporary relief of burning, irri...",,,Warnings For external use only. To avoid conta...,,,Purpose Eye lubricant Eye lubricant Eye lubricant,,Active ingredients Carboxymethylcellulose sodi...,Stop use and ask a doctor if you experience ey...,,,
215912,Creon,6 ADVERSE REACTIONS The most serious adverse r...,1 INDICATIONS AND USAGE CREON ® is indicated f...,4 CONTRAINDICATIONS None. None ( 4 ),5 WARNINGS AND PRECAUTIONS Fibrosing colonopat...,,,12.3 Pharmacokinetics The pancreatic enzymes i...,,12 CLINICAL PHARMACOLOGY 12.1 Mechanism of Act...,,,,,
215913,Losartan Potassium and Hydrochlorothiazide,6 ADVERSE REACTIONS Most common adverse reacti...,1 INDICATIONS AND USAGE Losartan potassium and...,4 CONTRAINDICATIONS Losartan potassium and hyd...,5 WARNINGS AND PRECAUTIONS Hypotension: Correc...,,,12.3 Pharmacokinetics Losartan Potassium Absor...,,12 CLINICAL PHARMACOLOGY 12.1 Mechanism of Act...,,,WARNING: FETAL TOXICITY When pregnancy is dete...,12.2 Pharmacodynamics Losartan Potassium Losar...,


## Named Entity Recognition (NER) for Pharmacokinetics / Pharmacodynamics / Pharmacogenomics
The model being utilizes for tag generation for this section is the bioNLP System for BioNER and BioNEN: https://github.com/librairy/bio-ner, 

- Genetics: https://huggingface.co/alvaroalon2/biobert_genetic_ner  
- Diseases: https://huggingface.co/alvaroalon2/biobert_diseases_ner
- Chemicals: https://huggingface.co/alvaroalon2/biobert_chemical_ner
  
Example: https://user-images.githubusercontent.com/72864707/120455516-20a28f80-c395-11eb-97a8-fb54b017eaab.png
  
Biomedical Named Entity Recognition and Normalization of Diseases, Chemicals, and Genetic entity classes through the use of state-of-the-art models. The core piece in the modelling of the text entities recognition will be BioBERT. The model was a fine-tuned version of BioBERT for other instances of genetico, quimicos, and enfermedades as a part of a master's thesis from the Escuela Tecnica Superior (ETS) Universidad Politecnica de Madrid. Two additional corpi were used to train each entity class: Diseases (BC5CDR - Diseases, NCBI - Diseases), Chemicals (BC4CHEMD, BC5CDR - Chemicals), and Genes/Proteins (JNLPBA, BC2GM)
  
https://oa.upm.es/67933/1/TFM_ALVARO_ALONSO_CASERO.pdf 

### Pharmacogenomics
This section is likely key to end goal of linking genomics, diseases, and interactions. TO DO: Implementa pipe for diseases and chemical as well!

#### Filter

In [2]:
# Datasets will load each column from a dataframe as a dataset but cannot handle single Series objects
f = pd.DataFrame(df, columns=['pharmacogenomics'])
dataset = Dataset.from_pandas(f)
dataset

Dataset({
    features: ['pharmacogenomics'],
    num_rows: 215915
})

In [3]:
# Filter Nonetype and length 
print(f'UNFILTERED: {len(dataset["pharmacogenomics"])}')

dataset = dataset.filter(lambda x: x['pharmacogenomics'] is not None)
print(f'After NONE CHECK: {len(dataset["pharmacogenomics"])}')

# test = dataset.filter(lambda x: len(x['adverse_reactions']) < 8000)
# print(f'After TITAN LENGTH CHECK: {len(test["adverse_reactions"])}')

# dataset = dataset.filter(lambda x: len(x['adverse_reactions']) < 512)
# print(f'After ADE LENGTH CHECK: {len(dataset["adverse_reactions"])}')

UNFILTERED: 215915


Filter:   0%|          | 0/215915 [00:00<?, ? examples/s]

After NONE CHECK: 355


In [4]:
dataset

Dataset({
    features: ['pharmacogenomics'],
    num_rows: 355
})

#### Pipe

In [5]:
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe_gene = pipeline("token-classification", model="alvaroalon2/biobert_genetic_ner")
pipe_chemical = pipeline("token-classification", model="alvaroalon2/biobert_chemical_ner")
pipe_disease = pipeline("token-classification", model="alvaroalon2/biobert_diseases_ner")

2023-10-05 10:46:09.565086: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [6]:
# Filter and generate tags
def generate_genomic_ner(entry):
    return {'genomic_ner': pipe_gene(entry['pharmacogenomics']) }

def generate_chemical_ner(entry):
    return {'chemical_ner': pipe_chemical(entry['pharmacogenomics']) }

def generate_diseases_ner(entry):
    return {'diseases_ner': pipe_disease(entry['pharmacogenomics']) }


dataset = dataset.map(generate_genomic_ner)
dataset = dataset.map(generate_chemical_ner)
dataset = dataset.map(generate_diseases_ner)

Map:   0%|          | 0/355 [00:00<?, ? examples/s]

Map:   0%|          | 0/355 [00:00<?, ? examples/s]

Map:   0%|          | 0/355 [00:00<?, ? examples/s]

In [7]:
dataset

Dataset({
    features: ['pharmacogenomics', 'genomic_ner', 'chemical_ner', 'diseases_ner'],
    num_rows: 355
})

In [8]:
def merged_ner_record(entry):
    return {'merged_ner_groups': entry['genomic_ner'] + entry['chemical_ner'] + entry['diseases_ner']}

dataset = dataset.map(merged_ner_record)

Map:   0%|          | 0/355 [00:00<?, ? examples/s]

In [9]:
print(len(dataset['genomic_ner'][0]))
print(len(dataset['chemical_ner'][0]))
print(len(dataset['diseases_ner'][0]))
len(dataset['merged_ner_groups'][0])


41
23
169


233

#### Rejoin Chunks


In [10]:
def recombine_ner_chunks(entry):
    all_tags = []
    all_words = []
    for tag_set in entry['merged_ner_groups']:
        all_tags.append(tag_set['entity'])
        all_words.append(tag_set['word'])

    i = 0
    chunk_number = []
    for tag in all_tags:
        if tag.startswith('B-') == True:            
            i += 1
            chunk_number.append(i)
        
        if tag.startswith('I-') == True:
            chunk_number.append(i)
    
    position = 0
    rebuilt = []
    word = '' 
    for chunk_indicator in chunk_number:
        current_word = all_words[position]
        try: 
            if chunk_indicator == chunk_number[position + 1]:
                # don't append
                word = word + ' ' + current_word # build word
                position += 1# iterate
            else:
                word = word + ' ' + current_word 
                rebuilt.append(word.strip()) # append
                word = '' # start new word
                position += 1 # iterate
        except:
            word = word + ' ' + current_word
            rebuilt.append(word.strip())


    return({'ner_chunk_indicators': chunk_number, 'individual_ner_words' : all_words, 'ner_chunks': rebuilt })

In [11]:
dataset = dataset.map(recombine_ner_chunks)
dataset

Map:   0%|          | 0/355 [00:00<?, ? examples/s]

Dataset({
    features: ['pharmacogenomics', 'genomic_ner', 'chemical_ner', 'diseases_ner', 'merged_ner_groups', 'ner_chunk_indicators', 'individual_ner_words', 'ner_chunks'],
    num_rows: 355
})

In [12]:
dataset['pharmacogenomics'][0]

'12.5 Pharmacogenomics Disposition of HMG-CoA reductase inhibitors, including rosuvastatin, involves OATP1B1 and other transporter proteins. Higher plasma concentrations of rosuvastatin have been reported in very small groups of patients (n=3 to 5) who have two reduced function alleles of the gene that encodes OATP1B1 ( SLCO1B1 521T>C). The frequency of this genotype (i.e., SLCO1B1 521 C/C) is generally lower than 5% in most racial/ethnic groups. The impact of this polymorphism on efficacy and/or safety of rosuvastatin has not been clearly established.'

In [18]:
dataset['chemical_ner'][0]

[{'end': 39,
  'entity': 'B-CHEMICAL',
  'index': 15,
  'score': 0.9999948740005493,
  'start': 37,
  'word': 'HM'},
 {'end': 40,
  'entity': 'I-CHEMICAL',
  'index': 16,
  'score': 0.9996401071548462,
  'start': 39,
  'word': '##G'},
 {'end': 41,
  'entity': 'I-CHEMICAL',
  'index': 17,
  'score': 0.9999849796295166,
  'start': 40,
  'word': '-'},
 {'end': 43,
  'entity': 'I-CHEMICAL',
  'index': 18,
  'score': 0.999990701675415,
  'start': 41,
  'word': 'Co'},
 {'end': 44,
  'entity': 'I-CHEMICAL',
  'index': 19,
  'score': 0.9999616146087646,
  'start': 43,
  'word': '##A'},
 {'end': 78,
  'entity': 'B-CHEMICAL',
  'index': 27,
  'score': 0.9999961853027344,
  'start': 77,
  'word': 'r'},
 {'end': 80,
  'entity': 'B-CHEMICAL',
  'index': 28,
  'score': 0.9995237588882446,
  'start': 78,
  'word': '##os'},
 {'end': 81,
  'entity': 'I-CHEMICAL',
  'index': 29,
  'score': 0.9978333115577698,
  'start': 80,
  'word': '##u'},
 {'end': 84,
  'entity': 'I-CHEMICAL',
  'index': 30,
  'score

In [13]:
dataset['ner_chunks'][0]

['HM ##G - Co ##A red ##uc ##tase',
 'O ##AT ##P ##1 ##B ##1',
 'transport',
 'O ##AT ##P ##1 ##B ##1',
 'SL ##CO ##1 ##B ##1 52 ##1 ##T > C',
 'SL ##CO ##1 ##B ##1 52 ##1 C / C',
 'HM ##G - Co ##A',
 'r',
 '##os ##u ##vas ##tat ##in',
 'r',
 '##os ##u ##vas ##tat ##in',
 'r ##os ##u ##vas ##tat ##in']

In [14]:
dataset['pharmacogenomics'][257]

'12.5 Pharmacogenomics Patients who are CYP2C19 poor metabolizers have little to no CYP2C19 enzyme function compared to CYP2C19 normal metabolizers that have fully functional CYP2C19 enzymes. After single doses of abrocitinib, CYP2C19 poor metabolizers demonstrated dose-normalized AUC of abrocitinib values that were 2.3-fold higher when compared to CYP2C19 normal metabolizers. Approximately 3–5% of Caucasians and Blacks and 15 to 20% of Asians are CYP2C19 poor metabolizers [see Dosage and Administration (2.4) and Use in Specific Populations (8.8) ] .'

In [15]:
dataset['ner_chunks'][257]

['C ##YP ##2 ##C ##19',
 'C ##YP ##2 ##C ##19 enzyme',
 'C ##YP ##2 ##C ##19',
 'C ##YP ##2 ##C ##19 enzymes',
 'C ##YP ##2 ##C ##19',
 'C ##YP ##2 ##C ##19',
 'C ##YP ##2 ##C ##19',
 'a ##bro ##ci ##tin ##ib',
 'a',
 '##bro ##ci ##tin ##ib']

TO DO: Make the recombine thing also give the type of entity! Also, fix post processing!
NOTE: check example [257] and [0] are good

### Pharmacodynamics TO DO

#### Filter

In [35]:
# Datasets will load each column from a dataframe as a dataset but cannot handle single Series objects
f = pd.DataFrame(df, columns=['pharmacodynamics'])
dataset = Dataset.from_pandas(f)
dataset

Dataset({
    features: ['pharmacodynamics'],
    num_rows: 215915
})

In [38]:
# Filter Nonetype and length 
print(f'UNFILTERED: {len(dataset["pharmacodynamics"])}')

dataset = dataset.filter(lambda x: x['pharmacodynamics'] is not None)
print(f'After NONE CHECK: {len(dataset["pharmacodynamics"])}')

# test = dataset.filter(lambda x: len(x['adverse_reactions']) < 8000)
# print(f'After TITAN LENGTH CHECK: {len(test["adverse_reactions"])}')

# dataset = dataset.filter(lambda x: len(x['adverse_reactions']) < 512)
# print(f'After ADE LENGTH CHECK: {len(dataset["adverse_reactions"])}')

UNFILTERED: 215915


Filter:   0%|          | 0/215915 [00:00<?, ? examples/s]

After NONE CHECK: 26217


#### Pipe

In [39]:
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("token-classification", model="alvaroalon2/biobert_genetic_ner")

In [40]:
# Filter and generate tags
def generate_ner_tags(entry):
    return {'ner_tags': pipe(entry['pharmacodynamics']) }

dataset = dataset.map(generate_ner_tags)

Map:   0%|          | 0/26217 [00:00<?, ? examples/s]

#### Rejoin Chunks

In [None]:
def recombine_ner_chunks(entry):
    all_tags = []
    all_words = []
    for tag_set in entry['ner_tags']:
        all_tags.append(tag_set['entity'])
        all_words.append(tag_set['word'])

    i = 0
    chunk_number = []
    for tag in all_tags:
        if tag.startswith('B-') == True:            
            i += 1
            chunk_number.append(i)
        
        if tag.startswith('I-') == True:
            chunk_number.append(i)
    
    position = 0
    rebuilt = []
    word = '' 
    for chunk_indicator in chunk_number:
        current_word = all_words[position]
        try: 
            if chunk_indicator == chunk_number[position + 1]:
                # don't append
                word = word + ' ' + current_word # build word
                position += 1# iterate
            else:
                word = word + ' ' + current_word 
                rebuilt.append(word.strip()) # append
                word = '' # start new word
                position += 1 # iterate
        except:
            word = word + ' ' + current_word
            rebuilt.append(word.strip())


    return({'ner_chunk_indicators': chunk_number, 'individual_ner_words' : all_words, 'ner_chunks': rebuilt })

In [None]:
dataset = dataset.map(recombine_ner_chunks)
dataset

## Named Entity Recognition (NER) for Adverse Reactions Dataset
The model being used for this tag generation is the electramed-small-ADE-DRUG-EFFECT-ner-v3: https://huggingface.co/chintagunta85/electramed-small-ADE-DRUG-EFFECT-ner-v3
      
There is not a very descriptive model card for this on huggingface, but via James it seems like it might be the same as this paper: https://arxiv.org/pdf/2201.01405v2.pdf


### Filter

In [None]:
# Datasets will load each column from a dataframe as a dataset but cannot handle single Series objects
f = pd.DataFrame(df, columns=['adverse_reactions'])
dataset = Dataset.from_pandas(f)

In [None]:
# Filter Nonetype and length 
print(f'UNFILTERED: {len(dataset["adverse_reactions"])}')

dataset = dataset.filter(lambda x: x['adverse_reactions'] is not None)
print(f'After NONE CHECK: {len(dataset["adverse_reactions"])}')

test = dataset.filter(lambda x: len(x['adverse_reactions']) < 8000)
print(f'After TITAN LENGTH CHECK: {len(test["adverse_reactions"])}')

dataset = dataset.filter(lambda x: len(x['adverse_reactions']) < 512)
print(f'After ADE LENGTH CHECK: {len(dataset["adverse_reactions"])}')

UNFILTERED: 215915


Filter:   0%|          | 0/215915 [00:00<?, ? examples/s]

After NONE CHECK: 77064


Filter:   0%|          | 0/77064 [00:00<?, ? examples/s]

After TITAN LENGTH CHECK: 59382


Filter:   0%|          | 0/77064 [00:00<?, ? examples/s]

After ADE LENGTH CHECK: 5642


### Pipe

In [5]:
# Load ADE-DRUG-EFFECT pipe  
# Paper: https://paperswithcode.com/paper/mining-adverse-drug-reactions-from
# Google Colab: https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/16.Adverse_Drug_Event_ADE_NER_and_Classifier.ipynb

from transformers import pipeline, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext")
pipe = pipeline("token-classification", model="chintagunta85/electramed-small-ADE-DRUG-EFFECT-ner-v3")

2023-09-28 10:50:16.680609: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [6]:
# Filter and generate tags
def generate_ner_tags(entry):
    return {'ner_tags': pipe(entry['adverse_reactions']) }

dataset = dataset.map(generate_ner_tags)

Map:   0%|          | 0/5642 [00:00<?, ? examples/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


In [7]:
# Inspect
dataset

Dataset({
    features: ['adverse_reactions', 'ner_tags'],
    num_rows: 5642
})

In [8]:
# Inspect
dataset['ner_tags'][3]

[{'end': 7,
  'entity': 'B-EFFECT',
  'index': 1,
  'score': 0.9073781967163086,
  'start': 0,
  'word': 'adverse'},
 {'end': 17,
  'entity': 'I-EFFECT',
  'index': 2,
  'score': 0.9293419122695923,
  'start': 8,
  'word': 'reactions'}]

In [9]:
dataset['adverse_reactions'][3220]



### Rejoin NER Fragments 
Once NER has been performed with this model, post-processing must be done to rejoin the chunks where applicable.

In [10]:
def recombine_ner_chunks(entry):
    all_tags = []
    all_words = []
    for tag_set in entry['ner_tags']:
        all_tags.append(tag_set['entity'])
        all_words.append(tag_set['word'])

    i = 0
    chunk_number = []
    for tag in all_tags:
        if tag.startswith('B-') == True:            
            i += 1
            chunk_number.append(i)
        
        if tag.startswith('I-') == True:
            chunk_number.append(i)
    
    position = 0
    rebuilt = []
    word = '' 
    for chunk_indicator in chunk_number:
        current_word = all_words[position]
        try: 
            if chunk_indicator == chunk_number[position + 1]:
                # don't append
                word = word + ' ' + current_word # build word
                position += 1# iterate
            else:
                word = word + ' ' + current_word 
                rebuilt.append(word.strip()) # append
                word = '' # start new word
                position += 1 # iterate
        except:
            word = word + ' ' + current_word
            rebuilt.append(word.strip())


    return({'ner_chunk_indicators': chunk_number, 'individual_ner_words' : all_words, 'ner_chunks': rebuilt })

In [11]:
dataset = dataset.map(recombine_ner_chunks)
dataset

Map:   0%|          | 0/5642 [00:00<?, ? examples/s]

Dataset({
    features: ['adverse_reactions', 'ner_tags', 'ner_chunk_indicators', 'individual_ner_words', 'ner_chunks'],
    num_rows: 5642
})

In [12]:
# Inspect
dataset['ner_chunks']

[['rash'],
 ['adverse reactions section', 'satisfaction', 'vi'],
 ['adverse reactions', 'mo'],
 ['adverse reactions'],
 ['adverse reaction'],
 ['adverse reaction', 'tn 370'],
 ['adverse reactions', 'cp ##da - 1'],
 ['potassium salts'],
 ['allergic sensitization', 'foli ##c acid'],
 ['adverse reactions'],
 ['adverse reaction', 'wal', 'il'],
 ['corticosteroids'],
 ['adverse reaction', 'ald', 'il 60 ##51'],
 ['adverse reactions'],
 ['adverse reactions'],
 [],
 ['adverse reaction', 'ri'],
 ['reactions allergic rash'],
 ['reactions allergic rash'],
 ['vitamin b 12'],
 ['adverse reactions section', 'ave', 'sp ##f'],
 [],
 ['adverse reaction', 'pri'],
 [],
 ['foli ##c acid'],
 ['ny ##stat ##in'],
 ['sham'],
 ['ny ##stat ##in'],
 ['claims satisfaction'],
 ['adverse reactions', 'mo'],
 ['adverse reactions dist', 'oh'],
 ['sp ##f 100 +'],
 ['sham'],
 ['adverse reaction - re'],
 [],
 ['gentamicin'],
 ['lact ##ulose'],
 ['adverse reaction dist'],
 ['adverse reactions'],
 ['adverse reactions', 'top

In [13]:
# IL + zip code getting recognized as an interleukin
dataset[12]

{'adverse_reactions': 'Adverse reaction Distributed by ALDI Inc. Batavia, IL 60510 DOUBLE BUARANTEE REPLACE THE PRODUCT REFUND YOUR MONEY www.ALDI.us PLEASE RECYCLE',
 'ner_tags': [{'end': 7,
   'entity': 'B-EFFECT',
   'index': 1,
   'score': 0.8888990879058838,
   'start': 0,
   'word': 'adverse'},
  {'end': 16,
   'entity': 'I-EFFECT',
   'index': 2,
   'score': 0.8761894702911377,
   'start': 8,
   'word': 'reaction'},
  {'end': 35,
   'entity': 'B-DRUG',
   'index': 5,
   'score': 0.5261055827140808,
   'start': 32,
   'word': 'ald'},
  {'end': 53,
   'entity': 'B-DRUG',
   'index': 13,
   'score': 0.8849818110466003,
   'start': 51,
   'word': 'il'},
  {'end': 56,
   'entity': 'I-DRUG',
   'index': 14,
   'score': 0.3542076349258423,
   'start': 54,
   'word': '60'},
  {'end': 58,
   'entity': 'I-DRUG',
   'index': 15,
   'score': 0.4874825179576874,
   'start': 56,
   'word': '##51'}],
 'ner_chunk_indicators': [1, 1, 2, 3, 3, 3],
 'individual_ner_words': ['adverse', 'reaction', 

In [14]:
# Inspect
dataset[53]

{'adverse_reactions': 'ADVERSE REACTIONS Developing teeth of children under age 6 may become permanently discolored if excessive amounts are repeatedly swallowed. The following adverse reactions are possible in individuals hypersensitive to fluoride: eczema, atopic dermatitis, urticaria, gastric distress, headache, and weakness.',
 'ner_tags': [{'end': 226,
   'entity': 'B-DRUG',
   'index': 35,
   'score': 0.9134745001792908,
   'start': 218,
   'word': 'fluoride'}],
 'ner_chunk_indicators': [1],
 'individual_ner_words': ['fluoride'],
 'ner_chunks': ['fluoride']}

TO DO: quantify different occurences and see if anything pops up? also post processing still to remove the ##s? get rid of occurences of the word adverse reaction(s)

In [15]:
dataset_df = dataset.to_pandas()

In [19]:
dataset_df['ner_chunks']

0                                              [rash]
1       [adverse reactions section, satisfaction, vi]
2                             [adverse reactions, mo]
3                                 [adverse reactions]
4                                  [adverse reaction]
                            ...                      
5637                                [potassium salts]
5638                       [hyper ##thy ##roid ##ism]
5639                                               []
5640                                  [foli ##c acid]
5641                                               []
Name: ner_chunks, Length: 5642, dtype: object

### Evaluate Performance TODO
Figure out a way using the model's score to assess performance

In [66]:
dataset_50 = 
# dataset_80 = 
# dataset_95 =

# manually label dataset? 100 sentences? 32 lowball


Filter:   0%|          | 0/647 [00:00<?, ? examples/s]

In [None]:
dataset_50 = []
dataset_80 = []
dataset_95 = []

for entry in dataset:
    for ner_tags in entry['ner_tags']:
        try:
            if ner_tags[0]['score'] > 0.50:
                dataset_50.append()
            if ner_tags[0]['score'] > 0.80:
                print(ner_tags[0])
            if ner_tags[0]['score'] > 0.95:
                print(ner_tags[0])

            else:
                pass
        except:
            pass

#### Convert free-text to JSON representations 
TO DO: This section is Likely no longer relevant, but keeping for now

**(9/14)**: In previous exercise with Kindred, PDF labels from Drugs@FDA were used as follows:  
  
1. PDFs were downloaded 
2. PDFs were converted to free-text files.  
3. Free-text files were then separated into different sections for Indications, Contraindications, and Adverse Effects.  (possible LLM task?)
    
This separation was chosen as the types of named entities and relationships would be largely identical, but with different inferred meanings dependent on the section they were present within (this seemed a hard problem to address). Additionally, due to the labels being highly unstructured and designed for visual, human understanding, the conversion from PDF to free-text could be extremely messy. 

**(9/28)**: In this exercise, Drugs@FDA labels were obtained from the OpenFDA resource and converted from JSON format to a pandas dataframe (excel file). Some of these fields are still messy, but they are now less messy than the previous method and are pre-sectioned.

I am thinking now about how to re-format this dataset for use in HuggingFace models. The aim of this research is to use AI-assisted technology to automatically generate interaction claims from additional, previously under utilized sources of data. In doing so, we can increase data parity between the published state of knowledge and what is available in databases.  
  
**Approach**:  
  
1. Take an existing pretrained model and use train it to perform **entity recognition** of drugs, genes, variants, and phenotypes. 
2. Either (1): identify a suitable model or, (2): train an existing pretrained model further to perform **text summarization** to summarize a label to link drugs, genes, variants, phenotypes within one individual 'claim'. (**relationship classification**??)

In thinking about this, I think the data should eventually be formatted like this:

In [None]:
# JSON format for Adverse Effects section
{
    "meta": { "label": <identifier>,
                "drug": <drug label>,
                "type": <type of page, i.e. indication, adverse effects, contraindications>,
                "url": <url of label download>,
            },
    "adverse_effects": <free text dump>,
    "tokens": [...],
    "pos_tags": [...], # IDs
    "chunk_tags": [...], # IDs
    "ner_tags": [...], # IDs
    "id": <identifier for data point>

}