## Convert free-text to JSON representations

In previous exercise with Kindred, PDF labels from Drugs@FDA were used as follows:  
  
1. PDFs were downloaded 
2. PDFs were converted to free-text files.  
3. Free-text files were then separated into different sections for Indications, Contraindications, and Adverse Effects. 
    
This separation was chosen as the types of named entities and relationships would be largely identical, but with different inferred meanings dependent on the section they were present within (this seemed a hard problem to address).

For this exercise, I am thinking about how to re-format this dataset for use in HuggingFace models. The aim of this research is to use AI-assisted technology to automatically generate interaction claims from additional, previously under utilized sources of data. In doing so, we can increase data parity between the published state of knowledge and what is available in databases.  
  
**Approach**:  
  
1. Take an existing pretrained model and use train it to perform **entity recognition** of drugs, genes, variants, and phenotypes. 
2. Either (1): identify a suitable model or, (2): train an existing pretrained model further to perform **text summarization** to summarize a label to link drugs, genes, variants, phenotypes within one individual 'claim'.

In thinking about this, I think the data should eventually be formatted like this:

In [None]:
# JSON format
{
    "meta": { "label": <identifier>,
                "drug": <drug label>,
                "type": <type of page, i.e. indication, adverse effects, contraindications>,
                "url": <url of label download>,
            },
    "text": <free text dump>,
    "tokens": [...],
    "pos_tags": [...], # IDs
    "chunk_tags": [...], # IDs
    "ner_tags": [...], # IDs
    "id": <identifier for data point>

}

## Load our Dataset
Dataset is generated from FDA labels. Labels were filtered for section (indication/adverse effects/contraindication). Once identified, label sections were split into sentences using kindred and saved as individual corpi.

In [1]:
from datasets import load_dataset
dataset = load_dataset("../old-data/kindred-data-sets/sentence_size/indication_corpus/")
dataset

Resolving data files:   0%|          | 0/141 [00:00<?, ?it/s]

Found cached dataset text (/Users/mjc014/.cache/huggingface/datasets/text/indication_corpus-4de800950bd39432/0.0.0/cb1e9bd71a82ad27976be3b12b407850fe2837d80c22c5e03a28949843a8ace2)


  0%|          | 0/1 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 141
    })
})

In [2]:
# Inspect feat
dataset['train'].features

{'text': Value(dtype='string', id=None)}

In [3]:
# Inspect entry
dataset['train'][0]

{'text': ' CLINICAL STUDIES  Tretinoin gel, USP (microsphere) 0.1%: In two vehicle-controlled studies  tretinoin gel, USP (microsphere) 0.1% applied once daily was significantly  more effective than vehicle in reducing the severity of acne lesion counts.'}

## Generate NER tags for Dataset

In [4]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForMaskedLM, pipeline

tokenizer = AutoTokenizer.from_pretrained("microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext")
model = AutoModelForMaskedLM.from_pretrained("microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext")
pipe = pipeline("token-classification", model="chintagunta85/electramed-small-ADE-DRUG-EFFECT-ner-v3")

2023-08-28 12:25:44.201687: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Some weights of the model checkpoint at microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClas

In [13]:
# Filter and generate tags
def generate_ner_tags(entry):
    return {'ner_tags': pipe(entry['text']) }

dataset = dataset.filter(lambda x: x['text'] is not None)
dataset = dataset.filter(lambda x: len(x['text']) < 512)

dataset = dataset.map(generate_ner_tags)

Filter:   0%|          | 0/141 [00:00<?, ? examples/s]

Filter:   0%|          | 0/141 [00:00<?, ? examples/s]

Map:   0%|          | 0/75 [00:00<?, ? examples/s]

In [14]:
# Inspect
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'ner_tags'],
        num_rows: 75
    })
})

In [15]:
# Inspect
dataset['train'][0]['ner_tags']

[{'end': 22,
  'entity': 'B-DRUG',
  'index': 3,
  'score': 0.6573849320411682,
  'start': 19,
  'word': 'tre'},
 {'end': 25,
  'entity': 'B-DRUG',
  'index': 4,
  'score': 0.5464023947715759,
  'start': 22,
  'word': '##tin'},
 {'end': 28,
  'entity': 'I-DRUG',
  'index': 5,
  'score': 0.6481591463088989,
  'start': 25,
  'word': '##oin'},
 {'end': 227,
  'entity': 'B-EFFECT',
  'index': 52,
  'score': 0.9042876362800598,
  'start': 225,
  'word': 'ac'},
 {'end': 229,
  'entity': 'I-EFFECT',
  'index': 53,
  'score': 0.9350144267082214,
  'start': 227,
  'word': '##ne'},
 {'end': 236,
  'entity': 'I-EFFECT',
  'index': 54,
  'score': 0.8742935657501221,
  'start': 230,
  'word': 'lesion'}]

## Evaluate Performance
Figure out a way using the model's score to assess performance

In [66]:
dataset_50 = 
# dataset_80 = 
# dataset_95 =

Filter:   0%|          | 0/647 [00:00<?, ? examples/s]

In [None]:
dataset_50 = []
dataset_80 = []
dataset_95 = []

for entry in dataset:
    for ner_tags in entry['ner_tags']:
        try:
            if ner_tags[0]['score'] > 0.50:
                dataset_50.append()
            if ner_tags[0]['score'] > 0.80:
                print(ner_tags[0])
            if ner_tags[0]['score'] > 0.95:
                print(ner_tags[0])

            else:
                pass
        except:
            pass