## Convert free-text to JSON representations

In previous exercise with Kindred, PDF labels from Drugs@FDA were used as follows:  
  
1. PDFs were downloaded 
2. PDFs were converted to free-text files.  
3. Free-text files were then separated into different sections for Indications, Contraindications, and Adverse Effects. 
    
This separation was chosen as the types of named entities and relationships would be largely identical, but with different inferred meanings dependent on the section they were present within (this seemed a hard problem to address).

For this exercise, I am thinking about how to re-format this dataset for use in HuggingFace models. The aim of this research is to take an existing pretrained model, such as BioNLP, and train it further for **entity recognition** of drugs, genes, variants, and phenotypes. Following this, train the model further to perform **text summarization** to summarize a label to link drugs, genes, variants, phenotypes within one individual 'claim'.

In thinking about this, I think the data should eventually be formatted like this:

In [None]:
# JSON format
{
    "meta": { "label": <identifier>,
                "drug": <drug label>,
                "type": <type of page, i.e. indication, adverse effects, contraindications>,
                "url": <url of label download>,
            },
    "text": <free text dump>,
    "tokens": [...],
    "pos_tags": [...], # IDs
    "chunk_tags": [...], # IDs
    "ner_tags": [...], # IDs
    "id": <identifier for data point>

}

This I think is the data format that needs to happen to make a training set for at least named entity recognition?

### Loading the Dataset

In [3]:
from datasets import load_dataset
dataset = load_dataset("../old-data/kindred-data-sets/indication_pages")
dataset

Resolving data files:   0%|          | 0/647 [00:00<?, ?it/s]

Found cached dataset text (/Users/mjc014/.cache/huggingface/datasets/text/indication_pages-3968f876ec7b966a/0.0.0/cb1e9bd71a82ad27976be3b12b407850fe2837d80c22c5e03a28949843a8ace2)


  0%|          | 0/1 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 647
    })
})

In [9]:
dataset['train'].features

{'text': Value(dtype='string', id=None)}

In [5]:
dataset['train'][0]



### Tokenize

In [35]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForMaskedLM, pipeline

tokenizer = AutoTokenizer.from_pretrained("microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext")
model = AutoModelForMaskedLM.from_pretrained("microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext")
pipe = pipeline("token-classification", model="chintagunta85/electramed-small-ADE-DRUG-EFFECT-ner-v3")

Some weights of the model checkpoint at microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [31]:
def make_tokens(text):
    return {'tokens' : tokenizer.tokenize(text['text'])}

dataset = dataset.filter(lambda x: x['text'] is not None)

dataset = dataset.map(make_tokens)

Filter:   0%|          | 0/647 [00:00<?, ? examples/s]

Map:   0%|          | 0/647 [00:00<?, ? examples/s]

In [34]:
dataset[0]

 'tokens': ['the',
  'plastic',
  'container',
  'is',
  'made',
  'from',
  'a',
  'multilayer',
  '##ed',
  'film',
  'specifically',
  'developed',
  'for',
  'parenteral',
  'drugs',
  '.',
  'it',
  'contains',
  'no',
  'plastic',
  '##izers',
  'and',
  'exhibits',
  'virtually',
  'no',
  'le',
  '##ach',
  '##ables',
  '.',
  'the',
  'solution',
  'contact',
  'layer',
  'is',
  'a',
  'rubber',
  '##ized',
  'copolymer',
  'of',
  'ethylene',
  'and',
  'propyl',
  '##ene',
  '.',
  'the',
  'container',
  'is',
  'nont',
  '##oxic',
  'and',
  'biologically',
  'inert',
  '.',
  'the',
  'container',
  '-',
  'solution',
  'unit',
  'is',
  'a',
  'closed',
  'system',
  'and',
  'is',
  'not',
  'dependent',
  'upon',
  'entry',
  'of',
  'external',
  'air',
  'during',
  'administration',
  '.',
  'the',
  'container',
  'is',
  'over',
  '##wr',
  '##apped',
  'to',
  'provide',
  'protection',
  'from',
  'the',
  'physical',
  'environment',
  'and',
  'to',
  'provid

### Attempting to NER?

In [37]:
pipe = pipeline("token-classification", model="chintagunta85/electramed-small-ADE-DRUG-EFFECT-ner-v3")

def get_ner_tags(text):
    return {'ner_tags' : pipe(text['tokens'])}

dataset = dataset.map(get_ner_tags) # try tqdming this dog, also is there a better way to batch this?

Map:   0%|          | 0/647 [00:00<?, ? examples/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


In [38]:
dataset

Dataset({
    features: ['text', 'tokens', 'ner_tags'],
    num_rows: 647
})

In [39]:
dataset[0]['ner_tags'] # next step, choose a score threshold and filter for confidence, notion of extraction?

[[],
 [{'end': 7,
   'entity': 'B-DRUG',
   'index': 1,
   'score': 0.313151478767395,
   'start': 0,
   'word': 'plastic'}],
 [],
 [],
 [],
 [],
 [{'end': 1,
   'entity': 'B-DRUG',
   'index': 1,
   'score': 0.3816811144351959,
   'start': 0,
   'word': 'a'}],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [{'end': 7,
   'entity': 'B-DRUG',
   'index': 1,
   'score': 0.313151478767395,
   'start': 0,
   'word': 'plastic'}],
 [],
 [],
 [],
 [],
 [],
 [{'end': 2,
   'entity': 'B-EFFECT',
   'index': 1,
   'score': 0.6361506581306458,
   'start': 0,
   'word': 'le'}],
 [{'end': 5,
   'entity': 'B-DRUG',
   'index': 3,
   'score': 0.419985294342041,
   'start': 2,
   'word': 'ach'}],
 [],
 [],
 [],
 [{'end': 8,
   'entity': 'B-DRUG',
   'index': 1,
   'score': 0.35858142375946045,
   'start': 0,
   'word': 'solution'}],
 [{'end': 7,
   'entity': 'B-EFFECT',
   'index': 1,
   'score': 0.5007998943328857,
   'start': 0,
   'word': 'contact'}],
 [{'end': 5,
   'entity': 'B-EFFE

From this output, I think I am running the pipe incorrectly, I think I should be running the pipe on the whole text as opposed to the individual tokens, as the index field seems to always be the same lists of one?

In [48]:
# Try 2
pipe = pipeline("token-classification", model="chintagunta85/electramed-small-ADE-DRUG-EFFECT-ner-v3")

def get_ner_tags_again(text):
    return {'ner_tags_again' : pipe(text['text'])}

dataset = dataset.map(get_ner_tags_again) # try tqdming this dog, also is there a better way to batch this?

Map:   0%|          | 0/647 [00:00<?, ? examples/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


This might not be correct, tensor length mismatch. This probably means things need to be split down and padded before hand. Would the workflow be separate out doc to sentences, then pipe the sentences? This is a TO DO

### Try out some filter thresholds

In [66]:
dataset_50 = 
# dataset_80 = 
# dataset_95 =

Filter:   0%|          | 0/647 [00:00<?, ? examples/s]

In [60]:
dataset_50 = []
dataset_80 = []
dataset_95 = []

for entry in dataset:
    for ner_tags in entry['ner_tags']:
        try:
            if ner_tags[0]['score'] > 0.50:
                dataset_50.append()
            if ner_tags[0]['score'] > 0.80:
                print(ner_tags[0])
            if ner_tags[0]['score'] > 0.95:
                print(ner_tags[0])

            else:
                pass
        except:
            pass

{'end': 8, 'entity': 'B-DRUG', 'index': 1, 'score': 0.8111954927444458, 'start': 0, 'word': 'ethylene'}
{'end': 4, 'entity': 'B-DRUG', 'index': 1, 'score': 0.923985481262207, 'start': 0, 'word': 'prop'}
{'end': 3, 'entity': 'B-DRUG', 'index': 1, 'score': 0.7690395712852478, 'start': 0, 'word': 'tam'}
{'end': 13, 'entity': 'B-DRUG', 'index': 1, 'score': 0.8387004137039185, 'start': 0, 'word': 'hydrochloride'}
{'end': 4, 'entity': 'B-EFFECT', 'index': 1, 'score': 0.7765231728553772, 'start': 0, 'word': 'dias'}
{'end': 13, 'entity': 'B-DRUG', 'index': 1, 'score': 0.8387004137039185, 'start': 0, 'word': 'hydrochloride'}
{'end': 6, 'entity': 'B-EFFECT', 'index': 1, 'score': 0.7771751880645752, 'start': 0, 'word': 'change'}
{'end': 10, 'entity': 'B-EFFECT', 'index': 1, 'score': 0.7592799663543701, 'start': 0, 'word': 'myocardial'}
{'end': 4, 'entity': 'B-DRUG', 'index': 1, 'score': 0.9100304841995239, 'start': 0, 'word': 'prop'}
{'end': 7, 'entity': 'B-EFFECT', 'index': 1, 'score': 0.7565127

The problem of padding needs to be addressed and pipe needs to be correctly run. Analysis is impossible on this set as compound chemicals / phenotypes are being split up and evaluated separately when they should be together.