## Background
For the purposes of creating some form of annotation set, load in manually annotated sentence set (created from indications_and_usage section) and figure out a way to evaluate them using Huggingface methods.

## Convert Text/Annotations to JSON
Datasets objects do not support individual item assignment, so all fields need to be prepopulated before loading in. Easiest way I could figure was to convert everything to a JSON object and load it in using load_datasets().

In [1]:
# Listify TEXT
text_for_json = []
for number in range(0,100):
    with open(f'sentences/texts/{number}.txt','r') as f:
        text_for_json.append(f.readlines())

text_for_json[0:3]

[["1 INDICATIONS AND USAGE Memantine hydrochloride is an N-methyl-D-aspartate (NMDA) receptor antagonist indicated for the treatment of moderate to severe dementia of the Alzheimer's type. ( 1 ) Memantine hydrochloride tablets, USP are indicated for the treatment of moderate to severe dementia of the Alzheimer's type."],
 ['1 INDICATIONS AND USAGE Zenchent Fe, norethindrone and ethinyl estradiol tablets, chewable and ferrous fumarate tablets are indicated for use by females of reproductive potential to prevent pregnancy. • Zenchent Fe, norethindrone and ethinyl estradiol tablets, chewable and ferrous fumarate tablets is a progestin/estrogen COC indicated for use by females of reproductive potential to prevent pregnancy. ( 1 )'],
 ['1 INDICATIONS AND USAGE Alprazolam extended-release tablets are indicated for the treatment of panic disorder with or without agoraphobia, in adults. Alprazolam extended-release tablets are a benzodiazepine indicated for the treatment of panic disorder with 

In [2]:
# Listify ANNOTATIONS
ann_for_json = []
for number in range(0,100):
    with open(f'sentences/ann/{number}.ann','r') as f:
        g = f.readlines()
        entry = []
        for item in g:
            entry_dict = {}
            entity_block = item.split('\t')[1]
            entry_dict['entity'] = entity_block.split(' ')[0]
            entry_dict['start'] = entity_block.split(' ')[1]
            entry_dict['end'] = entity_block.split(' ')[2]
            entry_dict['word'] = item.split('\t')[2].replace('\n','')
            entry.append(entry_dict)
    ann_for_json.append(entry)
ann_for_json[0:3]

[[],
 [{'entity': 'CHEMICAL', 'start': '24', 'end': '35', 'word': 'Zenchent Fe'},
  {'entity': 'CHEMICAL', 'start': '37', 'end': '50', 'word': 'norethindrone'},
  {'entity': 'CHEMICAL',
   'start': '55',
   'end': '72',
   'word': 'ethinyl estradiol'},
  {'entity': 'CHEMICAL',
   'start': '95',
   'end': '111',
   'word': 'ferrous fumarate'},
  {'entity': 'CHEMICAL', 'start': '203', 'end': '214', 'word': 'Zenchent Fe'},
  {'entity': 'CHEMICAL',
   'start': '216',
   'end': '229',
   'word': 'norethindrone'},
  {'entity': 'CHEMICAL',
   'start': '234',
   'end': '251',
   'word': 'ethinyl estradiol'},
  {'entity': 'CHEMICAL',
   'start': '274',
   'end': '290',
   'word': 'ferrous fumarate'},
  {'entity': 'CHEMICAL', 'start': '304', 'end': '313', 'word': 'progestin'},
  {'entity': 'CHEMICAL', 'start': '314', 'end': '322', 'word': 'estrogen'}],
 [{'entity': 'CHEMICAL', 'start': '24', 'end': '34', 'word': 'Alprazolam'},
  {'entity': 'CHEMICAL', 'start': '150', 'end': '160', 'word': 'Alpra

In [3]:
# Cast to JSON and save
import json
data = json.dumps([{'text': text, 'chemical_eval': ann} for text, ann in zip(text_for_json, ann_for_json)])
with open('sentences/evalset.json', 'w', encoding='utf-8') as f:
    f.write(data)

## Load Dataset into Huggingface
Use the load_dataset method to load in data in a JSON format. This will recognize the pre-built features 'text' and 'chemical_eval' built above using json.dumps from giant lists.

In [6]:
from datasets import load_dataset
dataset = load_dataset("json",data_files="sentences/evalset.json",split='train')

Downloading and preparing dataset json/default to /Users/mjc014/.cache/huggingface/datasets/json/default-ef66e144266a6f5b/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset json downloaded and prepared to /Users/mjc014/.cache/huggingface/datasets/json/default-ef66e144266a6f5b/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4. Subsequent calls will reuse this data.


In [7]:
dataset

Dataset({
    features: ['text', 'chemical_eval'],
    num_rows: 100
})

In [8]:
dataset['chemical_eval'][1]

[{'end': '35', 'entity': 'CHEMICAL', 'start': '24', 'word': 'Zenchent Fe'},
 {'end': '50', 'entity': 'CHEMICAL', 'start': '37', 'word': 'norethindrone'},
 {'end': '72',
  'entity': 'CHEMICAL',
  'start': '55',
  'word': 'ethinyl estradiol'},
 {'end': '111',
  'entity': 'CHEMICAL',
  'start': '95',
  'word': 'ferrous fumarate'},
 {'end': '214', 'entity': 'CHEMICAL', 'start': '203', 'word': 'Zenchent Fe'},
 {'end': '229', 'entity': 'CHEMICAL', 'start': '216', 'word': 'norethindrone'},
 {'end': '251',
  'entity': 'CHEMICAL',
  'start': '234',
  'word': 'ethinyl estradiol'},
 {'end': '290',
  'entity': 'CHEMICAL',
  'start': '274',
  'word': 'ferrous fumarate'},
 {'end': '313', 'entity': 'CHEMICAL', 'start': '304', 'word': 'progestin'},
 {'end': '322', 'entity': 'CHEMICAL', 'start': '314', 'word': 'estrogen'}]

## Pipe
Run pipe from Huggingface code and use map to have entities as their own field

In [11]:
# Use a pipeline as a high-level helper
from transformers import pipeline
import pandas as pd

pipe_chemical = pipeline("token-classification", model="alvaroalon2/biobert_chemical_ner",aggregation_strategy="simple")

def pipe(label_header,dataset):
    def generate_chemical_ner(entry):
        return {'chemical_test': pipe_chemical(entry[label_header]) }
    dataset = dataset.map(generate_chemical_ner)

    return(dataset)

def post_process(feature,dataset):
    # Build DF

    df = pd.DataFrame()

    for entry in dataset[feature]:
        tdf = pd.DataFrame(entry)
        df = pd.concat([df,tdf]).reset_index(drop=True)
    
    return(df)

In [12]:
dataset = pipe('text', dataset)

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

## Evaluate 
Load in pipes, and create a piped dataset, figure out a comparison between the two? Calculate F1 score??  
  
*TODO: Clean this up from scratch, improve calculations, improve score*

In [22]:
# Inspect
number = 7
(dataset['text'][number],dataset['chemical_eval'][number],dataset['chemical_test'][number])

(['1 INDICATIONS AND USAGE Gabapentin tablets USP are indicated for: • Management of postherpetic neuralgia in adults • Adjunctive therapy in the treatment of partial onset seizures, with and without secondary generalization, in adults and pediatric patients 3 years and older with epilepsy Gabapentin tablets are indicated for: • Postherpetic neuralgia in adults ( 1 ) • Adjunctive therapy in the treatment of partial onset seizures, with and without secondary generalization, in adults and pediatric patients 3 years and older with epilepsy ( 1 )'],
 [{'end': '34', 'entity': 'CHEMICAL', 'start': '24', 'word': 'Gabapentin'},
  {'end': '298', 'entity': 'CHEMICAL', 'start': '288', 'word': 'Gabapentin'}],
 [[{'end': 34,
    'entity_group': 'CHEMICAL',
    'score': 0.9902465343475342,
    'start': 24,
    'word': 'Gabapentin'},
   {'end': 298,
    'entity_group': 'CHEMICAL',
    'score': 0.9975243806838989,
    'start': 288,
    'word': 'Gabapentin'}]])

In [33]:
for m in dataset['chemical_eval'][7]:
    print(m['entity'])

CHEMICAL
CHEMICAL


In [40]:
dataset['chemical_test'][7]

[[{'end': 34,
   'entity_group': 'CHEMICAL',
   'score': 0.9902465343475342,
   'start': 24,
   'word': 'Gabapentin'},
  {'end': 298,
   'entity_group': 'CHEMICAL',
   'score': 0.9975243806838989,
   'start': 288,
   'word': 'Gabapentin'}]]

In [58]:
index = 7

eval_group = []
eval_test = []

for m in dataset['chemical_eval'][7]:
    eval_group.append((f'{m["start"]} {m["end"]} {m["entity"]} {m["word"]}'))
    
for n in dataset['chemical_test'][7][0]:
    eval_test.append((f'{n["start"]} {n["end"]} {n["entity_group"]} {n["word"]}'))
    

In [67]:
tp = 0
fp = 0
fn = 0

for entity in eval_test:
    if entity in eval_group:
        tp += 1
    else:
        fp += 1

for entity in eval_group:
    if entity in eval_test:
        pass
    else:
        fn += 1

precision = (tp/(tp+fp))*100
recall = (tp/(tp+fn))*100


print(f'TP: {tp} \nFP: {fp} \nFN: {fn}')
print(f'PRECISION: {precision} \nRECALL: {recall}')
print(f'F1: {(2*(precision*recall))/(precision+recall)}')




TP: 2 
FP: 0 
FN: 0
PRECISION: 100.0 
RECALL: 100.0
F1: 100.0


In [71]:
dataset

Dataset({
    features: ['text', 'chemical_eval', 'chemical_test'],
    num_rows: 100
})

In [72]:
# Rudimentary loop to evaluate F1

tp = 0
fp = 0
fn = 0

index = 0
eval_group = []
eval_test = []

while index < len(dataset):

    for m in dataset['chemical_eval'][index]:
        eval_group.append((f'{m["start"]} {m["end"]} {m["entity"]} {m["word"]}'))
        
    for n in dataset['chemical_test'][index][0]:
        eval_test.append((f'{n["start"]} {n["end"]} {n["entity_group"]} {n["word"]}'))

    for entity in eval_test:
        if entity in eval_group:
            tp += 1
        else:
            fp += 1

    for entity in eval_group:
        if entity in eval_test:
            pass
        else:
            fn += 1

    index += 1

    
precision = (tp/(tp+fp))*100
recall = (tp/(tp+fn))*100


print(f'TP: {tp} \nFP: {fp} \nFN: {fn}')
print(f'PRECISION: {precision} \nRECALL: {recall}')
print(f'F1: {(2*(precision*recall))/(precision+recall)}')




TP: 6268 
FP: 3803 
FN: 2862
PRECISION: 62.23810942309602 
RECALL: 68.65279299014239
F1: 65.28826623613354


In [56]:
def post_process(feature,dataset):
    # Build DF

    df = pd.DataFrame()

    for entry in dataset[feature]:
        tdf = pd.DataFrame(entry)
        df = pd.concat([df,tdf]).reset_index(drop=True)
    
    return(df)

def post_process2(feature,dataset):
    # Build DF

    df = pd.DataFrame()

    for entry in dataset[feature]:
        tdf = pd.DataFrame(entry[0])
        df = pd.concat([df,tdf]).reset_index(drop=True)
    
    return(df)


evaluated = post_process('chemical_eval',dataset)
evaluated2 = post_process2('chemical_test',dataset)


In [92]:
eval_truth = []
for m in evaluated.iterrows():
    eval_truth.append(f'{m[1][2]} {m[1][0]} {m[1][1]} {m[1][3]}')

eval_test = []    
for n in evaluated2.iterrows():
    eval_test.append(f'{n[1][3]} {n[1][0]} {n[1][1]} {n[1][4]}')

    # for m in dataset['chemical_eval'][index]:
    #     eval_group.append((f'{m["start"]} {m["end"]} {m["entity"]} {m["word"]}'))
        
    # for n in dataset['chemical_test'][index][0]:
    #     eval_test.append((f'{n["start"]} {n["end"]} {n["entity_group"]} {n["word"]}'))


In [113]:
tp = len(set(eval_test).intersection(eval_truth)) # correct labels
fp = len(set(eval_test)) - len(set(eval_test).intersection(eval_truth)) # things that were labeled that shouldn't have been

fn = len(eval_group) - (len(set(eval_test).intersection(eval_truth))) # things that should've been labeled but were not

precision = (tp/(tp+fp))*100
recall = (tp/(tp+fn))*100


print(f'TP: {tp} \nFP: {fp} \nFN: {fn}')
print(f'PRECISION: {precision} \nRECALL: {recall}')


# print(f'F1: {(2*(precision*recall))/(precision+recall)}')
# print(f'TP: {tp}\nFP: {fp}\nFN: {fn}')

TP: 95 
FP: 62 
FN: 55
PRECISION: 60.509554140127385 
RECALL: 63.33333333333333


In [114]:
evaluated # TRUTH SET / HUMAN LABELS

Unnamed: 0,end,entity,start,word
0,35,CHEMICAL,24,Zenchent Fe
1,50,CHEMICAL,37,norethindrone
2,72,CHEMICAL,55,ethinyl estradiol
3,111,CHEMICAL,95,ferrous fumarate
4,214,CHEMICAL,203,Zenchent Fe
...,...,...,...,...
145,46,CHEMICAL,36,prilocaine
146,100,CHEMICAL,91,lidocaine
147,120,CHEMICAL,110,prilocaine
148,333,CHEMICAL,324,Lidocaine


In [57]:
evaluated2 # MODEL PREDICTIONS

Unnamed: 0,end,entity_group,score,start,word
0,47,CHEMICAL,0.999128,24,Memantine hydrochloride
1,74,CHEMICAL,0.999993,54,N - methyl - D - aspartate
2,215,CHEMICAL,0.999696,192,Memantine hydrochloride
3,50,CHEMICAL,0.999892,37,norethindrone
4,72,CHEMICAL,0.999734,55,ethinyl estradiol
...,...,...,...,...,...
152,100,CHEMICAL,0.999996,91,lidocaine
153,120,CHEMICAL,0.984445,110,prilocaine
154,333,CHEMICAL,0.999674,324,Lidocaine
155,339,CHEMICAL,0.999998,338,p
