## Background
For the purposes of creating some form of annotation set, load in manually annotated sentence set (created from indications_and_usage section) and figure out a way to evaluate them using Huggingface methods.

## Convert Text/Annotations to JSON
Datasets objects do not support individual item assignment, so all fields need to be prepopulated before loading in. Easiest way I could figure was to convert everything to a JSON object and load it in using load_datasets().

In [188]:
# Listify TEXT
text_for_json = []
for number in range(0,100):
    with open(f'sentences/texts/{number}.txt','r') as f:
        text_for_json.append(f.readlines())

text_for_json[0:3]

[["1 INDICATIONS AND USAGE Memantine hydrochloride is an N-methyl-D-aspartate (NMDA) receptor antagonist indicated for the treatment of moderate to severe dementia of the Alzheimer's type. ( 1 ) Memantine hydrochloride tablets, USP are indicated for the treatment of moderate to severe dementia of the Alzheimer's type."],
 ['1 INDICATIONS AND USAGE Zenchent Fe, norethindrone and ethinyl estradiol tablets, chewable and ferrous fumarate tablets are indicated for use by females of reproductive potential to prevent pregnancy. • Zenchent Fe, norethindrone and ethinyl estradiol tablets, chewable and ferrous fumarate tablets is a progestin/estrogen COC indicated for use by females of reproductive potential to prevent pregnancy. ( 1 )'],
 ['1 INDICATIONS AND USAGE Alprazolam extended-release tablets are indicated for the treatment of panic disorder with or without agoraphobia, in adults. Alprazolam extended-release tablets are a benzodiazepine indicated for the treatment of panic disorder with 

In [189]:
# Listify ANNOTATIONS
ann_for_json = []
for number in range(0,100):
    with open(f'sentences/ann/{number}.ann','r') as f:
        g = f.readlines()
        entry = []
        for item in g:
            entry_dict = {}
            entity_block = item.split('\t')[1]
            entry_dict['entity'] = entity_block.split(' ')[0]
            entry_dict['start'] = entity_block.split(' ')[1]
            entry_dict['end'] = entity_block.split(' ')[2]
            entry_dict['word'] = item.split('\t')[2].replace('\n','')
            entry.append(entry_dict)
    ann_for_json.append(entry)
ann_for_json[0:3]

[[],
 [{'entity': 'CHEMICAL', 'start': '24', 'end': '35', 'word': 'Zenchent Fe'},
  {'entity': 'CHEMICAL', 'start': '37', 'end': '50', 'word': 'norethindrone'},
  {'entity': 'CHEMICAL',
   'start': '55',
   'end': '72',
   'word': 'ethinyl estradiol'},
  {'entity': 'CHEMICAL',
   'start': '95',
   'end': '111',
   'word': 'ferrous fumarate'},
  {'entity': 'CHEMICAL', 'start': '203', 'end': '214', 'word': 'Zenchent Fe'},
  {'entity': 'CHEMICAL',
   'start': '216',
   'end': '229',
   'word': 'norethindrone'},
  {'entity': 'CHEMICAL',
   'start': '234',
   'end': '251',
   'word': 'ethinyl estradiol'},
  {'entity': 'CHEMICAL',
   'start': '274',
   'end': '290',
   'word': 'ferrous fumarate'},
  {'entity': 'CHEMICAL', 'start': '304', 'end': '313', 'word': 'progestin'},
  {'entity': 'CHEMICAL', 'start': '314', 'end': '322', 'word': 'estrogen'}],
 [{'entity': 'CHEMICAL', 'start': '24', 'end': '34', 'word': 'Alprazolam'},
  {'entity': 'CHEMICAL', 'start': '150', 'end': '160', 'word': 'Alpra

In [192]:
# Cast to JSON and save
import json
data = json.dumps([{'text': text, 'chemical_eval': ann} for text, ann in zip(text_for_json, ann_for_json)])
with open('sentences/evalset.json', 'w', encoding='utf-8') as f:
    f.write(data)

## Load Dataset into Huggingface
Use the load_dataset method to load in data in a JSON format. This will recognize the pre-built features 'text' and 'chemical_eval' built above using json.dumps from giant lists.

In [181]:
from datasets import load_dataset
dataset = load_dataset("json",data_files="evalset.json",split='train')

Downloading and preparing dataset json/default to /Users/mjc014/.cache/huggingface/datasets/json/default-e79d0187071c381e/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset json downloaded and prepared to /Users/mjc014/.cache/huggingface/datasets/json/default-e79d0187071c381e/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4. Subsequent calls will reuse this data.


In [193]:
dataset

Dataset({
    features: ['text', 'chemical_eval'],
    num_rows: 100
})

In [187]:
dataset['chemical_eval'][1]

[{'end': '35', 'entity': 'CHEMICAL', 'start': '24', 'word': 'Zenchent Fe'},
 {'end': '50', 'entity': 'CHEMICAL', 'start': '37', 'word': 'norethindrone'},
 {'end': '72',
  'entity': 'CHEMICAL',
  'start': '55',
  'word': 'ethinyl estradiol'},
 {'end': '111',
  'entity': 'CHEMICAL',
  'start': '95',
  'word': 'ferrous fumarate'},
 {'end': '214', 'entity': 'CHEMICAL', 'start': '203', 'word': 'Zenchent Fe'},
 {'end': '229', 'entity': 'CHEMICAL', 'start': '216', 'word': 'norethindrone'},
 {'end': '251',
  'entity': 'CHEMICAL',
  'start': '234',
  'word': 'ethinyl estradiol'},
 {'end': '290',
  'entity': 'CHEMICAL',
  'start': '274',
  'word': 'ferrous fumarate'},
 {'end': '313', 'entity': 'CHEMICAL', 'start': '304', 'word': 'progestin'},
 {'end': '322', 'entity': 'CHEMICAL', 'start': '314', 'word': 'estrogen'}]

## Evaluate
Load in pipes, and create a piped dataset, figure out a comparison between the two? Calculate F1 score??