## Background
For the purposes of evaluating the models, extract and create a test set from the large label database. This code will select 100 sentences at random from a predetermined section (as evaluations will likely have to be by section). 

## Load Dataset
Load dataset from OpenFDA json resource that has been extracted previously.

In [1]:
import pandas as pd

drugs_at_fda = pd.read_excel('../data/openfda.xlsx').reset_index(drop=True).drop('Unnamed: 0', axis=1)

drugs_at_fda

Unnamed: 0,brand_name,adverse_reactions,indications_and_usage,contraindications,warnings_and_cautions,warnings,precautions,pharmacokinetics,purpose,clinical_pharmacology,active_ingredient,stop_use,boxed_warning,pharmacodynamics,pharmacogenomics
0,AMOXICILLIN AND CLAVULANATE POTASSIUM,ADVERSE REACTIONS SECTION The following are di...,INDICATIONS & USAGE SECTION To reduce the deve...,CONTRAINDICATIONS SECTION Amoxicillinfor oral ...,WARNINGS AND PRECAUTIONS SECTION 5.1 Anaphylac...,,,,,CLINICAL PHARMACOLOGY SECTION 12.1 Mechanism o...,,,,,
1,UNDA 312,,Uses For the relief of symptoms associated wit...,,,Warnings Sore throat warning: Severe or persis...,,,Uses For the relief of symptoms associated wit...,,Active ingredients Each drop contains: Angelic...,Stop use and ask a doctor if Cough persists fo...,,,
2,SUN PROTECT LIP BALM SPF 30,,Uses Helps prevent sunburn. If used as directe...,,,Warnings For external use only. Do not use on ...,,,Purpose Sunscreen,,Drug Facts Active ingredients Non Nano Zinc Ox...,,,,
3,LOSARTAN POTASSIUM AND HYDROCHLOROTHIAZIDE,,,,,,,,,,,,,,
4,Potassium Phosphates,6 ADVERSE REACTIONS The following clinically s...,1 INDICATIONS AND USAGE Potassium Phosphates I...,4 CONTRAINDICATIONS Potassium Phosphates Injec...,5 WARNINGS AND PRECAUTIONS Serious Cardiac Adv...,,,12.3 Pharmacokinetics Distribution Approximate...,,12 CLINICAL PHARMACOLOGY 12.1 Mechanism of Act...,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
215910,Carbidopa and Levodopa,ADVERSE REACTIONS The most common adverse reac...,INDICATIONS AND USAGE Carbidopa and levodopa t...,CONTRAINDICATIONS Nonselective monoamine oxida...,,WARNINGS When carbidopa and levodopa tablets a...,"PRECAUTIONS General As with levodopa, periodic...",Pharmacokinetics Carbidopa reduces the amount ...,,CLINICAL PHARMACOLOGY Mechanism of Action Park...,,,,Pharmacodynamics When levodopa is administered...,
215911,REFRESH Optive Mega-3,,"Uses For the temporary relief of burning, irri...",,,Warnings For external use only. To avoid conta...,,,Purpose Eye lubricant Eye lubricant Eye lubricant,,Active ingredients Carboxymethylcellulose sodi...,Stop use and ask a doctor if you experience ey...,,,
215912,Creon,6 ADVERSE REACTIONS The most serious adverse r...,1 INDICATIONS AND USAGE CREON ® is indicated f...,4 CONTRAINDICATIONS None. None ( 4 ),5 WARNINGS AND PRECAUTIONS Fibrosing colonopat...,,,12.3 Pharmacokinetics The pancreatic enzymes i...,,12 CLINICAL PHARMACOLOGY 12.1 Mechanism of Act...,,,,,
215913,Losartan Potassium and Hydrochlorothiazide,6 ADVERSE REACTIONS Most common adverse reacti...,1 INDICATIONS AND USAGE Losartan potassium and...,4 CONTRAINDICATIONS Losartan potassium and hyd...,5 WARNINGS AND PRECAUTIONS Hypotension: Correc...,,,12.3 Pharmacokinetics Losartan Potassium Absor...,,12 CLINICAL PHARMACOLOGY 12.1 Mechanism of Act...,,,WARNING: FETAL TOXICITY When pregnancy is dete...,12.2 Pharmacodynamics Losartan Potassium Losar...,


## Pre-Process

Run the preprocessing filter code. This code was generated previously during the NER_tagging phase and reused here

In [15]:
from datasets import Dataset
from transformers import AutoTokenizer


def pre_process(label_header,df):
    f = pd.DataFrame(df, columns=[label_header])
    dataset = Dataset.from_pandas(f)
    
    print(f'UNFILTERED: {len(dataset[label_header])}')

    dataset = dataset.filter(lambda x: x[label_header] is not None)
    print(f'After NONE CHECK: {len(dataset[label_header])}')

    tokenizer = AutoTokenizer.from_pretrained("alvaroalon2/biobert_genetic_ner")
    dataset = dataset.filter(lambda x: len(tokenizer.tokenize(x[label_header])) <= 512)
    print(f'After Token Length (<512) CHECK: {len(dataset[label_header])}')


    return(dataset)


In [16]:
dataset = pre_process('indications_and_usage',drugs_at_fda)

UNFILTERED: 215915


Filter:   0%|          | 0/215915 [00:00<?, ? examples/s]

After NONE CHECK: 205650


Filter:   0%|          | 0/205650 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (807 > 512). Running this sequence through the model will result in indexing errors


After Token Length (<512) CHECK: 186829


## Random Number Selection
Select 100 random numbers using a predetermined seed. Save numbers to a yaml file and write all individual sentences to txt files for downstream manual tagging.

In [20]:
# Take 1 (seed 151)
from random import seed
from random import randint

seed(151)

numbers = []
for _ in range(100):
    value = randint(0,len(dataset))
    numbers.append(value)

numbers

[185224,
 107378,
 186660,
 165881,
 53637,
 92297,
 73986,
 96862,
 150468,
 30436,
 120940,
 128303,
 150040,
 19358,
 119989,
 28065,
 105492,
 75517,
 175120,
 48068,
 152006,
 96081,
 146424,
 31538,
 34940,
 40343,
 79843,
 46131,
 45976,
 47022,
 88654,
 109780,
 16210,
 88750,
 136290,
 116425,
 26898,
 75331,
 79099,
 44624,
 137733,
 132537,
 148284,
 175257,
 122755,
 57583,
 36634,
 27163,
 89205,
 131778,
 152311,
 134907,
 167338,
 103821,
 175692,
 32695,
 129134,
 107994,
 50447,
 113582,
 152933,
 59686,
 28600,
 177797,
 165664,
 163763,
 115867,
 146377,
 12316,
 79843,
 179512,
 169005,
 123600,
 52127,
 185578,
 52923,
 55925,
 180266,
 137215,
 252,
 100249,
 8680,
 180767,
 180348,
 127190,
 17331,
 114861,
 152449,
 92938,
 80851,
 34362,
 29014,
 143411,
 108577,
 158653,
 72204,
 159516,
 162485,
 27263,
 29197]

In [28]:
# Save take-1, seed 151
with open('take-1.yaml','w') as f:
    f.write('take-1, seed: 151\n')
    for number in numbers:
        f.write(f'{number}\n')


In [29]:
from tqdm import tqdm

text = []
for number in tqdm(numbers):
    text.append(dataset['indications_and_usage'][number])

counter = 0
for number in tqdm(numbers):
    with open(f'sentences/{counter}.txt', 'w') as f:
        f.write(text[counter])
    counter += 1


100%|██████████| 100/100 [02:23<00:00,  1.43s/it]
100%|██████████| 100/100 [00:00<00:00, 2395.96it/s]
