* **Simple Transformers** lets you quickly train and evaluate Transformer models. Only 3 lines of code are needed to initialize, train, and evaluate a model. This library is based on the Transformers library by HuggingFace. 

* GitHub repo: https://github.com/ThilinaRajapakse/simpletransformers

* Website: https://simpletransformers.ai/docs/qa-specifics/

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [1]:
%cd /content/drive/MyDrive/Colab Notebooks/Data

/content/drive/MyDrive/Colab Notebooks/Data


In [2]:
#!unzip -q QA.zip

### Data Credid:
**BIOASQ:** A challenge on large-scale
biomedical semantic indexing
and question answering

https://academic.oup.com/bioinformatics/article/36/4/1234/5566506

### Data is downloaded from:
https://github.com/dmis-lab/biobert



In [3]:
import json
import pathlib as Path
#with Path('BioASQ/BioASQ-train-factoid-4b.json').open() as json_file:
json_file =open('BioASQ/BioASQ-train-factoid-4b.json')
data = json.load(json_file)

In [4]:
type(data)

dict

In [5]:
data.keys()

dict_keys(['data', 'version'])

In [6]:
data['version']

'BioASQ6b'

In [7]:
data = data['data']

In [8]:
type(data)

list

In [9]:
len(data[0]), type(data[0]), data[0].keys()

(2, dict, dict_keys(['paragraphs', 'title']))

In [10]:
data[0]['title']    # Same as version

'BioASQ6b'

In [11]:
type(data[0]['paragraphs']), len(data[0]['paragraphs'])

(list, 3266)

In [14]:
data[0]['paragraphs'][:3]

[{'qas': [{'id': '52bf208003868f1b06000019_002',
    'question': 'What is the inheritance pattern of Li–Fraumeni syndrome?',
    'answers': [{'text': 'autosomal dominant', 'answer_start': 213}]}],
  'context': 'Balanced t(11;15)(q23;q15) in a TP53+/+ breast cancer patient from a Li-Fraumeni syndrome family. Li-Fraumeni Syndrome (LFS) is characterized by early-onset carcinogenesis involving multiple tumor types and shows autosomal dominant inheritance. Approximately 70% of LFS cases are due to germline mutations in the TP53 gene on chromosome 17p13.1. Mutations have also been found in the CHEK2 gene on chromosome 22q11, and others have been mapped to chromosome 11q23. While characterizing an LFS family with a documented defect in TP53, we found one family member who developed bilateral breast cancer at age 37 yet was homozygous for wild-type TP53. Her mother also developed early-onset primary bilateral breast cancer, and a sister had unilateral breast cancer and a soft tissue sarcoma. C

In [12]:
data[0]['paragraphs'][0]

{'qas': [{'id': '52bf208003868f1b06000019_002',
   'question': 'What is the inheritance pattern of Li–Fraumeni syndrome?',
   'answers': [{'text': 'autosomal dominant', 'answer_start': 213}]}],
 'context': 'Balanced t(11;15)(q23;q15) in a TP53+/+ breast cancer patient from a Li-Fraumeni syndrome family. Li-Fraumeni Syndrome (LFS) is characterized by early-onset carcinogenesis involving multiple tumor types and shows autosomal dominant inheritance. Approximately 70% of LFS cases are due to germline mutations in the TP53 gene on chromosome 17p13.1. Mutations have also been found in the CHEK2 gene on chromosome 22q11, and others have been mapped to chromosome 11q23. While characterizing an LFS family with a documented defect in TP53, we found one family member who developed bilateral breast cancer at age 37 yet was homozygous for wild-type TP53. Her mother also developed early-onset primary bilateral breast cancer, and a sister had unilateral breast cancer and a soft tissue sarcoma. Cytog

In [13]:
dict_items = list(data[0]['paragraphs'][0]['qas'][0].items())
dict_items.insert(1, ("is_impossible", False))
dict(dict_items)

{'id': '52bf208003868f1b06000019_002',
 'is_impossible': False,
 'question': 'What is the inheritance pattern of Li–Fraumeni syndrome?',
 'answers': [{'text': 'autosomal dominant', 'answer_start': 213}]}

In [15]:
for i in data[0]['paragraphs'][:100]:
  print(i['qas'][0])

{'id': '52bf208003868f1b06000019_002', 'question': 'What is the inheritance pattern of Li–Fraumeni syndrome?', 'answers': [{'text': 'autosomal dominant', 'answer_start': 213}]}
{'id': '52bf208003868f1b06000019_003', 'question': 'What is the inheritance pattern of Li–Fraumeni syndrome?', 'answers': [{'text': 'autosomal dominant', 'answer_start': 105}]}
{'id': '530cf4fe960c95ad0c00000b_001', 'question': 'Which type of lung cancer is afatinib used for?', 'answers': [{'text': 'EGFR-mutant NSCLC', 'answer_start': 1203}]}
{'id': '53148a07dae131f847000002_001', 'question': 'Which hormone abnormalities are characteristic to Pendred syndrome?', 'answers': [{'text': 'thyroid', 'answer_start': 419}]}
{'id': '53148a07dae131f847000002_002', 'question': 'Which hormone abnormalities are characteristic to Pendred syndrome?', 'answers': [{'text': 'thyroid', 'answer_start': 705}]}
{'id': '53148a07dae131f847000002_003', 'question': 'Which hormone abnormalities are characteristic to Pendred syndrome?', 'a

### Creating Structured input file compatible with the simpletransformer QuenstionAnswering Model

### Input Structure:

The input data should be a single list of dictionaries (or path to a JSON file containing the same). A dictionary represents a single context and its associated questions.

* Each such dictionary contains two attributes, the "context" and "qas".

* * context: The paragraph or text from which the question is asked.
* * qas: A list of questions and answers (format below).
Questions and answers are represented as dictionaries. Each dictionary in qas has the following format.

* * * id: (string) A unique ID for the question. Should be unique across the entire dataset.
question: (string) A question.
* * * is_impossible: (bool) Indicates whether the question can be answered correctly from the context.
* * * answers: (list) The list of correct answers to the question. A single answer is represented by a dictionary with the following attributes.

* * * * text: (string) The answer to the question. Must be a substring of the context.
* * * * answer_start: (int) Starting index of the answer in the context.

In [16]:
def create_dataset(file_path):
  json_file =open(file_path)
  data = json.load(json_file)

  QA_data = data['data'][0]['paragraphs'][:100]


  train = []
  for i in QA_data:
    context_dict = {'context':[], 'qas':[]}
    context_dict['context'] = i['context']

    list_items = list(i['qas'][0].items())
    list_items.insert(1, ("is_impossible", False))
    dict_modified = dict(list_items)

    context_dict['qas'].append(dict_modified)

    train.append(context_dict)

  return train

In [17]:
train_path = 'BioASQ/BioASQ-train-factoid-4b.json'
train = create_dataset(train_path)

In [18]:
train

[{'context': 'Balanced t(11;15)(q23;q15) in a TP53+/+ breast cancer patient from a Li-Fraumeni syndrome family. Li-Fraumeni Syndrome (LFS) is characterized by early-onset carcinogenesis involving multiple tumor types and shows autosomal dominant inheritance. Approximately 70% of LFS cases are due to germline mutations in the TP53 gene on chromosome 17p13.1. Mutations have also been found in the CHEK2 gene on chromosome 22q11, and others have been mapped to chromosome 11q23. While characterizing an LFS family with a documented defect in TP53, we found one family member who developed bilateral breast cancer at age 37 yet was homozygous for wild-type TP53. Her mother also developed early-onset primary bilateral breast cancer, and a sister had unilateral breast cancer and a soft tissue sarcoma. Cytogenetic analysis using fluorescence in situ hybridization of a primary skin fibroblast cell line revealed that the patient had a novel balanced reciprocal translocation between the long arms of 

In [19]:
len(train)

100

In [20]:
train_data = data[0]['paragraphs']

In [21]:
len(train_data)

3266

In [22]:
test_path = 'BioASQ/BioASQ-train-factoid-5b.json'
test = create_dataset(test_path)

In [23]:
test

[{'context': 'Acrokeratosis paraneoplastica (Bazex syndrome): report of a case associated with small cell lung carcinoma and review of the literature. Acrokeratosis paraneoplastic (Bazex syndrome) is a rare, but distinctive paraneoplastic dermatosis characterized by erythematosquamous lesions located at the acral sites and is most commonly associated with carcinomas of the upper aerodigestive tract. We report a 58-year-old female with a history of a pigmented rash on her extremities, thick keratotic plaques on her hands, and brittle nails. Chest imaging revealed a right upper lobe mass that was proven to be small cell lung carcinoma. While Bazex syndrome has been described in the dermatology literature, it is also important for the radiologist to be aware of this entity and its common presentations.',
  'qas': [{'id': '56bc751eac7ad10019000013_001',
    'is_impossible': False,
    'question': 'Name synonym of Acrokeratosis paraneoplastica.',
    'answers': [{'text': 'Bazex syndrome', '

# SimpleTransformer Model Training

In [None]:
# Install simpletransformers
!pip install simpletransformers

* Usage Steps


* * Initialize a QuestionAnsweringModel
* * Train the model with **train_model()**
* * Evaluate the model with **eval_model()**
* * Make predictions on (unlabelled) data with **predict()**





In [24]:
import logging
from simpletransformers.question_answering import QuestionAnsweringModel, QuestionAnsweringArgs


logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)

To create a QuestionAnsweringModel, you must specify a model_type and a model_name.

* **model_type** should be one of the model types from the supported models (e.g. bert, electra, xlnet)

* **model_name** specifies the exact architecture and trained weights to use. This may be a Hugging Face Transformers compatible pre-trained model, a community model, or the path to a directory containing model files.

In [None]:
# Model Names and Types
model_type="bert"
model_name= "bert-base-cased"
if model_type == "bert":
    model_name = "bert-base-cased"

elif model_type == "roberta":
    model_name = "roberta-base"

elif model_type == "distilbert":
    model_name = "distilbert-base-cased"

elif model_type == "distilroberta":
    model_type = "roberta"
    model_name = "distilroberta-base"

elif model_type == "electra-base":
    model_type = "electra"
    model_name = "google/electra-base-discriminator"

elif model_type == "electra-small":
    model_type = "electra"
    model_name = "google/electra-small-discriminator"

elif model_type == "xlnet":
    model_name = "xlnet-base-cased"

In [None]:
### Advanced Methodology
train_args = {
    "reprocess_input_data": True,
    "overwrite_output_dir": True,
    "use_cached_eval_features": True,
    "output_dir": f"outputs/{model_type}",
    "best_model_dir": f"outputs/{model_type}/best_model",
    "evaluate_during_training": True,
    "max_seq_length": 128,
    "num_train_epochs": 5,
    "evaluate_during_training_steps": 1000,
    "wandb_project": "Question Answer Application",
    "wandb_kwargs": {"name": model_name},
    "save_model_every_epoch": False,
    "save_eval_checkpoints": False,
    "n_best_size":3
    # "use_early_stopping": True,
    # "early_stopping_metric": "mcc",
    # "n_gpu": 2,
    # "manual_seed": 4,
    # "use_multiprocessing": False,
    "train_batch_size": 128,
    "eval_batch_size": 64,
    # "config": {
    #     "output_hidden_states": True
    # }
}

In [25]:
# Configure the model
model_args = QuestionAnsweringArgs()
model_args.train_batch_size = 16
model_args.evaluate_during_training = True
model_args.overwrite_output_dir =True
model_args.n_best_size = 3
model_args.num_train_epochs=10

#Create QuestionAnswering Model
model = QuestionAnsweringModel(
    "distilbert", "distilbert-base-uncased-distilled-squad", args=model_args
)

In [26]:
# Train the model
model.train_model(train, eval_data=test)

INFO:simpletransformers.question_answering.question_answering_model: Converting to features started.
convert squad examples to features: 100%|██████████| 100/100 [00:01<00:00, 64.80it/s]
add example index and unique id: 100%|██████████| 100/100 [00:00<00:00, 369216.90it/s]


Epoch:   0%|          | 0/10 [00:00<?, ?it/s]

Running Epoch 0 of 10:   0%|          | 0/9 [00:00<?, ?it/s]

INFO:simpletransformers.question_answering.question_answering_model: Converting to features started.

convert squad examples to features:   0%|          | 0/100 [00:00<?, ?it/s][A
convert squad examples to features: 100%|██████████| 100/100 [00:01<00:00, 54.22it/s]

add example index and unique id: 100%|██████████| 100/100 [00:00<00:00, 314180.07it/s]


Running Evaluation:   0%|          | 0/22 [00:00<?, ?it/s]

Running Epoch 1 of 10:   0%|          | 0/9 [00:00<?, ?it/s]

INFO:simpletransformers.question_answering.question_answering_model: Converting to features started.

convert squad examples to features:   0%|          | 0/100 [00:00<?, ?it/s][A
convert squad examples to features: 100%|██████████| 100/100 [00:02<00:00, 44.01it/s]

add example index and unique id: 100%|██████████| 100/100 [00:00<00:00, 257635.38it/s]


Running Evaluation:   0%|          | 0/22 [00:00<?, ?it/s]

Running Epoch 2 of 10:   0%|          | 0/9 [00:00<?, ?it/s]

INFO:simpletransformers.question_answering.question_answering_model: Converting to features started.

convert squad examples to features:   0%|          | 0/100 [00:00<?, ?it/s][A
convert squad examples to features: 100%|██████████| 100/100 [00:02<00:00, 44.56it/s]

add example index and unique id: 100%|██████████| 100/100 [00:00<00:00, 263792.70it/s]


Running Evaluation:   0%|          | 0/22 [00:00<?, ?it/s]

Running Epoch 3 of 10:   0%|          | 0/9 [00:00<?, ?it/s]

INFO:simpletransformers.question_answering.question_answering_model: Converting to features started.

convert squad examples to features:   0%|          | 0/100 [00:00<?, ?it/s][A
convert squad examples to features: 100%|██████████| 100/100 [00:02<00:00, 44.69it/s]

add example index and unique id: 100%|██████████| 100/100 [00:00<00:00, 253432.27it/s]


Running Evaluation:   0%|          | 0/22 [00:00<?, ?it/s]

Running Epoch 4 of 10:   0%|          | 0/9 [00:00<?, ?it/s]

INFO:simpletransformers.question_answering.question_answering_model: Converting to features started.

convert squad examples to features:   0%|          | 0/100 [00:00<?, ?it/s][A
convert squad examples to features: 100%|██████████| 100/100 [00:01<00:00, 54.66it/s]

add example index and unique id: 100%|██████████| 100/100 [00:00<00:00, 250256.80it/s]


Running Evaluation:   0%|          | 0/22 [00:00<?, ?it/s]

Running Epoch 5 of 10:   0%|          | 0/9 [00:00<?, ?it/s]

INFO:simpletransformers.question_answering.question_answering_model: Converting to features started.

convert squad examples to features:   0%|          | 0/100 [00:00<?, ?it/s][A
convert squad examples to features: 100%|██████████| 100/100 [00:01<00:00, 54.53it/s]

add example index and unique id: 100%|██████████| 100/100 [00:00<00:00, 226229.99it/s]


Running Evaluation:   0%|          | 0/22 [00:00<?, ?it/s]

Running Epoch 6 of 10:   0%|          | 0/9 [00:00<?, ?it/s]

INFO:simpletransformers.question_answering.question_answering_model: Converting to features started.

convert squad examples to features:   0%|          | 0/100 [00:00<?, ?it/s][A
convert squad examples to features: 100%|██████████| 100/100 [00:02<00:00, 43.20it/s]

add example index and unique id: 100%|██████████| 100/100 [00:00<00:00, 229950.88it/s]


Running Evaluation:   0%|          | 0/22 [00:00<?, ?it/s]

Running Epoch 7 of 10:   0%|          | 0/9 [00:00<?, ?it/s]

INFO:simpletransformers.question_answering.question_answering_model: Converting to features started.

convert squad examples to features:   0%|          | 0/100 [00:00<?, ?it/s][A
convert squad examples to features: 100%|██████████| 100/100 [00:02<00:00, 45.84it/s]

add example index and unique id: 100%|██████████| 100/100 [00:00<00:00, 158754.88it/s]


Running Evaluation:   0%|          | 0/22 [00:00<?, ?it/s]

Running Epoch 8 of 10:   0%|          | 0/9 [00:00<?, ?it/s]

INFO:simpletransformers.question_answering.question_answering_model: Converting to features started.

convert squad examples to features:   0%|          | 0/100 [00:00<?, ?it/s][A
convert squad examples to features: 100%|██████████| 100/100 [00:01<00:00, 53.18it/s]

add example index and unique id: 100%|██████████| 100/100 [00:00<00:00, 237503.06it/s]


Running Evaluation:   0%|          | 0/22 [00:00<?, ?it/s]

Running Epoch 9 of 10:   0%|          | 0/9 [00:00<?, ?it/s]

INFO:simpletransformers.question_answering.question_answering_model: Converting to features started.

convert squad examples to features:   0%|          | 0/100 [00:00<?, ?it/s][A
convert squad examples to features: 100%|██████████| 100/100 [00:01<00:00, 51.05it/s]

add example index and unique id: 100%|██████████| 100/100 [00:00<00:00, 243713.19it/s]


Running Evaluation:   0%|          | 0/22 [00:00<?, ?it/s]

INFO:simpletransformers.question_answering.question_answering_model: Training of distilbert model complete. Saved to outputs/.


(90,
 {'global_step': [9, 18, 27, 36, 45, 54, 63, 72, 81, 90],
  'correct': [0, 11, 14, 14, 14, 16, 14, 14, 14, 14],
  'similar': [83, 86, 79, 83, 82, 74, 78, 75, 75, 78],
  'incorrect': [17, 3, 7, 3, 4, 10, 8, 11, 11, 8],
  'train_loss': [1.4478111267089844,
   0.8534301519393921,
   1.497988224029541,
   0.6585150957107544,
   0.4236159324645996,
   1.0082619190216064,
   0.7700591087341309,
   0.28037965297698975,
   0.35962390899658203,
   0.5479906797409058],
  'eval_loss': [-7.303267045454546,
   -7.747159090909091,
   -7.4406960227272725,
   -7.5886008522727275,
   -8.239524147727273,
   -8.118430397727273,
   -8.355113636363637,
   -8.390980113636363,
   -8.669211647727273,
   -8.763139204545455]})

In [27]:
# Evaluate the model
result, texts = model.eval_model(test)

INFO:simpletransformers.question_answering.question_answering_model: Converting to features started.
convert squad examples to features: 100%|██████████| 100/100 [00:01<00:00, 53.45it/s]
add example index and unique id: 100%|██████████| 100/100 [00:00<00:00, 305262.30it/s]


Running Evaluation:   0%|          | 0/22 [00:00<?, ?it/s]

In [28]:
result

{'correct': 14, 'similar': 78, 'incorrect': 8, 'eval_loss': -8.763139204545455}

In [None]:
texts

In [30]:
# Make predictions with the model
to_predict = [
              {'context': 'Balanced t(11;15)(q23;q15) in a TP53+/+ breast cancer patient from a Li-Fraumeni syndrome family. Li-Fraumeni Syndrome (LFS) is characterized by early-onset carcinogenesis involving multiple tumor types and shows autosomal dominant inheritance. Approximately 70% of LFS cases are due to germline mutations in the TP53 gene on chromosome 17p13.1. Mutations have also been found in the CHEK2 gene on chromosome 22q11, and others have been mapped to chromosome 11q23. While characterizing an LFS family with a documented defect in TP53, we found one family member who developed bilateral breast cancer at age 37 yet was homozygous for wild-type TP53. Her mother also developed early-onset primary bilateral breast cancer, and a sister had unilateral breast cancer and a soft tissue sarcoma. Cytogenetic analysis using fluorescence in situ hybridization of a primary skin fibroblast cell line revealed that the patient had a novel balanced reciprocal translocation between the long arms of chromosomes 11 and 15: t(11;15)(q23;q15). This translocation was not present in a primary skin fibroblast cell line from a brother with neuroblastoma, who was heterozygous for the TP53 mutation. There was no evidence of acute lymphoblastic leukemia in either the patient or her mother, although a nephew did develop leukemia and died in childhood. These data may implicate the region at breakpoint 11q23 and/or 15q15 as playing a significant role in predisposition to breast cancer development.',
  'qas': [{'id': '52bf208003868f1b06000019_002',
    'is_impossible': False,
    'question': 'What is the inheritance pattern of Li–Fraumeni syndrome?'
    }]
     }
]

answers, probabilities = model.predict(to_predict)

print(answers)
print(probabilities)

INFO:simpletransformers.question_answering.question_answering_model: Converting to features started.
convert squad examples to features: 100%|██████████| 1/1 [00:00<00:00, 44.84it/s]
add example index and unique id: 100%|██████████| 1/1 [00:00<00:00, 4999.17it/s]


Running Prediction:   0%|          | 0/1 [00:00<?, ?it/s]

[{'id': '52bf208003868f1b06000019_002', 'answer': ['autosomal dominant', 'autosomal dominant inheritance', 'autosomal']}]
[{'id': '52bf208003868f1b06000019_002', 'probability': [0.9246951034430451, 0.07158390908618548, 0.0037204293015799645]}]
