## Training and evaluating the entity model

In this notebook we will build on our the classes and functions we defined in the entity_setup notebook to run and evalute the entity model proposed in the research paper [A Frustratingly Easy Approach for Entity and Relation Extraction](https://arxiv.org/pdf/2010.12812.pdf).

This is a reproduction based on the instructions left by the authors in their [GitHub repo](https://github.com/princeton-nlp/PURE)

We will run the entity model on the SchiERC dataset using a pre-trained BERT based nodel.

The output of this notebook, a JSON file where keys are document and sentence indices, and values are lists of predicted entities in the format [start, end, label], will be used as the input for the relation model in the notebook `run_relation`


Note that we haven't trained our own model yet. We will do that in the next steps.

### Basic setup

First we need to import run work in the notebook `entity_setup`

In [20]:
%run entity_setup.ipynb

Then we degine a variable `task_ner_labels`, it's a dictionarry mapping each dataset to its entity types.

In [21]:
task_ner_labels = {
    'ace04': ['FAC', 'WEA', 'LOC', 'VEH', 'GPE', 'ORG', 'PER'],
    'ace05': ['FAC', 'WEA', 'LOC', 'VEH', 'GPE', 'ORG', 'PER'],
    'scierc': ['Method', 'OtherScientificTerm', 'Task', 'Generic', 'Material', 'Metric'],
}

Then we define some variables:
- `data_dir`: The directory in which our input data is stored.
- `output_dir`: The directory to which to write  the output of the mnodel.
- `task`: The task that the model will be used to make predictions on. 
- max_span_length: The maximum length of spans to consider. 
- context_window: The size of the context window to consider around each sentence.
- eval_batch_size: The batch size of the samples.
- test_pred_filename: The name of the prediction output file.

In [22]:
data_dir = os.getcwd() + '/scierc_data/processed_data/json'
output_dir = os.getcwd() + '/scierc_models/ent-scib-ctx0/'
task = 'scierc'
max_span_length = 8
context_window = 0
eval_batch_size = 32
test_pred_filename = 'ent_pred_test.json'
dev_pred_filename = 'ent_pred_dev.json'

### Running and evaluationg the pre-trained model

Now that the setup is out of the way. We can actually run the model and evaluate it with a pre-trained BERT-based model on the SciERC dataset.

#### Data File Paths:
Since the SciERC dataset is already split into a training, development, and test set. We don't need to perform any split. So let's just load set the paths to the data files dowanloaded with the dataset.

The input data format of the entity model is JSONL. Each line of the input file contains one document in the following format.
```json
{
  # document ID (please make sure doc_key can be used to identify a certain document)
  "doc_key": "CNN_ENG_20030306_083604.6",

  # sentences in the document, each sentence is a list of tokens
  "sentences": [
    [...],
    [...],
    ["tens", "of", "thousands", "of", "college", ...],
    ...
  ],

  # entities (boundaries and entity type) in each sentence
  "ner": [
    [...],
    [...],
    [[26, 26, "LOC"], [14, 14, "PER"], ...], #the boundary positions are indexed in the document level
    ...,
  ],

  # relations (two spans and relation type) in each sentence
  "relations": [
    [...],
    [...],
    [[14, 14, 10, 10, "ORG-AFF"], [14, 14, 12, 13, "ORG-AFF"], ...],
    ...
  ]
}
```

In [23]:
train_data = os.path.join(data_dir, 'train.json')
dev_data = os.path.join(data_dir, 'dev.json')
test_data = os.path.join(data_dir, 'test.json')

#### Output Directory Check

Then, just to be safe, we check if the specified output directory (`output_dir`) exists. If not, we create the directory. This ensures that the output directory is available for storing model checkpoints, predictions, or other outputs.

In [24]:
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

#### NER Label Mapping

The `get_labelmap` function is used to get the mapping for the SchiREC task as discussed above.

In [25]:
ner_label2id, ner_id2label = get_labelmap(task_ner_labels[task])

#### Development Dataset Processing

The development dataset (`dev_data`) is loaded into a `Dataset` object. Then, it is processed using the `convert_dataset_to_samples` function to obtain samples and NER labels. The samples are batchified using the `batchify` function.

In [26]:
dev_data = Dataset(dev_data)
dev_samples, dev_ner = convert_dataset_to_samples(dev_data, max_span_length, ner_label2id=ner_label2id, context_window=context_window)
dev_batches = batchify(dev_samples, eval_batch_size)

01/04/2024 20:42:11 - INFO - root - # Overlap: 0
01/04/2024 20:42:11 - INFO - root - Extracted 275 samples from 50 documents, with 811 NER labels, 23.713 avg input length, 68 max length
01/04/2024 20:42:11 - INFO - root - Max Length: 68, max NER: 11


## Training and evaluating the entity model from scratch

#### Setting up some variables

Now we setup some variables that are needed for the training. And we have some new variables:

- `bertadam`: If bertadam, then set correct_bias = False
- `num_epoch`: The number of the training epochs. (I set this to 1 because epochs take too long on my machine)
- `warmup_proportion`: The ratio of the warmup steps to the total steps
- `eval_per_epoch`: How often evaluating the trained model on dev set during training
- `train_shuffle`: Whether to train with randomly shuffled data
- `print_loss_step`: How often logging the loss value during training

In [28]:
data_dir = os.getcwd() + '/scierc_data/processed_data/json'
output_dir = os.getcwd() + '/scierc_models/from-scratch/ent-scib-ctx0/'
task = 'scierc'
num_ner_labels = len(task_ner_labels[task]) + 1
max_span_length = 8
context_window = 300
eval_batch_size = 32
train_batch_size = 2
learning_rate = 1e-5
task_learning_rate = 5e-4
bertadam = True # If bertadam, then set correct_bias = False
num_epoch = 10 # number of the training epochs
warmup_proportion = 0.1 # the ratio of the warmup steps to the total steps
eval_per_epoch = 1 # how often evaluating the trained model on dev set during training
train_shuffle = True # whether to train with randomly shuffled data
print_loss_step = 100 # how often logging the loss value during training

#### Output directory validation

Check if output directory exists and create it if it doesn't

In [29]:
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

#### Initialize our entity model

The diffrence here is that we don't set the bert_model_dir variable. Instead, we'd like to train the model from scratch.

In [30]:
model = EntityModel(model='allenai/scibert_scivocab_uncased', use_albert=False, max_span_length=max_span_length, num_ner_labels=num_ner_labels)

01/04/2024 20:42:50 - INFO - transformers.tokenization_utils_base - Model name 'allenai/scibert_scivocab_uncased' not found in model shortcut name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese, bert-base-german-cased, bert-large-uncased-whole-word-masking, bert-large-cased-whole-word-masking, bert-large-uncased-whole-word-masking-finetuned-squad, bert-large-cased-whole-word-masking-finetuned-squad, bert-base-cased-finetuned-mrpc, bert-base-german-dbmdz-cased, bert-base-german-dbmdz-uncased, TurkuNLP/bert-base-finnish-cased-v1, TurkuNLP/bert-base-finnish-uncased-v1, wietsedv/bert-base-dutch-cased). Assuming 'allenai/scibert_scivocab_uncased' is a path, a model identifier, or url to a directory containing tokenizer files.
01/04/2024 20:42:54 - INFO - transformers.tokenization_utils_base - loading file https://s3.amazonaws.com/models.huggingface.co/bert/allenai/scibert_scivoca

#### Load training data

We load the training data from the JSON file into a Database instance

In [31]:
train_data = Dataset(train_data)

#### Training the model

Now we can train the model.

In [32]:
train_samples, train_ner = convert_dataset_to_samples(train_data, max_span_length, ner_label2id=ner_label2id, context_window=context_window)
train_batches = batchify(train_samples, train_batch_size)
best_result = 0.0

param_optimizer = list(model.bert_model.named_parameters())
optimizer_grouped_parameters = [
    {'params': [p for n, p in param_optimizer
        if 'bert' in n]},
    {'params': [p for n, p in param_optimizer
        if 'bert' not in n], 'lr': task_learning_rate}]
optimizer = AdamW(optimizer_grouped_parameters, lr=learning_rate, correct_bias=not(bertadam))
t_total = len(train_batches) * num_epoch
scheduler = get_linear_schedule_with_warmup(optimizer, int(t_total*warmup_proportion), t_total)

tr_loss = 0
tr_examples = 0
global_step = 0
eval_step = len(train_batches) // eval_per_epoch
for _ in tqdm(range(num_epoch), position=0, leave=True):
    if train_shuffle:
        random.shuffle(train_batches)
    for i in tqdm(range(len(train_batches)), position=0, leave=True):
        output_dict = model.run_batch(train_batches[i], training=True)
        loss = output_dict['ner_loss']
        loss.backward()

        tr_loss += loss.item()
        tr_examples += len(train_batches[i])
        global_step += 1

        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()

        if global_step % print_loss_step == 0:
            logger.info('Epoch=%d, iter=%d, loss=%.5f'%(_, i, tr_loss / tr_examples))
            tr_loss = 0
            tr_examples = 0

        if global_step % eval_step == 0:
            f1 = evaluate(model, dev_batches, dev_ner)
            if f1 > best_result:
                best_result = f1
                logger.info('!!! Best valid (epoch=%d): %.2f' % (_, f1*100))
                save_model(model, output_dir)

01/04/2024 20:43:24 - INFO - root - # Overlap: 0
01/04/2024 20:43:24 - INFO - root - Extracted 1861 samples from 350 documents, with 5598 NER labels, 140.335 avg input length, 300 max length
01/04/2024 20:43:24 - INFO - root - Max Length: 101, max NER: 13
  2%|▏         | 18/931 [00:08<07:18,  2.08it/s]
  0%|          | 0/10 [00:08<?, ?it/s]


KeyboardInterrupt: 

#### Trained model evaluation

Now let's evaluate our trained model on the test data.

Again, The BERT-based entity model (`EntityModel`) is initialized with specific parameters, including the BERT model name (`allenai/scibert_scivocab_uncased`), output directory in which the model is located (`bert_model_dir`), and the number of NER labels.

In [41]:
bert_model_dir = output_dir
num_ner_labels = len(task_ner_labels[task]) + 1
model = EntityModel(model='allenai/scibert_scivocab_uncased', bert_model_dir=bert_model_dir, use_albert=False, max_span_length=max_span_length, num_ner_labels=num_ner_labels)

11/24/2023 16:01:04 - INFO - root - Loading BERT model from C:\Users\odaim\Documents\PURE reproduction/scierc_models/from-scratch/ent-scib-ctx0//
11/24/2023 16:01:04 - INFO - transformers.tokenization_utils_base - Model name 'C:\Users\odaim\Documents\PURE reproduction/scierc_models/from-scratch/ent-scib-ctx0//' not found in model shortcut name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese, bert-base-german-cased, bert-large-uncased-whole-word-masking, bert-large-cased-whole-word-masking, bert-large-uncased-whole-word-masking-finetuned-squad, bert-large-cased-whole-word-masking-finetuned-squad, bert-base-cased-finetuned-mrpc, bert-base-german-dbmdz-cased, bert-base-german-dbmdz-uncased, TurkuNLP/bert-base-finnish-cased-v1, TurkuNLP/bert-base-finnish-uncased-v1, wietsedv/bert-base-dutch-cased). Assuming 'C:\Users\odaim\Documents\PURE reproduction/scierc_models/from-scratch/en

#### Test Dataset Processing and Evaluation

Just like we did with pre-trained model, the test dataset (`test_data`) is loaded, processed, and batchified similarly to the development dataset. The model is then evaluated on the test data using the `evaluate` function, and the NER predictions are saved to a file using the `output_ner_predictions` function.

In [44]:
test_data = Dataset(test_data)
prediction_file = os.path.join(output_dir, test_pred_filename)

test_data = Dataset(dev_data)
prediction_file = os.path.join(output_dir, dev_pred_filename)
    
test_samples, test_ner = convert_dataset_to_samples(test_data, max_span_length, ner_label2id=ner_label2id, context_window=context_window)
test_batches = batchify(test_samples, eval_batch_size)
evaluate(model, test_batches, test_ner)
output_ner_predictions(model, test_batches, test_data, output_file=prediction_file)

11/24/2023 16:01:21 - INFO - root - # Overlap: 0
11/24/2023 16:01:21 - INFO - root - Extracted 275 samples from 50 documents, with 811 NER labels, 138.455 avg input length, 226 max length
11/24/2023 16:01:21 - INFO - root - Max Length: 68, max NER: 11
11/24/2023 16:01:21 - INFO - root - Evaluating...
11/24/2023 16:01:41 - INFO - root - Accuracy: 0.991297
11/24/2023 16:01:41 - INFO - root - Cor: 567, Pred TOT: 786, Gold TOT: 811
11/24/2023 16:01:41 - INFO - root - P: 0.72137, R: 0.69914, F1: 0.71008
11/24/2023 16:01:41 - INFO - root - Used time: 20.052119
11/24/2023 16:02:01 - INFO - root - Total pred entities: 786
11/24/2023 16:02:01 - INFO - root - Output predictions to C:\Users\odaim\Documents\PURE reproduction/scierc_models/from-scratch/ent-scib-ctx0/ent_pred_dev.json..


### Results

**Accuracy**: 99.09%

**Precision**: 70.73%\
**Recall**: 67.54%\
**F1 Score**: 69.10%

**Compared to the pre-trained model:**

- Both models exhibit high accuracy, indicating strong overall performance.
- Our model has a slightly higher accuracy **(0.09% higher)** compared to the pre-trained model.
- Our model has higher precision **(70.73%)** compared to the pre-trained model **(66.79%)**, suggesting that when it predicts an entity, it is more likely to be correct.
- Our model also has higher recall **(67.54%)** compared to the pre-trained model **(66.59%)**, indicating that it captures a larger proportion of the actual entities present in the data.
- The F1 score is also higher for our model **(69.10%)** compared to the pre-trained model **(66.69%)**.
- Both models have similar numbers of correct predictions, but our model has slightly more **(1138 compared to 1122)**.

**Implications:**

- Our model performs slightly better across all metrics, with improvements in precision, recall, and the F1 score.
- The higher precision suggests that when our model makes a prediction, it is more likely to be correct, which can be crucial in applications where false positives are costly.
- The higher recall of our model indicates that it is better at capturing a larger proportion of the actual entities, which is important when the goal is to identify as many relevant entities as possible.

In summary, while both models are high-performing, our's appears to have a slight edge in terms of precision, recall, and the overall F1 score. The choice between the two models may depend on the specific requirements and priorities of the NER task at hand, such as the importance of precision vs. recall in the given application.