## Running and evaluating the pre-trained entity model

In this notebook we will build on our the classes and functions we defined in the entity_setup notebook to run and evalute the entity model proposed in the research paper [A Frustratingly Easy Approach for Entity and Relation Extraction](https://arxiv.org/pdf/2010.12812.pdf).

This is a reproduction based on the instructions left by the authors in their [GitHub repo](https://github.com/princeton-nlp/PURE)

We will run the entity model on the SchiERC dataset using a pre-trained BERT based nodel.

The output of this notebook, a JSON file where keys are document and sentence indices, and values are lists of predicted entities in the format [start, end, label], will be used as the input for the relation model in the notebook `run_relation`


Note that we haven't trained our own model yet. We will do that in the next steps.

### Basic setup

First we need to run our work in the notebook `entity_setup`

In [7]:
%run entity_setup.ipynb

Then we degine a variable `task_ner_labels`, it's a dictionarry mapping each dataset to its entity types.

In [8]:
task_ner_labels = {
    'ace04': ['FAC', 'WEA', 'LOC', 'VEH', 'GPE', 'ORG', 'PER'],
    'ace05': ['FAC', 'WEA', 'LOC', 'VEH', 'GPE', 'ORG', 'PER'],
    'scierc': ['Method', 'OtherScientificTerm', 'Task', 'Generic', 'Material', 'Metric'],
}

Then we define some variables:
- `data_dir`: The directory in which our input data is stored.
- `output_dir`: The directory to which to write  the output of the mnodel.
- `task`: The task that the model will be used to make predictions on. 
- max_span_length: The maximum length of spans to consider. 
- context_window: The size of the context window to consider around each sentence.
- eval_batch_size: The batch size of the samples.
- test_pred_filename: The name of the prediction output file.

In [9]:
data_dir = os.getcwd() + '/scierc_data/processed_data/json'
output_dir = os.getcwd() + '/scierc_models/ent-scib-ctx0/'
task = 'scierc'
max_span_length = 8
context_window = 0
eval_batch_size = 32
test_pred_filename = 'ent_pred_test.json'
dev_pred_filename = 'ent_pred_dev.json'

### Running and evaluationg the pre-trained model

Now that the setup is out of the way. We can actually run the model and evaluate it with a pre-trained BERT-based model on the SciERC dataset.

#### Data File Paths:
Since the SciERC dataset is already split into a training, development, and test set. We don't need to perform any split. So let's just load set the paths to the data files dowanloaded with the dataset.

The input data format of the entity model is JSONL. Each line of the input file contains one document in the following format.
```json
{
  # document ID (please make sure doc_key can be used to identify a certain document)
  "doc_key": "CNN_ENG_20030306_083604.6",

  # sentences in the document, each sentence is a list of tokens
  "sentences": [
    [...],
    [...],
    ["tens", "of", "thousands", "of", "college", ...],
    ...
  ],

  # entities (boundaries and entity type) in each sentence
  "ner": [
    [...],
    [...],
    [[26, 26, "LOC"], [14, 14, "PER"], ...], #the boundary positions are indexed in the document level
    ...,
  ],

  # relations (two spans and relation type) in each sentence
  "relations": [
    [...],
    [...],
    [[14, 14, 10, 10, "ORG-AFF"], [14, 14, 12, 13, "ORG-AFF"], ...],
    ...
  ]
}
```

In [10]:
train_data = os.path.join(data_dir, 'train.json')
dev_data = os.path.join(data_dir, 'dev.json')
test_data = os.path.join(data_dir, 'test.json')

#### Output Directory Check

Then, just to be safe, we check if the specified output directory (`output_dir`) exists. If not, we create the directory. This ensures that the output directory is available for storing model checkpoints, predictions, or other outputs.

In [11]:
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

#### NER Label Mapping

The `get_labelmap` function is used to get the mapping for the SchiREC task as discussed above.

In [12]:
ner_label2id, ner_id2label = get_labelmap(task_ner_labels[task])

#### Development Dataset Processing

The development dataset (`dev_data`) is loaded into a `Dataset` object. Then, it is processed using the `convert_dataset_to_samples` function to obtain samples and NER labels. The samples are batchified using the `batchify` function.

In [13]:
dev_data = Dataset(dev_data)
dev_samples, dev_ner = convert_dataset_to_samples(dev_data, max_span_length, ner_label2id=ner_label2id, context_window=context_window)
dev_batches = batchify(dev_samples, eval_batch_size)

01/04/2024 20:41:36 - INFO - root - # Overlap: 0
01/04/2024 20:41:36 - INFO - root - Extracted 275 samples from 50 documents, with 811 NER labels, 23.713 avg input length, 68 max length
01/04/2024 20:41:36 - INFO - root - Max Length: 68, max NER: 11


#### Model Initialization

The BERT-based entity model (`EntityModel`) is initialized with specific parameters, including the BERT model name (`allenai/scibert_scivocab_uncased`), output directory for saving checkpoints (`bert_model_dir`), and the number of NER labels.

In [69]:
bert_model_dir = output_dir
num_ner_labels = len(task_ner_labels[task]) + 1
model = EntityModel(model='allenai/scibert_scivocab_uncased', bert_model_dir=bert_model_dir, use_albert=False, max_span_length=max_span_length, num_ner_labels=num_ner_labels)

11/23/2023 13:33:05 - INFO - root - Loading BERT model from C:\Users\odaim\Documents\PURE reproduction/scierc_models/ent-scib-ctx0//
11/23/2023 13:33:05 - INFO - transformers.tokenization_utils_base - Model name 'C:\Users\odaim\Documents\PURE reproduction/scierc_models/ent-scib-ctx0//' not found in model shortcut name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese, bert-base-german-cased, bert-large-uncased-whole-word-masking, bert-large-cased-whole-word-masking, bert-large-uncased-whole-word-masking-finetuned-squad, bert-large-cased-whole-word-masking-finetuned-squad, bert-base-cased-finetuned-mrpc, bert-base-german-dbmdz-cased, bert-base-german-dbmdz-uncased, TurkuNLP/bert-base-finnish-cased-v1, TurkuNLP/bert-base-finnish-uncased-v1, wietsedv/bert-base-dutch-cased). Assuming 'C:\Users\odaim\Documents\PURE reproduction/scierc_models/ent-scib-ctx0//' is a path, a model ident

#### Test Dataset Processing and Evaluation

Finally the test dataset (`test_data`) is loaded, processed, and batchified similarly to the development dataset. The model is then evaluated on the test data using the `evaluate` function, and the NER predictions are saved to a file using the `output_ner_predictions` function.

In [70]:
test_data = Dataset(test_data)
prediction_file = os.path.join(output_dir, test_pred_filename)

test_samples, test_ner = convert_dataset_to_samples(test_data, max_span_length, ner_label2id=ner_label2id, context_window=context_window)
test_batches = batchify(test_samples, eval_batch_size)
evaluate(model, test_batches, test_ner)
output_ner_predictions(model, test_batches, test_data, output_file=prediction_file)

11/23/2023 13:33:13 - INFO - root - # Overlap: 0
11/23/2023 13:33:13 - INFO - root - Extracted 551 samples from 100 documents, with 1685 NER labels, 24.321 avg input length, 97 max length
11/23/2023 13:33:13 - INFO - root - Max Length: 97, max NER: 13
11/23/2023 13:33:13 - INFO - root - Evaluating...
11/23/2023 13:33:28 - INFO - root - Accuracy: 0.990194
11/23/2023 13:33:28 - INFO - root - Cor: 1122, Pred TOT: 1680, Gold TOT: 1685
11/23/2023 13:33:28 - INFO - root - P: 0.66786, R: 0.66588, F1: 0.66686
11/23/2023 13:33:28 - INFO - root - Used time: 15.231171
11/23/2023 13:33:41 - INFO - root - Total pred entities: 1680
11/23/2023 13:33:41 - INFO - root - Output predictions to C:\Users\odaim\Documents\PURE reproduction/scierc_models/ent-scib-ctx0/ent_pred_test.json..


### Results 
**Accuracy**: 0.990194

**Correct Predictions**: 1122\
**Total Predictions**: 1680\
**Total Gold Entities**: 1685

**Precision**: 0.66786\
**Recall**: 0.66588\
**F1 Score**: 0.66686

**Implications**:

- The model is performing very well, with a high overall accuracy.
- Precision, recall, and F1 score are reasonably balanced, indicating that the model is achieving a good trade-off between precision and recall.
- The model is correctly identifying entities in the text, but there is still room for improvement, as evidenced by the not-perfect precision and recall values.
- The results suggest that the model is generalizing well to new data, as indicated by the high accuracy.

**Note** 
It's important to note that the interpretation of these metrics can depend on the specific requirements and priorities of the NER task at hand. For example, in some cases, precision may be more critical than recall, or vice versa, depending on the consequences of false positives and false negatives.

## Training and evaluating the entity model from scratch
Now we will train and evaluate an entity model from scratch in the notebook train_entity