# Training a bert model on new data

In this notebook we will use the same model architecture that we used for the scierc dataset. But this time we will train and evaluate our models on a new dataset. The [TAC Relation Extraction Dataset](https://nlp.stanford.edu/projects/tacred/)

## The TAC Relation Extraction Dataset
TACRED is a large-scale relation extraction dataset with 106,264 examples built over newswire and web text from the [corpus](https://catalog.ldc.upenn.edu/LDC2018T03) used in the yearly [TAC Knowledge Base Population (TAC KBP) challenges](https://tac.nist.gov/2017/KBP/index.html). Examples in TACRED cover 41 relation types as used in the TAC KBP challenges (e.g., per:schools_attended and org:members) or are labeled as no_relation if no defined relation is held. These examples are created by combining available human annotations from the TAC KBP challenges and crowdsourcing.

### Bias towards predicting false positives
To ensure that models trained on TACRED are not biased towards predicting false positives on real-world text, we fully annotated all sampled sentences where no relation was found between the mention pairs to be negative examples. As a result, 79.5% of the examples are labeled as no_relation. Among the examples where a relation was found, the distribution of relations is:

### The dataset is made up of 3 JSON files:
1. train.json: The training examples. 56196 in total.
2. dev.json: The development examples. 5000 in total.
3. test.json: The test examples. 5000 in total.

## Preparing the data

Since the datapoints of the New York Times (NYT) dataset have a different shape from the SciERC dataset. 
We must first map the NYT data to the shape of the SciERC data so it can fit in our Dataset class.

### Install nltk
We must first install the nltk library that will be used for the tokenization of the NYT sentences.

In [1]:
! pip install nltk



## Install `punkt` for nltk
Since we'll use nlyk's word_tokenize, we must also download the punkt tokenizer

In [2]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\odaim\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## Some needed imports

In [3]:
import json
import os
from unidecode import unidecode
from tqdm import tqdm
import itertools

from nltk.tokenize import word_tokenize

In [4]:
def normalize_tacred_sample(sample):
    norm = {}
    norm['doc_key'] = sample['docid']
    norm['sentences'] = [sample['token']]
    relation = sample['relation']
    norm['ner'] = []
    norm['relations'] = [[[sample['subj_start'], sample['subj_end'], sample['obj_start'],sample['obj_end'], relation]]]
    norm['clusters'] = []
    
    entities = []
    tac_ner = sample['stanford_ner']
    entities.append([sample['subj_start'], sample['subj_end'], sample['subj_type']])
    entities.append([sample['obj_start'], sample['obj_end'], sample['obj_type']])
    # # print(tokens)
    i = 0
    while i < len(tac_ner):
        if tac_ner[i] != 'O':
            ner = []
            ner.append(i)
            j = i
            while j < len(tac_ner):
                if tac_ner[i] == tac_ner[j]:
                    if j == len(tac_ner) - 1:
                        ner.append(j)
                        ner.append(tac_ner[j])
                        break
                    j = j + 1
                    continue
                else:
                    ner.append(j - 1)
                    ner.append(tac_ner[j - 1])
                    i = j
                    break
                    
            entities.append(ner)
        i = i + 1
    entities.sort()
    norm['ner'].append(list(entities for entities,_ in itertools.groupby(entities)))
    
    return norm

In [5]:


print(normalize_tacred_sample(json.loads('{"id": "e779865fb91e34998dce", "docid": "eng-NG-31-142693-10075646", "relation": "no_relation", "token": ["To", "Fed", "Judge", "Jeff", "White", ",", "Cayman", "Isles", "Bank", "Julius", "Baer", "\'s", "Lapdog", "in", "San", "Francisco"], "subj_start": 9, "subj_end": 10, "obj_start": 1, "obj_end": 1, "subj_type": "PERSON", "obj_type": "ORGANIZATION", "stanford_pos": ["TO", "NNP", "NNP", "NNP", "NNP", ",", "NNP", "NNP", "NNP", "NNP", "NNP", "POS", "NNP", "IN", "NNP", "NNP"], "stanford_ner": ["O", "O", "O", "PERSON", "PERSON", "O", "ORGANIZATION", "ORGANIZATION", "ORGANIZATION", "PERSON", "PERSON", "O", "O", "O", "LOCATION", "LOCATION"], "stanford_head": [5, 5, 5, 5, 0, 5, 11, 11, 11, 11, 13, 11, 5, 16, 16, 13], "stanford_deprel": ["case", "compound", "compound", "compound", "ROOT", "punct", "compound", "compound", "compound", "compound", "nmod:poss", "case", "appos", "case", "compound", "nmod"]}')))

{'doc_key': 'eng-NG-31-142693-10075646', 'sentences': [['To', 'Fed', 'Judge', 'Jeff', 'White', ',', 'Cayman', 'Isles', 'Bank', 'Julius', 'Baer', "'s", 'Lapdog', 'in', 'San', 'Francisco']], 'ner': [[[1, 1, 'ORGANIZATION'], [3, 4, 'PERSON'], [6, 8, 'ORGANIZATION'], [9, 10, 'PERSON'], [10, 10, 'PERSON'], [14, 15, 'LOCATION'], [15, 15, 'LOCATION']]], 'relations': [[[9, 10, 1, 1, 'no_relation']]], 'clusters': []}


In [6]:
def write_normal_data(in_dir, out_dir):
    with open(in_dir) as f:
        data = json.load(f)
        for i in tqdm(range (len(data)), desc="Normalizing data samples..."):
            normal_sample = normalize_tacred_sample(data[i])
            with open(out_dir, 'a') as normalized:
                normalized.write(json.dumps(normal_sample) + "\n")

In [7]:
train_data_path = os.getcwd() + '/other_data/tacred/data/json/train.json'
normal_train_data_path = os.getcwd() + '/other_data/tacred/data/json/norm_train.json'



In [167]:
test_data_path = os.getcwd() + '/other_data/tacred/data/json/test.json'
normal_test_data_path = os.getcwd() + '/other_data/tacred/data/json/norm_test.json'

write_normal_data(test_data_path, normal_test_data_path)

Normalizing data samples...: 100%|██████████| 15509/15509 [00:03<00:00, 4190.59it/s]


In [168]:
dev_data_path = os.getcwd() + '/other_data/tacred/data/json/dev.json'
normal_dev_data_path = os.getcwd() + '/other_data/tacred/data/json/norm_dev.json'

write_normal_data(dev_data_path, normal_dev_data_path)

Normalizing data samples...: 100%|██████████| 22631/22631 [00:05<00:00, 4282.49it/s]


## Training our bert models

Now that the NYT data has the propper shaep, we can train a new Bert model on it.

### The entity model

#### Set up

First we run the entity_setup.ipynb notebook to setup our classes and functions in the kernal.

In [9]:
%run entity_model/entity_setup.ipynb

  from .autonotebook import tqdm as notebook_tqdm


#### Model training

Now we train our bert-based entity model. This is gonna be very familiar compared to the work we've done before.
Before anything, we setup some variables. The same ones we set before.

#### `task_ner_labels`
This is a map from our datasets to their relative entity types. Here we added the NYT dataset entity types.

In [10]:
task_ner_labels = {
    'ace04': ['FAC', 'WEA', 'LOC', 'VEH', 'GPE', 'ORG', 'PER'],
    'ace05': ['FAC', 'WEA', 'LOC', 'VEH', 'GPE', 'ORG', 'PER'],
    'scierc': ['Method', 'OtherScientificTerm', 'Task', 'Generic', 'Material', 'Metric'],
    'tacred': ['ORGANIZATION', 'NUMBER', 'MONEY', 'ORDINAL', 'DATE', 'PERCENT', 'PERSON', 'DURATION', 'MISC', 'LOCATION', 'SET', 'TIME', 'TITLE', 'NATIONALITY', 'RELIGION', 'URL', 'CAUSE_OF_DEATH', 'COUNTRY', 'STATE_OR_PROVINCE', 'CRIMINAL_CHARGE', 'CITY', 'IDEOLOGY']
}

Then we define the other variables:
- `data_dir`: The directory in which our input data is stored.
- `output_dir`: The directory to which to write  the output of the mnodel.
- `task`: The task that the model will be used to make predictions on. 
- max_span_length: The maximum length of spans to consider. 
- context_window: The size of the context window to consider around each sentence.
- eval_batch_size: The batch size of the samples.
- test_pred_filename: The name of the prediction output file.

In [11]:
data_dir = os.getcwd() + '/other_data/tacred/data/json/'
output_dir = os.getcwd() + '/tacred_models/ent-scib-ctx0/'
task = 'tacred'
max_span_length = 8
test_pred_filename = 'ent_pred_test.json'
dev_pred_filename = 'ent_pred_dev.json'

num_ner_labels = len(task_ner_labels[task]) + 1
context_window = 300
eval_batch_size = 32
train_batch_size = 2
learning_rate = 1e-5
task_learning_rate = 5e-4
bertadam = True # If bertadam, then set correct_bias = False
num_epoch = 4 # number of the training epochs
warmup_proportion = 0.1 # the ratio of the warmup steps to the total steps
eval_per_epoch = 1 # how often evaluating the trained model on dev set during training
train_shuffle = True # whether to train with randomly shuffled data
print_loss_step = 100 # how often logging the loss value during training

#### Data File Paths:
Since the SciERC dataset is already split into a training, development, and test set. We don't need to perform any split. So let's just load set the paths to the data files dowanloaded with the dataset.


In [12]:
train_data = os.path.join(data_dir, 'norm_train.json')
dev_data = os.path.join(data_dir, 'norm_dev.json')
test_data = os.path.join(data_dir, 'norm_test.json')

#### Output Directory Check

Then, just to be safe, we check if the specified output directory (`output_dir`) exists. If not, we create the directory. This ensures that the output directory is available for storing model checkpoints, predictions, or other outputs.

In [13]:
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

#### NER Label Mapping

The `get_labelmap` function is used to get the mapping for the SchiREC task as discussed above.

In [14]:
ner_label2id, ner_id2label = get_labelmap(task_ner_labels[task])

#### Development Dataset Processing

The development dataset (`dev_data`) is loaded into a `Dataset` object. Then, it is processed using the `convert_dataset_to_samples` function to obtain samples and NER labels. The samples are batchified using the `batchify` function.

In [15]:
dev_data = Dataset(dev_data)

In [16]:
dev_samples, dev_ner = convert_dataset_to_samples(dev_data, max_span_length, ner_label2id=ner_label2id, context_window=context_window)
dev_batches = batchify(dev_samples, eval_batch_size)

02/12/2024 00:35:05 - INFO - root - # Overlap: 0
02/12/2024 00:35:05 - INFO - root - Extracted 22631 samples from 22631 documents, with 129939 NER labels, 35.463 avg input length, 95 max length
02/12/2024 00:35:05 - INFO - root - Max Length: 95, max NER: 29


#### Initialize our entity model

We initialize an empty entity model.

In [17]:
model = EntityModel(model='allenai/scibert_scivocab_uncased', use_albert=False, max_span_length=max_span_length, num_ner_labels=num_ner_labels)

02/12/2024 00:35:05 - INFO - transformers.tokenization_utils_base - Model name 'allenai/scibert_scivocab_uncased' not found in model shortcut name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese, bert-base-german-cased, bert-large-uncased-whole-word-masking, bert-large-cased-whole-word-masking, bert-large-uncased-whole-word-masking-finetuned-squad, bert-large-cased-whole-word-masking-finetuned-squad, bert-base-cased-finetuned-mrpc, bert-base-german-dbmdz-cased, bert-base-german-dbmdz-uncased, TurkuNLP/bert-base-finnish-cased-v1, TurkuNLP/bert-base-finnish-uncased-v1, wietsedv/bert-base-dutch-cased). Assuming 'allenai/scibert_scivocab_uncased' is a path, a model identifier, or url to a directory containing tokenizer files.
02/12/2024 00:35:09 - INFO - transformers.tokenization_utils_base - loading file https://s3.amazonaws.com/models.huggingface.co/bert/allenai/scibert_scivoca

#### Load training data

We load the training data from the JSON file into a Database instance

In [18]:
train_data = Dataset(train_data)

#### Training the model

Now we can train the model.

In [20]:
train_samples, train_ner = convert_dataset_to_samples(train_data, max_span_length, ner_label2id=ner_label2id, context_window=context_window)
train_batches = batchify(train_samples, train_batch_size)
best_result = 0.0

param_optimizer = list(model.bert_model.named_parameters())
optimizer_grouped_parameters = [
    {'params': [p for n, p in param_optimizer
        if 'bert' in n]},
    {'params': [p for n, p in param_optimizer
        if 'bert' not in n], 'lr': task_learning_rate}]
optimizer = AdamW(optimizer_grouped_parameters, lr=learning_rate, correct_bias=not(bertadam))
t_total = len(train_batches) * num_epoch
scheduler = get_linear_schedule_with_warmup(optimizer, int(t_total*warmup_proportion), t_total)

tr_loss = 0
tr_examples = 0
global_step = 0
eval_step = len(train_batches) // eval_per_epoch
for _ in tqdm(range(num_epoch), position=0, leave=True):
    if train_shuffle:
        random.shuffle(train_batches)
    for i in tqdm(range(len(train_batches)), position=0, leave=True):
        output_dict = model.run_batch(train_batches[i], training=True)
        loss = output_dict['ner_loss']
        loss.backward()

        tr_loss += loss.item()
        tr_examples += len(train_batches[i])
        global_step += 1

        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()

        if global_step % print_loss_step == 0:
            logger.info('Epoch=%d, iter=%d, loss=%.5f'%(_, i, tr_loss / tr_examples))
            tr_loss = 0
            tr_examples = 0

        if global_step % eval_step == 0:
            f1 = evaluate(model, dev_batches, dev_ner)
            if f1 > best_result:
                best_result = f1
                logger.info('!!! Best valid (epoch=%d): %.2f' % (_, f1*100))
                save_model(model, output_dir)

02/12/2024 00:36:04 - INFO - root - # Overlap: 0
02/12/2024 00:36:04 - INFO - root - Extracted 68124 samples from 68124 documents, with 420007 NER labels, 37.069 avg input length, 96 max length
02/12/2024 00:36:04 - INFO - root - Max Length: 96, max NER: 38
  0%|          | 99/34062 [00:21<1:37:37,  5.80it/s]02/12/2024 00:36:26 - INFO - root - Epoch=0, iter=99, loss=890.37061
  1%|          | 199/34062 [00:40<1:58:25,  4.77it/s]02/12/2024 00:36:45 - INFO - root - Epoch=0, iter=199, loss=747.48587
  1%|          | 299/34062 [00:58<1:31:50,  6.13it/s]02/12/2024 00:37:03 - INFO - root - Epoch=0, iter=299, loss=183.91385
  1%|          | 399/34062 [01:16<1:38:41,  5.68it/s]02/12/2024 00:37:21 - INFO - root - Epoch=0, iter=399, loss=45.67421
  1%|▏         | 499/34062 [01:34<1:36:59,  5.77it/s]02/12/2024 00:37:39 - INFO - root - Epoch=0, iter=499, loss=43.93870
  2%|▏         | 599/34062 [01:53<1:42:40,  5.43it/s]02/12/2024 00:37:58 - INFO - root - Epoch=0, iter=599, loss=42.83338
  2%|▏   

#### Trained model evaluation

Now let's evaluate our trained model on the test data.

Again, The BERT-based entity model (`EntityModel`) is initialized with specific parameters, including the BERT model name (`allenai/scibert_scivocab_uncased`), output directory in which the model is located (`bert_model_dir`), and the number of NER labels.

In [21]:
bert_model_dir = output_dir
num_ner_labels = len(task_ner_labels[task]) + 1
model = EntityModel(model='allenai/scibert_scivocab_uncased', bert_model_dir=bert_model_dir, use_albert=False, max_span_length=max_span_length, num_ner_labels=num_ner_labels)

02/12/2024 19:36:14 - INFO - root - Loading BERT model from C:\Users\odaim\Documents\PURE reproduction/tacred_models/ent-scib-ctx0//
02/12/2024 19:36:14 - INFO - transformers.tokenization_utils_base - Model name 'C:\Users\odaim\Documents\PURE reproduction/tacred_models/ent-scib-ctx0//' not found in model shortcut name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese, bert-base-german-cased, bert-large-uncased-whole-word-masking, bert-large-cased-whole-word-masking, bert-large-uncased-whole-word-masking-finetuned-squad, bert-large-cased-whole-word-masking-finetuned-squad, bert-base-cased-finetuned-mrpc, bert-base-german-dbmdz-cased, bert-base-german-dbmdz-uncased, TurkuNLP/bert-base-finnish-cased-v1, TurkuNLP/bert-base-finnish-uncased-v1, wietsedv/bert-base-dutch-cased). Assuming 'C:\Users\odaim\Documents\PURE reproduction/tacred_models/ent-scib-ctx0//' is a path, a model ident

#### Dev Dataset Processing and Evaluation

Just like we did with pre-trained model, the test dataset (`dev_data`) is loaded, processed, and batchified. The NER predictions are saved to a file using the `output_ner_predictions` function.

In [25]:
dev_data = Dataset(os.path.join(data_dir, 'norm_dev.json'))
prediction_file = os.path.join(output_dir, dev_pred_filename)
    
dev_samples, dev_ner = convert_dataset_to_samples(dev_data, max_span_length, ner_label2id=ner_label2id, context_window=context_window)
dev_batches = batchify(dev_samples, eval_batch_size)

output_ner_predictions(model, dev_batches, dev_data, output_file=prediction_file)

02/12/2024 19:59:41 - INFO - root - # Overlap: 0
02/12/2024 19:59:41 - INFO - root - Extracted 22631 samples from 22631 documents, with 129939 NER labels, 35.463 avg input length, 95 max length
02/12/2024 19:59:41 - INFO - root - Max Length: 95, max NER: 29
02/12/2024 20:23:41 - INFO - root - Total pred entities: 117351
02/12/2024 20:23:41 - INFO - root - Output predictions to C:\Users\odaim\Documents\PURE reproduction/tacred_models/ent-scib-ctx0/ent_pred_dev.json..


#### Test Dataset Processing and Evaluation

Just like we did with the dev dataset, the test dataset (`test_data`) is loaded, processed, and batchified similarly to the development dataset. The model is then evaluated on the test data using the `evaluate` function, and the NER predictions are saved to a file using the `output_ner_predictions` function.

In [26]:
test_data = Dataset(os.path.join(data_dir, 'norm_test.json'))
prediction_file = os.path.join(output_dir, test_pred_filename)
    
test_samples, test_ner = convert_dataset_to_samples(test_data, max_span_length, ner_label2id=ner_label2id, context_window=context_window)
test_batches = batchify(test_samples, eval_batch_size)
evaluate(model, test_batches, test_ner)
output_ner_predictions(model, test_batches, test_data, output_file=prediction_file)

02/12/2024 20:23:47 - INFO - root - # Overlap: 0
02/12/2024 20:23:47 - INFO - root - Extracted 15509 samples from 15509 documents, with 85473 NER labels, 34.755 avg input length, 96 max length
02/12/2024 20:23:47 - INFO - root - Max Length: 96, max NER: 28
02/12/2024 20:23:47 - INFO - root - Evaluating...
02/12/2024 20:39:53 - INFO - root - Accuracy: 0.994220
02/12/2024 20:39:53 - INFO - root - Cor: 67617, Pred TOT: 77218, Gold TOT: 85473
02/12/2024 20:39:53 - INFO - root - P: 0.87566, R: 0.79109, F1: 0.83123
02/12/2024 20:39:53 - INFO - root - Used time: 965.440703
02/12/2024 20:53:29 - INFO - root - Total pred entities: 77218
02/12/2024 20:53:29 - INFO - root - Output predictions to C:\Users\odaim\Documents\PURE reproduction/tacred_models/ent-scib-ctx0/ent_pred_test.json..


### Results

**Accuracy**: 99.42%

**Precision**: 87.56%\
**Recall**: 79.10%\
**F1 Score**: 83.12%

**Implications:**

- The high accuracy suggests that the model is effective in overall entity recognition on the test set of the TAC RED dataset.
- The precision value indicates that the model has a high level of confidence when predicting entities, with a relatively low rate of false positives.
- The recall value suggests that the model is successful in capturing a significant portion of the actual entities present in the test set.
- The F1 score, being a harmonic mean, provides a balanced evaluation of precision and recall. And it shows a good balance between precision and recall.

In summary, our entity model seems to be performing well on the TAC RED dataset, striking a balance between precision and recall.

### The relation model

Now that we've trained our entity model, we're ready to train the relation model.

#### Set up

First we run the relation_setup.ipynb notebook to setup our classes and functions in the kernal.

In [10]:
%run relation_model/relation_setup.ipynb

  from .autonotebook import tqdm as notebook_tqdm


#### Training and evaluating the relation model

Now we train our own relation model from scratch on the same dataset. And then we will evaluate it using the test data.

First we setup some variables

In [11]:
model_name = 'allenai/scibert_scivocab_uncased'
add_new_tokens = False
no_cuda = False
do_train = True
do_eval = True
eval_test = True
do_lower_case = True
entity_output_dir = os.getcwd() + '/tacred_models/ent-scib-ctx0/'
entity_predictions_dev = 'ent_pred_dev.json'
eval_with_gold = True
context_window = 0
max_seq_length = 128
entity_predictions_test = 'ent_pred_test.json'
seed = 0
output_dir = os.getcwd() + '/tacred_models/rel-scib-ctx0/'
negative_label = 'no_relation'
task = 'tacred'
train_mode = 'random_sorted'
train_batch_size = 8
eval_batch_size = 8
num_train_epochs = 2
train_file = normal_train_data_path
eval_per_epoch = 10
learning_rate = 2e-5
prediction_file = 'predictions.json'
BertLayerNorm = torch.nn.LayerNorm
train_mode = 'random_sorted'
bertadam = True
warmup_proportion = 0.1
eval_metric = 'f1'
task_rel_labels = {
    'ace04': ['PER-SOC', 'OTHER-AFF', 'ART', 'GPE-AFF', 'EMP-ORG', 'PHYS'],
    'ace05': ['ART', 'ORG-AFF', 'GEN-AFF', 'PHYS', 'PER-SOC', 'PART-WHOLE'],
    'scierc': ['PART-OF', 'USED-FOR', 'FEATURE-OF', 'CONJUNCTION', 'EVALUATE-FOR', 'HYPONYM-OF', 'COMPARE'],
    'tacred': ['org:subsidiaries', 'org:political/religious_affiliation', 'per:cause_of_death', 'per:employee_of', 'org:number_of_employees/members', 'org:dissolved', 'per:city_of_birth', 'org:founded_by', 'org:alternate_names', 'org:members', 'per:stateorprovince_of_birth', 'org:founded', 'org:website', 'org:member_of', 'per:stateorprovinces_of_residence', 'per:siblings', 'per:other_family', 'per:title', 'org:city_of_headquarters', 'per:religion', 'per:charges', 'per:countries_of_residence', 'org:country_of_headquarters', 'per:stateorprovince_of_death', 'per:origin', 'per:schools_attended', 'per:spouse', 'no_relation', 'per:city_of_death', 'per:children', 'per:date_of_death', 'per:date_of_birth', 'org:shareholders', 'per:alternate_names', 'org:stateorprovince_of_headquarters', 'org:parents', 'per:age', 'per:cities_of_residence', 'per:parents', 'org:top_members/employees', 'per:country_of_birth', 'per:country_of_death']
}

Now we train the model and then we evaluate it

In [None]:
CLS = "[CLS]"
SEP = "[SEP]"

RelationModel = BertForRelation

device = torch.device("cuda" if torch.cuda.is_available() and not no_cuda else "cpu")
n_gpu = torch.cuda.device_count()

# train set
if do_train:
    train_dataset, train_examples, train_nrel = generate_relation_data(train_file, use_gold=True, context_window=context_window)
# dev set
if (do_eval and do_train) or (do_eval and not(eval_test)):
    eval_dataset, eval_examples, eval_nrel = generate_relation_data(os.path.join(entity_output_dir, entity_predictions_dev), use_gold=eval_with_gold, context_window=context_window)
# test set
if eval_test:
    test_dataset, test_examples, test_nrel = generate_relation_data(os.path.join(entity_output_dir, entity_predictions_test), use_gold=eval_with_gold, context_window=context_window)
    
setseed(seed)

if not do_train and not do_eval:
    raise ValueError("At least one of `do_train` or `do_eval` must be True.")

if not os.path.exists(output_dir):
    os.makedirs(output_dir)
if do_train:
    logger.addHandler(logging.FileHandler(os.path.join(output_dir, "train.log"), 'w'))
else:
    logger.addHandler(logging.FileHandler(os.path.join(output_dir, "eval.log"), 'w'))
    
# get label_list
if os.path.exists(os.path.join(output_dir, 'label_list.json')):
    with open(os.path.join(output_dir, 'label_list.json'), 'r') as f:
        label_list = json.load(f)
else:
    label_list = [negative_label] + task_rel_labels[task]
    with open(os.path.join(output_dir, 'label_list.json'), 'w') as f:
        json.dump(label_list, f)
label2id = {label: i for i, label in enumerate(label_list)}
id2label = {i: label for i, label in enumerate(label_list)}
num_labels = len(label_list)

tokenizer = AutoTokenizer.from_pretrained(model_name, do_lower_case=do_lower_case)
if add_new_tokens:
    add_marker_tokens(tokenizer, task_ner_labels[task])

if os.path.exists(os.path.join(output_dir, 'special_tokens.json')):
    with open(os.path.join(output_dir, 'special_tokens.json'), 'r') as f:
        special_tokens = json.load(f)
else:
    special_tokens = {}
    
if do_eval and (do_train or not(eval_test)):
    eval_features = convert_examples_to_features(
        eval_examples, label2id, max_seq_length, tokenizer, special_tokens, unused_tokens=not(add_new_tokens))
    logger.info("***** Dev *****")
    logger.info("  Num examples = %d", len(eval_examples))
    logger.info("  Batch size = %d", eval_batch_size)
    all_input_ids = torch.tensor([f.input_ids for f in eval_features], dtype=torch.long)
    all_input_mask = torch.tensor([f.input_mask for f in eval_features], dtype=torch.long)
    all_segment_ids = torch.tensor([f.segment_ids for f in eval_features], dtype=torch.long)
    all_label_ids = torch.tensor([f.label_id for f in eval_features], dtype=torch.long)
    all_sub_idx = torch.tensor([f.sub_idx for f in eval_features], dtype=torch.long)
    all_obj_idx = torch.tensor([f.obj_idx for f in eval_features], dtype=torch.long)
    eval_data = TensorDataset(all_input_ids, all_input_mask, all_segment_ids, all_label_ids, all_sub_idx, all_obj_idx)
    eval_dataloader = DataLoader(eval_data, batch_size=eval_batch_size)
    eval_label_ids = all_label_ids

    
if do_train:
    train_features = convert_examples_to_features(
        train_examples, label2id, max_seq_length, tokenizer, special_tokens, unused_tokens=not(add_new_tokens))
    if train_mode == 'sorted' or train_mode == 'random_sorted':
        train_features = sorted(train_features, key=lambda f: np.sum(f.input_mask))
    else:
        random.shuffle(train_features)
    all_input_ids = torch.tensor([f.input_ids for f in train_features], dtype=torch.long)
    all_input_mask = torch.tensor([f.input_mask for f in train_features], dtype=torch.long)
    all_segment_ids = torch.tensor([f.segment_ids for f in train_features], dtype=torch.long)
    all_label_ids = torch.tensor([f.label_id for f in train_features], dtype=torch.long)
    all_sub_idx = torch.tensor([f.sub_idx for f in train_features], dtype=torch.long)
    all_obj_idx = torch.tensor([f.obj_idx for f in train_features], dtype=torch.long)
    train_data = TensorDataset(all_input_ids, all_input_mask, all_segment_ids, all_label_ids, all_sub_idx, all_obj_idx)
    train_dataloader = DataLoader(train_data, batch_size=train_batch_size)
    train_batches = [batch for batch in train_dataloader]

    num_train_optimization_steps = len(train_dataloader) * num_train_epochs

    logger.info("***** Training *****")
    logger.info("  Num examples = %d", len(train_examples))
    logger.info("  Batch size = %d", train_batch_size)
    logger.info("  Num steps = %d", num_train_optimization_steps)

    best_result = None
    eval_step = max(1, len(train_batches) // eval_per_epoch)

    lr = learning_rate
    model = RelationModel.from_pretrained(
        'allenai/scibert_scivocab_uncased', cache_dir=str(PYTORCH_PRETRAINED_BERT_CACHE), num_rel_labels=num_labels)
    if hasattr(model, 'bert'):
        model.bert.resize_token_embeddings(len(tokenizer))
    elif hasattr(model, 'albert'):
        model.albert.resize_token_embeddings(len(tokenizer))
    else:
        raise TypeError("Unknown model class")

    model.to(device)
    if n_gpu > 1:
        model = torch.nn.DataParallel(model)

    param_optimizer = list(model.named_parameters())
    no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']
    optimizer_grouped_parameters = [
        {'params': [p for n, p in param_optimizer
                    if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},
        {'params': [p for n, p in param_optimizer
                    if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
    ]
    optimizer = AdamW(optimizer_grouped_parameters, lr=lr, correct_bias=not(bertadam))
    scheduler = get_linear_schedule_with_warmup(optimizer, int(num_train_optimization_steps * warmup_proportion), num_train_optimization_steps)

    start_time = time.time()
    global_step = 0
    tr_loss = 0
    nb_tr_examples = 0
    nb_tr_steps = 0
    for epoch in range(int(num_train_epochs)):
        model.train()
        logger.info("Start epoch #{} (lr = {})...".format(epoch, lr))
        if train_mode == 'random' or train_mode == 'random_sorted':
            random.shuffle(train_batches)
        for step, batch in enumerate(train_batches):
            batch = tuple(t.to(device) for t in batch)
            input_ids, input_mask, segment_ids, label_ids, sub_idx, obj_idx = batch
            loss = model(input_ids, segment_ids, input_mask, label_ids, sub_idx, obj_idx)
            if n_gpu > 1:
                loss = loss.mean()

            loss.backward()

            tr_loss += loss.item()
            nb_tr_examples += input_ids.size(0)
            nb_tr_steps += 1

            optimizer.step()
            scheduler.step()
            optimizer.zero_grad()
            global_step += 1

            if (step + 1) % eval_step == 0:
                logger.info('Epoch: {}, Step: {} / {}, used_time = {:.2f}s, loss = {:.6f}'.format(
                            epoch, step + 1, len(train_batches),
                            time.time() - start_time, tr_loss / nb_tr_steps))
                save_model = False
                if do_eval:
                    preds, result, logits = evaluate(model, device, eval_dataloader, eval_label_ids, num_labels, e2e_ngold=eval_nrel)
                    model.train()
                    result['global_step'] = global_step
                    result['epoch'] = epoch
                    result['learning_rate'] = lr
                    result['batch_size'] = train_batch_size

                    if (best_result is None) or (result[eval_metric] > best_result[eval_metric]):
                        best_result = result
          
    if eval_test: 
        eval_dataset = test_dataset
        eval_examples = test_examples
        eval_features = convert_examples_to_features(
            test_examples, label2id, max_seq_length, tokenizer, special_tokens, unused_tokens=not(add_new_tokens))
        eval_nrel = test_nrel
        logger.info(special_tokens)
        logger.info("***** Test *****")
        logger.info("  Num examples = %d", len(test_examples))
        logger.info("  Batch size = %d", eval_batch_size)
        all_input_ids = torch.tensor([f.input_ids for f in eval_features], dtype=torch.long)
        all_input_mask = torch.tensor([f.input_mask for f in eval_features], dtype=torch.long)
        all_segment_ids = torch.tensor([f.segment_ids                                                                                                                                                                                                                                                                                                                                                                                               for f in eval_features], dtype=torch.long)
        all_label_ids = torch.tensor([f.label_id for f in eval_features], dtype=torch.long)
        all_sub_idx = torch.tensor([f.sub_idx for f in eval_features], dtype=torch.long)
        all_obj_idx = torch.tensor([f.obj_idx for f in eval_features], dtype=torch.long)
        eval_data = TensorDataset(all_input_ids, all_input_mask, all_segment_ids, all_label_ids, all_sub_idx, all_obj_idx)
        eval_dataloader = DataLoader(eval_data, batch_size=eval_batch_size)
        eval_label_ids = all_label_ids
    model = RelationModel.from_pretrained(output_dir, num_rel_labels=num_labels)
    model.to(device)
    preds, result, logits = evaluate(model, device, eval_dataloader, eval_label_ids, num_labels, e2e_ngold=eval_nrel)

    logger.info('*** Evaluation Results ***')
    for key in sorted(result.keys()):
        logger.info("  %s = %s", key, str(result[key]))

    print_pred_json(eval_dataset, eval_examples, preds, id2label, os.path.join(output_dir, prediction_file))

02/19/2024 17:08:24 - INFO - run_relation - Generate relation data from C:\Users\odaim\Documents\PURE reproduction/other_data/tacred/data/json/norm_train.json
02/19/2024 17:08:50 - INFO - run_relation - #samples: 3045890, max #sent.samples: 1406
02/19/2024 17:08:50 - INFO - run_relation - Generate relation data from C:\Users\odaim\Documents\PURE reproduction/tacred_models/ent-scib-ctx0/ent_pred_dev.json
02/19/2024 17:08:59 - INFO - run_relation - #samples: 777644, max #sent.samples: 812
02/19/2024 17:08:59 - INFO - run_relation - Generate relation data from C:\Users\odaim\Documents\PURE reproduction/tacred_models/ent-scib-ctx0/ent_pred_test.json
02/19/2024 17:09:05 - INFO - run_relation - #samples: 505502, max #sent.samples: 756
02/19/2024 17:09:06 - INFO - transformers.configuration_utils - loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/allenai/scibert_scivocab_uncased/config.json from cache at C:\Users\odaim/.cache\torch\transformers\199e28e62d2210c23d

## Results

**Accuracy**: 0.9630

**Evaluation Loss**: 0.0950

**Precision**: 0.8455\
**Recall**: 0.6963\
**F1 Score**: 0.7637

Implications:

- The high accuracy indicates that the model performs well overall in predicting relations in the NYT dataset.
- The F1 score suggests a good balance between precision and recall, but there might still be room for improvement.
- The precision value indicates that when the model predicts a relation, it is quite likely to be correct.
- The recall value suggests that there is some room for improvement in capturing all actual relations, as a recall of 69.6% means that the model is missing around 30.4% of the actual relations.