# V2 NER Fine Tuned Model Inference Test
In this notebook I am going to evaluate the V2 NER model I fine tuned on the test set. The difference between this V2 and V1 is that V1 utilizes the base bert tokenizer and V2 utilizes a custom tokenizer that was trained on both the wiki and conll2003 sets.

This is the same inference_test.ipynb notebook as for V1, but adjusted for the new tokenizer of V2.

### V1 Result:    
{'precision': 0.629440682682497,    
 'recall': 0.6295574728639021,    
 'f1': 0.6294990723562153,    
 'accuracy': 0.9027749326357921}  

### V2 Results:
 {'precision': 0.629533438456544,  
 'recall': 0.6296502458484089,  
 'f1': 0.6295918367346939,  
 'accuracy': 0.9027749326357921}    
 


The new model with adjusted tokenizer achieved nearly the exact same results. This is good.

## Load Fine Tuned Model and Custom Tokenizer

In [9]:
from transformers import AutoModelForTokenClassification, BertTokenizer
import os

base_path = r"C:\Users\hunte\OneDrive\Documents\Coding Projects\Bot_Discord_Proj\NER_model\V2_NER_model"
model_path, tokenizer_path = [os.path.join(base_path, m) for m in ["model", "tokenizer"]]


#there are 9 labels for the token classification task because there are 9 ner-tags
model = AutoModelForTokenClassification.from_pretrained(model_path, num_labels=9) 
tokenizer = BertTokenizer.from_pretrained(tokenizer_path, do_lower_case=True) #custom tokenizer

## Load Pre-Tokenized Dataset From Disk

In [10]:
import datasets

base_path = "C:\\Users\\hunte\\OneDrive\\Documents\\Coding Projects\\Bot_Discord_Proj\\NER_model\\NER_model\\NER_standard_model"

test_set_path = os.path.join(base_path, "test_set_tokenized")

#load from disk the datasets Dataset object
test_set = datasets.load_from_disk(test_set_path)

## Prepare Data for Inference

Here we are running a forward pass by first extracting the input_ids and attention_mask from the dataset while converting to tensor. These tensors are then passed into the model to which outputs the logits.

In [11]:
test_set

Dataset({
    features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags', 'input_ids', 'token_type_ids', 'attention_mask', 'labels'],
    num_rows: 3453
})

Below I am pulling out the attention mask and input ids from the test set. attention_mask and input_ids are lists of tensors. Then I am converting them to a list of dicts of which I can pass into the data_collator. After collated, the dataset into a torch TorchDataset and then wrapping in a DataLoader for inference on the entire test set.

In [12]:
from transformers import DataCollatorForTokenClassification
from torch.utils.data import DataLoader, TensorDataset
import torch

attention_mask, input_ids = test_set['attention_mask'], test_set['input_ids'] #list of tensors

examples = [{"attention_mask": att_msk, "input_ids": ids} for att_msk, ids in zip(attention_mask, input_ids)] #convert them to lists of dicts 

data_collator = DataCollatorForTokenClassification(tokenizer) #create data collator with tokenizer
batch = data_collator(examples) #collate the examples into a batch. This will unify the tensors sizes \

#which is necessary to wrap in a dataloader
test_dataset = TensorDataset(batch['input_ids'], batch['attention_mask']) #create dataset from batch
test_dataloader = DataLoader(test_dataset, batch_size=8) #create dataloader from dataset

This function runs through that dataloader and computes the logits of each batch into a list which is concatenated into a single tensors and returned.

In [13]:
def compute_logits():
    logit_preds = [] #list to store logit predictions

    for input_ids, attention_mask in test_dataloader: 
        with torch.no_grad(): 
            output = model(input_ids=input_ids, attention_mask=attention_mask) #forward pass
        logit_preds.append(output.logits) #append logits to list

    return torch.concat(logit_preds, dim=0) #concatenate the list of tensors into a single tensor

In [14]:
logits = compute_logits() #get logits

In [15]:
logits.shape #shape of logits

torch.Size([3453, 148, 9])

## Compute Metric Function
This function was used within the training process and will now be used to eval the model

In [16]:
import numpy as np
import seqeval

metric = datasets.load_metric("seqeval") #load in seqeval metric

def compute_metrics(p): 
    '''
        this function unpacks the predictions and labels from p. Then it applies argmax to the prediction logics which converts
        them to indices within the labels_list. Then assigned to true_predictions is a list comprehension of those indices converted 
        to their label names. The true_labels list has this analogous operation performed on the label indices of the targets for 
        that example. Then the true_predictiosn and true_labels are evaluated for precision, recall, and f1 using the seqeval package.
    '''
    #NER labels specific to the conll2003 task
    label_list = ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC'] 
        
    #unpack predictions
    predictions, labels = p 
        
    #get prediction indices for use in labels_list by argmaxing logits
    predictions = np.argmax(predictions, axis=2) 
    
    #prediction indicies ---> labels
    true_predictions = [ 
        [label_list[pred] for (pred, lab) in zip(prediction, label) if lab != -100] for prediction, label in zip(predictions, labels) 
    ] 
    
    #Ground truth indicies ---> labels
    true_labels = [ 
        [label_list[lab] for (pred, lab) in zip(prediction, label) if lab != -100] for prediction, label in zip(predictions, labels) 
    ] 
    
    #get score
    results = metric.compute(predictions=true_predictions, references=true_labels) 
    
    return { 
        "precision": results["overall_precision"], 
        "recall": results["overall_recall"], 
        "f1": results["overall_f1"], 
        "accuracy": results["overall_accuracy"], 
    } 

  metric = datasets.load_metric("seqeval") #load in seqeval metric


In [17]:
logits_np = logits.numpy() #convert logits to numpy array

p = [logits_np, test_set['labels']] #create tuple of predictions and labels
result = compute_metrics(p) #compute metrics

result #output results

{'precision': 0.629533438456544,
 'recall': 0.6296502458484089,
 'f1': 0.6295918367346939,
 'accuracy': 0.9027749326357921}