# Inference Test on Unseen Data
I have just completed finetuning of this model on sagemaker. In this notebook I am conducting a test of the model on data it has not see yet.

The accuracy result on the unseen data was: 97%

## Load in fine tuned model 

In [2]:
from transformers import DistilBertConfig, DistilBertForSequenceClassification

num_labels=2

config = DistilBertConfig.from_pretrained('/home/ec2-user/SageMaker/RAC_training_sagemaker/model', num_labels=num_labels)
model = DistilBertForSequenceClassification.from_pretrained('/home/ec2-user/SageMaker/RAC_training_sagemaker/model', config=config)

## Eval of Model
I am going to perform a quick evaluation of the model here. The final eval acc of the model on the wikipedia toxic comments test datset during training was 0.96832. I had mistakenly used the test set for eval during training so here I will use the validation set to test the model.

Below I load in the tokenizer I used on the dataset for training.

In [3]:
import transformers
from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification

model_path = 'distilbert-base-uncased'
tokenizer = DistilBertTokenizerFast.from_pretrained(model_path)

In [4]:
import pandas as pd

val_set = pd.read_csv('/home/ec2-user/SageMaker/RAC_training_sagemaker/cursory_data_prep/processed_val.csv')

val_set.head()

Unnamed: 0,comment_text,toxic
0,"""\n\nI know you are (an independent). That was...",0
1,Or a big shot book critic?,0
2,Oh - and the template's tone is not appropriat...,0
3,For ex. If Warriors envade the Philippine isla...,0
4,"""\n\nand who is that """"someone"""" that you are ...",0


The below functions are used to tokenize/encode text data into the format necessary for input to distilbert

In [5]:
import pandas as pd
import torch
from transformers import DistilBertTokenizerFast

def encode(comment, label):
    encoded = tokenizer(comment, truncation=True, padding='max_length', max_length=128, return_tensors="pt")
    
    # Convert 0d tensors to python numbers using .item() for each element in vector
    attention_mask = [i.item() for i in encoded['attention_mask'][0]]
    input_ids = [i.item() for i in encoded['input_ids'][0]]
    label = label.item() if isinstance(label, torch.Tensor) else label
    
    # Return data in a dictionary format
    return {
        'attention_mask': attention_mask,
        'input_ids': input_ids,
        'label': label,
        'text': comment
    }

def transform_to_dataframe(df):
    # Apply the encode function to each row of the dataframe
    encoded_data = df.apply(lambda row: encode(row['comment_text'], row['toxic']), axis=1)
    
    # Convert encoded data to a list of dictionaries
    list_of_dicts = [item for item in encoded_data]
    
    # Convert list of dictionaries to dataframe
    return pd.DataFrame(list_of_dicts)

In [6]:
import os

os.getcwd()

'/home/ec2-user/SageMaker/RAC_training_sagemaker'

In [7]:
encoded_val_set = transform_to_dataframe(val_set)
encoded_val_set.head()

Unnamed: 0,attention_mask,input_ids,label,text
0,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[101, 1000, 1045, 2113, 2017, 2024, 1006, 2019...",0,"""\n\nI know you are (an independent). That was..."
1,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, ...","[101, 2030, 1037, 2502, 2915, 2338, 6232, 1029...",0,Or a big shot book critic?
2,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[101, 2821, 1011, 1998, 1996, 23561, 1005, 105...",0,Oh - and the template's tone is not appropriat...
3,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[101, 2005, 4654, 1012, 2065, 6424, 4372, 3567...",0,For ex. If Warriors envade the Philippine isla...
4,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[101, 1000, 1998, 2040, 2003, 2008, 1000, 1000...",0,"""\n\nand who is that """"someone"""" that you are ..."


## Inference 
Below I am pasing the encoded data into the model for inference after converting it to torch tensors. 

In [15]:
import torch

# Check if CUDA is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using {device}")

# Convert 'input_ids' and 'attention_mask' columns to tensors
input_ids = torch.tensor(encoded_val_set['input_ids'].tolist())
attention_mask = torch.tensor(encoded_val_set['attention_mask'].tolist())

# Move the model to GPU
model.to(device)
model.eval()

batch_size = 128  #memory issues when ran whole batch on 1 gpu
num_batches = len(input_ids) // batch_size + (len(input_ids) % batch_size != 0)
print(f'There are {num_batches} batches')

all_predicted_labels = []

for i in range(num_batches):
    if i % 25 == 0:
        print(f'batch {i} completed')
    batch_input_ids = input_ids[i*batch_size:(i+1)*batch_size].to(device)
    batch_attention_mask = attention_mask[i*batch_size:(i+1)*batch_size].to(device)

    # Perform inference
    with torch.no_grad():
        # Depending on the model, it might return a tuple. 
        # Typically, the first entry in the tuple is the output logits.
        outputs = model(input_ids=batch_input_ids, attention_mask=batch_attention_mask)
        logits = outputs[0]
        # Use the argmax function to get the predicted labels for the current batch
        predicted_labels = torch.argmax(logits, dim=1).cpu()
        all_predicted_labels.append(predicted_labels)

# Concatenate all the predicted labels
all_predicted_labels = torch.cat(all_predicted_labels)


Using cuda
There are 192 batches
batch 0 completed
batch 25 completed
batch 50 completed
batch 75 completed
batch 100 completed
batch 125 completed
batch 150 completed
batch 175 completed


Below is a simple accuracy function

In [18]:
def check_accuracy(predicted_labels, val_set):
    correct = 0
    true_labels = val_set['label'].tolist()  # Assuming 'label' is the column name with true labels
    
    for i in range(len(true_labels)):
        if predicted_labels[i].item() == true_labels[i]:
            correct += 1
            
    return correct/len(true_labels)

acc = check_accuracy(all_predicted_labels, encoded_val_set)
print(acc)


0.9700459891742298


Nice 97% accuracy

Let's also measure the precision and recall

In [19]:
def precision_recall(predicted_labels, val_set):
    TP = 0  # True Positives
    FP = 0  # False Positives
    FN = 0  # False Negatives
    true_labels = val_set['label'].tolist()  # Assuming 'label' is the column name with true labels
    
    for i in range(len(true_labels)):
        if predicted_labels[i].item() == true_labels[i] == 1:  # Assuming positive class is labeled as '1'
            TP += 1
        elif predicted_labels[i].item() == 1 and true_labels[i] == 0:
            FP += 1
        elif predicted_labels[i].item() == 0 and true_labels[i] == 1:
            FN += 1
    
    precision = TP / (TP + FP) if (TP + FP) > 0 else 0
    recall = TP / (TP + FN) if (TP + FN) > 0 else 0

    return precision, recall

prec, rec = precision_recall(all_predicted_labels, encoded_val_set)
print(f"Precision: {prec:.4f}, Recall: {rec:.4f}")


Precision: 0.8825, Recall: 0.8058
