# Performance Evaluation

In this notebook, I will evaluate the models on data that has never been seen before and comes from a different distribution than the training, validation, and testing data. This will give me insight on the top model performances.

In the MVP, I utilized a basic custom Transformer Model. In the newest version, I utilized a DistilBERT model. After sending both versions to people, one piece of feedback I got was that the Transformer Model was better. This is definitely a surprising finding since the DistilBERT model performed better on all my metrics in comparison to the Transformer. Thus, I am going to take this completely different dataset and see how both models perform.

In [1]:
# Getting the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from tqdm import tqdm
from sklearn.metrics import roc_auc_score, roc_curve
import torch
import torch.nn as nn
from torch.utils.data import TensorDataset, DataLoader
import torchtext
from torchtext.data.utils import get_tokenizer
from nltk.stem import SnowballStemmer
import re
import tensorflow as tf
import math

tqdm.pandas()

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
%matplotlib inline

2024-02-05 23:22:43.342668: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-05 23:22:43.342785: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-05 23:22:43.471076: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


In [2]:
# Getting the data
test_data = pd.read_csv('../input/training-llm-competition/hcV3-imagined-stories-with-generated.csv')
test_data.head()

Unnamed: 0,AssignmentId,story,summary,timeSinceEvent,generated_story
0,32RIADZISTQWI5XIVG5BN0VMYFRS4U,"Concerts are my most favorite thing, and my bo...",My boyfriend and I went to a concert together ...,90,I had been eagerly anticipating this concert f...
1,3IRIK4HM3B6UQBC0HI8Q5TBJZLEC61,It seems just like yesterday but today makes f...,My sister gave birth to my twin niece and neph...,150,I can hardly contain my excitement as I make m...
2,3MTMREQS4W44RBU8OMP3XSK8NMJAWZ,About a month ago I went to burning man. I was...,It is always a journey for me to go to burning...,30,It’s been a whirlwind since I returned from Bu...
3,36WLNQG780WFTLD990VT6XXEYVQEBZ,"Play stupid games, win stupid prizes road trip...",What happened is that I was on a trip with my ...,90,I can't believe it's been three months since t...
4,32Z9ZLUT1M6BWPTK368LXKUQWLLOHY,I wanted to write about one of the best days i...,Me and my girlfriend went to the zoo on a hot ...,30,"Today was such a hot day, but that didn't stop..."


In [3]:
# Getting the generated and non-generated ones
non_generated = pd.DataFrame()
non_generated['story'] = test_data['story']
non_generated['label'] = 0.0

generated = pd.DataFrame()
generated['story'] = test_data['generated_story']
generated['label'] = 1.0

testing_data = pd.concat([non_generated,generated])
testing_data

Unnamed: 0,story,label
0,"Concerts are my most favorite thing, and my bo...",0.0
1,It seems just like yesterday but today makes f...,0.0
2,About a month ago I went to burning man. I was...,0.0
3,"Play stupid games, win stupid prizes road trip...",0.0
4,I wanted to write about one of the best days i...,0.0
...,...,...
2751,I can't believe it. Today was my oldest daught...,1.0
2752,It's hard to believe that it has been 150 days...,1.0
2753,It's been five long years since my mother pass...,1.0
2754,I can hardly believe that four months have alr...,1.0


In [4]:
# Getting the training and validation datasets
train_data = pd.read_csv('../input/training-llm-competition/train.csv')
valid_data = pd.read_csv('../input/training-llm-competition/validation.csv')

## DistilBERT

In [5]:
# Getting the model
tokenizer = AutoTokenizer.from_pretrained("Skittles2821/distilbert-detector")
model = AutoModelForSequenceClassification.from_pretrained("Skittles2821/distilbert-detector")
model.to(device)

tokenizer_config.json:   0%|          | 0.00/1.20k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/676 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

In [6]:
# Preprocessing
def preprocess(essay:str):
    preprocessed_essay = essay.lower()
    
    # Subbing out \n and \t
    preprocessed_essay = re.sub("\n","",preprocessed_essay)
    preprocessed_essay = re.sub("\t","",preprocessed_essay)

    # Replacing /xa0 = non-breaking space in Latin1
    preprocessed_essay = preprocessed_essay.replace(u'\xa0', u' ')
    
    return preprocessed_essay

In [7]:
processed_essays_train = train_data['essay'].copy().progress_apply(preprocess)
processed_essays_valid = valid_data['essay'].copy().progress_apply(preprocess)
processed_essays_test = testing_data['story'].copy().progress_apply(preprocess)

100%|██████████| 44733/44733 [00:00<00:00, 58337.45it/s]
100%|██████████| 5195/5195 [00:00<00:00, 14464.72it/s]
100%|██████████| 5512/5512 [00:00<00:00, 91023.64it/s]


In [8]:
# Defining a function for inference
def inference(essay:str) -> float:
    # Tokenizing the input essay
    inputs = tokenizer(essay,padding='max_length',truncation=True,max_length=512,return_tensors='pt').to(device)
    
    # Getting the logits
    with torch.no_grad():
        logits = model(**inputs).logits
        probability = nn.functional.sigmoid(logits)
    return probability.item()

In [9]:
# Getting the predictions
predictions_train = processed_essays_train.progress_apply(inference)
predictions_valid = processed_essays_valid.progress_apply(inference)
predictions_test = processed_essays_test.progress_apply(inference)

100%|██████████| 44733/44733 [13:42<00:00, 54.36it/s]
100%|██████████| 5195/5195 [01:45<00:00, 49.33it/s]
100%|██████████| 5512/5512 [01:42<00:00, 53.77it/s]


In [10]:
# Making predictions
print('DistilBERT Score')
print(f'Training ROC AUC: {roc_auc_score(train_data["LLM_written"],predictions_train)}')
print(f'Validation ROC AUC: {roc_auc_score(valid_data["LLM_written"],predictions_valid)}')
print(f'Testing ROC AUC: {roc_auc_score(testing_data["label"],predictions_test)}')

DistilBERT Score
Training ROC AUC: 0.9999537707382391
Validation ROC AUC: 0.9710747774141688
Testing ROC AUC: 0.9778706861503915


## Custom Transformer

In [11]:
contractions = {
"ain't": "am not / are not / is not / has not / have not",
"aren't": "are not / am not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he had / he would",
"he'd've": "he would have",
"he'll": "he shall / he will",
"he'll've": "he shall have / he will have",
"he's": "he has / he is",
"how'd": "how did",
"how'd'y": "how do you",
"how'll": "how will",
"how's": "how has / how is / how does",
"I'd": "I had / I would",
"I'd've": "I would have",
"I'll": "I shall / I will",
"I'll've": "I shall have / I will have",
"I'm": "I am",
"I've": "I have",
"isn't": "is not",
"it'd": "it had / it would",
"it'd've": "it would have",
"it'll": "it shall / it will",
"it'll've": "it shall have / it will have",
"it's": "it has / it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"mightn't've": "might not have",
"must've": "must have",
"mustn't": "must not",
"mustn't've": "must not have",
"needn't": "need not",
"needn't've": "need not have",
"o'clock": "of the clock",
"oughtn't": "ought not",
"oughtn't've": "ought not have",
"shan't": "shall not",
"sha'n't": "shall not",
"shan't've": "shall not have",
"she'd": "she had / she would",
"she'd've": "she would have",
"she'll": "she shall / she will",
"she'll've": "she shall have / she will have",
"she's": "she has / she is",
"should've": "should have",
"shouldn't": "should not",
"shouldn't've": "should not have",
"so've": "so have",
"so's": "so as / so is",
"that'd": "that would / that had",
"that'd've": "that would have",
"that's": "that has / that is",
"there'd": "there had / there would",
"there'd've": "there would have",
"there's": "there has / there is",
"they'd": "they had / they would",
"they'd've": "they would have",
"they'll": "they shall / they will",
"they'll've": "they shall have / they will have",
"they're": "they are",
"they've": "they have",
"to've": "to have",
"wasn't": "was not",
"we'd": "we had / we would",
"we'd've": "we would have",
"we'll": "we will",
"we'll've": "we will have",
"we're": "we are",
"we've": "we have",
"weren't": "were not",
"what'll": "what shall / what will",
"what'll've": "what shall have / what will have",
"what're": "what are",
"what's": "what has / what is",
"what've": "what have",
"when's": "when has / when is",
"when've": "when have",
"where'd": "where did",
"where's": "where has / where is",
"where've": "where have",
"who'll": "who shall / who will",
"who'll've": "who shall have / who will have",
"who's": "who has / who is",
"who've": "who have",
"why's": "why has / why is",
"why've": "why have",
"will've": "will have",
"won't": "will not",
"won't've": "will not have",
"would've": "would have",
"wouldn't": "would not",
"wouldn't've": "would not have",
"y'all": "you all",
"y'all'd": "you all would",
"y'all'd've": "you all would have",
"y'all're": "you all are",
"y'all've": "you all have",
"you'd": "you had / you would",
"you'd've": "you would have",
"you'll": "you shall / you will",
"you'll've": "you shall have / you will have",
"you're": "you are",
"you've": "you have"
}

In [12]:
# Getting the tokenizer
tokenizer = get_tokenizer('spacy',language='en_core_web_sm')

# Getting the stemmer
stemmer = SnowballStemmer(language='english')

In [13]:
# A function for preprocessing
def preprocess(essay:str):
    preprocessed_essay = essay.lower()
    
    
    # Iterating through the contractions and replacing the 
    for contraction in contractions.keys():
        preprocessed_essay = re.sub(contraction.lower(),contractions[contraction].lower(),preprocessed_essay)
    
    # Subbing out \n and \t
    preprocessed_essay = re.sub("\n","",preprocessed_essay)
    preprocessed_essay = re.sub("\t","",preprocessed_essay)

    # Replacing /xa0 = non-breaking space in Latin1
    preprocessed_essay = preprocessed_essay.replace(u'\xa0', u' ')
    
    final_preprocessed_essay = []
    
    # Running through tokenizer and returning the non-whitespace tokens
    for token in tokenizer(preprocessed_essay):
        temp_token = token.strip(" ")
        
        if temp_token != "":
            final_preprocessed_essay.append(stemmer.stem(token))
    
    return final_preprocessed_essay

In [14]:
# Running the training essays and validation essays through preprocessing
tokenized_essays_train = train_data['essay'].copy().progress_apply(preprocess)
tokenized_essays_valid = valid_data['essay'].copy().progress_apply(preprocess)
tokenized_essays_test = testing_data['story'].copy().progress_apply(preprocess)

100%|██████████| 44733/44733 [06:41<00:00, 111.52it/s]
100%|██████████| 5195/5195 [01:49<00:00, 47.64it/s] 
100%|██████████| 5512/5512 [00:41<00:00, 131.89it/s]


In [15]:
# Loading the vocab
vocabulary = torch.load('../input/llm-competition-models/vocab.pt')

In [16]:
# Function to put each essay through the vocabulary
def put_through_vocab(essay:str) -> list:
    return vocabulary(essay)

# Indexed
indexed_essays_train = [put_through_vocab(essay) for essay in tokenized_essays_train]
indexed_essays_valid = [put_through_vocab(essay) for essay in tokenized_essays_valid]
indexed_essays_test = [put_through_vocab(essay) for essay in tokenized_essays_test]

In [17]:
train_padded = tf.keras.utils.pad_sequences(indexed_essays_train,maxlen=512,padding='post',truncating='post',value=vocabulary['<pad>'])
valid_padded = tf.keras.utils.pad_sequences(indexed_essays_valid,maxlen=512,padding='post',truncating='post',value=vocabulary['<pad>'])
test_padded = tf.keras.utils.pad_sequences(indexed_essays_test,maxlen=512,padding='post',truncating='post',value=vocabulary['<pad>'])

In [18]:
# Positional Encoding
class PositionalEncoding(nn.Module):
    def __init__(self,emb_size:int, dropout:float, maxlen:int = 500):
        super(PositionalEncoding,self).__init__()
        den = torch.exp(-torch.arange(0,emb_size,2)*math.log(10000) / emb_size)
        pos = torch.arange(0,maxlen).reshape(maxlen,1)
        pos_embedding = torch.zeros((maxlen,emb_size))
        pos_embedding[:,0::2] = torch.sin(pos * den)
        pos_embedding[:,1::2] = torch.cos(pos * den)
        pos_embedding = pos_embedding.unsqueeze(0)

        self.dropout = nn.Dropout(dropout)

        # Saving the positional encoding in the model state dict, but making sure PyTorch doesn't "train"
        # these parameters because they don't need to be trained
        self.register_buffer('pos_embedding',pos_embedding)

    def forward(self,token_embedding):
        return self.dropout(token_embedding + self.pos_embedding)

# Transformer Model
class Model(nn.Module):
    def __init__(self,vocab_size: int, emb_size:int,nheads:int,dim_feedforward:int,dropout:float,num_layers:int,max_length:int):
        super().__init__()
        self.embed_size = emb_size
        self.embedding = nn.Embedding(vocab_size,emb_size,padding_idx=vocabulary['<pad>'])
        self.positional_encoder = PositionalEncoding(emb_size,dropout,max_length)
        self.encoder_layer = nn.TransformerEncoderLayer(emb_size,nheads,dim_feedforward,dropout,batch_first=True)
        self.transformer = nn.TransformerEncoder(self.encoder_layer,num_layers)
        self.fc1 = nn.Linear(emb_size,1)
    
    # Forward Function
    def forward(self,X,src_key_padding_mask):
        # Putting X through embedding
        output = self.embedding(X.long()) * math.sqrt(self.embed_size)
        output = self.positional_encoder(output)
        
        # Feeding through transformer encoder
        output = self.transformer(output,src_key_padding_mask=src_key_padding_mask)
        output = torch.mean(output,dim=1)
        return nn.functional.sigmoid(self.fc1(output))

In [19]:
# Creating a mask to make sure the padding indicies are masked
def create_padding_mask(X):
    return (X == vocabulary['<pad>'])

In [20]:
# Setting up the model
model = Model(vocab_size=vocabulary.__len__(),emb_size=512,nheads=8,dim_feedforward=2048,dropout=0.2,num_layers=2,max_length=512)

model.load_state_dict(torch.load('../input/llm-competition-models/2-layer-transformer-encoder.pt'))

# Putting all on GPU
model.to(device)

Model(
  (embedding): Embedding(25002, 512, padding_idx=1)
  (positional_encoder): PositionalEncoding(
    (dropout): Dropout(p=0.2, inplace=False)
  )
  (encoder_layer): TransformerEncoderLayer(
    (self_attn): MultiheadAttention(
      (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True)
    )
    (linear1): Linear(in_features=512, out_features=2048, bias=True)
    (dropout): Dropout(p=0.2, inplace=False)
    (linear2): Linear(in_features=2048, out_features=512, bias=True)
    (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
    (norm2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
    (dropout1): Dropout(p=0.2, inplace=False)
    (dropout2): Dropout(p=0.2, inplace=False)
  )
  (transformer): TransformerEncoder(
    (layers): ModuleList(
      (0-1): 2 x TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True)
      

In [21]:
# Make a function for inference
def inference(X,y):
    X_torch = torch.from_numpy(X)
    y_torch = torch.from_numpy(y)
    dataset = TensorDataset(X_torch,y_torch)
    dataloader = DataLoader(dataset,batch_size=128,shuffle=True)
    with torch.no_grad():
        model.eval()
        preds = None
        targets = None
        for X,y in dataloader:
            # Making predictions
            X = X.to(device)
            pred = model(X,src_key_padding_mask=create_padding_mask(X))
            if preds is None:
                preds = pred.cpu().detach().numpy()
            else:
                preds = np.append(preds,pred.cpu().detach().numpy(),axis=0)

            # Getting the targets
            if targets is None:
                targets = y.cpu().numpy()
            else:
                targets = np.append(targets,y.cpu().detach().numpy(),axis=0)
    
    return preds, targets

In [22]:
# Making Predictions
train_preds, train_targets = inference(train_padded,train_data['LLM_written'].values)
valid_preds, valid_targets = inference(valid_padded,valid_data['LLM_written'].values)
test_preds, test_targets = inference(test_padded,testing_data['label'].values)

  output = torch._nested_tensor_from_mask(output, src_key_padding_mask.logical_not(), mask_check=False)


In [23]:
# Making predictions
print('Transformer Score')
print(f'Training ROC AUC: {roc_auc_score(train_targets,train_preds)}')
print(f'Validation ROC AUC: {roc_auc_score(valid_targets,valid_preds)}')
print(f'Testing ROC AUC: {roc_auc_score(test_targets,test_preds)}')

Transformer Score
Training ROC AUC: 0.9983653983570595
Validation ROC AUC: 0.8873101185764299
Testing ROC AUC: 0.5977724284369135


Summary: This experiment proves that DistilBERT generalizes a lot better than the Transformer. With this in mind, I should produce a prediction instead of a probability. I need to find the decision threshold. From my analysis, 0.5 isn't a good one. 

In [24]:
# Getting the TPR,FPR, and Thresholds for the DistilBERT model
train_fpr,train_tpr,train_thresholds = roc_curve(train_data["LLM_written"],predictions_train)
valid_fpr,valid_tpr,valid_thresholds = roc_curve(valid_data["LLM_written"],predictions_valid)
test_fpr,test_tpr,test_thresholds = roc_curve(testing_data["label"],predictions_test)

In [25]:
training_metrics = []
valid_metrics = []
test_metrics = []

for metric in zip(train_fpr,train_tpr,train_thresholds,train_tpr-train_fpr):
    training_metrics.append(metric)

for metric in zip(valid_fpr,valid_tpr,valid_thresholds,valid_tpr-valid_fpr):
    valid_metrics.append(metric)

for metric in zip(test_fpr,test_tpr,test_thresholds,test_tpr-test_fpr):
    test_metrics.append(metric)

In [27]:
# Getting the max difference for each
sorted(training_metrics,key=lambda x: x[3],reverse=True)[0]

(0.0008781173164734809,
 0.9983397897066962,
 0.5422803163528442,
 0.9974616723902228)

In [30]:
(sorted(valid_metrics,key=lambda x: x[3],reverse=True)[0][2] + sorted(test_metrics,key=lambda x: x[3],reverse=True)[0][2])/2

0.7122994661331177

Looks like the threshold should be 0.71. Anything above 0.71 should be classified as LLM generated and anything below it should be classified as student written.