In this workbook, I preprocess and clean text data, vectorize it using TF-IDF, and train a Logistic Regression model to predict the winning response between two models. Additionally, I fine-tune a BERT model for the same task. The purpose of this workbook is to compare the performance of traditional machine learning models with transformer-based models in the context of large language models (LLMs) projects. I use log loss as the evaluation metric to measure the performance of our models, ensuring that our predictions are probabilistically accurate.

# Libraries

In [None]:
import pandas as pd
import numpy as np
import torch
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import log_loss
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.linear_model import LogisticRegression
import warnings
warnings.filterwarnings('ignore')

  from .autonotebook import tqdm as notebook_tqdm





In [2]:
# Download necessary NLTK data files
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\elige\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\elige\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\elige\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\elige\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [3]:
# Initialize the lemmatizer
lemmatizer = WordNetLemmatizer()

# Load train Data

In [None]:
rain_data = pd.read_parquet('train.parquet', engine='pyarrow')
train_data.head()

Unnamed: 0,id,model_a,model_b,prompt,response_a,response_b,winner_model_a,winner_model_b,winner_tie
0,30192,gpt-4-1106-preview,gpt-4-0613,"[""Is it morally right to try to have a certain...","[""The question of whether it is morally right ...","[""As an AI, I don't have personal beliefs or o...",1,0,0
1,53567,koala-13b,gpt-4-0613,"[""What is the difference between marriage lice...","[""A marriage license is a legal document that ...","[""A marriage license and a marriage certificat...",0,1,0
2,65089,gpt-3.5-turbo-0613,mistral-medium,"[""explain function calling. how would you call...","[""Function calling is the process of invoking ...","[""Function calling is the process of invoking ...",0,0,1
3,96401,llama-2-13b-chat,mistral-7b-instruct,"[""How can I create a test set for a very rare ...","[""Creating a test set for a very rare category...","[""When building a classifier for a very rare c...",1,0,0
4,198779,koala-13b,gpt-3.5-turbo-0314,"[""What is the best way to travel from Tel-Aviv...","[""The best way to travel from Tel Aviv to Jeru...","[""The best way to travel from Tel-Aviv to Jeru...",0,1,0


In [None]:
def clean_text(text):
    
    '''
    Cleans the input text by performing the following steps:
    1. Converts text to lowercase.
    2. Tokenizes the text into words.
    3. Removes punctuation and non-alphabetic tokens.
    4. Removes stopwords.
    5. Lemmatizes the tokens.
    6. Joins the tokens back into a single string.

    Parameters:
    text (str): The input text to be cleaned.

    Returns:
    str: The cleaned text.
    '''

    text = text.lower()
    tokens = word_tokenize(text)
    tokens = [word for word in tokens if word.isalpha()] # Remove punctuation and non-alphabetic tokens
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    clean_text = ' '.join(tokens)
    return clean_text

In [None]:
# clean the text data
train_data['prompt'] = train_data['prompt'].apply(clean_text)
train_data['response_a'] = train_data['response_a'].apply(clean_text)
train_data['response_b'] = train_data['response_b'].apply(clean_text)

In [None]:
train_data.head()

Unnamed: 0,id,model_a,model_b,prompt,response_a,response_b,winner_model_a,winner_model_b,winner_tie
0,30192,gpt-4-1106-preview,gpt-4-0613,morally right try certain percentage female ma...,question whether morally right aim certain per...,ai personal belief opinion however tell questi...,1,0,0
1,53567,koala-13b,gpt-4-0613,difference marriage license marriage certifica...,marriage license legal document allows couple ...,marriage license marriage certificate two diff...,0,1,0
2,65089,gpt-3.5-turbo-0613,mistral-medium,explain function calling would call function,function calling process invoking executing fu...,function calling process invoking function pro...,0,0,1
3,96401,llama-2-13b-chat,mistral-7b-instruct,create test set rare category want build class...,creating test set rare category challenging ma...,building classifier rare category creating tes...,1,0,0
4,198779,koala-13b,gpt-3.5-turbo-0314,best way travel jerusalem car bus plane,best way travel tel aviv jerusalem depends per...,best way travel jerusalem depends personal pre...,0,1,0


In [8]:
# Combine responses for TF-IDF vectorization
train_data['combined_responses'] = train_data['response_a'] + " " + train_data['response_b']

In [9]:
train_data[['response_a', 'response_b', 'combined_responses']].head()

Unnamed: 0,response_a,response_b,combined_responses
0,question whether morally right aim certain per...,ai personal belief opinion however tell questi...,question whether morally right aim certain per...
1,marriage license legal document allows couple ...,marriage license marriage certificate two diff...,marriage license legal document allows couple ...
2,function calling process invoking executing fu...,function calling process invoking function pro...,function calling process invoking executing fu...
3,creating test set rare category challenging ma...,building classifier rare category creating tes...,creating test set rare category challenging ma...
4,best way travel tel aviv jerusalem depends per...,best way travel jerusalem depends personal pre...,best way travel tel aviv jerusalem depends per...


In [11]:
# Sample the data
sampled_data = train_data.sample(n=10000, random_state=42)

In [12]:
sampled_data

Unnamed: 0,id,model_a,model_b,prompt,response_a,response_b,winner_model_a,winner_model_b,winner_tie,combined_responses
37379,2785062085,gpt-3.5-turbo-0613,codellama-34b-instruct,hello world mean,hello world phrase used computer programming i...,hello world common phrase used display message...,1,0,0,hello world phrase used computer programming i...
644,48259531,mistral-medium,gpt-4-0314,ran marathon weighting fast would run lose,difficult provide exact answer question many f...,accurately estimate much faster would run mara...,1,0,0,difficult provide exact answer question many f...
48496,3622731894,gpt-4-0314,gpt-3.5-turbo-0314,instruction describes task paired input provid...,,,0,0,1,
12603,933663766,openchat-3.5,gpt-3.5-turbo-0613,run static analysis gcc qt creator,run static analysis gcc qt creator follow step...,run static analysis gcc qt creator follow step...,1,0,0,run static analysis gcc qt creator follow step...
16697,1246768370,codellama-34b-instruct,mixtral-8x7b-instruct-v0.1,music producer die,able provide information cause death publicly ...,really sorry ability access personal data ca p...,1,0,0,able provide information cause death publicly ...
...,...,...,...,...,...,...,...,...,...,...
25231,1877587731,wizardlm-70b,chatglm2-6b,write optimal sql query take data table client...,example optimal sql query retrieves sale volum...,optimal sql query provide sale volume per clie...,0,1,0,example optimal sql query retrieves sale volum...
49278,3680864910,gpt-3.5-turbo-0125,mixtral-8x7b-instruct-v0.1,difference snail cat,snail cat two different type animal key differ...,snail cat different type animal belong differe...,0,1,0,snail cat two different type animal key differ...
37785,2813285142,llama-2-13b-chat,wizardlm-13b,summarize following text really felt like stru...,sure summary requested tan donated bone marrow...,individual named tan donated bone marrow boy f...,1,0,0,sure summary requested tan donated bone marrow...
49685,3711677764,vicuna-7b,vicuna-13b,fear v intimidation,fear intimidation two related distinct feeling...,fear intimidation related concept natural emot...,1,0,0,fear intimidation two related distinct feeling...


In [13]:
# Vectorize text using TF-IDF
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(sampled_data['combined_responses'])

In [15]:
# Target variable
sampled_data[['winner_model_a', 'winner_model_b', 'winner_tie']].values

array([[1, 0, 0],
       [1, 0, 0],
       [0, 0, 1],
       ...,
       [1, 0, 0],
       [1, 0, 0],
       [0, 0, 1]], dtype=int64)

In [16]:
# Encode target variable
sampled_data['winner'] = sampled_data[['winner_model_a', 'winner_model_b', 'winner_tie']].idxmax(axis=1)
sampled_data['winner'] = sampled_data['winner'].map({'winner_model_a': 0, 'winner_model_b': 1, 'winner_tie': 2})

In [17]:
# target variable
y = sampled_data['winner'].values

In [19]:
# train_test
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Check the shapes
X_train.shape, X_val.shape, y_train.shape, y_val.shape

((8000, 54992), (2000, 54992), (8000,), (2000,))

### Model 1: Logistic Regression

In [21]:
# Initialize the model
model_LR = LogisticRegression(multi_class='multinomial', max_iter=1000)

# Train the model
model_LR.fit(X_train, y_train)

In [22]:
# Make predictions on the validation set
y_pred_LR = model_LR.predict_proba(X_val)

In [23]:
# Calculate log loss
log_loss_score_LR = log_loss(y_val, y_pred_LR)
print(f'Log Loss: {log_loss_score_LR}')

Log Loss: 1.1030934797200938


In [24]:
# Load the tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=3)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [25]:
# Tokenize the text data
def encode_data(texts):
    return tokenizer(texts, padding=True, truncation=True, return_tensors='pt')

train_encodings = encode_data(sampled_data['combined_responses'].tolist())
labels = torch.tensor(sampled_data['winner'].values, dtype=torch.long)

In [26]:
# Split the data into training and validation sets
train_texts, val_texts, train_labels, val_labels = train_test_split(sampled_data['combined_responses'], labels, test_size=0.2, random_state=42)

# Tokenize the split text data
train_encodings = encode_data(train_texts.tolist())
val_encodings = encode_data(val_texts.tolist())

In [None]:
# Define a custom dataset
class CustomDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
         """
        Initializes the CustomDataset with encodings and labels.

        Parameters:
        encodings (dict): Encoded input data.
        labels (torch.Tensor): Corresponding labels for the input data.
        """
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        """
        Retrieves the item (encoding and label) at the specified index.

        Parameters:
        idx (int): Index of the item to retrieve.

        Returns:
        dict: A dictionary containing the encoding and label for the specified index.
        """
        item = {key: val[idx] for key, val in self.encodings.items()}
        item['labels'] = self.labels[idx]
        return item

    def __len__(self):
        """
        Returns the number of items in the dataset.
        
        """
        return len(self.labels)

In [None]:
#  Create a CustomDataset object for the training and validation sets
train_dataset = CustomDataset(train_encodings, train_labels)
val_dataset = CustomDataset(val_encodings, val_labels)

In [29]:
# Define training arguments
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10
)

In [30]:
# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)

In [31]:
# Move model to GPU if available
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model.to(device)

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

In [None]:
# Train the model
trainer.train()

Step,Training Loss
10,1.101
20,1.1135
30,1.1148
40,1.1224
50,1.0675
60,1.1138


In [None]:
# Evaluate the model
trainer.evaluate()

{'eval_loss': 1.0958130359649658,
 'eval_runtime': 56.1869,
 'eval_samples_per_second': 35.595,
 'eval_steps_per_second': 4.449,
 'epoch': 3.0}

### Test Data

In [7]:
# Load and preprocess test data
test_data = pd.read_parquet('test.parquet', engine='pyarrow')

In [None]:
# Clean the text data
test_data['prompt'] = test_data['prompt'].apply(clean_text)
test_data['response_a'] = test_data['response_a'].apply(clean_text)
test_data['response_b'] = test_data['response_b'].apply(clean_text)
test_data['combined_responses'] = test_data['response_a'] + " " + test_data['response_b']

In [None]:
# Vectorize test data
X_test = vectorizer.transform(test_data['combined_responses'])

In [None]:
# Make predictions on the test data using Logistic Regression model
y_test_pred_LR = model_LR.predict_proba(X_test)
print(f'Test Predictions (Logistic Regression): {y_test_pred_LR}')

Test Predictions (Logistic Regression): [[0.18739054 0.36258478 0.45002467]
 [0.35477295 0.4159459  0.22928115]
 [0.46346236 0.26345956 0.27307808]]


In [None]:
# Tokenize the test data
test_encodings = encode_data(test_data['combined_responses'].tolist())

In [None]:
# Move test encodings to device
test_encodings = {key: val.to(device) for key, val in test_encodings.items()}

In [None]:
# Make predictions on the test data using BERT model
with torch.no_grad():
    model.eval()
    outputs = model(**test_encodings)
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1).cpu().numpy()

In [None]:
# Prepare the submission file
submission = pd.DataFrame(predictions, columns=['winner_model_a', 'winner_model_b', 'winner_tie'])
submission.insert(0, 'id', test_data['id'])

In [None]:
# Save the submission file
submission.to_csv('submission.csv', index=False)
print('Submission file created!')

Submission file created!
