# GPT vs BERT Fine-Tune for downstream NLP tasks


### In this notebook, we will:

    - Finetune Transformer based models for classification task
    - Prompt Engineer GPT models for a classification task
    - Live results comparision
    - Performance evaluation w/ different metrics

##### We will be using a labeled dataset of tweets for toxicity classification. The dataset is preprocessed and splitted into training, validation, and test dataset.


In [1]:
from datasets import load_dataset
import pandas as pd
from sklearn.model_selection import train_test_split
import openai
import os
import IPython
from langchain.llms import OpenAI
import torch
from torch.utils.data import DataLoader, TensorDataset

from transformers import AutoTokenizer, DataCollatorWithPadding
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
from datasets import Dataset
import numpy as np
import evaluate

import time

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 

from pandas.errors import SettingWithCopyWarning
warnings.simplefilter(action="ignore", category=SettingWithCopyWarning)


import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

In [2]:
# !pip install --upgrade accelerate
# !pip install evaluate

In [3]:
dataset = load_dataset("NischayDnk/bertvsllm_demodatav2")


Found cached dataset csv (/root/.cache/huggingface/datasets/NischayDnk___csv/NischayDnk--bertvsllm_demodatav2-3e25f0b2d59b20e9/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1)


  0%|          | 0/1 [00:00<?, ?it/s]

In [4]:
dataset

DatasetDict({
    train: Dataset({
        features: ['textID', 'text', 'selected_text', 'sentiment'],
        num_rows: 27481
    })
})

In [5]:
df = pd.DataFrame(dataset['train'])

In [6]:
df.head()

Unnamed: 0,textID,text,selected_text,sentiment
0,cb774db0d1,"I`d have responded, if I were going","I`d have responded, if I were going",neutral
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,Sooo SAD,negative
2,088c60f138,my boss is bullying me...,bullying me,negative
3,9642c003ef,what interview! leave me alone,leave me alone,negative
4,358bd9e861,"Sons of ****, why couldn`t they put them on t...","Sons of ****,",negative


In [7]:
train_df, val_test_df = train_test_split(df, train_size=0.8, random_state=42, stratify=df['sentiment'])

val_df, test_df = train_test_split(val_test_df, test_size=0.5, random_state=42, stratify=val_test_df['sentiment'])


In [8]:
train_df = train_df.reset_index(drop=True)
val_df = val_df.reset_index(drop=True)
test_df = test_df.reset_index(drop=True)



In [9]:
print(train_df.sentiment.value_counts())


sentiment
neutral     8894
positive    6865
negative    6225
Name: count, dtype: int64


In [10]:


accuracy = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)


In [11]:
id2label = {0: "negative", 1: "neutral", 2: "positive"}
label2id = {"negative": 0, "neutral": 1, "positive": 2}

train_df['label'] = train_df['sentiment'].map(label2id)
val_df['label'] = val_df['sentiment'].map(label2id)

model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=3, id2label=id2label, label2id=label2id
)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_projector.bias', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'classifier.weight', 'pre_classifier.weight', 'classifier.

In [12]:
# valid_dataset[0]

In [17]:
train_dataset = Dataset.from_pandas(train_df[['text','label']].tail(3000), split='train')
valid_dataset = Dataset.from_pandas(val_df[['text','label']].head(500), split='valid')


In [18]:
dynamic_padding = True

def tokenize_func(examples):
	return tokenizer(examples["text"], padding=True)  

encoded_dataset_train = train_dataset.map(tokenize_func, batched=True)
encoded_dataset_valid = valid_dataset.map(tokenize_func, batched=True)


Map:   0%|          | 0/3000 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

In [19]:
training_args = TrainingArguments(
    output_dir="exp1_bert",
    learning_rate=3e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=encoded_dataset_train,
    eval_dataset=encoded_dataset_valid,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)


In [20]:
trainer.train()


Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[34m[1mwandb[0m: Currently logged in as: [33mnischay[0m. Use [1m`wandb login --relogin`[0m to force relogin


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.561331,0.772
2,No log,0.53317,0.796
3,0.545900,0.550707,0.786


TrainOutput(global_step=564, training_loss=0.5174399707334262, metrics={'train_runtime': 33.5153, 'train_samples_per_second': 268.534, 'train_steps_per_second': 16.828, 'total_flos': 255919159258272.0, 'train_loss': 0.5174399707334262, 'epoch': 3.0})

In [24]:
del trainer, model

In [29]:
exp_dir = 'exp1_bert/checkpoint-564/'
bert_model = AutoModelForSequenceClassification.from_pretrained(exp_dir)

In [35]:
%%time
test_texts = test_df['text'].tolist()
test_labels = test_df['sentiment'].tolist()

# Tokenize the test data and convert it to DataLoader
inputs = tokenizer(test_texts, padding=True, truncation=True, return_tensors="pt")
test_dataset = TensorDataset(inputs['input_ids'], inputs['attention_mask'])
test_loader = DataLoader(test_dataset, batch_size=16)

# Make predictions on the test data using the trained model
bert_model.eval()
predictions = []
with torch.no_grad():
    for batch in test_loader:
        input_ids, attention_mask = batch
        logits = bert_model(input_ids, attention_mask=attention_mask)[0]
        batch_predictions = torch.argmax(logits, dim=1).tolist()
        predictions.extend(batch_predictions)




In [37]:
# Create a DataFrame to display the fancy output
output_df = pd.DataFrame({'Text': test_texts, 'Actual Label': test_labels, 'Model Prediction': predictions})
output_df['Model Prediction'] = output_df['Model Prediction'].map(id2label)

def format_output(row):
    return f"Text: {row['Text']}\nActual Label: {row['Actual Label']}\nModel Prediction: {row['Model Prediction']}\n"

formatted_output = output_df.head(10).apply(format_output, axis=1).tolist()
print("\n".join(formatted_output))

Text:  rain
Actual Label: neutral
Model Prediction: neutral

Text:  umm well i only go to house clubs and i never go to north beach so.no idea, sorry  been out 1x there 2 a now defunctlesi club
Actual Label: negative
Model Prediction: neutral

Text: getting ready to head out to Camp Allen.  Unless somethings changed, that means no phone service for about 24 hours.
Actual Label: neutral
Model Prediction: neutral

Text: Twitter won`t let me update online. My update box won`t work.
Actual Label: negative
Model Prediction: negative

Text:  is over
Actual Label: neutral
Model Prediction: neutral

Text: _Photography Good Morning! Hope you have a great day!!
Actual Label: positive
Model Prediction: positive

Text: started her new job today! aaand so stoked for may long..  and billy is awesome.
Actual Label: positive
Model Prediction: positive

Text:  I forgot about it and I already ate lunch  so I guess I`m not going.
Actual Label: neutral
Model Prediction: negative

Text: _0407 all over the 

In [39]:
%%capture
# update or install the necessary libraries
!pip install --upgrade openai
!pip install --upgrade langchain
!pip install --upgrade python-dotenv

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [40]:


def get_completion(prompt, model="gpt-3.5-turbo", temperature=0):
    messages = [{"role": "user", "content": prompt}]
    response = openai.ChatCompletion.create(
        model=model,
        messages=messages,
        temperature=0, # this is the degree of randomness of the model's output
    )
    return response.choices[0].message["content"]

openai.api_key  = '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' # this key must be changed 

In [42]:
output_df = output_df[:10]
output_df['gpt_pred'] = ''


In [43]:
output_df.head(2)

Unnamed: 0,Text,Actual Label,Model Prediction,gpt_pred
0,rain,neutral,neutral,
1,umm well i only go to house clubs and i never...,negative,neutral,


In [46]:
for i in range(len(output_df)):
    prompt = f"""
    you have this tweet text --> {output_df.Text[i]}.
    you have these options to predict sentiment / toxicity

    NEG - negative 
    NEU - neutral
    POS - positive

    what abbreviation corresponds to the right sentiment? 
    """
    response = get_completion(prompt)
    if 'NEG' in response:
        output_df.gpt_pred[i] = 'negative'
    if 'NEU' in response:
        output_df.gpt_pred[i] = 'neutral'
    if 'POS' in response:
        output_df.gpt_pred[i] = 'positive'

    time.sleep(20)

In [48]:
def format_output(row):
    return f"Text: {row['Text']}\nActual Label: {row['Actual Label']}\nModel Prediction: {row['Model Prediction']}\nGPT Prediction: {row['gpt_pred']}\n"

formatted_output = output_df.head(10).apply(format_output, axis=1).tolist()
print("\n".join(formatted_output))

Text:  rain
Actual Label: neutral
Model Prediction: neutral
GPT Prediction: neutral

Text:  umm well i only go to house clubs and i never go to north beach so.no idea, sorry  been out 1x there 2 a now defunctlesi club
Actual Label: negative
Model Prediction: neutral
GPT Prediction: neutral

Text: getting ready to head out to Camp Allen.  Unless somethings changed, that means no phone service for about 24 hours.
Actual Label: neutral
Model Prediction: neutral
GPT Prediction: neutral

Text: Twitter won`t let me update online. My update box won`t work.
Actual Label: negative
Model Prediction: negative
GPT Prediction: negative

Text:  is over
Actual Label: neutral
Model Prediction: neutral
GPT Prediction: neutral

Text: _Photography Good Morning! Hope you have a great day!!
Actual Label: positive
Model Prediction: positive
GPT Prediction: positive

Text: started her new job today! aaand so stoked for may long..  and billy is awesome.
Actual Label: positive
Model Prediction: positive
GPT Pr