README

*   change the file path to your directory where you store the datasets and models.
*   for only showing the results, skip the training cell (# run the entire pipeline) and go to result cell(# results showing)
*   together with this script,two best models: (best_model_state_pap3.bin) for original pap dataset  and (best_model_state_pep3.bin) for data augmented dataset, which add pep3k data into original data could be download throuth link [best models](https://drive.google.com/drive/folders/1asb1DWySYAU-unzVW-58LFynRm2UQ5D5?usp=sharing). Save to your path and load the one you wish to see the results in results cell.








In [1]:
# load google drive in colab for reading data and model save
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
# install library requirements
!pip install transformers torch pandas



In [3]:
# imports
import pandas as pd
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import RobertaTokenizer, RobertaForSequenceClassification, AdamW
from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score


In [4]:
# dataset class
class SemanticDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_len):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, item):
        text = str(self.texts[item])
        label = self.labels[item]

        encoding = self.tokenizer.encode_plus(
            text,
            add_special_tokens=True,
            max_length=self.max_len,
            return_token_type_ids=False,
            padding='max_length',
            return_attention_mask=True,
            return_tensors='pt',
            truncation=True
        )

        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'labels': torch.tensor(label, dtype=torch.long)
        }


In [5]:
#parameter configuration
MAX_LEN = 128
BATCH_SIZE = 8
EPOCHS = 3 # set to one to avoid long time training interrupt
LEARNING_RATE = 2e-5
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

In [6]:
#data loader function
def create_data_loader(df, tokenizer, max_len, batch_size,mode=1):
    if mode==1:
      label_mapping = {'plausible': 1,'implausible': 0}
      df['label_num'] = df['original_label'].map(label_mapping)
    ds = SemanticDataset(
        texts=df.text.to_numpy(),
        labels=df.label_num.to_numpy(),
        tokenizer=tokenizer,
        max_len=max_len
    )
    return DataLoader(ds, batch_size=batch_size, num_workers=4)


In [7]:
# evaluate model
def evaluate(model, data_loader, device, mode=1):
    model.eval()  # Set the model to evaluation mode
    predictions, true_labels = [], []
    total_loss = 0

    with torch.no_grad():
        for batch in data_loader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)

            outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
            loss = outputs.loss
            logits = outputs.logits

            # Accumulate the loss
            total_loss += loss.item()

            # Convert to class predictions
            preds = torch.argmax(logits, dim=1).cpu().numpy()
            labels = labels.cpu().numpy()

            predictions.extend(preds)
            true_labels.extend(labels)

    # Calculate the average loss
    avg_loss = total_loss / len(data_loader)

    # Calculate metrics
    precision = precision_score(true_labels, predictions)
    recall = recall_score(true_labels, predictions)
    accuracy = accuracy_score(true_labels, predictions)
    f1 = f1_score(true_labels, predictions)

    if mode==1:
      print(f'Test Loss: {avg_loss}')
      print(f'Accuracy: {accuracy}')
      print(f'Precision: {precision}')
      print(f'Recall: {recall}')
      print(f'F1 Score: {f1}')

    return accuracy, avg_loss

In [8]:
# training function
def train_epoch(model, data_loader, optimizer, device, n_examples):
    model.train()
    losses = []
    correct_predictions = 0
    Trained_sample_count = 0
    for d in data_loader:
        input_ids = d["input_ids"].to(device)
        attention_mask = d["attention_mask"].to(device)
        labels = d["labels"].to(device)
        outputs = model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            labels=labels
        )
        Trained_sample_count+=BATCH_SIZE

        loss = outputs.loss
        correct_predictions += torch.sum(outputs.logits.argmax(1) == labels)
        losses.append(loss.item())

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if Trained_sample_count%50==0:
          print("finished:",Trained_sample_count,"accuracy:",correct_predictions.double()/Trained_sample_count)

    return correct_predictions.double() / n_examples, sum(losses) / len(losses)


In [9]:
# main training loop
def train_model(model, train_data_loader, val_data_loader, device, n_epochs):
    best_accuracy = 0
    optimizer = AdamW(model.parameters(), lr=LEARNING_RATE)

    for epoch in range(n_epochs):
        print(f'Epoch {epoch + 1}/{n_epochs}')
        train_acc, train_loss = train_epoch(
            model,
            train_data_loader,
            optimizer,
            device,
            len(train_df)
        )
        print(f'Train loss {train_loss} accuracy {train_acc}')

        val_acc, val_loss = evaluate(model,val_data_loader,device,0)
        print(f'Validation loss {val_loss} accuracy {val_acc}')

        if val_acc > best_accuracy:
            torch.save(model.state_dict(), '/content/drive/MyDrive/pap/best_model_state.bin')
            best_accuracy = val_acc
            print('Saved Best Model')

    print ('model training finished')

In [10]:
# load data

'''
# 1 original training data: with pap dataset
train_df = pd.read_csv('/content/drive/MyDrive/pap/train.csv')
val_df = pd.read_csv('/content/drive/MyDrive/pap/dev.csv')
test_df = pd.read_csv('/content/drive/MyDrive/pap/test.csv')
'''
# 2 augmented training data: add pep3k data into original pap dataset
train_df1 = pd.read_csv('/content/drive/MyDrive/pap/train.csv')
train_df2 = pd.read_csv('/content/drive/MyDrive/pep3k/train.csv')
label_mapping = {1:'plausible',0:'implausible'}
train_df2['original_label'] = train_df2['label'].map(label_mapping)
train_df2 = train_df2[['text', 'original_label']]
train_df1 = train_df1[['text', 'original_label']]

train_df = train_df1.append(train_df2)
train_df.sample(frac=1).reset_index(drop=True)
val_df = pd.read_csv('/content/drive/MyDrive/pap/dev.csv')
test_df = pd.read_csv('/content/drive/MyDrive/pap/test.csv')



  train_df = train_df1.append(train_df2)


In [None]:
# run the entire pipeline
try:
    train_data_loader = create_data_loader(train_df, tokenizer, MAX_LEN, BATCH_SIZE)
    val_data_loader = create_data_loader(val_df, tokenizer, MAX_LEN, BATCH_SIZE)
    test_data_loader = create_data_loader(test_df, tokenizer, MAX_LEN, BATCH_SIZE)

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = RobertaForSequenceClassification.from_pretrained('roberta-base')
    #model.load_state_dict(torch.load('/content/drive/MyDrive/pap/best_model_state_pap3.bin'))
    #model.load_state_dict(torch.load('/content/drive/MyDrive/pap/best_model_state_pep3.bin'))load to start training based on best model we achieve
    model = model.to(device)

    train_model(model, train_data_loader, val_data_loader, device, EPOCHS)
    evaluate(model, test_data_loader, device)

except Exception as e:
    print(f"An error occurred: {e}")

In [12]:
# results showing with metrics
test_data_loader = create_data_loader(test_df, tokenizer, MAX_LEN, BATCH_SIZE)
model_temp = RobertaForSequenceClassification.from_pretrained('roberta-base')

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_temp.load_state_dict(torch.load('/content/drive/MyDrive/pap/best_model_state_pap3.bin')) # to see best result with pap and pep3k dataset
#model_temp.load_state_dict(torch.load('/content/drive/MyDrive/pap/best_model_state_pep3.bin')) # to see best result with original pap dataset
model_temp = model_temp.to(device)

evaluate(model_temp, test_data_loader, device);

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.weight', 'classifier.out_proj.weight', 'classifier.out_proj.bias', 'classifier.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Test Loss: 0.44177262620492413
Accuracy: 0.8103448275862069
Precision: 0.8604651162790697
Recall: 0.7789473684210526
F1 Score: 0.8176795580110496


In [13]:
# Extract incorrectly predicted samples
test_texts = test_df['text'].tolist()

def evaluate(model, data_loader, device, texts, mode=1):
    model.eval()
    predictions, true_labels = [], []
    wrong_samples = []
    total_loss = 0
    index = 0  # track the index of the current batch data in texts

    with torch.no_grad():
        for batch in data_loader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)

            outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
            loss = outputs.loss
            logits = outputs.logits

            total_loss += loss.item()

            preds = torch.argmax(logits, dim=1).cpu().numpy()
            labels = labels.cpu().numpy()

            predictions.extend(preds)
            true_labels.extend(labels)

            # Checking and collecting incorrectly predicted samples
            for i, (pred, label) in enumerate(zip(preds, labels)):
                if pred != label:
                    wrong_samples.append((texts[index + i], pred, label))
            index += len(labels)
    return wrong_samples

test_data_loader = create_data_loader(test_df, tokenizer, MAX_LEN, BATCH_SIZE)
model_temp = RobertaForSequenceClassification.from_pretrained('roberta-base')

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
#model_temp.load_state_dict(torch.load('/content/drive/MyDrive/pap/best_model_state_pap3.bin')) # to see best result with pap and pep3k dataset
model_temp.load_state_dict(torch.load('/content/drive/MyDrive/pap/best_model_state_pep3.bin')) # to see best result with original pap dataset

model_temp = model_temp.to(device)
wrong_samples = evaluate(model_temp, test_data_loader, device, test_texts)

# print wrongly predicted samples
for text, pred, label in wrong_samples:
    print(f'Text: {text}, Predicted: {pred}, Actual: {label}')



Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.weight', 'classifier.out_proj.weight', 'classifier.out_proj.bias', 'classifier.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Text: ratio outnumbers name, Predicted: 0, Actual: 1
Text: growth implies ground, Predicted: 0, Actual: 1
Text: attendee disengages norm, Predicted: 0, Actual: 1
Text: pipe incorporates layer, Predicted: 0, Actual: 1
Text: philosopher compares convenience, Predicted: 0, Actual: 1
Text: philanthropy threatens propriety, Predicted: 1, Actual: 0
Text: city lives period, Predicted: 0, Actual: 1
Text: procedure compares distribution, Predicted: 0, Actual: 1
Text: couple finds cockroach, Predicted: 1, Actual: 0
Text: elephant bounces tip, Predicted: 0, Actual: 1
Text: mob overruns palace, Predicted: 0, Actual: 1
Text: journalist pinpoints attainment, Predicted: 1, Actual: 0
Text: army rediscovers shotgun, Predicted: 0, Actual: 1
Text: choice treats water, Predicted: 0, Actual: 1
Text: property characterizes bearer, Predicted: 0, Actual: 1
Text: dolphin poses snout, Predicted: 0, Actual: 1
Text: method considers manifestation, Predicted: 1, Actual: 0
Text: battleship terrifies species, Predic