# **Paraphrase Adversaries from Word Scrambling (PAWS)**

PAWS dataset: https://huggingface.co/datasets/google-research-datasets/paws

In the rapidly evolving field of Natural Language Processing (NLP), one of the pivotal challenges is understanding and generating human language in a way that is both meaningful and contextually accurate. Among the various tasks within NLP, paraphrase identification (determining whether two sentences convey the same meaning despite differences in wording) plays a crucial role in applications such as machine translation, information retrieval, and conversational agents.

The PAWS dataset, created by researchers at Google, is specifically designed to test the limits of current models by introducing challenging examples where paraphrases are distinguished by subtle yet significant differences in word order and structure.
We used as benchmaek the PAWS article by Yuan Zhang, Jason Baldridge, Luheng He, that can be reviewd in the following [link](https://arxiv.org/abs/1904.01130).


## **Project Outline:**
[**Task 0**](#task0): Package Importing and Dataset loading

[**Task 1**](#task1): Bag of Words

[**Task 2**](#task2): Embedding Layer

[**Task 3**](#task3): RoBERTa combined with a Siamese Network

[**Task 4**](#task4): DeBERTaV3

<a name='task0'></a>
# Task 0: Importing and Loading

In [None]:
!pip install -q --upgrade keras_nlp
!pip install -q --upgrade keras
!pip install datasets

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m570.5/570.5 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m950.8/950.8 kB[0m [31m13.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m26.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m589.8/589.8 MB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.3/5.3 MB[0m [31m48.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m67.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.5/5.5 MB[0m [31m68.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m45.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from collections import Counter
import string
import copy
import random

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report, accuracy_score
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression

import tensorflow as tf
from tensorflow.keras.layers import Input, Dense, Flatten, Embedding, Concatenate
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import keras_nlp
import tensorflow_datasets as tfds
from transformers import AutoTokenizer, AutoModel, AdamW, get_linear_schedule_with_warmup

tf.keras.mixed_precision.set_global_policy("mixed_float16")

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

from datasets import load_dataset

from tqdm import tqdm


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


In [None]:
dataset = load_dataset("paws", "labeled_final")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/9.79k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/8.43M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.24M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.23M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/49401 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/8000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/8000 [00:00<?, ? examples/s]

In [None]:
paws_ds = tfds.load(
    "paws_wiki",
)
paws_train, paws_valid = paws_ds["train"], paws_ds["validation"]

Downloading and preparing dataset 57.47 MiB (download: 57.47 MiB, generated: 17.96 MiB, total: 75.43 MiB) to /root/tensorflow_datasets/paws_wiki/labeled_final_tokenized/1.1.0...


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Extraction completed...: 0 file [00:00, ? file/s]

Generating splits...:   0%|          | 0/3 [00:00<?, ? splits/s]

Generating train examples...:   0%|          | 0/49401 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/paws_wiki/labeled_final_tokenized/incomplete.LDSX85_1.1.0/paws_wiki-train.…

Generating validation examples...:   0%|          | 0/8000 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/paws_wiki/labeled_final_tokenized/incomplete.LDSX85_1.1.0/paws_wiki-valida…

Generating test examples...:   0%|          | 0/8000 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/paws_wiki/labeled_final_tokenized/incomplete.LDSX85_1.1.0/paws_wiki-test.t…

Dataset paws_wiki downloaded and prepared to /root/tensorflow_datasets/paws_wiki/labeled_final_tokenized/1.1.0. Subsequent calls will reuse this data.


<a name='task1'></a>
# Task 1: Bag Of Words method

In [None]:
train_df = pd.DataFrame(dataset['train'])
val_df = pd.DataFrame(dataset['validation'])
test_df = pd.DataFrame(dataset['test'])

# Combine sentence1 and sentence2 into a single list for BoW representation
texts = train_df['sentence1'].tolist() + train_df['sentence2'].tolist()

# Prepare labels by duplicating them for both sentence1 and sentence2
labels = pd.concat([train_df['label'], train_df['label']], ignore_index=True)

# Preprocess the text: lowercasing and tokenizing
vectorizer = CountVectorizer(lowercase=True, stop_words='english')

# Fit the vectorizer on the text data
X = vectorizer.fit_transform(texts)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.2, random_state=42)

# Train a simple classifier
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, y_train)

# Predict on the test set
y_pred = clf.predict(X_test)

# Evaluate the classifier
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

Accuracy: 0.5231516623652649


<a name='task2'></a>
# Task 2: Embeddings Layer and Double Input model

## Preprocessing

In [None]:
### Override the previous training and testing data, this time keeping the pairs

X_train, X_test, y_train, y_test = train_test_split(train_df[['sentence1', 'sentence2']].to_numpy(), train_df['label'].to_numpy(), test_size=0.2, random_state=42);

In [None]:
### Tokenize the text

def preprocess_text(text):
    text = text.lower()
    text = text.translate(str.maketrans('', '', string.punctuation))
    tokens = word_tokenize(text)
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    return tokens

tokenized_train = [(preprocess_text(sen1), preprocess_text(sen2)) for sen1, sen2 in X_train]
tokenized_test = [(preprocess_text(sen1), preprocess_text(sen2)) for sen1, sen2 in X_test]
print("Tokenized example:", tokenized_train[0])

Tokenized example: (['river', 'scridoasa', 'tributary', 'river', 'botizu', 'romania'], ['botizu', 'river', 'tributary', 'scridoasa', 'river', 'romania'])


In [None]:
### We create an ordered list for each token and how many times it appears

all_words = [word for pair in tokenized_train for sublist in pair for word in sublist]

word_counts = Counter(all_words)
sorted_words_by_frequency = sorted(word_counts, key=word_counts.get, reverse=True)

# We discard some tokens
sorted_words_by_frequency = sorted_words_by_frequency[:29999] + ["<unk>"]
print(len(sorted_words_by_frequency))

30000


In [None]:
### Vocabulary creation and conversion from our text to sequences of tokens

vocab = {token: idx + 1 for idx, token in enumerate(set(sorted_words_by_frequency))}

def preprocess_data_2(X, vocab):
    processed_pairs = []
    for sen1, sen2 in X:
        processed_sen1 = []
        processed_sen2 = []

        for word in sen1:
            word_lower = word.lower()
            if vocab.get(word_lower) is None:
                processed_sen1.append("<unk>")
            else:
                processed_sen1.append(word_lower)

        for word in sen2:
            word_lower = word.lower()
            if vocab.get(word_lower) is None:
                processed_sen2.append("<unk>")
            else:
                processed_sen2.append(word_lower)

        processed_pairs.append((processed_sen1, processed_sen2))
    return processed_pairs

tokenized_test = preprocess_data_2(tokenized_test, vocab)
tokenized_train = preprocess_data_2(tokenized_train, vocab)

In [None]:
### Convert our Token sequences to int sequences

def convert_to_sequences(pairs_of_tokenized_texts, vocab):
    converted_sequences = []
    for sen1, sen2 in pairs_of_tokenized_texts:
        seq1 = [vocab.get(token, vocab.get("<unk>")) for token in sen1]
        seq2 = [vocab.get(token, vocab.get("<unk>")) for token in sen2]
        converted_sequences.append((seq1, seq2))
    return converted_sequences

X_train_seq = convert_to_sequences(tokenized_train, vocab)
X_test_seq = convert_to_sequences(tokenized_test, vocab)

In [None]:
### Determine the maximum length of the sequences and pad the sequences

def pad_sequence_pairs(pairs_of_sequences, maxlen, padding='post'):
    padded_pairs = []
    for seq1, seq2 in pairs_of_sequences:
        padded_seq1 = pad_sequences([seq1], maxlen=maxlen, padding=padding)[0]
        padded_seq2 = pad_sequences([seq2], maxlen=maxlen, padding=padding)[0]
        padded_pairs.append((padded_seq1, padded_seq2))
    return padded_pairs

max_length = max(max(len(seq1), len(seq2)) for seq1, seq2 in X_train_seq)
X_train_padded = np.array(pad_sequence_pairs(X_train_seq, maxlen=max_length, padding='post'))
X_test_padded = np.array(pad_sequence_pairs(X_test_seq, maxlen=max_length, padding='post'))

X_train_padded_A = X_train_padded[:, 0, :]
X_train_padded_B = X_train_padded[:, 1, :]
X_test_padded_A = X_test_padded[:, 0, :]
X_test_padded_B = X_test_padded[:, 1, :]

## Model Deployment

In [None]:
### Creation of the model

vocab_size = len(vocab) + 1
embedding_dim = 100


inputA = Input(shape=(max_length,), dtype='int32', name="inputA")
inputB = Input(shape=(max_length,), dtype='int32', name="inputB")

x1 = Embedding(input_dim=vocab_size, output_dim=50, input_length=max_length)(inputA)
x2 = Embedding(input_dim=vocab_size, output_dim=50, input_length=max_length)(inputB)

x1 = Flatten()(x1)
x2 = Flatten()(x2)

merge = Concatenate()([x1, x2])
merge = Dense(64, activation='relu', name="merge")(merge)
out = Dense(1, activation="sigmoid", name="output")(merge)

model = Model(inputs=[inputA, inputB], outputs=out, name='Combined')

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])



In [None]:
### Fit the model to our data

model.fit([X_train_padded_A, X_train_padded_B], y_train, epochs=1, batch_size=8);

[1m4940/4940[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 2ms/step - accuracy: 0.5656 - loss: 0.6845


## Model Evaluation

In [None]:
### Evaluation of our model
loss, accuracy = model.evaluate([X_test_padded_A, X_test_padded_B], y_test)
print(f"Accuracy: {accuracy}")

[1m309/309[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 6ms/step - accuracy: 0.6005 - loss: 0.6560
Accuracy: 0.6013561487197876


<a name='task3'></a>
# Task 3: RoBERTa + Siamese Network

This method has been inspired from [THIS](https://colab.research.google.com/github/keras-team/keras-io/blob/master/examples/nlp/ipynb/sentence_embeddings_with_sbert.ipynb) Colab Notebook.

## Preprocessing

In [None]:
### Prepare our dataset for our model

TRAIN_BATCH_SIZE = 6
VALIDATION_BATCH_SIZE = 8

TRAIN_NUM_BATCHES = 1000
VALIDATION_NUM_BATCHES = 30

AUTOTUNE = tf.data.experimental.AUTOTUNE

def prepare_dataset(dataset, num_batches, batch_size):
    dataset = dataset.map(
        lambda z: (
            [z["sentence1"], z["sentence2"]],
            [tf.cast(z["label"], tf.float32)],
        ),
        num_parallel_calls=AUTOTUNE,
    )
    dataset = dataset.batch(batch_size)
    dataset = dataset.take(num_batches)
    dataset = dataset.prefetch(AUTOTUNE)
    return dataset

# Prepare the datasets
paws_train_prepared = prepare_dataset(paws_train, TRAIN_NUM_BATCHES, TRAIN_BATCH_SIZE)
paws_valid_prepared = prepare_dataset(paws_valid, VALIDATION_NUM_BATCHES, VALIDATION_BATCH_SIZE)

In [None]:
X_train = list(map(lambda x: x[0], paws_train_prepared))
y_train = list(map(lambda x: x[1], paws_train_prepared))

X_train_np = np.array(X_train)
X_train_np = X_train_np.reshape(-1, 2)
y_train_np = np.array(y_train)
y_train_np = y_train_np.reshape(-1, 1)

X_valid = list(map(lambda x: x[0], paws_valid_prepared))
y_valid = list(map(lambda x: x[1], paws_valid_prepared))

X_valid_np = np.array(X_valid)
X_valid_np = X_valid_np.reshape(-1, 2)
y_valid_np = np.array(y_valid)
y_valid_np = y_valid_np.reshape(-1, 1)

In [None]:
### Print some examples of our input data

for x, y in paws_train_prepared:
    for i, example in enumerate(x):
        print(f"Sentence 1 : {example[0]} ")
        print(f"Sentence 2 : {example[1]} ")
        print(f"Label Value : {y[i]} \n")
    break

Sentence 1 : b'Hugo K\xc3\xa4ch died on December 31 , 2003 in Schaffhausen near Flurlingen , Germany .' 
Sentence 2 : b'Hugo K\xc3\xa4ch died on 31 December 2003 in Flurlingen near Schaffhausen .' 
Label Value : [0.] 

Sentence 1 : b'In 2013 Peter married Anna Barattin while Julia is married to Nicholas Furiuele , both are members of the band Shantih Shantih .' 
Sentence 2 : b'Peter Anna Barattin married in 2013 while Julia was married to Nicholas Furiuele , both of whom are members of the band Shantih Shantih .' 
Label Value : [1.] 

Sentence 1 : b'The recent Sierra Leone Civil War was secular in nature featuring members of Tribal , Muslim , and Christian faiths fighting on both sides of the conflict .' 
Sentence 2 : b'The recent civil war in Sierra Leone was secular in nature , with members of Christian , Muslim , and tribal faith fighting on both sides of the conflict .' 
Label Value : [1.] 

Sentence 1 : b"The campus newspaper , `` The Oklahoma Daily '' , is produced daily during t

## Model Definition

In [None]:
### Import necessary tools for RoBERTa from presets

preprocessor = keras_nlp.models.RobertaPreprocessor.from_preset("roberta_base_en")
backbone = keras_nlp.models.RobertaBackbone.from_preset("roberta_base_en")

### Define the normal encoder model
inputs = tf.keras.Input(shape=(1,), dtype="string", name="sentence")
x = preprocessor(inputs)
h = backbone(x)
embedding = tf.keras.layers.GlobalAveragePooling1D(name="pooling_layer")(
    h, x["padding_mask"]
)
n_embedding = tf.keras.layers.UnitNormalization(axis=1)(embedding)
roberta_normal_encoder = tf.keras.Model(inputs=inputs, outputs=n_embedding)

roberta_normal_encoder.summary()

Downloading from https://www.kaggle.com/api/v1/models/keras/roberta/keras/roberta_base_en/2/download/metadata.json...
100%|██████████| 141/141 [00:00<00:00, 138kB/s]
Downloading from https://www.kaggle.com/api/v1/models/keras/roberta/keras/roberta_base_en/2/download/preprocessor.json...
Downloading from https://www.kaggle.com/api/v1/models/keras/roberta/keras/roberta_base_en/2/download/tokenizer.json...
100%|██████████| 463/463 [00:00<00:00, 374kB/s]
Downloading from https://www.kaggle.com/api/v1/models/keras/roberta/keras/roberta_base_en/2/download/assets/tokenizer/vocabulary.json...
100%|██████████| 0.99M/0.99M [00:00<00:00, 1.78MB/s]
Downloading from https://www.kaggle.com/api/v1/models/keras/roberta/keras/roberta_base_en/2/download/assets/tokenizer/merges.txt...
100%|██████████| 446k/446k [00:00<00:00, 1.00MB/s]
Downloading from https://www.kaggle.com/api/v1/models/keras/roberta/keras/roberta_base_en/2/download/config.json...
100%|██████████| 498/498 [00:00<00:00, 459kB/s]
Download

In [None]:
### Define a custom layer and our final model

class CosineSimilarityLayer(tf.keras.layers.Layer):
    def __init__(self, **kwargs):
        super(CosineSimilarityLayer, self).__init__(**kwargs)

    def call(self, inputs):
        u, v = inputs
        u_norm = tf.nn.l2_normalize(u, axis=1)
        v_norm = tf.nn.l2_normalize(v, axis=1)
        cosine_similarity = tf.reduce_sum(tf.multiply(u_norm, v_norm), axis=1, keepdims=True)
        probabilities = tf.sigmoid(cosine_similarity)
        return probabilities

class RegressionSiamese(tf.keras.Model):
    def __init__(self, encoder, **kwargs):
        super(RegressionSiamese, self).__init__(**kwargs)
        self.encoder = encoder
        self.cosine_similarity_layer = CosineSimilarityLayer()

    def call(self, inputs):
        sen1, sen2 = tf.split(inputs, num_or_size_splits=2, axis=1)
        sen1 = tf.squeeze(sen1, axis=1)
        sen2 = tf.squeeze(sen2, axis=1)

        u = self.encoder(sen1)
        v = self.encoder(sen2)

        probabilities = self.cosine_similarity_layer([u, v])
        return probabilities

    def get_encoder(self):
        return self.encoder

    def set_trainable(self, trainable: bool):
        self.encoder.trainable = trainable

In [None]:
### See how the encoder would compute the cosine similarity before training
### BEFORE

sentences = [
    "Today is a very sunny day.",
    "I am hungry, I will get my meal.",
    "The dog is eating his food.",
]
query = ["The dog is enjoying his meal."]

encoder = roberta_normal_encoder

sentence_embeddings = encoder(tf.constant(sentences))
query_embedding = encoder(tf.constant(query))

cosine_similarity_scores = tf.matmul(query_embedding, tf.transpose(sentence_embeddings))
for i, sim in enumerate(cosine_similarity_scores[0]):
    print(f"cosine similarity score between sentence {i+1} and the query = {sim} ")

cosine similarity score between sentence 1 and the query = 0.96630859375 
cosine similarity score between sentence 2 and the query = 0.97607421875 
cosine similarity score between sentence 3 and the query = 0.9931640625 


In [None]:
### Compile the Model

roberta_regression_siamese = RegressionSiamese(roberta_normal_encoder)

roberta_regression_siamese.compile(
    loss=tf.keras.losses.MeanSquaredError(),
    optimizer=tf.keras.optimizers.Adam(2e-5),
    jit_compile=False,
)

In [None]:
### Fit the model

roberta_regression_siamese.fit(paws_train_prepared, validation_data=paws_valid_prepared, epochs=1);

[1m1000/1000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m729s[0m 640ms/step - loss: 0.2570 - val_loss: 0.2036


## Evaluation and Visualization

In [None]:
### See how the encoder would compute the cosine similarity after training
### AFTER

sentences = [
    "Today is a very sunny day.",
    "I am hungry, I will get my meal.",
    "The dog is eating his food.",
]
query = ["The dog is enjoying his meal."]

roberta_regression_siamese.set_trainable(False)
encoder_fin = roberta_regression_siamese.get_encoder()

sentence_embeddings = encoder_fin(tf.constant(sentences))
query_embedding = encoder_fin(tf.constant(query))

cosine_similarity_scores_ex = tf.matmul(query_embedding, tf.transpose(sentence_embeddings))
for i, sim in enumerate(cosine_similarity_scores_ex[0]):
    print(f"cosine similarity score between sentence {i+1} and the query = {sim} ")

cosine similarity score between sentence 1 and the query = 0.16015625 
cosine similarity score between sentence 2 and the query = 0.38623046875 
cosine similarity score between sentence 3 and the query = 0.7451171875 


In [None]:
### Visualize the meaning of the cosine similarity scores

n = 42

sentence1_embeddings = encoder(tf.constant([[X_train_np[n][0]]]))
sentence2_embeddings = encoder(tf.constant([[X_train_np[n][1]]]))

print(f"Sentence 1: {X_train_np[n][0]}")
print(f"Sentence 2: {X_train_np[n][1]}")
print(f"Paraphrase: {y_train_np[n]}")
print(f"Cosine Similarity Calculated: {tf.matmul(sentence1_embeddings, tf.transpose(sentence2_embeddings))}")

Sentence 1: b'The Oraciu River or Orociu is a tributary of the Pustnic River in Romania .'
Sentence 2: b'The Pustnic River or Orociu River is a tributary of the River Oraciu in Romania .'
Paraphrase: [0.]
Cosine Similarity Calculated: [[-0.7407]]


In [None]:
### Evaluate on the validation set

valid_scores = []

for sen1, sen2 in X_valid_np:
    if isinstance(sen1, bytes):
        sen1 = sen1.decode('utf-8')
    if isinstance(sen2, bytes):
        sen2 = sen2.decode('utf-8')
    logit = roberta_regression_siamese.predict(tf.convert_to_tensor([(sen1, sen2)], dtype=tf.string), verbose=0)
    pred = 1 if logit>0.5 else 0
    valid_scores.append(pred)

valid_scores = np.array(valid_scores)
valid_scores = valid_scores.reshape(-1, 1)
accuracy = accuracy_score(y_valid_np, valid_scores)

print(f"Accuracy: {accuracy}")

Accuracy: 0.7083333333333334


<a name='task4'></a>
# Task 4: Bert Pretrained

This part has been inspired from [THIS](https://www.kaggle.com/code/gabrielrasskin/debertav3-quickstart) Kaggle notebook.

## Preprocessing

In [None]:
def extract_features(sample):
    sentence1 = sample['sentence1']
    sentence2 = sample['sentence2']
    label = sample['label']
    concatenated_sentences = tf.strings.join(['[CLS] ', sentence1, ' [SEP] ', sentence2, ' [SEP]'], separator='')
    return concatenated_sentences, label

train_data = paws_ds['train'].map(extract_features).take(30000)
val_data = paws_ds['validation'].map(extract_features).take(1000)
test_data = paws_ds['test'].map(extract_features).take(1000)

In [None]:
batch_size = 16

train_data = train_data.batch(batch_size)
val_data = val_data.batch(batch_size)
test_data = test_data.batch(batch_size)

train_data = train_data.shuffle(buffer_size=1000, seed=42)

train_data = train_data.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
val_data = val_data.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
test_data = test_data.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)

## Import Model

In [None]:
classifier = keras_nlp.models.DebertaV3Classifier.from_preset(
    "deberta_v3_small_en",
    num_classes=2
)

classifier.compile(
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer=tf.keras.optimizers.Adam(5e-5),
    jit_compile=True,
)

classifier.backbone.trainable = True

Downloading from https://www.kaggle.com/api/v1/models/keras/deberta_v3/keras/deberta_v3_small_en/2/download/task.json...
Downloading from https://www.kaggle.com/api/v1/models/keras/deberta_v3/keras/deberta_v3_small_en/2/download/preprocessor.json...


In [None]:
classifier.fit(train_data, epochs=1)

predictions = classifier.predict(val_data)

[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m768s[0m 355ms/step - loss: 0.3475 - sparse_categorical_accuracy: 0.8332
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 176ms/step


## Model Evaluation

In [None]:
true_labels = []
for _, label in val_data.unbatch():
    true_labels.append(label.numpy())

# Convert predictions to class labels
predicted_labels = tf.argmax(predictions, axis=-1).numpy()

# Step 11: Calculate accuracy using sklearn
true_labels = np.array(true_labels)
accuracy = accuracy_score(true_labels, predicted_labels)
print(f"Validation accuracy: {accuracy}")

Validation accuracy: 0.937


In [None]:
print("Evaluate on test data")
results = classifier.evaluate(test_data)
print("\nTest accuracy:", results[1])

Evaluate on test data
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 160ms/step - loss: 0.1754 - sparse_categorical_accuracy: 0.9370

Test accuracy: 0.9330000281333923
