#### Link to compentition
https://www.kaggle.com/competitions/runi-nlp-2023/leaderboard

#### Useful links
https://huggingface.co/models?pipeline_tag=text-classification&sort=downloads&search=similar

## RUN NLP 2023 course competition
#### Overview:
Welcome to the RUN NLP 2023 course competition!
This competition is designed to help you learn how to apply natural language processing (NLP) techniques and language models (LM) to validate how similar a pair of texts are. We invite you to compete and challenge yourself.

#### Goal:
Text Similarity is the process of comparing a piece of text with another and finding the similarity between them. It’s basically about determining the degree of closeness of the text. Text Similarity can be used for a variety of purposes, including search engines, summarization, essay scoring, plagiarism detection, machine translation, and more. Text similarity algorithms can vary greatly depending on the type of text and the purpose it is being used for.
Generally, text similarity algorithms will compare two texts by looking at the words and phrases used in each of the texts, looking at the phrase or sentence structure, or examining the differences between the two documents in terms of context. The algorithms may also consider other factors, such as the length of the texts, the number of words, or the number of sentences. Additionally, the NLP algorithms use semantic information to determine the degree of similarity between two texts. By using a text similarity algorithm, users can quickly and accurately determine the degree of similarity between two texts.
And this is what you'll try to achieve in this competition :)

#### Instructions:

Register for the competition by clicking the Register button at the top of the page.
Download the data and read the task description.
Develop your NLP LM model.
Submit your answer.
Track the progress of your submission on the Leaderboard.
After the competition ends, participants with the top scores will be rewarded with certificates of recognition.
We wish you the best of luck in this competition!

~~~~

#### Acknowledgements
Special thanks to Dr. Kfir Bar, Mr. Amir Cohen, and Mr. Sahar Millis

# Install necessary libraries

In [2]:
%%capture

!pip install datasets
!pip install sentence-transformers
!pip install transformers

# Import libraries

In [3]:
import torch
from sentence_transformers import SentenceTransformer, models
from transformers import BertTokenizer
from torch.optim import Adam
from torch.utils.data import DataLoader
from tqdm import tqdm
from datasets import load_dataset
import pandas as pd

In [6]:
def read_csv_file_for_train(file_path):
    data_frame = pd.read_csv(file_path)

    data = []

    for i in range(len(data_frame)):
        data.append({
            "sentence1": data_frame["text1"][i],
            "sentence2": data_frame["text2"][i],
            "similarity_score": float(data_frame["Similarity"][i]),
            "similarity": float(data_frame["Similarity"][i])
        })

    return data

def read_csv_file_for_test(test_sentences_file_path, test_sample_label_file_path):
    test_sentences_data_frame = pd.read_csv(test_sentences_file_path)
    test_sample_label_data_frame = pd.read_csv(test_sample_label_file_path)

    data = []

    for i in range(len(test_sentences_data_frame)):
        data.append({
            "sentence1": test_sentences_data_frame["text1"][i],
            "sentence2": test_sentences_data_frame["text2"][i],
            "similarity_score": float(test_sample_label_data_frame["Category"][i]),
            "similarity": float(test_sample_label_data_frame["Category"][i])
        })

    return data

tarin = read_csv_file_for_train("./nlp_2023_train.csv")
test = read_csv_file_for_test("./nlp_2023_test.csv", "./nlp_2023_sample.csv")

# Fetch data for training and test, as well as the tokenizer

In [7]:
# Dataset for training
dataset = tarin

similarity = [i['similarity_score'] for i in dataset]
normalized_similarity = [i/1.0 for i in similarity]

# Dataset for test
test_dataset = test

# Prepare test data
sentence_1_test = [i['sentence1'] for i in test_dataset]
sentence_2_test = [i['sentence2'] for i in test_dataset]
text_cat_test = [[str(x), str(y)] for x,y in zip(sentence_1_test, sentence_2_test)]

# Set the tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Define Model architecture

In [8]:
class STSBertModel(torch.nn.Module):
    def __init__(self):
        super(STSBertModel, self).__init__()

        word_embedding_model = models.Transformer('bert-base-uncased', max_seq_length=128)
        pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
        self.sts_model = SentenceTransformer(modules=[word_embedding_model, pooling_model])

        parameters = self.sts_model.parameters()
        # Determine which layers to freeze
        for idx, param in enumerate(parameters):
            if idx < 196:
                param.requires_grad = False
            else:
              break

    def forward(self, input_data):
        output = self.sts_model(input_data)

        return output

# Define Dataloader for training

In [9]:
class DataSequence(torch.utils.data.Dataset):
    def __init__(self, dataset):
        similarity = [i['similarity_score'] for i in dataset]
        self.label = [i/1.0 for i in similarity]
        self.sentence_1 = [i['sentence1'] for i in dataset]
        self.sentence_2 = [i['sentence2'] for i in dataset]
        self.text_cat = [[str(x), str(y)] for x,y in zip(self.sentence_1, self.sentence_2)]

    def __len__(self):
        return len(self.text_cat)

    def get_batch_labels(self, idx):
        return torch.tensor(self.label[idx])

    def get_batch_texts(self, idx):
        return tokenizer(self.text_cat[idx], padding='max_length', max_length = 128, truncation=True, return_tensors="pt")

    def __getitem__(self, idx):
        batch_texts = self.get_batch_texts(idx)
        batch_y = self.get_batch_labels(idx)

        return batch_texts, batch_y

def collate_fn(texts):
  num_texts = len(texts['input_ids'])
  features = list()
  for i in range(num_texts):
      features.append({'input_ids':texts['input_ids'][i], 'attention_mask':texts['attention_mask'][i]})

  return features

# Define loss function for training

In [10]:
class CosineSimilarityLoss(torch.nn.Module):
    def __init__(self,  loss_fct = torch.nn.MSELoss(), cos_score_transformation=torch.nn.Identity()):
        super(CosineSimilarityLoss, self).__init__()
        self.loss_fct = loss_fct
        self.cos_score_transformation = cos_score_transformation
        self.cos = torch.nn.CosineSimilarity(dim=1)

    def forward(self, input, label):
        embedding_1 = torch.stack([inp[0] for inp in input])
        embedding_2 = torch.stack([inp[1] for inp in input])

        output = self.cos_score_transformation(self.cos(embedding_1, embedding_2))

        return self.loss_fct(output, label.squeeze())

# Train the Model

In [11]:
def model_train(dataset, epochs, learning_rate, bs):
    use_cuda = torch.cuda.is_available()
    device = torch.device("cuda" if use_cuda else "cpu")
    print(device)

    model = STSBertModel()

    criterion = CosineSimilarityLoss()
    optimizer = Adam(model.parameters(), lr=learning_rate)

    train_dataset = DataSequence(dataset)
    train_dataloader = DataLoader(train_dataset, num_workers=4, batch_size=bs, shuffle=True)

    if use_cuda:
        model = model.cuda()
        criterion = criterion.cuda()

    best_acc = 0.0
    best_loss = 1000

    for i in range(epochs):
        total_acc_train = 0
        total_loss_train = 0.0

        for train_data, train_label in tqdm(train_dataloader):
            train_data['input_ids'] = train_data['input_ids'].to(device)
            train_data['attention_mask'] = train_data['attention_mask'].to(device)
            del train_data['token_type_ids']

            train_data = collate_fn(train_data)

            output = [model(feature)['sentence_embedding'] for feature in train_data]

            loss = criterion(output, train_label.to(device))
            total_loss_train += loss.item()

            loss.backward()
            optimizer.step()
            optimizer.zero_grad()

        print(f'Epochs: {i + 1} | Loss: {total_loss_train / len(dataset): .3f}')
        model.train()

    return model

EPOCHS = 1
LEARNING_RATE = 1e-3
BATCH_SIZE = 64

# Train the model
trained_model = model_train(dataset, EPOCHS, LEARNING_RATE, BATCH_SIZE)

cuda


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
100%|██████████| 1094/1094 [21:34<00:00,  1.18s/it]

Epochs: 1 | Loss:  0.004





In [12]:
# Function to predict test data
def predict_sts(texts):
  trained_model.to('cpu')
  trained_model.eval()
  test_input = tokenizer(texts, padding='max_length', max_length = 128, truncation=True, return_tensors="pt")
  test_input['input_ids'] = test_input['input_ids']
  test_input['attention_mask'] = test_input['attention_mask']
  del test_input['token_type_ids']

  test_output = trained_model(test_input)['sentence_embedding']
  sim = torch.nn.functional.cosine_similarity(test_output[0], test_output[1], dim=0).item()

  return sim

# Predict on test data

In [13]:
predict_sts(text_cat_test[245])

0.659848153591156

In [None]:
predict_sts(text_cat_test)