Notes regarding the project:
- cls vector is produced by bert model and is used for classification by adding a linear layer on top of it.

Preparation tasks:
- read the paper on sentence bert 
- learn bert model.
- learn how to use bert model.

The task:
we need to implement Sentence-BERT simplified bert model for similarity search (contrastive learning).
- we will use models trained on Hebrew.
- the dataset will contains example that entails/contradict each other

## tasks:
- read the data.
- check preprocessing.
- check tokenization.
- insert into the given models.
- get the cls token / check what does it mean to do a pooling?
- build an objective function (triplet loss, cos similarity, etc).
- create an eval metric - spearman correlation


[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Oren-Ben/nlp-final-project/blob/main/notes.ipynb)

### Imports

In [None]:
# pip install -r requirements.txt

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments


### gpt version

In [None]:
# Load the HebrewNLI dataset
dataset = load_dataset("HebArabNlpProject/HebNLI")

# Check the structure of the dataset
print(dataset)


In [None]:
# Get the train split
train_dataset = dataset['train']

# Print the first example
print(train_dataset[0])


In [None]:
# Function to downsample the dataset
def downsample(dataset, num_samples):
    return dataset.shuffle(seed=42).select(range(num_samples))

# Define the number of samples for the POC
num_samples = 1000  # Adjust this number as needed

# Downsample the train and validation datasets
train_dataset = downsample(dataset['train'], num_samples)
val_dataset = downsample(dataset['dev'], num_samples)

print(f"Train dataset size: {len(train_dataset)}")
print(f"Validation dataset size: {len(val_dataset)}")


In [None]:
from transformers import AutoTokenizer

# Load the AlephBERT tokenizer
tokenizer = AutoTokenizer.from_pretrained("onlplab/alephbert-base")

# Tokenize the dataset
def tokenize_function(examples):
    return tokenizer(examples['translation1'], examples['translation2'], padding="max_length", truncation=True)

# Tokenize the train dataset
tokenized_train_dataset = train_dataset.map(tokenize_function, batched=True)

# Check the tokenized dataset
print(tokenized_train_dataset[0])


In [None]:
import torch
from torch.utils.data import DataLoader

class STSDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx], dtype=torch.float)
        return item

# Convert labels to a float value for regression
train_labels = train_dataset['label']
tokenized_train_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'token_type_ids', 'label'])
train_dataset = STSDataset(tokenized_train_dataset, train_labels)

# Create data loaders
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)


In [None]:
from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments

# Load the AlephBERT model for sequence classification (regression)
model = AutoModelForSequenceClassification.from_pretrained("onlplab/alephbert-base", num_labels=1)

# Define the training arguments
training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
)

# Define a Trainer instance
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=train_dataset,  # Replace with actual validation dataset
)

# Fine-tune the model
trainer.train()


In [None]:
# Evaluate the model
results = trainer.evaluate()
print(results)

# Save the model
model.save_pretrained('./alephbert_hebrewnli_sts_model')
tokenizer.save_pretrained('./alephbert_hebrewnli_sts_model')


In [None]:
def predict_similarity(model, tokenizer, sentence1, sentence2):
    inputs = tokenizer(sentence1, sentence2, return_tensors="pt", truncation=True, padding=True)
    outputs = model(**inputs)
    return outputs.logits.item()

# Example usage
similarity_score = predict_similarity(model, tokenizer, "משפט ראשון לדוגמה", "משפט שני לדוגמה")
print(similarity_score)
