# For running everything:

* Check that the model and hyperparams are the desired ones (see each code block where the relevant parameters are set)
* Go to the last block (where the fine-tuned weights are stored) -> click on the block -> Select "Runtime" from the toolbar above -> Click "run before"
  * This is important, as saving the weights when you don't have enough gdrive storage space available will lead to several problematic behaviour of gdrive (specifically, it will lag almost to a point where it becomes unusable)

In [None]:
!pip install sentence_transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
!pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
from datasets import Dataset, load_dataset
import pandas as pd
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments, DataCollatorForLanguageModeling
from google.colab import drive
import torch

In [None]:
drive.mount('/content/drive/')

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


# CHANGE THE MODEL HERE

In [None]:
model = AutoModelForSequenceClassification.from_pretrained('cross-encoder/nli-MiniLM2-L6-H768')
tokenizer = AutoTokenizer.from_pretrained('cross-encoder/nli-MiniLM2-L6-H768')

# Change Padding to match transformer architecture

In [None]:
max_length = 512

In [None]:
def tokenize_function(dataset):
  return tokenizer(dataset['sentence1'], dataset['sentence2'],  padding='max_length', truncation=True, return_tensors="pt", max_length=max_length)

# Dataset loading - Change path to match local drive configuration

In [None]:
ds_train_AAE = load_dataset('csv', data_files='/content/drive/MyDrive/fine_tuning_transformers/ft_ds.csv', encoding="UTF-8", sep=';', index_col='Unnamed: 0')



  0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
tokenized_train_ds = ds_train_AAE['train'].map(tokenize_function, batched=True)



# This allows the dataset to be loaded to GPU as torch tensors

In [None]:
tokenized_train_ds.with_format('torch')

Dataset({
    features: ['sentence1', 'sentence2', 'label', '__index_level_0__', 'input_ids', 'attention_mask'],
    num_rows: 13415
})

# Tune hyperparameters here

* The output_dir stores checkpoints of the model!!! This takes up most of the space on gdrive (as, if the weights are 1 gb, then each checkpoint is at least 1gb and there are many of them). Make use of it but delete the checkpoints if you just want to store the final model weights (or delete whichever checkpoints you don't want).

In [None]:
training_args = TrainingArguments(
    output_dir='/content/drive/MyDrive/fine_tuning_transformers/output',
    per_device_train_batch_size=32,
    learning_rate=0.0000001,
    num_train_epochs=1
    )

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_ds
)

In [None]:
torch.cuda.is_available()

True

In [None]:
torch.cuda.empty_cache()

In [None]:
trainer.train()

Step,Training Loss


TrainOutput(global_step=420, training_loss=1.2141782488141741, metrics={'train_runtime': 599.8513, 'train_samples_per_second': 22.364, 'train_steps_per_second': 0.7, 'total_flos': 1777081844136960.0, 'train_loss': 1.2141782488141741, 'epoch': 1.0})

# !!! Only run this if you have enough space

* Also make sure you save weights in the desired files - overwritting already fine-tuned weights wouldn't do!

In [None]:
trainer.save_model('/content/drive/MyDrive/fine_tuning_transformers/weights/minilm2_512padding_1e7lr_32batchsize_1epochs')