# Exploration of Data Selection and Fine-tuning Strategies

In this task, you can explore alternative methods for data selection and investigate several fine-tuning techniques to adapt a pre-trained model for a specific task. The goal is to improve the model performance on our target dataset. You can use the [`task3.py`](../scripts/Task3.py) file for your implementation.

1. Data Selection Strategies
The first step in fine-tuning a model is to carefully select the training data. While the previous tasks focused on influence-based data selection, here you will experiment with other selection strategies. Pick one data selection method by yourself. Log your findings about the selected data subsets:
    - How much data is used in each strategy?
    - Compare the performance of models trained with each selection method.

2. Fine-tuning Strategies
In this section, you will implement and compare some parameter-efficient fine-tuning approaches:

    - [bitfit](https://arxiv.org/abs/2106.10199)
    - [LoRA](https://arxiv.org/abs/2106.09685) (Low-Rank Adaptation)
    - iA3 (see Section 3.3. of this [paper](https://arxiv.org/abs/2205.05638)) (Implicit Adapter)

In [1]:
from google.colab import drive
# drive.mount('/content/drive/')# Note: Commented out for local execution. Uncomment if using Google Colab.


Mounted at /content/drive/


In [2]:
!pip install datasets

Collecting datasets
  Downloading datasets-3.3.2-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.3.2-py3-none-any.whl (485 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m485.4/485.4 kB[0m [31m11.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading multiprocess-0.70.16-py311-none-any.whl (143 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.5/143.5 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading x

In [3]:
!pip install rdkit

Collecting rdkit
  Downloading rdkit-2024.9.5-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.0 kB)
Downloading rdkit-2024.9.5-cp311-cp311-manylinux_2_28_x86_64.whl (34.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m34.3/34.3 MB[0m [31m13.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: rdkit
Successfully installed rdkit-2024.9.5


In [4]:
!pip install transformers
!pip install datasets
!pip install sklearn
!pip install torch
!pip install wandb

Collecting sklearn
  Downloading sklearn-0.0.post12.tar.gz (2.6 kB)
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py egg_info[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m See above for output.
  
  [1;35mnote[0m: This error originates from a subprocess, and is likely not a problem with pip.
  Preparing metadata (setup.py) ... [?25l[?25herror
[1;31merror[0m: [1mmetadata-generation-failed[0m

[31m×[0m Encountered error while generating package metadata.
[31m╰─>[0m See above for output.

[1;35mnote[0m: This is an issue with the package mentioned above, not pip.
[1;36mhint[0m: See above for details.
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.met

In [9]:
import warnings
from rdkit import RDLogger

# Suppress RDKit warnings
RDLogger.DisableLog('rdApp.*')
warnings.filterwarnings("ignore")

# Import dependencies
import torch
import torch.nn as nn
from torch.optim import AdamW
import numpy as np
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from transformers import AutoTokenizer, AutoModel
from rdkit import Chem
from rdkit.Chem import rdMolDescriptors
from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split
from torch.utils.data import Dataset, DataLoader
from datasets import load_dataset

# Define the device (GPU if available, otherwise CPU)
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
MODEL_NAME = "ibm/MoLFormer-XL-both-10pct"

########################################################
# Data Selection Strategies
########################################################

def random_sampling(smiles_list, targets, sample_size):
    """Randomly select a subset of the data."""
    indices = np.random.choice(len(smiles_list), size=sample_size, replace=False)
    return [smiles_list[i] for i in indices], [targets[i] for i in indices]

def diversity_sampling(smiles_list, targets, sample_size):
    """Select a diverse subset of the data using molecular fingerprints."""
    # Generate Morgan fingerprints
    fingerprints = []
    for smiles in smiles_list:
        mol = Chem.MolFromSmiles(smiles)
        if mol is not None:
            fp = rdMolDescriptors.GetMorganFingerprintAsBitVect(mol, radius=2, nBits=1024)
            fingerprints.append(np.array(fp))
        else:
            fingerprints.append(np.zeros(1024, dtype=int))  # Handle invalid SMILES

    fingerprints = np.array(fingerprints)

    # Perform KMeans clustering
    kmeans = KMeans(n_clusters=sample_size, random_state=42)
    kmeans.fit(fingerprints)

    # Select one sample from each cluster
    selected_indices = []
    for cluster in range(sample_size):
        cluster_indices = np.where(kmeans.labels_ == cluster)[0]
        selected_indices.append(np.random.choice(cluster_indices))

    return [smiles_list[i] for i in selected_indices], [targets[i] for i in selected_indices]

########################################################
# Custom Dataset Class
########################################################

class SMILESDataset(Dataset):
    def __init__(self, smiles_list, targets, tokenizer, max_length=128):
        self.smiles_list = smiles_list
        self.targets = targets
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.smiles_list)

    def __getitem__(self, idx):
        smiles = self.smiles_list[idx]
        target = self.targets[idx]
        encoding = self.tokenizer(
            smiles,
            padding="max_length",
            truncation=True,
            max_length=self.max_length,
            return_tensors="pt"
        )
        return {
            "input_ids": encoding["input_ids"].squeeze(0),  # Remove batch dimension
            "attention_mask": encoding["attention_mask"].squeeze(0),
            "target": torch.tensor(target, dtype=torch.float32)
        }

########################################################
# Fine-Tuning Strategies
########################################################

# Add a regression head to the model
class RegressionModel(nn.Module):
    def __init__(self, base_model, hidden_size=768):
        super(RegressionModel, self).__init__()
        self.base_model = base_model
        self.regressor = nn.Linear(hidden_size, 1)  # Output shape: (batch_size, 1)

    def forward(self, input_ids, attention_mask):
        outputs = self.base_model(input_ids, attention_mask)
        pooled_output = outputs.pooler_output  # Shape: (batch_size, hidden_size)
        return self.regressor(pooled_output).squeeze(-1)  # Ensure shape: (batch_size,)

def bitfit_finetune(model, train_loader, num_epochs=100, lr=5e-4):
    """Fine-tune only the bias terms of the model."""
    bias_params = [p for n, p in model.named_parameters() if 'bias' in n]
    optimizer = AdamW(bias_params, lr=lr)
    criterion = nn.MSELoss()

    for epoch in range(num_epochs):
        model.train()
        epoch_loss = 0.0

        for batch in train_loader:
            input_ids = batch['input_ids'].to(DEVICE)
            attention_mask = batch['attention_mask'].to(DEVICE)
            targets = batch['target'].to(DEVICE)

            optimizer.zero_grad()
            predictions = model(input_ids, attention_mask)
            loss = criterion(predictions, targets)
            loss.backward()
            optimizer.step()

            epoch_loss += loss.item()

        print(f"Epoch [{epoch + 1}/{num_epochs}], Loss: {epoch_loss / len(train_loader):.4f}")

def lora_finetune(model, train_loader, num_epochs=100, lr=1e-5, rank=8):
    """Fine-tune the model using LoRA (Low-Rank Adaptation)."""

    lora_params = {}

    for name, module in model.named_modules():
        if isinstance(module, nn.Linear):
            lora_name = name.replace(".", "_")  # Fix parameter naming issue

            lora_A = nn.Parameter(torch.randn(rank, module.in_features).to(DEVICE) * 0.01)  # Scale to avoid NaNs
            lora_B = nn.Parameter(torch.randn(module.out_features, rank).to(DEVICE) * 0.01)

            module.weight.requires_grad = False  # Freeze original weights

            lora_params[f"{lora_name}_lora_A"] = lora_A
            lora_params[f"{lora_name}_lora_B"] = lora_B

    for param_name, param in lora_params.items():
        model.register_parameter(param_name, param)

    optimizer = AdamW(model.parameters(), lr=lr)
    criterion = nn.MSELoss()

    for epoch in range(num_epochs):
        model.train()
        epoch_loss = 0.0

        for batch in train_loader:
            input_ids = batch['input_ids'].to(DEVICE)
            attention_mask = batch['attention_mask'].to(DEVICE)
            targets = batch['target'].to(DEVICE)

            optimizer.zero_grad()

            outputs = model.base_model(input_ids, attention_mask)
            pooled_output = outputs.pooler_output  # Shape: (batch_size, hidden_size)

            for name, module in model.named_modules():
                if isinstance(module, nn.Linear):
                    lora_name = name.replace(".", "_")
                    lora_A = getattr(model, f"{lora_name}_lora_A")
                    lora_B = getattr(model, f"{lora_name}_lora_B")

                    lora_output = torch.matmul(pooled_output, lora_A.T)
                    lora_output = torch.matmul(lora_output, lora_B.T)

                    pooled_output = pooled_output + lora_output  # Apply LoRA

            # Normalize to prevent NaNs
            pooled_output = torch.nn.functional.layer_norm(pooled_output, (pooled_output.shape[-1],))

            predictions = model.regressor(pooled_output).squeeze(-1)
            loss = criterion(predictions, targets)

            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)  # Prevent NaNs
            optimizer.step()

            epoch_loss += loss.item()

        print(f"Epoch [{epoch + 1}/{num_epochs}], Loss: {epoch_loss / len(train_loader):.4f}")


def ia3_finetune(model, train_loader, num_epochs=100, lr=5e-4):
    """Fine-tune the model using iA3 (Implicit Adapter)."""

    ia3_params = {}

    for name, module in model.named_modules():
        if isinstance(module, nn.Linear):
            ia3_name = name.replace(".", "_")  # Fix parameter naming issue
            ia3_scale = nn.Parameter(torch.ones(module.out_features).to(DEVICE))

            module.weight.requires_grad = False  # Freeze original weights
            ia3_params[f"{ia3_name}_ia3_scale"] = ia3_scale

    for param_name, param in ia3_params.items():
        model.register_parameter(param_name, param)

    optimizer = AdamW(model.parameters(), lr=lr)
    criterion = nn.MSELoss()

    for epoch in range(num_epochs):
        model.train()
        epoch_loss = 0.0

        for batch in train_loader:
            input_ids = batch['input_ids'].to(DEVICE)
            attention_mask = batch['attention_mask'].to(DEVICE)
            targets = batch['target'].to(DEVICE)

            optimizer.zero_grad()

            outputs = model.base_model(input_ids, attention_mask)
            pooled_output = outputs.pooler_output  # Shape: (batch_size, hidden_size)

            for name, module in model.named_modules():
                if isinstance(module, nn.Linear):
                    ia3_name = name.replace(".", "_")
                    ia3_scale = getattr(model, f"{ia3_name}_ia3_scale")

                    pooled_output = pooled_output * ia3_scale  # Apply iA3 scaling

            predictions = model.regressor(pooled_output).squeeze(-1)
            loss = criterion(predictions, targets)

            loss.backward()
            optimizer.step()

            epoch_loss += loss.item()

        print(f"Epoch [{epoch + 1}/{num_epochs}], Loss: {epoch_loss / len(train_loader):.4f}")

########################################################
# Evaluation Function
########################################################

def evaluate_model(model, test_loader):
    """Evaluate the model on the test set."""
    model.eval()
    all_predictions = []
    all_targets = []

    with torch.no_grad():
        for batch in test_loader:
            input_ids = batch['input_ids'].to(DEVICE)
            attention_mask = batch['attention_mask'].to(DEVICE)
            targets = batch['target'].to(DEVICE)

            predictions = model(input_ids, attention_mask)
            all_predictions.extend(predictions.cpu().numpy())
            all_targets.extend(targets.cpu().numpy())

    # Calculate evaluation metrics
    mse = mean_squared_error(all_targets, all_predictions)
    rmse = np.sqrt(mse)
    mae = mean_absolute_error(all_targets, all_predictions)
    r2 = r2_score(all_targets, all_predictions)

    print(f"Mean Squared Error (MSE): {mse:.4f}")
    print(f"Root Mean Squared Error (RMSE): {rmse:.4f}")
    print(f"Mean Absolute Error (MAE): {mae:.4f}")
    print(f"R² Score: {r2:.4f}")

    return mse, rmse, mae, r2

########################################################
# Main Execution
########################################################

if __name__ == "__main__":
    # Load the MoleculeNet Lipophilicity dataset from Hugging Face
    dataset = load_dataset("scikit-fingerprints/MoleculeNet_Lipophilicity")

    # Extract SMILES and targets
    smiles_list = dataset["train"]["SMILES"]
    targets = dataset["train"]["label"]

    # Split data into training and testing sets
    train_smiles, test_smiles, train_targets, test_targets = train_test_split(
        smiles_list, targets, test_size=0.2, random_state=42
    )

    # Apply data selection strategies
    random_smiles, random_targets = random_sampling(train_smiles, train_targets, sample_size=1000)
    diverse_smiles, diverse_targets = diversity_sampling(train_smiles, train_targets, sample_size=1000)

    # Load the tokenizer
    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

    # Create datasets
    random_dataset = SMILESDataset(random_smiles, random_targets, tokenizer)
    diverse_dataset = SMILESDataset(diverse_smiles, diverse_targets, tokenizer)
    test_dataset = SMILESDataset(test_smiles, test_targets, tokenizer)

    # Create data loaders
    random_loader = DataLoader(random_dataset, batch_size=64, shuffle=True)
    diverse_loader = DataLoader(diverse_dataset, batch_size=64, shuffle=True)
    test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)

    # Load the pre-trained model and add the regression head
    base_model = AutoModel.from_pretrained(MODEL_NAME, trust_remote_code=True).to(DEVICE)
    model = RegressionModel(base_model).to(DEVICE)

    # Fine-tune and evaluate each method separately
    print("Evaluating BitFit...")
    bitfit_model = RegressionModel(AutoModel.from_pretrained(MODEL_NAME, trust_remote_code=True).to(DEVICE)).to(DEVICE)
    bitfit_finetune(bitfit_model, random_loader)
    evaluate_model(bitfit_model, test_loader)

    print("Evaluating LoRA...")
    lora_model = RegressionModel(AutoModel.from_pretrained(MODEL_NAME, trust_remote_code=True).to(DEVICE)).to(DEVICE)
    lora_finetune(lora_model, diverse_loader)
    evaluate_model(lora_model, test_loader)

    print("Evaluating iA3...")
    ia3_model = RegressionModel(AutoModel.from_pretrained(MODEL_NAME, trust_remote_code=True).to(DEVICE)).to(DEVICE)
    ia3_finetune(ia3_model, random_loader)
    evaluate_model(ia3_model, test_loader)

The repository for ibm/MoLFormer-XL-both-10pct contains custom code which must be executed to correctly load the model. You can inspect the repository content at https://hf.co/ibm/MoLFormer-XL-both-10pct.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y
Evaluating BitFit...
Epoch [1/100], Loss: 5.2422
Epoch [2/100], Loss: 1.6940
Epoch [3/100], Loss: 1.5218
Epoch [4/100], Loss: 1.4067
Epoch [5/100], Loss: 1.4152
Epoch [6/100], Loss: 1.3792
Epoch [7/100], Loss: 1.3455
Epoch [8/100], Loss: 1.3271
Epoch [9/100], Loss: 1.3039
Epoch [10/100], Loss: 1.3003
Epoch [11/100], Loss: 1.2863
Epoch [12/100], Loss: 1.2750
Epoch [13/100], Loss: 1.2585
Epoch [14/100], Loss: 1.2607
Epoch [15/100], Loss: 1.2197
Epoch [16/100], Loss: 1.2172
Epoch [17/100], Loss: 1.2008
Epoch [18/100], Loss: 1.1858
Epoch [19/100], Loss: 1.1819
Epoch [20/100], Loss: 1.1740
Epoch [21/100], Loss: 1.1983
Epoch [22/100], Loss: 1.1608
Epoch [23/100]

The Best Fine-Tuning Strategy


{BitFit} (Bias-Term Fine-Tuning) – This method yielded the best performance, achieving a significant improvement in MSE, RMSE, and $R^2$ scores, demonstrating that fine-tuning only the bias terms is an efficient and effective approach.