This homework is about learning sentence representation and contrastive learning.

From previous homework, we used to build token/sequence classification task and learn it through only supervised method. In real-world scenario, human annotation requires a lot of cost and effort to do. Some annotation tasks might require domain experts such as medical domain, legal domain, etc. However, there are some unsupervised methods which are no need any annotations.

Contrastive learning is the popular one of unsupervised learning approach. It will learn the representation via similar and dissimilar examples.

For this homework, we will focus on SimCSE framework which is one of contrastive learning techniques. For SimCSE, it will learn sentence embedding by comparing between different views of the same sentence.

In this homework you will perform three main tasks.

Train a sentiment classification model using a pretrained model. This model uses freeze weights. That is it treats the pretrained model as a fixed feature extractor.
Train a sentiment classification model using a pretrained model. This model also performs weight updates on the base model's weights.
Perform SimCSE and use the sentence embedding to perform linear classification.

## Import libraries

In [1]:
import torch
from torch import nn
import torch.nn.functional as F
from transformers import (
    AutoTokenizer, AutoModelForSequenceClassification, AutoModel
)
from datasets import load_dataset
import pytorch_lightning as pl
from pytorch_lightning import LightningModule, Trainer
from torch.utils.data import DataLoader
from torchmetrics import Accuracy

import numpy as np
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE

  from .autonotebook import tqdm as notebook_tqdm


The dataset we use for this homework is Wisesight-Sentiment (huggingface, github) dataset. It is a Thai social media dataset which are labeled as 4 classes e.g. positive, negative, neutral, and question. Furthermore, It contains both Thai, English, Emoji, and etc. That is why we choose the distilled version of multilingual BERT (mBERT) DistilledBERT paper to be a base model.

In [2]:
model_name = 'distilbert-base-multilingual-cased'
dataset = load_dataset('pythainlp/wisesight_sentiment')

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name) # Or a Thai-specific tokenizer if available

In [3]:
dataset

DatasetDict({
    train: Dataset({
        features: ['texts', 'category'],
        num_rows: 21628
    })
    validation: Dataset({
        features: ['texts', 'category'],
        num_rows: 2404
    })
    test: Dataset({
        features: ['texts', 'category'],
        num_rows: 2671
    })
})

In [19]:
dataset['train'].features

{'texts': Value(dtype='string', id=None),
 'category': ClassLabel(names=['pos', 'neu', 'neg', 'q'], id=None)}

In [None]:
def preprocess_function(examples):
    return tokenizer(examples['texts'], padding='max_length', truncation=True)

# Apply preprocessing
encoded_dataset = dataset.map(preprocess_function, batched=True)

# Change `category` key to `labels`
encoded_dataset = encoded_dataset.map(lambda examples: {'labels': [label for label in examples['category']]}, batched=True)

     

Map: 100%|██████████| 21628/21628 [00:03<00:00, 6005.99 examples/s]
Map: 100%|██████████| 2404/2404 [00:00<00:00, 6269.95 examples/s]
Map: 100%|██████████| 2671/2671 [00:00<00:00, 6343.00 examples/s]
Map: 100%|██████████| 21628/21628 [00:00<00:00, 297418.43 examples/s]
Map: 100%|██████████| 2404/2404 [00:00<00:00, 176695.12 examples/s]
Map: 100%|██████████| 2671/2671 [00:00<00:00, 173300.12 examples/s]


In [9]:
encoded_dataset['train']

Dataset({
    features: ['texts', 'category', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 21628
})

In [12]:
# Create PyTorch Dataset
class SentimentDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {
            key: torch.tensor(val) for key, val in self.encodings[idx].items()
            if key in ['input_ids', 'attention_mask']
        }
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

In [20]:
train_dataset = SentimentDataset(encoded_dataset['train'], encoded_dataset['train']['labels'])
val_dataset = SentimentDataset(encoded_dataset['validation'], encoded_dataset['validation']['labels'])
test_dataset = SentimentDataset(encoded_dataset['test'], encoded_dataset['test']['labels'])

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=32)
test_loader = DataLoader(test_dataset, batch_size=32)

Base Model class  
BaseModel is a parent class for building other models e.g.  

Pretrained LM with a linear classifier  
Fine-tuned LM with a linear classifier  
Contrastive learning based (SimCSE) LM with a linear classifier  

In [None]:
class BaseModel(LightningModule):
    def __init__(self,model_name: str = 'distilbert-base-multilingual-cased', learning_rate: float = 2e-5):
        super().__init__()
        self.save_hyperparameters()
        self.encoder = AutoModel.from_pretrained(model_name)
        self.learning_rate = learning_rate
        self.hidden_size = self.encoder.config.hidden_size
        #output [batch, seq_len, hidden_size]
        
    def get_embeddings(self, input_ids, attention_mask):
        outputs = self.encoder(input_ids=input_ids,attention_mask=attention_mask)
        cls_embedding = outputs.last_hidden_state
        return cls_embedding[:,0,:]

    def configure_optimizers(self):
        optimizer = torch.optim.AdamW(self.parameters(), lr=self.learning_rate)
        return optimizer

    def forward(self, input_ids, attention_mask):
        return self.get_embeddings(input_ids, attention_mask)
     

In [26]:
# Linear classifier + freeze option
class LMWithLinearClassfier(BaseModel):
    def __init__(
        self,
        model_name: str = "distilbert-base-multilingual-cased",
        ckpt_path: str = None,
        learning_rate: float = 2e-5,
        freeze_encoder_weights: bool = False,
        num_classes: int = 4,
    ):
        super().__init__(model_name, learning_rate)
        self.save_hyperparameters()

        # TODO 2: load encoder's weights from checkpoint (ถ้ามี)
        if ckpt_path:
            ckpt = torch.load(ckpt_path, map_location="cpu")
            self.load_state_dict(ckpt["state_dict"], strict=False)

        # TODO 3: linear classifier
        self.classifier = nn.Linear(self.hidden_size, num_classes)

        # freeze encoder weights (optional)
        if freeze_encoder_weights:
            self.freeze_weights(self.encoder)

        # metric
        self.accuracy = Accuracy(task="multiclass", num_classes=num_classes)

    # TODO 4: freeze encoder
    def freeze_weights(self, model):
        for param in model.parameters():
            param.requires_grad = False

    # TODO 5: forward pass (logits)
    def forward(self, input_ids, attention_mask):
        cls_emb = super().forward(input_ids, attention_mask)
        logits = self.classifier(cls_emb)
        return logits

    # TODO 6.1: training step
    def training_step(self, batch, batch_idx):
        input_ids, attention_mask, labels = batch["input_ids"], batch["attention_mask"], batch["labels"]
        logits = self(input_ids, attention_mask)
        loss = F.cross_entropy(logits, labels)
        acc = self.accuracy(logits, labels)
        self.log("train_loss", loss, prog_bar=True)
        self.log("train_acc", acc, prog_bar=True)
        return loss

    # TODO 6.2: validation step
    def validation_step(self, batch, batch_idx):
        input_ids, attention_mask, labels = batch["input_ids"], batch["attention_mask"], batch["labels"]
        logits = self(input_ids, attention_mask)
        loss = F.cross_entropy(logits, labels)
        acc = self.accuracy(logits, labels)
        self.log("val_loss", loss, prog_bar=True)
        self.log("val_acc", acc, prog_bar=True)

    # TODO 6.3: test step
    def test_step(self, batch, batch_idx):
        input_ids, attention_mask, labels = batch["input_ids"], batch["attention_mask"], batch["labels"]
        logits = self(input_ids, attention_mask)
        loss = F.cross_entropy(logits, labels)
        acc = self.accuracy(logits, labels)
        self.log("test_loss", loss, prog_bar=True)
        self.log("test_acc", acc, prog_bar=True)


In [27]:
pretrained_lm_w_linear_model = LMWithLinearClassfier(
    model_name,
    ckpt_path=None,
    freeze_encoder_weights=True
)

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


In [28]:
# Create a ModelCheckpoint callback (recommended way):
pretrained_lm_w_linear_checkpoint_callback = pl.callbacks.ModelCheckpoint(
    monitor="val_acc",  # Metric to monitor
    mode="max",  # "min" for loss, "max" for accuracy
    save_top_k=1,  # Save only the best model(s)
    save_weights_only=True, # Saves only weights, not the entire model
    dirpath="./checkpoints/", # Path where the checkpoints will be saved
    filename="best_pretrained_w_linear_model-{epoch}-{val_acc:.2f}", # Customized name for the checkpoint
    verbose=True,
)

# Initialize trainer
pretrained_lm_w_linear_trainer = Trainer(
    max_epochs=3,
    accelerator='auto',
    callbacks=[pretrained_lm_w_linear_checkpoint_callback], # Add the ModelCheckpoint callback
    gradient_clip_val=1.0,
    precision=16, # Mixed precision training
    devices=1,
)

# Train the model
pretrained_lm_w_linear_trainer.fit(pretrained_lm_w_linear_model, train_loader, val_loader)

d:\mini\envs\pine\Lib\site-packages\lightning_fabric\connector.py:571: `precision=16` is supported for historical reasons but its usage is discouraged. Please set your precision to 16-mixed instead!
Using 16bit Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
You are using a CUDA device ('NVIDIA GeForce RTX 3060 Laptop GPU') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name       | Type               | Params | Mode 
----------------------------------------------------------
0 | encoder    | DistilBertModel    | 134 M  | eval 
1 | classifier | Linear             | 3.1 K  | train
2 | a

Sanity Checking: |          | 0/? [00:00<?, ?it/s]

d:\mini\envs\pine\Lib\site-packages\pytorch_lightning\trainer\connectors\data_connector.py:425: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.


                                                                           

d:\mini\envs\pine\Lib\site-packages\pytorch_lightning\trainer\connectors\data_connector.py:425: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.


Epoch 0: 100%|██████████| 676/676 [01:39<00:00,  6.78it/s, v_num=0, train_loss=1.050, train_acc=0.500, val_loss=1.050, val_acc=0.537]

Epoch 0, global step 676: 'val_acc' reached 0.53702 (best 0.53702), saving model to 'D:\\NLP_learn\\NLP_learn\\constrative learning\\checkpoints\\best_pretrained_w_linear_model-epoch=0-val_acc=0.54.ckpt' as top 1


Epoch 1: 100%|██████████| 676/676 [01:40<00:00,  6.70it/s, v_num=0, train_loss=1.050, train_acc=0.429, val_loss=1.030, val_acc=0.537]

Epoch 1, global step 1352: 'val_acc' was not in top 1


Epoch 2: 100%|██████████| 676/676 [01:41<00:00,  6.68it/s, v_num=0, train_loss=0.860, train_acc=0.714, val_loss=1.030, val_acc=0.537]

Epoch 2, global step 2028: 'val_acc' reached 0.53744 (best 0.53744), saving model to 'D:\\NLP_learn\\NLP_learn\\constrative learning\\checkpoints\\best_pretrained_w_linear_model-epoch=2-val_acc=0.54.ckpt' as top 1
`Trainer.fit` stopped: `max_epochs=3` reached.


Epoch 2: 100%|██████████| 676/676 [01:43<00:00,  6.56it/s, v_num=0, train_loss=0.860, train_acc=0.714, val_loss=1.030, val_acc=0.537]


In [29]:
pretrained_lm_w_linear_result = pretrained_lm_w_linear_trainer.test(pretrained_lm_w_linear_model, test_loader)
pretrained_lm_w_linear_result

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
d:\mini\envs\pine\Lib\site-packages\pytorch_lightning\trainer\connectors\data_connector.py:425: The 'test_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.


Testing DataLoader 0: 100%|██████████| 84/84 [00:09<00:00,  8.88it/s]


[{'test_loss': 1.0267904996871948, 'test_acc': 0.5439910292625427}]

ลอง step fine-turning ตั้งแต่เริ่ม แต่คอมไม่ไหว