# Build Your Custom AI/LLM With PyTorch Lightning

<img src="imgs/arch.png" alt="Cohere logo" width="800" height="500"/>

In [103]:
from config import TrainConfig
import torch
import gc
import pandas as pd
import numpy as np

# 1. Transformer

## Train Config

In [2]:
# Free up gpu vRAM from memory leaks.
torch.cuda.empty_cache()
gc.collect()

0

In [3]:
train_config = TrainConfig()

## Training Loop

In [4]:
from transformers import AutoModelForSequenceClassification

  from .autonotebook import tqdm as notebook_tqdm


In [5]:
pretrained_model = train_config.pretrained_model
num_classes = train_config.num_classes

In [6]:
print(num_classes)

2


In [7]:
pretrained_model

'roberta-base'

In [8]:
model = AutoModelForSequenceClassification.from_pretrained(
    pretrained_model_name_or_path=pretrained_model,
    num_labels=num_classes,
)

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### PEFT (QLORA)

In [9]:
from peft import get_peft_model, LoraConfig, TaskType

TaskType:

Overview of the supported task types:
- SEQ_CLS: Text classification.
- SEQ_2_SEQ_LM: Sequence-to-sequence language modeling.
- CAUSAL_LM: Causal language modeling.
- TOKEN_CLS: Token classification.
- QUESTION_ANS: Question answering.
- FEATURE_EXTRACTION: Feature extraction. Provides the hidden states which can be used as embeddings or features
  for downstream tasks.

- We are interested in **SEQ_CLS**
- Set the `inference_mode` to False to enable these layers in training mode

In [10]:
peft_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,
    inference_mode=False,
    r=8,
    lora_alpha=32,
    lora_dropout=0.1,
)

LoRA (Low-Rank Adaptation) is a technique for efficiently fine-tuning large language models by introducing low-rank matrices into the architecture. Here's the rigorous mathematical explanation of how the parameters `lora_alpha`, `r`, and `lora_dropout` are used in the LoRA framework:

---

### Core Idea Behind LoRA
The key idea is to approximate the updates to a pretrained model's weight matrix $W \in \mathbb{R}^{d \times k}$ with a low-rank decomposition:
$$
\Delta W \approx A B
$$
where:
- $A \in \mathbb{R}^{d \times r}$ and $B \in \mathbb{R}^{r \times k}$ are low-rank matrices.
- $r \ll \min(d, k)$, reducing the number of trainable parameters from $d \times k$ to $r \times (d + k)$.

The LoRA weight update is incorporated into the model as:
$$
W_{\text{effective}} = W + \Delta W = W + A B
$$
Here, $W$ is the pretrained weight matrix, and $\Delta W$ represents the learned low-rank update.

---

### LoRA Parameters in Detail

#### 1. **`r` (Rank of Decomposition)**
- **Definition**: $r$ is the rank of the low-rank decomposition.
- **Mathematical Role**:
  - Controls the dimensionality of the intermediate representation in the decomposition $\Delta W = A B$.
  - A larger $r$ allows more expressive updates, but increases the number of trainable parameters ($r \times (d + k)$).
- **Parameter Count**:
  - Trainable parameters introduced by LoRA: $r \times (d + k)$.
- **Trade-off**:
  - Low $r$: Less expressive, more efficient.
  - High $r$: More expressive, less efficient.

#### 2. **`lora_alpha` (Scaling Factor)**
- **Definition**: A scalar $\alpha$ that scales the output of the low-rank update.
- **Mathematical Role**:
  - Ensures numerical stability and controls the magnitude of the updates.
  - The effective update becomes:
    $$
    \Delta W = \frac{\alpha}{r} A B
    $$
  - Dividing $\alpha$ by $r$ ensures the scale of the updates is independent of the rank $r$, preventing instability when $r$ is large.
- **Intuition**:
  - $\alpha$ adjusts how much influence the low-rank adaptation has on the pretrained weights $W$.

#### 3. **`lora_dropout` (Dropout for Regularization)**
- **Definition**: Dropout applied to the low-rank matrix $A$ during training to regularize the adaptation.
- **Mathematical Role**:
  - During training, a dropout mask $M$ (where $M \sim \text{Bernoulli}(1-p)$) is applied to $A$:
    $$
    A' = M \odot A
    $$
  - The modified low-rank update becomes:
    $$
    \Delta W = \frac{\alpha}{r} A' B
    $$
  - This introduces sparsity in the updates during training, reducing overfitting.
- **Trade-off**:
  - Low dropout (small $p$): Less regularization, more prone to overfitting.
  - High dropout (large $p$): Stronger regularization, potentially underfitting.

---

### Full Expression of LoRA Weight Update
Incorporating all these components, the effective weight matrix during training is:
$$
W_{\text{effective}} = W + \Delta W = W + \frac{\alpha}{r} \left( M \odot A \right) B
$$
where:
- $W$: Pretrained weight matrix.
- $A$ and $B$: Trainable low-rank matrices.
- $r$: Rank of decomposition.
- $\alpha$: Scaling factor.
- $M$: Dropout mask.

---

### Example Breakdown
Given your configuration:
- $r = 8$: Low-rank decomposition introduces $8$ dimensions for the intermediate representation.
- $\alpha = 32$: Updates are scaled by $\frac{32}{8} = 4$ to adjust their magnitude.
- $\text{dropout} = 0.1$: A dropout probability of $0.1$ regularizes the updates.

### Impact on Parameter Efficiency
The number of trainable parameters added by LoRA is:
$$
\text{Trainable Params} = r \times (d + k)
$$
For example, if $d = 768$ and $k = 768$, and $r = 8$:
$$
\text{Trainable Params} = 8 \times (768 + 768) = 12,288
$$
This is significantly smaller than the $768 \times 768 = 589,824$ parameters of the original weight matrix.

---

### Summary
- **`r`**: Controls the expressiveness of the low-rank decomposition.
- **`alpha`**: Scales the low-rank updates for stability and proportionality.
- **`dropout`**: Regularizes the updates to prevent overfitting.
This formulation enables efficient fine-tuning of large models with minimal additional parameters.


In [11]:
peft_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,
    inference_mode=False,
    r=8,
    lora_alpha=32,
    lora_dropout=0.1,
)

In [12]:
model

RobertaForSequenceClassification(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0-11): 12 x RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
         

In [13]:
model = get_peft_model(model, peft_config)

In [14]:
model.print_trainable_parameters()

trainable params: 887,042 || all params: 125,534,212 || trainable%: 0.7066


- Set the learning rate

In [15]:
lr = train_config.lr
lr

0.0002

### Pytorch Lightning: save_hyperparameters

In PyTorch Lightning, the save_hyperparameters method is used to save the initialization arguments (hyperparameters) of a LightningModule. This allows the framework to store and later retrieve these hyperparameters for logging, checkpointing, and reproducibility purposes

- When called, save_hyperparameters captures the arguments passed to the __init__ method of the LightningModule and saves them as part of the module's internal state.

- The saved hyperparameters are included in the checkpoints automatically created by PyTorch Lightning during training. This ensures that if you reload a model from a checkpoint, the hyperparameters are restored.

- Many logging frameworks, like TensorBoard, WandB, or MLFlow, can automatically log the hyperparameters saved using this method, enabling better experiment tracking.

- After calling save_hyperparameters, the stored arguments are accessible through the self.hparams attribute, which acts like a dictionary.

In [16]:
# self.save_hyperparameters("pretrained_model")

In [17]:
from argparse import Namespace

def save_hyperparameters(*args, **kwargs):
    # Combine positional arguments and keyword arguments into a Namespace
    params = Namespace(**{key: val for key, val in kwargs.items()})
    return params

In [18]:
# Save hyperparameters
hparams = save_hyperparameters(pretrained_model=pretrained_model, num_classes=num_classes, lr=lr)

# Access hyperparameters
print(hparams.pretrained_model)  # Outputs: "bert-base-uncased"
print(hparams.lr)    

roberta-base
0.0002


# 2. Data (LexGlueDataModule)

In [19]:
import os
from typing import Optional

import polars as pl
import torch
from datasets import Dataset, DatasetDict, load_dataset
from lightning import LightningDataModule
from torch.utils.data import DataLoader
from transformers import AutoTokenizer

In [20]:
os.environ["TOKENIZERS_PARALLELISM"] = "false"

In [21]:
pretrained_model = train_config.pretrained_model
max_length = train_config.max_length
batch_size = train_config.batch_size
num_workers = train_config.num_workers
debug_mode_sample = train_config.debug_mode_sample

In [22]:
print("Pretrained model:", pretrained_model)
print("Max length:", max_length)
print("Batch size:", batch_size)
print("Number of workers:", num_workers)
print("Debug mode sample:", debug_mode_sample)

Pretrained model: roberta-base
Max length: 128
Batch size: 256
Number of workers: 32
Debug mode sample: None


Let's download the dataset

In [23]:
dsname = "lex_glue"
dsdict = DatasetDict()

In [24]:
tokenizer = AutoTokenizer.from_pretrained(pretrained_model)

In [25]:
tokenizer

RobertaTokenizerFast(name_or_path='roberta-base', vocab_size=50265, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'sep_token': '</s>', 'pad_token': '<pad>', 'cls_token': '<s>', 'mask_token': '<mask>'}, clean_up_tokenization_spaces=False, added_tokens_decoder={
	0: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	1: AddedToken("<pad>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	3: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	50264: AddedToken("<mask>", rstrip=False, lstrip=True, single_word=False, normalized=False, special=True),
}
)

In [26]:
tokenizer.tokenize("Hello, world!")

['Hello', ',', 'Ġworld', '!']

In [27]:
tokenizer.convert_tokens_to_ids(["Hello", ",", "world", "!"])

[31414, 6, 8331, 328]

In [28]:
!pwd

/home/david/Documents/data_science/llm/david/Fine-Tuning-Lightning-Unfair


In addition, using fork() with Python in general is a recipe for mysterious
deadlocks and crashes.

The most likely reason you are seeing this error is because you are using the
multiprocessing module on Linux, which uses fork() by default. This will be
fixed in Python 3.14. Until then, you want to use the "spawn" context instead.

See https://docs.pola.rs/user-guide/misc/multiprocessing/ for details.

or by setting POLARS_ALLOW_FORKING_THREAD=1.

  pid, fd = os.forkpty()


In [29]:
# dsdict = load_dataset(dsname, "unfair_tos")
# Split data into train, validation, and test
dsdict["train"] = load_dataset(dsname, "unfair_tos", split="train")
dsdict["validation"] = load_dataset(dsname, "unfair_tos", split="validation")
dsdict["test"] = load_dataset(dsname, "unfair_tos", split="test")

In [30]:
dsdict

DatasetDict({
    train: Dataset({
        features: ['text', 'labels'],
        num_rows: 5532
    })
    validation: Dataset({
        features: ['text', 'labels'],
        num_rows: 2275
    })
    test: Dataset({
        features: ['text', 'labels'],
        num_rows: 1607
    })
})

In [31]:
dsdict["train"]

Dataset({
    features: ['text', 'labels'],
    num_rows: 5532
})

In [32]:
df_labels = pd.DataFrame(dsdict["train"]["labels"])
df_labels

Unnamed: 0,0,1,2
0,,,
1,,,
2,,,
3,,,
4,,,
...,...,...,...
5527,,,
5528,,,
5529,5.0,6.0,
5530,,,


In [33]:
mask = df_labels[0].isna()
df_labels[~mask]

Unnamed: 0,0,1,2
8,4.0,,
10,2.0,,
12,2.0,,
15,4.0,,
33,3.0,2.0,1.0
...,...,...,...
5512,0.0,,
5513,0.0,,
5516,1.0,,
5517,1.0,,


In [34]:
dsdict["train"][0]

{'text': 'notice to california subscribers : you may cancel your subscription , without penalty or obligation , at any time prior to midnight of the third business day following the date you subscribed . \n',
 'labels': []}

In [35]:
dsdict["train"][1]

{'text': 'if you subscribed using your apple id , refunds are handled by apple , not tinder . \n',
 'labels': []}

In [36]:
print(dsdict["train"].features) 

{'text': Value(dtype='string', id=None), 'labels': Sequence(feature=ClassLabel(names=['Limitation of liability', 'Unilateral termination', 'Unilateral change', 'Content removal', 'Contract by using', 'Choice of law', 'Jurisdiction', 'Arbitration'], id=None), length=-1, id=None)}


### preprocess

In [37]:
# Define helper functions
def preprocess(batch: dict) -> dict:
    """Tokenize the text field and convert labels for binary classification."""
    tokens = tokenizer(
        batch["text"],
        max_length=max_length,
        padding="max_length",
        truncation=True,
    )
    tokens["label"] = [1 if label else 0 for label in batch["labels"]]
    return tokens

In [38]:
# Simulate a batch
sample_batch = {
    "text": [
        "Tinder may terminate your account at any time without notice if it believes that you have violated this agreement.",
        "Notice to California subscribers: you may cancel your subscription, without penalty or obligation, at any time prior to midnight of the third business day following the date you subscribed.",
    ],
    "labels": [2, 0],  # Example labels
}

In [39]:
# Define max_length and tokenizer for preprocessing
max_length = 128
pretrained_model = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(pretrained_model)

In [40]:
# Run the preprocess function on the simulated batch
processed_batch = preprocess(sample_batch)

In [41]:
processed_batch

{'input_ids': [[101, 9543, 4063, 2089, 20320, 2115, 4070, 2012, 2151, 2051, 2302, 5060, 2065, 2009, 7164, 2008, 2017, 2031, 14424, 2023, 3820, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 5060, 2000, 2662, 17073, 1024, 2017, 2089, 17542, 2115, 15002, 1010, 2302, 6531, 2030, 14987, 1010, 2012, 2151, 2051, 3188, 2000, 7090, 1997, 1996, 2353, 2449, 2154, 2206, 1996, 3058, 2017, 4942, 29234, 2094, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'token_type_ids': [[0, 0, 0, 

In [42]:
processed_batch.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'label'])

In [43]:
np.array(processed_batch.attention_mask).shape

(2, 128)

In [44]:
np.array(processed_batch.input_ids).shape

(2, 128)

In [45]:
np.array(processed_batch.input_ids)

array([[  101,  9543,  4063,  2089, 20320,  2115,  4070,  2012,  2151,
         2051,  2302,  5060,  2065,  2009,  7164,  2008,  2017,  2031,
        14424,  2023,  3820,  1012,   102,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
      

In [46]:
np.array(processed_batch.attention_mask)

array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

In [47]:
np.array(processed_batch.token_type_ids)

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

In [48]:
np.array(processed_batch.label)

array([1, 0])

### balance sample

The function balanced_sample is designed to create a balanced dataset from an imbalanced dataset by sampling an equal number of examples from each class (fair and unfair). 

In [49]:
def balanced_sample(df: pl.DataFrame, sample_size: int, seed: int = 42) -> pl.DataFrame:
    """Balance the dataset by sampling an equal number of fair and unfair examples."""
    fairness = df["labels"].apply(lambda x: min(len(x), 1))
    fair = df.filter(fairness.eq(0)).sample(fairness.sum(), seed=seed)
    unfair = df.filter(fairness.ne(0))
    balanced = pl.concat([fair, unfair])
    return balanced.sample(n=sample_size, seed=seed)

In [50]:
pl.__version__

'1.17.1'

In [51]:
split = "train"
ds = dsdict[split]
ds.data.to_pandas().head(10)

Unnamed: 0,text,labels
0,notice to california subscribers : you may can...,[]
1,"if you subscribed using your apple id , refund...",[]
2,"if you wish to request a refund , please visit...",[]
3,if you subscribed using your google play store...,[]
4,key changes in this version : we 've included ...,[]
5,"for a summary of our terms of use , go to summ...",[]
6,"welcome to tinder , operated by match group , ...",[]
7,acceptance of terms of use agreement . \n,[]
8,by creating a tinder account or by using the t...,[4]
9,if you do not accept and agree to be bound by ...,[]


In [52]:
df = pl.from_arrow(ds.data.table)

In [53]:
df

text,labels
str,list[i64]
"""notice to california subscribe…",[]
"""if you subscribed using your a…",[]
"""if you wish to request a refun…",[]
"""if you subscribed using your g…",[]
"""key changes in this version : …",[]
…,…
"""any failure by us to enforce a…",[]
"""a person who is not a party to…",[]
"""irrespective of the country fr…","[5, 6]"
"""if you require further informa…",[]


In [54]:
debug_mode_sample

In [55]:
type(df.select(pl.col("labels")))

polars.dataframe.frame.DataFrame

In [56]:
# # Define the UDF
# def min_length_one(label_list):
#     return min(len(label_list), 1)

# # Apply the UDF to the 'labels' column
# df = df.with_columns(
#     pl.col("labels").map_elements(min_length_one, return_dtype=pl.Int64).alias("fairness")
# )
# fairness = df["fairness"]

# seed = 42
# fair = df.filter(fairness.eq(0)).sample(fairness.sum(), seed=seed)
# unfair = df.filter(fairness.ne(0))
# balanced = pl.concat([fair, unfair])
# balanced.sample(n=sample_size, seed=seed)

In [57]:
tokenized_ds = ds.map(
    preprocess,
    batched=True,
    batch_size=32,
    load_from_cache_file=True,
)

In [58]:
tokenized_ds

Dataset({
    features: ['text', 'labels', 'input_ids', 'token_type_ids', 'attention_mask', 'label'],
    num_rows: 5532
})

In [59]:
tokenized_ds.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])

In [60]:
tokenized_ds["input_ids"]

tensor([[  101,  5060,  2000,  ...,     0,     0,     0],
        [  101,  2065,  2017,  ...,     0,     0,     0],
        [  101,  2065,  2017,  ...,     0,     0,     0],
        ...,
        [  101, 20868,  6072,  ...,     0,     0,     0],
        [  101,  2065,  2017,  ...,     0,     0,     0],
        [  101, 14084,  1010,  ...,     0,     0,     0]])

In [61]:
tokenized_ds["input_ids"].shape


torch.Size([5532, 128])

In [62]:
tokenized_ds["label"]

tensor([0, 0, 0,  ..., 1, 0, 0])

In [63]:
df

text,labels
str,list[i64]
"""notice to california subscribe…",[]
"""if you subscribed using your a…",[]
"""if you wish to request a refun…",[]
"""if you subscribed using your g…",[]
"""key changes in this version : …",[]
…,…
"""any failure by us to enforce a…",[]
"""a person who is not a party to…",[]
"""irrespective of the country fr…","[5, 6]"
"""if you require further informa…",[]


### Let's implement the setup replica

In [64]:
def preprocess(batch: dict) -> dict:
    """Tokenize the text field and convert labels for binary classification."""
    tokens = tokenizer(
        batch["text"],
        max_length=max_length,
        padding="max_length",
        truncation=True,
    )
    tokens["label"] = [1 if label else 0 for label in batch["labels"]]
    return tokens

# Preprocess dataset
def shared_transform(split: str):
    """Tokenize and preprocess the dataset split."""
    ds = dsdict[split]
    tokenized_ds = ds.map(
        preprocess,
        batched=True,
        load_from_cache_file=True,
    )
    tokenized_ds.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])
    return tokenized_ds

In [65]:
# Prepare tokenized datasets
train_dataset = shared_transform("train")
val_dataset = shared_transform("validation")
test_dataset = shared_transform("test")

In [66]:
# Create DataLoaders
train_dataloader = DataLoader(
    dataset=train_dataset,
    batch_size=batch_size,
    num_workers=num_workers,
    shuffle=True,
    drop_last=True,
)

val_dataloader = DataLoader(
    dataset=val_dataset,
    batch_size=batch_size,
    num_workers=num_workers,
    drop_last=True,
)

test_dataloader = DataLoader(
    dataset=test_dataset,
    batch_size=batch_size,
    num_workers=num_workers,
    drop_last=True,
)

In [67]:
batch_size

256

In [68]:
# Example usage
for batch in train_dataloader:
    input_ids = batch["input_ids"]
    attention_mask = batch["attention_mask"]
    labels = batch["label"]
    print(input_ids.shape, attention_mask.shape, labels.shape)
    break

torch.Size([256, 128]) torch.Size([256, 128]) torch.Size([256])


In [69]:
input_ids

tensor([[  101,  3531,  2022,  ...,     0,     0,     0],
        [  101,  2065,  2017,  ...,     0,     0,     0],
        [  101,  2017,  5993,  ...,     0,     0,     0],
        ...,
        [  101,  2065,  2017,  ...,     0,     0,     0],
        [  101,  2224,  2151,  ...,     0,     0,     0],
        [  101, 19575,  1010,  ...,     0,     0,     0]])

In [70]:
attention_mask

tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]])

# Checkpointing

In [71]:
from lightning.pytorch.callbacks import EarlyStopping, ModelCheckpoint

In [73]:
# Keep the model with the highest F1 score.
checkpoint_callback = ModelCheckpoint(
    filename="{epoch}-{Val_F1_Score:.2f}",
    monitor="Val_F1_Score",
    mode="max",
    verbose=True,
    save_top_k=1,
)

# EarlyStopping

In [74]:
earlystopping = EarlyStopping(
    monitor="Val_F1_Score",
    min_delta=train_config.min_delta,
    patience=train_config.patience,
    verbose=True,
    mode="max",
)

In [75]:
train_config.min_delta

0.005

In [76]:
train_config.patience

4

# Callbacks

In [77]:
l_callbacks = [earlystopping, checkpoint_callback]

In [78]:
from lightning.pytorch.callbacks import Callback as Cb
for callback in l_callbacks:
    assert isinstance(callback, Cb), f"{callback} is not a valid Callback"

# Trainer

In [79]:
from lightning import Trainer
import logging
from lightning.pytorch.loggers import CSVLogger

In [80]:
# Set up a standard Python logger
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("training_logger")

In [81]:
csv_logger = CSVLogger(save_dir=train_config.model_checkpoint_dir, name="logs")

In [82]:
print(bool(train_config.debug_mode_sample))
print(train_config.max_epochs)
print(train_config.model_checkpoint_dir)
print(train_config.max_time)

False
10
/home/david/Documents/data_science/llm/david/model-checkpoints
{'hours': 3}


In [83]:
torch.cuda.is_available()

True

In [84]:
trainer = Trainer(
    callbacks=l_callbacks,
    default_root_dir=train_config.model_checkpoint_dir,
    fast_dev_run=bool(train_config.debug_mode_sample),
    max_epochs=train_config.max_epochs,
    max_time=train_config.max_time,
    precision="bf16-mixed" if torch.cuda.is_available() else "32-true",
    logger=csv_logger,
)

Using bfloat16 Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs


# DataModule

In [85]:
from data import LexGlueDataModule

In [86]:
datamodule = LexGlueDataModule(
    pretrained_model=train_config.pretrained_model,
    max_length=train_config.max_length,
    batch_size=train_config.batch_size,
    num_workers=train_config.num_workers,
    debug_mode_sample=train_config.debug_mode_sample,
)

# Model

In [87]:
# Example usage
for batch in train_dataloader:
    input_ids = batch["input_ids"]
    attention_mask = batch["attention_mask"]
    labels = batch["label"]
    print(input_ids.shape, attention_mask.shape, labels.shape)
    break

torch.Size([256, 128]) torch.Size([256, 128]) torch.Size([256])


### Forward

In [88]:
batch

{'input_ids': tensor([[  101,  2065,  2017,  ...,     0,     0,     0],
         [  101,  2057,  2467,  ...,     0,     0,     0],
         [  101,  2324,  1012,  ...,     0,     0,     0],
         ...,
         [  101,  2004,  1037,  ...,     0,     0,     0],
         [  101,  2017, 13399,  ...,     0,     0,     0],
         [  101,  7858,  2038,  ...,     0,     0,     0]]),
 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         ...,
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0]]),
 'label': tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

In [89]:
classif_out = model(
    input_ids=batch["input_ids"],
    attention_mask=batch["attention_mask"],
    labels=batch["label"],
)

In [90]:
classif_out.keys()


odict_keys(['loss', 'logits'])

In [91]:
classif_out.loss

tensor(0.6561, grad_fn=<NllLossBackward0>)

By default Hugging Face uses:
-  Negative Log-Likelihood Loss (NLLLoss).

$$\text{loss} = - \frac{1}{N} \sum_{i=1}^N \log(\text{probability of the correct class for } i)$$

In [92]:
loss_fn = torch.nn.CrossEntropyLoss()
loss = loss_fn(classif_out.logits, labels)

In [93]:
classif_out.logits.shape

torch.Size([256, 2])

In [94]:
labels.shape

torch.Size([256])

In [95]:
loss

tensor(0.6561, grad_fn=<NllLossBackward0>)

# Train

In [96]:
from architectures import TransformerModule

In [97]:
model = TransformerModule(
    pretrained_model=train_config.pretrained_model,
    num_classes=train_config.num_classes,
    lr=train_config.lr,
)

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


trainable params: 887,042 || all params: 125,534,212 || trainable%: 0.7066


In [None]:
# trainer.fit(model=model, datamodule=datamodule)

INFO:pytorch_lightning.utilities.rank_zero:You are using a CUDA device ('NVIDIA GeForce RTX 4090') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name  | Type                               | Params | Mode 
---------------------------------------------------------------------
0 | model | PeftModelForSequenceClassification | 125 M  | train
---------------------------------------------------------------------
887 K     Trainable params
124 M     Non-trainable params
125 M     Total params
502.137   Total estimated model params size (MB)
244       Modules in train mode
234       Modules in eval mode


Sanity Checking: |          | 0/? [00:00<?, ?it/s]

Map: 100%|██████████| 2275/2275 [00:00<00:00, 16724.46 examples/s]


                                                                           

Map: 100%|██████████| 5532/5532 [00:01<00:00, 3929.34 examples/s]
/home/david/anaconda3/envs/r_unfair/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py:310: The number of training batches (21) is smaller than the logging interval Trainer(log_every_n_steps=50). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.


Epoch 0: 100%|██████████| 21/21 [00:04<00:00,  4.85it/s, v_num=0]

: 

In [101]:
# All Layers
for name, param in model.named_parameters():
    trainable_status = "Trainable" if param.requires_grad else "Frozen"
    # Delete the prefix in name " model.base_model.model.roberta":
    if "roberta" in name:
        name = name.split("model.base_model.model.roberta.")[1]
    print(f"Layer: {name} | Status: {trainable_status} | Shape: {param.shape}")

Layer: embeddings.word_embeddings.weight | Status: Frozen | Shape: torch.Size([50265, 768])
Layer: embeddings.position_embeddings.weight | Status: Frozen | Shape: torch.Size([514, 768])
Layer: embeddings.token_type_embeddings.weight | Status: Frozen | Shape: torch.Size([1, 768])
Layer: embeddings.LayerNorm.weight | Status: Frozen | Shape: torch.Size([768])
Layer: embeddings.LayerNorm.bias | Status: Frozen | Shape: torch.Size([768])
Layer: encoder.layer.0.attention.self.query.base_layer.weight | Status: Frozen | Shape: torch.Size([768, 768])
Layer: encoder.layer.0.attention.self.query.base_layer.bias | Status: Frozen | Shape: torch.Size([768])
Layer: encoder.layer.0.attention.self.query.lora_A.default.weight | Status: Trainable | Shape: torch.Size([8, 768])
Layer: encoder.layer.0.attention.self.query.lora_B.default.weight | Status: Trainable | Shape: torch.Size([768, 8])
Layer: encoder.layer.0.attention.self.key.weight | Status: Frozen | Shape: torch.Size([768, 768])
Layer: encoder.laye

In [102]:
# Only trainable
for name, param in model.named_parameters():
    trainable_status = "Trainable" if param.requires_grad else "Frozen"
    # Delete the prefix in name " model.base_model.model.roberta":
    if "roberta" in name:
        name = name.split("model.base_model.model.roberta.")[1]
    if trainable_status == "Trainable":
        print(f"Layer: {name} | Status: {trainable_status} | Shape: {param.shape}")

Layer: encoder.layer.0.attention.self.query.lora_A.default.weight | Status: Trainable | Shape: torch.Size([8, 768])
Layer: encoder.layer.0.attention.self.query.lora_B.default.weight | Status: Trainable | Shape: torch.Size([768, 8])
Layer: encoder.layer.0.attention.self.value.lora_A.default.weight | Status: Trainable | Shape: torch.Size([8, 768])
Layer: encoder.layer.0.attention.self.value.lora_B.default.weight | Status: Trainable | Shape: torch.Size([768, 8])
Layer: encoder.layer.1.attention.self.query.lora_A.default.weight | Status: Trainable | Shape: torch.Size([8, 768])
Layer: encoder.layer.1.attention.self.query.lora_B.default.weight | Status: Trainable | Shape: torch.Size([768, 8])
Layer: encoder.layer.1.attention.self.value.lora_A.default.weight | Status: Trainable | Shape: torch.Size([8, 768])
Layer: encoder.layer.1.attention.self.value.lora_B.default.weight | Status: Trainable | Shape: torch.Size([768, 8])
Layer: encoder.layer.2.attention.self.query.lora_A.default.weight | Stat