# Language Detector Training with LoRa

This is the notebook that was used to train the model [dominguesm/xlm-roberta-base-lora-language-detection](https://huggingface.co/dominguesm/xlm-roberta-base-lora-language-detection), which is a fine-tuned version of [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) on the [Language Identification](https://huggingface.co/datasets/papluca/language-identification#additional-information) dataset.

In this notebook, we will learn how to use LoRA from 🤗 PEFT to fine-tune an sequence classification model by ONLY using 0.32% of the original trainable parameters of the model.

LoRA adds low-rank "update matrices" to certain blocks in the underlying model (in this case the attention blocks) and ONLY trains those matrices during fine-tuning. During inference, these update matrices are merged with the original model parameters. For more details, check out the original [LoRA paper](https://arxiv.org/abs/2106.09685).

Let's get started by installing the dependencies. 

## Install dependencies

Here we're installing `peft` from source to ensure we have access to all the bleeding edge features of `peft`. 

In [6]:
!pip install -q bitsandbytes watermark langid loralib datasets>=2.6.1
!pip install -q git+https://github.com/huggingface/transformers
!pip install -q  git+https://github.com/huggingface/peft.git@main

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


In [7]:
!nvidia-smi

Mon Mar 13 16:24:35 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA A100-SXM...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   34C    P0    49W / 400W |      0MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## Check the library versions

In [8]:
import os

import torch
from datasets import load_dataset
from peft import (LoraConfig, PeftType, PrefixTuningConfig,
                  PromptEncoderConfig, get_peft_config, get_peft_model,
                  get_peft_model_state_dict, set_peft_model_state_dict)
from sklearn.metrics import accuracy_score, classification_report, f1_score
from tqdm import tqdm
from transformers import (AutoModelForSequenceClassification, AutoConfig, AutoTokenizer,
                          Trainer, TrainingArguments,
                          get_linear_schedule_with_warmup, set_seed)


Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues


  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)


CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /usr/local/lib/python3.9/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so...


In [9]:
%load_ext watermark
%watermark -p torch,datasets,sklearn,transformers,langid,peft

torch       : 1.13.1+cu116
datasets    : 2.10.1
sklearn     : 1.2.1
transformers: 4.27.0.dev0
langid      : 1.1.6
peft        : 0.3.0.dev0



## Params

In [30]:
MODEL_NAME_OR_PATH = "xlm-roberta-base"
MAX_LENGTH=128
DEVICE = "cuda"
NUM_EPOCHS = 2
PADDING_SIDE= "right"
EPOCHS = 2
LR = 2e-5
TRAIN_BS = 64
EVAL_BS = TRAIN_BS * 2

## Load Dataset

In [11]:
datasets = load_dataset("papluca/language-identification")
print(datasets)

Downloading readme:   0%|          | 0.00/4.99k [00:00<?, ?B/s]

Downloading and preparing dataset csv/papluca--language-identification to /root/.cache/huggingface/datasets/papluca___csv/papluca--language-identification-ad5bdc8c9b1a4985/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/12.0M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.69M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.71M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/papluca___csv/papluca--language-identification-ad5bdc8c9b1a4985/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['labels', 'text'],
        num_rows: 70000
    })
    test: Dataset({
        features: ['labels', 'text'],
        num_rows: 10000
    })
    validation: Dataset({
        features: ['labels', 'text'],
        num_rows: 10000
    })
})


## Prepare Feature Extractor, Tokenizer and Data

In [12]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME_OR_PATH, padding_side=PADDING_SIDE)
if getattr(tokenizer, "pad_token_id") is None:
    tokenizer.pad_token_id = tokenizer.eos_token_id

Downloading (…)lve/main/config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

Downloading (…)tencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

### Prepare Data

In [13]:
def tokenize_function(examples):
    outputs = tokenizer(examples["text"], truncation=True, max_length=MAX_LENGTH)
    return outputs

In [14]:
tokenized_datasets = datasets.map(
    tokenize_function,
    batched=True,
    remove_columns=["text"],
)

Map:   0%|          | 0/70000 [00:00<?, ? examples/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

In [15]:
amazon_languages = ['en', 'de', 'fr', 'es', 'ja', 'zh']
xnli_languages = ['ar', 'el', 'hi', 'ru', 'th', 'tr', 'vi', 'bg', 'sw', 'ur']
stsb_languages = ['it', 'nl', 'pl', 'pt']

all_langs = sorted(list(set(amazon_languages + xnli_languages + stsb_languages)))

In [16]:
id2label = {idx: all_langs[idx] for idx in range(len(all_langs))}
label2id = {v: k for k, v in id2label.items()}
label2id

{'ar': 0,
 'bg': 1,
 'de': 2,
 'el': 3,
 'en': 4,
 'es': 5,
 'fr': 6,
 'hi': 7,
 'it': 8,
 'ja': 9,
 'nl': 10,
 'pl': 11,
 'pt': 12,
 'ru': 13,
 'sw': 14,
 'th': 15,
 'tr': 16,
 'ur': 17,
 'vi': 18,
 'zh': 19}

In [19]:
tokenized_datasets = tokenized_datasets.map(lambda example: {"labels": label2id[example["labels"]]})

Map:   0%|          | 0/70000 [00:00<?, ? examples/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

In [22]:
tok_train = tokenized_datasets['train']
tok_valid = tokenized_datasets['validation']
tok_test = tokenized_datasets['test']

print(f"Train / valid / test samples: {len(tok_train)} / {len(tok_valid)} / {len(tok_test)}")

Train / valid / test samples: 70000 / 10000 / 10000


## Training and Evaluation


### Define a Data Collator

In [23]:
def collate_fn(examples):
    return tokenizer.pad(examples, padding=True, return_tensors="pt")

### Evaluation Metrics

In [24]:
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    acc = accuracy_score(labels, preds)
    f1 = f1_score(labels, preds, average="weighted")
    return {
        "accuracy": acc,
        "f1": f1
        }

### Load a Pre-Trained Checkpoint

In [25]:
model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_NAME_OR_PATH,
    id2label=id2label,
    label2id=label2id,
    ignore_mismatched_sizes=True, 
    return_dict=True
)

Downloading pytorch_model.bin:   0%|          | 0.00/1.12G [00:00<?, ?B/s]

Some weights of the model checkpoint at xlm-roberta-base were not used when initializing XLMRobertaForSequenceClassification: ['lm_head.decoder.weight', 'roberta.pooler.dense.weight', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.weight', 'lm_head.bias', 'lm_head.dense.bias', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.dense

### Apply LoRA


In [26]:
peft_config = LoraConfig(
    task_type="SEQ_CLS",
    inference_mode=False,
    r=8,
    lora_alpha=16,
    target_modules=["query", "value"],
    lora_dropout=0.1,
    bias="none",
    modules_to_save=["classifier"],
)

In [27]:
lora_model = get_peft_model(model, peft_config)

In [28]:
lora_model.print_trainable_parameters()

trainable params: 900884 || all params: 278353940 || trainable%: 0.3236469367022432


In [29]:
lora_model.to(DEVICE)

PeftModelForSequenceClassification(
  (base_model): LoraModel(
    (model): XLMRobertaForSequenceClassification(
      (roberta): XLMRobertaModel(
        (embeddings): XLMRobertaEmbeddings(
          (word_embeddings): Embedding(250002, 768, padding_idx=1)
          (position_embeddings): Embedding(514, 768, padding_idx=1)
          (token_type_embeddings): Embedding(1, 768)
          (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (encoder): XLMRobertaEncoder(
          (layer): ModuleList(
            (0): XLMRobertaLayer(
              (attention): XLMRobertaAttention(
                (self): XLMRobertaSelfAttention(
                  (query): Linear(
                    in_features=768, out_features=768, bias=True
                    (lora_dropout): Dropout(p=0.1, inplace=False)
                    (lora_A): Linear(in_features=768, out_features=8, bias=False)
                    (lora_B): Linea

## Training arguments

In [31]:
logging_steps = len(tokenized_datasets["train"]) // TRAIN_BS
output_dir = "./xlm-roberta-base-lora-language-detection"

In [33]:
args = TrainingArguments(
    optim='adamw_torch',
    output_dir=output_dir,
    num_train_epochs=NUM_EPOCHS,
    learning_rate=LR,
    per_device_train_batch_size=TRAIN_BS,
    per_device_eval_batch_size=EVAL_BS,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    logging_steps=logging_steps,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    push_to_hub=False,
    fp16=True,  # Remove if GPU doesn't support it
)

## Train and evaluate

In [34]:
trainer = Trainer(
    model,
    args,
    compute_metrics=compute_metrics,
    train_dataset=tok_train,
    eval_dataset=tok_valid,
    data_collator=collate_fn,
    tokenizer=tokenizer,
)

In [35]:
train_results = trainer.train()

You're using a XLMRobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,1.4403,0.059124,0.9952,0.995203
2,0.2567,0.027223,0.9955,0.995502


In [36]:
trainer.evaluate(tok_test)

{'eval_loss': 0.02986098639667034,
 'eval_accuracy': 0.9946,
 'eval_f1': 0.994605415510712,
 'eval_runtime': 3.9619,
 'eval_samples_per_second': 2524.072,
 'eval_steps_per_second': 19.94,
 'epoch': 2.0}

## Inference 

The steps below assume that you have published your model on the Huggingface Hub, you can find more information in the [official documentation](https://huggingface.co/docs/transformers/model_sharing).

**Few important notes:**
1. `pipe()` should be in the autocast context manager `with torch.cuda.amp.autocast() or torch.cpu.amp.autocast()`
2. You will get warning along the below lines which is **safe to ignore**.

```
The model 'PeftModelForSequenceClassification' is not supported for text-classification. Supported models are ['AlbertForSequenceClassification', 'BartForSequenceClassification', 'BertForSequenceClassification', 'BigBirdForSequenceClassification', 'BigBirdPegasusForSequenceClassification', 'BloomForSequenceClassification', 'CamembertForSequenceClassification', 'CanineForSequenceClassification', 'ConvBertForSequenceClassification', 'CTRLForSequenceClassification', ...].
```

In [37]:
repo_name = f"dominguesm/xlm-roberta-base-lora-language-detection"

In [38]:
import torch
from peft import PeftConfig, PeftModel
from transformers import AutoConfig, AutoModelForSequenceClassification, AutoTokenizer
from transformers import pipeline

perft_config = PeftConfig.from_pretrained(repo_name)
config = AutoConfig.from_pretrained(repo_name)
base_model = AutoModelForSequenceClassification.from_pretrained(
    perft_config.base_model_name_or_path,
    config=config
)
tokenizer = AutoTokenizer.from_pretrained(perft_config.base_model_name_or_path)
inference_model = PeftModel.from_pretrained(base_model, repo_name)
pipe = pipeline("text-classification", model=inference_model, tokenizer=tokenizer)

Downloading (…)/adapter_config.json:   0%|          | 0.00/370 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

Some weights of the model checkpoint at xlm-roberta-base were not used when initializing XLMRobertaForSequenceClassification: ['lm_head.decoder.weight', 'roberta.pooler.dense.weight', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.weight', 'lm_head.bias', 'lm_head.dense.bias', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.dense

Downloading adapter_model.bin:   0%|          | 0.00/3.62M [00:00<?, ?B/s]

The model 'PeftModelForSequenceClassification' is not supported for text-classification. Supported models are ['AlbertForSequenceClassification', 'BartForSequenceClassification', 'BertForSequenceClassification', 'BigBirdForSequenceClassification', 'BigBirdPegasusForSequenceClassification', 'BloomForSequenceClassification', 'CamembertForSequenceClassification', 'CanineForSequenceClassification', 'ConvBertForSequenceClassification', 'CTRLForSequenceClassification', 'Data2VecTextForSequenceClassification', 'DebertaForSequenceClassification', 'DebertaV2ForSequenceClassification', 'DistilBertForSequenceClassification', 'ElectraForSequenceClassification', 'ErnieForSequenceClassification', 'ErnieMForSequenceClassification', 'EsmForSequenceClassification', 'FlaubertForSequenceClassification', 'FNetForSequenceClassification', 'FunnelForSequenceClassification', 'GPT2ForSequenceClassification', 'GPT2ForSequenceClassification', 'GPTNeoForSequenceClassification', 'GPTJForSequenceClassification', 'I

In [52]:
def detect_lang(text):
    with torch.cuda.amp.autocast():
        pred = pipe(text)
    return pred[0]

In [53]:
detect_lang("Tutte queste cose sono un sottoprodotto del fatto che stiamo allenando solo un piccolo numero di parametri.")



{'label': 'it', 'score': 0.9986590147018433}

In [54]:
detect_lang("Cada qual sabe amar a seu modo; o modo, pouco importa; o essencial é que saiba amar.")

{'label': 'pt', 'score': 0.9948723912239075}

## Benchmarking our model

We still haven't used the test set so far, so let's use it to benchmark our model against `langid`!

In [44]:
import langid
import time

In [45]:
ds_test = datasets["test"].to_pandas()
ds_test.head(3)

Unnamed: 0,labels,text
0,nl,Een man zingt en speelt gitaar.
1,nl,De technologisch geplaatste Nasdaq Composite I...
2,es,Es muy resistente la parte trasera rígida y lo...


### Langid

In [46]:
# Constrain the language set
langid.set_languages(all_langs)

In [47]:
start_time = time.perf_counter()
langid_preds = [langid.classify(s)[0] for s in ds_test.text.values.tolist()]
print(f"{time.perf_counter() - start_time:.2f} seconds")

4.67 seconds


Classification report for **langid**:

In [48]:
print(classification_report(ds_test.labels.values.tolist(), langid_preds, digits=3))

              precision    recall  f1-score   support

          ar      1.000     0.996     0.998       500
          bg      0.971     0.990     0.980       500
          de      0.980     1.000     0.990       500
          el      1.000     1.000     1.000       500
          en      0.950     0.996     0.973       500
          es      0.927     0.988     0.956       500
          fr      0.986     0.996     0.991       500
          hi      1.000     0.968     0.984       500
          it      0.990     0.968     0.979       500
          ja      0.996     1.000     0.998       500
          nl      0.996     0.974     0.985       500
          pl      0.990     0.980     0.985       500
          pt      0.992     0.944     0.967       500
          ru      0.990     0.970     0.980       500
          sw      0.949     0.976     0.963       500
          th      1.000     1.000     1.000       500
          tr      1.000     0.998     0.999       500
          ur      0.998    

### Our model

In [56]:
start_time = time.perf_counter()
model_preds = [s['label'] for s in pipe(ds_test.text.values.tolist())]
print(f"{time.perf_counter() - start_time:.2f} seconds")

150.91 seconds


In [57]:
print(classification_report(ds_test.labels.values.tolist(), model_preds, digits=3))

              precision    recall  f1-score   support

          ar      1.000     0.998     0.999       500
          bg      0.990     1.000     0.995       500
          de      1.000     1.000     1.000       500
          el      1.000     1.000     1.000       500
          en      0.992     0.994     0.993       500
          es      0.994     0.992     0.993       500
          fr      0.998     0.998     0.998       500
          hi      0.947     1.000     0.973       500
          it      1.000     0.984     0.992       500
          ja      1.000     1.000     1.000       500
          nl      0.996     0.992     0.994       500
          pl      0.994     0.988     0.991       500
          pt      0.992     0.986     0.989       500
          ru      0.998     0.996     0.997       500
          sw      0.992     0.998     0.995       500
          th      1.000     1.000     1.000       500
          tr      1.000     1.000     1.000       500
          ur      1.000    