# [NLLB-200-disttille-600M](https://huggingface.co/facebook/nllb-200-distilled-600M)


Model Fine-tuning

[NLLB-200](https://huggingface.co/docs/transformers/model_doc/nllb) is a multilingual encoder-decoder (seq-to-seq) model primarily intended for translation tasks.

In [None]:
!nvidia-smi

Tue Sep 24 13:39:30 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   41C    P8               9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [None]:
# Installing necessary libraries for data processing and model fine-tuning
!pip install datasets
!pip install -U bitsandbytes
!pip install PEFT
!pip install wandb
!pip install evaluate
!pip install sacrebleu



### Weights and biases for results tracking

References for [WANDB](https://analyticsindiamag.com/hands-on-guide-to-weights-and-biases-wandb-with-python-implementation/): https://docs.wandb.ai/


In [None]:
import os

#os.environ["WANDB_DISABLED"]="true"
os.environ["WANDB_PROJECT"]   = "NLLB-200-distille-Experiments"
os.environ["WANDB_LOG_MODEL"] = "end"

# Fine-tuning the model on a translation task
We will fine-tune the NLLB hugging-face model for a wolof-french translation task. We will use the Baamtu dataset, a machine translation dataset composed from a collection of various sources, including news, commentaries and books.

In [None]:
import transformers

transformers.set_seed(7)
print(transformers.__version__)

4.44.2


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Loading the dataset
We will use the [datasets](https://huggingface.co/docs/datasets/) library to load the data and get the metric we need to use for evaluation.  
This can be easily done with the functions `load_dataset` and `load_metric`.

In [None]:

#%cd /home/ubuntu/alain/traduction-fr-wolof/unidirection
%cd /content/drive/MyDrive/unidirection/


/content/drive/MyDrive/unidirection


In [None]:
ls

[0m[01;34mfr_wo[0m/  [01;34mwo_fr[0m/


In [None]:
path_data_dir  = "/content/drive/MyDrive/unidirection/fr_wo/"


### Loading data

In [None]:
# loading  data
from datasets import load_dataset
data = load_dataset(path_data_dir)

In [None]:
data

DatasetDict({
    train: Dataset({
        features: ['translation', 'codes'],
        num_rows: 145000
    })
    validation: Dataset({
        features: ['translation', 'codes'],
        num_rows: 14000
    })
    test: Dataset({
        features: ['translation', 'codes'],
        num_rows: 6964
    })
})

> __NOTE:__ We've added the [Microsoft NTREX dataset](https://github.com/MicrosoftTranslator/NTREX) into the training set

In [None]:
data['train'][8]

{'translation': {'src': "être dans l'opposition ne changera rien à mon investissement.",
  'tgt': 'nekk ci kujjee gi du soppi dara ci li may def'},
 'codes': {'src': 'fr', 'tgt': 'wo'}}

In [None]:
import evaluate

metric  = evaluate.load("sacrebleu")

# Preparation

## Loading the model & Apllies

The model architecture and config are the same as the `M2M-100` implementation, but the tokenizer is modified to adjust language codes.  
So, we load the tokenizer __locally__ from [tokenization_small100.py](tokenization_small100.py) file for the moment.

In [None]:
model_checkpoint = 'facebook/nllb-200-distilled-600M'

In [None]:
from transformers import AutoModelForSeq2SeqLM,AutoTokenizer

model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

config.json:   0%|          | 0.00/846 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.46G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/564 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/4.85M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.3M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/3.55k [00:00<?, ?B/s]



# Preprocessing the data

Before we can feed those texts to our model, we need to preprocess them. This is done by a `Transformers Tokenizer` which will (as the name indicates) tokenize the inputs (including converting the tokens to their corresponding IDs in the pretrained vocabulary) and put it in a format the model expects, as well as generate the other inputs that model requires.

To do all of this, we instantiate our tokenizer with the `AutoTokenizer.from_pretrained` method, which will ensure:

* we get a tokenizer that corresponds to the model architecture we want to use,
we download the vocabulary used when pretraining this specific checkpoint.
That vocabulary will be cached, so it's not downloaded again the next time we run the cell.

If you downloaded the model manually, you can provide model present directory instead of `model_checkpoint`.

In [None]:
# We can directly call this tokenizer on one sentence or a pair of sentences
tokenizer(["Ceci est une phrase!", "Ceci est une autre phrase encore."])

{'input_ids': [[256047, 168269, 613, 3335, 136505, 248203, 2], [256047, 168269, 613, 3335, 36091, 136505, 30522, 248075, 2]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1]]}

To prepare the targets for our model, we need to tokenize them inside the `as_target_tokenizer` context manager. This will make sure the tokenizer uses the special tokens corresponding to the targets:

We can then write the function that will preprocess our samples. We just feed them to the tokenizer with the argument `truncation=True`. This will ensure that an input longer that what the model selected can handle will be truncated to the maximum length accepted by the model. The padding will be dealt with later on (in a `data collator`) so we pad examples to the longest length in the batch and not the whole dataset.

In [None]:
max_length = 128
max_input_length   =  128
max_target_length =  128
source_lang =  "src"
target_lang =  "tgt"


def preprocess_function(examples):

    inputs = [ex[source_lang] for ex in examples["translation"]]
    target = [ex[target_lang] for ex in examples["translation"]]
    model_inputs = tokenizer( inputs , max_length = max_input_length , truncation = True , padding = True )

    # Configurer le tokenizer pour les cibles
    labels = tokenizer (target , max_length = max_target_length , truncation = True , padding = True )
    model_inputs [ "labels" ]  = labels [ "input_ids" ]
    return model_inputs

In [None]:
preprocess_function(data['train'][:1])

{'input_ids': [[256047, 219, 55, 248116, 18, 106725, 82, 104287, 1956, 79, 5492, 702, 93, 563, 613, 134060, 22, 79, 57090, 12698, 14, 153, 153003, 702, 60088, 179594, 14351, 248075, 2]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], 'labels': [[256047, 548, 18818, 8054, 90701, 229968, 225503, 169, 702, 93, 12207, 118, 14733, 41, 89817, 37770, 8231, 41700, 259, 423, 923, 248105, 197, 923, 1593, 285, 2768, 248071, 2]]}

To apply this function on all the pairs of sentences in our dataset, we just use the `map` method of our dataset object we created earlier. This will apply the function on all the elements of all the splits in dataset, so our training, validation and testing data will be preprocessed in one single command.

In [None]:
tokenized_dataset = data.map(preprocess_function,
                                         batched        = True,
                                         batch_size     = 100,

                                      )

Map:   0%|          | 0/145000 [00:00<?, ? examples/s]

Map:   0%|          | 0/14000 [00:00<?, ? examples/s]

Map:   0%|          | 0/6964 [00:00<?, ? examples/s]

In [None]:
print("Tokenized dataset details:")
for subset, data in tokenized_dataset.items():
    print(f"- {subset} set size: {len(data)}")

Tokenized dataset details:
- train set size: 145000
- validation set size: 14000
- test set size: 6964


Same as `in_house_dataset` 🥳

# Fine-Tuning
Now that our data is ready, we can download the pretrained model and fine-tune it. Since our task is of the sequence-to-sequence kind, we use the `AutoModelForSeq2SeqLM` class. Like with the tokenizer, the `from_pretrained` method will download and cache the model for us.

In [None]:
from transformers import (DataCollatorForSeq2Seq,
                          Seq2SeqTrainingArguments,
                          Seq2SeqTrainer,
                          EarlyStoppingCallback)

To instantiate a `Seq2SeqTrainer`, we will need to define three more things. The most important is the [Seq2SeqTrainingArguments](https://huggingface.co/transformers/main_classes/trainer.html#transformers.Seq2SeqTrainingArguments), which is a class that contains all the attributes to customize the training. It requires one folder name, which will be used to save the checkpoints of the model, and all other arguments are optional:

In [None]:
batch_size       = 16
model_name       = model_checkpoint.split("/")[-1]
source_lang      = 'fr'
target_lang      = 'wo'
model_checkpoint = "models/{}-finetuned-{}-to-{}".format(model_name,
                                                         source_lang,
                                                         target_lang)

args = Seq2SeqTrainingArguments(model_checkpoint,
                                evaluation_strategy         = "steps",
                                eval_steps                  = 1000,
                                save_steps                  = 1000,
                                learning_rate               = 2e-5,
                                per_device_train_batch_size = batch_size,
                                per_device_eval_batch_size  = batch_size,
                                weight_decay                = 0.01,
                                save_total_limit            = 5, # Only last 5 models are saved. Older ones are deleted.
                                num_train_epochs            = 120,
                                predict_with_generate       = True,
                                report_to                   = 'all',
                                load_best_model_at_end      = True
                            )



Here we set the evaluation to be done at the end of each epoch, tweak the learning rate, use the `batch_size` defined at the top of the cell and customize the weight decay. Since the `Seq2SeqTrainer` will save the model regularly and our dataset is quite large, we tell it to make three saves maximum. Lastly, we use the `predict_with_generate` option (to properly generate summaries) and activate mixed precision training (to go a bit faster).

Model will save under **{model_name}-finetuned-{source_lang}-to-{target_lang}** directory.

Then, we need a special kind of data collator, which will not only pad the inputs to the maximum length in the batch, but also the labels.

In [None]:
data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer,
                                       model     = model)

The last thing to define for our `Seq2SeqTrainer` is how to compute the metrics from the predictions. We need to define a function for this, which will just use the metric we loaded earlier, and we have to do a bit of pre-processing to decode the predictions into texts.

In [None]:
import numpy as np

def postprocess_text(preds, labels):
    preds  = [pred.strip() for pred in preds]
    labels = [[label.strip()] for label in labels]

    return preds, labels

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]

    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)
    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    result = {"bleu": result["score"]}
    prediction_lens   = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    result = {k: round(v, 4) for k, v in result.items()}

    return result

Then we just need to pass all of this along with our datasets to the `Seq2SeqTrainer`.

In [None]:
trainer = Seq2SeqTrainer(model,
                         args,
                         train_dataset   = tokenized_dataset["train"],
                         eval_dataset    = tokenized_dataset["validation"],
                         data_collator   = data_collator,
                         tokenizer       = tokenizer,
                         compute_metrics = compute_metrics,
                         callbacks       = [EarlyStoppingCallback(early_stopping_patience=5)]
                        )

### Training

> Note: Lot of ouputs!

We can now finetune our model by just calling the `train` method

In [None]:
trainer.train()

[34m[1mwandb[0m: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Step,Training Loss,Validation Loss


KeyboardInterrupt: 

Our fine tuned model is already saved under `models/small100-finetuned-fr-to-wo/`

Load the model and translate some text from `french` to `wolof`