If you're opening this Notebook on colab, you will probably need to install 🤗 Transformers and 🤗 Datasets. Uncomment the following cell and run it.

허깅페이스와 트랜스포머 데이터셋을 다운받아줍니다

In [None]:
! pip install datasets transformers[sentencepiece] sacrebleu

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.4.0-py3-none-any.whl (365 kB)
[K     |████████████████████████████████| 365 kB 13.3 MB/s 
[?25hCollecting transformers[sentencepiece]
  Downloading transformers-4.22.0-py3-none-any.whl (4.9 MB)
[K     |████████████████████████████████| 4.9 MB 50.4 MB/s 
[?25hCollecting sacrebleu
  Downloading sacrebleu-2.2.1-py3-none-any.whl (116 kB)
[K     |████████████████████████████████| 116 kB 61.0 MB/s 
Collecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting multiprocess
  Downloading multiprocess-0.70.13-py37-none-any.whl (115 kB)
[K     |████████████████████████████████| 115 kB 70.3 MB/s 
Collecting huggingface-hub<1.0.0,>=0.1.0
  Downloading huggingface_hub-0.9.1-py3-none-any.whl (120 kB)
[K     |████████████████████████████████| 120 kB 75.6 MB/s 
Collecting xxhash
  Downloading xxhash-3.0.0-cp37-cp37m-manyli

# Fine-tuning a model on a translation task

이번 실습에서는 허깅페이스의 트랜스 포머 모델을 이용해서 번역을진행하게 됩니다.
machine translation dataset은 다양한 도메인의 영역에서 구성된 source - target language로 구성되어 있습니다.

여기서는 허깅페이스의 Dataset 모듈을 통해 데이터를 쉽게 불러오고, Trainer API를 통해 학습을 쉽게 진행 해 볼것입니다.



![Widget inference on a translation task](https://github.com/huggingface/notebooks/blob/main/examples/images/translation.png?raw=1)


In [None]:
model_checkpoint = "Helsinki-NLP/opus-mt-en-ro"



이 실습은 seq-to-seq의 구조를 가지는 (일반적인 encoder와 decoder가 있는 트랜스포머 구조) 모델을 이용합니다. 허깅페이스 모델 라이브러리에서 위 모델의 체크포인트를 사용합니다.(Face book에서 만든 mBART)

## Loading the dataset



허깅페이스 Datasets 라이브러리의 데이터를 사용하고, evaluation을 위한 평가지표를 다운로드 합니다. 이 작업은 load+dataset과 load_metric으로 간단하게 불러올 수 있습니다.

In [None]:
from datasets import load_dataset, load_metric

raw_datasets = load_dataset("wmt16", "ro-en")
metric = load_metric("sacrebleu")
# 정답과 일치하는 n-gram 수 / 모든 예측된 n-gram 중에서

Downloading builder script:   0%|          | 0.00/1.41k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.38k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/8.69k [00:00<?, ?B/s]

Downloading and preparing dataset wmt16/ro-en (download: 274.05 MiB, generated: 180.62 MiB, post-processed: Unknown size, total: 454.67 MiB) to /root/.cache/huggingface/datasets/wmt16/ro-en/1.0.0/28ebdf8cf22106c2f1e58b2083d4b103608acd7bfdb6b14313ccd9e5bc8c313a...


Downloading data files:   0%|          | 0/4 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/225M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/23.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/38.7M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/4 [00:00<?, ?it/s]

Extracting data files: 0it [00:00, ?it/s]

Generating train split:   0%|          | 0/610320 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1999 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1999 [00:00<?, ? examples/s]

Dataset wmt16 downloaded and prepared to /root/.cache/huggingface/datasets/wmt16/ro-en/1.0.0/28ebdf8cf22106c2f1e58b2083d4b103608acd7bfdb6b14313ccd9e5bc8c313a. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

Downloading builder script:   0%|          | 0.00/2.85k [00:00<?, ?B/s]

데이터셋의 구성을 보겠습니다.

In [None]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['translation'],
        num_rows: 610320
    })
    validation: Dataset({
        features: ['translation'],
        num_rows: 1999
    })
    test: Dataset({
        features: ['translation'],
        num_rows: 1999
    })
})

In [None]:
raw_datasets["train"][0]

{'translation': {'en': 'Membership of Parliament: see Minutes',
  'ro': 'Componenţa Parlamentului: a se vedea procesul-verbal'}}

랜덤하게 데이터를 가져와서 데이터의 내용을 좀더 자세히 보겠습니다 (코드 내용을 이해 할 필요는 없습니다)

In [None]:
import datasets
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=5):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [None]:
show_random_elements(raw_datasets["train"])

Unnamed: 0,translation
0,"{'en': 'Maritime policy must encompass not only obligatory regulations on oil exploration and transportation, but must also offer twofold compensation for losses resulting from accidents.', 'ro': 'Politica maritimă trebuie să cuprindă nu numai regulamente obligatorii în materie explorare și transportare a petrolului, ci să ofere, de asemenea, compensații duble pentru pierderile rezultate în urma accidentelor.'}"
1,"{'en': 'Also in science and technology news: Cyprus and European Apace Agency sign an agreement, and a Bosniak wins the gold at the Ninth Moscow International Salon of Innovations and Investments.', 'ro': 'Tot în ştirile din ştiinţă şi tehnologie: Cipru şi Agenţia Spaţială Europeană au semnat un acord, iar un bosniac a câştigat aurul la al nouălea Salon Internaţional de Inovaţii şi Investiţii de la Moscova.'}"
2,"{'en': 'The Greek designer duo Deux Hommes, Ioannis Guia and Yiorgos Eleftheriades have confirmed plans to show their collections for the fall-winter 2006-2007 season.', 'ro': 'Duo-ul de designeri greci Deux Hommes, Ioannis Guia alături de Yiorgos Eleftheriades şi-au confirmat intenţia de a prezenta colecţiile pentru sezonul toamnă-iarnă 2006-2007.'}"
3,"{'en': 'One might wonder if it is possible to operate a common policy for the 27 Member States of the Union at all.', 'ro': 'Ne-am putea întreba dacă o politică comună poate funcționa cu cele 27 de state membre.'}"
4,"{'en': 'Measures for the social inclusion of the Roma can be mainstreamed in various policy fields.', 'ro': 'Măsurile pentru incluziunea romilor poate fi extinsă în diferite domenii de politică.'}"


다음은 불러온 metric에 대해서 어떤 metri인지 좀더 자세히 설명을 보겠습니다.

In [None]:
metric

Metric(name: "sacrebleu", features: {'predictions': Value(dtype='string', id='sequence'), 'references': Sequence(feature=Value(dtype='string', id='sequence'), length=-1, id='references')}, usage: """
Produces BLEU scores along with its sufficient statistics
from a source against one or more references.

Args:
    predictions (`list` of `str`): list of translations to score. Each translation should be tokenized into a list of tokens.
    references (`list` of `list` of `str`): A list of lists of references. The contents of the first sub-list are the references for the first prediction, the contents of the second sub-list are for the second prediction, etc. Note that there must be the same number of references for each prediction (i.e. all sub-lists must be of the same length).
    smooth_method (`str`): The smoothing method to use, defaults to `'exp'`. Possible values are:
        - `'none'`: no smoothing
        - `'floor'`: increment zero counts
        - `'add-k'`: increment num/deno

metric이 어떻게 작동하는지 확인해 보겠습니다
https://donghwa-kim.github.io/BLEU.html

In [None]:
fake_preds = ["hello there", "general kenobi"]
fake_labels = [["hello there"], ["general kenobi"]]
metric.compute(predictions=fake_preds, references=fake_labels)

{'score': 0.0,
 'counts': [4, 2, 0, 0],
 'totals': [4, 2, 0, 0],
 'precisions': [100.0, 100.0, 0.0, 0.0],
 'bp': 1.0,
 'sys_len': 4,
 'ref_len': 4}

## Preprocessing the data

학습을 하기전 토크나이저로 먼저 데이터를 프로세싱 해 주겠습니다.
aut tokenizer모듈은 어떤 토크나이저든 불러올 수 있는 클래스이며, 모델의 체크포인트에서 사용한 것과 동일한 토크나이저를 가져오겠습니다
(토크나이저는 어제 실습시간에 알려드린대로, 단어를 나눠주는 기능을 합니다)

In [None]:
from transformers import AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


Moving 0 files to the new cache system


0it [00:00, ?it/s]

Downloading:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/789k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/817k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.39M [00:00<?, ?B/s]



이 부분은 지금 저희가 사용하는 번역 모델 (Face book에서 만든 mBART) 의 토크나이저에만 있는 특수한 부분인데, tokenizer에 사용 언어가 어떤 것인지 표기해야 한다고 합니다.

In [None]:
if "mbart" in model_checkpoint:
    tokenizer.src_lang = "en-XX"
    tokenizer.tgt_lang = "ro-RO"

By default, the call above will use one of the fast tokenizers (backed by Rust) from the 🤗 Tokenizers library.

먼저 문장 하나에 적용해 보겠습니다.
Input ids는 숫자들의 벡터로 변환된문장이며
attention_mask는 실제 의미를 가지고 있는 부분을 의미합니다.

In [None]:
tokenizer("Hello, this one sentence!")

{'input_ids': [125, 778, 3, 63, 141, 9191, 23, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}

여러개의 문장을 넣기 위해서는 리스트 형태로 넣어주면 됩니다

In [None]:
tokenizer(["Hello, this one sentence!", "This is another sentence."])

{'input_ids': [[125, 778, 3, 63, 141, 9191, 23, 0], [187, 32, 716, 9191, 2, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]}

To prepare the targets for our model, we need to tokenize them inside the `as_target_tokenizer` context manager. This will make sure the tokenizer uses the special tokens corresponding to the targets:

In [None]:
with tokenizer.as_target_tokenizer():
    print(tokenizer(["Hello, this one sentence!", "This is another sentence."]))

{'input_ids': [[10334, 1204, 3, 15, 8915, 27, 452, 59, 29579, 581, 23, 0], [235, 1705, 11, 32, 8, 1205, 5305, 59, 29579, 581, 2, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}


만약 t5구조를 사용한다면 아래와 같은 문장을 prefix를 인풋으로 넣어줘야 합니다. 현재는 아니기때문에 prefix == "" 입니다



In [None]:
if model_checkpoint in ["t5-small", "t5-base", "t5-larg", "t5-3b", "t5-11b"]:
    prefix = "translate English to Romanian: "
else:
    prefix = ""

example들을 프로세싱 하는 과정을 살펴 봅시다.

인풋 문장과 타겟 문장을 리스트에 전부 넣어주는 과정이 있고,
리스트에 들어간 문장들은 tokenizer를 통해서 벡터로 바뀌게 됩니다. 여기서 truncation=True를 통해 길이를 지정해 주었습니다.

In [None]:
max_input_length = 128
max_target_length = 128
source_lang = "en"
target_lang = "ro"

def preprocess_function(examples):
    inputs = [prefix + ex[source_lang] for ex in examples["translation"]] #input앞에 prefix를 붙이는 과정 
    targets = [ex[target_lang] for ex in examples["translation"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True) #tokenizer에 넣어주는 과정

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(targets, max_length=max_target_length, truncation=True) #tokenizer에 넣어주는 과정

    model_inputs["labels"] = labels["input_ids"] 
    return model_inputs

예시로 진행해 보면 다음과 같은 결과를 얻게 됩니다.

In [None]:
preprocess_function(raw_datasets['train'][:2])

{'input_ids': [[393, 4462, 14, 1137, 53, 216, 28636, 0], [24385, 14, 28636, 14, 4646, 4622, 53, 216, 28636, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], 'labels': [[42140, 494, 1750, 53, 8, 59, 903, 3543, 9, 15202, 0], [36199, 6612, 9, 15202, 122, 568, 35788, 21549, 53, 8, 59, 903, 3543, 9, 15202, 0]]}

해당함수를 전체 데이터에 적용해 보도록 하겠습니다.

In [None]:
tokenized_datasets = raw_datasets.map(preprocess_function, batched=True) # map연산을 내부적으로 batch 형태로 작업한다는 뜻

  0%|          | 0/611 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

해당 결과는 cache로 자동 저장되어, 나중에 한번더 같은 연산을 하면 빠르게 불러올 수 있게 됩니다.

## Fine-tuning the model



데이터가 준비 되었기때문에 pretrained model을 가져와 봅시다.


In [None]:
from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer

model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

Downloading:   0%|          | 0.00/301M [00:00<?, ?B/s]

To instantiate a `Seq2SeqTrainer`, we will need to define three more things. The most important is the [`Seq2SeqTrainingArguments`](https://huggingface.co/transformers/main_classes/trainer.html#transformers.Seq2SeqTrainingArguments), which is a class that contains all the attributes to customize the training. It requires one folder name, which will be used to save the checkpoints of the model, and all other arguments are optional:

seqtoseq모델을 사용하기 위한 몇가지 argment들을 정해주도록 하겠습니다.

In [None]:
batch_size = 16
model_name = model_checkpoint.split("/")[-1]
args = Seq2SeqTrainingArguments(
    f"{model_name}-finetuned-{source_lang}-to-{target_lang}",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01, #regularizer
    save_total_limit=3,
    num_train_epochs=1,
    predict_with_generate=True,  # 평가지표가 생성된 언어로만 가능한게 있으므로(?)
    push_to_hub=False,
)

마지막으로 data collator를 사용해섬 토크나이저와 모델 옵션들을 연결해 주면됩니다.
data_collator는 인풋과 아웃풋을 배치 사이즈로 자르면서, 배치에 맞게 padding 해 줍니다.
(배치에 들어가는 모든 인풋은 길이가 같아야 합니다)

In [None]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

마지막으로 모델에게 알려줄 것은 어떻게 평가를 해야하는지 입니다.
아래 부분은 평가에 관한 코드입니다.

In [None]:
import numpy as np

def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [[label.strip()] for label in labels]

    return preds, labels

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    result = {"bleu": result["score"]}

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    result = {k: round(v, 4) for k, v in result.items()}
    return result

트레이너를 통해 학습을 진행합니다

In [None]:
trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

We can now finetune our model by just calling the `train` method:

In [None]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `MarianMTModel.forward` and have been ignored: translation. If translation are not expected by `MarianMTModel.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 610320
  Num Epochs = 1
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 38145


Epoch,Training Loss,Validation Loss


KeyboardInterrupt: ignored

In [None]:
# import torch
model_inputs = tokenizer(['This is tranlation model', 'good quality'], max_length=max_input_length, padding = True,truncation=True)
outputs = model.generate(torch.tensor(model_inputs['input_ids']))
outputs
tokenizer.decode(outputs[0], skip_special_token = True)

'<pad> Ăsta e modelul de translaţie.'