# Fine-tuning a model on a text classification task

아래 코드의 주석 (#)을 해제하고 실행해서 필요한 라이브러리를 설치합니다.

In [None]:
! pip install datasets transformers evaluate

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting evaluate
  Downloading evaluate-0.2.2-py3-none-any.whl (69 kB)
[K     |████████████████████████████████| 69 kB 4.7 MB/s 
Installing collected packages: evaluate
Successfully installed evaluate-0.2.2


또한, GPU를 사용하는 것을 권장합니다. Colab에서 GPU를 사용하는 방법은 다음과 같습니다:

1.   상단의 런타임 > 런타임 유형 변경 을 선택하고, 우측의 바에서 '런타임 유형 변경 선택

2.   그 후 나타난 팝업창에서 하드웨어 가속기를 'GPU'로 설정.


GPU 사용 설정이 정상적으로 완료되었다면, 아래 코드를 실행했을 때 'True' 가 나타나야 합니다.


In [None]:
import torch

torch.cuda.is_available()
device = 'cuda:0' if torch.cuda.is_available() else 'cpu'

True

거대한 데이터를 다운로드 하기 위해 Git-LFS가 필요합니다. 아래 코드의 주석을 해제하고 실행합니다.

In [None]:
# !apt install git-lfs

가상환경에 설치된 Transformer library의 버전이 4.11.0 이상인지 확인합니다.

In [None]:
import transformers

print(transformers.__version__)

이번 실습을 통해 [BERT](https://huggingface.co/docs/transformers/model_doc/bert)모델을 [IMDB Dataset](https://www.imdb.com/?ref_=nv_home) 텍스트 분류 Task에 fine-tune해볼 것입니다. [(논문 링크)](https://arxiv.org/abs/1810.04805) 
![BERT_img](https://paul-hyun.github.io/assets/2020-01-02/bert-classification.png)

### **BERT (Bidirectional Encoder Representations from Transformers)**

#### ***Transformer Encoder***
- BERT는 Transformer 에서 Encoder부분을 분리하여 학습시킨 아키텍처
- 문장의 15% 정도를 가린 후, 이를 복원하는 Task (MLM)와 두 문장이 주어지면 서로 연결되어 있는 문장인지 분류하는 Task (NSP)를 학습
- BERT 상단에 추가 Layer 및 fine-tune을 통해 어떤 Task에도 활용이 가능.

#### ***Byte Pair Encoding(BPE)***
- subword 기반의 인코딩 방법으로 문자 단위로 단어를 분해해 vocab 생성
- OOV(Out Of Vocabulary) 문제 해결 



IMDB Dataset은 IMDB 영화 리뷰를 긍정/부정 중 하나로 분류하는 Binary Classification Task입니다. 
![IMDB_img](https://miro.medium.com/max/705/1*f-bF79_zFHGXEhJvx2WPLg.jpeg)

이번 실습에서는 bert-base-cased 모델을 사용합니다. 

In [None]:
model_checkpoint = "bert-base-cased"
batch_size = 16

## Loading the dataset

Huggingface [Datasets](https://github.com/huggingface/datasets) 라이브러리의 `load_dataset` 함수를 활용해서 데이터를 다운로드 받을 수 있습니다. 

또한, 성능 평가를 위해 Huggingface [Evaluate](https://huggingface.co/docs/evaluate/index) 라이브러리를 사용합니다.

In [None]:
from datasets import load_dataset
import evaluate
metric = evaluate.load("accuracy")

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

IMDB Dataset을 불러옵니다. 

In [None]:
dataset = load_dataset("imdb")



  0%|          | 0/3 [00:00<?, ?it/s]

Downloading builder script:   0%|          | 0.00/7.55k [00:00<?, ?B/s]

IMDB Dataset은 Train/Test dataset으로 구성되어 있습니다. 

In [None]:
dataset['train']

Dataset({
    features: ['text', 'label'],
    num_rows: 25000
})

In [None]:
dataset['test']

Dataset({
    features: ['text', 'label'],
    num_rows: 25000
})

In [None]:
dataset["train"][0]

{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far be

앞서 생성한 accuracy metric은 [`evaluation.EvaluationModule`](https://huggingface.co/docs/evaluate/package_reference/main_classes#evaluate.EvaluationModule)의 instance 중 하나입니다:

In [None]:
metric

EvaluationModule(name: "accuracy", module_type: "metric", features: {'predictions': Value(dtype='int32', id=None), 'references': Value(dtype='int32', id=None)}, usage: """
Args:
    predictions (`list` of `int`): Predicted labels.
    references (`list` of `int`): Ground truth labels.
    normalize (`boolean`): If set to False, returns the number of correctly classified samples. Otherwise, returns the fraction of correctly classified samples. Defaults to True.
    sample_weight (`list` of `float`): Sample weights Defaults to None.

Returns:
    accuracy (`float` or `int`): Accuracy score. Minimum possible value is 0. Maximum possible value is 1.0, or the number of examples input, if `normalize` is set to `True`.. A higher score means higher accuracy.

Examples:

    Example 1-A simple example
        >>> accuracy_metric = evaluate.load("accuracy")
        >>> results = accuracy_metric.compute(references=[0, 1, 2, 0, 1, 2], predictions=[0, 1, 1, 2, 1, 0])
        >>> print(results)
    

Metric의 `compute` method 를 통해 입력된 두 데이터가 얼마나 일치하는지 정확도를 비교할 수 있습니다.

In [None]:
import numpy as np

fake_preds = np.random.randint(0, 2, size=(64,))
fake_labels = np.random.randint(0, 2, size=(64,))
metric.compute(predictions=fake_preds, references=fake_labels)

{'accuracy': 0.5625}

## Preprocessing the data

Huggingface Transformer 모델에 텍스트를 입력하기 전에, 텍스트를 Tokenize해야 합니다. 어제 실습시간에 배웠던 Huggingface `Tokenizer`를 활용하여 Text를 Tokenize할 수 있습니다.

Tokenizer 모델을 `AutoTokenizer.from_pretrained` method를 통해 Load 해봅시다.
`use_fast` 인자에 True를 넘겨서 *fast* 버전 (Rust backbone) 을 Load 할 수 있습니다. 에러가 발생하면 이 인자를 제거해주세요.

In [None]:
from transformers import AutoTokenizer

# model_checkpoint = bert-base-cased
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--bert-base-cased/snapshots/a8d257ba9925ef39f3036bfc338acf5283c512d9/config.json
Model config BertConfig {
  "_name_or_path": "bert-base-cased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.22.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}

loading file vocab.txt from cache at /root/.cache/huggingface/hub/models--bert-base-cased/snapshots/a8d257ba9925ef39f3036bfc338acf5283c512d9/vocab.txt
loading file tokenize

tokenizer가 잘 Load 되었는지 확인합니다.

In [None]:
tokenizer("Hello, this one sentence!", "And this sentence goes with it.")

{'input_ids': [101, 8667, 117, 1142, 1141, 5650, 106, 102, 1262, 1142, 5650, 2947, 1114, 1122, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

tokenizer를 통해 IMDB dataset을 encoding 합니다.


In [None]:
def preprocess_function(examples):
    return tokenizer(examples['text'], truncation=True)

In [None]:
# 데이터 전체를 Encode합니다. 
encoded_dataset = dataset.map(preprocess_function, batched=True)

  0%|          | 0/25 [00:00<?, ?ba/s]

  0%|          | 0/25 [00:00<?, ?ba/s]

  0%|          | 0/50 [00:00<?, ?ba/s]

## Fine-tuning the model

IMDB 데이터를 준비했고, Encoding까지 마쳤으니 pretrained model을 불러와서 fine-tune을 할 수 있습니다.

저희의 목표 Task가 Sentence Classification이니까  Huggingface에서 제공하는 `AutoModelForSequenceClassification` class 를 활용할 수 있습니다.

분류 Task를 수행할 때, label의 개수를 지정할 수 있습니다.

In [None]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

# 0 (neg) / 1 (pos)
num_labels = 2 
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--bert-base-cased/snapshots/a8d257ba9925ef39f3036bfc338acf5283c512d9/config.json
Model config BertConfig {
  "_name_or_path": "bert-base-cased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.22.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}

loading weights file pytorch_model.bin from cache at /root/.cache/huggingface/hub/models--bert-base-cased/snapshots/a8d257ba9925ef39f3036bfc338acf5283c512d9/pytorch_model.b

일반적으로 `AutoModelForSequenceClassification` class를 Load하면 일부 Layer를 initialize한다는 경고가 나타납니다. 이는 정상적인 내용입니다. 

BERT를 비롯하여 Universial한 모델을 활용해서 Text Classification Task에 fune-tune하는 과정이기 때문에, 관련 Task를 수행하는 부분을 학습해줘야 하기 때문입니다.

Huggingface `Trainer`를 사용하기 위해 [`TrainingArguments`](https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments) instance가 필요합니다. 이는 학습 과정에서 필요한 모든 argument를 포함하고 있는 객체입니다.

 모델 체크포인트를 저장할 경로는 필수 옵션이며, 나머지는 선택사항입니다,

In [None]:
metric_name = "accuracy"
model_name = model_checkpoint.split("/")[-1]

args = TrainingArguments(
    f"IMDB-finetuned-BERT",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=5,
    weight_decay=0.01,
    do_train=True,
    do_eval=True,
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
    push_to_hub=False,
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


Here we set the evaluation to be done at the end of each epoch, tweak the learning rate, use the `batch_size` defined at the top of the notebook and customize the number of epochs for training, as well as the weight decay. Since the best model might not be the one at the end of training, we ask the `Trainer` to load the best model it saved (according to `metric_name`) at the end of training.


두번째로, `Trainer`에게 모델의 성능을 검사할 metric을 입력해야 합니다. 앞서 정의한 `metric` instance를 사용하면 됩니다. 

모델은 입력된 Text에 대해, 미리 정의된 각 Class에 속할 가능성을 출력합니다. 따라서, 모델의 출력믈의 argmax를 계산해서 모델의 예측한 Class index를 추출해야 합니다. 

In [77]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return metric.compute(predictions=predictions, references=labels)

준비가 끝났으면, dataset을 비롯한 모든 요소를 `Trainer`에 입력해봅시다.

In [78]:
trainer = Trainer(
    model,
    args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset["test"],
    tokenizer=tokenizer,
    compute_metrics=metric
)

You might wonder why we pass along the `tokenizer` when we already preprocessed our data. This is because we will use it once last time to make all the samples we gather the same length by applying padding, which requires knowing the model's preferences regarding padding (to the left or right? with which token?). The `tokenizer` has a pad method that will do all of this right for us, and the `Trainer` will use it. You can customize this part by defining and passing your own `data_collator` which will receive the samples like the dictionaries seen above and will need to return a dictionary of tensors.

이제 trainer instance의 `train` method를 통해 저희가 원하는 모델을 학습할 수 있습니다.

In [None]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 25000
  Num Epochs = 5
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 7815


Epoch,Training Loss,Validation Loss


KeyboardInterrupt: ignored

학습이 끝나면 `evaluate` method를 통해 성능을 평가할 수 있습니다.

In [None]:
trainer.evaluate()

## Hyperparameter search

여기부터는 선택사항입니다. 

The `Trainer` supports hyperparameter search using [optuna](https://optuna.org/) or [Ray Tune](https://docs.ray.io/en/latest/tune/). For this last section you will need either of those libraries installed, just uncomment the line you want on the next cell and run it.

In [None]:
# ! pip install optuna
# ! pip install ray[tune]

During hyperparameter search, the `Trainer` will run several trainings, so it needs to have the model defined via a function (so it can be reinitialized at each new run) instead of just having it passed. We jsut use the same function as before:

In [None]:
def model_init():
    return AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)

And we can instantiate our `Trainer` like before:

In [None]:
trainer = Trainer(
    model_init=model_init,
    args=args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset[validation_key],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

The method we call this time is `hyperparameter_search`. Note that it can take a long time to run on the full dataset for some of the tasks. You can try to find some good hyperparameter on a portion of the training dataset by replacing the `train_dataset` line above by:
```python
train_dataset = encoded_dataset["train"].shard(index=1, num_shards=10) 
```
for 1/10th of the dataset. Then you can run a full training on the best hyperparameters picked by the search.

In [None]:
best_run = trainer.hyperparameter_search(n_trials=10, direction="maximize")

The `hyperparameter_search` method returns a `BestRun` objects, which contains the value of the objective maximized (by default the sum of all metrics) and the hyperparameters it used for that run.

In [None]:
best_run

You can customize the objective to maximize by passing along a `compute_objective` function to the `hyperparameter_search` method, and you can customize the search space by passing a `hp_space` argument to `hyperparameter_search`. See this [forum post](https://discuss.huggingface.co/t/using-hyperparameter-search-in-trainer/785/10) for some examples.

To reproduce the best training, just set the hyperparameters in your `TrainingArgument` before creating a `Trainer`:

In [None]:
for n, v in best_run.hyperparameters.items():
    setattr(trainer.args, n, v)

trainer.train()