# Text Classificaiton: BERT

**텍스트 분류(Text classificaiton)** 는 입력 테스트를 미리 정의된 범주나 레이블로 할당하는 과제를 의미한다.

**BERT(Bidirectional Encoder Representations form Transformers)** 

## BertTokenizer

BERT는 워드피스 토크나이저를 사용한다.
- 워드피스 : 단어를 더 작은 서브워드 단위로 나누는 방식
    - OOV 문제 완화
    - 데이터 기반으로 토큰 집합 생성 -> 도메인 적응성 향상

**Tokenization using BERT Tokenizer**

In [1]:
from transformers import BertTokenizer

In [2]:
tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-multilingual-uncased")

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/872k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.72M [00:00<?, ?B/s]



config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

In [3]:
text = "Transformers Is so COOL"

In [4]:
encoded = tokenizer(text)
print(encoded)

{'input_ids': [101, 58263, 10127, 10297, 26462, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1]}


- `input_ids` : 입력 텍스트를 정수 인코딩으로 변환한 값
- `token_type_ids` : 입력이 여러 세그먼트로 구성된 경우 각 세그먼트를 구분하는 값
- `attention_mask` : 트랜스포머 인코더의 셀프 어텐션에 사용되는 마스크 값, 모델이 어떤 토큰을 무시해야 하는지를 지정하는 역할

In [5]:
input_ids = encoded["input_ids"]
decoded = tokenizer.decode(input_ids)
print(decoded)

2024-11-03 23:27:47.670046: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-11-03 23:27:47.772344: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-11-03 23:27:47.776119: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2024-11-03 23:27:47.776129: I tensorflow/stream_executor/cuda

[CLS] transformers is so cool [SEP]


- 전처리 단계에서 모든 문자를 소문자로 변환
    - 동일한 단어가 대소문자로 인해 다르게 표기되는 문제 해결 -> 데이터 일관성 확보
    - 대소문자를 구분하지 않으므로 어휘 사전의 크기 감소 -> 계산 효율성 향상
    - 개체명 인식 등 대소문자 구분이 중요한 과제에서는 성능이 떨어질 수 있음

## BertModel

1. 임베딩 계층
    - 입력 텍스트를 벡터 형태로 변환하는 역할
2. 인코더 계층
    - 12개의 트랜스포머 인코더 계층으로 구성
3. 풀러 계층
    - 인코더 계층의 최종 출력을 받아 [CLS] 토큰의 벡터를 추출하고 이를 요약벡터로 변환

**Structure of BERT model**

In [7]:
from transformers import BertModel

In [8]:
model = BertModel.from_pretrained("google-bert/bert-base-multilingual-uncased")

model.safetensors:   0%|          | 0.00/672M [00:00<?, ?B/s]

## Train Text Classification Model

**Tokenizing a Movie Review Sentiment Analysis Dataset**

In [12]:
from datasets import load_dataset
from transformers import BertTokenizer, BertForSequenceClassification

In [13]:
def preprocess_data(example, tokenizer):
    return tokenizer(example["document"], truncation=True)

In [15]:
model_name = "google-bert/bert-base-multilingual-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=2)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-multilingual-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [16]:
dataset = load_dataset("nsmc", trust_remote_code=True)

Downloading builder script:   0%|          | 0.00/3.18k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/3.74k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/6.33M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/4.89M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/150000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [17]:
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'document', 'label'],
        num_rows: 150000
    })
    test: Dataset({
        features: ['id', 'document', 'label'],
        num_rows: 50000
    })
})

In [22]:
dataset['train'][0]['document']

'아 더빙.. 진짜 짜증나네요 목소리'

In [23]:
processed_dataset = dataset.map(
    lambda example: preprocess_data(example, tokenizer),
    batched=True,
    remove_columns=["id", "document"]
).rename_column("label", "labels")

Map:   0%|          | 0/150000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [26]:
print(dataset)
print("---------------------------------------------------")
print(processed_dataset)
print("---------------------------------------------------")
print(dataset['train'][0])
print("---------------------------------------------------")
print(processed_dataset['train'][0])

DatasetDict({
    train: Dataset({
        features: ['id', 'document', 'label'],
        num_rows: 150000
    })
    test: Dataset({
        features: ['id', 'document', 'label'],
        num_rows: 50000
    })
})
---------------------------------------------------
DatasetDict({
    train: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 150000
    })
    test: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 50000
    })
})
---------------------------------------------------
{'id': '9976970', 'document': '아 더빙.. 진짜 짜증나네요 목소리', 'label': 0}
---------------------------------------------------
{'labels': 0, 'input_ids': [101, 1174, 25539, 23236, 29234, 13045, 119, 119, 87550, 97082, 25539, 1176, 25539, 24937, 13045, 16801, 72197, 47024, 1169, 70724, 22585, 13926, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask':

**DataCollatorWithPadding**

In [27]:
from torch.utils.data import DataLoader
from transformers import DataCollatorWithPadding

In [28]:
max_length_collator = DataCollatorWithPadding(
    tokenizer=tokenizer,
    padding="max_length"
)

max_length_dataloader = DataLoader(
    processed_dataset["train"],
    collate_fn=max_length_collator,
    batch_size=4,
    shuffle=False
)

In [29]:
max_length_iterator = iter(max_length_dataloader)
max_length_batch = next(max_length_iterator)
print("max_length padding input id shape :", max_length_batch["input_ids"].shape)

max_length padding input id shape : torch.Size([4, 512])


In [30]:
longest_collator = DataCollatorWithPadding(
    tokenizer=tokenizer,
    padding="longest"
)

longest_dataloader = DataLoader(
    processed_dataset["train"],
    collate_fn=longest_collator,
    batch_size=4,
    shuffle=False
)

In [31]:
longest_iterator = iter(longest_dataloader)
longest_batch = next(longest_iterator)
print("longest padding input id shape :", longest_batch["input_ids"].shape)

longest padding input id shape : torch.Size([4, 42])


**Train text classification model**

In [32]:
from transformers import TrainingArguments, Trainer

In [33]:
training_args = TrainingArguments(
    output_dir="text-classification",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=16,
    learning_rate=5e-5,
    num_train_epochs=1,
    eval_steps=200,
    logging_steps=200,
    seed=42
)

In [34]:
processed_dataset["train"].select(range(10))

Dataset({
    features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 10
})

In [35]:
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=longest_collator,
    train_dataset=processed_dataset["train"].select(range(10000)),
    eval_dataset=processed_dataset["test"].select(range(100))
)

[codecarbon INFO @ 00:35:06] [setup] RAM Tracking...
[codecarbon INFO @ 00:35:06] [setup] GPU Tracking...
[codecarbon INFO @ 00:35:06] Tracking Nvidia GPU via pynvml
[codecarbon INFO @ 00:35:06] [setup] CPU Tracking...
[codecarbon INFO @ 00:35:07] CPU Model on constant consumption mode: Intel(R) Core(TM) i7-14700K
[codecarbon INFO @ 00:35:07] >>> Tracker's metadata:
[codecarbon INFO @ 00:35:07]   Platform system: Linux-6.8.0-47-generic-x86_64-with-glibc2.35
[codecarbon INFO @ 00:35:07]   Python version: 3.10.12
[codecarbon INFO @ 00:35:07]   CodeCarbon version: 2.3.5
[codecarbon INFO @ 00:35:07]   Available RAM : 62.506 GB
[codecarbon INFO @ 00:35:07]   CPU count: 28
[codecarbon INFO @ 00:35:07]   CPU model: Intel(R) Core(TM) i7-14700K
[codecarbon INFO @ 00:35:07]   GPU count: 1
[codecarbon INFO @ 00:35:07]   GPU model: 1 x NVIDIA GeForce RTX 4090


In [36]:
trainer.train()

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mnoahyun1222[0m ([33mjiyun[0m). Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss
200,0.6968
400,0.6961
600,0.6992
800,0.6968
1000,0.6952
1200,0.6952


[codecarbon INFO @ 00:35:43] Energy consumed for RAM : 0.000098 kWh. RAM Power : 23.439877510070804 W
[codecarbon INFO @ 00:35:43] Energy consumed for all GPUs : 0.000809 kWh. Total GPU Power : 193.90044967764211 W
[codecarbon INFO @ 00:35:43] Energy consumed for all CPUs : 0.000177 kWh. Total CPU Power : 42.5 W
[codecarbon INFO @ 00:35:43] 0.001084 kWh of electricity used since the beginning.
[codecarbon INFO @ 00:35:58] Energy consumed for RAM : 0.000195 kWh. RAM Power : 23.439877510070804 W
[codecarbon INFO @ 00:35:58] Energy consumed for all GPUs : 0.001558 kWh. Total GPU Power : 179.7001997738787 W
[codecarbon INFO @ 00:35:58] Energy consumed for all CPUs : 0.000354 kWh. Total CPU Power : 42.5 W
[codecarbon INFO @ 00:35:58] 0.002108 kWh of electricity used since the beginning.
[codecarbon INFO @ 00:36:13] Energy consumed for RAM : 0.000293 kWh. RAM Power : 23.439877510070804 W
[codecarbon INFO @ 00:36:13] Energy consumed for all GPUs : 0.002380 kWh. Total GPU Power : 197.412080472

TrainOutput(global_step=1250, training_loss=0.6963639434814453, metrics={'train_runtime': 75.0946, 'train_samples_per_second': 133.165, 'train_steps_per_second': 16.646, 'total_flos': 416739133918560.0, 'train_loss': 0.6963639434814453, 'epoch': 1.0})

**Inference text classification**

In [37]:
import torch

In [38]:
model.eval()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(105879, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1

In [39]:
text = "진짜 재밌었어요. 또 보러 갈거에요"
inputs = tokenizer(text, return_tensors="pt")

In [40]:
inputs

{'input_ids': tensor([[  101, 87550, 97082, 25539,  1175, 26179, 22699, 97104, 13413, 97104,
         13413, 47024,   119, 35848,  1170, 29347, 41616, 20966, 12397, 40815,
         10609, 47024,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [42]:
with torch.no_grad():
    outputs = model(**inputs.to(device))
    print(outputs)
    print(outputs.logits)
    print(outputs.logits.argmax())

SequenceClassifierOutput(loss=None, logits=tensor([[0.0173, 0.0291]], device='cuda:0'), hidden_states=None, attentions=None)
tensor([[0.0173, 0.0291]], device='cuda:0')
tensor(1, device='cuda:0')


**Evaluate text classification model**

In [43]:
import evaluate

In [44]:
yhat = trainer.predict(processed_dataset["test"])

In [49]:
predictions = yhat.predictions.argmax(axis=1)

In [52]:
references = yhat.label_ids

In [54]:
metric = evaluate.load("accuracy")
accuracy = metric.compute(predictions=predictions, references=references)
print(accuracy)

{'accuracy': 0.50346}


In [55]:
metric = evaluate.load("f1")
f1 = metric.compute(predictions=predictions, references=references)
print(f1)

{'f1': 0.6697351442672236}


# Summary generation: BART

**Tokenization using BART Tokenizer**
- 대규모 데이터셋을 다루거나 실시간 처리가 필요한 경우 사용

In [1]:
from transformers import BartTokenizer

In [3]:
tokenizer = BartTokenizer.from_pretrained("gogamza/kobart-base-v2")

vocab.json:   0%|          | 0.00/446k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/177k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/4.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/682k [00:00<?, ?B/s]



config.json:   0%|          | 0.00/1.36k [00:00<?, ?B/s]

You passed along `num_labels=3` with an incompatible id to label map: {'0': 'NEGATIVE', '1': 'POSITIVE'}. The number of labels wil be overwritten to 2.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'PreTrainedTokenizerFast'. 
The class this function is called from is 'BartTokenizer'.


In [4]:
text = "BART는 요약 모델을 학습하기에 적합하다."
encoded = tokenizer(text)
print(encoded)

{'input_ids': [0, 265, 264, 281, 283, 415, 5, 5, 461, 416, 5, 5, 416, 473, 5, 461, 415, 5, 5, 415, 5, 5, 416, 5, 464, 461, 417, 473, 5, 416, 5, 5, 417, 473, 476, 414, 5, 370, 416, 475, 5, 461, 416, 480, 5, 417, 473, 365, 417, 473, 476, 415, 468, 361, 245, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


**Structure of BART model**

In [5]:
from transformers import BartForConditionalGeneration

In [6]:
model = BartForConditionalGeneration.from_pretrained("gogamza/kobart-base-v2")

You passed along `num_labels=3` with an incompatible id to label map: {'0': 'NEGATIVE', '1': 'POSITIVE'}. The number of labels wil be overwritten to 2.


model.safetensors:   0%|          | 0.00/495M [00:00<?, ?B/s]

In [7]:
for main_name, main_module in model.named_children():
    print("Main name = ", main_name)
    for sub_name, sub_module in main_module.named_children():
        print("L", sub_name)
        for ssub_name, ssub_module in sub_module.named_children():
            print("| L", ssub_name)
            for sssub_name, sssub_module in ssub_module.named_children():
                print("| | L", sssub_name)

Main name =  model
L shared
L encoder
| L embed_tokens
| L embed_positions
| L layers
| | L 0
| | L 1
| | L 2
| | L 3
| | L 4
| | L 5
| L layernorm_embedding
L decoder
| L embed_tokens
| L embed_positions
| L layers
| | L 0
| | L 1
| | L 2
| | L 3
| | L 4
| | L 5
| L layernorm_embedding
Main name =  lm_head


**Tokenization of movie news summary dataset**

In [11]:
from datasets import load_dataset
from transformers import BartTokenizer, BartForConditionalGeneration

In [12]:
def preprocess_data(example, tokenizer):
    return tokenizer(
        example["document"],
        text_target=example["summary"],
        truncation=True
    )

In [13]:
model_name = "gogamza/kobart-base-v2"
tokenizer = BartTokenizer.from_pretrained(model_name)
model = BartForConditionalGeneration.from_pretrained(model_name)

You passed along `num_labels=3` with an incompatible id to label map: {'0': 'NEGATIVE', '1': 'POSITIVE'}. The number of labels wil be overwritten to 2.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'PreTrainedTokenizerFast'. 
The class this function is called from is 'BartTokenizer'.
You passed along `num_labels=3` with an incompatible id to label map: {'0': 'NEGATIVE', '1': 'POSITIVE'}. The number of labels wil be overwritten to 2.


In [14]:
dataset = load_dataset("daekeun-ml/naver-news-summarization-ko")
print(dataset)

Downloading readme:   0%|          | 0.00/787 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/66.3M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/7.45M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/8.17M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/22194 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/2466 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2740 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['date', 'category', 'press', 'title', 'document', 'link', 'summary'],
        num_rows: 22194
    })
    validation: Dataset({
        features: ['date', 'category', 'press', 'title', 'document', 'link', 'summary'],
        num_rows: 2466
    })
    test: Dataset({
        features: ['date', 'category', 'press', 'title', 'document', 'link', 'summary'],
        num_rows: 2740
    })
})


In [15]:
tokenizer.model_max_length = model.config.max_position_embeddings

In [18]:
processed_dataset = dataset.map(
    lambda example: preprocess_data(example, tokenizer),
    batched=True,
    remove_columns=dataset["train"].column_names
)

Map:   0%|          | 0/22194 [00:00<?, ? examples/s]

Map:   0%|          | 0/2466 [00:00<?, ? examples/s]

Map:   0%|          | 0/2740 [00:00<?, ? examples/s]

In [19]:
print(processed_dataset)

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 22194
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 2466
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 2740
    })
})


In [20]:
sample = processed_dataset["train"]["labels"][0]
print(sample)
print(tokenizer.decode(sample))

[0, 416, 476, 367, 417, 473, 5, 461, 416, 463, 5, 415, 370, 476, 414, 5, 370, 461, 416, 5, 370, 415, 363, 367, 415, 5, 476, 415, 5, 5, 461, 415, 367, 5, 416, 475, 5, 416, 5, 476, 416, 364, 5, 415, 5, 5, 461, 416, 475, 5, 415, 469, 5, 461, 416, 5, 478, 416, 473, 465, 416, 5, 5, 461, 15684, 250, 416, 474, 5, 461, 415, 468, 367, 415, 479, 367, 461, 416, 480, 5, 416, 5, 5, 415, 362, 5, 461, 414, 5, 370, 415, 358, 5, 417, 473, 478, 461, 414, 370, 5, 416, 5, 5, 415, 5, 370, 243, 461, 416, 480, 473, 415, 372, 5, 414, 370, 5, 461, 417, 473, 476, 415, 370, 476, 414, 5, 370, 416, 475, 5, 461, 416, 5, 370, 415, 363, 367, 461, 414, 5, 5, 416, 480, 478, 416, 5, 476, 461, 415, 5, 464, 417, 469, 5, 415, 5, 365, 416, 5, 5, 461, 416, 5, 476, 416, 372, 478, 461, 417, 5, 473, 415, 469, 5, 415, 362, 5, 461, 416, 478, 464, 417, 473, 5, 461, 416, 5, 5, 415, 480, 362, 416, 5, 464, 461, 414, 5, 370, 416, 5, 5, 416, 5, 5, 414, 5, 370, 415, 358, 478, 461, 414, 5, 370, 416, 480, 473, 417, 473, 478, 461, 414, 370

2024-11-04 06:45:19.078590: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-11-04 06:45:19.174466: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-11-04 06:45:19.178502: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2024-11-04 06:45:19.178513: I tensorflow/stream_executor/cuda

<s>올�<unk> �<unk>반�<unk>� �<unk>�리�<unk>��<unk><unk> �<unk>�<unk>�<unk>��<unk>�<unk><unk> �<unk>�<unk> �<unk>�악�<unk><unk> 103�<unk> 달러 �<unk>�<unk><unk>�<unk> �<unk>��<unk>한 �<unk>�<unk><unk>�<unk>�, 정�<unk>�<unk> 하반�<unk>��<unk> �<unk>�리 �<unk><unk>제�<unk>� �<unk>��<unk>�<unk>��<unk><unk> �<unk>�출 �<unk>��<unk>�<unk> 위�<unk> �<unk><unk>력�<unk>� �<unk>��<unk><unk>�<unk><unk>�<unk>�로 �<unk>�정한 �<unk>�<unk><unk>�<unk>�, �<unk><unk>�<unk><unk> �<unk>�출 �<unk>�<unk>��<unk>�업�<unk>� �<unk>류�<unk>� �<unk>�<unk>��<unk> 위�<unk> �<unk>�<unk>�<unk><unk>�<unk> 규�<unk><unk>�<unk> 40조 �<unk> �<unk><unk>�<unk> �<unk>��<unk>하�<unk>� �<unk>류�<unk>� �<unk>�<unk>�<unk><unk> �<unk>�시선박 �<unk>��<unk>� 등�<unk>� �<unk>진하�<unk>�로 �<unk>다.</s>


**DataCollatorForSeq2Seq**

In [21]:
from torch.utils.data import DataLoader
from transformers import DataCollatorForSeq2Seq

In [24]:
seq2seq_collator = DataCollatorForSeq2Seq(
    tokenizer=tokenizer,
    padding="longest",
    return_tensors="pt"
)

seq2seq_dataloader = DataLoader(
    processed_dataset["train"],
    collate_fn=seq2seq_collator,
    batch_size=4,
    shuffle=False
)

seq2seq_iterator = iter(seq2seq_dataloader)
seq2seq_batch = next(seq2seq_iterator)

for key, value in seq2seq_batch.items():
    print(f"{key}: {value.shape}")

input_ids: torch.Size([4, 1026])
attention_mask: torch.Size([4, 1026])
labels: torch.Size([4, 515])


**Model train**

In [25]:
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer

In [26]:
training_args = Seq2SeqTrainingArguments(
    output_dir="text-summarization",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=16,
    learning_rate=5e-5,
    num_train_epochs=1,
    eval_steps=200,
    logging_steps=200,
    seed=42
)

In [27]:
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    data_collator=seq2seq_collator,
    train_dataset=processed_dataset["train"].select(range(10000)),
    eval_dataset=processed_dataset["validation"].select(range(100))
)

[codecarbon INFO @ 06:55:18] [setup] RAM Tracking...
[codecarbon INFO @ 06:55:18] [setup] GPU Tracking...
[codecarbon INFO @ 06:55:18] Tracking Nvidia GPU via pynvml
[codecarbon INFO @ 06:55:18] [setup] CPU Tracking...
[codecarbon INFO @ 06:55:19] CPU Model on constant consumption mode: Intel(R) Core(TM) i7-14700K
[codecarbon INFO @ 06:55:19] >>> Tracker's metadata:
[codecarbon INFO @ 06:55:19]   Platform system: Linux-6.8.0-47-generic-x86_64-with-glibc2.35
[codecarbon INFO @ 06:55:19]   Python version: 3.10.12
[codecarbon INFO @ 06:55:19]   CodeCarbon version: 2.3.5
[codecarbon INFO @ 06:55:19]   Available RAM : 62.506 GB
[codecarbon INFO @ 06:55:19]   CPU count: 28
[codecarbon INFO @ 06:55:19]   CPU model: Intel(R) Core(TM) i7-14700K
[codecarbon INFO @ 06:55:19]   GPU count: 1
[codecarbon INFO @ 06:55:19]   GPU model: 1 x NVIDIA GeForce RTX 4090


In [28]:
trainer.train()

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mnoahyun1222[0m ([33mjiyun[0m). Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss
200,2.0119
400,0.5646
600,0.48
800,0.4636
1000,0.4172
1200,0.4147


[codecarbon INFO @ 06:55:46] Energy consumed for RAM : 0.000098 kWh. RAM Power : 23.439877510070804 W
[codecarbon INFO @ 06:55:46] Energy consumed for all GPUs : 0.001516 kWh. Total GPU Power : 362.8189922798815 W
[codecarbon INFO @ 06:55:46] Energy consumed for all CPUs : 0.000178 kWh. Total CPU Power : 42.5 W
[codecarbon INFO @ 06:55:46] 0.001792 kWh of electricity used since the beginning.
[codecarbon INFO @ 06:56:01] Energy consumed for RAM : 0.000196 kWh. RAM Power : 23.439877510070804 W
[codecarbon INFO @ 06:56:01] Energy consumed for all GPUs : 0.003063 kWh. Total GPU Power : 371.4208170299897 W
[codecarbon INFO @ 06:56:01] Energy consumed for all CPUs : 0.000355 kWh. Total CPU Power : 42.5 W
[codecarbon INFO @ 06:56:01] 0.003613 kWh of electricity used since the beginning.
[codecarbon INFO @ 06:56:16] Energy consumed for RAM : 0.000293 kWh. RAM Power : 23.439877510070804 W
[codecarbon INFO @ 06:56:16] Energy consumed for all GPUs : 0.004634 kWh. Total GPU Power : 376.9719830244

TrainOutput(global_step=1250, training_loss=0.7132517852783203, metrics={'train_runtime': 320.3591, 'train_samples_per_second': 31.215, 'train_steps_per_second': 3.902, 'total_flos': 6109273497600000.0, 'train_loss': 0.7132517852783203, 'epoch': 1.0})

**Inference**

In [29]:
import torch

In [31]:
model.eval()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

BartForConditionalGeneration(
  (model): BartModel(
    (shared): Embedding(30000, 768, padding_idx=3)
    (encoder): BartEncoder(
      (embed_tokens): BartScaledWordEmbedding(30000, 768, padding_idx=3)
      (embed_positions): BartLearnedPositionalEmbedding(1028, 768)
      (layers): ModuleList(
        (0-5): 6 x BartEncoderLayer(
          (self_attn): BartSdpaAttention(
            (k_proj): Linear(in_features=768, out_features=768, bias=True)
            (v_proj): Linear(in_features=768, out_features=768, bias=True)
            (q_proj): Linear(in_features=768, out_features=768, bias=True)
            (out_proj): Linear(in_features=768, out_features=768, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (activation_fn): GELUActivation()
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (fc2): Linear(in_features=3072, out_features=768, bias=True)
          (final_layer_norm): Laye

In [32]:
sample = dataset["test"][0]
document = sample["document"]
inputs = tokenizer(document, return_tensors="pt").to(device)

Token indices sequence length is longer than the specified maximum sequence length for this model (1415 > 1026). Running this sequence through the model will result in indexing errors


In [None]:
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_length=1026,
        num_beams=4,
        no_repeat_ngram_size=2,
        early_stopping=True
    )

In [None]:
print("원문 :", document)
print("정답 요약문 :", sample["summary"])
print("생성 요약문 :", tokenizer.decode(outputs[0]. skip_special_tokens=True)

# Question Answering: RoBERTa

**Tokenization machine reading dataset**

In [3]:
from datasets import load_dataset
from transformers import RobertaTokenizerFast, RobertaForQuestionAnswering

In [4]:
def preprocess_data(example, tokenizer):
    tokenized = tokenizer(
        example["question"],
        example["context"],
        truncation="only_second",
        return_offsets_mapping=True
    )
    start_index = example["answers"]["answer_start"][0]
    answer_text = example["answers"]["text"][0]
    answer_tokens = tokenizer.encode(answer_text, add_special_tokens=False)
    answer_tokens_length = len(answer_tokens)

    start_context_tokens_index = tokenized["input_ids"].index(tokenizer.sep_token_id)
    context_offset_mapping = tokenized["offset_mapping"][start_context_tokens_index:]
    tokenized["start_positions"] = len(tokenized["input_ids"])
    tokenized["end_positions"] = len(tokenized["input_ids"])

    for i, (start_offset, end_offset) in enumerate(context_offset_mapping):
        if start_offset >= start_index:
            tokenized["start_positions"] = start_context_tokens_index + i
            tokenized["end_positions"] = tokenized["start_positions"] + answer_tokens_length
            break
    return tokenized

In [5]:
model_name = "klue/roberta-base"
tokenizer = RobertaTokenizerFast.from_pretrained(model_name)
model = RobertaForQuestionAnswering.from_pretrained(model_name)

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'BertTokenizer'. 
The class this function is called from is 'RobertaTokenizerFast'.
Some weights of RobertaForQuestionAnswering were not initialized from the model checkpoint at klue/roberta-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [6]:
dataset = load_dataset("klue", "mrc")

In [7]:
dataset["train"][0]

{'title': '제주도 장마 시작 … 중부는 이달 말부터',
 'context': '올여름 장마가 17일 제주도에서 시작됐다. 서울 등 중부지방은 예년보다 사나흘 정도 늦은 이달 말께 장마가 시작될 전망이다.17일 기상청에 따르면 제주도 남쪽 먼바다에 있는 장마전선의 영향으로 이날 제주도 산간 및 내륙지역에 호우주의보가 내려지면서 곳곳에 100㎜에 육박하는 많은 비가 내렸다. 제주의 장마는 평년보다 2~3일, 지난해보다는 하루 일찍 시작됐다. 장마는 고온다습한 북태평양 기단과 한랭 습윤한 오호츠크해 기단이 만나 형성되는 장마전선에서 내리는 비를 뜻한다.장마전선은 18일 제주도 먼 남쪽 해상으로 내려갔다가 20일께 다시 북상해 전남 남해안까지 영향을 줄 것으로 보인다. 이에 따라 20~21일 남부지방에도 예년보다 사흘 정도 장마가 일찍 찾아올 전망이다. 그러나 장마전선을 밀어올리는 북태평양 고기압 세력이 약해 서울 등 중부지방은 평년보다 사나흘가량 늦은 이달 말부터 장마가 시작될 것이라는 게 기상청의 설명이다. 장마전선은 이후 한 달가량 한반도 중남부를 오르내리며 곳곳에 비를 뿌릴 전망이다. 최근 30년간 평균치에 따르면 중부지방의 장마 시작일은 6월24~25일이었으며 장마기간은 32일, 강수일수는 17.2일이었다.기상청은 올해 장마기간의 평균 강수량이 350~400㎜로 평년과 비슷하거나 적을 것으로 내다봤다. 브라질 월드컵 한국과 러시아의 경기가 열리는 18일 오전 서울은 대체로 구름이 많이 끼지만 비는 오지 않을 것으로 예상돼 거리 응원에는 지장이 없을 전망이다.',
 'news_category': '종합',
 'source': 'hankyung',
 'guid': 'klue-mrc-v1_train_12759',
 'is_impossible': False,
 'question_type': 1,
 'question': '북태평양 기단과 오호츠크해 기단이 만나 국내에 머무르는 기간은?',
 'answers': {'answer_start': [478, 478]

In [8]:
processed_dataset = dataset.filter(lambda x: not x["is_impossible"])
processed_dataset = processed_dataset.map(
    lambda example: preprocess_data(example, tokenizer), batched=False
)

In [9]:
processed_dataset

DatasetDict({
    train: Dataset({
        features: ['title', 'context', 'news_category', 'source', 'guid', 'is_impossible', 'question_type', 'question', 'answers', 'input_ids', 'attention_mask', 'offset_mapping', 'start_positions', 'end_positions'],
        num_rows: 12037
    })
    validation: Dataset({
        features: ['title', 'context', 'news_category', 'source', 'guid', 'is_impossible', 'question_type', 'question', 'answers', 'input_ids', 'attention_mask', 'offset_mapping', 'start_positions', 'end_positions'],
        num_rows: 4008
    })
})

In [10]:
processed_dataset = processed_dataset.filter(
    lambda x: x["start_positions"] < tokenizer.model_max_length
)

In [11]:
processed_dataset

DatasetDict({
    train: Dataset({
        features: ['title', 'context', 'news_category', 'source', 'guid', 'is_impossible', 'question_type', 'question', 'answers', 'input_ids', 'attention_mask', 'offset_mapping', 'start_positions', 'end_positions'],
        num_rows: 11095
    })
    validation: Dataset({
        features: ['title', 'context', 'news_category', 'source', 'guid', 'is_impossible', 'question_type', 'question', 'answers', 'input_ids', 'attention_mask', 'offset_mapping', 'start_positions', 'end_positions'],
        num_rows: 3703
    })
})

In [12]:
processed_dataset = processed_dataset.filter(
    lambda x: x["end_positions"] < tokenizer.model_max_length
)

In [13]:
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['title', 'context', 'news_category', 'source', 'guid', 'is_impossible', 'question_type', 'question', 'answers'],
        num_rows: 17554
    })
    validation: Dataset({
        features: ['title', 'context', 'news_category', 'source', 'guid', 'is_impossible', 'question_type', 'question', 'answers'],
        num_rows: 5841
    })
})


In [14]:
print(processed_dataset)

DatasetDict({
    train: Dataset({
        features: ['title', 'context', 'news_category', 'source', 'guid', 'is_impossible', 'question_type', 'question', 'answers', 'input_ids', 'attention_mask', 'offset_mapping', 'start_positions', 'end_positions'],
        num_rows: 11083
    })
    validation: Dataset({
        features: ['title', 'context', 'news_category', 'source', 'guid', 'is_impossible', 'question_type', 'question', 'answers', 'input_ids', 'attention_mask', 'offset_mapping', 'start_positions', 'end_positions'],
        num_rows: 3696
    })
})


**Train model**

In [15]:
from transformers import DataCollatorWithPadding, TrainingArguments, Trainer

In [16]:
collator = DataCollatorWithPadding(tokenizer, padding="longest")

In [17]:
training_arguments = TrainingArguments(
    output_dir="question-answering",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=16,
    learning_rate=5e-5,
    num_train_epochs=1,
    eval_steps=250,
    logging_steps=250,
    seed=42
)

In [18]:
trainer = Trainer(
    model=model,
    args=training_arguments,
    data_collator=collator,
    train_dataset=processed_dataset["train"].select(range(10000)),
    eval_dataset=processed_dataset["validation"].select(range(100))
)

[codecarbon INFO @ 07:39:41] [setup] RAM Tracking...
[codecarbon INFO @ 07:39:41] [setup] GPU Tracking...
[codecarbon INFO @ 07:39:41] Tracking Nvidia GPU via pynvml
[codecarbon INFO @ 07:39:41] [setup] CPU Tracking...
[codecarbon INFO @ 07:39:42] CPU Model on constant consumption mode: Intel(R) Core(TM) i7-14700K
[codecarbon INFO @ 07:39:42] >>> Tracker's metadata:
[codecarbon INFO @ 07:39:42]   Platform system: Linux-6.8.0-47-generic-x86_64-with-glibc2.35
[codecarbon INFO @ 07:39:42]   Python version: 3.10.12
[codecarbon INFO @ 07:39:42]   CodeCarbon version: 2.3.5
[codecarbon INFO @ 07:39:42]   Available RAM : 62.506 GB
[codecarbon INFO @ 07:39:42]   CPU count: 28
[codecarbon INFO @ 07:39:42]   CPU model: Intel(R) Core(TM) i7-14700K
[codecarbon INFO @ 07:39:42]   GPU count: 1
[codecarbon INFO @ 07:39:42]   GPU model: 1 x NVIDIA GeForce RTX 4090


In [19]:
trainer.train()

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mnoahyun1222[0m ([33mjiyun[0m). Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss
250,2.384
500,1.5038
750,1.4078
1000,1.2094
1250,1.1724


[codecarbon INFO @ 07:40:11] Energy consumed for RAM : 0.000098 kWh. RAM Power : 23.439877510070804 W
[codecarbon INFO @ 07:40:11] Energy consumed for all GPUs : 0.001660 kWh. Total GPU Power : 398.1109883204044 W
[codecarbon INFO @ 07:40:11] Energy consumed for all CPUs : 0.000177 kWh. Total CPU Power : 42.5 W
[codecarbon INFO @ 07:40:11] 0.001935 kWh of electricity used since the beginning.
[codecarbon INFO @ 07:40:26] Energy consumed for RAM : 0.000195 kWh. RAM Power : 23.439877510070804 W
[codecarbon INFO @ 07:40:26] Energy consumed for all GPUs : 0.003345 kWh. Total GPU Power : 404.392284568547 W
[codecarbon INFO @ 07:40:26] Energy consumed for all CPUs : 0.000354 kWh. Total CPU Power : 42.5 W
[codecarbon INFO @ 07:40:26] 0.003894 kWh of electricity used since the beginning.
[codecarbon INFO @ 07:40:41] Energy consumed for RAM : 0.000293 kWh. RAM Power : 23.439877510070804 W
[codecarbon INFO @ 07:40:41] Energy consumed for all GPUs : 0.005043 kWh. Total GPU Power : 407.74496648760

TrainOutput(global_step=1250, training_loss=1.535464208984375, metrics={'train_runtime': 139.8668, 'train_samples_per_second': 71.497, 'train_steps_per_second': 8.937, 'total_flos': 2611440614437824.0, 'train_loss': 1.535464208984375, 'epoch': 1.0})

**Inference**

In [21]:
import torch

In [23]:
model.eval()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

question = "대한민국의 수도는 어디인가요?"
context = "서울은 대한민국의 수도다."

inputs = tokenizer(question, context, return_tensors="pt").to(device)

with torch.no_grad():
    outputs = model(**inputs)

In [28]:
inputs

{'input_ids': tensor([[    0,  4892,  2079,  4438,  2259,  4069,  2179, 18119,    35,     2,
          3671,  2073,  4892,  2079,  4438,  2062,    18,     2]],
       device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]],
       device='cuda:0')}

In [24]:
outputs

QuestionAnsweringModelOutput(loss=None, start_logits=tensor([[ 0.1863, -3.9321, -5.7286, -3.9440, -5.6872, -4.8295, -5.5411, -5.0764,
         -4.7644,  0.1864,  3.8483, -3.1025, -1.0339, -4.7686, -0.6544, -5.3247,
          0.0333,  0.1865]], device='cuda:0'), end_logits=tensor([[-2.5434, -4.2888, -3.6859, -4.6698, -4.0784, -4.5356, -4.2747, -5.0802,
         -4.9306, -2.5435,  0.6271,  4.4405, -2.2527, -1.4613, -2.6562, -0.5889,
         -2.8601, -2.5436]], device='cuda:0'), hidden_states=None, attentions=None)

In [27]:
outputs["start_logits"].argmax(dim=-1).item()

10

In [29]:
start_index = outputs["start_logits"].argmax(dim=-1).item()
end_index = outputs["end_logits"].argmax(dim=-1).item()
predicted_ids = inputs["input_ids"][0][start_index: end_index]
predicted_text = tokenizer.decode(predicted_ids)
print(predicted_text)

서울


**Evaluate**

In [30]:
from evaluate import evaluator

In [31]:
metric = evaluator("question-answering")

In [33]:
results = metric.compute(
    model,
    tokenizer=tokenizer,
    data=processed_dataset["validation"].select(range(100)),
    id_column="guid",
    question_column="question",
    context_column="context",
    label_column="answers"
)

Filter:   0%|          | 0/100 [00:00<?, ? examples/s]

`squad_v2_format` parameter not provided to QuestionAnsweringEvaluator.compute(). Automatically inferred `squad_v2_format` as False.


Downloading builder script:   0%|          | 0.00/4.53k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.32k [00:00<?, ?B/s]

In [34]:
print(results)

{'exact_match': 3.0, 'f1': 25.43333333333334, 'total_time_in_seconds': 1.0021326879505068, 'samples_per_second': 99.78718507278029, 'latency_in_seconds': 0.010021326879505068}


# Machine Translation: T5

**Tokenization OPUS-100 dataset**

In [35]:
from datasets import load_dataset
from transformers import T5TokenizerFast, T5ForConditionalGeneration

In [46]:
def preprocess_data(example, tokenizer):
    translation = example["translation"]
    translation_source = ["en: " + instance["en"] for instance in translation]
    translation_target = ["ko: " + instance["ko"] for instance in translation]
    tokenized = tokenizer(
        translation_source, text_target=translation_target, truncation=True
    )
    return tokenized

In [36]:
model_name = "KETI-AIR/long-ke-t5-small"
tokenizer = T5TokenizerFast.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

tokenizer_config.json:   0%|          | 0.00/2.49k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/1.59M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/4.17M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.22k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/893 [00:00<?, ?B/s]

You are using a model of type longt5 to instantiate a model of type t5. This is not supported for all configurations of models and can yield errors.


model.safetensors:   0%|          | 0.00/439M [00:00<?, ?B/s]

Some weights of T5ForConditionalGeneration were not initialized from the model checkpoint at KETI-AIR/long-ke-t5-small and are newly initialized: ['encoder.block.0.layer.0.SelfAttention.k.weight', 'encoder.block.0.layer.0.SelfAttention.o.weight', 'encoder.block.0.layer.0.SelfAttention.q.weight', 'encoder.block.0.layer.0.SelfAttention.relative_attention_bias.weight', 'encoder.block.0.layer.0.SelfAttention.v.weight', 'encoder.block.1.layer.0.SelfAttention.k.weight', 'encoder.block.1.layer.0.SelfAttention.o.weight', 'encoder.block.1.layer.0.SelfAttention.q.weight', 'encoder.block.1.layer.0.SelfAttention.v.weight', 'encoder.block.2.layer.0.SelfAttention.k.weight', 'encoder.block.2.layer.0.SelfAttention.o.weight', 'encoder.block.2.layer.0.SelfAttention.q.weight', 'encoder.block.2.layer.0.SelfAttention.v.weight', 'encoder.block.3.layer.0.SelfAttention.k.weight', 'encoder.block.3.layer.0.SelfAttention.o.weight', 'encoder.block.3.layer.0.SelfAttention.q.weight', 'encoder.block.3.layer.0.SelfAt

In [37]:
dataset = load_dataset("Helsinki-NLP/opus-100", "en-ko")

Downloading readme:   0%|          | 0.00/65.4k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/143k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/70.1M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/144k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/2000 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/1000000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/2000 [00:00<?, ? examples/s]

In [45]:
dataset["train"][0]["translation"]

{'en': "They're shaped like a bus.", 'ko': '할머니처럼 만들었지만.. ? 엉망이지만..'}

In [47]:
processed_dataset = dataset.map(
    lambda example: preprocess_data(example, tokenizer),
    batched=True,
    remove_columns=dataset["train"].column_names
)

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Map:   0%|          | 0/1000000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

In [50]:
processed_dataset

DatasetDict({
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 2000
    })
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 1000000
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 2000
    })
})

In [48]:
sample = processed_dataset["test"][0]

In [49]:
print(sample)
print("변환된 출발 언어 : ", tokenizer.decode(sample["input_ids"]))
print("변환된 도착 언어 : ", tokenizer.decode(sample["labels"]))

{'input_ids': [20004, 20525, 20048, 20298, 20480, 20025, 20263, 20027, 20187, 20050, 43305, 20009, 21015, 20047, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'labels': [20004, 23477, 20048, 92, 14, 4256, 11, 1363, 71, 1133, 2951, 20371, 33, 16, 75, 242, 10, 513, 20047, 1]}
변환된 출발 언어 :  en: What makes you think I want an intro to anyone?</s>
변환된 도착 언어 :  ko: 내가 너를 누구에게 소개하고 싶어한다고 생각하니?</s>


**Train model**

In [51]:
from transformers import DataCollatorForSeq2Seq
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer

In [52]:
seq2seq_collator = DataCollatorForSeq2Seq(
    tokenizer=tokenizer,
    padding="longest",
    return_tensors="pt"
)

training_arguments = Seq2SeqTrainingArguments(
    output_dir="t5-translation",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=16,
    learning_rate=5e-5,
    num_train_epochs=1,
    eval_steps=2500,
    logging_steps=2500,
    seed=42
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_arguments,
    data_collator=seq2seq_collator,
    train_dataset=processed_dataset["train"].select(range(100000)),
    eval_dataset=processed_dataset["validation"].select(range(1000))
)

[codecarbon INFO @ 00:17:55] [setup] RAM Tracking...
[codecarbon INFO @ 00:17:55] [setup] GPU Tracking...
[codecarbon INFO @ 00:17:55] Tracking Nvidia GPU via pynvml
[codecarbon INFO @ 00:17:55] [setup] CPU Tracking...
[codecarbon INFO @ 00:17:56] CPU Model on constant consumption mode: Intel(R) Core(TM) i7-14700K
[codecarbon INFO @ 00:17:56] >>> Tracker's metadata:
[codecarbon INFO @ 00:17:56]   Platform system: Linux-6.8.0-47-generic-x86_64-with-glibc2.35
[codecarbon INFO @ 00:17:56]   Python version: 3.10.12
[codecarbon INFO @ 00:17:56]   CodeCarbon version: 2.3.5
[codecarbon INFO @ 00:17:56]   Available RAM : 62.506 GB
[codecarbon INFO @ 00:17:56]   CPU count: 28
[codecarbon INFO @ 00:17:56]   CPU model: Intel(R) Core(TM) i7-14700K
[codecarbon INFO @ 00:17:56]   GPU count: 1
[codecarbon INFO @ 00:17:56]   GPU model: 1 x NVIDIA GeForce RTX 4090


In [53]:
trainer.train()

Step,Training Loss
2500,3.127
5000,2.8703
7500,2.8171
10000,2.7726
12500,2.7413


[codecarbon INFO @ 00:18:21] Energy consumed for RAM : 0.000098 kWh. RAM Power : 23.439877510070804 W
[codecarbon INFO @ 00:18:21] Energy consumed for all GPUs : 0.000525 kWh. Total GPU Power : 125.86140561234407 W
[codecarbon INFO @ 00:18:21] Energy consumed for all CPUs : 0.000177 kWh. Total CPU Power : 42.5 W
[codecarbon INFO @ 00:18:21] 0.000799 kWh of electricity used since the beginning.
[codecarbon INFO @ 00:18:36] Energy consumed for RAM : 0.000195 kWh. RAM Power : 23.439877510070804 W
[codecarbon INFO @ 00:18:36] Energy consumed for all GPUs : 0.001057 kWh. Total GPU Power : 127.80701389098887 W
[codecarbon INFO @ 00:18:36] Energy consumed for all CPUs : 0.000354 kWh. Total CPU Power : 42.5 W
[codecarbon INFO @ 00:18:36] 0.001607 kWh of electricity used since the beginning.
[codecarbon INFO @ 00:18:51] Energy consumed for RAM : 0.000293 kWh. RAM Power : 23.439877510070804 W
[codecarbon INFO @ 00:18:51] Energy consumed for all GPUs : 0.001563 kWh. Total GPU Power : 121.52024154

TrainOutput(global_step=12500, training_loss=2.8656383984375, metrics={'train_runtime': 998.9481, 'train_samples_per_second': 100.105, 'train_steps_per_second': 12.513, 'total_flos': 1434328128700416.0, 'train_loss': 2.8656383984375, 'epoch': 1.0})

# Text Generation: LLaMA-3.1

**Load LLaMA-3.1**

In [1]:
import torch
from transformers import BitsAndBytesConfig
from transformers import AutoTokenizer, AutoModelForCausalLM

In [2]:
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=False
)

In [3]:
token = ""

In [4]:
model_name = "meta-llama/Meta-Llama-3.1-8B-Instruct"

In [5]:
tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    trust_remote_code=True,
    token=token
)

In [6]:
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=quantization_config,
    device_map={"": 0},
    token=token
)

2024-11-05 01:04:30.995537: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-11-05 01:04:31.095157: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-11-05 01:04:31.099341: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2024-11-05 01:04:31.099351: I tensorflow/stream_executor/cuda

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/184 [00:00<?, ?B/s]

**Communicate with LLaMA-3.1**

In [8]:
model.eval()

messages = [
    {"role": "user", "content": "안녕하세요."}
]

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(device)

with torch.no_grad():
    outputs = model.generate(
        input_ids,
        max_new_tokens=256,
        do_sample=True,
        temperature=0.7,
        top_p=0.9
    )

response = outputs[0][input_ids.shape[-1]:]
print(tokenizer.decode(response, skip_special_tokens=True))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


안녕하세요. 무엇을 도와드릴까요?


**Settings LLaMA-3.1**

In [10]:
import torch
from datasets import load_dataset
from transformers import BitsAndBytesConfig
from transformers import AutoTokenizer, AutoModelForCausalLM


dataset = load_dataset("s076923/llama3-wikibook-ko")

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=False
)

token = ""
model_name = "meta-llama/Meta-Llama-3.1-8B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    trust_remote_code=True,
    token=token
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=quantization_config,
    device_map={"": 0},
    token=token
)

tokenizer.pad_token = tokenizer.eos_token
model.config.use_cache = False

print(dataset)
print(dataset["train"]["text"][7])

README.md:   0%|          | 0.00/301 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/3.81k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/10 [00:00<?, ? examples/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 10
    })
})
<|begin_of_text|><|start_header_id|>user<|end_header_id|>

위키북스의 대표 저자를 알려주세요.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

윤대희, 김동화, 송종민, 진현두<|eot_id|><|start_header_id|>assistant<|end_header_id|>




**LoRA 설정**

In [11]:
from peft import LoraConfig

In [12]:
peft_config = LoraConfig(
    r=128,
    lora_alpha=4,
    lora_dropout=0.1,
    task_type="CAUSAL_LM"
)

**Fine-tuning using STF Trainer**

In [13]:
from transformers import TrainingArguments
from trl import SFTTrainer

In [14]:
training_args = TrainingArguments(
    output_dir="LLaMa-3.1",
    per_device_train_batch_size=1,
    per_device_eval_batch_size=5,
    learning_rate=2e-4,
    max_steps=500,
    warmup_steps=100,
    logging_steps=100,
    fp16=True,
    optim="paged_adamw_8bit",
    seed=42
)

In [15]:
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    peft_config=peft_config,
    train_dataset=dataset["train"],
    dataset_text_field="text",
    max_seq_length=64
)


Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.


Map:   0%|          | 0/10 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs
[codecarbon INFO @ 01:33:44] [setup] RAM Tracking...
[codecarbon INFO @ 01:33:44] [setup] GPU Tracking...
[codecarbon INFO @ 01:33:44] Tracking Nvidia GPU via pynvml
[codecarbon INFO @ 01:33:44] [setup] CPU Tracking...
[codecarbon INFO @ 01:33:45] CPU Model on constant consumption mode: Intel(R) Core(TM) i7-14700K
[codecarbon INFO @ 01:33:45] >>> Tracker's metadata:
[codecarbon INFO @ 01:33:45]   Platform system: Linux-6.8.0-47-generic-x86_64-with-glibc2.35
[codecarbon INFO @ 01:33:45]   Python version: 3.10.12
[codecarbon INFO @ 01:33:45]   CodeCarbon version: 2.3.5
[codecarbon INFO @ 01:33:45]   Available RAM : 62.506 GB
[codecarbon INFO @ 01:33:45]   CPU count: 28
[codecarbon INFO @ 01:33:45]   CPU model: Intel(R) Core(TM) i7-14700K
[codecarbon INFO @ 01:33:45]   GPU count: 1
[codecarbon INFO @ 01:33:45]   GPU model: 1 x NVIDIA GeForce RTX 4090


In [16]:
trainer.train()

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mnoahyun1222[0m ([33mjiyun[0m). Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss
100,2.2148
200,0.505
300,0.2667
400,0.1894
500,0.128


[codecarbon INFO @ 01:34:12] Energy consumed for RAM : 0.000098 kWh. RAM Power : 23.439877510070804 W
[codecarbon INFO @ 01:34:12] Energy consumed for all GPUs : 0.001088 kWh. Total GPU Power : 261.1869555144978 W
[codecarbon INFO @ 01:34:12] Energy consumed for all CPUs : 0.000177 kWh. Total CPU Power : 42.5 W
[codecarbon INFO @ 01:34:12] 0.001363 kWh of electricity used since the beginning.
[codecarbon INFO @ 01:34:27] Energy consumed for RAM : 0.000195 kWh. RAM Power : 23.439877510070804 W
[codecarbon INFO @ 01:34:27] Energy consumed for all GPUs : 0.002198 kWh. Total GPU Power : 266.3891404550268 W
[codecarbon INFO @ 01:34:27] Energy consumed for all CPUs : 0.000354 kWh. Total CPU Power : 42.5 W
[codecarbon INFO @ 01:34:27] 0.002748 kWh of electricity used since the beginning.
[codecarbon INFO @ 01:34:42] Energy consumed for RAM : 0.000293 kWh. RAM Power : 23.439877510070804 W
[codecarbon INFO @ 01:34:42] Energy consumed for all GPUs : 0.003282 kWh. Total GPU Power : 260.1181477046

TrainOutput(global_step=500, training_loss=0.6607710571289063, metrics={'train_runtime': 108.3673, 'train_samples_per_second': 4.614, 'train_steps_per_second': 4.614, 'total_flos': 1161131615846400.0, 'train_loss': 0.6607710571289063, 'epoch': 50.0})

**Inference**

In [19]:
model.eval()

messages = [
    {"role": "user", "content": "위키북스의 대표 저자는 누구에요"}
]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(device)

with torch.no_grad():
    outputs = model.generate(
        input_ids,
        max_new_tokens=64,
        do_sample=True,
        temperature=0.2,
        top_p=0.95,
        no_repeat_ngram_size=2
    )

response = outputs[0][input_ids.shape[-1]:]
print(tokenizer.decode(response, skip_special_tokens=True))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


위를북스의 대표저자는 김동화, 송종민, 진현두입니다.


In [23]:
input_ids.shape

torch.Size([1, 45])

In [26]:
outputs[0].shape

torch.Size([66])