# HuggingFace Pipeline

In [1]:
!pip install -q transformers datasets
!pip install -q sentencepiece
!pip install -q kobert-transformers
!pip install -q python-dotenv

In [2]:
from dotenv import load_dotenv
import os

load_dotenv()
HF_TOKEN = os.getenv('HF_TOKEN')

### NLP Tasks

주어진 task에서 사전 훈련된 모델을 사용하는 가장 간단한 방법은 `pipeline`을 사용하는 것이다.

In [3]:
from transformers import pipeline

##### 기계번역

https://huggingface.co/Helsinki-NLP/opus-mt-ko-en

In [4]:
!pip install tf-keras



In [5]:
translator = pipeline('translation', model='Helsinki-NLP/opus-mt-ko-en')
translator

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/312M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/312M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/44.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/842k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/813k [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

Device set to use cuda:0


<transformers.pipelines.text2text_generation.TranslationPipeline at 0x7c6519b6cc50>

In [6]:
translator('간식비를 향하여!!!')

[{'translation_text': 'To the snacks!'}]

In [7]:
translator([
    '공부를 열심히 하시면 좋겠어요.',
    '수업을 열심히 들으시면 좋겠고요.',
    '실습도 열심히 해주시면 좋겠어요.',
    '제발 17기 말 좀... 들으세요...'
])

[{'translation_text': 'I hope you study hard.'},
 {'translation_text': "I'd like you to take the class really hard."},
 {'translation_text': "I'd like you to work hard on your practice."},
 {'translation_text': 'Please listen to 17th century...'}]

##### 감성 분석

In [8]:
sentiment_clf = pipeline('sentiment-analysis')

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

Device set to use cuda:0


In [9]:
sentiment_clf("I'm very happy!!!")

[{'label': 'POSITIVE', 'score': 0.9998722076416016}]

- 한국어

https://huggingface.co/sangrimlee/bert-base-multilingual-cased-nsmc

In [10]:
ko_sentiment_clf = pipeline('sentiment-analysis', model='sangrimlee/bert-base-multilingual-cased-nsmc')

config.json:   0%|          | 0.00/932 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/712M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/711M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/297 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Device set to use cuda:0


In [11]:
ko_sentiment_clf([
    '어제 배운 Attention이 어려워서 맥주를 한 잔 했어.',
    '공부는 어렵지만 맥주는 시원하더라',
    '하지만 나는 굴하지 않고 열심히 공부할거야!!!'
])

[{'label': 'negative', 'score': 0.9047990441322327},
 {'label': 'negative', 'score': 0.7013086676597595},
 {'label': 'positive', 'score': 0.7580999135971069}]

##### Zero Shot Classification

- shot == 예시
- 예시 없이 (학습 없이) 추론(분류)

In [12]:
zero_shot_clf = pipeline('zero-shot-classification')

No model was supplied, defaulted to facebook/bart-large-mnli and revision d7645e1 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cuda:0


In [13]:
zero_shot_clf(
    "This is a course about the transformers library.",
    candidate_labels=["education", "politics", "business"]
)

{'sequence': 'This is a course about the transformers library.',
 'labels': ['education', 'business', 'politics'],
 'scores': [0.9053574204444885, 0.07259681075811386, 0.022045746445655823]}

In [14]:
zero_shot_clf(
    "It has a poison and it's very dangerous",
    candidate_labels=["tiger", "squirrel", "snake"]
)

{'sequence': "It has a poison and it's very dangerous",
 'labels': ['snake', 'tiger', 'squirrel'],
 'scores': [0.6373918652534485, 0.19870366156101227, 0.16390444338321686]}

- 한국어

https://huggingface.co/joeddav/xlm-roberta-large-xnli

In [15]:
ko_zero_shot_clf = pipeline('zero-shot-classification', model='joeddav/xlm-roberta-large-xnli')

config.json:   0%|          | 0.00/734 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

Some weights of the model checkpoint at joeddav/xlm-roberta-large-xnli were not used when initializing XLMRobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]

Device set to use cuda:0


In [16]:
sequence_to_classify = "2025년에는 어떤 운동을 하시겠습니까?!"
candidate_labels = ["정치", "경제", "건강"]

ko_zero_shot_clf(sequence_to_classify, candidate_labels)

{'sequence': '2025년에는 어떤 운동을 하시겠습니까?!',
 'labels': ['정치', '건강', '경제'],
 'scores': [0.4039548635482788, 0.3669039011001587, 0.2291412651538849]}

In [17]:
sequence_to_classify = "2025년에는 어떤 운동을 하시겠습니까?!"
candidate_labels = ["정치", "경제", "건강"]
hypothesis_template = "이 텍스트는 {}에 관한 내용입니다."

ko_zero_shot_clf(sequence_to_classify, candidate_labels, hypothesis_template=hypothesis_template)

{'sequence': '2025년에는 어떤 운동을 하시겠습니까?!',
 'labels': ['건강', '정치', '경제'],
 'scores': [0.9117786288261414, 0.0592406764626503, 0.028980696573853493]}

##### 텍스트 생성

In [18]:
text_generator = pipeline('text-generation', model='distilgpt2')

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cuda:0


In [19]:
text_generator(
    "In this course, we will teach you how to",
    max_length=30,
    num_return_sequences=2
)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=30) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


[{'generated_text': 'In this course, we will teach you how to make your own custom products with one of our best practices.\n\n\n\n\n\n\n\nWe will give you the opportunity to learn about the latest technologies available in the marketplace to create custom products, and then offer you an opportunity to help shape your own products and services with the same confidence.\n\n\nAs a student and student, we are committed to providing students with the information necessary to build, develop, and deliver a customized product.\nFor more information on how to make your own custom products, including how to make your own custom products, please visit www.cis-cis-cis-cl.com.'},
 {'generated_text': 'In this course, we will teach you how to do this for a while.\n\n\n\n\nThis course will help you to develop your own custom tools for your app. This course will help you to create a new way to make your app more user friendly. This course is for the same reason that you are giving free classes this ye

- 한국어

https://huggingface.co/skt/ko-gpt-trinity-1.2B-v0.5

In [20]:
ko_text_generator = pipeline('text-generation', model='skt/ko-gpt-trinity-1.2B-v0.5')

config.json:   0%|          | 0.00/731 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/4.68G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/4.68G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/109 [00:00<?, ?B/s]

Device set to use cuda:0


In [21]:
# ko_text_generator('AI란 말입니다')
ko_text_generator('오늘 점심에는')

[{'generated_text': '오늘 점심에는 뭐 먹지?"라고 말씀해보세요."'}]

##### Fill Mask

In [22]:
unmasker = pipeline('fill-mask')

No model was supplied, defaulted to distilbert/distilroberta-base and revision fb53ab8 (https://huggingface.co/distilbert/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/331M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert/distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cuda:0


In [23]:
unmasker('This course will teach you all about <mask> model.', top_k=3)

[{'score': 0.09073015302419662,
  'token': 265,
  'token_str': ' business',
  'sequence': 'This course will teach you all about business model.'},
 {'score': 0.07262531667947769,
  'token': 30412,
  'token_str': ' mathematical',
  'sequence': 'This course will teach you all about mathematical model.'},
 {'score': 0.049304623156785965,
  'token': 5,
  'token_str': ' the',
  'sequence': 'This course will teach you all about the model.'}]

##### NER

In [24]:
ner = pipeline('ner', model='dslim/bert-base-NER', grouped_entities=True)

config.json:   0%|          | 0.00/829 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/433M [00:00<?, ?B/s]

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/59.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Device set to use cuda:0


In [25]:
ner('My name is Squirrel and I work at Playdata in Seoul')

[{'entity_group': 'PER',
  'score': np.float32(0.98995215),
  'word': 'S',
  'start': 11,
  'end': 12},
 {'entity_group': 'PER',
  'score': np.float32(0.46767876),
  'word': '##qui',
  'start': 12,
  'end': 15},
 {'entity_group': 'PER',
  'score': np.float32(0.72662574),
  'word': '##rrel',
  'start': 15,
  'end': 19},
 {'entity_group': 'ORG',
  'score': np.float32(0.986338),
  'word': 'Playdata',
  'start': 34,
  'end': 42},
 {'entity_group': 'LOC',
  'score': np.float32(0.9994649),
  'word': 'Seoul',
  'start': 46,
  'end': 51}]

##### Q&A

In [26]:
qna = pipeline('question-answering')

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cuda:0


In [27]:
qna(
    question="Where do I work?",
    context="My name is Squirrel and I work at Playdata in Seoul"
)

{'score': 0.871019447396975, 'start': 34, 'end': 42, 'answer': 'Playdata'}

- 한국어

In [28]:
ko_qna = pipeline('question-answering', model='klue/roberta-base')

config.json:   0%|          | 0.00/546 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/443M [00:00<?, ?B/s]

Some weights of RobertaForQuestionAnswering were not initialized from the model checkpoint at klue/roberta-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/375 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/173 [00:00<?, ?B/s]

Device set to use cuda:0


In [29]:
ko_qna(
    question="1931년 어떤 진단을 받았나요?",
    context="""그러던 1931년, 그는 갑작스럽게 폐결핵을 진단받았다. 이상은 하루에 담배를 50개피 이상 피는 것을 자신의 일과라고 표현했을 정도로 엄청난 골초였는데, 그 때문인지 병세는 날이 갈수록 악화되어 담당의가 그의 폐를 확인하고는 형체도 안 보인다며 혀를 내둘렀을 정도였다. 결국 1933년부터는 각혈까지 시작되었고, 건축기사 일을 지속하기 어렵다고 판단한 이상은 조선총독부에서 퇴사하고 황해도에 있는 배천 온천으로 요양을 간다."""
)

{'score': 0.00032637573895044625,
 'start': 105,
 'end': 114,
 'answer': '악화되어 담당의가'}

In [30]:
help(ko_qna)

Help on QuestionAnsweringPipeline in module transformers.pipelines.question_answering object:

class QuestionAnsweringPipeline(transformers.pipelines.base.ChunkPipeline)
 |  QuestionAnsweringPipeline(model: Union[ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')], tokenizer: transformers.tokenization_utils.PreTrainedTokenizer, modelcard: Optional[transformers.modelcard.ModelCard] = None, framework: Optional[str] = None, task: str = '', **kwargs)
 |
 |  Question Answering pipeline using any `ModelForQuestionAnswering`. See the [question answering
 |  examples](../task_summary#question-answering) for more information.
 |
 |  Example:
 |
 |  ```python
 |  >>> from transformers import pipeline
 |
 |  >>> oracle = pipeline(model="deepset/roberta-base-squad2")
 |  >>> oracle(question="Where do I live?", context="My name is Wolfgang and I live in Berlin")
 |  {'score': 0.9191, 'start': 34, 'end': 40, 'answer': 'Berlin'}
 |  ```
 |
 |  Learn more about the basics of using a pipeli

In [31]:
from google.colab import files
uploaded = files.upload()


Saving insider_2000s_kor_slang_dataset.jsonl to insider_2000s_kor_slang_dataset.jsonl


##### 2000년대 인싸 말투 변환 (커스텀 파인튜닝)

In [32]:
import json
import datasets
from datasets import Dataset


data_path = "/content/insider_2000s_kor_slang_dataset.jsonl"


data = []
with open(data_path, "r", encoding="utf-8") as f:
    for line in f:
        data.append(json.loads(line))


raw = Dataset.from_list(data)

def build_prompt(ex):
    return f"[시스템] 너는 일반 문장을 2000년대 인싸 말투로 변환하는 AI야.\n[입력] {ex['input']}\n[출력] {ex['target']}"

raw = raw.map(lambda ex: {"text": build_prompt(ex)})

print(raw[0])


Map:   0%|          | 0/3000 [00:00<?, ? examples/s]

{'id': 1, 'input': '리뷰 부탁해.', 'target': '간지폭발 리뷰 부탁해. 가즈아 탱 ♨☆☆가능?!', 'text': '[시스템] 너는 일반 문장을 2000년대 인싸 말투로 변환하는 AI야.\n[입력] 리뷰 부탁해.\n[출력] 간지폭발 리뷰 부탁해. 가즈아 탱 ♨☆☆가능?!'}


In [2]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

# GPU 사용을 설정합니다.
device = "cuda" if torch.cuda.is_available() else "cpu"

# 텍스트 변환에 최적화된 KoBART 모델을 불러옵니다.
model_name = "gogamza/kobart-base-v2"

# 토크나이저와 모델을 로드하고 GPU로 이동시킵니다.
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)

print(f"모델 이름: {model_name}")
print(f"모델을 {device}로 로드했습니다.")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json: 0.00B [00:00, ?B/s]

You passed `num_labels=3` which is incompatible to the `id2label` map of length `2`.


tokenizer.json: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/4.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

You passed `num_labels=3` which is incompatible to the `id2label` map of length `2`.
You passed `num_labels=3` which is incompatible to the `id2label` map of length `2`.


model.safetensors:   0%|          | 0.00/495M [00:00<?, ?B/s]

모델 이름: gogamza/kobart-base-v2
모델을 cuda로 로드했습니다.


In [3]:
from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling

def tok_fn(ex):
    return tokenizer(ex["text"], truncation=True, max_length=384)

tokd = raw.map(tok_fn, batched=True, remove_columns=raw.column_names)

args = TrainingArguments(
    output_dir="./insider-2000s-lora",
    per_device_train_batch_size=8,
    gradient_accumulation_steps=2,
    learning_rate=2e-4,
    num_train_epochs=2,
    fp16=True,
    logging_steps=50,
)

dcollator = DataCollatorForLanguageModeling(tokenizer, mlm=False)
trainer = Trainer(model=model, args=args, train_dataset=tokd, data_collator=dcollator)
trainer.train()


NameError: name 'raw' is not defined

In [6]:
import json
import datasets
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, TrainingArguments, Trainer, DataCollatorForLanguageModeling
import torch

# 1. 파일 경로 설정 및 데이터셋 로드
# 파일 이름을 insider_2000s_kor_slang_dataset.jsonl로 변경합니다.
data_path = "/content/insider_2000s_kor_slang_dataset.jsonl"

# JSON Lines 형식의 파일을 읽어서 리스트에 저장합니다.
data = []
with open(data_path, "r", encoding="utf-8") as f:
    for line in f:
        data.append(json.loads(line))

# 리스트를 Dataset으로 변환합니다.
raw = Dataset.from_list(data)

# 2. 모델과 토크나이저 로드
# KoBART 모델을 사용해 메모리 문제를 해결합니다.
model_name = "gogamza/kobart-base-v2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# 3. 데이터 전처리
def build_prompt(ex):
    return f"[시스템] 너는 일반 문장을 2000년대 인싸 말투로 변환하는 AI야.\n[입력] {ex['input']}\n[출력] {ex['target']}"

from transformers import TrainingArguments, Trainer, DataCollatorForSeq2Seq

# 'eval_strategy'는 'evaluation_strategy'로 변경할 필요 없습니다.
# 기존에 TypeError가 났던 이유는 'evaluation_strategy'가 'eval_strategy'로 변경되어서 났던 것이고
# 현재는 Trainer를 불러오면서 정상적으로 인식되기에 따로 변경할 필요 없습니다.

from transformers import TrainingArguments, Trainer, DataCollatorForSeq2Seq

# KoBART는 Text-to-Text 모델이므로, input과 target을 구분하여 토크나이징합니다.
def tok_fn(ex):
    model_inputs = tokenizer(ex["input"], max_length=128, truncation=True)
    labels = tokenizer(text_target=ex["target"], max_length=128, truncation=True)
    model_inputs['labels'] = labels['input_ids']
    return model_inputs

# 데이터셋을 학습과 검증용으로 나눕니다.
split_dataset = raw.train_test_split(test_size=0.1)
tokenized_datasets = split_dataset.map(tok_fn, batched=True)

# 학습 인자(Arguments)를 설정합니다.
args = TrainingArguments(
    output_dir="./insider-2000s-model",
    per_device_train_batch_size=8,
    gradient_accumulation_steps=2,
    learning_rate=2e-5,
    num_train_epochs=3,
    fp16=True,
    logging_steps=50,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    report_to="none"  # 이 줄을 추가하여 wandb 로그인을 건너뜁니다.
)

# Text-to-Text 모델에 맞는 DataCollatorForSeq2Seq를 사용합니다.
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

# Trainer를 초기화하고 학습을 시작합니다.
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['test'],
    data_collator=data_collator
)

trainer.train()

# Text-to-Text 모델에는 DataCollatorForSeq2Seq를 사용하는 것이 더 적합합니다.
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['test'],
    data_collator=data_collator
)

trainer.train()

You passed `num_labels=3` which is incompatible to the `id2label` map of length `2`.
You passed `num_labels=3` which is incompatible to the `id2label` map of length `2`.
You passed `num_labels=3` which is incompatible to the `id2label` map of length `2`.


Map:   0%|          | 0/2700 [00:00<?, ? examples/s]

Map:   0%|          | 0/300 [00:00<?, ? examples/s]

Epoch,Training Loss,Validation Loss
1,1.778,1.427785
2,1.3686,1.201738
3,1.2128,1.166573


There were missing keys in the checkpoint model loaded: ['model.encoder.embed_tokens.weight', 'model.decoder.embed_tokens.weight', 'lm_head.weight'].


Epoch,Training Loss,Validation Loss
1,1.1132,1.174978
2,1.0797,1.132364
3,1.0768,1.121788


There were missing keys in the checkpoint model loaded: ['model.encoder.embed_tokens.weight', 'model.decoder.embed_tokens.weight', 'lm_head.weight'].


TrainOutput(global_step=507, training_loss=1.1036167699674648, metrics={'train_runtime': 166.3964, 'train_samples_per_second': 48.679, 'train_steps_per_second': 3.047, 'total_flos': 30694038036480.0, 'train_loss': 1.1036167699674648, 'epoch': 3.0})

In [9]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline, TrainingArguments, Trainer, DataCollatorForSeq2Seq
import torch
import json
import datasets
from datasets import Dataset

# 1. 파일 경로 설정 및 데이터셋 로드
data_path = "/content/insider_2000s_kor_slang_dataset.jsonl"

data = []
with open(data_path, "r", encoding="utf-8") as f:
    for line in f:
        data.append(json.loads(line))

raw = Dataset.from_list(data)

# 2. 모델과 토크나이저 로드
model_name = "gogamza/kobart-base-v2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# 3. 데이터 전처리
def tok_fn(ex):
    model_inputs = tokenizer(ex["input"], max_length=128, truncation=True)
    labels = tokenizer(text_target=ex["target"], max_length=128, truncation=True)
    model_inputs['labels'] = labels['input_ids']
    return model_inputs

split_dataset = raw.train_test_split(test_size=0.1)
tokenized_datasets = split_dataset.map(tok_fn, batched=True)

# 4. 학습 준비 및 실행
args = TrainingArguments(
    output_dir="./insider-2000s-model",
    per_device_train_batch_size=8,
    gradient_accumulation_steps=2,
    learning_rate=2e-5,
    num_train_epochs=3,
    fp16=True,
    logging_steps=50,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    report_to="none"
)

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['test'],
    data_collator=data_collator
)

trainer.train()

# 5. 최종 모델 저장 및 로드
best_model_path = trainer.state.best_model_checkpoint
tokenizer.save_pretrained(best_model_path)
model.save_pretrained(best_model_path)

# 저장된 모델을 불러와 파이프라인을 생성합니다.
pipe = pipeline("text2text-generation", model=best_model_path, tokenizer=best_model_path)

# 6. 모델을 사용하여 2000년대 말투 생성
input_text = "오늘 점심 뭐 먹지?"
output = pipe(input_text, max_new_tokens=64, do_sample=True)
output_text = output[0]["generated_text"]

print(f"입력: {input_text}")
print(f"출력: {output_text}")

You passed `num_labels=3` which is incompatible to the `id2label` map of length `2`.
You passed `num_labels=3` which is incompatible to the `id2label` map of length `2`.
You passed `num_labels=3` which is incompatible to the `id2label` map of length `2`.


Map:   0%|          | 0/2700 [00:00<?, ? examples/s]

Map:   0%|          | 0/300 [00:00<?, ? examples/s]

Epoch,Training Loss,Validation Loss
1,1.778,1.427785
2,1.3686,1.201738
3,1.2128,1.166573


There were missing keys in the checkpoint model loaded: ['model.encoder.embed_tokens.weight', 'model.decoder.embed_tokens.weight', 'lm_head.weight'].
You passed `num_labels=3` which is incompatible to the `id2label` map of length `2`.
You passed `num_labels=3` which is incompatible to the `id2label` map of length `2`.
Device set to use cuda:0


입력: 오늘 점심 뭐 먹지?
출력: 오나전 오나전★오늘 점심 머 머먹지? ᄀᄀᄀ 실화냐 ♬!? ᄒᄒ!? !? ~~!? !? !? !? ~!? !? ᄒᄒ


In [13]:
from transformers import pipeline


model_path = "./insider-2000s-model/checkpoint-507"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSeq2SeqLM.from_pretrained(model_path)

pipe = pipeline("text2text-generation", model=model, tokenizer=tokenizer)

print("2000년대 인싸 말투 변환기입니다. 변환할 문장을 입력하세요.")
print("종료하려면 'q'를 입력하세요.")

while True:
    user_input = input(">> ")
    if user_input.lower() == '빠잉':
        print("변환기를 종료합니다.")
        break

    output = pipe(
        user_input,
        max_new_tokens=64,
        do_sample=True,
        no_repeat_ngram_size=2
    )
    output_text = output[0]["generated_text"]

    print(f"출력: {output_text}\n")

You passed `num_labels=3` which is incompatible to the `id2label` map of length `2`.
Device set to use cuda:0


2000년대 인싸 말투 변환기입니다. 변환할 문장을 입력하세요.
종료하려면 'q'를 입력하세요.
>> 노래방 갈 사람
출력: 빵터진 노래방 갈까? ᄀᄀᄀ 실화냐 쩐다 ᄏᄏ간지난다!? 심상치않음 ᄒᄒ심쿵!! ~~~^_^ᄏ갑니다잉!~쪼끔?~ᄏ심상치마!

>> 뭐하니
출력: 개간지 머 머하니? ᄂᄂ 굿 ♥ᄏᄏ심상치않음!? ~ᄏ지대로?! ᄏ♧완소!!ᄒᄒ♧♧ᄏ심쿵~~지?~ᄏ갑니다잉!~♥ᄏ!

>> 신기하다
출력: 간지폭발 쩐다 ᄏᄏ간지난다!? 심상치않음 ^_^~~ᄒᄒ~?! ~~심쿵~!ᄒᄏ가능?~ᄏᄒ완소!~지대로?ᄒ♧~맞죠

>> 빠잉
변환기를 종료합니다.
