Skip to content

AIRC-KETI/ke-t5-downstreams

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

KE-T5 Downstreams

Downstreams

Huggingface Eco System์„ ์‚ฌ์šฉํ•˜์‹œ๋Š” ๋ถ„๋“ค์„ ์œ„ํ•ด, ์—ฌ๋Ÿฌ ๋‹ค์šด์ŠคํŠธ๋ฆผ ํƒœ์Šคํฌ๋“ค์„ ํ•™์Šต์‹œํ‚ฌ ์ˆ˜ ์žˆ๋Š” ๋ชจ๋“ˆ์„ ๋งŒ๋“ค์—ˆ์Šต๋‹ˆ๋‹ค. Google์˜ seqio ์ผ๋ถ€๋ฅผ huggingface datasets์šฉ์œผ๋กœ ๋งŒ๋“  ke_t5.pipe๋ฅผ ์ด์šฉํ•˜์—ฌ ๋ช…๋ น์–ด ํ•œ์ค„๋กœ task๋“ค์„ ํ•™์Šต์‹œ์ผœ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Install requirements

๋ชจ๋“ˆ ์‚ฌ์šฉ์— ํ•„์š”ํ•œ ํŒจํ‚ค์ง€๋“ค์„ ์„ค์น˜ํ•ด์ค๋‹ˆ๋‹ค.

    pip install -r requirements.txt

Type of Model

๊ธฐ๋ณธ ์ œ๊ณต๋˜๋Š” ๋ชจ๋ธ๋“ค์€ ํฌ๊ฒŒ 2๊ฐ€์ง€๋กœ ๋ถ„๋ฅ˜๋ฉ๋‹ˆ๋‹ค.

  1. Seq2Seq Model (Generative)
  2. Encoder Model (BERT like)

Seq2Seq ๋ชจ๋ธ์€ ์ž…๋ ฅ๊ณผ ์ถœ๋ ฅ์ด ๋ชจ๋‘ ํ…์ŠคํŠธ์ด๋ฉฐ, Encoder ๋ชจ๋ธ์€ Output์ด Class logtis(Token level, Sequence level)์ธ ๊ฒฝ์šฐ๊ฐ€ ๋Œ€๋ถ€๋ถ„์ž…๋‹ˆ๋‹ค.

Task์˜ ์ด๋ฆ„ ๋’ค์— _gen์ด ๋ถ™๋Š” ๊ฒฝ์šฐ๋Š” ์ด๋Ÿฌํ•œ seq2seq ๋ชจ๋ธ๋“ค์„ ํ•™์Šต์‹œํ‚ฌ ์ˆ˜ ์žˆ๋Š” task๋“ค์ž…๋‹ˆ๋‹ค. _gen์ด ์—†๋Š” ํƒœ์Šคํฌ๋“ค์€ Encoder ๋ชจ๋ธ์— ํ—ค๋“œ๋ฅผ ๋ถ™์—ฌ ํ•™์Šต์‹œํ‚ฌ ์ˆ˜ ์žˆ๋Š” ํƒœ์Šคํฌ๋“ค์ž…๋‹ˆ๋‹ค. (seq2seq์œผ๋กœ๋งŒ ๊ฐ€๋Šฅํ•œ task์˜ ๊ฒฝ์šฐ ๋„ค์ด๋ฐ ๊ทœ์น™ ์˜ˆ์™ธ๊ฐ€ ์กด์žฌํ•ฉ๋‹ˆ๋‹ค)

Training Downstream

์ง€์›๋˜๋Š” ๋‹ค์šด์ŠคํŠธ๋ฆผ์€ ke_t5/task/task.py๋ฅผ ์ฐธ์กฐํ•˜์‹œ๊ธฐ ๋ฐ”๋ž๋‹ˆ๋‹ค. ์•„๋ž˜๋Š” KE-T5๋ชจ๋ธ์„ ์ด์šฉํ•˜์—ฌ nikl_summarization_summary_split์„ ํ•™์Šต์‹œํ‚ค๋Š” ๊ฒฝ์šฐ๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. (summarization task๋Š” generative ๋ชจ๋ธ๋งŒ ํ•™์Šต๋‹ˆ ๊ฐ€๋Šฅํ•˜๊ธฐ ๋•Œ๋ฌธ์— _gen์ด ๋ถ™์ง€ ์•Š์Šต๋‹ˆ๋‹ค.) (์ฃผ์˜ NIKL๊ณผ ๊ฐ™์ด ์ž๋™์œผ๋กœ ๋‹ค์šด ๋ฐ›์ง€ ๋ชปํ•˜๋Š” ๋ฐ์ดํ„ฐ์…‹์€ ์ง์ ‘ ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค์šด๋ฐ›์•„ ์••์ถ•์„ ํ‘ผ ํ›„ ๋ฃจํŠธ ๋””๋ ‰ํ† ๋ฆฌ์˜ ์œ„์น˜๋ฅผ --hf_data_dir๋กœ ์ž…๋ ฅํ•ด์ค˜์•ผ ํ•ฉ๋‹ˆ๋‹ค. NIKL ๋ฐ์ดํ„ฐ๋ฅผ ์ค€๋น„ํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ์—ฌ๊ธฐ๋ฅผ ์ฐธ๊ณ ํ•˜์‹œ๊ธฐ ๋ฐ”๋ž๋‹ˆ๋‹ค.)

python -m torch.distributed.launch --nproc_per_node=2 train_ddp.py \
    --batch 32 \
    --hf_data_dir "./data" \
    --hf_cache_dir "./cache_dir/huggingface_datasets" \
    --train_split "train[:90%]" \
    --test_split "train[90%:]" \
    --pass_only_model_io true \
    --gin_param="get_dataset.sequence_length={'inputs':512, 'targets':512}" \
    --gin_param="ke_t5.task.utils.get_vocabulary.vocab_name='KETI-AIR/ke-t5-base'" \
    --pre_trained_model="KETI-AIR/ke-t5-base" \
    --model_name "transformers:T5ForConditionalGeneration" \
    --task 'nikl_summarization_summary_split'

--pass_only_model_io๋ฅผ true๋กœ ์„ค์ •ํ•˜๋ฉด ๋ชจ๋ธ์˜ IO๋กœ ์‚ฌ์šฉ๋˜๋Š” feature๋กœ๋งŒ mini batch๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค. ๋Œ€๋ถ€๋ถ„์˜ generative ๋ชจ๋ธ์€ ๋ชจ๋ธ์˜ input๊ณผ taget tensor๋งŒ์œผ๋กœ ์„ฑ๋Šฅ์„ ์ธก์ •ํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ์ด ๊ฐ’์„ true๋กœํ•˜๋ฉด ๋ถˆํ•„์š”ํ•œ ์—ฐ์‚ฐ์„ ์ค„์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ช‡๋ช‡ ๋‹ค๋ฅธ task๋“ค์˜ ๊ฒฝ์šฐ ๋ชจ๋ธ์˜ input๊ณผ target๋งŒ์œผ๋กœ ์„ฑ๋Šฅ์„ ์ธก์ •ํ•  ์ˆ˜ ์—†๋Š” ๊ฒฝ์šฐ(NER, extractive QA, etc...)๊ฐ€ ์žˆ๋Š”๋ฐ, ์ด ๊ฒฝ์šฐ์—๋Š” ์ด ๊ฐ’์„ false๋กœ ์„ค์ •ํ•ด์ฃผ์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๊ธฐ๋ณธ๊ฐ’์€ false์ž…๋‹ˆ๋‹ค.

์œ„ ์˜ˆ์ œ์™€ ๊ฐ™์ด gin_param์œผ๋กœ ์‚ฌ์šฉํ•  ๋ฐ์ดํ„ฐ๋“ค์˜ sequence length์™€ target length๋ฅผ ์ž…๋ ฅํ•ด์ฃผ๊ณ , ๋ฐ์ดํ„ฐ๋ฅผ preprocessingํ•˜๋Š”๋ฐ ์‚ฌ์šฉํ•  huggingface tokenizer์˜ ์ด๋ฆ„์„ ์ง€์ •ํ•ด์ค„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๊ฐ’๋“ค์„ ๋ฏธ๋ฆฌ *.ginํŒŒ์ผ์— ์ž…๋ ฅํ•˜์—ฌ --gin_file๋กœ ์ง€์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

gin/train_default.gin ํŒŒ์ผ์˜ ๊ฒฝ์šฐ๋ฅผ ์‚ดํŽด๋ณด๋ฉด, ์•„๋ž˜์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค.

get_dataset.sequence_length={'inputs':512, 'targets':512}
ke_t5.task.utils.get_vocabulary.vocab_name='KETI-AIR/ke-t5-base'

get_optimizer.optimizer_cls=@AdamW
AdamW.lr=1e-3
AdamW.betas=(0.9, 0.999)
AdamW.eps=1e-06
AdamW.weight_decay=1e-2

์œ„์—์„œ ๋ณด๋“ฏ์ด sequence length์™€ ์‚ฌ์šฉํ•  vocabulary ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ์‚ฌ์šฉํ•  optimizer์™€ ํŒŒ๋ผ๋ฏธํ„ฐ๊นŒ์ง€ ์ง€์ •๋˜์–ด ์žˆ๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ์ด์šฉํ•˜์—ฌ ํ•™์Šต์„ ์‹œํ‚ค๋ ค๋ฉด ์•„๋ž˜์™€ ๊ฐ™์ด ์ž…๋ ฅํ•ฉ๋‹ˆ๋‹ค.

python -m torch.distributed.launch --nproc_per_node=2 train_ddp.py \
    --batch 32 \
    --gin_file="gin/train_default.gin" \
    --pre_trained_model="KETI-AIR/ke-t5-base" \
    --model_name transformers:T5ForConditionalGeneration \
    --task 'nikl_summarization_summary_split'

KE-T5 ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๋‹ค๋ฅธ ๋ชจ๋ธ๋“ค์„ ์ด์šฉํ•ด์„œ๋„ task๋ฅผ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. klue/roberta-small ๋ชจ๋ธ์„ ์ด์šฉํ•˜์—ฌ klue topic classification์„ ํ•™์Šตํ•˜๋Š” ๊ฒฝ์šฐ ์•„๋ž˜์™€ ๊ฐ™์ด ์ž…๋ ฅํ•ฉ๋‹ˆ๋‹ค. ์ด ๊ฒฝ์šฐ RoBERTa๋ฅผ ์ด์šฉํ•œ sequence classification ๋ชจ๋ธ์€ transformers package์˜ RobertaForSequenceClassification ํด๋ž˜์Šค์ž…๋‹ˆ๋‹ค. {๋ชจ๋“ˆ ๊ฒฝ๋กœ}:{ํด๋ž˜์Šค ์ด๋ฆ„} ํ˜•ํƒœ๋กœ --model_name์— ์ž…๋ ฅํ•ด์ค๋‹ˆ๋‹ค. ๊ฐœ์ธ์ด ๋งŒ๋“  ๋ชจ๋ธ๋“ค๋„ ์ž…๋ ฅ๊ณผ ์ถœ๋ ฅ์ด huggingface ๋ชจ๋ธ๊ณผ ๋™์ผํ•˜๋‹ค๋ฉด ๋˜‘๊ฐ™์ด ํ•™์Šต์‹œํ‚ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. (gin/klue_roberta_tc.gin์—๋Š” vocabulary๋ฅผ klue/roberta-small์˜ vocabulary๋กœ ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค. classification์˜ ๊ฒฝ์šฐ targets์˜ seqeunce length๋Š” ํฐ ์˜๋ฏธ๊ฐ€ ์—†์Šต๋‹ˆ๋‹ค.)

python -m torch.distributed.launch --nproc_per_node=2 train_ddp.py \
    --batch_size 16 \
    --gin_file="gin/klue_roberta_tc.gin" \
    --pre_trained_model "klue/roberta-small" \
    --model_name transformers:RobertaForSequenceClassification \
    --task 'klue_tc'

ํ•™์Šต์„ ๋” ์ง„ํ–‰ํ•˜๊ณ  ์‹ถ๋‹ค๋ฉด --resume์— true๋‚˜ ์ฒดํฌํฌ์ธํŠธ ๊ฒฝ๋กœ๋ฅผ ์ž…๋ ฅํ•ด์ค๋‹ˆ๋‹ค. (true์˜ ๊ฒฝ์šฐ ๊ธฐ๋ณธ ๊ฒฝ๋กœ์—์„œ ์ฒดํฌํฌ์ธํŠธ๋ฅผ ๋กœ๋“œํ•จ) ๋ชจ๋ธ์€ huggingface save_pretrained ํ•จ์ˆ˜๋กœ ์ €์žฅํ•˜์—ฌ ๋‚˜์ค‘์— from_pretrained๋กœ ๋กœ๋”ฉํ•˜๋ ค๋ฉด ์ €์žฅํ•  ํด๋” ๊ฒฝ๋กœ๋ฅผ --hf_path์— ์ž…๋ ฅํ•ด์ค๋‹ˆ๋‹ค.

python -m torch.distributed.launch --nproc_per_node=2 train_ddp.py \
    --batch_size 16 \
    --gin_file="gin/klue_roberta_tc.gin" \
    --pre_trained_model "klue/roberta-small" \
    --model_name transformers:RobertaForSequenceClassification \
    --task 'klue_tc' \
    --resume true \
    --hf_path hf_out/klue_bert_tc

์ง€๊ธˆ๊นŒ์ง€์˜ ๋ชจ๋“  ์„ค๋ช…์€ distributed setting์ด์—ˆ์”๋‹ˆ๋‹ค. --nproc_per_node๊ฐ€ ํ•™์Šต์— ์‚ฌ์šฉํ•  gpu์˜ ๊ฐฏ์ˆ˜๋ฅผ ๋งํ•ด์ค๋‹ˆ๋‹ค. Single GPU๋กœ ํ•™์Šต์„ ํ•˜๋ ค๋ฉด ์ด ๊ฐ’์„ 1๋กœ ์„ค์ •ํ•˜๊ฑฐ๋‚˜, python -m torch.distributed.launch --nproc_per_node=2 ๋ถ€๋ถ„์„ python์œผ๋กœ ๋ฐ”๊ฟ”์ค๋‹ˆ๋‹ค.

Test downstream tasks

Test๋ฅผ ํ• ๋•Œ, generative model์˜ ๊ฒฝ์šฐ๋Š” huggingface์˜ beam search, top_p, top_k ๋“ฑ์„ ์œ„ํ•ด generateํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜๊ณ  ์‹ถ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ๊ฒฝ์šฐ EvaluationHelper์˜ model_fn์œผ๋กœ ์‚ฌ์šฉํ•˜๊ณ  ์‹ถ์€ ํ•จ์ˆ˜ ์ด๋ฆ„์„ ์ž…๋ ฅํ•˜๊ณ , model_kwargs๋กœ ํ•จ์ˆ˜์˜ keyword arguments๋ฅผ ์ž…๋ ฅํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค. (์ž…๋ ฅํ•˜์ง€ ์•Š์•˜์„ ๊ฒฝ์šฐ task์— ์ง€์ •๋œ kwargs๊ฐ€ ์ž…๋ ฅ๋ฉ๋‹ˆ๋‹ค.) model_input_keys๋กœ ํ•จ์ˆ˜์— ์ž…๋ ฅ๋  ๋ฐ์ดํ„ฐ์˜ ํ•„๋“œ๋ฅผ ์ •ํ• ์ˆ˜ ์žˆ์œผ๋ฉฐ ์ž…๋ ฅํ•˜์ง€ ์•Š์•˜์„ ๊ฒฝ์šฐ input_ids๋งŒ ์ž…๋ ฅ๋ฉ๋‹ˆ๋‹ค. (์ด ๊ฐ’์€ ๋ชจ๋ธ์— ์ž…๋ ฅ๋  key์˜ ๋ฐฐ์—ด์ž…๋‹ˆ๋‹ค.)

gin/test_default_gen.gin

get_dataset.sequence_length={'inputs':512, 'targets':512}
ke_t5.task.utils.get_vocabulary.vocab_name='KETI-AIR/ke-t5-base'

EvaluationHelper.model_fn='generate'
EvaluationHelper.model_kwargs={
                "early_stopping": True,
                "length_penalty": 2.0,
                "max_length": 200,
                "min_length": 30,
                "no_repeat_ngram_size": 3,
                "num_beams": 4,
            }
# EvaluationHelper.model_input_keys=['input_ids']
# Test!!!

python -m torch.distributed.launch --nproc_per_node=2 test_ddp.py \
    --gin_file="gin/test_default_gen.gin" \
    --model_name "transformers:T5ForConditionalGeneration" \
    --task 'nikl_summarization_summary_split' \
    --test_split test \
    --resume true

Downstream and Model List

ํ˜„์žฌ ์ง€์›๋˜๋Š” ๋‹ค์šด์ŠคํŠธ๋ฆผ๋“ค์˜ ์ผ๋ถ€์ž…๋‹ˆ๋‹ค.

Task ์ด๋ฆ„ ํ˜•ํƒœ
klue_tc_gen Generative
klue_tc Sequence Classification - single_label_classification
klue_nli_gen Generative
klue_nli Sequence Classification - single_label_classification
klue_sts_gen Generative
klue_sts_re Sequence Classification - regression
klue_sts Sequence Classification - single_label_classification
klue_re Sequence Classification - single_label_classification
klue_ner Token Classification
nikl_ner Token Classification
nikl_ner2020 Token Classification
nikl_summarization_summary Generative
nikl_summarization_topic Generative
korquad_gen Generative
korquad_gen_context_free Generative
kor_3i4k_gen Sequence Classification - single_label_classification
kor_3i4k Sequence Classification - single_label_classification

ํ˜„์žฌ ์ง€์›๋˜๋Š” T5๊ธฐ๋ฐ˜ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.

๋ชจ๋ธ ์ด๋ฆ„ ํ˜•ํƒœ
transformers:T5ForConditionalGeneration Generative
T5EncoderForSequenceClassificationSimple Sequence Classification - single_label_classification
T5EncoderForSequenceClassificationMean Sequence Classification - single_label_classification
T5EncoderForTokenClassification Token Classification
T5EncoderForEntityRecognitionWithCRF Token Classification

Custom model

Huggingface model์„ ์ƒ์†๋ฐ›์•„ huggingface output type์œผ๋กœ forward์—์„œ returnํ•œ๋‹ค๋ฉด ์ด ๋ชจ๋ธ๋„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ๋“ค์–ด my_model.py๋ฅผ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋งŒ๋“ค์—ˆ๋‹ค๊ณ  ๊ฐ€์ •ํ•ฉ๋‹ˆ๋‹ค. (๋ชจ๋ธ ์ƒ์„ฑ์€ ke_t5/models/models.py๋ฅผ ์ฐธ์กฐํ•ด ์ฃผ์„ธ์š”.)

my_model_dir/my_model.py

from transformers import T5EncoderModel
from ke_t5.models.loader import register_model

@register_model("abcdefg")
class MyModel(T5EncoderModel):
    ...

์œ„์˜ @register_model ๋ฐ์ฝ”๋ ˆ์ดํ„ฐ๋Š” MyModel ํด๋ž˜์Šค๋ฅผ abcdefg๋ผ๊ณ  ๋“ฑ๋กํ•œ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ --model_name์œผ๋กœ ๋ชจ๋“ˆ ์ด๋ฆ„ ์—†์ด abcdefg๋ฅผ ์ž…๋ ฅํ•ด์ฃผ๋ฉด ๋ฉ๋‹ˆ๋‹ค. ๋งŒ์•ฝ ์ € decorator๋ฅผ ๋ถ™์ด์ง€ ์•Š์•˜์„ ๊ฒฝ์šฐ๋Š” my_model_dir.my_model:MyModel๋กœ ์ž…๋ ฅํ•ด์ฃผ๋ฉด ๋ฉ๋‹ˆ๋‹ค.

์˜ˆ๋ฅผ ๋“ค์–ด ๊ธฐ๋ณธ ์ œ๊ณต๋˜๋Š” ๋ชจ๋ธ๋“ค์ค‘ T5EncoderForSequenceClassificationMean ํด๋ž˜์Šค๋Š” ke_t5.models.models์— ์œ„์น˜ํ•ด ์žˆ๊ณ , T5EncoderForSequenceClassificationMean์œผ๋กœ ์ด๋ฆ„์„ ๋“ฑ๋กํ–ˆ๊ธฐ ๋•Œ๋ฌธ์—, T5EncoderForSequenceClassificationMean ๋˜๋Š” ke_t5.models.models:T5EncoderForSequenceClassificationMean ๋‘˜ ์ค‘ ์•„๋ฌด๊ฑฐ๋‚˜ --model_name์œผ๋กœ ์ž…๋ ฅํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ Camel case๋กœ ๋ช…๋ช…๋œ Class์˜ ๊ฒฝ์šฐ ke_t5.models.models:t5_encoder_for_sequence_classification_mean์™€ ๊ฐ™์ด ๋Œ€๋ฌธ์ž๋งˆ๋‹ค _๋ฅผ ๋Œ€์‹  ์ด์šฉํ•˜์…”๋„ ๋ฉ๋‹ˆ๋‹ค.

custom module์„ train_ddp script์—์„œ ๋™์ž‘ํ•˜๊ฒŒ ํ•˜๋ ค๋ฉด --module_import์— ๋ชจ๋“ˆ์˜ ๊ฒฝ๋กœ๋ฅผ ์ž…๋ ฅํ•ด์ค๋‹ˆ๋‹ค.

python -m torch.distributed.launch --nproc_per_node=2 train_ddp.py \
    --batch_size 16 \
    --gin_file="my_model.gin" \
    --pre_trained_model "path_to_pretrained_model_weights" \
    --model_name "abcdefg" \
    --task 'klue_tc' \
    --module_import "my_model_dir.my_model"

์ž์‹ ์˜ ๋ชจ๋ธ์— ๋งž๋Š” huggingface vocab path๋ฅผ ์ž…๋ ฅํ•ด์ฃผ๋Š” ๊ฒƒ์„ ์žŠ์ง€๋งˆ์„ธ์š”.

Samples

๋ช‡๊ฐ€์ง€ ์ƒ˜ํ”Œ ๋ชจ๋ธ์„ ๊ณต์œ ํ•ฉ๋‹ˆ๋‹ค.

task model base model URL
nikl_ner T5EncoderForEntityRecognitionWithCRF KETI-AIR/ke-t5-base Download
nikl_ner2020 T5EncoderForEntityRecognitionWithCRF KETI-AIR/ke-t5-base Download

Sample code (์ƒ˜ํ”Œ ๋ชจ๋ธ ์ค‘ nikl_ner ๋ชจ๋ธ๋“ค์„ ๋‹ค์šด ๋ฐ›์•˜๋‹ค๊ณ  ๊ฐ€์ •)

from transformers import T5Tokenizer
from ke_t5.models import loader, models

model_path = 'path_to_model_directory'
model_name = 'T5EncoderForEntityRecognitionWithCRF'
model_cls = loader.load_model(model_name)

tokenizer = T5Tokenizer.from_pretrained(model_path)
model = model_cls.from_pretrained(model_path)
id2label = model.config.id2label

# ์ถœ์ฒ˜ : ๊ฒฝ์ƒ์ผ๋ณด(http://www.ksilbo.co.kr)
# author: ์ด์ถ˜๋ด‰๊ธฐ์ž bong@ksilbo.co.kr
# URL: http://www.ksilbo.co.kr/news/articleView.html?idxno=903455
input_txt = "์šธ์‚ฐ์‹œ์„ค๊ณต๋‹จ์€ ๋‹ค์–‘ํ•œ ๊ฝƒยท๋‚˜๋ฌด ๊ฐ์ƒ ๊ธฐํšŒ๋ฅผ ์ œ๊ณตํ•ด ์‹œ๋ฏผ๋“ค์˜ \
    ์ฝ”๋กœ๋‚˜ ๋ธ”๋ฃจ๋ฅผ ํ•ด์†Œํ•˜๊ณ  ์ด์ƒ‰์ ์ธ ๊ณต๊ฐ„์„ ์—ฐ์ถœํ•˜๊ธฐ ์œ„ํ•ด ์šธ์‚ฐ๋Œ€๊ณต์› ์šธ์‚ฐ๋Œ€์ข… \
    ๋’คํŽธ ์•ผ์™ธ๊ณต์—ฐ์žฅ ์ƒ๋‹จ์— ํ•ด๋ฐ”๋ผ๊ธฐ ์ •์›์„ ์กฐ์„ฑํ–ˆ๋‹ค๊ณ  13์ผ ๋ฐํ˜”๋‹ค."

inputs = tokenizer(input_txt, return_tensors="pt")
output = model(
        input_ids=inputs.input_ids,
        attention_mask=inputs.attention_mask,
    )

input_ids = inputs.input_ids[0]
predicted_classes = output.logits[0]
inp_tks = [tokenizer.decode(x) for x in input_ids]
lbls = [id2label[x] for x in predicted_classes]
print(list(zip(inp_tks, lbls)))


# --------------------------------------------------------
## NIKL NER์˜ ๊ฒฝ์šฐ
[('์šธ์‚ฐ', 'B-OG'), ('์‹œ์„ค๊ณต๋‹จ', 'I-OG'), ('์€', 'O'), ('๋‹ค์–‘ํ•œ', 'O'), 
('๊ฝƒ', 'B-PT'), ('ยท', 'O'), ('๋‚˜๋ฌด', 'B-PT'), ('๊ฐ์ƒ', 'O'), 
('๊ธฐํšŒ๋ฅผ', 'O'), ('์ œ๊ณตํ•ด', 'O'), ('์‹œ๋ฏผ๋“ค์˜', 'B-CV'), ('์ฝ”๋กœ๋‚˜', 'O'), 
('๋ธ”๋ฃจ', 'O'), ('๋ฅผ', 'O'), ('ํ•ด์†Œํ•˜๊ณ ', 'O'), ('์ด์ƒ‰์ ์ธ', 'O'), 
('๊ณต๊ฐ„์„', 'O'), ('์—ฐ์ถœ', 'O'), ('ํ•˜๊ธฐ', 'O'), ('์œ„ํ•ด', 'O'), 
('์šธ์‚ฐ', 'B-LC'), ('๋Œ€', 'I-LC'), ('๊ณต์›', 'I-LC'), ('์šธ์‚ฐ', 'B-LC'), 
('๋Œ€', 'I-LC'), ('์ข…', 'I-LC'), ('๋’คํŽธ', 'B-TM'), ('์•ผ์™ธ', 'O'), 
('๊ณต์—ฐ์žฅ', 'O'), ('์ƒ๋‹จ', 'O'), ('์—', 'O'), ('ํ•ด๋ฐ”๋ผ๊ธฐ', 'B-PT'), 
('์ •์›์„', 'O'), ('์กฐ์„ฑํ–ˆ๋‹ค', 'O'), ('๊ณ ', 'O'), ('13', 'B-DT'), 
('์ผ', 'I-DT'), ('๋ฐํ˜”๋‹ค', 'O'), ('.', 'O'), ('</s>', 'O')]

## NIKL NER 2020์˜ ๊ฒฝ์šฐ
[('์šธ์‚ฐ', 'B-OGG_POLITICS'), ('์‹œ์„ค๊ณต๋‹จ', 'I-OGG_POLITICS'), 
('์€', 'O'), ('๋‹ค์–‘ํ•œ', 'O'), ('๊ฝƒ', 'B-PT_PART'), ('ยท', 'O'), 
('๋‚˜๋ฌด', 'O'), ('๊ฐ์ƒ', 'O'), ('๊ธฐํšŒ๋ฅผ', 'O'), ('์ œ๊ณตํ•ด', 'O'), 
('์‹œ๋ฏผ๋“ค์˜', 'O'), ('์ฝ”๋กœ๋‚˜', 'O'), ('๋ธ”๋ฃจ', 'O'), ('๋ฅผ', 'O'), 
('ํ•ด์†Œํ•˜๊ณ ', 'O'), ('์ด์ƒ‰์ ์ธ', 'O'), ('๊ณต๊ฐ„์„', 'O'), ('์—ฐ์ถœ', 'O'), 
('ํ•˜๊ธฐ', 'O'), ('์œ„ํ•ด', 'O'), ('์šธ์‚ฐ', 'B-LC_OTHERS'), 
('๋Œ€', 'I-LC_OTHERS'), ('๊ณต์›', 'I-LC_OTHERS'), 
('์šธ์‚ฐ', 'B-AF_CULTURAL_ASSET'), ('๋Œ€', 'I-AF_CULTURAL_ASSET'), 
('์ข…', 'I-AF_CULTURAL_ASSET'), ('๋’คํŽธ', 'O'), ('์•ผ์™ธ', 'O'), 
('๊ณต์—ฐ์žฅ', 'O'), ('์ƒ๋‹จ', 'O'), ('์—', 'O'), ('ํ•ด๋ฐ”๋ผ๊ธฐ', 'B-PT_FLOWER'), 
('์ •์›์„', 'O'), ('์กฐ์„ฑํ–ˆ๋‹ค', 'O'), ('๊ณ ', 'O'), ('13', 'B-DT_DAY'), 
('์ผ', 'I-DT_DAY'), ('๋ฐํ˜”๋‹ค', 'O'), ('.', 'O'), ('</s>', 'O')]
# --------------------------------------------------------

Training configurations

Task๋ณ„ ๋ชจ๋ธ ํ•™์Šต์— ์‚ฌ์šฉ๋œ configuration๋“ค ์ž…๋‹ˆ๋‹ค.

Relation Extraction (T5EncoderForSequenceClassificationMeanSubmeanObjmean)

train_RE.gin

get_dataset.sequence_length={'inputs':512, 'targets':512}
ke_t5.task.utils.get_vocabulary.vocab_name='KETI-AIR/ke-t5-base'

get_optimizer.optimizer_cls=@AdamW
AdamW.lr=1e-5
AdamW.betas=(0.9, 0.999)
AdamW.eps=1e-06
AdamW.weight_decay=1e-2

gin/RE_test_default.gin

get_dataset.sequence_length={'inputs':512, 'targets':8}
ke_t5.task.utils.get_vocabulary.vocab_name='KETI-AIR/ke-t5-base'

EvaluationHelper.model_fn='forward'
EvaluationHelper.model_input_keys=['input_ids', 'attention_mask', 'entity_token_idx']

Command

CUDA_VISIBLE_DEVICES='0' python train_ddp.py --batch_size 32 --gin_file="gin/train_RE.gin" --pre_trained_model "KETI-AIR/ke-t5-base" --model_name T5EncoderForSequenceClassificationMeanSubmeanObjmean     --task 'klue_re_tk_idx' -epochs 50 --train_split train --valid_split test
EPOCHS=50
BSZ=16
NUM_PROC=8
WORKERS=0

PRE_TRAINED_MODEL="KETI-AIR/ke-t5-base"

TASK=klue_re_tk_idx
MODEL=T5EncoderForSequenceClassificationFirstSubmeanObjmean

# training
python -m torch.distributed.launch --nproc_per_node=${NUM_PROC} \
    train_ddp.py \
    --gin_file="tmp/train_RE.gin" \
    --model_name ${MODEL} --task ${TASK} \
    --train_split "train" --valid_split "test" \
    --epochs ${EPOCHS} \
    --batch_size ${BSZ} \
    --workers ${WORKERS} \
    --pre_trained_model ${PRE_TRAINED_MODEL}

# test
python -m torch.distributed.launch --nproc_per_node=${NUM_PROC} \
    test_ddp.py \
    --gin_file="tmp/test_RE.gin" \
    --model_name ${MODEL} \
    --task ${TASK} \
    --batch_size ${BSZ} \
    --pre_trained_model ${PRE_TRAINED_MODEL} \
    --resume output/${MODEL}_KETI-AIR_ke-t5-base/${TASK}/weights/best_model.pth

Performance

task model base model *F1mic URL
KLUE RE T5EncoderForSequenceClassificationFirstSubmeanObjmean KETI-AIR/ke-t5-base 73.64 Download

* The F1-Scoremic of KLUE-RE is micro-averaged F1 score ignoring the no_relation.

Topic Classification (T5EncoderForSequenceClassificationMean)

gin/train_default.gin

get_dataset.sequence_length={'inputs':512, 'targets':512}
ke_t5.task.utils.get_vocabulary.vocab_name='KETI-AIR/ke-t5-base'

get_optimizer.optimizer_cls=@AdamW
AdamW.lr=3e-4
AdamW.betas=(0.9, 0.999)
AdamW.eps=1e-06
AdamW.weight_decay=1e-2

gin/test_default.gin

get_dataset.sequence_length={'inputs':512, 'targets':512}
ke_t5.task.utils.get_vocabulary.vocab_name='KETI-AIR/ke-t5-base'

EvaluationHelper.model_fn='forward'
EvaluationHelper.model_input_keys=['input_ids', 'attention_mask']

Command

EPOCHS=3
BSZ=24
NUM_PROC=8
WORKERS=0

PRE_TRAINED_MODEL="KETI-AIR/ke-t5-base"

TASK=klue_tc
MODEL=T5EncoderForSequenceClassificationMean

# training
python -m torch.distributed.launch --nproc_per_node=${NUM_PROC} \
    train_ddp.py \
    --gin_file="tmp/train_default.gin" \
    --model_name ${MODEL} --task ${TASK} \
    --train_split "train" --valid_split "test" \
    --epochs ${EPOCHS} \
    --batch_size ${BSZ} \
    --workers ${WORKERS} \
    --pre_trained_model ${PRE_TRAINED_MODEL}

# test
python -m torch.distributed.launch --nproc_per_node=${NUM_PROC} \
    test_ddp.py \
    --gin_file="tmp/test_default.gin" \
    --model_name ${MODEL} \
    --task ${TASK} \
    --batch_size ${BSZ} \
    --pre_trained_model ${PRE_TRAINED_MODEL} \
    --resume output/${MODEL}_KETI-AIR_ke-t5-base/${TASK}/weights/best_model.pth

Performance

task model base model Acc.
KLUE TC T5EncoderForSequenceClassificationMean ke-t5-base 85.579

Natural Language Inference (T5EncoderForSequenceClassificationMean)

gin file์€ topic classification๊ณผ ๋™์ผํ•ฉ๋‹ˆ๋‹ค.

Command

EPOCHS=15
BSZ=24
NUM_PROC=8
WORKERS=0

PRE_TRAINED_MODEL="KETI-AIR/ke-t5-base"

TASK=klue_nli
MODEL=T5EncoderForSequenceClassificationMean

# training
python -m torch.distributed.launch --nproc_per_node=${NUM_PROC} \
    train_ddp.py \
    --gin_file="tmp/train_default.gin" \
    --model_name ${MODEL} --task ${TASK} \
    --train_split "train" --valid_split "test" \
    --epochs ${EPOCHS} \
    --batch_size ${BSZ} \
    --workers ${WORKERS} \
    --pre_trained_model ${PRE_TRAINED_MODEL}

# test
python -m torch.distributed.launch --nproc_per_node=${NUM_PROC} \
    test_ddp.py \
    --gin_file="tmp/test_default.gin" \
    --model_name ${MODEL} \
    --task ${TASK} \
    --batch_size ${BSZ} \
    --pre_trained_model ${PRE_TRAINED_MODEL} \
    --resume best

# save the model as Hugging face style.
python -m torch.distributed.launch --nproc_per_node=${NUM_PROC} \
    train_ddp.py \
    --gin_file="tmp/train_default.gin" \
    --model_name ${MODEL} --task ${TASK} \
    --train_split "train" --valid_split "test" \
    --epochs ${EPOCHS} \
    --batch_size ${BSZ} \
    --workers ${WORKERS} \
    --resume true \
    --hf_path default \
    --pre_trained_model ${PRE_TRAINED_MODEL}

Performance

task model base model Acc.
KLUE NLI T5EncoderForSequenceClassificationMean ke-t5-base 85

Seq Pipe

TODO Seq pipe์— ๋Œ€ํ•˜์—ฌ ์„ค๋ช…ํ•  ๊ฒƒ.

TODO

  • Seq Pipe ์„ค๋ช… ์ถ”๊ฐ€
  • Generative model์„ ์œ„ํ•œ Mixture task ์ถ”๊ฐ€
  • Coreference Resolution ์ฝ”๋“œ ์ถ”๊ฐ€

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors 4

  •  
  •  
  •  
  •  

Languages