RE is a task to identify semantic relations between entity pairs in a text. The relation is defined between an entity pair consisting of subject entity and object entity. The goal is then to pick an appropriate relationship between these two entities in Korean sentence.
- 주어진 문장에서 Subject Entity와 Object Entity 사이의 관계를 예측하는 프로젝트
- KLUE Dataset의 RE Task데이터로 30개의 관계가 존재하고, 약 32000개의 문장을 학습데이터로 학습한다.
- 관계(label)의 분포는 굉장히 불균형했는데 No relation(관계없음)의 분포가 9534개로 가장 많고 per:place_of_death(사망한 위치)의 분포가 40개로 가장 적었다.
- 동일한 문장의 데이터가 42개, 동일하면서도 라벨이 달랐던 데이터가 5개 존재했고 둘 중 올바른 라벨로 수정하였다.
- 'No relation' 라벨을 제외한 Micro F1-score로 평가하였다.
- Prerequisites Installatioin
- Quick Start
- Best Score Model
- Model Architecture
- Usage
- Augmenters
- Contributor
requirements.txt can be installed using pip as follows:
$ pip install -r requirements.txt
- Train
python train.py
- inference
python inference.py
- Private Score : 72.681
- Base Model : RoBERTa-large
- Hyper parameter are same as for RoBERTa-large
- Using Tokenize like BERT
- KLUE/RoBERTa-large
"focal_loss":{
"true" : True,
"alpha" : 0.1,
"gamma" : 0.25
},
"Trainer" : {
"use_imbalanced_sampler" : true
},
BERT result [CLS] the man went to [MASK] store [SEP]he bought a gallon [MASK] milk [SEP] LABEL = IsNext like BERT result [CLS][obj] 변정수[/obj] 씨는 1994년 21살의 나이에 7살 연상 남편과 결혼해 슬하에 두 딸 [subj]유채원[/subj], 유정원 씨를 두고 있다. [SEP][obj][PER]변정수[/obj][subj][PER]유채원[/subj] [SEP]
"dataPP" :{
"active" : true,
"entityInfo" : "entity&token",
"sentence" : "entity"
},
"aeda" : "None"
Default 하위 15개 label에 대해 AEDA 적용 (Mecab 설치필요)
"aeda" : "default"
sudo apt install g++
sudo apt update
sudo apt install default-jre
sudo apt install default-jdk
pip install konlpy
# install khaiii
cd ~
git clone https://github.com/kakao/khaiii.git
cd khaiii
mkdir build
cd build
pip install cmake
sudo apt-get install cmake
cmake ..
make resource
sudo make install
make package_python
cd package_python
pip install .
cd ~
apt-get install locales
locale-gen en_US.UTF-8
pip install tweepy==3.7.0
# install mecab
wget https://bitbucket.org/eunjeon/mecab-ko/downloads/mecab-0.996-ko-0.9.2.tar.gz
tar xvfz mecab-0.996-ko-0.9.2.tar.gz
cd mecab-0.996-ko-0.9.2
./configure
make
make check
sudo make install
sudo ldconfig
cd ~
wget https://bitbucket.org/eunjeon/mecab-ko-dic/downloads/mecab-ko-dic-2.1.1-20180720.tar.gz
tar xvfz mecab-ko-dic-2.1.1-20180720.tar.gz
cd mecab-ko-dic-2.1.1-20180720
./configure
make
sudo make install
cd ~
mecab -d /usr/local/lib/mecab/dic/mecab-ko-dic
apt install curl
apt install git
bash <(curl -s https://raw.githubusercontent.com/konlpy/konlpy/master/scripts/mecab.sh)
pip install mecab-python
Custom Mecab 설치 불필요
# of no_relation * 0.4
보다 적은 데이터를 가지는 label에 대해서 augmantation 실행
sentence를 space(' ')기준으로 나눈 후 entity에 해당하는 데이터를 합친 후 aeda 적용
"aeda" : "custom"
모델이 헷갈려 하는 가족 관계 레이블에 대한 augmentation
"aug_family" = true
An Improved Baseline for Sentence-level Relation Extraction by Wenxuan Zhou, Muhao Chen
"type_ent_marker" = true
An Improved Baseline for Sentence-level Relation Extraction by Wenxuan Zhou, Muhao Chen
"type_punct" = true
- RoRERTa-large
Argument | DataType | Default | Help |
---|---|---|---|
name | str | "roberta_large_stratified" | Wandb model Name |
tags | list | ["ROBERT_LARGE", "stratified", "10epoch"] | Wandb Tags |
group | str | "ROBERT_LARGE" | Wandb group Name |
- XLM-RoBERTa-large
Argument | DataType | Default | Help |
---|---|---|---|
name | str | "XLM-RoBERTa-large" | Wandb model Name |
tags | list | ["XLM-RoBERTa-large", "stratified", "10epoch"] | Wandb Tags |
group | str | "XLM-RoBERTa-large" | Wandb group Name |
Argument | DataType | Default | Help |
---|---|---|---|
true | bool | false | Using Focal loss |
alpha | float | 0.1 | balances focal loss |
gamma | float | 0.25 | smoothly adjusts the rate |
- RoBERTa-large
Argument | DataType | Default | Help |
---|---|---|---|
output_dir | str | "./results" | result director |
save_total_limit | int | 10 | limit of save files |
save_steps | int | 100 | saving step |
num_train_epochs | int | 3 | train epochs |
learning_rate | int | 5e-5 | learning rate |
per_device_train_batch_size | int | 38 | train batch size |
per_device_eval_batch_size | int | 38 | evaluation batch size |
warmup_steps | int | 500 | lr scheduler warm up step |
weight_decay | float | 0.01 | AdamW weight decay |
logging_dir | str | "./logs" | logging dir |
logging_steps | int | 100 | logging step |
evaluation_strategy | str | "steps" | evaluation strategy (epoch or step) |
eval_steps | int | 100 | eval steps |
load_best_model_at_end | bool | true | best checkpoint saving (loss) |
- XLM-RoBERTa-large
Argument | DataType | Default | Help |
---|---|---|---|
output_dir | str | "./results" | result director |
save_total_limit | int | 10 | limit of save files |
save_steps | int | 100 | saving step |
num_train_epochs | int | 10 | train epochs |
learning_rate | int | 5e-5 | learning rate |
per_device_train_batch_size | int | 31 | train batch size |
per_device_eval_batch_size | int | 31 | evaluation batch size |
warmup_steps | int | 500 | lr scheduler warm up step |
weight_decay | float | 0.01 | AdamW weight decay |
logging_dir | str | "./logs" | logging dir |
logging_steps | int | 100 | logging step |
evaluation_strategy | str | "steps" | evaluation strategy (epoch or step) |
eval_steps | int | 100 | eval steps |
load_best_model_at_end | bool | true | best checkpoint saving (loss) |
Easy Data Augmentation Paper
Korean WordNet
나요한_T2073 : https://github.com/nudago
백재형_T2102 : https://github.com/BaekTree
송민재_T2116 : https://github.com/Jjackson-dev
이호영_T2177 : https://github.com/hylee-250
정찬미_T2207 : https://github.com/ChanMiJung
한진_T2237 : https://github.com/wlsl8135/
홍석진_T2243 : https://github.com/HongCu