BoostCamp AI Tech - [NLP] 문장 내 개체간 관계 추출

Relation Extraction For Korean

RE is a task to identify semantic relations between entity pairs in a text. The relation is defined between an entity pair consisting of subject entity and object entity. The goal is then to pick an appropriate relationship between these two entities in Korean sentence.

Project Overview

프로젝트 목표

주어진 문장에서 Subject Entity와 Object Entity 사이의 관계를 예측하는 프로젝트

데이터셋

KLUE Dataset의 RE Task데이터로 30개의 관계가 존재하고, 약 32000개의 문장을 학습데이터로 학습한다.
관계(label)의 분포는 굉장히 불균형했는데 No relation(관계없음)의 분포가 9534개로 가장 많고 per:place_of_death(사망한 위치)의 분포가 40개로 가장 적었다.

데이터 전처리

동일한 문장의 데이터가 42개, 동일하면서도 라벨이 달랐던 데이터가 5개 존재했고 둘 중 올바른 라벨로 수정하였다.

평가지표

'No relation' 라벨을 제외한 Micro F1-score로 평가하였다.

1. Prerequisites Installatioin

requirements.txt can be installed using pip as follows:

$ pip install -r requirements.txt

2. Quick Start

Train

python train.py

inference

python inference.py

3. Best Score Model

Private Score : 72.681
Base Model : RoBERTa-large
Hyper parameter are same as for RoBERTa-large
Using Tokenize like BERT

4. Model Architecture

KLUE/RoBERTa-large

5. Usage

Using Focal loss

    "focal_loss":{
        "true" : True,
        "alpha" : 0.1,
        "gamma" : 0.25
      },

Using Imbalanced Sampler

"Trainer" : {
      "use_imbalanced_sampler" : true 
    },

Using Tokenize like BERT

BERT result [CLS] the man went to [MASK] store [SEP]he bought a gallon [MASK] milk [SEP] LABEL = IsNext like BERT result [CLS][obj] 변정수[/obj] 씨는 1994년 21살의 나이에 7살 연상 남편과 결혼해 슬하에 두 딸 [subj]유채원[/subj], 유정원 씨를 두고 있다. [SEP][obj][PER]변정수[/obj][subj][PER]유채원[/subj] [SEP]

"dataPP" :{ 
    "active" : true,
    "entityInfo" : "entity&token",
    "sentence" : "entity"
},

AEDA

"aeda" : "None"

Default 하위 15개 label에 대해 AEDA 적용 (Mecab 설치필요)

"aeda" : "default"

Mecab 설치방법

sudo apt install g++
sudo apt update
sudo apt install default-jre
sudo apt install default-jdk
pip install konlpy

# install khaiii
cd ~
git clone https://github.com/kakao/khaiii.git
cd khaiii
mkdir build
cd build
pip install cmake
sudo apt-get install cmake
cmake ..
make resource
sudo make install
make package_python
cd package_python
pip install .
cd ~
apt-get install locales
locale-gen en_US.UTF-8
pip install tweepy==3.7.0
# install mecab
wget https://bitbucket.org/eunjeon/mecab-ko/downloads/mecab-0.996-ko-0.9.2.tar.gz
tar xvfz mecab-0.996-ko-0.9.2.tar.gz
cd mecab-0.996-ko-0.9.2
./configure
make
make check
sudo make install
sudo ldconfig
cd ~
wget https://bitbucket.org/eunjeon/mecab-ko-dic/downloads/mecab-ko-dic-2.1.1-20180720.tar.gz
tar xvfz mecab-ko-dic-2.1.1-20180720.tar.gz
cd mecab-ko-dic-2.1.1-20180720
./configure
make
sudo make install
cd ~
mecab -d /usr/local/lib/mecab/dic/mecab-ko-dic
apt install curl
apt install git
bash <(curl -s https://raw.githubusercontent.com/konlpy/konlpy/master/scripts/mecab.sh)
pip install mecab-python

Custom Mecab 설치 불필요

# of no_relation * 0.4 보다 적은 데이터를 가지는 label에 대해서 augmantation 실행 sentence를 space(' ')기준으로 나눈 후 entity에 해당하는 데이터를 합친 후 aeda 적용

"aeda" : "custom"

aug family

모델이 헷갈려 하는 가족 관계 레이블에 대한 augmentation

"aug_family" = true

typed entity

An Improved Baseline for Sentence-level Relation Extraction by Wenxuan Zhou, Muhao Chen

"type_ent_marker" = true

typed punct

An Improved Baseline for Sentence-level Relation Extraction by Wenxuan Zhou, Muhao Chen

"type_punct" = true

6. Config Augmenters

Wandb

RoRERTa-large

Argument	DataType	Default	Help
name	str	"roberta_large_stratified"	Wandb model Name
tags	list	["ROBERT_LARGE", "stratified", "10epoch"]	Wandb Tags
group	str	"ROBERT_LARGE"	Wandb group Name

XLM-RoBERTa-large

Argument	DataType	Default	Help
name	str	"XLM-RoBERTa-large"	Wandb model Name
tags	list	["XLM-RoBERTa-large", "stratified", "10epoch"]	Wandb Tags
group	str	"XLM-RoBERTa-large"	Wandb group Name

Focal Loss

Argument	DataType	Default	Help
true	bool	false	Using Focal loss
alpha	float	0.1	balances focal loss
gamma	float	0.25	smoothly adjusts the rate

Train Arguments

RoBERTa-large

Argument	DataType	Default	Help
output_dir	str	"./results"	result director
save_total_limit	int	10	limit of save files
save_steps	int	100	saving step
num_train_epochs	int	3	train epochs
learning_rate	int	5e-5	learning rate
per_device_train_batch_size	int	38	train batch size
per_device_eval_batch_size	int	38	evaluation batch size
warmup_steps	int	500	lr scheduler warm up step
weight_decay	float	0.01	AdamW weight decay
logging_dir	str	"./logs"	logging dir
logging_steps	int	100	logging step
evaluation_strategy	str	"steps"	evaluation strategy (epoch or step)
eval_steps	int	100	eval steps
load_best_model_at_end	bool	true	best checkpoint saving (loss)

XLM-RoBERTa-large

Argument	DataType	Default	Help
output_dir	str	"./results"	result director
save_total_limit	int	10	limit of save files
save_steps	int	100	saving step
num_train_epochs	int	10	train epochs
learning_rate	int	5e-5	learning rate
per_device_train_batch_size	int	31	train batch size
per_device_eval_batch_size	int	31	evaluation batch size
warmup_steps	int	500	lr scheduler warm up step
weight_decay	float	0.01	AdamW weight decay
logging_dir	str	"./logs"	logging dir
logging_steps	int	100	logging step
evaluation_strategy	str	"steps"	evaluation strategy (epoch or step)
eval_steps	int	100	eval steps
load_best_model_at_end	bool	true	best checkpoint saving (loss)

7. Reference

Easy Data Augmentation Paper
Korean WordNet

8. Contributor

나요한_T2073 : https://github.com/nudago
백재형_T2102 : https://github.com/BaekTree
송민재_T2116 : https://github.com/Jjackson-dev
이호영_T2177 : https://github.com/hylee-250
정찬미_T2207 : https://github.com/ChanMiJung
한진_T2237 : https://github.com/wlsl8135/
홍석진_T2243 : https://github.com/HongCu

Name		Name	Last commit message	Last commit date
Latest commit History 128 Commits
.ipynb_checkpoints		.ipynb_checkpoints
EDA		EDA
Experiment		Experiment
KorEDA		KorEDA
prediction		prediction
.gitignore		.gitignore
README.md		README.md
config.json		config.json
config_parser.py		config_parser.py
dict_label_to_num.pkl		dict_label_to_num.pkl
dict_num_to_label.pkl		dict_num_to_label.pkl
diff_checker.ipynb		diff_checker.ipynb
focal_loss.py		focal_loss.py
inference.py		inference.py
load_data.py		load_data.py
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BoostCamp AI Tech - [NLP] 문장 내 개체간 관계 추출

Relation Extraction For Korean

Project Overview

프로젝트 목표

데이터셋

데이터 전처리

평가지표

Table of Contents

1. Prerequisites Installatioin

2. Quick Start

3. Best Score Model

4. Model Architecture

5. Usage

Using Focal loss

Using Imbalanced Sampler

Using Tokenize like BERT

AEDA

Mecab 설치방법

aug family

typed entity

typed punct

6. Config Augmenters

Wandb

Focal Loss

Train Arguments

7. Reference

8. Contributor

About

Releases

Packages

Languages

HongCu/klue-level2-nlp-06

Folders and files

Latest commit

History

Repository files navigation

BoostCamp AI Tech - [NLP] 문장 내 개체간 관계 추출

Relation Extraction For Korean

Project Overview

프로젝트 목표

데이터셋

데이터 전처리

평가지표

Table of Contents

1. Prerequisites Installatioin

2. Quick Start

3. Best Score Model

4. Model Architecture

5. Usage

Using Focal loss

Using Imbalanced Sampler

Using Tokenize like BERT

AEDA

Mecab 설치방법

aug family

typed entity

typed punct

6. Config Augmenters

Wandb

Focal Loss

Train Arguments

7. Reference

8. Contributor

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages