# ELECTRA를 이용한 한국어 감정분류
사전학습모델 : [KcELECTRA](https://github.com/Beomi/KcELECTRA) <br>
데이터 : [NAVER Sentiment Movie Corpus](https://github.com/e9t/nsmc/)

# 사전 준비

In [1]:
!pip install transformers
!pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


**NSMC 데이터 불러오기**

In [2]:
from datasets import load_dataset

datasets = load_dataset("nsmc")



  0%|          | 0/2 [00:00<?, ?it/s]

In [3]:
datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'document', 'label'],
        num_rows: 150000
    })
    test: Dataset({
        features: ['id', 'document', 'label'],
        num_rows: 50000
    })
})

In [4]:
# label 0: negative(부정) / 1: positive(긍정)
for i in range(3):
    print("train", datasets["train"][i])
    print("test", datasets["test"][i])

train {'id': '9976970', 'document': '아 더빙.. 진짜 짜증나네요 목소리', 'label': 0}
test {'id': '6270596', 'document': '굳 ㅋ', 'label': 1}
train {'id': '3819312', 'document': '흠...포스터보고 초딩영화줄....오버연기조차 가볍지 않구나', 'label': 1}
test {'id': '9274899', 'document': 'GDNTOPCLASSINTHECLUB', 'label': 0}
train {'id': '10265843', 'document': '너무재밓었다그래서보는것을추천한다', 'label': 0}
test {'id': '8544678', 'document': '뭐야 이 평점들은.... 나쁘진 않지만 10점 짜리는 더더욱 아니잖아', 'label': 0}


**KcELECTRA 모델과 토크나이저 불러오기**

In [5]:
from transformers import AutoTokenizer, AutoModel
  
tokenizer = AutoTokenizer.from_pretrained("beomi/KcELECTRA-base")
model = AutoModel.from_pretrained("beomi/KcELECTRA-base")

Some weights of the model checkpoint at beomi/KcELECTRA-base were not used when initializing ElectraModel: ['discriminator_predictions.dense_prediction.weight', 'discriminator_predictions.dense.weight', 'discriminator_predictions.dense.bias', 'discriminator_predictions.dense_prediction.bias']
- This IS expected if you are initializing ElectraModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing ElectraModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [6]:
model.config

ElectraConfig {
  "_name_or_path": "beomi/KcELECTRA-base",
  "architectures": [
    "ElectraForPreTraining"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "embedding_size": 768,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "electra",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "summary_activation": "gelu",
  "summary_last_dropout": 0.1,
  "summary_type": "first",
  "summary_use_proj": true,
  "tokenizer_class": "BertTokenizer",
  "transformers_version": "4.21.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 50135
}

# 데이터 구축

**데이터 준비**

In [7]:
from tqdm import tqdm

In [8]:
# 150000개 처리 시간 약 8시간 -> 따라서 train:test=3:1 유지시키고 전체 개수를 1/15으로 줄임
ids = int((datasets['train'].num_rows)//15)
train_doc = [datasets['train']['document'][idx] for idx in tqdm(range(0, ids))]
train_label = [datasets['train']['label'][idx] for idx in tqdm(range(0, ids))]

100%|██████████| 10000/10000 [31:42<00:00,  5.26it/s]
100%|██████████| 10000/10000 [08:19<00:00, 20.01it/s]


In [9]:
ids = int((datasets['test'].num_rows)//15)
test_doc = [datasets['test']['document'][idx] for idx in tqdm(range(0, ids))]
test_label = [datasets['test']['label'][idx] for idx in tqdm(range(0, ids))]

100%|██████████| 3333/3333 [03:23<00:00, 16.42it/s]
100%|██████████| 3333/3333 [00:55<00:00, 60.21it/s]


**텍스트 전처리** <br>
Pre-trained KcELECTRA는 preprocessing 과정이 존재 / 해당 저자의 finetune 모델에선 사용하지 않는다.

In [10]:
!pip install soynlp
!pip install emoji==1.7.0

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [11]:
import re
import emoji
from soynlp.normalizer import repeat_normalize

In [12]:
emojis = ''.join(emoji.UNICODE_EMOJI.keys())
pattern = re.compile(f'[^ .,?!/@$%~％·∼()\x00-\x7Fㄱ-ㅣ가-힣{emojis}]+')
url_pattern = re.compile(
    r'https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)')

In [13]:
def clean(x):
    x = pattern.sub(' ', x)         # 일반적으로 사용하는 특수문자, 영어, 한글, emoji제외 공백으로 치환
    x = url_pattern.sub('', x)      # URL 제거
    x = x.strip()                   # 문자의 시작과 끝에서 공백제거
    x = repeat_normalize(x, num_repeats=2)      # 반목되는 문자의 축약 횟수 2개로 줄임
    return x