# [프로젝트6] 고객 상담 문의 유형 분류를 위해 BERT 모델 적용하기

---


## 프로젝트 목표
---
- BERT 모델을 딥러닝 프레임워크로 구성
- BERT를 통하여 문의 유형 분류 모델 학습 및 평가


## 프로젝트 목차
---

1. **BERT 모델 구성:** Transformer 기반 모델인 BERT를 huggingface, torch 라이브러리로 구성합니다.

2. **BERT를 통한 문의 유형 분류 문제:** BERT에 분류자를 추가해 fine-tuning을 통하여 문의 유형 분류 문제를 해결합니다.


## 프로젝트 개요
---

이전 프로젝트에서 데이터를 전처리하였고 LSTM, Attention을 추가한 LSTM 모델을 구성하고 학습하여 분류 모델을 만들었습니다. 이때, 단어 임베딩의 정보를 활용하였습니다.

이번 프로젝트에서는 Pre-trained 된 BERT 모델을 huggingface 라이버를 통하여 불러오고, 분류자를 추가하여 fine-tuning하여 문의 유형 분류 문제를 해결해보고자 합니다.

## 1. 데이터 전처리

---

### 1.1. 라이브러리 및 데이터 불러오기


프로젝트 1에서 사용한 데이터와 모델 학습을 위해 필요한 라이브러리를 불러옵니다. 

In [1]:
!pip install torch torchtext==0.11.0

Collecting torchtext==0.11.0
  Downloading torchtext-0.11.0-cp38-cp38-manylinux1_x86_64.whl (8.0 MB)
[K     |████████████████████████████████| 8.0 MB 13.5 MB/s eta 0:00:01
Collecting torch
  Downloading torch-1.10.0-cp38-cp38-manylinux1_x86_64.whl (881.9 MB)
[K     |████████████████████████▊       | 680.6 MB 82.8 MB/s eta 0:00:03

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[K     |████████████████████████████████| 881.9 MB 80.9 MB/s eta 0:00:01[K     |████████████████████████████████| 881.9 MB 4.1 kB/s 
Installing collected packages: torch, torchtext
  Attempting uninstall: torch
    Found existing installation: torch 1.9.0
    Uninstalling torch-1.9.0:
      Successfully uninstalled torch-1.9.0
Successfully installed torch-1.10.0 torchtext-0.11.0
You should consider upgrading via the '/opt/conda/bin/python3.8 -m pip install --upgrade pip' command.[0m


In [2]:
import pandas as pd
import re
from konlpy.tag import Okt

import time
import random
import numpy as np
import torch

In [3]:
data = pd.read_csv('./01_data.csv', encoding='cp949')
texts = data['메모'].tolist() # 자연어 데이터를 리스트 형식으로 변환합니다
label_list = data['상담유형3_GT'].unique().tolist()
labels = data['상담유형3_GT'].apply(label_list.index).tolist() # 한국어로 써 있는 라벨을 숫자로 변경

In [4]:
def cleaning(text):
    # 정제: 한글, 공백 제외한 문자 제거
    text = re.sub('[^가-힣ㄱ-ㅎㅏ-ㅣ\\s]', '', text)
    return text

In [5]:
texts_clean = []
for i in range(len(texts)):
    text_clean = cleaning(texts[i])
    texts_clean.append(text_clean)

학습 데이터와 테스트 데이터를 구분합니다.

In [6]:
num_train = int(0.8*len(texts_clean))

texts_labels = list(zip(texts_clean,labels))
random.shuffle(texts_labels)
texts_clean, labels = zip(*texts_labels)

train_texts = texts_clean[:num_train]
train_labels = labels[:num_train]

test_texts = texts_clean[num_train:]
test_labels = labels[num_train:]

### 1.2. 데이터 전처리
---

List 형태로 저장되어 있는 데이터와 라벨을 BERT 모델에 적용할 수 있도록 전처리합니다.

In [7]:
!pip install konlpy --no-cache-dir 
!pip install 'git+https://github.com/SKTBrain/KoBERT.git#egg=kobert_tokenizer&subdirectory=kobert_hf'
!pip install torch --no-cache-dir 
!pip install transformers --no-cache-dir 
!pip install sentencepiece --no-cache-dir
!pip install gluonnlp --no-cache-dir

You should consider upgrading via the '/opt/conda/bin/python3.8 -m pip install --upgrade pip' command.[0m
Collecting kobert_tokenizer
  Cloning https://github.com/SKTBrain/KoBERT.git to /tmp/pip-install-7d8soooe/kobert-tokenizer_3173b9cbceed4b6da73a5a5f5ad7f6f2
  Running command git clone -q https://github.com/SKTBrain/KoBERT.git /tmp/pip-install-7d8soooe/kobert-tokenizer_3173b9cbceed4b6da73a5a5f5ad7f6f2
  Resolved https://github.com/SKTBrain/KoBERT.git to commit e1f2f37055e7460d8427f6912579c0162cb69831
Building wheels for collected packages: kobert-tokenizer
  Building wheel for kobert-tokenizer (setup.py) ... [?25ldone
[?25h  Created wheel for kobert-tokenizer: filename=kobert_tokenizer-0.1-py3-none-any.whl size=4649 sha256=7f8a865db4a66185ada3af6ae089788ec8d49fb593002923c47f025f64a6fdcd
  Stored in directory: /tmp/pip-ephem-wheel-cache-hhq_7h_5/wheels/f7/cb/29/1a737fe71e5108dc30b04ea4a990f78ed271fa537aaf3fce7c
Successfully built kobert-tokenizer
Installing collected packages: kob

In [8]:
import torch
from transformers import BertModel
from kobert_tokenizer import KoBERTTokenizer
import gluonnlp as nlp
from transformers import BertForSequenceClassification, TrainingArguments, Trainer

텍스트 전처리 및 데이터셋을 만듭니다.

In [9]:
tokenizer = KoBERTTokenizer.from_pretrained('skt/kobert-base-v1')

Downloading:   0%|          | 0.00/363k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/244 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/432 [00:00<?, ?B/s]

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'XLNetTokenizer'. 
The class this function is called from is 'KoBERTTokenizer'.


In [10]:
MAX_LEN = 50

In [11]:
train_encodings = tokenizer(train_texts, padding='max_length', max_length=MAX_LEN)
test_encodings = tokenizer(test_texts, padding='max_length', max_length=MAX_LEN)

In [12]:
class ConsultDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx][:MAX_LEN]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

In [13]:
train_dataset = ConsultDataset(train_encodings, train_labels)
test_dataset = ConsultDataset(test_encodings, test_labels)

## 2. BERT 모델 구성
---
분류자가 추가된 BERT 모델을 구성합니다.

### [TODO] BERT 출력 임베딩 위에 넣을 분류자를 만드는 코드를 작성하세요.

In [14]:
import torch.nn as nn
import torch.nn.functional as F
class Classifier(nn.Module):
    def __init__(self, embedding_dim, hidden_dim, target_size):
        super(Classifier, self).__init__()
        self.fc1 = nn.Linear(embedding_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, target_size)

    def forward(self, x):
        x = self.fc1(x)
        x = F.relu(x)
        x = self.fc2(x)
        return x

하이퍼파라미터를 설정합니다.

In [15]:
EMBEDDING_DIM = 768 # BERT 출력물 (임베딩) 차원
HIDDEN_DIM = 128 # 분류자 은닉 변수 차원
TARGET_SIZE = len(label_list) # 라벨 클래스 개수
BATCH_SIZE = 64

모델, 손실 함수, 옵티마이저를 설정합니다.

In [16]:
from torch.utils.data import DataLoader
from torch import optim

classifier = Classifier(EMBEDDING_DIM, HIDDEN_DIM, TARGET_SIZE)

train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False)
optimizer = optim.Adam(classifier.parameters(), lr=1e-3)

loss_function = nn.CrossEntropyLoss()

In [17]:
model = BertModel.from_pretrained('skt/kobert-base-v1')
model.eval()

Downloading:   0%|          | 0.00/535 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/352M [00:00<?, ?B/s]

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(8002, 768, padding_idx=1)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )

## 3. BERT 모델을 통한 문의 유형 분류 문제
---
학습을 진행합니다.

정확도를 재는 함수를 정의합니다.

In [18]:
def accuracy(prediction, label):
    prediction_argmax = prediction.max(dim=-1)[1]
    correct = (prediction_argmax == label).float()
    acc = correct.sum() / len(correct)
    return acc

### [TODO] BERT 모델을 통한 분류 학습에 대한 훈련 함수 코드를 작성하세요.

In [24]:
def train(model, loader, optimizer, loss_function):
    epoch_loss = 0
    epoch_acc = 0
    temp_loss = 0
    temp_acc = 0
    
    classifier.train()
    for idx, batch in enumerate(loader):
        optimizer.zero_grad()
        
        input_ids = batch['input_ids']
        attention_mask = batch['attention_mask']
        labels = batch['labels']
        with torch.no_grad():
            outputs = model(input_ids, attention_mask=attention_mask)
        predictions = classifier(outputs.pooler_output) # BERT 출력물로부터 분류 출력값 계산
        
        loss = loss_function(predictions, labels) # 로스 계산
        acc = accuracy(predictions, labels) # 정확도 계산
        
        loss.backward()
        optimizer.step()
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
        temp_loss += loss.item()
        temp_acc += acc.item()
        
        if (idx % 10 == 0) and (idx != 0):
            print(f'Epoch: {epoch+1:02} - {(idx+1)*len(labels)} texts')
            print(f'\tTrain Loss: {temp_loss/10:.3f} | Train Acc: {temp_acc/10*100:.2f}%')
            temp_loss = 0
            temp_acc = 0
            
    return epoch_loss / len(loader), epoch_acc / len(loader)

평가 함수를 정의합니다.

In [20]:
def evaluate(model, loader, loss_function):
    epoch_loss = 0
    epoch_acc = 0
    
    classifier.eval()
        
    with torch.no_grad():
        for batch in loader:
            input_ids = batch['input_ids']
            attention_mask = batch['attention_mask']
            labels = batch['labels']

            outputs = model(input_ids, attention_mask=attention_mask)
            predictions = classifier(outputs.pooler_output)
        
            loss = loss_function(predictions, labels)
            acc = accuracy(predictions, labels)

            epoch_loss += loss.item()
            epoch_acc += acc.item()

    return epoch_loss / len(loader), epoch_acc / len(loader)

학습을 진행합니다.

In [21]:
NUM_EPOCHS = 4
best_valid_loss = float('inf')

In [25]:
for epoch in range(NUM_EPOCHS):
    start_time = time.time()
    train_loss, train_acc = train(model, train_loader, optimizer, loss_function)
    valid_loss, valid_acc = evaluate(model, test_loader, loss_function)
    epoch_time = time.time() - start_time
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        #torch.save(model.state_dict(), 'bert-best.pt')
    
    print(f'Epoch: {epoch+1:02} | Time: {epoch_time:.2f}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

Epoch: 01 - 704 texts
	Train Loss: 3.017 | Train Acc: 20.62%
Epoch: 01 - 1344 texts
	Train Loss: 2.692 | Train Acc: 17.03%
Epoch: 01 - 1984 texts
	Train Loss: 2.661 | Train Acc: 18.28%
Epoch: 01 - 2624 texts
	Train Loss: 2.599 | Train Acc: 19.69%
Epoch: 01 - 3264 texts
	Train Loss: 2.578 | Train Acc: 21.56%
Epoch: 01 - 3904 texts
	Train Loss: 2.543 | Train Acc: 21.88%
Epoch: 01 | Time: 284.96s
	Train Loss: 2.635 | Train Acc: 19.72%
	 Val. Loss: 2.485 |  Val. Acc: 23.87%
Epoch: 02 - 704 texts
	Train Loss: 2.770 | Train Acc: 25.16%
Epoch: 02 - 1344 texts
	Train Loss: 2.497 | Train Acc: 21.25%
Epoch: 02 - 1984 texts
	Train Loss: 2.452 | Train Acc: 23.44%
Epoch: 02 - 2624 texts
	Train Loss: 2.477 | Train Acc: 25.00%
Epoch: 02 - 3264 texts
	Train Loss: 2.449 | Train Acc: 21.72%
Epoch: 02 - 3904 texts
	Train Loss: 2.480 | Train Acc: 25.16%
Epoch: 02 | Time: 287.63s
	Train Loss: 2.474 | Train Acc: 23.51%
	 Val. Loss: 2.396 |  Val. Acc: 24.20%
Epoch: 03 - 704 texts
	Train Loss: 2.718 | Train A