# 승화 문서에서의 표기 오류 검출

BERT(Bidirectional Encoder Representations from Transformers)는 구글이 개발한 사전훈련된(pre-training) 모델입니다. 이 모델은 위키피디아같은 텍스트 코퍼스(말뭉치)를 사용하여 미리 학습되었다는 특징이 있습니다. 그리고 BERT의 특성으로 단어를 학습할 때 문맥을 함께 고려하기때문에 언어의 패턴을 이해한 모델이 만들어집니다.

이를 기반으로 새로운 문제에 적용하는 전이학습(transfer learning)을 수행할 수 있습니다. 미리 학습된 모델을 사용하기 때문에 적은 데이터로도 빠르게 학습이 가능하다는 이점이 있습니다.

따라서 해당 모델을 기반으로 문서에서의 잘못된 표기 오류를 검출하는 알고리즘을 개발하였습니다.


## 목표

웹사이트나 텍스트 문서는 긴 여러개의 문장으로 이루어져있습니다. 각 문장이 잘못되었는지를 검사하고 잘못된 경우 잘못된 표현이 어디에 있는지 정확한 위치를 예측하여 알려주는 것이 저희 모델의 최종 목표입니다.

따라서 저희는 BERT 모델에 linear regression를 적용한 네트워크 모델을 사용할 예정입니다. Output의 [CLS] 토큰을 통해 문장의 표기 오류를 분류하고 linear regression을 통해 그 위치를 예측할 것입니다.

또한 표기 오류가 있지만 문제가 되지 않는 경우가 있습니다.

<pre>
한국 옆에는 작은 섬이 있는데 이것은 다케시마라고 불리기도 한다. 
그러나 이것은 잘못된 것으로 올바른 표기는 독도이다.
</pre>

문장 단위로 표기 오류를 검출한다면 인공지능은 표기 오류 결과로 "다케시마"를 지목할 것입니다.
그러나 전체적인 문맥을 보면 해당 문장은 잘못된 사례를 이야기 해줄 뿐, 오류가 있는 문장이라고 말을 할수는 없습니다.

따라서 BERT 모델에 주변 문장을 함께 학습시키는 모델을 구상하였습니다.
**학습 예시**
<pre>
[index-2] None
[index-1] None
[판단할 문장] 한국 옆에는 작은 섬이 있는데 이것은 다케시마라고 불리기도 한다. 
[index+1] 그러나 이것은 잘못된 것으로 올바른 표기는 독도이다.
[index+2]
</pre>

이 알고리즘을 이용해 sentence window를 슬라이딩 시키며 학습을 진행하면 문맥을 함께 고려하는 모델을 만들 수 있을 것이라 생각하였습니다.
표기 오류는 오직 [판단할 문장]에 있는지만 체크하도록 학습을 시킬 것으로, 주변 문장의 표기 오류를 검출하여 모델이 혼잡해지는 경우를 최소화하였습니다.

# pretrained 모델 불러오기

In [1]:
import os
import torch
import pandas as pd
import torch.nn as nn
import numpy as np
import torch.nn.functional as F
from torch.optim import lr_scheduler

from sklearn import model_selection
from sklearn import metrics
import transformers
import tokenizers
from transformers import AdamW
from transformers import get_linear_schedule_with_warmup
from tqdm.autonotebook import tqdm
import utils

class config:
    MAX_LEN = 128
    TRAIN_BATCH_SIZE = 8
    VALID_BATCH_SIZE = 4
    EPOCHS = 5
    BERT_PATH = "./bert-base-uncased/"
    MODEL_PATH = "model.bin"
    TOKENIZER = tokenizers.BertWordPieceTokenizer(
        f"{BERT_PATH}/vocab.txt", 
        lowercase=True
    )
    


In [2]:
import os
from urllib import request
if not os.path.isfile(config.BERT_PATH + "pytorch_model.bin"):
    print("사전 학습 모델 다운로드중")
    request.urlretrieve("https://share.easylab.kr/pytorch_model.bin",config.BERT_PATH + "pytorch_model.bin")
print("사전 학습 모델 다운로드 완료")

사전 학습 모델 다운로드 완료


## 데이터 불러오기
API를 통해 서버로부터 사용 가능한 학습 데이터를 불러옵니다. 

API는 [ {no, contents, errors[code,keyword] } , ...] 형태로 데이터를 보내주도록 되어있습니다.

In [3]:
import requests
class APIDokdo:
    def __init__(self, apikey):
      self.apiurl = "https://api.easylab.kr"
      self.headers =  {'authorization': apikey}
    def getTrainingData(self):
        return requests.get(self.apiurl + "/deeplearning/data/sentences", headers=self.headers).json()['list']
    def getErrorTypes(self):
        return requests.get(self.apiurl + "/error", headers=self.headers).json()['list']

In [4]:
api = APIDokdo("godapikey12")
original_data_json = api.getTrainingData()
print("불러온 문서의 개수: ", len(original_data_json))

class_list = [0]
class_list_from_code = {0:0}
for i in api.getErrorTypes():
    class_list_from_code[i['code']] = len(class_list)
    class_list.append(i['code'])


불러온 문서의 개수:  27


### 데이터 전처리
불러온 데이터를 문장 단위로 분리를 하고, 표기 오류를 검색합니다. 또한 문맥을 고려할 수 있게 5개의 문장씩 관리합니다 (0-5, 1-6, 2-7, 3-8 ...)

In [5]:
from nltk import tokenize
original_data = []
for i in original_data_json:
    sentences = []
    # text = i['contents'].replace("\r","").replace("\n","")
    #sentences = [i.strip() for i in tokenize.sent_tokenize(text)] #문장 단위로 분리 및 문장 앞뒤 공백 제거
    sentences = [i.strip() for i in i['contents'].split('\n')]
    
    # 두줄 이상 공백이 있는 경우 제거
    last = ""
    remove_indexes = []
    for j in range(0, len(sentences)):
        if last == "" and sentences[j] == "":
            remove_indexes.append(j)
        last = sentences[j]
        
    for index in sorted(remove_indexes, reverse=True):
        del sentences[index]
    
    # 문장별로 표기 오류 키워드 검색 && 2 + 1 + 2 문장 단위로 자동 구성
    # padding
    sentences = ['', ''] + sentences + ['', '']
    for index in range(2, len(sentences)):
        # 빈 문장 제거
        if (len(sentences[index]) == 0): continue
            
        y_class = 0
        y_keyword = ""
        # 현재 문장에 표기 오류가 있는지 확인
        if 'errors' not in i:
            print(i)
        error_index = 0
        for error in i['errors']:
            if (error['sentence_no'] != index - 2):
                continue
            predict_keyword = sentences[index][error['position']:(error['position']+error['length'])]
            if (predict_keyword != error['keyword']):
                print(error)
            y_class = class_list_from_code[error['code']]
            y_keyword = error['keyword']
            sequence = "sequence " + str(error_index) + ": "
            sequence = ""
            position = len(sequence) + error['position']
            original_data.append([[(sequence + sentences[j] if j == index else sentences[j]) for j in range(index-2, index+2 + 1)], y_class, y_keyword, position, error['length'], error_index])
            error_index += 1
        
        original_data.append([[("sequence "+str(error_index)+": "+ sentences[j] if j == index else sentences[j]) for j in range(index-2, index+2 + 1)],0, "", 0, 0, error_index])
        

In [6]:
len(original_data)

1689

## 데이터 전처리 2

데이터 불균형 해결

이거 없으면 소수 클래스 예측 엄청 

In [7]:
import random
random.seed(0)
random.shuffle(original_data)
index = int(len(original_data) * 0.8)
train = original_data[0:index]
test_data = original_data[index:]

#train = original_data[0:index]
#test_data = original_data[0:index]

print(len(original_data))
print(len(train))
print(len(test_data))

1689
1351
338


In [8]:
length = len(train)
for i in range(0,length):
    temp = [[train[i][0][j] for j in range(4,-1,-1)], train[i][1], train[i][2], train[i][3], train[i][4], train[i][5]]
    train.append(temp)

In [9]:
training_data = train
value_counts = [0] * len(class_list)
for i in training_data:
    value_counts[i[1]] += 1

value_counts

[1932, 0, 212, 558, 0, 0]

In [10]:
def ab():
    global training_data
    training_data = []
    for i in train:
        training_data.append(i)
    temp = [] * len(class_list)
    for i in class_list:
        temp.append([])
    value_counts = [0] * len(class_list)
    for i in training_data:
        temp[i[1]].append(i)
        value_counts[i[1]] += 1

    vv = max(value_counts)
    print(vv)
    for i in range(0,len(class_list)):
        if len(temp[i]) == 0:
            continue
        for j in range(len(temp[i]), vv, len(temp[i])):
            training_data.extend(temp[i])

    value_counts = [0] * len(class_list)
    for i in training_data:
        value_counts[i[1]] += 1

    return value_counts
    
# ab()

# 데이터 전처리 3
가공된 데이터를 BERT 모델에 넣을 수 있도록 만들어야합니다. 모델에는 자연어를 그대로 입력할 수 없으니 사전 학습된 BERT 모델의 vocabulary를 활용하여 토큰화 합니다.

In [11]:
tok_tweet = config.TOKENIZER.encode("The “Takeshima Zusetsu [Explanation of Takeshima with Maps]” edited by Tsuan Kitazono during the Horeki Period (1751-1763) contains the following description: There is an island about 40 ri north of the west island (Nishijima) of Matsushima (Takeshima) in 3 Oki county.")
print("분활된 토큰 형태: " + str(tok_tweet.tokens))
print("숫자형 토큰 형태: " + str(tok_tweet.ids))
print("offsets: " + str(tok_tweet.offsets))
print("attention_mask: " + str(tok_tweet.attention_mask))
print("special_tokens_mask: " + str(tok_tweet.special_tokens_mask))
print("special_tokens_mask: " + str(tok_tweet.special_tokens_mask))
print("overflowing: " + str(tok_tweet.overflowing))

분활된 토큰 형태: ['[CLS]', 'the', '“', 'takeshima', 'zu', '##set', '##su', '[', 'explanation', 'of', 'takeshima', 'with', 'maps', ']', '”', 'edited', 'by', 'ts', '##uan', 'kit', '##az', '##ono', 'during', 'the', 'ho', '##rek', '##i', 'period', '(', '1751', '-', '1763', ')', 'contains', 'the', 'following', 'description', ':', 'there', 'is', 'an', 'island', 'about', '40', 'ri', 'north', 'of', 'the', 'west', 'island', '(', 'ni', '##shi', '##jima', ')', 'of', 'mats', '##ush', '##ima', '(', 'takeshima', ')', 'in', '3', 'ok', '##i', 'county', '.', '[SEP]']
숫자형 토큰 형태: [101, 1996, 1523, 3, 16950, 13462, 6342, 1031, 7526, 1997, 3, 2007, 7341, 1033, 1524, 5493, 2011, 24529, 13860, 8934, 10936, 17175, 2076, 1996, 7570, 16816, 2072, 2558, 1006, 24440, 1011, 18432, 1007, 3397, 1996, 2206, 6412, 1024, 2045, 2003, 2019, 2479, 2055, 2871, 15544, 2167, 1997, 1996, 2225, 2479, 1006, 9152, 6182, 19417, 1007, 1997, 22281, 20668, 9581, 1006, 3, 1007, 1999, 1017, 7929, 2072, 2221, 1012, 102]
offsets: [(0, 0), (0,

In [12]:
def preprocessing(text, y_class, y_keyword, position, length, error_index):
    if (len(text[1]) > 50):
        before = text[1]
    else:
        before = text[0] + " " + text[1]
        
    main = text[2]
    
    if (len(text[3]) > 50):
        after = text[3]
    else:
        after = text[3] + " " + text[4]
    # before = ""
    # after = ""
    before = config.TOKENIZER.encode(before)
    main = config.TOKENIZER.encode(main)
    after = config.TOKENIZER.encode(after)


    # 토큰 기준으로 키워드가 어디있는지 확인
    keyword_position_in_token = -1
    keyword_end_in_token = -1
    if y_class != 0:
        keyword_position_in_string = position
        keyword_length_in_string = length
        for j in range(len(main.offsets)):
            if keyword_position_in_token == -1 and main.offsets[j][0] >= keyword_position_in_string:
                keyword_position_in_token = j
            if main.offsets[j][1] == 0: continue
            if main.offsets[j][1] <= (keyword_position_in_string + keyword_length_in_string):
                keyword_end_in_token = j
    else:        
        keyword_position_in_token = 0
        keyword_end_in_token = 0
        
    # ids = cls, classification number, sep, token, sep
    ids = [101, 9999, 102] + main.ids[1:] + before.ids[1:] + after.ids[1:]
    # mask = len(cls, classification number, sep, token, sep) = 1, else 0
    mask = [1] * len(ids)
    # token_type_ids len(token, sep) = 1, else 0
    token_type_ids = [0,0,0] +  [1] * (len(main) - 1) + [0] * (len(before) - 1) + [0] * (len(after) - 1)

    targets_start = keyword_position_in_token
    targets_end = keyword_end_in_token

    # offsets based on ids, token offsets (0,0)(0,0)(0,0)(0,a)...(0,0)
    offsets = main.offsets
    
    # Pad sequence if its length < `max_len`
    padding_length = 380 - len(ids)
    if padding_length > 0:
        ids = ids + ([0] * padding_length)
        mask = mask + ([0] * padding_length)
        token_type_ids = token_type_ids + ([0] * padding_length)
        padding_length = 380 - len(offsets)
        offsets = offsets + ([(0, 0)] * padding_length)
        
    temp = []
    temp = [0] * len(class_list)
    temp[y_class] = 1
            
    return {
            'ids': ids,
            'mask': mask, 
            'token_type_ids': token_type_ids,
            'targets_start': targets_start, 
            'targets_end': targets_end, 
            'orig_text': text,
            'orig_keyword': y_keyword,
            'class': y_class,
            'offsets': offsets ,
            'targets_class': temp,
            'error_index':[error_index] * 380
    }

In [13]:
i = 17
print (training_data[i])
a = preprocessing(training_data[i][0], training_data[i][1],training_data[i][2], training_data[i][3], training_data[i][4], training_data[i][5])
print(a)

[['The “Takeshima Zusetsu [Explanation of Takeshima with Maps]” edited by Tsuan Kitazono during the Horeki Period (1751-1763) contains the following description: There is an island about 40 ri north of the west island (Nishijima) of Matsushima (Takeshima) in 3 Oki county.', 'It is called Takeshima (Utsuryo Island).', 'sequence 0: This island is close to Japan and next to Joseon and is shaped like a triangle.', 'Its circumference is about 15 ri.', '… The distance from Hakushu’s Yonago to Takeshima is about 160 ri by sea.'], 0, '', 0, 0, 0]
{'ids': [101, 9999, 102, 5537, 1014, 1024, 2023, 2479, 2003, 2485, 2000, 2900, 1998, 2279, 2000, 4560, 2239, 1998, 2003, 5044, 2066, 1037, 9546, 1012, 102, 1996, 1523, 3, 16950, 13462, 6342, 1031, 7526, 1997, 3, 2007, 7341, 1033, 1524, 5493, 2011, 24529, 13860, 8934, 10936, 17175, 2076, 1996, 7570, 16816, 2072, 2558, 1006, 24440, 1011, 18432, 1007, 3397, 1996, 2206, 6412, 1024, 2045, 2003, 2019, 2479, 2055, 2871, 15544, 2167, 1997, 1996, 2225, 2479, 1

## 파이토치를 위한 데이터셋 클래스 생성
이 클래스는 파이토치에서 데이터를 로드할 때 사용되는 인터페이스입니다.

In [14]:
class TextDataset:
    """
    Dataset which stores the tweets and returns them as processed features
    """
    def __init__(self, dataset):
        self.dataset = dataset
    
    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, item):
        data = preprocessing(
            self.dataset[item][0], 
            self.dataset[item][1], 
            self.dataset[item][2], 
            self.dataset[item][3], 
            self.dataset[item][4], 
            self.dataset[item][4]
        )
        temp = []
        for i in data["targets_class"]:
            temp.append(torch.tensor(i, dtype=torch.long))
            
        # Return the processed data where the lists are converted to `torch.tensor`s
        return {
            'ids': torch.tensor(data["ids"], dtype=torch.long),
            'mask': torch.tensor(data["mask"], dtype=torch.long),
            'token_type_ids': torch.tensor(data["token_type_ids"], dtype=torch.long),
            'targets_start': torch.tensor(data["targets_start"], dtype=torch.long),
            'targets_end': torch.tensor(data["targets_end"], dtype=torch.long),
            'targets_class': torch.tensor(data["targets_class"], dtype=torch.long),
            'orig_tweet': data["orig_text"],
            'orig_selected': data["orig_keyword"],
            'sentiment': data["class"],
            'offsets': torch.tensor(data["offsets"], dtype=torch.long),
            'error_index': torch.tensor(data["error_index"], dtype=torch.long)
        }


## 파이토치 모델 생성

In [15]:
import torch
import transformers
class TweetModel(transformers.BertPreTrainedModel):
    """
    Model class that combines a pretrained bert model with a linear later
    """
    def __init__(self, conf, num_labels):
        super(TweetModel, self).__init__(conf)
        # pretrained BERT model을 불러옵니다.
        self.bert = transformers.BertModel.from_pretrained(config.BERT_PATH, config=conf)
        
        # Set 10% dropout to be applied to the BERT backbone's output
        # dropout은 은닉층에서 일정 확률로 유닛을 사용하지 않도록(=0) 합니다.
        # 따라서 해당 케이스에서는 사용된 유닛만을 이용해 loss를 구하고 grident를 수행합니다.
        # 결국 오버피팅 방지 가능!! (하나의 유닛에 의존하는 현상을 제거)
        self.drop_out = nn.Dropout(0.1)
        
        # 우리가 쓰는 bert-base-uncased 모델은 768의 hidden representation을 가지고 있음
        # 그래서 새로운 레이어를 이어 붙일 때에도 768개씩 붙여야함.
        
        # 우리의 데이터를 추가로 학습하는 용도로 사용할 추가적인 레이어가 필요함. (hidden_layer 추가)
        # 히든 레이어를 추가할수록 복잡한 딥러닝 네트워크를 만들 수 있지만... 데이터가 많이 필요할 듯
        
        # 여기에서는 단어 임베딩 결과를 활용할 수 있게 레이어 정의
        # BERT를 수행하며 나온 hidden layer 12개중, 마지막 10번째 11번째를 사용할 것임.
        # 12번째는 오버피팅 가능성이 높기 때문
        # 따라서 10번째(768) 11번째 (768) 두개의 레이어를 input으로 받을 것이기 때문에 레이어의 input은 768 * 2
        
        # layer0만으로 결과를 내기에는 제대로 학습이 안된다고 판단이 되어 같은 크기의 레이어 layer1를 추가할 예정
        # 따라서 768*2 -> 768*2 레이어 정의
        self.l0 = nn.Linear(768 * 2 + 1, 768 * 2 + 1)
        
        # l0으로부터 768*2 결과를 전달받아 최종적으로 start, end, class를 판단하기 위한 layer1를 정의
        self.l1 = nn.Linear(768 * 2 + 1, 2 + num_labels)
        
        # 사용할 activation 함수
        self.gelu = nn.GELU()
        
        # 가중치 랜덤 초기화
        torch.nn.init.normal_(self.l0.weight, std=0.02)
        torch.nn.init.normal_(self.l1.weight, std=0.02)
    
    def forward(self, ids, mask, token_type_ids, error_index):
        # BERT backbone으로부터 hidden states를 얻어옴.
        _, _, out = self.bert(
            ids,
            attention_mask=mask,
            token_type_ids=token_type_ids
        ) # bert_layers x bs x SL x (768)

        # Concatenate the last two hidden states
        # This is done since experiments have shown that just getting the last layer
        # gives out vectors that may be too taylored to the original BERT training objectives (MLM + NSP)
        # Sample explanation: https://bert-as-service.readthedocs.io/en/latest/section/faq.html#why-not-the-last-hidden-layer-why-second-to-last
        
        # BERT를 수행하며 나온 hidden layer의 output에서 -2번째, -1번째만 가져옴. 그리고 한줄로 이어 붙이기

        error_index = error_index.view(out[-1].shape[0],out[-1].shape[1],1)
        out = torch.cat((out[-1], out[-2], error_index), dim=-1) # bs x SL x (768 * 2)
        
        # 위에서 말했던것 처럼 10%의 노드를 제거
        out = self.drop_out(out) # bs x SL x (768 * 2)
        # The "dropped out" hidden vectors are now fed into the linear layer to output two scores
        
        # 해당 결과를 layer0에 통과
        out = self.l0(out) # bs x SL x 2
        
        # 이 결과를 바로 사용할 것은 아니기에 gelu functaion을 거치게 만듦.
        out = self.gelu(out)
        
        # layer1로 가기 전에도 똑같이 drop oup 진행
        out = self.drop_out(out)
        
        # layer1로 전달
        logits = self.l1(out)
        
        # 현재 layer1은 n개의 output을 내기때문에 이것을 분리
        # (bs x SL x n) -> (bs x SL x 1), (bs x SL x 1) ...
        outputs = list(logits.split(1, dim=-1))
        for i in range(0,len(outputs)):
            outputs[i] = outputs[i].squeeze(-1)
            

        start_logits = outputs[0] # (bs x SL)
        end_logits = outputs[1] # (bs x SL)
        class_logits = outputs[2:]
        return start_logits, end_logits, class_logits

In [16]:
def loss_fn(start_logits, end_logits, class_logits, start_positions, end_positions, class_targets):
    """
    Return the sum of the cross entropy losses for both the start and end logits
    """
    loss_fct = nn.CrossEntropyLoss()
    start_loss = loss_fct(start_logits, start_positions)
    end_loss = loss_fct(end_logits, end_positions)
    total_loss = start_loss + end_loss
        
    class_targets = class_targets.t()

    for i in range(0, len(class_list)):
        total_loss += loss_fct(class_logits[i], class_targets[i]) / len(class_list)
    return total_loss

In [17]:
def train_fn(data_loader, model, optimizer, device, scheduler=None):
    """
    Trains the bert model on the twitter data
    """
    # Set model to training mode (dropout + sampled batch norm is activated)
    model.train()
    losses = utils.AverageMeter()
    jaccards = utils.AverageMeter()

    # Set tqdm to add loading screen and set the length
    tk0 = tqdm(data_loader, total=len(data_loader))
    
    # Train the model on each batch
    for bi, d in enumerate(tk0):

        ids = d["ids"]
        token_type_ids = d["token_type_ids"]
        mask = d["mask"]
        targets_start = d["targets_start"]
        targets_end = d["targets_end"]
        sentiment = d["sentiment"]
        orig_selected = d["orig_selected"]
        orig_tweet = d["orig_tweet"][2]
        offsets = d["offsets"]
        targets_class = d["targets_class"]
        error_index = d["error_index"]

        # Move ids, masks, and targets to gpu while setting as torch.long
        ids = ids.to(device, dtype=torch.long)
        token_type_ids = token_type_ids.to(device, dtype=torch.long)
        mask = mask.to(device, dtype=torch.long)
        targets_start = targets_start.to(device, dtype=torch.long)
        targets_end = targets_end.to(device, dtype=torch.long)
        targets_class = targets_class.to(device, dtype=torch.long)
        error_index = error_index.to(device, dtype=torch.long)
        
        # Reset gradients
        model.zero_grad()
        # Use ids, masks, and token types as input to the model
        # Predict logits for each of the input tokens for each batch
        outputs_start, outputs_end, outputs_class = model(
            ids=ids,
            mask=mask,
            token_type_ids=token_type_ids,
            error_index=error_index
        ) # (bs x SL), (bs x SL)
        # Calculate batch loss based on CrossEntropy
        loss = loss_fn(outputs_start, outputs_end, outputs_class, targets_start, targets_end, targets_class)
        # Calculate gradients based on loss
        loss.backward()
        # Adjust weights based on calculated gradients
        optimizer.step()
        # Update scheduler
        scheduler.step()
        
        # Apply softmax to the start and end logits
        # This squeezes each of the logits in a sequence to a value between 0 and 1, while ensuring that they sum to 1
        # This is similar to the characteristics of "probabilities"
        outputs_start = torch.softmax(outputs_start, dim=1).cpu().detach().numpy()
        outputs_end = torch.softmax(outputs_end, dim=1).cpu().detach().numpy()
        outputs_class = [torch.softmax(i, dim=1).cpu().detach().numpy() for i in outputs_class]
        
        # Calculate the jaccard score based on the predictions for this batch
        jaccard_scores = []
        for px, tweet in enumerate(orig_tweet):
            selected_tweet = orig_selected[px]
            tweet_sentiment = sentiment[px]
            ont_hot_class = [np.argmax(i[px, :]) for i in outputs_class]
            class_number = ont_hot_class.index(max(ont_hot_class))
            jaccard_score, _ = calculate_jaccard_score(
                original_tweet=tweet, # Full text of the px'th tweet in the batch
                target_string=selected_tweet, # Span containing the specified sentiment for the px'th tweet in the batch
                sentiment_val=tweet_sentiment, # Sentiment of the px'th tweet in the batch
                idx_start=np.argmax(outputs_start[px, :]), # Predicted start index for the px'th tweet in the batch
                idx_end=np.argmax(outputs_end[px, :]), # Predicted end index for the px'th tweet in the batch
                offsets=offsets[px], # Offsets for each of the tokens for the px'th tweet in the batch
                class_number=class_number
            )
            jaccard_scores.append(jaccard_score)
        # Update the jaccard score and loss
        # For details, refer to `AverageMeter` in https://www.kaggle.com/abhishek/utils
        jaccards.update(np.mean(jaccard_scores), ids.size(0))
        losses.update(loss.item(), ids.size(0))
        # Print the average loss and jaccard score at the end of each batch
        tk0.set_postfix(loss=losses.avg, jaccard=jaccards.avg)

In [18]:
def eval_fn(data_loader, model, device):
    """
    Evaluation function to predict on the test set
    """
    # Set model to evaluation mode
    # I.e., turn off dropout and set batchnorm to use overall mean and variance (from training), rather than batch level mean and variance
    # Reference: https://github.com/pytorch/pytorch/issues/5406
    model.eval()
    losses = utils.AverageMeter()
    jaccards = utils.AverageMeter()
    
    # Turns off gradient calculations (https://datascience.stackexchange.com/questions/32651/what-is-the-use-of-torch-no-grad-in-pytorch)
    with torch.no_grad():
        tk0 = tqdm(data_loader, total=len(data_loader))
        # Make predictions and calculate loss / jaccard score for each batch
        for bi, d in enumerate(tk0):
            ids = d["ids"]
            token_type_ids = d["token_type_ids"]
            mask = d["mask"]
            sentiment = d["sentiment"]
            orig_selected = d["orig_selected"]
            orig_tweet = d["orig_tweet"][2]
            targets_start = d["targets_start"]
            targets_end = d["targets_end"]
            offsets = d["offsets"].numpy()
            targets_class = d["targets_class"]
            error_index = d["error_index"]

            # Move tensors to GPU for faster matrix calculations
            ids = ids.to(device, dtype=torch.long)
            token_type_ids = token_type_ids.to(device, dtype=torch.long)
            mask = mask.to(device, dtype=torch.long)
            targets_start = targets_start.to(device, dtype=torch.long)
            targets_end = targets_end.to(device, dtype=torch.long)
            targets_class = targets_class.to(device, dtype=torch.long)
            error_index = error_index.to(device, dtype=torch.long)

            # Predict logits for start and end indexes
            outputs_start, outputs_end, outputs_class = model(
                ids=ids,
                mask=mask,
                token_type_ids=token_type_ids,
                error_index=error_index
            )
            # Calculate loss for the batch
            loss = loss_fn(outputs_start, outputs_end, outputs_class, targets_start, targets_end, targets_class)
            # Apply softmax to the predicted logits for the start and end indexes
            # This converts the "logits" to "probability-like" scores
            outputs_start = torch.softmax(outputs_start, dim=1).cpu().detach().numpy()
            outputs_end = torch.softmax(outputs_end, dim=1).cpu().detach().numpy()
            outputs_class = [torch.softmax(i, dim=1).cpu().detach().numpy() for i in outputs_class]
            
            # Calculate jaccard scores for each tweet in the batch
            jaccard_scores = []
            for px, tweet in enumerate(orig_tweet):
                selected_tweet = orig_selected[px]
                tweet_sentiment = sentiment[px]
                ont_hot_class = [np.argmax(i[px, :]) for i in outputs_class]
                class_number = ont_hot_class.index(max(ont_hot_class))
                
                jaccard_score, _ = calculate_jaccard_score(
                    original_tweet=tweet,
                    target_string=selected_tweet,
                    sentiment_val=tweet_sentiment,
                    idx_start=np.argmax(outputs_start[px, :]),
                    idx_end=np.argmax(outputs_end[px, :]),
                    offsets=offsets[px],
                    class_number=class_number
                )
                jaccard_scores.append(jaccard_score)

            # Update running jaccard score and loss
            jaccards.update(np.mean(jaccard_scores), ids.size(0))
            losses.update(loss.item(), ids.size(0))
            # Print the running average loss and jaccard score
            tk0.set_postfix(loss=losses.avg, jaccard=jaccards.avg)
    
    print(f"Jaccard = {jaccards.avg}")
    return jaccards.avg

In [19]:
def calculate_jaccard_score(
    original_tweet, 
    target_string, 
    sentiment_val, 
    idx_start, 
    idx_end, 
    offsets,
    class_number,
    verbose=False):
    """
    Calculate the jaccard score from the predicted span and the actual span for a batch of tweets
    """
    # A span's end index has to be greater than or equal to the start index
    # If this doesn't hold, the start index is set to equal the end index (the span is a single token)
    if idx_end < idx_start:
        idx_end = idx_start
    
    # Combine into a string the tokens that belong to the predicted span
    filtered_output  = ""
    for ix in range(idx_start, idx_end + 1):
        filtered_output += original_tweet[offsets[ix][0]: offsets[ix][1]]
        # If the token is not the last token in the tweet, and the ending offset of the current token is less
        # than the beginning offset of the following token, add a space.
        # Basically, add a space when the next token (word piece) corresponds to a new word
        if (ix+1) < len(offsets) and offsets[ix][1] < offsets[ix+1][0]:
            filtered_output += " "
    #print(filtered_output)
    # Set the predicted output as the original tweet when the tweet's sentiment is "neutral", or the tweet only contains one word
    if sentiment_val == 0 or len(original_tweet.split()) < 2:
        filtered_output = original_tweet
    # Calculate the jaccard score between the predicted span, and the actual span
    # The IOU (intersection over union) approach is detailed in the utils module's `jaccard` function:
    # https://www.kaggle.com/abhishek/utils
    jac = utils.jaccard(target_string.strip(), filtered_output.strip())
    if class_number != sentiment_val:
        jac *= 0.5
    return jac, filtered_output


In [20]:
train_dataset = TextDataset(training_data)

# Instantiate DataLoader with `train_dataset`
# This is a generator that yields the dataset in batches
train_data_loader = torch.utils.data.DataLoader(
    train_dataset,
    batch_size=config.TRAIN_BATCH_SIZE,
    num_workers=0,
    shuffle = True
)
#    shuffle = True
valid_dataset = TextDataset(test_data)

# Instantiate DataLoader with `train_dataset`
# This is a generator that yields the dataset in batches
valid_data_loader = torch.utils.data.DataLoader(
    valid_dataset,
    batch_size=config.TRAIN_BATCH_SIZE,
    num_workers=0,
    shuffle = True
)
# Set device as `cuda` (GPU)
device = torch.device("cuda")
# Load pretrained BERT (bert-base-uncased)
model_config = transformers.BertConfig.from_pretrained(config.BERT_PATH)
# Output hidden states
# This is important to set since we want to concatenate the hidden states from the last 2 BERT layers
model_config.output_hidden_states = True
# Instantiate our model with `model_config`
model = TweetModel(conf=model_config, num_labels=len(class_list))
# Move the model to the GPU
model.to(device)

# Calculate the number of training steps
num_train_steps = int(len(training_data) / config.TRAIN_BATCH_SIZE * config.EPOCHS)
# Get the list of named parameters
param_optimizer = list(model.named_parameters())
# Specify parameters where weight decay shouldn't be applied
no_decay = ["bias", "LayerNorm.bias", "LayerNorm.weight"]
# Define two sets of parameters: those with weight decay, and those without
optimizer_parameters = [
    {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': 0.001},
    {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0},
]
# Instantiate AdamW optimizer with our two sets of parameters, and a learning rate of 3e-5
optimizer = AdamW(optimizer_parameters, lr=3e-5)
# Create a scheduler to set the learning rate at each training step
# "Create a schedule with a learning rate that decreases linearly after linearly increasing during a warmup period." (https://pytorch.org/docs/stable/optim.html)
# Since num_warmup_steps = 0, the learning rate starts at 3e-5, and then linearly decreases at each training step
scheduler = get_linear_schedule_with_warmup(
    optimizer, 
    num_warmup_steps=0, 
    num_training_steps=num_train_steps
)

# Apply early stopping with patience of 2
# This means to stop training new epochs when 2 rounds have passed without any improvement
es = utils.EarlyStopping(patience=2, mode="max")
fold = 0
print(f"Training is Starting for fold={fold}")

# I'm training only for 3 epochs even though I specified 5!!!
for epoch in range(50):
    train_fn(train_data_loader, model, optimizer, device, scheduler=scheduler)
    jaccard = eval_fn(valid_data_loader, model, device)
    print(f"Jaccard Score = {jaccard}")
    es(jaccard, model, model_path=f"model_{fold}.bin")
    if es.early_stop:
        print("Early stopping")
        break
        
del model

Training is Starting for fold=0


HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=338.0), HTML(value='')))




HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=43.0), HTML(value='')))


Jaccard = 0.18818118763847602
Jaccard Score = 0.18818118763847602
Validation score improved (-inf --> 0.18818118763847602). Saving model!


HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=338.0), HTML(value='')))




HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=43.0), HTML(value='')))


Jaccard = 0.19160845877746027
Jaccard Score = 0.19160845877746027
Validation score improved (0.18818118763847602 --> 0.19160845877746027). Saving model!


HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=338.0), HTML(value='')))




HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=43.0), HTML(value='')))


Jaccard = 0.22075227557954677
Jaccard Score = 0.22075227557954677
Validation score improved (0.19160845877746027 --> 0.22075227557954677). Saving model!


HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=338.0), HTML(value='')))




HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=43.0), HTML(value='')))


Jaccard = 0.2258219470323543
Jaccard Score = 0.2258219470323543
Validation score improved (0.22075227557954677 --> 0.2258219470323543). Saving model!


HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=338.0), HTML(value='')))




HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=43.0), HTML(value='')))


Jaccard = 0.2141446398590957
Jaccard Score = 0.2141446398590957
EarlyStopping counter: 1 out of 2


HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=338.0), HTML(value='')))




HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=43.0), HTML(value='')))


Jaccard = 0.2141446398590957
Jaccard Score = 0.2141446398590957
EarlyStopping counter: 2 out of 2
Early stopping


In [21]:
device = torch.device("cuda")
model_config = transformers.BertConfig.from_pretrained(config.BERT_PATH)
model_config.output_hidden_states = True

model1 = TweetModel(conf=model_config, num_labels=len(class_list))
model1.to(device)
model1.load_state_dict(torch.load("model_0.bin"))
model1.eval()

TweetModel(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
 

In [22]:
final_output = []

# Instantiate TweetDataset with the test data
test_dataset = TextDataset(test_data)

# Instantiate DataLoader with `test_dataset`
data_loader = torch.utils.data.DataLoader(
    test_dataset,
    shuffle=False,
    batch_size=config.VALID_BATCH_SIZE,
    num_workers=0
)

TP = 0
TN = 0
FP = 0
FN = 0
# Turn of gradient calculations
with torch.no_grad():
    tk0 = tqdm(data_loader, total=len(data_loader))
    # Predict the span containing the sentiment for each batch
    for bi, d in enumerate(tk0):
        ids = d["ids"]
        token_type_ids = d["token_type_ids"]
        mask = d["mask"]
        sentiment = d["sentiment"]
        orig_selected = d["orig_selected"]
        orig_tweet = d["orig_tweet"][2]
        targets_start = d["targets_start"]
        targets_end = d["targets_end"]
        offsets = d["offsets"].numpy()
        error_index = d["error_index"]

        ids = ids.to(device, dtype=torch.long)
        token_type_ids = token_type_ids.to(device, dtype=torch.long)
        mask = mask.to(device, dtype=torch.long)
        targets_start = targets_start.to(device, dtype=torch.long)
        targets_end = targets_end.to(device, dtype=torch.long)
        error_index = error_index.to(device, dtype=torch.long)

        # Predict start and end logits for each of the five models
        outputs_start, outputs_end, outputs_class = model1(
            ids=ids,
            mask=mask,
            token_type_ids=token_type_ids,
            error_index=error_index
        )
        
        # Apply softmax to the predicted start and end logits
        outputs_start = torch.softmax(outputs_start, dim=1).cpu().detach().numpy()
        outputs_end = torch.softmax(outputs_end, dim=1).cpu().detach().numpy()
        outputs_class = [torch.softmax(i, dim=1).cpu().detach().numpy() for i in outputs_class]
        # Convert the start and end scores to actual predicted spans (in string form)
        for px, tweet in enumerate(orig_tweet):
            selected_tweet = orig_selected[px]
            tweet_sentiment = sentiment[px]
            ont_hot_class = [np.argmax(i[px, :]) for i in outputs_class]
            class_number = ont_hot_class.index(max(ont_hot_class))
            if tweet_sentiment == class_number:
                if class_number == 0:
                    TN += 1
                else:
                    TP += 1
            else:
                if class_number == 0:
                    FN += 1 # 원래는 양성인데 음성으로 예측
                else:
                    FP += 1
                    
            _, output_sentence = calculate_jaccard_score(
                original_tweet=tweet,
                target_string=selected_tweet,
                sentiment_val=tweet_sentiment,
                idx_start=np.argmax(outputs_start[px, :]),
                idx_end=np.argmax(outputs_end[px, :]),
                offsets=offsets[px],
                class_number = class_number
            )
            final_output.append([np.argmax(outputs_start[px, :]), np.argmax(outputs_end[px, :]), output_sentence, class_number])

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=85.0), HTML(value='')))




In [23]:
print("정확도: ", (TP + TN)  / (TN+TP+FN+FP))
print("Precision: ", (TP)  / (TP+FP), " (특정 오류가 있다고 예측한 것 중 실제 그럴 확률)")
print("Recall: ", (TP)  / (TP+FN), " (실제 오류중에 제대로 검출할 확률)")

print("문서 개수: ", (TN+TP+FN+FP))

print("TP: ", (TP))
print("TN: ", (TN))
print("FN: ", (FN))
print("FP: ", (FP))

정확도:  0.985207100591716
Precision:  0.9504950495049505  (특정 오류가 있다고 예측한 것 중 실제 그럴 확률)
Recall:  1.0  (실제 오류중에 제대로 검출할 확률)
문서 개수:  338
TP:  96
TN:  237
FN:  0
FP:  5


In [24]:
import json
for i in range(0, len(test_dataset)):
    if test_dataset[i]['sentiment'] !=0 and test_dataset[i]['sentiment'] != 0:
        temp = {
            "tw": test_dataset[i]['orig_tweet'][2],
            "keyword": test_dataset[i]['orig_selected'],
            "predicted_keyword": (final_output[i][2] if final_output[i][3] != 0 else ""),
            "class": test_dataset[i]['sentiment'],
            "predicted_class": final_output[i][3]
        }
        print(json.dumps(temp, indent=4))

{
    "tw": "Shortly after, the Republic of Korea stationed marine police personnel on Takeshima and has continued to occupy the islands illegally up to the present day.",
    "keyword": "Takeshima and has continued to occupy the islands illegally",
    "predicted_keyword": "Republic of Korea stationed marine police personnel on Takeshima ",
    "class": 3,
    "predicted_class": 3
}
{
    "tw": "Submarine topography of the Sea of Japan This is due to the particular characteristics of the Sea of Japan.",
    "keyword": "Sea of Japan This is due to the particular characteristics of the Sea of Japan",
    "predicted_keyword": "Sea of Japan This is due to the particular characteristics of the Sea of Japan",
    "class": 2,
    "predicted_class": 2
}
{
    "tw": "With regard to Takeshima, the Japanese government asserted: In the Proclamation, the Republic of Korea appears to assert territorial rights over the islets in the Sea of Japan known as Takeshima.",
    "keyword": "Takeshima",
    