# 승화 문서에서의 표기 오류 검출

BERT(Bidirectional Encoder Representations from Transformers)는 구글이 개발한 사전훈련된(pre-training) 모델입니다. 이 모델은 위키피디아같은 텍스트 코퍼스(말뭉치)를 사용하여 미리 학습되었다는 특징이 있습니다. 그리고 BERT의 특성으로 단어를 학습할 때 문맥을 함께 고려하기때문에 언어의 패턴을 이해한 모델이 만들어집니다.

이를 기반으로 새로운 문제에 적용하는 전이학습(transfer learning)을 수행할 수 있습니다. 미리 학습된 모델을 사용하기 때문에 적은 데이터로도 빠르게 학습이 가능하다는 이점이 있습니다.

따라서 해당 모델을 기반으로 문서에서의 잘못된 표기 오류를 검출하는 알고리즘을 개발하였습니다.


## 목표

웹사이트나 텍스트 문서는 긴 여러개의 문장으로 이루어져있습니다. 각 문장이 잘못되었는지를 검사하고 잘못된 경우 잘못된 표현이 어디에 있는지 정확한 위치를 예측하여 알려주는 것이 저희 모델의 최종 목표입니다.

따라서 저희는 BERT 모델에 linear regression를 적용한 네트워크 모델을 사용할 예정입니다. Output의 [CLS] 토큰을 통해 문장의 표기 오류를 분류하고 linear regression을 통해 그 위치를 예측할 것입니다.

또한 표기 오류가 있지만 문제가 되지 않는 경우가 있습니다.

<pre>
한국 옆에는 작은 섬이 있는데 이것은 다케시마라고 불리기도 한다. 
그러나 이것은 잘못된 것으로 올바른 표기는 독도이다.
</pre>

문장 단위로 표기 오류를 검출한다면 인공지능은 표기 오류 결과로 "다케시마"를 지목할 것입니다.
그러나 전체적인 문맥을 보면 해당 문장은 잘못된 사례를 이야기 해줄 뿐, 오류가 있는 문장이라고 말을 할수는 없습니다.

따라서 BERT 모델에 주변 문장을 함께 학습시키는 모델을 구상하였습니다.
**학습 예시**
<pre>
[index-2] None
[index-1] None
[판단할 문장] 한국 옆에는 작은 섬이 있는데 이것은 다케시마라고 불리기도 한다. 
[index+1] 그러나 이것은 잘못된 것으로 올바른 표기는 독도이다.
[index+2]
</pre>

이 알고리즘을 이용해 sentence window를 슬라이딩 시키며 학습을 진행하면 문맥을 함께 고려하는 모델을 만들 수 있을 것이라 생각하였습니다.
표기 오류는 오직 [판단할 문장]에 있는지만 체크하도록 학습을 시킬 것으로, 주변 문장의 표기 오류를 검출하여 모델이 혼잡해지는 경우를 최소화하였습니다.

## 데이터 불러오기
API를 통해 서버로부터 사용 가능한 학습 데이터를 불러옵니다. 

API는 [ {no, contents, errors[code,keyword] } , ...] 형태로 데이터를 보내주도록 되어있습니다.

In [1]:
import requests
class APIDokdo:
    def __init__(self, apikey):
      self.apiurl = "https://api.easylab.kr"
      self.headers =  {'authorization': apikey}
    def getTrainingData(self):
        return requests.get(self.apiurl + "/deeplearning/data/sentences", headers=self.headers).json()['list']

In [2]:
original_data = APIDokdo("godapikey12").getTrainingData()
print("불러온 문서의 개수: ", len(original_data))

불러온 문서의 개수:  11


### 데이터 전처리
불러온 데이터를 문장 단위로 분리를 하고, 표기 오류를 검색합니다. 또한 문맥을 고려할 수 있게 5개의 문장씩 관리합니다 (0-5, 1-6, 2-7, 3-8 ...)

In [3]:
from nltk import tokenize
training_data = []
for i in original_data:
    sentences = []
    # text = i['contents'].replace("\r","").replace("\n","")
    #sentences = [i.strip() for i in tokenize.sent_tokenize(text)] #문장 단위로 분리 및 문장 앞뒤 공백 제거
    sentences = [i.strip() for i in i['contents'].split('\n')]
    
    # 두줄 이상 공백이 있는 경우 제거
    last = ""
    remove_indexes = []
    for j in range(0, len(sentences)):
        if last == "" and sentences[j] == "":
            remove_indexes.append(j)
        last = sentences[j]
        
    for index in sorted(remove_indexes, reverse=True):
        del sentences[index]
    
    # 문장별로 표기 오류 키워드 검색 && 2 + 1 + 2 문장 단위로 자동 구성
    # padding
    sentences = ['', ''] + sentences + ['', '']
    for index in range(2, len(sentences)):
        # 빈 문장 제거
        if (len(sentences[index]) == 0): continue
            
        y_class = 11510
        y_keyword = ""
        # 현재 문장에 표기 오류가 있는지 확인
        for error in i['errors']:
            if (error['sentence_no'] != index - 2):
                continue
            print("Error", error['keyword'])
            if sentences[index].find(error['keyword']) != -1:
                y_class = error['code']
                y_keyword = error['keyword']
        
        training_data.append([[sentences[j] for j in range(index-2, index+2 + 1)], y_class, y_keyword])

Error 2002 Japan-Korea
Error Seas of Japan
Error Seas of Japan
Error Sea of Japan
Error Sea of Japan
Error Sea of Japan
Error Sea of Japan
Error Sea of Japan
Error Sea of Japan
Error Sea of Japan
Error Sea of Japan
Error Sea Of Japan
Error Sea of Japan
Error Sea of Japan
Error Sea of Japan
Error Sea of Japan (Japan Sea)
Error Sea of Japan (Japan Sea)
Error Sea of Japan (Japan Sea) is the only internationally established name
Error Sea of Japan (Japan Sea) became recognized internationally by the early 19th Century
Error Sea of Japan (Japan Sea)
Error use the name - Sea of Japan (Japan Sea) for making their nautical chart
Error Sea of Japan (Japan Sea) is established in the guidelines on names of seas, published by the International Hydrographic Organization (IHO)
Error Sea of Japan (Japan Sea) established by "Limits of Oceans and Seas"
Error IHO has consistently used Sea of Japan (Japan Sea)
Error If "East Sea" were used alongside the name - Sea of Japan (Japan Sea) in "Limits of Ocean

In [4]:
# 가공된 데이터는 별도로 저장
import json
with open("data/training_data.txt", 'w') as outfile:
    json.dump(training_data, outfile)
training_data[0:2]

[[['',
   '',
   'World Cup>Past Tournaments>2002 Japan-Korea>Overview',
   '',
   'The 2002 world cup was held in South Korea and Japan and certainly did not disappoint.'],
  100,
  '2002 Japan-Korea'],
 [['World Cup>Past Tournaments>2002 Japan-Korea>Overview',
   '',
   'The 2002 world cup was held in South Korea and Japan and certainly did not disappoint.',
   'The opening group stage consisted of 32 teams.',
   'Group E was considered the ‘group of death’ including favourites Argentina, England, Sweden and Nigeria.'],
  11510,
  '']]

### 데이터 전처리 2
가공된 데이터를 BERT 모델에 넣을 수 있도록 만들어야합니다. 모델에는 자연어를 그대로 입력할 수 없으니 사전 학습된 BERT 모델의 vocabulary를 활용하여 토큰화 합니다.

In [5]:
import tokenizers
tokenizer = tokenizers.BertWordPieceTokenizer(
    f"data/bert-base-uncased/vocab.txt", 
    lowercase=True
)
tok_tweet = tokenizer.encode("England drew their opening match against Sweden and followed that up with a glorious 1-0 victory over Argentina thanks to a David Beckham penalty in the first half.")
print("분활된 토큰 형태: " + str(tok_tweet.tokens))
print("숫자형 토큰 형태: " + str(tok_tweet.ids))
print("offsets: " + str(tok_tweet.offsets))
print("attention_mask: " + str(tok_tweet.attention_mask))
print("special_tokens_mask: " + str(tok_tweet.special_tokens_mask))
print("special_tokens_mask: " + str(tok_tweet.special_tokens_mask))
print("overflowing: " + str(tok_tweet.overflowing))

분활된 토큰 형태: ['[CLS]', 'england', 'drew', 'their', 'opening', 'match', 'against', 'sweden', 'and', 'followed', 'that', 'up', 'with', 'a', 'glorious', '1', '-', '0', 'victory', 'over', 'argentina', 'thanks', 'to', 'a', 'david', 'beck', '##ham', 'penalty', 'in', 'the', 'first', 'half', '.', '[SEP]']
숫자형 토큰 형태: [101, 2563, 3881, 2037, 3098, 2674, 2114, 4701, 1998, 2628, 2008, 2039, 2007, 1037, 14013, 1015, 1011, 1014, 3377, 2058, 5619, 4283, 2000, 1037, 2585, 10272, 3511, 6531, 1999, 1996, 2034, 2431, 1012, 102]
offsets: [(0, 0), (0, 7), (8, 12), (13, 18), (19, 26), (27, 32), (33, 40), (41, 47), (48, 51), (52, 60), (61, 65), (66, 68), (69, 73), (74, 75), (76, 84), (85, 86), (86, 87), (87, 88), (89, 96), (97, 101), (102, 111), (112, 118), (119, 121), (122, 123), (124, 129), (130, 134), (134, 137), (138, 145), (146, 148), (149, 152), (153, 158), (159, 163), (163, 164), (0, 0)]
attention_mask: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1

In [6]:
def preprocessing(text, y_class, y_keyword):
    if (len(text[1]) > 5):
        before = text[1]
    else:
        before = text[0] + " " + text[1]
    main = text[2]
    
    if (len(text[3]) > 5):
        after = text[3]
    else:
        after = text[3] + " " + text[4]
    
    before = tokenizer.encode(before)
    main = tokenizer.encode(main)
    after = tokenizer.encode(after)


    # 토큰 기준으로 키워드가 어디있는지 확인
    keyword_position_in_token = -1
    keyword_end_in_token = -1
    if y_class != 11510:
        keyword_position_in_string = text[2].find(y_keyword)
        keyword_length_in_string = len(y_keyword) # 43까지...
        for j in range(len(main.offsets)):
            if keyword_position_in_token == -1 and main.offsets[j][0] >= keyword_position_in_string:
                keyword_position_in_token = j
            if main.offsets[j][1] == 0: continue
            if main.offsets[j][1] <= (keyword_position_in_string + keyword_length_in_string):
                keyword_end_in_token = j
    else:        
        keyword_position_in_token = 0
        keyword_end_in_token = 0
        
    # ids = cls, classification number, sep, token, sep
    ids = [101, y_class, 102] + before.ids[1:] + main.ids[1:] + after.ids[1:]
    # mask = len(cls, classification number, sep, token, sep) = 1, else 0
    mask = [1] * len(ids)
    # token_type_ids len(token, sep) = 1, else 0
    token_type_ids = [0,0,0] + [1] * (len(ids) - 3)

    targets_start = keyword_position_in_token
    targets_end = keyword_end_in_token

    # offsets based on ids, token offsets (0,0)(0,0)(0,0)(0,a)...(0,0)
    offsets = main.offsets
    
    # Pad sequence if its length < `max_len`
    padding_length = 350 - len(ids)
    if padding_length > 0:
        ids = ids + ([0] * padding_length)
        mask = mask + ([0] * padding_length)
        token_type_ids = token_type_ids + ([0] * padding_length)
        padding_length = 350 - len(offsets)
        offsets = offsets + ([(0, 0)] * padding_length)
        
        
    return {
            'ids': ids,
            'mask': mask, 
            'token_type_ids': token_type_ids,
            'targets_start': targets_start, 
            'targets_end': targets_end, 
            'orig_text': text,
            'orig_keyword': y_keyword,
            'class': y_class,
            'offsets': offsets ,
            'main_offsets': main.offsets
    }

In [7]:
i = 0
a = preprocessing(training_data[i][0], training_data[i][1], training_data[i][2])
a['targets_end']

10

## 파이토치를 위한 데이터셋 클래스 생성
이 클래스는 파이토치에서 데이터를 로드할 때 사용되는 인터페이스입니다.

In [8]:
class TextDataset:
    """
    Dataset which stores the tweets and returns them as processed features
    """
    def __init__(self, dataset):
        self.dataset = dataset
    
    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, item):
        data = preprocessing(
            self.dataset[item][0], 
            self.dataset[item][1], 
            self.dataset[item][2],
        )
        
        # Return the processed data where the lists are converted to `torch.tensor`s
        return {
            'ids': torch.tensor(data["ids"], dtype=torch.long),
            'mask': torch.tensor(data["mask"], dtype=torch.long),
            'token_type_ids': torch.tensor(data["token_type_ids"], dtype=torch.long),
            'targets_start': torch.tensor(data["targets_start"], dtype=torch.long),
            'targets_end': torch.tensor(data["targets_end"], dtype=torch.long),
            'orig_tweet': data["orig_text"],
            'orig_selected': data["orig_keyword"],
            'sentiment': data["class"],
            'offsets': torch.tensor(data["offsets"], dtype=torch.long)
        }


## 파이토치 모델 생성

In [9]:
import torch
import transformers
class TweetModel(transformers.BertPreTrainedModel):
    """
    Model class that combines a pretrained bert model with a linear later
    """
    def __init__(self, conf):
        super(TweetModel, self).__init__(conf)
        # pretrained BERT model을 불러옵니다.
        self.bert = transformers.BertModel.from_pretrained(config.BERT_PATH, config=conf)
        
        # Set 10% dropout to be applied to the BERT backbone's output
        # dropout은 은닉층에서 일정 확률로 유닛을 사용하지 않도록(=0) 합니다.
        # 따라서 해당 케이스에서는 사용된 유닛만을 이용해 loss를 구하고 grident를 수행합니다.
        # 결국 오버피팅 방지 가능!! (하나의 유닛에 의존하는 현상을 제거)
        self.drop_out = nn.Dropout(0.1)
        # 768 is the dimensionality of bert-base-uncased's hidden representations
        # Multiplied by 2 since the forward pass concatenates the last two hidden representation layers
        # The output will have two dimensions ("start_logits", and "end_logits")
        self.l0 = nn.Linear(768 * 2, 2)
        torch.nn.init.normal_(self.l0.weight, std=0.02)
    
    def forward(self, ids, mask, token_type_ids):
        # Return the hidden states from the BERT backbone
        _, _, out = self.bert(
            ids,
            attention_mask=mask,
            token_type_ids=token_type_ids
        ) # bert_layers x bs x SL x (768)

        # Concatenate the last two hidden states
        # This is done since experiments have shown that just getting the last layer
        # gives out vectors that may be too taylored to the original BERT training objectives (MLM + NSP)
        # Sample explanation: https://bert-as-service.readthedocs.io/en/latest/section/faq.html#why-not-the-last-hidden-layer-why-second-to-last
        out = torch.cat((out[-1], out[-2]), dim=-1) # bs x SL x (768 * 2)
        # Apply 10% dropout to the last 2 hidden states
        out = self.drop_out(out) # bs x SL x (768 * 2)
        # The "dropped out" hidden vectors are now fed into the linear layer to output two scores
        logits = self.l0(out) # bs x SL x 2
        
        # Splits the tensor into start_logits and end_logits
        # (bs x SL x 2) -> (bs x SL x 1), (bs x SL x 1)
        start_logits, end_logits = logits.split(1, dim=-1)

        start_logits = start_logits.squeeze(-1) # (bs x SL)
        end_logits = end_logits.squeeze(-1) # (bs x SL)

        return start_logits, end_logits

In [10]:
def loss_fn(start_logits, end_logits, start_positions, end_positions):
    """
    Return the sum of the cross entropy losses for both the start and end logits
    """
    loss_fct = nn.CrossEntropyLoss()
    start_loss = loss_fct(start_logits, start_positions)
    end_loss = loss_fct(end_logits, end_positions)
    total_loss = (start_loss + end_loss)
    return total_loss

In [11]:
def train_fn(data_loader, model, optimizer, device, scheduler=None):
    """
    Trains the bert model on the twitter data
    """
    # Set model to training mode (dropout + sampled batch norm is activated)
    model.train()
    losses = utils.AverageMeter()
    jaccards = utils.AverageMeter()

    # Set tqdm to add loading screen and set the length
    tk0 = tqdm(data_loader, total=len(data_loader))
    
    # Train the model on each batch
    for bi, d in enumerate(tk0):

        ids = d["ids"]
        token_type_ids = d["token_type_ids"]
        mask = d["mask"]
        targets_start = d["targets_start"]
        targets_end = d["targets_end"]
        sentiment = d["sentiment"]
        orig_selected = d["orig_selected"]
        orig_tweet = d["orig_tweet"][2]
        offsets = d["offsets"]

        # Move ids, masks, and targets to gpu while setting as torch.long
        ids = ids.to(device, dtype=torch.long)
        token_type_ids = token_type_ids.to(device, dtype=torch.long)
        mask = mask.to(device, dtype=torch.long)
        targets_start = targets_start.to(device, dtype=torch.long)
        targets_end = targets_end.to(device, dtype=torch.long)

        # Reset gradients
        model.zero_grad()
        # Use ids, masks, and token types as input to the model
        # Predict logits for each of the input tokens for each batch
        outputs_start, outputs_end = model(
            ids=ids,
            mask=mask,
            token_type_ids=token_type_ids,
        ) # (bs x SL), (bs x SL)
        # Calculate batch loss based on CrossEntropy
        loss = loss_fn(outputs_start, outputs_end, targets_start, targets_end)
        # Calculate gradients based on loss
        loss.backward()
        # Adjust weights based on calculated gradients
        optimizer.step()
        # Update scheduler
        scheduler.step()
        
        # Apply softmax to the start and end logits
        # This squeezes each of the logits in a sequence to a value between 0 and 1, while ensuring that they sum to 1
        # This is similar to the characteristics of "probabilities"
        outputs_start = torch.softmax(outputs_start, dim=1).cpu().detach().numpy()
        outputs_end = torch.softmax(outputs_end, dim=1).cpu().detach().numpy()
        
        # Calculate the jaccard score based on the predictions for this batch
        jaccard_scores = []
        for px, tweet in enumerate(orig_tweet):
            selected_tweet = orig_selected[px]
            tweet_sentiment = sentiment[px]
            jaccard_score, _ = calculate_jaccard_score(
                original_tweet=tweet, # Full text of the px'th tweet in the batch
                target_string=selected_tweet, # Span containing the specified sentiment for the px'th tweet in the batch
                sentiment_val=tweet_sentiment, # Sentiment of the px'th tweet in the batch
                idx_start=np.argmax(outputs_start[px, :]), # Predicted start index for the px'th tweet in the batch
                idx_end=np.argmax(outputs_end[px, :]), # Predicted end index for the px'th tweet in the batch
                offsets=offsets[px] # Offsets for each of the tokens for the px'th tweet in the batch
            )
            jaccard_scores.append(jaccard_score)
        # Update the jaccard score and loss
        # For details, refer to `AverageMeter` in https://www.kaggle.com/abhishek/utils
        jaccards.update(np.mean(jaccard_scores), ids.size(0))
        losses.update(loss.item(), ids.size(0))
        # Print the average loss and jaccard score at the end of each batch
        tk0.set_postfix(loss=losses.avg, jaccard=jaccards.avg)

In [12]:
def eval_fn(data_loader, model, device):
    """
    Evaluation function to predict on the test set
    """
    # Set model to evaluation mode
    # I.e., turn off dropout and set batchnorm to use overall mean and variance (from training), rather than batch level mean and variance
    # Reference: https://github.com/pytorch/pytorch/issues/5406
    model.eval()
    losses = utils.AverageMeter()
    jaccards = utils.AverageMeter()
    
    # Turns off gradient calculations (https://datascience.stackexchange.com/questions/32651/what-is-the-use-of-torch-no-grad-in-pytorch)
    with torch.no_grad():
        tk0 = tqdm(data_loader, total=len(data_loader))
        # Make predictions and calculate loss / jaccard score for each batch
        for bi, d in enumerate(tk0):
            ids = d["ids"]
            token_type_ids = d["token_type_ids"]
            mask = d["mask"]
            sentiment = d["sentiment"]
            orig_selected = d["orig_selected"]
            orig_tweet = d["orig_tweet"][2]
            targets_start = d["targets_start"]
            targets_end = d["targets_end"]
            offsets = d["offsets"].numpy()

            # Move tensors to GPU for faster matrix calculations
            ids = ids.to(device, dtype=torch.long)
            token_type_ids = token_type_ids.to(device, dtype=torch.long)
            mask = mask.to(device, dtype=torch.long)
            targets_start = targets_start.to(device, dtype=torch.long)
            targets_end = targets_end.to(device, dtype=torch.long)

            # Predict logits for start and end indexes
            outputs_start, outputs_end = model(
                ids=ids,
                mask=mask,
                token_type_ids=token_type_ids
            )
            # Calculate loss for the batch
            loss = loss_fn(outputs_start, outputs_end, targets_start, targets_end)
            # Apply softmax to the predicted logits for the start and end indexes
            # This converts the "logits" to "probability-like" scores
            outputs_start = torch.softmax(outputs_start, dim=1).cpu().detach().numpy()
            outputs_end = torch.softmax(outputs_end, dim=1).cpu().detach().numpy()
            # Calculate jaccard scores for each tweet in the batch
            jaccard_scores = []
            for px, tweet in enumerate(orig_tweet):
                selected_tweet = orig_selected[px]
                tweet_sentiment = sentiment[px]
                jaccard_score, _ = calculate_jaccard_score(
                    original_tweet=tweet,
                    target_string=selected_tweet,
                    sentiment_val=tweet_sentiment,
                    idx_start=np.argmax(outputs_start[px, :]),
                    idx_end=np.argmax(outputs_end[px, :]),
                    offsets=offsets[px]
                )
                jaccard_scores.append(jaccard_score)

            # Update running jaccard score and loss
            jaccards.update(np.mean(jaccard_scores), ids.size(0))
            losses.update(loss.item(), ids.size(0))
            # Print the running average loss and jaccard score
            tk0.set_postfix(loss=losses.avg, jaccard=jaccards.avg)
    
    print(f"Jaccard = {jaccards.avg}")
    return jaccards.avg

In [13]:
import os
import torch
import pandas as pd
import torch.nn as nn
import numpy as np
import torch.nn.functional as F
from torch.optim import lr_scheduler

from sklearn import model_selection
from sklearn import metrics
import transformers
import tokenizers
from transformers import AdamW
from transformers import get_linear_schedule_with_warmup
from tqdm.autonotebook import tqdm
import utils

In [14]:
def calculate_jaccard_score(
    original_tweet, 
    target_string, 
    sentiment_val, 
    idx_start, 
    idx_end, 
    offsets,
    verbose=False):
    """
    Calculate the jaccard score from the predicted span and the actual span for a batch of tweets
    """
    # A span's end index has to be greater than or equal to the start index
    # If this doesn't hold, the start index is set to equal the end index (the span is a single token)
    if idx_end < idx_start:
        idx_end = idx_start
    
    # Combine into a string the tokens that belong to the predicted span
    filtered_output  = ""
    for ix in range(idx_start, idx_end + 1):
        filtered_output += original_tweet[offsets[ix][0]: offsets[ix][1]]
        # If the token is not the last token in the tweet, and the ending offset of the current token is less
        # than the beginning offset of the following token, add a space.
        # Basically, add a space when the next token (word piece) corresponds to a new word
        if (ix+1) < len(offsets) and offsets[ix][1] < offsets[ix+1][0]:
            filtered_output += " "
    #print(filtered_output)
    # Set the predicted output as the original tweet when the tweet's sentiment is "neutral", or the tweet only contains one word
    if sentiment_val == 11510 or len(original_tweet.split()) < 2:
        filtered_output = original_tweet
    # Calculate the jaccard score between the predicted span, and the actual span
    # The IOU (intersection over union) approach is detailed in the utils module's `jaccard` function:
    # https://www.kaggle.com/abhishek/utils
    jac = utils.jaccard(target_string.strip(), filtered_output.strip())
    return jac, filtered_output


In [15]:
class config:
    MAX_LEN = 128
    TRAIN_BATCH_SIZE = 2
    VALID_BATCH_SIZE = 1
    EPOCHS = 5
    BERT_PATH = "../input/bert-base-uncased/"
    MODEL_PATH = "model.bin"
    TRAINING_FILE = "../input/tweet-train-folds/train_folds.csv"
    TOKENIZER = tokenizers.BertWordPieceTokenizer(
        f"{BERT_PATH}/vocab.txt", 
        lowercase=True
    )
    
train_dataset = TextDataset(training_data)

# Instantiate DataLoader with `train_dataset`
# This is a generator that yields the dataset in batches
train_data_loader = torch.utils.data.DataLoader(
    train_dataset,
    batch_size=config.TRAIN_BATCH_SIZE,
    num_workers=0
)

valid_data_loader = TextDataset(training_data)

# Instantiate DataLoader with `train_dataset`
# This is a generator that yields the dataset in batches
valid_data_loader = torch.utils.data.DataLoader(
    train_dataset,
    batch_size=config.TRAIN_BATCH_SIZE,
    num_workers=0
)
# Set device as `cuda` (GPU)
device = torch.device("cuda")
# Load pretrained BERT (bert-base-uncased)
model_config = transformers.BertConfig.from_pretrained(config.BERT_PATH)
# Output hidden states
# This is important to set since we want to concatenate the hidden states from the last 2 BERT layers
model_config.output_hidden_states = True
# Instantiate our model with `model_config`
model = TweetModel(conf=model_config)
# Move the model to the GPU
model.to(device)

# Calculate the number of training steps
num_train_steps = int(len(training_data) / config.TRAIN_BATCH_SIZE * config.EPOCHS)
# Get the list of named parameters
param_optimizer = list(model.named_parameters())
# Specify parameters where weight decay shouldn't be applied
no_decay = ["bias", "LayerNorm.bias", "LayerNorm.weight"]
# Define two sets of parameters: those with weight decay, and those without
optimizer_parameters = [
    {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': 0.001},
    {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0},
]
# Instantiate AdamW optimizer with our two sets of parameters, and a learning rate of 3e-5
optimizer = AdamW(optimizer_parameters, lr=3e-5)
# Create a scheduler to set the learning rate at each training step
# "Create a schedule with a learning rate that decreases linearly after linearly increasing during a warmup period." (https://pytorch.org/docs/stable/optim.html)
# Since num_warmup_steps = 0, the learning rate starts at 3e-5, and then linearly decreases at each training step
scheduler = get_linear_schedule_with_warmup(
    optimizer, 
    num_warmup_steps=0, 
    num_training_steps=num_train_steps
)

# Apply early stopping with patience of 2
# This means to stop training new epochs when 2 rounds have passed without any improvement
es = utils.EarlyStopping(patience=2, mode="max")
fold = 0
print(f"Training is Starting for fold={fold}")

# I'm training only for 3 epochs even though I specified 5!!!
for epoch in range(10):
    train_fn(train_data_loader, model, optimizer, device, scheduler=scheduler)
    jaccard = eval_fn(valid_data_loader, model, device)
    print(f"Jaccard Score = {jaccard}")
    es(jaccard, model, model_path=f"model_{fold}.bin")
    if es.early_stop:
        print("Early stopping")
        break
        
del model

Training is Starting for fold=0


HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=233.0), HTML(value='')))




HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=233.0), HTML(value='')))


Jaccard = 0.07563075504492421
Jaccard Score = 0.07563075504492421
Validation score improved (-inf --> 0.07563075504492421). Saving model!


HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=233.0), HTML(value='')))




HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=233.0), HTML(value='')))


Jaccard = 0.10607780022408507
Jaccard Score = 0.10607780022408507
Validation score improved (0.07563075504492421 --> 0.10607780022408507). Saving model!


HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=233.0), HTML(value='')))




HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=233.0), HTML(value='')))


Jaccard = 0.14321486935479583
Jaccard Score = 0.14321486935479583
Validation score improved (0.10607780022408507 --> 0.14321486935479583). Saving model!


HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=233.0), HTML(value='')))




HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=233.0), HTML(value='')))


Jaccard = 0.19612623572291762
Jaccard Score = 0.19612623572291762
Validation score improved (0.14321486935479583 --> 0.19612623572291762). Saving model!


HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=233.0), HTML(value='')))




HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=233.0), HTML(value='')))


Jaccard = 0.2055464160351133
Jaccard Score = 0.2055464160351133
Validation score improved (0.19612623572291762 --> 0.2055464160351133). Saving model!


HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=233.0), HTML(value='')))




HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=233.0), HTML(value='')))


Jaccard = 0.2055464160351133
Jaccard Score = 0.2055464160351133
EarlyStopping counter: 1 out of 2


HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=233.0), HTML(value='')))




HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=233.0), HTML(value='')))


Jaccard = 0.2055464160351133
Jaccard Score = 0.2055464160351133
EarlyStopping counter: 2 out of 2
Early stopping


In [16]:
device = torch.device("cuda")
model_config = transformers.BertConfig.from_pretrained(config.BERT_PATH)
model_config.output_hidden_states = True

model1 = TweetModel(conf=model_config)
model1.to(device)
model1.load_state_dict(torch.load("model_0.bin"))
model1.eval()

TweetModel(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
 

In [17]:
final_output = []

# Instantiate TweetDataset with the test data
test_dataset = TextDataset(training_data)

# Instantiate DataLoader with `test_dataset`
data_loader = torch.utils.data.DataLoader(
    test_dataset,
    shuffle=False,
    batch_size=config.VALID_BATCH_SIZE,
    num_workers=0
)

# Turn of gradient calculations
with torch.no_grad():
    tk0 = tqdm(data_loader, total=len(data_loader))
    # Predict the span containing the sentiment for each batch
    for bi, d in enumerate(tk0):
        ids = d["ids"]
        token_type_ids = d["token_type_ids"]
        mask = d["mask"]
        sentiment = d["sentiment"]
        orig_selected = d["orig_selected"]
        orig_tweet = d["orig_tweet"][2]
        targets_start = d["targets_start"]
        targets_end = d["targets_end"]
        offsets = d["offsets"].numpy()

        ids = ids.to(device, dtype=torch.long)
        token_type_ids = token_type_ids.to(device, dtype=torch.long)
        mask = mask.to(device, dtype=torch.long)
        targets_start = targets_start.to(device, dtype=torch.long)
        targets_end = targets_end.to(device, dtype=torch.long)

        # Predict start and end logits for each of the five models
        outputs_start, outputs_end = model1(
            ids=ids,
            mask=mask,
            token_type_ids=token_type_ids
        )
        
        # Apply softmax to the predicted start and end logits
        outputs_start = torch.softmax(outputs_start, dim=1).cpu().detach().numpy()
        outputs_end = torch.softmax(outputs_end, dim=1).cpu().detach().numpy()

        # Convert the start and end scores to actual predicted spans (in string form)
        for px, tweet in enumerate(orig_tweet):
            selected_tweet = orig_selected[px]
            tweet_sentiment = sentiment[px]
            _, output_sentence = calculate_jaccard_score(
                original_tweet=tweet,
                target_string=selected_tweet,
                sentiment_val=tweet_sentiment,
                idx_start=np.argmax(outputs_start[px, :]),
                idx_end=np.argmax(outputs_end[px, :]),
                offsets=offsets[px]
            )
            final_output.append([np.argmax(outputs_start[px, :]), np.argmax(outputs_end[px, :]), output_sentence])

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=465.0), HTML(value='')))




In [18]:
import json
for i in range(0, len(test_dataset)):
    if test_dataset[i]['sentiment'] != 11510:
        temp = {
            "origin": test_dataset[i]['orig_selected'],
            "output": final_output[i][2]
        }
        print(json.dumps(temp, indent=4))

{
    "origin": "2002 Japan-Korea",
    "output": "2002 Japan-Korea"
}
{
    "origin": "Seas of Japan",
    "output": "Seas of Japan "
}
{
    "origin": "Seas of Japan",
    "output": "Japan "
}
{
    "origin": "Sea of Japan",
    "output": "Sea of Japan "
}
{
    "origin": "Sea of Japan",
    "output": ""
}
{
    "origin": "Sea of Japan",
    "output": "Sea of Japan "
}
{
    "origin": "Sea of Japan",
    "output": "Sea of Japan "
}
{
    "origin": "Sea of Japan",
    "output": "Sea of Japan "
}
{
    "origin": "Sea of Japan",
    "output": "Sea of Japan "
}
{
    "origin": "Sea of Japan",
    "output": "Sea of Japan"
}
{
    "origin": "Sea of Japan",
    "output": "Sea of Japan "
}
{
    "origin": "Sea Of Japan",
    "output": "Of Japan"
}
{
    "origin": "Sea of Japan",
    "output": "Sea of Japan "
}
{
    "origin": "Sea of Japan",
    "output": "Sea of Japan "
}
{
    "origin": "Sea of Japan",
    "output": "Sea of Japan "
}
{
    "origin": "Sea of Japan (Japan Sea)",
    "output"