## 전체적인 구현 과정 및 분석 

**1. 데이터 프로세싱 방법** <br>
  1.1 영어 & 한국어 데이터 전처리
- 영어 데이터: NLTK의 BracketParseCorpusReader 모듈을 통해 처리
- 한국어 데이터: 구문표지를 바탕으로 어순을 정렬한 후 형태소만 추출. 'NP_SBJ(주격 체언구)' 뒤에 위치하는 형태소들의 순서를 뒤집는 방식으로 어순 정렬

1.2 영어-한국어 parallel 데이터쌍을 만든 후 shuffle. 

**2. Model** <br>
"Helsinki-NLP/opus-mt-ko-en" model: pretrained from MarianMT <br>
https://huggingface.co/Helsinki-NLP/opus-mt-ko-en?text=%EB%82%98%EB%8A%94+%EB%B0%94%EB%B3%B4%EB%8B%A4  
https://huggingface.co/transformers/model_doc/marian.html

- source language(s): kor kor_Hang kor_Latn
- target language(s): eng
- encoder, decoder layers : 6 each
- Dimensionality of the layers and the pooler layer: 512
- encoder, decoder attention heads: 8 each
- vocab size: 58101
- dropout: 0.1
- tokenizer: sentencepiece에 바탕을 둔 tokenizer

**3. Fine-tuning** <br>
  3.1. Hyperparameters
- epoch: 2
- training, evaluation batch size: 12
- warmup steps: 500
- weight decay: 0.01
- load best model at end

3.2. Train, Validation, Test 결과
- Train, Validation
  - epoch 1: (training loss) 0.018 (validation loss) 0.017
  - epoch 2: (training loss) 0.014 (validation loss) 0/015
- Test perplexity
  - epoch 1: 1.02
  - epoch 2: 1.02

**4. BLEU score** <br>
Fine-tuning 시 BLEU score는 66.27로, 돌린 모델들 중 가장 성능이 좋았다.

**5. Inference 비교**
- Fine-tuning을 하지 않은 모델과 Fine-tuning을 한 모델 모두 inference 시 <unk> 토큰을 출력하지 않았다.
- Fine-tuning을 하지 않았을 때보다 Fine-tuning을 했을 때 inference가 더 좋아졌다.
- Fine-tuning 시, epoch 1을 돌렸을 때와 epoch 2를 돌렸을 때 test perplexity는 차이가 없었으나 inference는 epoch2를 돌렸을 때 더 좋아졌다.





## 코드

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### 1. 필요 패키치 설치

In [None]:
!pip install --upgrade git+https://github.com/pytorch/text #upgrading torchtext for colab
!pip install transformers
!pip install --upgrade tensorflow

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/ed/db/98c3ea1a78190dac41c0127a063abf92bd01b4b0b6970a6db1c2f5b66fa0/transformers-4.0.1-py3-none-any.whl (1.4MB)
[K     |▎                               | 10kB 17.5MB/s eta 0:00:01[K     |▌                               | 20kB 17.8MB/s eta 0:00:01[K     |▊                               | 30kB 10.1MB/s eta 0:00:01[K     |█                               | 40kB 8.3MB/s eta 0:00:01[K     |█▏                              | 51kB 4.3MB/s eta 0:00:01[K     |█▌                              | 61kB 4.8MB/s eta 0:00:01[K     |█▊                              | 71kB 5.2MB/s eta 0:00:01[K     |██                              | 81kB 5.3MB/s eta 0:00:01[K     |██▏                             | 92kB 5.3MB/s eta 0:00:01[K     |██▍                             | 102kB 5.7MB/s eta 0:00:01[K     |██▋                             | 112kB 5.7MB/s eta 0:00:01[K     |███                             | 122kB 5.

In [None]:
!pip install sentencepiece

Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/e5/2d/6d4ca4bef9a67070fa1cac508606328329152b1df10bdf31fb6e4e727894/sentencepiece-0.1.94-cp36-cp36m-manylinux2014_x86_64.whl (1.1MB)
[K     |████████████████████████████████| 1.1MB 5.7MB/s 
[?25hInstalling collected packages: sentencepiece
Successfully installed sentencepiece-0.1.94


### 2. Huggingface: "Helsinki-NLP/opus-mt-ko-en" model without fine-tuning

### (1) 사전학습된 model과 tokenizer 로드

In [None]:
import torch
from transformers import MarianTokenizer, MarianMTModel

tokenizer = MarianTokenizer.from_pretrained("Helsinki-NLP/opus-mt-ko-en")
model = MarianMTModel.from_pretrained("Helsinki-NLP/opus-mt-ko-en")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=841805.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=813126.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1719866.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=44.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1146.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=312087009.0, style=ProgressStyle(descri…




### (2) Inference

In [None]:
sen_list = ['inference 문장 입력']

In [None]:
translate_input = tokenizer.prepare_seq2seq_batch(sen_list, return_tensors="pt")
translated = model.generate(**translate_input)
trg_text = [tokenizer.decode(t, skip_special_tokens=True) for t in translated]

### 3. Huggingface: "Helsinki-NLP/opus-mt-ko-en" model fine-tuning

### (1) 사전학습된 model과 tokenizer 로드 

In [None]:
from transformers import MarianTokenizer, MarianMTModel

tokenizer = MarianTokenizer.from_pretrained("Helsinki-NLP/opus-mt-ko-en")
model = MarianMTModel.from_pretrained("Helsinki-NLP/opus-mt-ko-en", )

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=841805.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=813126.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1719866.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=44.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1146.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=312087009.0, style=ProgressStyle(descri…




### (2) 데이터 로드 및 전처리

### (2.1) 한국어 데이터: 영어 어순에 맞게 어순을 reverse하는 과정 포함

In [None]:
#한국어 데이터 프로세싱을 위해 정규식 이용 
ko_path = "/content/drive/MyDrive/dataset/ko-en.ko.parse"
with open (ko_path, 'r', encoding='utf-8') as f:
  data = f.read()
  contents = re.findall('<id.*?/id>', data, re.S)
  sentences = []
  for c in contents:
    pattern = '<.+?>'
    sent = re.sub(pattern, '', c)
    sent = re.sub('\n', '\t', sent)
    sentences.append(sent.strip())

In [None]:
#한국어 데이터에서 탭으로 구분된 각 column을 분리 + 한 id에 sent가 2개 이상인 경우 공백 제거

tabs_rm_b = []

for sents in sentences:

  split_sent = sents.split('\t')

  # 공백이 있을 경우 공백 제거
  if '' in split_sent: 
    while '' in split_sent:
      split_sent.remove('') 

  tabs_rm_b.append(split_sent)

In [None]:
#데이터프레임으로 변경
ko_df = pd.DataFrame(tabs_rm_b)

In [None]:
#한국어 데이터에서 세 번째 열(구문 표지)과 네 번째 열(형태소와 품사정보) 각각 추출하여 튜플의 형태로 만듬 (구문표지, 형태소와 품사정보)
ko_corpus_list = []

for i in range (0, len(ko_df.index)):
  row = ko_df.loc[i,:].dropna()
  length = len(row)

  j = 2

  ko_sen_list = []

  while j <= length-1:
    index = row[j]
    word = row[j+1]

    ko_tup = (index,word)

    ko_sen_list.append(ko_tup)

    j = j + 4

  ko_corpus_list.append(ko_sen_list)

In [None]:
#한국어 각 문장의 어순을 바꿈. NP_SUB(주격 체언구)가 있는 경우 NP_SUB 뒤의 형태소들을 reverse. 없는 경우는 전체 형태소들을 reverse
ordered_ko_corpus = []

for i in range(0, corpus_len):
  sen = ko_corpus_list[i]

  sen_mi_list = []
  
  for n in range(0, len(sen)):
    sen_morph_index = sen[n][1]
    sen_mi_list.append(sen_morph_index)

  #문장이 2개 이상인 경우, 품사 정보 중 SF(마침표, 물음표, 느낌표)를 기준으로 문장 분리 

  sf_check = list(filter(lambda x: 'SF' in x, sen_mi_list)) #ref: https://coding-groot.tistory.com/21

  if len(sf_check) >= 2: 

    sf_index = []

    for ind in sf_check:
      index = sen_mi_list.index(ind)
      sf_index.append(index)

    multi_sen = []

    a = 0

    for sf_i in sf_index:
      sen_index_list = []

      raw_multi_sen = [word for word in sen[a:sf_i+1]]

      for j in range(0, len(sen)):
        sen_index = sen[j][0]
        sen_index_list.append(sen_index)

      if 'NP_SBJ' in sen_index_list:
        ns_index = sen_index_list.index('NP_SBJ')

        raw_sen = [word for word in raw_multi_sen[:ns_index+1]]

        reverse_sen = [word for word in raw_multi_sen[ns_index+1:]]
        reverse_sen = reverse_sen[::-1]

        new_sen = raw_sen + reverse_sen

      else:
        new_sen = raw_multi_sen[::-1]
      
      multi_sen.append(new_sen)

      a = a+sf_i+1
    
    new_sen = [item for uni in multi_sen for item in uni]

  else: 
    sen_index_list = []
    
    for j in range(0, len(sen)):
      sen_index = sen[j][0]
      sen_index_list.append(sen_index)
      
    if 'NP_SBJ' in sen_index_list:
      ns_index = sen_index_list.index('NP_SBJ')

      raw_sen = [word for word in sen[:ns_index+1]]

      reverse_sen = [word for word in sen[ns_index+1:]]
      reverse_sen = reverse_sen[::-1]

      new_sen = raw_sen + reverse_sen

    else:
      new_sen = sen[::-1]
  
  ordered_ko_corpus.append(new_sen)

In [None]:
#어순을 조정한 데이터들에 대해 품사정보 표지를 제거하고 형태소만 남김
clean_ko_corpus = []

for sen in ordered_ko_corpus:

  clean_1 = []
  clean_2 = []

  for i in range(0, len(sen)):
    word = sen[i][1]

    if '|' in word:
      new_word = word.split('|')
      for w in new_word:
        clean_1.append(w)

    else:
      clean_1.append(word)

  
  for token in clean_1:
    new_token = token.split('/')
    clean_2.append(new_token[0])
    
    clean_sen = " ".join(clean_2)
  
  clean_ko_corpus.append(clean_sen)

### (2.2) 영어데이터

In [None]:
#영어 데이터 프로세싱을 위해 nltk의 BracketParseCorpusReader 모듈 사용 
#BracketParseCorpusReader를 통해 영어 데이터의 문장, 단어, 품사태깅된 문장 등을 불러올 수 있음

from nltk.corpus.reader import BracketParseCorpusReader
en = BracketParseCorpusReader(root="/content/drive/MyDrive/dataset/", fileids=['ko-en.en.parse.syn'], encoding='utf-8')

In [None]:
#한국어 데이터와 영어 데이터 개수 비교, 동일함을 확인
print(len(clean_ko_corpus))
print(len(en.tagged_sents(fileids='ko-en.en.parse.syn')))

330974
330974


In [None]:
#각 영어 문장별 형태소 리스트 생성 
en_word_list = list(en.sents(fileids='ko-en.en.parse.syn'))

In [None]:
#각 영어 문장별로 각 형태소가 공백으로 나뉜 텍스트 생성
tokenized_sentences = [" ".join(sent) for sent in en_word_list]

### (2.3) 한국어-영어 문장이 짝지어진 데이터셋 만들기

In [None]:
#(영어) 각 문장이 newline token으로 나뉜 텍스트 파일 생성 
f = open('/content/drive/MyDrive/dataset/en.txt', mode='wt', encoding='utf-8')
for sent in tokenized_sentences:
  f.write(sent)
  f.write("\n")
f.close()

In [None]:
#(한국어) 각 문장이 newline token으로 나뉜 텍스트 파일 생성 
f = open('/content/drive/MyDrive/dataset/ko.txt', mode='wt', encoding='utf-8')
for sent in clean_ko_corpus:
  f.write(sent)
  f.write("\n")
f.close()

In [None]:
#데이터 셔플링을 위해 DataFrame 이용 

df1 = pd.read_csv('/content/drive/MyDrive/dataset/ko.txt', sep='/n,', names=['src'], header=None) # 한국어
df2 = pd.read_csv('/content/drive/MyDrive/dataset/en.txt', sep='/n,', names=['trg'], header=None) # 영어

df = pd.concat([df1,df2],axis=1)

  This is separate from the ipykernel package so we can avoid doing imports until
  after removing the cwd from sys.path.


In [None]:
#데이터 셔플
df_shuffle = df.sample(frac = 1)

In [None]:
df_src = df_shuffle['src']
df_trg = df_shuffle['trg']

df_src.to_csv('/content/drive/MyDrive/dataset/ko_shuffle.txt', sep = '\n', index = False, header=None)
df_trg.to_csv('/content/drive/MyDrive/dataset/en_shuffle.txt', sep = '\n', index = False, header=None)

### (3) CustomDataset 생성

In [None]:
#앞서 생성한 한국어, 영어 shuffle 버전 데이터 읽어오기
ko_dir ='/content/drive/MyDrive/dataset/ordered_data/ko_shuffle.txt'
en_dir ='/content/drive/MyDrive/dataset/ordered_data/en_shuffle.txt'

with open(ko_dir, 'r', encoding='UTF-8') as f:
  ko_text = f.readlines()
with open(en_dir, 'r', encoding='UTF-8') as f:
  en_text = f.readlines()

In [None]:
# 위 과정을 통해 프로세싱한 데이터를 이 모델을 Trainer로 fine-tuning하기 적합한 데이터 형태로 정의하는 함수
import torch
class CustomDataset(torch.utils.data.Dataset):
  def __init__(self, src, trg, tokenizer):
    super().__init__()
    self.features = tokenizer.prepare_seq2seq_batch(src, trg, return_tensors="pt", padding='max_length')
    self.input = torch.tensor(self.features['input_ids'])
    self.mask = torch.tensor(self.features['attention_mask'])
    self.labels = torch.tensor(self.features['labels'])


  def __len__(self):
    return len(self.input) 


  def __getitem__(self, index):
    item = {'input_ids': self.input[index], 'attention_mask': self.mask[index], 'labels': self.labels[index]} 
    return item

In [None]:
# CustomDataset형태로 데이터를 생성
from transformers import MarianTokenizer, MarianMTModel

data = CustomDataset(ko_text, en_text, tokenizer)

length = len(data)

  
  import sys
  


### (4) Train, Validation, Test Dataset 생성

In [None]:
#train, validation, test dataset 각각 80:10:10 비율로 생성 
from torch.utils.data.dataset import random_split

train_data, valid_data, test_data = torch.utils.data.random_split(data, [int(length*0.8), int(length*0.1), length-int(length*0.8)-int(length*0.1)])

### (5) Fine-tuning the model

In [None]:
#Trainer에 넣을 TrainingArguments 설정

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir='/content/drive/MyDrive/dataset',
    num_train_epochs=2,
    evaluation_strategy = "epoch",
    per_device_train_batch_size=12,
    per_device_eval_batch_size=12,
    warmup_steps=500,
    save_steps=100,
    save_total_limit=5,
    load_best_model_at_end= True,
    weight_decay=0.01
)

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=valid_data
)

In [None]:
#모델을 설정값에 맞게 fine-tune

trainer.train()

Epoch,Training Loss,Validation Loss
1,0.018478,0.016915
2,0.013732,0.014883


TrainOutput(global_step=44130, training_loss=0.018092823844363354)

In [None]:
#모델을 저장

trainer.save_model()

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [None]:
model.to(device)

MarianMTModel(
  (model): BartModel(
    (shared): Embedding(65001, 512, padding_idx=65000)
    (encoder): BartEncoder(
      (embed_tokens): Embedding(65001, 512, padding_idx=65000)
      (embed_positions): SinusoidalPositionalEmbedding(512, 512)
      (layers): ModuleList(
        (0): EncoderLayer(
          (self_attn): Attention(
            (k_proj): Linear(in_features=512, out_features=512, bias=True)
            (v_proj): Linear(in_features=512, out_features=512, bias=True)
            (q_proj): Linear(in_features=512, out_features=512, bias=True)
            (out_proj): Linear(in_features=512, out_features=512, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (fc1): Linear(in_features=512, out_features=2048, bias=True)
          (fc2): Linear(in_features=2048, out_features=512, bias=True)
          (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        )
        (1): EncoderLayer

In [None]:
# test set에 대한 evaluation perplexity 계산

import math

training_args.do_eval = True
# check if `do_eval` flag is set.
if training_args.do_eval:
  
  # capture output if trainer evaluate.
  eval_output = trainer.evaluate(test_data)
  # compute perplexity from model loss.
  perplexity = math.exp(eval_output["eval_loss"])
  print('\nEvaluate Perplexity: {:10,.2f}'.format(perplexity))
else:
  print('No evaluation needed. No evaluation data provided, `do_eval=False`!')


Evaluate Perplexity:       1.02


### (6) BLEU score

In [None]:
# BLEU score 계산을 위해 저장해 놓았던 모델을 로드

import torch
model2 = MarianMTModel.from_pretrained('/content/drive/MyDrive/model/')

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model2.to(device)

MarianMTModel(
  (model): BartModel(
    (shared): Embedding(65001, 512, padding_idx=65000)
    (encoder): BartEncoder(
      (embed_tokens): Embedding(65001, 512, padding_idx=65000)
      (embed_positions): SinusoidalPositionalEmbedding(512, 512)
      (layers): ModuleList(
        (0): EncoderLayer(
          (self_attn): Attention(
            (k_proj): Linear(in_features=512, out_features=512, bias=True)
            (v_proj): Linear(in_features=512, out_features=512, bias=True)
            (q_proj): Linear(in_features=512, out_features=512, bias=True)
            (out_proj): Linear(in_features=512, out_features=512, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (fc1): Linear(in_features=512, out_features=2048, bias=True)
          (fc2): Linear(in_features=2048, out_features=512, bias=True)
          (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        )
        (1): EncoderLayer

In [None]:
#BLEU score 계산 함수

from torchtext.data.metrics import bleu_score

def calculate_bleu(src_test_data, trg_test_data, model, device, max_len = 50):
    tgt_text= []
    for i in range(0,33098,2):
      model_generate_input= tokenizer.prepare_seq2seq_batch(src_test_data[i:i+2], return_tensors="pt").to(device)
      translated = model.generate(**model_generate_input)
      tgt_text_batch = [tokenizer.decode(t, skip_special_tokens=True) for t in translated]
      tgt_text.extend(tgt_text_batch)

    pred_trgs = [] 
    original_trgs = []

    edited_sents = []
    for sent in tgt_text:
      '''
      predicted sentence와 target sentence의 tokenize후 형태를 동일하게 맞추기 위해
      문장부호 앞에 공백 추가
      '''
      edited_sent = sent.replace('!', ' !')
      edited_sent = edited_sent.replace('"', ' "')
      edited_sent = edited_sent.replace('#', ' #')
      edited_sent = edited_sent.replace('$', ' $')
      edited_sent = edited_sent.replace('%', ' %')
      edited_sent = edited_sent.replace('&', ' &')
      edited_sent = edited_sent.replace("'", " '")
      edited_sent = edited_sent.replace('(', ' (')
      edited_sent = edited_sent.replace(')', ' )')
      edited_sent = edited_sent.replace('*', ' *')
      edited_sent = edited_sent.replace('+', ' +')
      edited_sent = edited_sent.replace(',', ' ,')
      edited_sent = edited_sent.replace('-', ' -')
      edited_sent = edited_sent.replace('.', ' .')
      edited_sent = edited_sent.replace('/', ' /')
      edited_sent = edited_sent.replace(':', ' :')
      edited_sent = edited_sent.replace('<', ' <')
      edited_sent = edited_sent.replace('=', ' =')
      edited_sent = edited_sent.replace('>', ' >')
      edited_sent = edited_sent.replace('?', ' ?')
      edited_sent = edited_sent.replace('@', ' @')
      edited_sent = edited_sent.replace('[', ' [')
      edited_sent = edited_sent.replace('\\', ' \\')
      edited_sent = edited_sent.replace(']', ' ]')
      edited_sent = edited_sent.replace('^', ' ^')
      edited_sent = edited_sent.replace('_', ' _')
      edited_sent = edited_sent.replace('`', ' `')
      edited_sent = edited_sent.replace('{', ' {')
      edited_sent = edited_sent.replace('|', ' |')
      edited_sent = edited_sent.replace('}', ' }')
      edited_sent = edited_sent.replace('~', ' ~')
      edited_sents.append(edited_sent)

    for i in edited_sents:
      trg_token = tokenizer.tokenize(i)
      pred_trgs.append(list(trg_token))

    for sent2 in trg_test_data:
      ori_trg_token = tokenizer.tokenize(sent2)
      original_trgs.append(list([ori_trg_token]))
        
    return bleu_score(pred_trgs, original_trgs)

In [None]:
bleu_score = calculate_bleu(test_kotexts, test_entexts, model2, device)

print(f'BLEU score = {bleu_score*100:.2f}')

BLEU score = 66.27


### (7) Inference

In [None]:
sen_list = ['inference 문장 입력']

In [None]:
translate_input = tokenizer.prepare_seq2seq_batch(sen_list, return_tensors="pt")
translate_input.to(device)
translated = model.generate(**translate_input)
trg_text = [tokenizer.decode(t, skip_special_tokens=True) for t in translated]