# コンペティション課題4

## 課題
RNN Encoder-Decoderにより高精度な英日翻訳器を実装してみましょう。

## 目標値
BLEU：0.23
（これはあくまで「目標値」であるため、達成できなかったからといって不合格となったり、著しく成績が損なわれることはありません）

## ルール
- 「修正しないでください」とあるセルを、修正しないでください。
- モデルのアーキテクチャは自由です。講義で扱ったモデル以外でも構いません。
- 以下のセル内の`train_iter, val_iter`で定義されている`train, val`以外の学習データは使わないでください。
- `id2text_en, id2text_ja`はIDをリストのindexとして入力することで、対応する単語を習得できます。

## 提出方法
- 1つのファイルを提出していただきます。
  1. テストデータ`test_iter`に対する予測ラベルを`submission4_gen.csv`として保存・ダウンロードしてください。
  2. Homeworkタブから**Day4 Pred (.csv)**を選択して提出してください。
  3. それとは別に、最終提出に対応するノートブックを[Final Submission]などと命名しわかるようにiLect System上に置いておいてください。
- 成績優秀者には、次回講義にて取り組みの発表をお願いいたします。

## LeaderBoard
- コンペティション期間中のLeaderBoardは提出されたcsvファイルのうち50%を使って計算されます。
- コンペティション終了時には提出されたcsvファイルのうち、コンペティション期間中のLeaderBoard計算に使われなかったもう半分のデータがスコア計算に使用されます。
- このため、コンペ中の順位とコンペ終了後にLeaderBoardが更新された後の順位やスコアが食い違うことがあります。

## 評価方法

- 予測ラベルの（`t_testに対する`）BLEUスコア(4-gramまで)で評価します。
- BLEUスコア算出の際にはSmoothingを行っています。詳細は[こちらのmethod4](https://github.com/nltk/nltk/blob/7d6a8d42f6/nltk/translate/bleu_score.py#L577-L591)を確認してください。

## データの読み込み

- このセルは修正しないでください。
- 誤って修正した場合は、元ファイルをコピーし直してください。


- データサイズが大きいため、読み込みには数分を要します。

In [None]:
import pandas as pd
import spacy
import torch
from janome.tokenizer import Tokenizer
from torchtext.data import Field, LabelField, TabularDataset, BucketIterator

# 英語用のtokenizer
spacy_en = spacy.load('en')
def tokenizer_en(text):
    return [token.text for token in spacy_en.tokenizer(text)]

# 日本語用のtokenizer
ja_t = Tokenizer()
def tokenizer_ja(text): 
    return [token for token in ja_t.tokenize(text, wakati=True)]

# 各Fieldを定義
DATA_ID = LabelField(dtype=torch.int)
SOURCE = Field(sequential=True, tokenize=tokenizer_en, init_token="<sos>", eos_token="<eos>", lower=True, include_lengths=True)
TARGET = Field(sequential=True, tokenize=tokenizer_ja, init_token="<sos>", eos_token="<eos>", lower=False, include_lengths=True)

train, val, test = TabularDataset.splits(
    path="/root/userspace/public/day4/chap08/data", 
    train="train.csv", validation="val.csv",
    test="test_homework.csv", format="csv",
    skip_header=True,
    fields=[("data_id", DATA_ID), ("source", SOURCE), ("target", TARGET)]
)

def load_dataset(batch_size, device):
    # Vocabularyの作成
    DATA_ID.build_vocab(train)
    SOURCE.build_vocab(train, min_freq=2)
    TARGET.build_vocab(train, min_freq=2)
    id2text_en = SOURCE.vocab.itos
    id2text_ja = TARGET.vocab.itos
    
    # 各種データセットのイテレータを作成
    train_iter, val_iter, test_iter = BucketIterator.splits(
    (train, val, test), batch_size=batch_size, device=device, sort=False)
    
    return train_iter, val_iter, test_iter, SOURCE, TARGET, id2text_en, id2text_ja

## 実装

In [None]:
#精度上げるポイント
1. https://github.com/bentrevett/pytorch-seq2seq ドイツ→英語のレポジトリを参考しつつ，　実装した
2. tokenizerの変更

In [1]:
import pandas as pd
import spacy
import torch
from janome.tokenizer import Tokenizer
from torchtext.data import Field, LabelField, TabularDataset, BucketIterator

In [2]:
#!pip3 install spaCy==3.0

In [3]:
#!python3 -m spacy download ja_core_news_lg

In [4]:
from transformers import BertTokenizer

# 英語用のtokenizer
spacy_en = BertTokenizer.from_pretrained('bert-base-uncased')
def tokenizer_en(text):
    return [token for token in spacy_en.tokenize(text)]

# 日本語用のtokenizer
ja_t = spacy.load('ja_core_news_lg')
def tokenizer_ja(text): 
    return [token.text for token in ja_t.tokenizer(text)]

SRC = Field(tokenize = tokenizer_en, 
            init_token = '<sos>', 
            eos_token = '<eos>', 
            lower = True, 
            batch_first = True)

TRG = Field(tokenize = tokenizer_ja, 
            init_token = '<sos>', 
            eos_token = '<eos>', 
            lower = True, 
            batch_first = True)
DATA_ID = LabelField(dtype=torch.int)

train_data, valid_data, test_data = TabularDataset.splits(
    path="/root/userspace/public/day4/chap08/data", 
    train="train.csv", validation="val.csv",
    test="test_homework.csv", format="csv",
    skip_header=True,
     fields=[("data_id", DATA_ID), ("source", SRC), ("target", TRG)]
)

In [5]:
print(f"Number of training examples: {len(train_data.examples)}")
print(f"Number of validation examples: {len(valid_data.examples)}")
print(f"Number of testing examples: {len(test_data.examples)}")

Number of training examples: 35000
Number of validation examples: 7500
Number of testing examples: 7500


In [6]:
print(vars(train_data.examples[0]))

{'data_id': '0', 'source': ['i', 'can', "'", 't', 'tell', 'who', 'will', 'arrive', 'first', '.'], 'target': ['誰', 'が', '一番', 'に', '着く', 'か', '私', 'に', 'は', '分かり', 'ませ', 'ん', '。']}


In [7]:
SRC.build_vocab(train_data, min_freq = 2)
TRG.build_vocab(train_data, min_freq = 2)
DATA_ID.build_vocab(train_data)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [8]:
print(f"Unique tokens in source (JA) vocabulary: {len(SRC.vocab)}")
print(f"Unique tokens in target (en) vocabulary: {len(TRG.vocab)}")

Unique tokens in source (JA) vocabulary: 3748
Unique tokens in target (en) vocabulary: 4848


In [9]:
BATCH_SIZE = 128

train_iterator, valid_iterator, test_iterator = BucketIterator.splits(
    (train_data, valid_data, test_data), 
     batch_size = BATCH_SIZE,
     device = device,
     sort=False)

In [10]:
import torch
import torch.nn as nn
import torch.optim as optim
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import spacy
import numpy as np
import random
import math
import time

In [11]:
class Encoder(nn.Module):
    def __init__(self, 
                 input_dim, 
                 hid_dim, 
                 n_layers, 
                 n_heads, 
                 pf_dim,
                 dropout, 
                 device,
                 max_length = 100):
        super().__init__()

        self.device = device
        
        self.tok_embedding = nn.Embedding(input_dim, hid_dim)
        self.pos_embedding = nn.Embedding(max_length, hid_dim)
        
        self.layers = nn.ModuleList([EncoderLayer(hid_dim, 
                                                  n_heads, 
                                                  pf_dim,
                                                  dropout, 
                                                  device) 
                                     for _ in range(n_layers)])
        
        self.dropout = nn.Dropout(dropout)
        
        self.scale = torch.sqrt(torch.FloatTensor([hid_dim])).to(device)
        
    def forward(self, src, src_mask):
        
        #src = [batch size, src len]
        #src_mask = [batch size, 1, 1, src len]
        
        batch_size = src.shape[0]
        src_len = src.shape[1]
        
        pos = torch.arange(0, src_len).unsqueeze(0).repeat(batch_size, 1).to(self.device)
        
        #pos = [batch size, src len]
        
        src = self.dropout((self.tok_embedding(src) * self.scale) + self.pos_embedding(pos))
        
        #src = [batch size, src len, hid dim]
        
        for layer in self.layers:
            src = layer(src, src_mask)
            
        #src = [batch size, src len, hid dim]
            
        return src

In [12]:
class EncoderLayer(nn.Module):
    def __init__(self, 
                 hid_dim, 
                 n_heads, 
                 pf_dim,  
                 dropout, 
                 device):
        super().__init__()
        
        self.self_attn_layer_norm = nn.LayerNorm(hid_dim)
        self.ff_layer_norm = nn.LayerNorm(hid_dim)
        self.self_attention = MultiHeadAttentionLayer(hid_dim, n_heads, dropout, device)
        self.positionwise_feedforward = PositionwiseFeedforwardLayer(hid_dim, 
                                                                     pf_dim, 
                                                                     dropout)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, src, src_mask):
        
        #src = [batch size, src len, hid dim]
        #src_mask = [batch size, 1, 1, src len] 
                
        #self attention
        _src, _ = self.self_attention(src, src, src, src_mask)
        
        #dropout, residual connection and layer norm
        src = self.self_attn_layer_norm(src + self.dropout(_src))
        
        #src = [batch size, src len, hid dim]
        
        #positionwise feedforward
        _src = self.positionwise_feedforward(src)
        
        #dropout, residual and layer norm
        src = self.ff_layer_norm(src + self.dropout(_src))
        
        #src = [batch size, src len, hid dim]
        
        return src

In [13]:
class MultiHeadAttentionLayer(nn.Module):
    def __init__(self, hid_dim, n_heads, dropout, device):
        super().__init__()
        
        assert hid_dim % n_heads == 0
        
        self.hid_dim = hid_dim
        self.n_heads = n_heads
        self.head_dim = hid_dim // n_heads
        
        self.fc_q = nn.Linear(hid_dim, hid_dim)
        self.fc_k = nn.Linear(hid_dim, hid_dim)
        self.fc_v = nn.Linear(hid_dim, hid_dim)
        
        self.fc_o = nn.Linear(hid_dim, hid_dim)
        
        self.dropout = nn.Dropout(dropout)
        
        self.scale = torch.sqrt(torch.FloatTensor([self.head_dim])).to(device)
        
    def forward(self, query, key, value, mask = None):
        
        batch_size = query.shape[0]
        
        #query = [batch size, query len, hid dim]
        #key = [batch size, key len, hid dim]
        #value = [batch size, value len, hid dim]
                
        Q = self.fc_q(query)
        K = self.fc_k(key)
        V = self.fc_v(value)
        
        #Q = [batch size, query len, hid dim]
        #K = [batch size, key len, hid dim]
        #V = [batch size, value len, hid dim]
                
        Q = Q.view(batch_size, -1, self.n_heads, self.head_dim).permute(0, 2, 1, 3)
        K = K.view(batch_size, -1, self.n_heads, self.head_dim).permute(0, 2, 1, 3)
        V = V.view(batch_size, -1, self.n_heads, self.head_dim).permute(0, 2, 1, 3)
        
        #Q = [batch size, n heads, query len, head dim]
        #K = [batch size, n heads, key len, head dim]
        #V = [batch size, n heads, value len, head dim]
                
        energy = torch.matmul(Q, K.permute(0, 1, 3, 2)) / self.scale
        
        #energy = [batch size, n heads, query len, key len]
        
        if mask is not None:
            energy = energy.masked_fill(mask == 0, -1e10)
        
        attention = torch.softmax(energy, dim = -1)
                
        #attention = [batch size, n heads, query len, key len]
                
        x = torch.matmul(self.dropout(attention), V)
        
        #x = [batch size, n heads, query len, head dim]
        
        x = x.permute(0, 2, 1, 3).contiguous()
        
        #x = [batch size, query len, n heads, head dim]
        
        x = x.view(batch_size, -1, self.hid_dim)
        
        #x = [batch size, query len, hid dim]
        
        x = self.fc_o(x)
        
        #x = [batch size, query len, hid dim]
        
        return x, attention

In [14]:
class PositionwiseFeedforwardLayer(nn.Module):
    def __init__(self, hid_dim, pf_dim, dropout):
        super().__init__()
        
        self.fc_1 = nn.Linear(hid_dim, pf_dim)
        self.fc_2 = nn.Linear(pf_dim, hid_dim)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x):
        
        #x = [batch size, seq len, hid dim]
        
        x = self.dropout(torch.relu(self.fc_1(x)))
        
        #x = [batch size, seq len, pf dim]
        
        x = self.fc_2(x)
        
        #x = [batch size, seq len, hid dim]
        
        return x

In [15]:
class Decoder(nn.Module):
    def __init__(self, 
                 output_dim, 
                 hid_dim, 
                 n_layers, 
                 n_heads, 
                 pf_dim, 
                 dropout, 
                 device,
                 max_length = 100):
        super().__init__()
        
        self.device = device
        
        self.tok_embedding = nn.Embedding(output_dim, hid_dim)
        self.pos_embedding = nn.Embedding(max_length, hid_dim)
        
        self.layers = nn.ModuleList([DecoderLayer(hid_dim, 
                                                  n_heads, 
                                                  pf_dim, 
                                                  dropout, 
                                                  device)
                                     for _ in range(n_layers)])
        
        self.fc_out = nn.Linear(hid_dim, output_dim)
        
        self.dropout = nn.Dropout(dropout)
        
        self.scale = torch.sqrt(torch.FloatTensor([hid_dim])).to(device)
        
    def forward(self, trg, enc_src, trg_mask, src_mask):
        
        #trg = [batch size, trg len]
        #enc_src = [batch size, src len, hid dim]
        #trg_mask = [batch size, 1, trg len, trg len]
        #src_mask = [batch size, 1, 1, src len]
                
        batch_size = trg.shape[0]
        trg_len = trg.shape[1]
        
        pos = torch.arange(0, trg_len).unsqueeze(0).repeat(batch_size, 1).to(self.device)
                            
        #pos = [batch size, trg len]
            
        trg = self.dropout((self.tok_embedding(trg) * self.scale) + self.pos_embedding(pos))
                
        #trg = [batch size, trg len, hid dim]
        
        for layer in self.layers:
            trg, attention = layer(trg, enc_src, trg_mask, src_mask)
        
        #trg = [batch size, trg len, hid dim]
        #attention = [batch size, n heads, trg len, src len]
        
        output = self.fc_out(trg)
        
        #output = [batch size, trg len, output dim]
            
        return output, attention

In [16]:
class DecoderLayer(nn.Module):
    def __init__(self, 
                 hid_dim, 
                 n_heads, 
                 pf_dim, 
                 dropout, 
                 device):
        super().__init__()
        
        self.self_attn_layer_norm = nn.LayerNorm(hid_dim)
        self.enc_attn_layer_norm = nn.LayerNorm(hid_dim)
        self.ff_layer_norm = nn.LayerNorm(hid_dim)
        self.self_attention = MultiHeadAttentionLayer(hid_dim, n_heads, dropout, device)
        self.encoder_attention = MultiHeadAttentionLayer(hid_dim, n_heads, dropout, device)
        self.positionwise_feedforward = PositionwiseFeedforwardLayer(hid_dim, 
                                                                     pf_dim, 
                                                                     dropout)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, trg, enc_src, trg_mask, src_mask):
        
        #trg = [batch size, trg len, hid dim]
        #enc_src = [batch size, src len, hid dim]
        #trg_mask = [batch size, 1, trg len, trg len]
        #src_mask = [batch size, 1, 1, src len]
        
        #self attention
        _trg, _ = self.self_attention(trg, trg, trg, trg_mask)
        
        #dropout, residual connection and layer norm
        trg = self.self_attn_layer_norm(trg + self.dropout(_trg))
            
        #trg = [batch size, trg len, hid dim]
            
        #encoder attention
        _trg, attention = self.encoder_attention(trg, enc_src, enc_src, src_mask)
        
        #dropout, residual connection and layer norm
        trg = self.enc_attn_layer_norm(trg + self.dropout(_trg))
                    
        #trg = [batch size, trg len, hid dim]
        
        #positionwise feedforward
        _trg = self.positionwise_feedforward(trg)
        
        #dropout, residual and layer norm
        trg = self.ff_layer_norm(trg + self.dropout(_trg))
        
        #trg = [batch size, trg len, hid dim]
        #attention = [batch size, n heads, trg len, src len]
        
        return trg, attention

In [17]:
class Seq2Seq(nn.Module):
    def __init__(self, 
                 encoder, 
                 decoder, 
                 src_pad_idx, 
                 trg_pad_idx, 
                 device):
        super().__init__()
        
        self.encoder = encoder
        self.decoder = decoder
        self.src_pad_idx = src_pad_idx
        self.trg_pad_idx = trg_pad_idx
        self.device = device
        
    def make_src_mask(self, src):
        
        #src = [batch size, src len]
        
        src_mask = (src != self.src_pad_idx).unsqueeze(1).unsqueeze(2)

        #src_mask = [batch size, 1, 1, src len]

        return src_mask
    
    def make_trg_mask(self, trg):
        
        #trg = [batch size, trg len]
        
        trg_pad_mask = (trg != self.trg_pad_idx).unsqueeze(1).unsqueeze(2)
        
        #trg_pad_mask = [batch size, 1, 1, trg len]
        
        trg_len = trg.shape[1]
        
        trg_sub_mask = torch.tril(torch.ones((trg_len, trg_len), device = self.device)).bool()
        
        #trg_sub_mask = [trg len, trg len]
            
        trg_mask = trg_pad_mask & trg_sub_mask
        
        #trg_mask = [batch size, 1, trg len, trg len]
        
        return trg_mask

    def forward(self, src, trg):
        
        #src = [batch size, src len]
        #trg = [batch size, trg len]
                
        src_mask = self.make_src_mask(src)
        trg_mask = self.make_trg_mask(trg)
        
        #src_mask = [batch size, 1, 1, src len]
        #trg_mask = [batch size, 1, trg len, trg len]
        
        enc_src = self.encoder(src, src_mask)
        
        #enc_src = [batch size, src len, hid dim]
                
        output, attention = self.decoder(trg, enc_src, trg_mask, src_mask)
        
        #output = [batch size, trg len, output dim]
        #attention = [batch size, n heads, trg len, src len]
        
        return output, attention

In [18]:
INPUT_DIM = len(SRC.vocab)
OUTPUT_DIM = len(TRG.vocab)
HID_DIM = 256
ENC_LAYERS = 3
DEC_LAYERS = 3
ENC_HEADS = 8
DEC_HEADS = 8
ENC_PF_DIM = 512
DEC_PF_DIM = 512
ENC_DROPOUT = 0.1
DEC_DROPOUT = 0.1

enc = Encoder(INPUT_DIM, 
              HID_DIM, 
              ENC_LAYERS, 
              ENC_HEADS, 
              ENC_PF_DIM, 
              ENC_DROPOUT, 
              device)

dec = Decoder(OUTPUT_DIM, 
              HID_DIM, 
              DEC_LAYERS, 
              DEC_HEADS, 
              DEC_PF_DIM, 
              DEC_DROPOUT, 
              device)

In [19]:
SRC_PAD_IDX = SRC.vocab.stoi[SRC.pad_token]
TRG_PAD_IDX = TRG.vocab.stoi[TRG.pad_token]

model = Seq2Seq(enc, dec, SRC_PAD_IDX, TRG_PAD_IDX, device).to(device)

In [20]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 7,451,376 trainable parameters


In [21]:
def initialize_weights(m):
    if hasattr(m, 'weight') and m.weight.dim() > 1:
        nn.init.xavier_uniform_(m.weight.data)
        
model.apply(initialize_weights);

LEARNING_RATE = 0.0005

optimizer = torch.optim.Adam(model.parameters(), lr = LEARNING_RATE)
criterion = nn.CrossEntropyLoss(ignore_index = TRG_PAD_IDX)

In [22]:
def train(model, iterator, optimizer, criterion, clip):
    
    model.train()
    
    epoch_loss = 0
    
    for i, batch in enumerate(iterator):
        
        src = batch.source
        trg = batch.target
        
        optimizer.zero_grad()
        
        output, _ = model(src, trg[:,:-1])
                
        #output = [batch size, trg len - 1, output dim]
        #trg = [batch size, trg len]
            
        output_dim = output.shape[-1]
            
        output = output.contiguous().view(-1, output_dim)
        trg = trg[:,1:].contiguous().view(-1)
                
        #output = [batch size * trg len - 1, output dim]
        #trg = [batch size * trg len - 1]
            
        loss = criterion(output, trg)
        
        loss.backward()
        
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        
        optimizer.step()
        
        epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)
def evaluate(model, iterator, criterion):
    
    model.eval()
    
    epoch_loss = 0
    
    with torch.no_grad():
    
        for i, batch in enumerate(iterator):

            src = batch.source
            trg = batch.target

            output, _ = model(src, trg[:,:-1])
            
            #output = [batch size, trg len - 1, output dim]
            #trg = [batch size, trg len]
            
            output_dim = output.shape[-1]
            
            output = output.contiguous().view(-1, output_dim)
            trg = trg[:,1:].contiguous().view(-1)
            
            #output = [batch size * trg len - 1, output dim]
            #trg = [batch size * trg len - 1]
            
            loss = criterion(output, trg)

            epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

In [32]:
N_EPOCHS = 15
CLIP = 1

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
    
    start_time = time.time()
    
    train_loss = train(model, train_iterator, optimizer, criterion, CLIP)
    valid_loss = evaluate(model, valid_iterator, criterion)
    
    end_time = time.time()
    
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), '/root/userspace/day4/homework4/tut6-model1.pt')
    
    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} ')
    print(f'\t Val. Loss: {valid_loss:.3f} ')

Epoch: 01 | Time: 0m 33s
	Train Loss: 1.045 
	 Val. Loss: 1.607 
Epoch: 02 | Time: 0m 33s
	Train Loss: 0.961 
	 Val. Loss: 1.613 
Epoch: 03 | Time: 0m 33s
	Train Loss: 0.891 
	 Val. Loss: 1.610 
Epoch: 04 | Time: 0m 33s
	Train Loss: 0.829 
	 Val. Loss: 1.644 
Epoch: 05 | Time: 0m 33s
	Train Loss: 0.776 
	 Val. Loss: 1.634 
Epoch: 06 | Time: 0m 33s
	Train Loss: 0.730 
	 Val. Loss: 1.647 
Epoch: 07 | Time: 0m 33s
	Train Loss: 0.687 
	 Val. Loss: 1.675 
Epoch: 08 | Time: 0m 33s
	Train Loss: 0.651 
	 Val. Loss: 1.695 
Epoch: 09 | Time: 0m 33s
	Train Loss: 0.615 
	 Val. Loss: 1.718 
Epoch: 10 | Time: 0m 32s
	Train Loss: 0.587 
	 Val. Loss: 1.744 
Epoch: 11 | Time: 0m 33s
	Train Loss: 0.559 
	 Val. Loss: 1.768 
Epoch: 12 | Time: 0m 33s
	Train Loss: 0.533 
	 Val. Loss: 1.788 
Epoch: 13 | Time: 0m 33s
	Train Loss: 0.511 
	 Val. Loss: 1.820 
Epoch: 14 | Time: 0m 33s
	Train Loss: 0.490 
	 Val. Loss: 1.837 
Epoch: 15 | Time: 0m 33s
	Train Loss: 0.470 
	 Val. Loss: 1.863 


In [23]:
model.load_state_dict(torch.load('/root/userspace/day4/homework4/tut6-model1.pt'))

test_loss = evaluate(model, test_iterator, criterion)

print(f'| Test Loss: {test_loss:.3f} |')

| Test Loss: 9.443 |


In [24]:
def translate_sentence(sentence, src_field, trg_field, model, device, max_len = 50):
    
    model.eval()
        
    if isinstance(sentence, str):
        nlp = spacy.load('ja_core_news_lg')
        
        tokens = [token.text.lower() for token in nlp(sentence)]
    else:
        tokens = [token.lower() for token in sentence]

    tokens = [src_field.init_token] + tokens + [src_field.eos_token]
        
    src_indexes = [src_field.vocab.stoi[token] for token in tokens]

    src_tensor = torch.LongTensor(src_indexes).unsqueeze(0).to(device)
    
    src_mask = model.make_src_mask(src_tensor)
    
    with torch.no_grad():
        enc_src = model.encoder(src_tensor, src_mask)

    trg_indexes = [trg_field.vocab.stoi[trg_field.init_token]]

    for i in range(max_len):

        trg_tensor = torch.LongTensor(trg_indexes).unsqueeze(0).to(device)

        trg_mask = model.make_trg_mask(trg_tensor)
        
        with torch.no_grad():
            output, attention = model.decoder(trg_tensor, enc_src, trg_mask, src_mask)
        
        pred_token = output.argmax(2)[:,-1].item()
        
        trg_indexes.append(pred_token)

        if pred_token == trg_field.vocab.stoi[trg_field.eos_token]:
            break
    
    trg_tokens = [trg_field.vocab.itos[i] for i in trg_indexes]
    
    return trg_tokens[1:], attention

In [25]:
example_idx = 5
src = vars(valid_data.examples[example_idx])['source']
trg = vars(valid_data.examples[example_idx])['target']
print(f'src = {src}')
print(f'trg = {trg}')
translation, attention = translate_sentence(src, SRC, TRG, model, device)

print(f'predicted trg = {translation}')

src = ['they', 'are', 'very', 'compatible', '.']
trg = ['彼', 'ら', '二人', 'は', 'よく', '肌', 'が', '合う', '。']
predicted trg = ['彼', 'ら', 'は', 'とても', '<unk>', 'です', '。', '<eos>']


In [26]:
data_id_pred = []
for batch in test_iterator:
    data_id_pred += [int(DATA_ID.vocab.itos[i]) for i in batch.data_id.tolist()]

In [28]:
#save prediction output to y_pred
y_pred = []
for idx in range(len(test_data.examples)):
    src = vars(test_data.examples[idx])['source']
    translation, attention = translate_sentence(src, SRC, TRG, model, device)
    y_pred.append(translation)

In [30]:
texts = []
outputs = " "
output = []
i=0
for number in y_pred:
    maxlen=0
    #print(texts)
    for k in number:
        maxlen+=1
        if k !='<eos>'and maxlen<=40 :
            texts.append(k)
        elif maxlen<42:
            outputs = [' '.join(texts)]
            i+=1
            texts = [ ] 
            output +=outputs
#         elif maxlen>=51:
#             outputs = [' '.join(texts)]
#             texts = [ ] 
#             i+=1
#             output +=outputs
    #print(maxlen)
print(i)

7500


In [31]:
output

['雨 が やん で 待とう 。',
 'あなた は 酒 を 飲む べき だ 。',
 '君 は 一 晩 中 食事 を 変え ませ ん か 。',
 'この 学校 は この 川 の <unk> です か 。',
 '彼 ら は その 仕事 を 終え て しまっ た 。',
 '私 に は お 金 の 持ち合わせ が ない 。',
 '私 は 彼 の 結婚 し た 生活 を <unk> た 。',
 '私 たち は 車 で 車 を 運転 し た 。',
 '私 は 若い 頃 、 山 に 山 に 行き たい 。',
 'あの 人 は あの 我慢 でき ない 。',
 '君 は そんな に 遅く まで 起き て い た 方 が よい 。',
 '私 は まだ 手 に いくら 持っ て い ます 。',
 '私 は この 家 に 帰っ て くる 。',
 '彼 は 私 に その 仕事 を <unk> た 。',
 '馬鹿 な 馬鹿 で は ない で 。',
 '男 は とても 素敵 だ 。',
 '暗闇 で は なく て 人生 を 見 て は いけ ない 。',
 '多く の 人々 が 日本 を 訪れ た 。',
 '彼女 は パーティー に 出席 し たい 。',
 'この 語 は 使い 出し た 。',
 'この 博物 館 へ 行く 道 は どう です か 。',
 '彼女 は 彼 ら に とても とても とても <unk> し た 。',
 'できる だけ 早く 起き なさい 。',
 '私 は どこ へ 行っ て も 行き ませ ん 。',
 '彼 は 東京 から パリ へ 出発 し た 。',
 '私 は それ に つい て 知り ませ ん 。',
 '金 は 少し しか なく なっ た 。',
 'あなた に 会う の を 楽し み に し て い まし た 。',
 '彼 ら は 彼女 に <unk> 会っ た 。',
 '私 は 来月 散髪 する 。',
 '私 に は ２人 の 娘 が いる 。',
 '明日 の 朝 電話 を かけ ます 。',
 'ジョン は 門 を 門 の 楽し み に し て き た 。',
 'それ は どの くらい の 高 さ です か 。',
 '君 は 一日 から 休み ます か 。',
 '彼 は 気 に <unk> し 

In [123]:
### 出力 ###
# 翻訳文に変換して予測結果のリストを作成
submission = pd.DataFrame({"data_id": data_id_pred, "pred_target": output}).sort_values("data_id")
submission.to_csv('/root/userspace/submission4_pred1.csv', header=True, index=False)