# コンペティション課題3

## 課題
IMDb（映画レビュー）のsentiment analysisを実装してみましょう。

## 目標値
正解率：80%  
（これはあくまで「目標値」であるため、達成できなかったからといって不合格となったり、著しく成績が損なわれることはありません）

## ルール
- 「修正しないでください」とあるセルを、修正しないでください。
- 与えられたデータ以外は、モデルの学習や推論に使用しないでください。
- モデルのアーキテクチャは自由です。講義で扱ったモデル以外でも構いません。

## 提出方法
- 1つのファイルを提出していただきます。
  1. テストデータ`test_iter`に対する予測ラベルを`submission3_pred.csv`として保存・ダウンロードしてください。
  2. Homeworkタブから**Day3 Pred (.csv)**を選択して提出してください。
  3. それとは別に、最終提出に対応するノートブックを[Final Submission]などと命名しわかるようにiLect System上に置いておいてください。
- 成績優秀者には、次回講義にて取り組みの発表をお願いいたします。

## LeaderBoard
- コンペティション期間中のLeaderBoardは提出されたcsvファイルのうち50%を使って計算されます。
- コンペティション終了時には提出されたcsvファイルのうち、コンペティション期間中のLeaderBoard計算に使われなかったもう半分のデータがスコア計算に使用されます。
- このため、コンペ中の順位とコンペ終了後にLeaderBoardが更新された後の順位やスコアが食い違うことがあります。

## 評価方法

- 正解率（accuracy）によって評価します。

## データの読み込み

- このセルは修正しないでください。
- 誤って修正した場合は、元ファイルをコピーし直してください。


- データサイズが大きいため、読み込みには数分を要します。

In [1]:
import random
import spacy
import numpy as np
import pandas as pd
from tqdm import tqdm

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchtext import datasets
from torchtext.data import Field, LabelField, TabularDataset, BucketIterator
from sklearn.metrics import f1_score
from torch.nn.utils.rnn import pack_padded_sequence

# データのフィールドを定義
DATA_ID = LabelField(dtype=torch.int)
TEXT = Field(tokenize='spacy', lower=True, include_lengths=True)
LABEL = LabelField(dtype=torch.float)

# データセットの読み込み
#　テストデータのラベルは全て"pos"でマスクされています。
train, val, test = TabularDataset.splits(
    path="/root/userspace/public/day3/homework3/data", 
    train="train.csv", validation="val.csv",
    test="test_homework.csv", format="csv",
    skip_header=True,
    fields=[("data_id", DATA_ID),("text", TEXT), ("label", LABEL)]
)
    

def load_dataset(batch_size, device, train=train, val=val, test=test):
    # Vocabularyの作成
    DATA_ID.build_vocab(train)
    TEXT.build_vocab(train, max_size=25000)
    LABEL.build_vocab(["neg", "pos"]) # neg:0, pos:1
    
    # 各種データセットのイテレータを作成
    train_iter, val_iter, test_iter = BucketIterator.splits(
        (train, val, test), batch_size=batch_size, device=device, sort=False)
    
    return train_iter, val_iter, test_iter

## 実装

In [None]:
# Point
# 1. using　BERT（Bidirectional Encoder Representations from Transformers）  IMDb（映画レビュー）のsentiment analysis
# 2. 入力単語をEmbedding layerを利用せず、transformers library to get pre-trained transformers modelでembedding取得
# 3. 一般のRNNを利用せず、multi-layer bidirectional GRU

In [2]:
#pre-trained bert tokenizer
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=231508.0), HTML(value='')))




In [6]:
init_token = tokenizer.cls_token
eos_token = tokenizer.sep_token
pad_token = tokenizer.pad_token
unk_token = tokenizer.unk_token

print(init_token, eos_token, pad_token, unk_token)

[CLS] [SEP] [PAD] [UNK]


In [7]:
init_token_idx = tokenizer.convert_tokens_to_ids(init_token)
eos_token_idx = tokenizer.convert_tokens_to_ids(eos_token)
pad_token_idx = tokenizer.convert_tokens_to_ids(pad_token)
unk_token_idx = tokenizer.convert_tokens_to_ids(unk_token)

print(init_token_idx, eos_token_idx, pad_token_idx, unk_token_idx)

101 102 0 100


In [8]:
init_token_idx = tokenizer.cls_token_id
eos_token_idx = tokenizer.sep_token_id
pad_token_idx = tokenizer.pad_token_id
unk_token_idx = tokenizer.unk_token_id

print(init_token_idx, eos_token_idx, pad_token_idx, unk_token_idx)

101 102 0 100


In [9]:
max_input_length = tokenizer.max_model_input_sizes['bert-base-uncased']

print(max_input_length)

512


In [10]:
def tokenize_and_cut(sentence):
    tokens = tokenizer.tokenize(sentence) 
    tokens = tokens[:max_input_length-2]
    return tokens

In [11]:
batch_size=128
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
train_iterator, val_iterator, test_iterator = load_dataset(batch_size, device)


In [12]:
print(f"Number of training examples: {len(train)}")
print(f"Number of validation examples: {len(val)}")
print(f"Number of testing examples: {len(test)}")

Number of training examples: 20888
Number of validation examples: 3972
Number of testing examples: 20000


In [13]:
print(vars(train.examples[6]))

{'data_id': '6', 'text': ['i', "'ve", 'been', 'watching', 'this', 'movie', 'by', 'hoping', 'to', 'find', 'a', 'pretty', 'and', 'interesting', 'story', 'yet', 'the', 'story', 'line', 'was', "n't", 'good', 'at', 'all', '.', 'the', 'play', 'of', 'the', 'actors', 'were', "n't", 'any', 'better.<br', '/><br', '/>of', 'course', 'shahrukh', 'khan', 'was', 'there', 'yet', 'he', 'was', "n't", 'enough', 'to', 'make', 'this', 'movie', '"', 'credible', '"', 'and', 'interesting.<br', '/><br', "/>i've", 'read', 'that', 'this', 'movie', 'was', 'based', 'on', 'the', 'novel', 'of', 'flaubert', '"', 'madame', 'bovary', '"', 'yet', 'for', 'me', 'i', 'did', "n't", 'see', 'it', 'matching', 'with', 'the', 'indian', 'mentality.<br', '/><br', '/>in', 'general', 'we', 'buy', 'movie', 'to', 'dream', 'and', 'have', 'a', 'good', 'time', ',', 'not', 'to', 'waste', 'our', 'time', 'and', 'change', 'our', 'mood', 'into', 'worse', '.', 'i', 'just', 'ca', "n't", 'understand', 'how', 'it', 'could', 'get', 'such', 'a', '"

In [17]:
TEXT = Field(batch_first = True,
                  use_vocab = False,
                  tokenize = tokenize_and_cut,
                  preprocessing = tokenizer.convert_tokens_to_ids,
                  init_token = init_token_idx,
                  eos_token = eos_token_idx,
                  pad_token = pad_token_idx,
                  unk_token = unk_token_idx)

LABEL = LabelField(dtype = torch.float)

# データセットの読み込み
#　テストデータのラベルは全て"pos"でマスクされています。
train, val, test = TabularDataset.splits(
    path="/root/userspace/public/day3/homework3/data", 
    train="train.csv", validation="val.csv",
    test="test_homework.csv", format="csv",
    skip_header=True,
    fields=[("data_id", DATA_ID),("text", TEXT), ("label", LABEL)]
)



In [33]:
def load_dataset(batch_size, device, train=train, val=val, test=test):
    # Vocabularyの作成
    DATA_ID.build_vocab(train)
    TEXT.build_vocab(train, max_size=25000)
    LABEL.build_vocab(["neg", "pos"]) # neg:0, pos:1
    
    # 各種データセットのイテレータを作成
    train_iter, val_iter, test_iter = BucketIterator.splits(
        (train, val, test), batch_size=batch_size, device=device, sort=False)
    
    return train_iter, val_iter, test_iter

In [34]:
batch_size=128
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
train_iterator, val_iterator, test_iterator = load_dataset(batch_size, device)


In [35]:
print(vars(train.examples[6]))

{'data_id': '6', 'text': [1045, 1005, 2310, 2042, 3666, 2023, 3185, 2011, 5327, 2000, 2424, 1037, 3492, 1998, 5875, 2466, 2664, 1996, 2466, 2240, 2347, 1005, 1056, 2204, 2012, 2035, 1012, 1996, 2377, 1997, 1996, 5889, 4694, 1005, 1056, 2151, 2488, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 1997, 2607, 7890, 6820, 10023, 4967, 2001, 2045, 2664, 2002, 2347, 1005, 1056, 2438, 2000, 2191, 2023, 3185, 1000, 23411, 1000, 1998, 5875, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 1045, 1005, 2310, 3191, 2008, 2023, 3185, 2001, 2241, 2006, 1996, 3117, 1997, 13109, 4887, 8296, 1000, 10602, 8945, 21639, 1000, 2664, 2005, 2033, 1045, 2134, 1005, 1056, 2156, 2009, 9844, 2007, 1996, 2796, 5177, 3012, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 1999, 2236, 2057, 4965, 3185, 2000, 3959, 1998, 2031, 1037, 2204, 2051, 1010, 2025, 2000, 5949, 2256, 2051, 1998, 2689, 2256, 6888, 2046, 4788, 1012, 1045, 2074, 2064, 1005, 1056, 3305, 2129, 2009, 2071, 2131, 2107, 1037, 1000, 215

In [36]:
from transformers import BertTokenizer, BertModel

bert = BertModel.from_pretrained('bert-base-uncased')

In [37]:
import torch.nn as nn
### モデル構築 ###

class BERTGRUSentiment(nn.Module):
    def __init__(self,bert,hidden_dim,output_dim,n_layers,bidirectional,dropout):
        
        super().__init__()
        
        self.bert = bert
        
        embedding_dim = bert.config.to_dict()['hidden_size']
        
        self.rnn = nn.GRU(embedding_dim,
                          hidden_dim,
                          num_layers = n_layers,
                          bidirectional = bidirectional,
                          batch_first = True,
                          dropout = 0 if n_layers < 2 else dropout)
        
        self.out = nn.Linear(hidden_dim * 2 if bidirectional else hidden_dim, output_dim)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, text):
        
        #text = [batch size, sent len]
        with torch.no_grad():
            embedded = self.bert(text)[0]  
        #embedded = [batch size, sent len, emb dim]
        _, hidden = self.rnn(embedded)
        #hidden = [n layers * n directions, batch size, emb dim]
        if self.rnn.bidirectional:
            hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1))
        else:
            hidden = self.dropout(hidden[-1,:,:])
        #hidden = [batch size, hid dim]
        output = self.out(hidden)
        #output = [batch size, out dim]
        
        return output

In [38]:
HIDDEN_DIM = 256
OUTPUT_DIM = 1
N_LAYERS = 2
BIDIRECTIONAL = True
DROPOUT = 0.25

model = BERTGRUSentiment(bert,
                         HIDDEN_DIM,
                         OUTPUT_DIM,
                         N_LAYERS,
                         BIDIRECTIONAL,
                         DROPOUT)

In [39]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 112,241,409 trainable parameters


In [40]:
for name, param in model.named_parameters():                
    if name.startswith('bert'):
        param.requires_grad = False

In [41]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 2,759,169 trainable parameters


In [42]:
for name, param in model.named_parameters():                
    if param.requires_grad:
        print(name)

rnn.weight_ih_l0
rnn.weight_hh_l0
rnn.bias_ih_l0
rnn.bias_hh_l0
rnn.weight_ih_l0_reverse
rnn.weight_hh_l0_reverse
rnn.bias_ih_l0_reverse
rnn.bias_hh_l0_reverse
rnn.weight_ih_l1
rnn.weight_hh_l1
rnn.bias_ih_l1
rnn.bias_hh_l1
rnn.weight_ih_l1_reverse
rnn.weight_hh_l1_reverse
rnn.bias_ih_l1_reverse
rnn.bias_hh_l1_reverse
out.weight
out.bias


In [43]:
import torch.optim as optim

optimizer = optim.Adam(model.parameters())
criterion = nn.BCEWithLogitsLoss()
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')


In [44]:
model = model.to(device)
criterion = criterion.to(device)

In [45]:
def binary_accuracy(preds, y):
    #round predictions to the closest integer
    rounded_preds = torch.round(torch.sigmoid(preds))
    correct = (rounded_preds == y).float() #convert into float for division 
    acc = correct.sum() / len(correct)
    return acc

In [46]:
def train_model(model, iterator, optimizer, criterion):
    epoch_loss = 0
    epoch_acc = 0
    model.train()
    
    for batch in iterator:
        
        optimizer.zero_grad()
        
        predictions = model(batch.text).squeeze(1)
        
        loss = criterion(predictions, batch.label)
        
        acc = binary_accuracy(predictions, batch.label)
        
        loss.backward()
        
        optimizer.step()
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [47]:
def evaluate(model, iterator, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.eval()
    
    with torch.no_grad():
    
        for batch in iterator:

            predictions = model(batch.text).squeeze(1)
            
            loss = criterion(predictions, batch.label)
            
            acc = binary_accuracy(predictions, batch.label)

            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [48]:
import time

def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs


In [49]:
N_EPOCHS = 7

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
    
    start_time = time.time()
    
    train_loss, train_acc = train_model(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, val_iterator, criterion)
        
    end_time = time.time()
        
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
      
best_valid_loss = float('inf')  
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), '/root/userspace/day3/homework3/model.pt')
    
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

Epoch: 01 | Epoch Time: 30m 46s
	Train Loss: 0.423 | Train Acc: 79.86%
	 Val. Loss: 0.254 |  Val. Acc: 89.67%
Epoch: 02 | Epoch Time: 30m 46s
	Train Loss: 0.243 | Train Acc: 90.25%
	 Val. Loss: 0.238 |  Val. Acc: 90.89%
Epoch: 03 | Epoch Time: 30m 47s
	Train Loss: 0.213 | Train Acc: 91.56%
	 Val. Loss: 0.235 |  Val. Acc: 91.11%
Epoch: 04 | Epoch Time: 30m 46s
	Train Loss: 0.189 | Train Acc: 92.69%
	 Val. Loss: 0.213 |  Val. Acc: 92.07%
Epoch: 05 | Epoch Time: 30m 46s
	Train Loss: 0.169 | Train Acc: 93.53%
	 Val. Loss: 0.294 |  Val. Acc: 88.13%
Epoch: 06 | Epoch Time: 30m 46s
	Train Loss: 0.140 | Train Acc: 94.72%
	 Val. Loss: 0.240 |  Val. Acc: 91.75%
Epoch: 07 | Epoch Time: 30m 46s
	Train Loss: 0.113 | Train Acc: 95.87%
	 Val. Loss: 0.251 |  Val. Acc: 92.09%


In [50]:
# load model
model.load_state_dict(torch.load('/root/userspace/day3/homework3/model.pt'))

<All keys matched successfully>

In [52]:

### 予測 ###
model.eval()  # 推論モードに切替
data_id_pred = []
y_pred = []
for batch in test_iterator:
    data_id_pred += [int(DATA_ID.vocab.itos[i]) for i in batch.data_id.tolist()]
    prediction = torch.sigmoid(model(batch.text))    
    y_pred += prediction.tolist()

y_pred = [num for elem in y_pred for num in elem]
print("before y_pred",y_pred)
y_pred = list(map(np.round, y_pred))
print("after y_pred",y_pred)

### 出力 ###
submission = pd.DataFrame({"data_id": data_id_pred, "pred_label": y_pred}).sort_values("data_id")
submission.to_csv('/root/userspace/day3/homework3/submission3_pred.csv', header=True, index=False)

before y_pred [0.772295355796814, 0.9887113571166992, 0.1335458755493164, 0.9918262958526611, 0.8413504958152771, 0.022462090477347374, 0.9939664602279663, 0.0576518177986145, 0.042590025812387466, 0.003597792936488986, 0.9639655947685242, 0.005850784480571747, 0.3441779613494873, 0.9948787689208984, 0.00447110878303647, 0.004138737451285124, 0.9147233963012695, 0.005949359852820635, 0.02822481095790863, 0.9622145891189575, 0.26262903213500977, 0.01294595468789339, 0.006594264879822731, 0.9929289221763611, 0.9727805852890015, 0.004398246295750141, 0.013986392877995968, 0.285068154335022, 0.20555506646633148, 0.0019205103162676096, 0.24195368587970734, 0.2036529928445816, 0.9931548833847046, 0.985218346118927, 0.968177318572998, 0.9865947961807251, 0.9783720374107361, 0.9798664450645447, 0.8571558594703674, 0.9565150141716003, 0.913663923740387, 0.002240422647446394, 0.978661298751831, 0.0036333585157990456, 0.07946663349866867, 0.016762612387537956, 0.004209243226796389, 0.077679097652