# Quick demo for training a **sentiment-analysis** adapter

First, install adapter-transformers from gitHub and other required modules for Japanese tokenizer.

最初に、GitHubから`adapter-transformers`をインストールし、他にも日本語のトークナイザーに必要なモジュールをインストールします。

In [None]:
!pip install git+https://github.com/adapter-hub/adapter-transformers.git
!git clone https://github.com/huggingface/transformers

In [None]:
!pip install mecab-python3==0.996.5
!pip install unidic-lite
!pip install toiro

In [None]:
import dataclasses
import os
import sys
from dataclasses import dataclass, field
from typing import Dict, Optional

import numpy as np
import pandas as pd

import torch
from transformers import (
    AutoTokenizer,
    EvalPrediction,
    GlueDataset,
    GlueDataTrainingArguments,
    AutoModelWithHeads,
    AdapterType,
    AdapterConfig,
    AutoConfig,
    AutoModelForSequenceClassification,
    EvalPrediction
)

from transformers import GlueDataTrainingArguments as DataTrainingArguments
from transformers import (
    Trainer,
    TrainingArguments,
    glue_compute_metrics,
    glue_output_modes,
    glue_tasks_num_labels,
    set_seed,
)

from toiro import datadownloader

Currently only BERT, Roberta & XLM-Roberta are supported by adapter-transformers integration.

Here, we load a pretrained model([Pretrained BERT from TOHOKU NLP LAB](https://www.nlp.ecei.tohoku.ac.jp/news-release/3284/)) and add a new SST-2 task adapter.

[SST-2](https://paperswithcode.com/sota/sentiment-analysis-on-sst-2-binary) is a binary classification benmark on sentiment analysis in English, but we could customize our dataset to make it work.

現在`adapter-transformers`でサポートされているのは、BERT、Roberta、XLM-Robertaのみです。

ここでは、事前学習済みモデル（[Pretrained BERT from TOHOKU NLP LAB](https://www.nlp.ecei.tohoku.ac.jp/news-release/3284/)）をロードし、新たにSST-2タスクのアダプターを追加します。  

[SST-2](https://paperswithcode.com/sota/sentiment-analysis-on-sst-2-binary)は英語での感情分析の2値分類用ベンチマークですが、データセットをカスタマイズすることで動かすことができました。

In [None]:
model_name_or_path = "cl-tohoku/bert-base-japanese-whole-word-masking"
task_name = "sst-2"
adapter_config = "pfeiffer"
set_seed(71)

In [None]:
num_labels = glue_tasks_num_labels[task_name]
output_mode = glue_output_modes[task_name]

In [None]:
config = AutoConfig.from_pretrained(
    model_name_or_path,
    num_labels=num_labels,
    finetuning_task=task_name
    )

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)

model = AutoModelForSequenceClassification.from_pretrained(model_name_or_path, config=config)
model.add_adapter(task_name, AdapterType.text_task, config=adapter_config)

We freeze parameters except for those of sst-2 adapter.

Besides, we could also store the classification head as well as adapter weights for reproducibility.

SST-2アダプター以外のパラメータをフリーズします。

また、再現性のためにアダプターの重みと同様に、頭の分類層も保存することができます。

In [None]:
model.train_adapter([task_name])
model.set_active_adapters([task_name])

Then, we download the Yahoo Movie Reviews dataset and dump it into local folder for training.

The pretrained adapter was train with 12500 rows of data.

It takes about 5 minutes/epoch to run on Colab GPU for every 100 rows of data.
If you just want to go through a quick demo, you can choose a smaller number such as n=125.

そして、Yahoo!映画のユーザレビューデータセットをダウンロードし、訓練のためにローカルフォルダにダンプします。

事前学習済みのアダプターは12500行のデータで訓練しました。

早くデモをしてみたい場合は、n=125のような小さい数を選択することができます。

In [None]:
corpus = "yahoo_movie_reviews"
datadownloader.download_corpus(corpus)
train_df, dev_df, test_df = datadownloader.load_corpus(corpus, n=125)

train_df.columns = ['label','sentence']
dev_df.columns = ['label','sentence']

train_df = train_df[['sentence', 'label']]
dev_df = dev_df[['sentence', 'label']]

In [None]:
data_path = "data"

if not os.path.exists(data_path):
    os.mkdir(data_path)

train_df.to_csv(os.path.join(data_path, "train.tsv"), sep = '\t', index = False)
dev_df.to_csv(os.path.join(data_path, "dev.tsv"), sep = '\t', index = False)

We would configure the training and data arguments for training and define metric for evaluation.

訓練のために訓練引数とデータ引数を設定し、評価のためのメトリクスを定義します。

In [None]:
data_args = DataTrainingArguments(
    task_name = task_name, 
    data_dir = data_path, 
    max_seq_length = 128,
    overwrite_cache = True)

train_dataset = GlueDataset(
    data_args,
    tokenizer=tokenizer)

eval_dataset = GlueDataset(
    data_args,
    tokenizer=tokenizer,
    mode="dev")

In [None]:
def compute_metrics(p: EvalPrediction) -> Dict:
  preds = np.argmax(p.predictions, axis=1)
  return glue_compute_metrics(data_args.task_name, preds, p.label_ids)

In [None]:
output_path = "output"

if not os.path.exists(data_path):
    os.mkdir(output_path)

training_args = TrainingArguments(
    output_dir = output_path,
    per_device_train_batch_size = 1,
    learning_rate = 1e-4,
    num_train_epochs = 3.0,
    )

In [None]:
trainer = Trainer(
    model = model,
    args = training_args,
    train_dataset = train_dataset,
    eval_dataset = eval_dataset,
    compute_metrics = compute_metrics,
    do_save_adapters = True,
    )

Finally, we start training our adapter in 3 epochs.

最後に3エポックでアダプターの訓練を開始します。

In [None]:
trainer.train()

In addition, we could also evaluate our adapter and export all model and adapters in local file. 

さらに、アダプターを評価し、すべてのモデルとアダプターをローカルファイルにエクスポートすることもできます。

In [None]:
eval_results = {}
eval_datasets = [eval_dataset]
for eval_dataset in eval_datasets:
    eval_result = trainer.evaluate(eval_dataset=eval_dataset)
    eval_results.update(eval_result)

In [None]:
eval_results

In [None]:
trainer.save_model()