<a href="https://colab.research.google.com/github/KazukiHirata-sun/ai_project_dev_2022/blob/main/section_2/BERT_Fine_tuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Classification of Japanese Sentences with BERT
[Fine-tune BERT's model](https://towardsdatascience.com/what-exactly-happens-when-we-fine-tune-bert-f5dc32885d76) on the Japanese dataset to classify the news.

## Installation of all library we need


In [1]:
!pip install transformers
!pip install nlp
!pip install datasets
!pip install fugashi
!pip install ipadic

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.21.2-py3-none-any.whl (4.7 MB)
[K     |████████████████████████████████| 4.7 MB 4.6 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.9.1-py3-none-any.whl (120 kB)
[K     |████████████████████████████████| 120 kB 70.7 MB/s 
[?25hCollecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 47.2 MB/s 
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.9.1 tokenizers-0.12.1 transformers-4.21.2
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting nlp
  Downloading nlp-0.4.0-py3-none-any.whl (1.7 MB)
[K     |████████████████████████████████| 1.7 MB 4.6 MB/s 
Coll

## Connecting with Google Drive
Mount our Google Drive using the authorization code.

In [2]:
from google.colab import drive
drive.mount("/content/drive/")

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


In [3]:
# Set up your own working folder
workFolder = "/content/drive/MyDrive/Colab Notebooks/AI Project Development Course/2回/"

## Loading Dataset
Load a news dataset stored on Google Drive.

In [4]:
import glob
import os

raw_data_path = workFolder + "text/"

dir_files = os.listdir(path=raw_data_path)
dirs = [f for f in dir_files if os.path.isdir(os.path.join(raw_data_path, f))] 

text_label_data = []
dir_count = 0 
file_count= 0 

for i in range(len(dirs)):
    dir = dirs[i]
    files = glob.glob(raw_data_path + dir + "/*.txt") 
    dir_count += 1

    for file in files:
        if os.path.basename(file) == "LICENSE.txt":
            continue

        with open(file, "r") as f:
            text = f.readlines()[3:]
            text = "".join(text)
            text = text.translate(str.maketrans({"\n":"", "\t":"", "\r":"", "\u3000":""})) 
            text_label_data.append([text, i])

        file_count += 1
        print("\rfiles: " + str(file_count) + " dirs: " + str(dir_count), end="")

files: 7367 dirs: 9

## Saving Data
Devide the data into training and test data and save them as csv files to Google Drive.

In [5]:
import csv
from sklearn.model_selection import train_test_split

# Split for training and testing data
news_train, news_test =  train_test_split(text_label_data, shuffle=True)
data_path = workFolder + "data/"

if not os.path.exists(data_path):
    os.makedirs(data_path)

with open(data_path+"news_train.csv", "w") as f:
    writer = csv.writer(f)
    writer.writerows(news_train)

with open(data_path+"news_test.csv", "w") as f:
    writer = csv.writer(f)
    writer.writerows(news_test)

## Loading Models and Tokenizers
Load a pre-trained Japanese model and its associated Tokenizer.

In [6]:
from transformers import BertForSequenceClassification, BertJapaneseTokenizer

model_name ='cl-tohoku/bert-base-japanese-whole-word-masking'

sc_model = BertForSequenceClassification.from_pretrained(model_name, num_labels=9)
sc_model.cuda()
tokenizer = BertJapaneseTokenizer.from_pretrained(model_name)

Downloading config.json:   0%|          | 0.00/479 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/424M [00:00<?, ?B/s]

Some weights of the model checkpoint at cl-tohoku/bert-base-japanese-whole-word-masking were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialize

Downloading vocab.txt:   0%|          | 0.00/252k [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/110 [00:00<?, ?B/s]

## Loading Data Sets
Loads stored news data.

In [7]:
from datasets import load_dataset

def tokenize(batch):
    return tokenizer(batch["text"], padding=True, truncation=True, max_length=128)
    
data_path = workFolder + "data/"

train_data = load_dataset("csv", data_files=data_path+"news_train.csv", column_names=["text", "label"], split="train")
train_data = train_data.map(tokenize, batched=True, batch_size=len(train_data))
train_data.set_format("torch", columns=["input_ids", "label"])

test_data = load_dataset("csv", data_files=data_path+"news_test.csv", column_names=["text", "label"], split="train")
test_data = test_data.map(tokenize, batched=True, batch_size=len(test_data))
test_data.set_format("torch", columns=["input_ids", "label"])




Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-b2e3f7e5381559e0/0.0.0/652c3096f041ee27b04d2232d41f10547a8fecda3e284a79a0ec4053c916ef7a...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

0 tables [00:00, ? tables/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-b2e3f7e5381559e0/0.0.0/652c3096f041ee27b04d2232d41f10547a8fecda3e284a79a0ec4053c916ef7a. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?ba/s]



Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-ad2bc0a3ab0a3b8e/0.0.0/652c3096f041ee27b04d2232d41f10547a8fecda3e284a79a0ec4053c916ef7a...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

0 tables [00:00, ? tables/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-ad2bc0a3ab0a3b8e/0.0.0/652c3096f041ee27b04d2232d41f10547a8fecda3e284a79a0ec4053c916ef7a. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?ba/s]

## Functions for evaluation
Use `sklearn.metrics()` to define functions for evaluating models.

In [8]:
from sklearn.metrics import accuracy_score

def compute_metrics(result):
    labels = result.label_ids
    preds = result.predictions.argmax(-1)
    acc = accuracy_score(labels, preds)
    return {
        "accuracy": acc,
    }

## Setting up a Trainer
Use the [Trainer](https://huggingface.co/transformers/main_classes/trainer.html) and [TrainingArguments](https://huggingface.co/transformers/main_classes/trainer.html#trainingarguments) classes to set up a Trainer to train. 


In [9]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir = "./results",
    num_train_epochs = 2,
    per_device_train_batch_size = 8,
    per_device_eval_batch_size = 32,
    warmup_steps = 500, 
    weight_decay = 0.01,
    logging_dir = "./logs",
)

trainer = Trainer(
    model = sc_model,
    args = training_args,
    compute_metrics = compute_metrics,
    train_dataset = train_data,
    eval_dataset = test_data,
)

## Model training
Fine tuning based on the setting.

In [10]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 5525
  Num Epochs = 2
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 1382


Step,Training Loss
500,1.0945
1000,0.4055


Saving model checkpoint to ./results/checkpoint-500
Configuration saved in ./results/checkpoint-500/config.json
Model weights saved in ./results/checkpoint-500/pytorch_model.bin
Saving model checkpoint to ./results/checkpoint-1000
Configuration saved in ./results/checkpoint-1000/config.json
Model weights saved in ./results/checkpoint-1000/pytorch_model.bin


Step,Training Loss
500,1.0945
1000,0.4055




Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=1382, training_loss=0.6132484027514754, metrics={'train_runtime': 305.6511, 'train_samples_per_second': 36.152, 'train_steps_per_second': 4.521, 'total_flos': 726889972723200.0, 'train_loss': 0.6132484027514754, 'epoch': 2.0})

## Evaluating Models
The trainer's `evaluate()` method evaluates the model.

In [11]:
trainer.evaluate()

The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1842
  Batch size = 32


{'eval_loss': 0.3917217254638672,
 'eval_accuracy': 0.9017372421281216,
 'eval_runtime': 13.7642,
 'eval_samples_per_second': 133.825,
 'eval_steps_per_second': 4.214,
 'epoch': 2.0}

## Viewing Results with TensorBoard
Use [TensorBoard](https://www.tensorflow.org/tensorboard) to view the training process stored in the logs folder.

In [None]:
%load_ext tensorboard
%tensorboard --logdir logs

## Save Model
Saves a trained model.

In [13]:
model_path = workFolder + "model/"

if not os.path.exists(model_path):
    os.makedirs(model_path)

trainer.save_model(model_path)
tokenizer.save_pretrained(model_path)

Saving model checkpoint to /content/drive/MyDrive/Colab Notebooks/AI Project Development Course/2回/model/
Configuration saved in /content/drive/MyDrive/Colab Notebooks/AI Project Development Course/2回/model/config.json
Model weights saved in /content/drive/MyDrive/Colab Notebooks/AI Project Development Course/2回/model/pytorch_model.bin
tokenizer config file saved in /content/drive/MyDrive/Colab Notebooks/AI Project Development Course/2回/model/tokenizer_config.json
Special tokens file saved in /content/drive/MyDrive/Colab Notebooks/AI Project Development Course/2回/model/special_tokens_map.json


('/content/drive/MyDrive/Colab Notebooks/AI Project Development Course/2回/model/tokenizer_config.json',
 '/content/drive/MyDrive/Colab Notebooks/AI Project Development Course/2回/model/special_tokens_map.json',
 '/content/drive/MyDrive/Colab Notebooks/AI Project Development Course/2回/model/vocab.txt',
 '/content/drive/MyDrive/Colab Notebooks/AI Project Development Course/2回/model/added_tokens.json')

## Loading a model
Loads a previously saved model.

In [14]:
loaded_model = BertForSequenceClassification.from_pretrained(model_path)
loaded_model.cuda()
loaded_tokenizer = BertJapaneseTokenizer.from_pretrained(model_path)

loading configuration file /content/drive/MyDrive/Colab Notebooks/AI Project Development Course/2回/model/config.json
Model config BertConfig {
  "_name_or_path": "cl-tohoku/bert-base-japanese-whole-word-masking",
  "architectures": [
    "BertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4",
    "5": "LABEL_5",
    "6": "LABEL_6",
    "7": "LABEL_7",
    "8": "LABEL_8"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2,
    "LABEL_3": 3,
    "LABEL_4": 4,
    "LABEL_5": 5,
    "LABEL_6": 6,
    "LABEL_7": 7,
    "LABEL_8": 8
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,

## Japanese News Classification
Classify news using the loaded model.

In [16]:
import os
import torch

# Loading the data to be classified.
file = raw_data_path + "/sports-watch/sports-watch-4764756.txt"
with open(file, "r") as f:
    sample_text = f.readlines()[3:]
    sample_text = "".join(sample_text)
    sample_text = sample_text.translate(str.maketrans({"\n":"", "\t":"", "\r":"", "\u3000":""})) 

# # https://www.infoq.com/jp/articles/ai-devops-takeover/?itm_source=articles_about_ai-ml-data-eng&itm_medium=link&itm_campaign=ai-ml-data-eng
# sample_text = "開発者の多くにとって、DevOpsの次に何が来るかを予測することは、ある種の気晴らしになっています。この10年間、私たちは、私たちの業界が急速に変化するのを目の当たりにしてきました。その間には、プログラマの役割も根本から変わってきています。"

max_length = 512
words = loaded_tokenizer.tokenize(sample_text)
word_ids = loaded_tokenizer.convert_tokens_to_ids(words)
word_tensor = torch.tensor([word_ids[:max_length]])

# Prediction
x = word_tensor.cuda()  
y = loaded_model(x)
pred = y[0].argmax(-1) 

# Displaying the results
raw_data_path = workFolder + "/text/"
dir_files = os.listdir(path=raw_data_path)
dirs = [f for f in dir_files if os.path.isdir(os.path.join(raw_data_path, f))]
print("結果は", dirs[pred])

結果は it-life-hack
