# Libraries

Installations

In [68]:
%%capture
!pip install transformers
!pip install transformers seqeval[gpu]

In [69]:
%%capture
!pip install fugashi
!pip install ipadic

In [70]:
!pip install sentencepiece

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [71]:
!pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


Imports

In [72]:
import pandas as pd
import numpy as np
import json
from tqdm import tqdm

from transformers import pipeline
import torch
from transformers import AutoModel, AutoTokenizer 
from transformers import AutoModelForMaskedLM, AutoModelForSequenceClassification 

from sklearn.metrics import accuracy_score

In [73]:
from torch import cuda
device = 'cuda' if cuda.is_available() else 'cpu'
print(device)

cuda


In [74]:
SEED = 42

# Japanese BERT

In [75]:
checkpoint = "cl-tohoku/bert-base-japanese"

In [76]:
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--cl-tohoku--bert-base-japanese/snapshots/5dc6dbba88a42d21da3b71025c109c42462307f2/config.json
Model config BertConfig {
  "_name_or_path": "cl-tohoku/bert-base-japanese",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "tokenizer_class": "BertJapaneseTokenizer",
  "transformers_version": "4.24.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 32000
}

loading file vocab.txt from cache at /root/.cache/huggingface/hub/models--cl-tohoku--bert-base-japanese/snapshots/5dc6dbba88a42d21da3b

In [77]:
model = AutoModelForMaskedLM.from_pretrained(checkpoint)

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--cl-tohoku--bert-base-japanese/snapshots/5dc6dbba88a42d21da3b71025c109c42462307f2/config.json
Model config BertConfig {
  "_name_or_path": "cl-tohoku/bert-base-japanese",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "tokenizer_class": "BertJapaneseTokenizer",
  "transformers_version": "4.24.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 32000
}

loading weights file pytorch_model.bin from cache at /root/.cache/huggingface/hub/models--cl-tohoku--bert-base-japanese/snapshots/5dc6

## Example of a title: (Tokinization)

### Understanding example

See tutorial : [Hugging Face](https://huggingface.co/course/chapter2/5?fw=pt)

In [78]:
sequence = "「東海道五十三次」  「三十八」「藤川」"
tokens = tokenizer.tokenize(sequence)

print(tokens)

['「', '東海道', '五', '十', '三', '次', '」', '「', '三', '十', '八', '」', '「', '藤', '##川', '」']


In [79]:
ids = tokenizer.convert_tokens_to_ids(tokens)

print(ids)

[36, 7174, 989, 714, 240, 288, 38, 36, 240, 714, 1035, 38, 36, 1408, 28698, 38]


In [80]:
decoded_string = tokenizer.decode(ids)
print(decoded_string)

「 東海道 五 十 三 次 」 「 三 十 八 」 「 藤川 」


1) Batch input

The follow code fail (see comment cell), because we sent a single sequence to the model, whereas Hugging Face Transformers models expect multiple sentences by default. 
`ValueError: not enough values to unpack (expected 2, got 1)`. 

So, we need to add into batch (eg 2 titles -> batched_tokens = [
    ["「", "東海道", "五"],
    ["「", "藤"]
] -> batched_ids = [
    [36, 7174, 989],
    [36, 45]
])

In [81]:
model_temp = AutoModelForSequenceClassification.from_pretrained(checkpoint)

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--cl-tohoku--bert-base-japanese/snapshots/5dc6dbba88a42d21da3b71025c109c42462307f2/config.json
Model config BertConfig {
  "_name_or_path": "cl-tohoku/bert-base-japanese",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "tokenizer_class": "BertJapaneseTokenizer",
  "transformers_version": "4.24.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 32000
}

loading weights file pytorch_model.bin from cache at /root/.cache/huggingface/hub/models--cl-tohoku--bert-base-japanese/snapshots/5dc6

In [82]:
# input_ids = torch.tensor(ids)

# # This line will fail.
# model(input_ids)

In [83]:
input_ids = torch.tensor([ids]) ### include in batch (eg batch_id = [ids1,ids2]) : Very important !
print("Input IDs:", input_ids)

output = model_temp(input_ids)
print("Logits:", output.logits)

Input IDs: tensor([[   36,  7174,   989,   714,   240,   288,    38,    36,   240,   714,
          1035,    38,    36,  1408, 28698,    38]])
Logits: tensor([[0.6940, 0.1056]], grad_fn=<AddmmBackward0>)


2) Padding token

The padding token ID can be found in tokenizer.`pad_token_id`. 

In [84]:
print('Padding id of Japanese Bert: ',tokenizer.pad_token_id)
print()

Padding id of Japanese Bert:  0



There’s something wrong with the logits in our batched predictions: the second row should be the same as the logits for the second sentence, but we’ve got completely different values!

This is because the key feature of Transformer models is attention layers that contextualize each token. These will take into account the padding tokens since they attend to all of the tokens of a sequence. To get the same result when passing individual sentences of different lengths through the model or when passing a batch with the same sentences and padding applied, we need to tell those attention layers to ignore the padding tokens. This is done by using an attention mask.

In [85]:
sequence1_ids = [[200, 200, 200]]
sequence2_ids = [[200, 200]]
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

print(model_temp(torch.tensor(sequence1_ids)).logits)
print(model_temp(torch.tensor(sequence2_ids)).logits)
print("VS")
print(model_temp(torch.tensor(batched_ids)).logits)

tensor([[ 0.2836, -0.1544]], grad_fn=<AddmmBackward0>)
tensor([[ 0.2651, -0.2056]], grad_fn=<AddmmBackward0>)
VS
tensor([[ 0.2836, -0.1544],
        [ 0.4044, -0.2048]], grad_fn=<AddmmBackward0>)


3) Attention mask

Attention masks are tensors with the exact same shape as the input IDs tensor, filled with 0s and 1s: 1s indicate the corresponding tokens should be attended to, and 0s indicate the corresponding tokens should not be attended to (i.e., they should be ignored by the attention layers of the model).

In [86]:
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

attention_mask = [
    [1, 1, 1],
    [1, 1, 0],
]

outputs = model_temp(torch.tensor(batched_ids), attention_mask=torch.tensor(attention_mask))

print(model_temp(torch.tensor(sequence1_ids)).logits)
print(model_temp(torch.tensor(sequence2_ids)).logits)
print("VS")
print(outputs.logits)

tensor([[ 0.2836, -0.1544]], grad_fn=<AddmmBackward0>)
tensor([[ 0.2651, -0.2056]], grad_fn=<AddmmBackward0>)
VS
tensor([[ 0.2836, -0.1544],
        [ 0.2651, -0.2055]], grad_fn=<AddmmBackward0>)


4) Sorter sequences: padding

vs

Longer sequences: truncation

With Transformer models, there is a limit to the lengths of the sequences we can pass the models. Most models handle sequences of up to 512 or 1024 tokens, and will crash when asked to process longer sequences. There are two solutions to this problem:

- Use a model with a longer supported sequence length.
- Truncate your sequences.

Otherwise, we recommend you truncate your sequences by specifying the max_sequence_length parameter:


`sequence = sequence[:max_sequence_length]`

### Run example

In [87]:
sequences = [
   "「東海道五十三次」  「三十八」「藤川」",
   "「東都六玉顔ノ内」  「角田川」",
] # 1) batch of sentencies

In [88]:
batch = tokenizer(sequences, 
                  padding=True, # 2) & 3) Sorter sequences: add padding and attention_mask token id until fill the maximum length accepted by the model
                  truncation=True, # 4) Longer sequences: truncate a sequence to the maximum length accepted by the model
                  return_tensors="pt"
                  )
batch

{'input_ids': tensor([[    2,    36,  7174,   989,   714,   240,   288,    38,    36,   240,
           714,  1035,    38,    36,  1408, 28698,    38,     3],
        [    2,    36, 26503,  1688,  2631,  2679,   534,   186,    38,    36,
         24334,   529,    38,     3,     0,     0,     0,     0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]])}

---

In [89]:
# import torch
# from transformers import AdamW, AutoTokenizer, AutoModelForSequenceClassification


In [90]:
# # This is new
# batch["labels"] = torch.tensor([1, 1])

# optimizer = AdamW(model_temp.parameters())
# loss = model(**batch).loss
# loss.backward()
# optimizer.step()

## Fine-tuning

**Unsupervise learning**

Fine-tuning a masked language model:

We want to first fine-tune the language models on meisho data, before training our task-specific head (NER). 

See tutorial: [Hugging Face](https://huggingface.co/course/chapter7/3?fw=pt)

Further train on Japanese BERT: Fine-tune a pretrained model with meisho dataset with over 20.000 titles

In [91]:
bert_num_parameters = model.num_parameters() / 1_000_000
print(f"'>>> Japanize BERT number of parameters: {round(bert_num_parameters)}M'")
print(f"'>>> BERT number of parameters: 110M'")

'>>> Japanize BERT number of parameters: 111M'
'>>> BERT number of parameters: 110M'


### Load Dataset

- get meisho-e dataset (over 20.000 unlabeled title)
- load in the format that accept hugging face, using `load_dataset` (see tutorial: [Hugginface](https://huggingface.co/course/chapter5/2?fw=pt))

In [92]:
from datasets import load_dataset

In [93]:
url = ""
data_files = {
    #"unsupervised": url + "train_place.csv",
    "unsupervised": url + "arc_meisho_full.csv",
    #"unsupervised": url + "arc_meisho.csv",
}

meisho_dataset = load_dataset("csv", data_files=data_files)



  0%|          | 0/1 [00:00<?, ?it/s]

In [94]:
meisho_dataset

DatasetDict({
    unsupervised: Dataset({
        features: ['title', 'link', 'full_title'],
        num_rows: 20346
    })
})

In [95]:
# meisho_dataset["unsupervised"]['title']

In [96]:
# extract features/columns names of the dataset
meisho_features = meisho_dataset["unsupervised"].features.keys()

meisho_features

dict_keys(['title', 'link', 'full_title'])

#### Example

In [97]:
correct_text = "「東海道五十三次」  「三十八」「藤川」"
text = "「東海道五十三[MASK]」  「三十八」「藤川」"

inputs = tokenizer(text, return_tensors="pt")
token_logits = model(**inputs).logits # predict probabilities

# Find the location of [MASK] and extract its logits
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
mask_token_logits = token_logits[0, mask_token_index, :]

# Pick the [MASK] candidates with the highest logits
top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()
for token in top_5_tokens:
    print(f"'>>> {text.replace(tokenizer.mask_token, tokenizer.decode([token]))}'")

'>>> 「東海道五十三次」  「三十八」「藤川」'
'>>> 「東海道五十三駅」  「三十八」「藤川」'
'>>> 「東海道五十三度」  「三十八」「藤川」'
'>>> 「東海道五十三号線」  「三十八」「藤川」'
'>>> 「東海道五十三[UNK]」  「三十八」「藤川」'


---

In [98]:
# from datasets import load_dataset

# url = ""
# data_files = {
#     "train": url + "test_place.csv",
#     "test": url + "train_place.csv",
#     "unsupervised": url + "arc_meisho.csv",
# }

# meisho_dataset = load_dataset("csv", data_files=data_files)

# meisho_dataset

In [99]:
meisho_dataset["unsupervised"][0]

{'title': '芳年「東海道\u3000京都之内」「大内能上覧図」',
 'link': 'https://www.arc.ritsumei.ac.jp/archive01/theater/th_image/PB/arc/Prints/arcUP/arcUP0542.jpg',
 'full_title': 'arcUP0542文久０３・・芳年「東海道\u3000京都之内」「大内能上覧図」'}

In [100]:
sample = meisho_dataset["unsupervised"].shuffle(seed=SEED).select(range(3))

for row in sample:
    print(f"\n'>>> Title: {row['title']}'")




'>>> Title: 北斎「奧津」「江尻へ一リ卅丁」'

'>>> Title: 歌麿「江戸名所十景」「王子社の雪」'

'>>> Title: 「四月中松ノ尾祭」'


### Preprocessing

Masked language modeling: 

A common preprocessing step is to concatenate all the examples and then split the whole corpus into chunks of equal size. This is quite different from our usual approach, where we simply tokenize individual examples. 

We concatenate everything together, because individual examples might get truncated if they’re too long, and that would result in losing information that might be useful for the language modeling task!

(Important) We’ll first tokenize our corpus as usual, but **without** setting the `truncation=True` option in our tokenizer!!!

#####  1) Tokinize text

In [101]:
tokenizer.is_fast

False

In [102]:
def tokenize_function(examples, sentence_colname:str = "title"):
    result = tokenizer(examples[sentence_colname])
    if tokenizer.is_fast:
        result["word_ids"] = [result.word_ids(i) for i in range(len(result["input_ids"]))]
    return result

In [103]:
tokenized_datasets = meisho_dataset.map(
    tokenize_function, 
    batched=True,  # use batched=True to activate fast multithreading!
    remove_columns=meisho_features  # remove all the initial column as we need only tokizised sentece (eg ["title", "entities"])
)



In [104]:
tokenized_datasets

DatasetDict({
    unsupervised: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 20346
    })
})

##### 2) Split into Chunk


(Max length value is derived from the `tokenizer_config.json` file associated with a checkpoint; in this case, the context size is 512 tokens, just like with original BERT)

In [105]:
print("Japanese BERT max length of sentence:", tokenizer.model_max_length)

Japanese BERT max length of sentence: 512


In order to run our experiments on GPUs like those found on Google Colab, we’ll pick something a bit smaller that can fit in memory:

In [106]:
chunk_size = 128

###### Example


Concatenation of title into one

In [107]:
# Slicing produces a list of lists for each feature
tokenized_samples = tokenized_datasets["unsupervised"][:5]

for idx, sample in enumerate(tokenized_samples["input_ids"]):
    print(f"'>>> Review {idx} length: {len(sample)}'")

'>>> Review 0 length: 17'
'>>> Review 1 length: 15'
'>>> Review 2 length: 18'
'>>> Review 3 length: 13'
'>>> Review 4 length: 15'


In [108]:
concatenated_examples = {
    k: sum(tokenized_samples[k], []) for k in tokenized_samples.keys()
}
total_length = len(concatenated_examples["input_ids"])
print(f"'>>> Concatenated reviews length: {total_length}'")

'>>> Concatenated reviews length: 78'


Split the concatenated titles into chunks of the size given by `block_size`/`chunk_size`.

In [109]:
chunks = {
    k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
    for k, t in concatenated_examples.items()
}

for chunk in chunks["input_ids"]:
    print(f"'>>> Chunk length: {len(chunk)}'")

'>>> Chunk length: 78'


The last chunk will generally be smaller than the maximum chunk size. There are two main strategies for dealing with this:

- Drop the last chunk if it’s smaller than chunk_size. (Implement above)
- Pad the last chunk until its length equals chunk_size.

###### Create chunck

In [110]:
def group_texts(examples, chunk_size:int = 128):

    # Concatenate all texts
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}

    # Compute length of concatenated texts
    total_length = len(concatenated_examples[list(examples.keys())[0]]) 

    # We drop the last chunk if it's smaller than chunk_size
    total_length = (total_length // chunk_size) * chunk_size

    # Split by chunks of max_len
    result = {
        k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
        for k, t in concatenated_examples.items()
    }

    # Create a new labels column
    """ Create a new labels column
    masked language modeling the objective is to predict randomly masked tokens
    in the input batch, and by creating a labels column we provide the ground truth
    for our language model to learn from.
    """
    result["labels"] = result["input_ids"].copy()

    return result

In [111]:
lm_datasets = tokenized_datasets.map(group_texts, batched=True)
lm_datasets



DatasetDict({
    unsupervised: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 2995
    })
})

In [112]:
tokenizer.decode(lm_datasets["unsupervised"][1]["input_ids"])

'」 「 御供所 」 「 若宮 」 「 別 雷 皇 太 神宮 」 「 杉尾 社 」 「 仮 殿 」 [SEP] [CLS] 国貞 「 東海道 名所 之 内 」 「 京 加茂 」 「 山科 」 「 黒谷 」 「 吉田山 」 「 将軍 塚 」 「 比叡山 」 「 比良 」 [SEP] [CLS] 暁斎 「 東海道 名所 之 内 」 「 加茂 の 競馬 」 [SEP] [CLS] 豊国 「 東海道 名所 之 内 」 「 [UNK] 河原 」 「 [UNK] 川原 」 「 みたらし 川 」 「 河合 社 」 [SEP] [CLS] 芳 艶 「 東海道 名所 之 内 」 「 祇園 祭礼 」 [SEP] [CLS] 芳 幾 「 東海道 京都 名所 之 内'

In [113]:
tokenizer.decode(lm_datasets["unsupervised"][1]["labels"])

'」 「 御供所 」 「 若宮 」 「 別 雷 皇 太 神宮 」 「 杉尾 社 」 「 仮 殿 」 [SEP] [CLS] 国貞 「 東海道 名所 之 内 」 「 京 加茂 」 「 山科 」 「 黒谷 」 「 吉田山 」 「 将軍 塚 」 「 比叡山 」 「 比良 」 [SEP] [CLS] 暁斎 「 東海道 名所 之 内 」 「 加茂 の 競馬 」 [SEP] [CLS] 豊国 「 東海道 名所 之 内 」 「 [UNK] 河原 」 「 [UNK] 川原 」 「 みたらし 川 」 「 河合 社 」 [SEP] [CLS] 芳 艶 「 東海道 名所 之 内 」 「 祇園 祭礼 」 [SEP] [CLS] 芳 幾 「 東海道 京都 名所 之 内'

##### 3) Add [MASK] on labels

Inserting [MASK] tokens at random positions in the inputs using `DataCollatorForLanguageModeling`

In `DataCollatorForLanguageModeling`, the `mlm_probability` argument that specifies what fraction of the tokens to mask. We’ll pick 15%, which is the amount used for BERT and a common choice in the literature:

In [114]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, 
                                                mlm_probability=0.15)

Example

In [115]:
samples = [lm_datasets["unsupervised"][i] for i in range(2)]

for chunk in data_collator(samples)["input_ids"]:
    print(f"\n'>>> {tokenizer.decode(chunk)}'")


'>>> [CLS] 芳年 「 東海道 京都 之 [MASK] 」 「 大内 能 上覧 図 」 [SEP] [CLS] 豊国 「 東海道 京都 名所 之 内 [MASK] [MASK] 四条 河原 」 [SEP] [CLS] [MASK]斎 「 東海道 名所 之 内 」 「 御 能 拝見 之 図 」 [SEP] [CLS] 芳盛 「 東海道 」 「 京都 [MASK] 震 殿 」 [SEP] [CLS] 芳盛 「 東海 道之 内 」 「 京都 参内 」 [SEP] [CLS] 芳盛 「 東海 道 [MASK] 内 」 「 京 」 「 大内 [UNK] 之 遊覧 」 [SEP] [CLS] 豊国 「 東海道 名所 之 内 」 「 上加茂 」 [MASK] 岩 [MASK] 」 「 三本杉 」 「 片岡 [MASK] 」 「 楼門'

'>>> 」 「 [MASK]供 [MASK] [MASK] 「 若宮 [MASK] 「 [MASK] 雷 皇 太 神宮 」 「 杉尾 社 」 「 仮 殿 」 [SEP] [CLS] 国 [MASK] 「 [MASK] 名所 之 内 」 「 京 加茂 」 「 山科 」 「 黒谷 」 「 [MASK]山 」 「 将軍 塚 」 [MASK] 比叡山 」 「 比良 」 [SEP] [CLS] 暁斎 「 東海道 名所 之 [MASK] [MASK] [MASK] 加茂 [MASK] 競馬 」 [SEP] [CLS] 豊国 「 東海道 名所 之 内 戦時 [MASK] [UNK] [MASK] 」 「 [UNK] 川原 」 「 みたらし 川 」 「 河合 社 」 [SEP] [CLS] 芳 艶 「 東海道 名所 之 内 」 「 祇園 祭礼 」 [SEP] [CLS] 芳 幾 「 東海道 [MASK] 名所 [MASK] 内'


When training models for masked language modeling, there are two techniques that can be used is to mask:

1) mask individual tokens, like `##ontas` from word `kainrontas`

2) mask whole words together (eg word `kainrontas`), not just individual tokens. This approach is called whole word masking. If we want to use whole word masking, we will need to build a data collator ourselves. 

We apply only the 1) as for now is complicate, because we do not have `tokinizer._isfast` to extracts `word_ids`

##### 4) split dataset

In [116]:
lm_datasets["unsupervised"]

Dataset({
    features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
    num_rows: 2995
})

In [117]:
lm_datasets

DatasetDict({
    unsupervised: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 2995
    })
})

In [118]:
sample_size = lm_datasets["unsupervised"].num_rows
train_size = int(0.8 * sample_size) # 10
test_size = int(0.2 * sample_size)

downsampled_dataset = lm_datasets["unsupervised"].train_test_split(
    train_size=train_size, test_size=test_size, seed=42
)
downsampled_dataset



DatasetDict({
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 2396
    })
    test: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 599
    })
})

### Train Models

In [119]:
batch_size = 64
model_name = 'bert-japanese'

In [120]:
# Including logging_steps to ensure we track the training loss with each epoch

# Show the training loss with every epoch
logging_steps = len(downsampled_dataset["train"]) // batch_size
if logging_steps <= 0:
    logging_steps=1

In [121]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir=f"{model_name}-finetuned-meisho",
    overwrite_output_dir=True,
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    fp16=True, # to enable mixed-precision training, which gives us another boost in speed 
    logging_steps=logging_steps, #to ensure we track the training loss with each epoch
    #push_to_hub=True,
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [122]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=downsampled_dataset["train"],
    eval_dataset=downsampled_dataset["test"],
    data_collator=data_collator,
)

Using cuda_amp half precision backend


In [123]:
import math

# compute the resulting perplexity on the test set before fine-tune

eval_results_before = trainer.evaluate() # compute the cross-entropy loss on the test set
print(f">>> Perplexity: {math.exp(eval_results_before['eval_loss']):.2f}")

***** Running Evaluation *****
  Num examples = 599
  Batch size = 64


>>> Perplexity: 4.66


A lower perplexity score means a better language model, so we check if with after fine-tuning has better results 

In [124]:
# Fine-tune the model (training mode)
trainer.train()

***** Running training *****
  Num examples = 2396
  Num Epochs = 3
  Instantaneous batch size per device = 64
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 1
  Total optimization steps = 114
  Number of trainable parameters = 110650880


Epoch,Training Loss,Validation Loss
1,1.5354,1.312717
2,1.3602,1.238028
3,1.3027,1.226828


***** Running Evaluation *****
  Num examples = 599
  Batch size = 64


***** Running Evaluation *****
  Num examples = 599
  Batch size = 64
***** Running Evaluation *****
  Num examples = 599
  Batch size = 64


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=114, training_loss=1.3991141612069649, metrics={'train_runtime': 67.6707, 'train_samples_per_second': 106.22, 'train_steps_per_second': 1.685, 'total_flos': 472987207729152.0, 'train_loss': 1.3991141612069649, 'epoch': 3.0})

In [125]:
# compute the resulting perplexity on the test set
eval_results = trainer.evaluate()

***** Running Evaluation *****
  Num examples = 599
  Batch size = 64


In [126]:
print(f">>> Perplexity Before Fine-tune: {math.exp(eval_results_before['eval_loss']):.2f}")
print(f">>> Perplexity After Fine-tune: {math.exp(eval_results['eval_loss']):.2f}")

>>> Perplexity Before Fine-tune: 4.66
>>> Perplexity After Fine-tune: 3.40


In [127]:
# trainer.push_to_hub()

# Tranfer learning

**Custom Named Entity Recognition with Japanese BERT**


In [128]:
# # create map between labels and an id
# labels_to_ids = {k: v for v, k in enumerate(['O','PLACE'])}
# ids_to_labels = {v: k for v, k in enumerate(['O','PLACE'])}

In [129]:
# labels_to_ids

- load in the format that accept hugging face, using `load_dataset` (see tutorial: [Hugginface](https://huggingface.co/course/chapter5/2?fw=pt))

more see fine tune of Japanese bert

In [130]:
# from datasets import load_dataset

# url = ""
# data_files = {
#     "train": url + "test_place.csv",
#     "test": url + "train_place.csv",
# }

# meisho_dataset = load_dataset("csv", data_files=data_files)

# meisho_dataset

In [131]:
# meisho_dataset["train"][0]

In [132]:
# sample = meisho_dataset["train"].shuffle(seed=SEED).select(range(3))

# for row in sample:
#     print(f"\n'>>> Title: {row['title']}'")

# ideas

- use a model called [DistilBERT](https://huggingface.co/distilbert-base-uncased?text=Paris+is+the+%5BMASK%5D+of+France.) that can be trained much faster with little to no loss in downstream performance. This model was trained using a special technique called [knowledge distillation](https://en.wikipedia.org/wiki/Knowledge_distillation), where a large “teacher model” like BERT is used to guide the training of a “student model” that has far fewer parameters. An explanation of the details of knowledge distillation would take us too far afield in this section, but if you’re interested you can read all about it in *Natural Language Processing with Transformers* (colloquially known as the Transformers textbook).