# 参考网址
https://www.kaggle.com/arnabs007/pretrain-a-bert-language-model-from-scratch

# 语料数据
用于训练bert模型的语料数据，常见的大规模数据在datasets里面都可以直接下载并加载，详细请参考资料https://huggingface.co/docs/datasets/index.html。对于自己的语料数据，或者脱敏的数据，
我们需要自己处理下，这里建议每一行作为一个句子，词与词之间最好有分隔符.

In [9]:
!pip install torch
!pip install tokenizers
!pip install transformers

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple


# Dataset
You can use your own text corpus or you can download one from OSCAR, these are huge multilingual corpora obtained by language classification and filtering of Common Crawl dumps of the Web.

One thing to keep in mind, you will get better results by pretraining your data on more and more data.

If you are using your own corpus, make sure that your text corpus is one sentence-per-line like this:

```
Mr. Cassius crossed the highway, and stopped suddenly.
Something glittered in the nearest red pool before him.
Gold, surely!
But, wonderful to relate, not an irregular, shapeless fragment of crude ore, fresh from Nature's crucible.
Looking at it more attentively, he saw that it bore the inscription, "May to Cass."
Like most of his fellow gold-seekers, Cass was superstitious.
```

## Tokenization
We will have to train our own tokenizer and build a vocabulary for our corpus. We will be choosing BertWordPieceTokenizer from tokenizers library. Arbitrarily choose a vocab_size=50,000. The model will be saved to the output directory as 'name-vocab.txt' file.

I had a pretrained tokenizer for Bangla, so I am using that.

In [7]:
# Train a tokenizer
import tokenizers
 
bwpt = tokenizers.BertWordPieceTokenizer()
 
filepath = "./data/raw_bangla_for_BERT.txt"

bwpt.train(
    files=[filepath],
    vocab_size=50000,
    min_frequency=3,
    limit_alphabet=1000
)

bwpt.save_model('./data','bert-bangla')

['./data/bert-bangla-vocab.txt']

In [10]:
# Load the tokenizer
from transformers import BertTokenizer, LineByLineTextDataset

vocab_file_dir = './data/bert-bangla-vocab.txt' 

tokenizer = BertTokenizer.from_pretrained(vocab_file_dir)

sentence = 'শেষ দিকে সেনাবাহিনীর সদস্যরা এসব ঘর তাঁর প্রশাসনের কাছে হস্তান্তর করেন'

encoded_input = tokenizer.tokenize(sentence)
print(encoded_input)
# print(encoded_input['input_ids'])

['শেষ', 'দিকে', 'সেনাবাহিনীর', 'সদসযরা', 'এসব', 'ঘর', 'তার', 'পরশাসনের', 'কাছে', 'হসতানতর', 'করেন']




In [11]:
# some bugs for LineByLineTextDataset https://discuss.huggingface.co/t/how-to-train-a-language-model-from-scratch-when-my-dataset-is-bigger-than-ram/117
'''
transformers has a predefined class LineByLineTextDataset()
which reads your text line by line and converts them to tokens
'''

dataset= LineByLineTextDataset(
    tokenizer = tokenizer,
    file_path = './data/raw_bangla_for_BERT.txt',
    block_size = 128  # maximum sequence length
)

print('No. of lines: ', len(dataset)) # No of lines in your datset




No. of lines:  2172033


## Defining model
Now that have the training data ready to be fed into the model, let's define the model. First we have to define the configuration of the BERT model. vocab_size should be the size of your trained vocabulary. Keep the rest of the arguments as they are. I am expecting that you have a thorough knowledge on the transformers model to understand the parameters

We will be using BertForMaskedLM from transformers library which is built on top of masked language modelling(MLM) excluding the next sentence prediction(NSP) task.

You also need to define a DataCollator. What is DataCollator you ask?

A DataCollator is a function that takes a list of samples from a Dataset and collate them into a batch, as a dictionary of Tensors.

collates batches of tensors, honoring their tokenizer's pad_token
preprocesses batches for masked language modeling

In [12]:
from transformers import BertConfig, BertForMaskedLM, DataCollatorForLanguageModeling

config = BertConfig(
    vocab_size=50000,
    hidden_size=768, 
    num_hidden_layers=6, 
    num_attention_heads=12,
    max_position_embeddings=512
)
 
model = BertForMaskedLM(config)
print('No of parameters: ', model.num_parameters())


data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

No of parameters:  81965648


## Defining training arguments
per_device_train_batch_size is theoretically not the same as the batch size for BERT model. This is true when you have more than 1 GPU/TPU.

But as of now in practicality, assuming that you are training the model on 1 GPU(In colab/your pc) per_device_train_batch_size is the bach size for your BERT model, which is I have set 32 (recommended batch size for BERT in the paper =16 or 32).

Then instantiate a trainer with the predefined model, tokenizer, datacollator and dataset.

In [16]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir='./results/',
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_device_train_batch_size=32,
    save_steps=10_000,
    save_total_limit=2,
    prediction_loss_only=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,  
)

## Train the model
We are at the last step of our language model pretraining. Call the trainer's train() method and sit back and watch a movie cause this is going to take a lot of time depending on your corpus size.

Remember Google's BERT-base was trained on 4 cloud TPUs for 4 uninterrupted days. That is equivalent to 16 GPU days!

I trained a model on a random newspaper article corpus of only 500MB containing around 2.2M sentences and 30M words and that took almost 4 hrs!

Don't forget to save the model! Cause you know, if you fall asleep (I am certain you will) and wake up and see runtime disconnected! RIP!

In [None]:
%%time
trainer.train()
trainer.save_model('/kaggle/working/')

Step,Training Loss


## Check your model's prediction
Load your pretained model from the saved model directory and a make a pipeline for masked word prediction task.

In [None]:
from transformers import pipeline

model = BertForMaskedLM.from_pretrained('/kaggle/working/')

fill_mask = pipeline(
    "fill-mask",
    model=model,
    tokenizer=tokenizer
)

In [None]:
fill_mask('লাশ উদ্ধার করে ময়নাতদন্তের জন্য কক্সবাজার [MASK] মর্গে পাঠিয়েছে পুলিশ')


## 结论
我的模型做得相当不错 正如我先前所说，BERT需要大量的文本来更好地理解一种语言。谷歌的BERT-base是在包含约33亿个单词的TeraBytes原始文本数据上训练的（大约是我们训练的110倍）。

我在随机的报纸文章上训练我的模型。为你的任务在特定领域的文本上训练你的BERT模型会更好。在那个领域，你肯定会得到更好的结果。

所以，恭喜你！你现在可以训练你自己的BERT模型了。你现在可以在任何语言中训练你自己的BERT模型。

现在，你的脑海中可能会出现一个问题。

我可以通过使用预训练的模型的权重来训练一个模型吗？

是的，你可以。注意在模型定义部分，我以这种方式定义了模型。

model = BertForMaskedLM(config)

这里BertConfig被作为参数传递，而你要做的是model = BertForMaskedLM.from_pretained('bert-base-cased')

或者如果你想从本地目录加载模型，model = BertForMaskedLM.from_pretained('your_model_directory')