In [1]:
import pandas as pd
df_train = pd.read_csv("./contents/chinese/train.tsv", sep='\t', header=None, names=["labels", "contents"])
df_dev = pd.read_csv("./contents/chinese/dev.tsv", sep='\t', header=None, names=["labels", "contents"])
df_dev

Unnamed: 0,labels,contents
0,game,EDG出线希望大增！有这个老师的示范还怕淘汰？ ★游戏马蹄铁原创，禁止网站或个人商用盗取，违...
1,game,CSGO外挂“大牛”制售者一脸懵逼 家中被CT抓个现行 特大喜讯！CSGO外挂“大牛”制售者...
2,game,LOL女警再遭削弱引发玩家不满，胜率都倒数了还砍？有病！ 对于很多喜欢ADC的玩家来说，前几...
3,game,《沃克梦》：口袋妖怪既视感？你要捕获的不是精灵而是员工 安智游戏推荐：精灵球、皮卡丘、来自真...
4,game,团战可以输他们必须死！盘点团战中最烦人的技能 大家都知道，峡谷中的每个英雄都有各自的特点，有...
5,game,LOL今日美服：养蜂人原画上线香炉大改加强！ ★游戏马蹄铁原创，禁止网站或个人商用盗取，违者...
6,game,"DNF韩服新增龙枪和黑枪职业, 史诗武器属性一览 DNF韩服更新了魔枪士两个职业，龙枪和黑枪..."
7,game,英雄联盟7.18测试服补丁：多款S7决赛皮肤上线，多个英雄改动 7.18测试服为期三周的改动...
8,game,LOL解说十一左拥右抱，泽元隔空喊话记得：学学人家，再看看你！ LOL解说记得其实从当初来到...
9,game,"天下ACG是一家，除了二次元，这款音游里还有游戏玩家的“爱""。 《同步音律》（曾用名《同步音..."


In [7]:
import re

with open("./contents/chinese/train.txt", "w") as f:
    for i, row in df_train.iterrows():
        contents = re.sub('[？!！。.]', ' ', row["contents"]).split()
        for content in contents:
            content = content.rstrip()+"\n"
            f.write(content)

with open("./contents/chinese/dev.txt", "w") as f:
    for i, row in df_dev.iterrows():
        contents = re.sub('[？!！。.]', ' ', row["contents"]).split()
        for content in contents:
            content = content.rstrip()+"\n"
            f.write(content)
        
with open("./contents/chinese/train.txt", "r") as f:
    contents = f.readlines()

## Tokenizer

In [9]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-chinese")

In [10]:
tokenizer.vocab_size

21128

In [11]:
idx = tokenizer.encode(contents[0])
tokenizer.convert_ids_to_tokens(idx)

['[CLS]', '5000', '万', ',', '投', '资', '游', '戏', '该', '有', '多', '好', '[SEP]']

## Model Definition

In [13]:
from model.mlm import BaseMLM

model = BaseMLM(vocab_size=21128, pretrain="bert-base-chinese")()
model.num_parameters()

Some weights of the model checkpoint at bert-base-chinese were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


102290312

In [14]:
from transformers import LineByLineTextDataset
from transformers import DataCollatorForLanguageModeling
train_dataset = LineByLineTextDataset(tokenizer=tokenizer,
                                file_path="./contents//chinese/train.txt", 
                                block_size=512 # maximum sequence length
                               )
print('No. of lines: ', len(train_dataset)) # No of lines in your datset

dev_dataset = LineByLineTextDataset(tokenizer=tokenizer, 
                                file_path="./contents//chinese/dev.txt", 
                                block_size=512 # maximum sequence length
                               )
print('No. of lines: ', len(dev_dataset)) # No of lines in your datset

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)



No. of lines:  11772
No. of lines:  1136


In [15]:
demo = next(iter(train_dataset))
len(demo["input_ids"])

13

In [None]:
from train.mlm import MLMTrainer

trainer = MLMTrainer(model=model, 
                     data_collator=data_collator, 
                     train_dataset=train_dataset,
                     eval_dataset=dev_dataset,
                     output_dir="./Bert"
                    )
t = trainer()
t.train()

Saving output into ./Bert/runs
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


***** Running training *****
  Num examples = 11772
  Num Epochs = 100
  Instantaneous batch size per device = 64
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 1
  Total optimization steps = 18400
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[34m[1mwandb[0m: Currently logged in as: [33mmorris135212[0m (use `wandb login --relogin` to force relogin)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[34m[1mwandb[0m: wandb version 0.12.21 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade


Step,Training Loss
