## BERT

BERT（Bidirectional Encoder Representations from Transformers）是一种基于Transformer架构的预训练语言模型，由Google在2018年提出。BERT通过大规模的无监督预训练来学习通用的语言表示，然后在各种NLP任务上进行微调，取得了显著的性能提升。

BERT的原理如下：

1. **Transformer架构**：BERT使用Transformer编码器作为基础架构，包括多层的自注意力机制和前馈神经网络。这种架构能够更好地捕捉输入序列的全局依赖关系。

2. **双向性**：BERT通过两种预训练任务来训练模型，包括Masked Language Model（MLM）和Next Sentence Prediction（NSP）。MLM任务要求模型预测句子中被掩盖的词语，从而使模型能够理解句子的上下文关系；NSP任务要求模型判断两个句子是否是原文中相邻的句子，从而使模型能够理解句子间的逻辑关系。

3. **预训练和微调**：BERT首先在大规模文本语料上进行无监督预训练，学习通用的语言表示。然后，在特定的下游任务上进行微调，通过少量标记数据即可取得较好的效果。

BERT解决了传统语言模型的一些问题，包括：

- **上下文理解**：传统的语言模型（如Word2Vec和GloVe）无法考虑句子中词语的上下文信息，而BERT通过双向性和Transformer架构能够更好地理解句子中的语境，从而提高了语言表示的质量。

- **迁移学习**：传统的语言模型通常需要针对特定任务重新训练，而BERT的预训练-微调框架使得模型可以更轻松地应用于各种NLP任务，极大地提高了模型的可迁移性和通用性。

BERT有多个版本，其中最知名的是BERT-base和BERT-large。它们的参数规模分别如下：

- BERT-base：包含110M个参数，包括12个Transformer编码器层，每层有12个注意力头，隐藏层大小为768。
- BERT-large：包含340M个参数，包括24个Transformer编码器层，每层有16个注意力头，隐藏层大小为1024。

除此之外，还有一些针对特定任务或语言的变体，如多语言BERT（BERT-multilingual）等。BERT的提出极大地推动了自然语言处理领域的发展，并在许多任务上取得了state-of-the-art的效果。

In [49]:
import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM

In [50]:
#bert模型文件下载地址 https://huggingface.co/bert-base-chinese

In [51]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-chinese")
model = AutoModelForMaskedLM.from_pretrained("bert-base-chinese")

Some weights of the model checkpoint at bert-base-chinese were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [52]:
# 输入文本
input_text = "我爱北京天安门"
# 通过tokenizer把文本变成 token_id
input_ids = tokenizer.encode(input_text, add_special_tokens=True)
print(input_ids)

[101, 2769, 4263, 1266, 776, 1921, 2128, 7305, 102]


In [53]:
input_ids = torch.tensor([input_ids])
# 获得BERT模型最后一个隐层结果
with torch.no_grad():
    last_hidden_states = model(input_ids)[0]  # Models outputs are now tuples
print(last_hidden_states.shape)

torch.Size([1, 9, 21128])


In [54]:
import pandas as pd
import numpy as np
import torch.nn as nn
import torch
from transformers import BertForSequenceClassification, BertTokenizer
from sklearn import metrics
from collections import Counter

In [55]:
#1、加载数据
train_df = pd.read_csv('data.csv', encoding='utf-8', header=None, names=['label','review'])
print(train_df.head)

sentences = list(train_df['review'][1:])
label =train_df['label'][1:].values


<bound method NDFrame.head of       label                                             review
0     label                                             review
1         0                      商业秘密的秘密性那是维系其商业价值和垄断地位的前提条件之一
2         1  南口阿玛施新春第一批限量春装到店啦         春暖花开淑女裙、冰蓝色公主衫  ...
3         0                                   带给我们大常州一场壮观的视觉盛宴
4         0                                      有原因不明的泌尿系统结石等
...     ...                                                ...
1238      0                                    关于两英女孩嫁海门惨遭家暴之后
1239      0          美国Amazon购买$50礼品卡部分用户送$10promotionalcredit
1240      0                                   一年中这是最好的调理疾病的好时候
1241      0                              我镇强力推进“3+2+2”专项打击整治行动
1242      0                                       ~热推~全球最好的祛疤膏

[1243 rows x 2 columns]>


In [56]:
c = Counter(label)
print (dict(c))

{'0': 1120, '1': 122}


In [57]:
#2 token encodding

max_length=32
sentences_tokened=tokenizer(sentences,padding=True,truncation=True,max_length=max_length, return_tensors='pt')
label=torch.tensor(label.astype(np.int64))

In [58]:
#3 encoding data
from torch.utils.data import Dataset,DataLoader,random_split

class DataToDataset(Dataset):
    def __init__(self,encoding,labels):
        self.encoding=encoding
        self.labels=labels
        
    def __len__(self):
        return len(self.labels)
        
    def __getitem__(self,index):
        return self.encoding['input_ids'][index],self.encoding['attention_mask'][index],self.labels[index]

#封装数据
datasets=DataToDataset(sentences_tokened,label)
train_size=int(len(datasets)*0.8)
test_size=len(datasets)-train_size
print([train_size,test_size])
train_dataset,val_dataset=random_split(dataset=datasets,lengths=[train_size,test_size])

[993, 249]


In [59]:
BATCH_SIZE=16
#这里的num_workers要大于0
train_loader=DataLoader(dataset=train_dataset,batch_size=BATCH_SIZE,shuffle=True,num_workers=0)
val_loader=DataLoader(dataset=val_dataset,batch_size=BATCH_SIZE,shuffle=True,num_workers=0)#

In [60]:
#4、create model
class BertTextClassficationModel(nn.Module):
    def __init__(self):
        super(BertTextClassficationModel,self).__init__()
        self.bert=BertForSequenceClassification.from_pretrained('bert-base-chinese', num_labels=2)
        
    def forward(self,ids,mask):
        out = self.bert(input_ids=ids,attention_mask=mask)
        return out[0]


mymodel=BertTextClassficationModel()


#获取gpu和cpu的设备信息
device=torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("device=",device)
if torch.cuda.device_count()>1:
    print("Let's use ",torch.cuda.device_count(),"GPUs!")
    mymodel=nn.DataParallel(mymodel)
mymodel.to(device)

Some weights of the model checkpoint at bert-base-chinese were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-chinese and are newly i

device= cuda


BertTextClassficationModel(
  (bert): BertForSequenceClassification(
    (bert): BertModel(
      (embeddings): BertEmbeddings(
        (word_embeddings): Embedding(21128, 768, padding_idx=0)
        (position_embeddings): Embedding(512, 768)
        (token_type_embeddings): Embedding(2, 768)
        (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (encoder): BertEncoder(
        (layer): ModuleList(
          (0-11): 12 x BertLayer(
            (attention): BertAttention(
              (self): BertSelfAttention(
                (query): Linear(in_features=768, out_features=768, bias=True)
                (key): Linear(in_features=768, out_features=768, bias=True)
                (value): Linear(in_features=768, out_features=768, bias=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (output): BertSelfOutput(
                (dense): Linear(in_features=768, out_fe

In [61]:
total_params = 0
for name, parameters in mymodel.named_parameters():
    if not parameters.requires_grad: continue
    print(name, ':', parameters.size())
    total_params += parameters.numel()
print("模型需要训练参数为：", total_params)

bert.bert.embeddings.word_embeddings.weight : torch.Size([21128, 768])
bert.bert.embeddings.position_embeddings.weight : torch.Size([512, 768])
bert.bert.embeddings.token_type_embeddings.weight : torch.Size([2, 768])
bert.bert.embeddings.LayerNorm.weight : torch.Size([768])
bert.bert.embeddings.LayerNorm.bias : torch.Size([768])
bert.bert.encoder.layer.0.attention.self.query.weight : torch.Size([768, 768])
bert.bert.encoder.layer.0.attention.self.query.bias : torch.Size([768])
bert.bert.encoder.layer.0.attention.self.key.weight : torch.Size([768, 768])
bert.bert.encoder.layer.0.attention.self.key.bias : torch.Size([768])
bert.bert.encoder.layer.0.attention.self.value.weight : torch.Size([768, 768])
bert.bert.encoder.layer.0.attention.self.value.bias : torch.Size([768])
bert.bert.encoder.layer.0.attention.output.dense.weight : torch.Size([768, 768])
bert.bert.encoder.layer.0.attention.output.dense.bias : torch.Size([768])
bert.bert.encoder.layer.0.attention.output.LayerNorm.weight : tor

In [62]:
#5、train model
loss_func=nn.CrossEntropyLoss()
optimizer=torch.optim.Adam(mymodel.parameters(),lr=0.0001)

from sklearn.metrics import accuracy_score
def flat_accuracy(preds,labels):
    pred_flat=np.argmax(preds,axis=1).flatten()
    labels_flat=labels.flatten()
    return accuracy_score(labels_flat,pred_flat)

epochs=3
for epoch in range(epochs):
    train_loss = 0.0
    train_acc=0.0
    for i,data in enumerate(train_loader):
        input_ids,attention_mask,labels=[elem.to(device) for elem in data]
        #优化器置零
        optimizer.zero_grad()
        #得到模型的结果
        out=mymodel(input_ids.long(),attention_mask)
        #计算误差
        loss=loss_func(out,labels)
        train_loss += loss.item()
        #误差反向传播
        loss.backward()
        #更新模型参数
        optimizer.step()
        #计算acc 
        #out=out.detach().numpy()
        out=out.detach().cpu().numpy()
        #labels=labels.detach().numpy()
        labels=labels.detach().cpu().numpy()
        train_acc+=flat_accuracy(out,labels)
        if (i + 1) % 10 == 0:
                print("train %d/%d epochs Batch %d Loss:%f, Acc:%f" %(epoch+1,epochs, (i+1), train_loss/(i+1),train_acc/(i+1)))
    print("train %d/%d epochs Loss:%f, Acc:%f" %(epoch+1,epochs,train_loss/(i+1),train_acc/(i+1)))

train 1/3 epochs Batch 10 Loss:0.336873, Acc:0.862500
train 1/3 epochs Batch 20 Loss:0.305873, Acc:0.884375
train 1/3 epochs Batch 30 Loss:0.224474, Acc:0.916667
train 1/3 epochs Batch 40 Loss:0.229717, Acc:0.915625
train 1/3 epochs Batch 50 Loss:0.210660, Acc:0.927500
train 1/3 epochs Batch 60 Loss:0.184388, Acc:0.936458
train 1/3 epochs Loss:0.181004, Acc:0.938492
train 2/3 epochs Batch 10 Loss:0.006369, Acc:1.000000
train 2/3 epochs Batch 20 Loss:0.051118, Acc:0.987500
train 2/3 epochs Batch 30 Loss:0.054556, Acc:0.989583
train 2/3 epochs Batch 40 Loss:0.042382, Acc:0.992188
train 2/3 epochs Batch 50 Loss:0.035039, Acc:0.993750
train 2/3 epochs Batch 60 Loss:0.042860, Acc:0.992708
train 2/3 epochs Loss:0.041098, Acc:0.993056
train 3/3 epochs Batch 10 Loss:0.007164, Acc:1.000000
train 3/3 epochs Batch 20 Loss:0.005960, Acc:1.000000
train 3/3 epochs Batch 30 Loss:0.024603, Acc:0.995833
train 3/3 epochs Batch 40 Loss:0.053970, Acc:0.990625
train 3/3 epochs Batch 50 Loss:0.101436, Acc:0

In [63]:
#6、evaluate

print("evaluate...")
pred_list = []
y_list = []
mymodel.eval()
for j,batch in enumerate(val_loader):
    val_input_ids,val_attention_mask,val_labels=[elem.to(device) for elem in batch]
    with torch.no_grad():
        pred=mymodel(val_input_ids,val_attention_mask)
        pred=pred.detach().cpu().numpy()
        pred_flat=np.argmax(pred,axis=1).flatten()
        pred_list.extend(pred_flat)
        val_labels=val_labels.detach().cpu().numpy()
        y_list.extend(val_labels)

classify_report = metrics.classification_report(pred_list, y_list, digits=4) #分类报告 support测试集样本数
print(classify_report) 
confusion_matrix = metrics.confusion_matrix(pred_list, y_list) #混淆矩阵
print(confusion_matrix) 

evaluate...
              precision    recall  f1-score   support

           0     1.0000    0.8675    0.9290       249
           1     0.0000    0.0000    0.0000         0

    accuracy                         0.8675       249
   macro avg     0.5000    0.4337    0.4645       249
weighted avg     1.0000    0.8675    0.9290       249

[[216  33]
 [  0   0]]


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
