## RoBERTa

RoBERTa（A Robustly Optimized BERT Approach）是由Facebook AI提出的一种改进的预训练语言模型，旨在提高自然语言处理任务的性能。RoBERTa在BERT的基础上进行了一系列优化，包括使用更大的批量大小、训练更长的时间、动态掩码长度等，取得了更好的效果。

RoBERTa的原理与BERT类似，都是基于Transformer架构，采用了Transformer编码器作为基础模块。RoBERTa的主要优化包括：

1. **更大的批量大小**：RoBERTa使用更大的批量大小来训练模型，这有助于提高模型的训练效率和性能。

2. **动态掩码长度**：RoBERTa在训练过程中使用动态掩码长度，即在每个训练步骤中随机选择掩码长度，而不是固定使用15%的掩码长度。

3. **去除NSP任务**：RoBERTa去除了BERT中的Next Sentence Prediction（NSP）任务，认为这个任务并没有带来显著的性能提升。

4. **更长的训练时间**：RoBERTa在训练过程中使用更长的训练时间，以获得更好的收敛效果。

RoBERTa解决了一些BERT存在的问题，包括：

- **数据处理不一致**：BERT在不同任务上的数据处理方式不一致，导致在某些任务上性能下降。RoBERTa通过统一数据处理方式来解决这个问题。

- **掩码预测任务不合理**：BERT中的掩码预测任务（MLM）在实践中表现一般，RoBERTa去除了这个任务，并采用更合理的训练策略。

RoBERTa是在2019年提出的，它在多项自然语言处理任务上取得了state-of-the-art的效果。

除了RoBERTa，还有一些类似的模型，如ALBERT、XLNet、DistilBERT等，它们都是在BERT基础上进行了一定的改进和优化，取得了不错的效果。

In [None]:
##RoBERTa模型文件下载地址 https://huggingface.co/hfl/chinese-roberta-wwm-ext

In [2]:
import pandas as pd
import numpy as np
import torch.nn as nn
import torch
from transformers import AutoTokenizer, RobertaForSequenceClassification

I0420 20:18:52.210010  4808 file_utils.py:39] PyTorch version 1.0.0 available.
I0420 20:18:54.384010  4808 file_utils.py:55] TensorFlow version 2.0.0 available.


In [3]:
#1、加载数据
train_df = pd.read_csv('dadata.csv', encoding='utf-8', header=None, names=['label','review'])
print(train_df.shape)

sentences = list(train_df['review'][1:])
label =train_df['label'][1:].values

(1243, 2)


In [4]:
#2 token encodding
model_path = r'E:\code\chinese-roberta-wwm-ext'
tokenizer=AutoTokenizer.from_pretrained(model_path)
max_length=32
sentences_tokened=tokenizer(sentences,padding=True,truncation=True,max_length=max_length, return_tensors='pt')
label=torch.tensor(label.astype(np.int64))

I0420 20:18:55.646010  4808 configuration_utils.py:262] loading configuration file E:\code\chinese-roberta-wwm-ext\config.json
I0420 20:18:55.648010  4808 configuration_utils.py:300] Model config BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "directionality": "bidi",
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "output_past": true,
  "pad_token_id": 0,
  "pooler_fc_size": 768,
  "pooler_num_attention_heads": 12,
  "pooler_num_fc_layers": 3,
  "pooler_size_per_head": 128,
  "pooler_type": "first_token_transform",
  "type_vocab_size": 2,
  "vocab_size": 21128
}

I0420 20:18:55.649010  4808 tokenization_utils_base.py:1169] Model n

In [6]:
#3 encoding data
from torch.utils.data import Dataset,DataLoader,random_split

class DataToDataset(Dataset):
    def __init__(self,encoding,labels):
        self.encoding=encoding
        self.labels=labels
        
    def __len__(self):
        return len(self.labels)
        
    def __getitem__(self,index):
        return self.encoding['input_ids'][index],self.encoding['attention_mask'][index],self.labels[index]

#封装数据
datasets=DataToDataset(sentences_tokened,label)
train_size=int(len(datasets)*0.8)
test_size=len(datasets)-train_size
print([train_size,test_size])
train_dataset,val_dataset=random_split(dataset=datasets,lengths=[train_size,test_size])

[993, 249]


In [7]:
BATCH_SIZE=16
#这里的num_workers要大于0
train_loader=DataLoader(dataset=train_dataset,batch_size=BATCH_SIZE,shuffle=True,num_workers=0)
val_loader=DataLoader(dataset=val_dataset,batch_size=BATCH_SIZE,shuffle=True,num_workers=0)#

In [8]:
#4、create model
class RobertaTextClassficationModel(nn.Module):
    def __init__(self):
        super(RobertaTextClassficationModel,self).__init__()
        self.distilbert=RobertaForSequenceClassification.from_pretrained(model_path, num_labels=2)
        
    def forward(self,ids,mask):
        out=self.distilbert(input_ids=ids,attention_mask=mask)
        #print(out.shape)
        #print(out)
        return out[0]


mymodel=RobertaTextClassficationModel()


#获取gpu和cpu的设备信息
device=torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("device=",device)
if torch.cuda.device_count()>1:
    print("Let's use ",torch.cuda.device_count(),"GPUs!")
    mymodel=nn.DataParallel(mymodel)
mymodel.to(device)

I0420 20:19:57.424010  4808 configuration_utils.py:262] loading configuration file E:\code\chinese-roberta-wwm-ext\config.json
I0420 20:19:57.429010  4808 configuration_utils.py:300] Model config RobertaConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "directionality": "bidi",
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "output_past": true,
  "pad_token_id": 0,
  "pooler_fc_size": 768,
  "pooler_num_attention_heads": 12,
  "pooler_num_fc_layers": 3,
  "pooler_size_per_head": 128,
  "pooler_type": "first_token_transform",
  "type_vocab_size": 2,
  "vocab_size": 21128
}

I0420 20:19:57.433010  4808 modeling_utils.py:664] loading wei

W0420 20:20:03.795010  4808 modeling_utils.py:767] Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at E:\code\chinese-roberta-wwm-ext and are newly initialized: ['embeddings.word_embeddings.weight', 'embeddings.position_embeddings.weight', 'embeddings.token_type_embeddings.weight', 'embeddings.LayerNorm.weight', 'embeddings.LayerNorm.bias', 'encoder.layer.0.attention.self.query.weight', 'encoder.layer.0.attention.self.query.bias', 'encoder.layer.0.attention.self.key.weight', 'encoder.layer.0.attention.self.key.bias', 'encoder.layer.0.attention.self.value.weight', 'encoder.layer.0.attention.self.value.bias', 'encoder.layer.0.attention.output.dense.weight', 'encoder.layer.0.attention.output.dense.bias', 'encoder.layer.0.attention.output.LayerNorm.weight', 'encoder.layer.0.attention.output.LayerNorm.bias', 'encoder.layer.0.intermediate.dense.weight', 'encoder.layer.0.intermediate.dense.bias', 'encoder.layer.0.output.dense.weight', 'encoder.l

device= cuda


RobertaTextClassficationModel(
  (distilbert): RobertaForSequenceClassification(
    (roberta): RobertaModel(
      (embeddings): RobertaEmbeddings(
        (word_embeddings): Embedding(21128, 768, padding_idx=0)
        (position_embeddings): Embedding(512, 768, padding_idx=0)
        (token_type_embeddings): Embedding(2, 768)
        (LayerNorm): LayerNorm(torch.Size([768]), eps=1e-12, elementwise_affine=True)
        (dropout): Dropout(p=0.1)
      )
      (encoder): BertEncoder(
        (layer): ModuleList(
          (0): BertLayer(
            (attention): BertAttention(
              (self): BertSelfAttention(
                (query): Linear(in_features=768, out_features=768, bias=True)
                (key): Linear(in_features=768, out_features=768, bias=True)
                (value): Linear(in_features=768, out_features=768, bias=True)
                (dropout): Dropout(p=0.1)
              )
              (output): BertSelfOutput(
                (dense): Linear(in_features=76

In [9]:
#5、train model
loss_func=nn.CrossEntropyLoss()
optimizer=torch.optim.Adam(mymodel.parameters(),lr=0.00001)

from sklearn.metrics import accuracy_score
def flat_accuracy(preds,labels):
    pred_flat=np.argmax(preds,axis=1).flatten()
    labels_flat=labels.flatten()
    return accuracy_score(labels_flat,pred_flat)

epochs=3
for epoch in range(epochs):
    train_loss = 0.0
    train_acc=0.0
    for i,data in enumerate(train_loader):
        input_ids,attention_mask,labels=[elem.to(device) for elem in data]
        #优化器置零
        optimizer.zero_grad()
        #得到模型的结果
        out=mymodel(input_ids.long(),attention_mask)
        #计算误差
        loss=loss_func(out,labels)
        train_loss += loss.item()
        #误差反向传播
        loss.backward()
        #更新模型参数
        optimizer.step()
        #计算acc 
        #out=out.detach().numpy()
        out=out.detach().cpu().numpy()
        #labels=labels.detach().numpy()
        labels=labels.detach().cpu().numpy()
        train_acc+=flat_accuracy(out,labels)
        if (i + 1) % 10 == 0:
                print("train %d/%d epochs Batch %d Loss:%f, Acc:%f" %(epoch+1,epochs, (i+1), train_loss/(i+1),train_acc/(i+1)))
    print("train %d/%d epochs Loss:%f, Acc:%f" %(epoch+1,epochs,train_loss/(i+1),train_acc/(i+1)))

train 1/3 epochs Batch 10 Loss:0.511143, Acc:0.781250
train 1/3 epochs Batch 20 Loss:0.398329, Acc:0.856250
train 1/3 epochs Batch 30 Loss:0.335944, Acc:0.885417
train 1/3 epochs Batch 40 Loss:0.354465, Acc:0.882812
train 1/3 epochs Batch 50 Loss:0.338761, Acc:0.890000
train 1/3 epochs Batch 60 Loss:0.341370, Acc:0.887500
train 1/3 epochs Loss:0.339765, Acc:0.888889
train 2/3 epochs Batch 10 Loss:0.189731, Acc:0.956250
train 2/3 epochs Batch 20 Loss:0.240447, Acc:0.931250
train 2/3 epochs Batch 30 Loss:0.244970, Acc:0.916667
train 2/3 epochs Batch 40 Loss:0.231880, Acc:0.918750
train 2/3 epochs Batch 50 Loss:0.218524, Acc:0.920000
train 2/3 epochs Batch 60 Loss:0.210471, Acc:0.915625
train 2/3 epochs Loss:0.212600, Acc:0.916667
train 3/3 epochs Batch 10 Loss:0.120725, Acc:0.956250
train 3/3 epochs Batch 20 Loss:0.081612, Acc:0.975000
train 3/3 epochs Batch 30 Loss:0.075640, Acc:0.977083
train 3/3 epochs Batch 40 Loss:0.064366, Acc:0.981250
train 3/3 epochs Batch 50 Loss:0.053215, Acc:0

In [10]:
#6、evaluate
from sklearn import metrics

print("evaluate...")
pred_list = []
y_list = []
mymodel.eval()
for j,batch in enumerate(val_loader):
    val_input_ids,val_attention_mask,val_labels=[elem.to(device) for elem in batch]
    with torch.no_grad():
        pred=mymodel(val_input_ids,val_attention_mask)
        pred=pred.detach().cpu().numpy()
        pred_flat=np.argmax(pred,axis=1).flatten()
        pred_list.extend(pred_flat)
        val_labels=val_labels.detach().cpu().numpy()
        y_list.extend(val_labels)

classify_report = metrics.classification_report(pred_list, y_list) #分类报告 support测试集样本数
print(classify_report) 
confusion_matrix = metrics.confusion_matrix(pred_list, y_list) #混淆矩阵
print(confusion_matrix) 

evaluate...
              precision    recall  f1-score   support

           0       1.00      0.98      0.99       234
           1       0.75      1.00      0.86        15

    accuracy                           0.98       249
   macro avg       0.88      0.99      0.92       249
weighted avg       0.98      0.98      0.98       249

[[229   5]
 [  0  15]]
