## RoBERTa

RoBERTa（A Robustly Optimized BERT Approach）是由Facebook AI提出的一种改进的预训练语言模型，旨在提高自然语言处理任务的性能。RoBERTa在BERT的基础上进行了一系列优化，包括使用更大的批量大小、训练更长的时间、动态掩码长度等，取得了更好的效果。

RoBERTa的原理与BERT类似，都是基于Transformer架构，采用了Transformer编码器作为基础模块。RoBERTa的主要优化包括：

1. **更大的批量大小**：RoBERTa使用更大的批量大小来训练模型，这有助于提高模型的训练效率和性能。

2. **动态掩码长度**：RoBERTa在训练过程中使用动态掩码长度，即在每个训练步骤中随机选择掩码长度，而不是固定使用15%的掩码长度。

3. **去除NSP任务**：RoBERTa去除了BERT中的Next Sentence Prediction（NSP）任务，认为这个任务并没有带来显著的性能提升。

4. **更长的训练时间**：RoBERTa在训练过程中使用更长的训练时间，以获得更好的收敛效果。

RoBERTa解决了一些BERT存在的问题，包括：

- **数据处理不一致**：BERT在不同任务上的数据处理方式不一致，导致在某些任务上性能下降。RoBERTa通过统一数据处理方式来解决这个问题。

- **掩码预测任务不合理**：BERT中的掩码预测任务（MLM）在实践中表现一般，RoBERTa去除了这个任务，并采用更合理的训练策略。

RoBERTa是在2019年提出的，它在多项自然语言处理任务上取得了state-of-the-art的效果。

除了RoBERTa，还有一些类似的模型，如ALBERT、XLNet、DistilBERT等，它们都是在BERT基础上进行了一定的改进和优化，取得了不错的效果。

In [None]:
##RoBERTa模型文件下载地址 https://huggingface.co/hfl/chinese-roberta-wwm-ext

In [1]:
import pandas as pd
import numpy as np
import torch.nn as nn
import torch
from transformers import AutoTokenizer, RobertaForSequenceClassification
from sklearn import metrics
from collections import Counter

In [2]:
#1、加载数据
train_df = pd.read_csv('data.csv', encoding='utf-8', header=None, names=['label','review'])
print(train_df.shape)

sentences = list(train_df['review'][1:])
label =train_df['label'][1:].values

(1243, 2)


In [3]:
#2 token encodding
model_path = r'E:\code\chinese-roberta-wwm-ext'
tokenizer=AutoTokenizer.from_pretrained(model_path)
max_length=32
sentences_tokened=tokenizer(sentences,padding=True,truncation=True,max_length=max_length, return_tensors='pt')
label=torch.tensor(label.astype(np.int64))

In [4]:
#3 encoding data
from torch.utils.data import Dataset,DataLoader,random_split

class DataToDataset(Dataset):
    def __init__(self,encoding,labels):
        self.encoding=encoding
        self.labels=labels
        
    def __len__(self):
        return len(self.labels)
        
    def __getitem__(self,index):
        return self.encoding['input_ids'][index],self.encoding['attention_mask'][index],self.labels[index]

#封装数据
datasets=DataToDataset(sentences_tokened,label)
train_size=int(len(datasets)*0.8)
test_size=len(datasets)-train_size
print([train_size,test_size])
train_dataset,val_dataset=random_split(dataset=datasets,lengths=[train_size,test_size])

[993, 249]


In [5]:
BATCH_SIZE=16
#这里的num_workers要大于0
train_loader=DataLoader(dataset=train_dataset,batch_size=BATCH_SIZE,shuffle=True,num_workers=0)
val_loader=DataLoader(dataset=val_dataset,batch_size=BATCH_SIZE,shuffle=True,num_workers=0)#

In [6]:
#4、create model
class RobertaTextClassficationModel(nn.Module):
    def __init__(self):
        super(RobertaTextClassficationModel,self).__init__()
        self.distilbert=RobertaForSequenceClassification.from_pretrained(model_path, num_labels=2)
        
    def forward(self,ids,mask):
        out=self.distilbert(input_ids=ids,attention_mask=mask)
        #print(out.shape)
        #print(out)
        return out[0]


mymodel=RobertaTextClassficationModel()


#获取gpu和cpu的设备信息
device=torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("device=",device)
if torch.cuda.device_count()>1:
    print("Let's use ",torch.cuda.device_count(),"GPUs!")
    mymodel=nn.DataParallel(mymodel)
mymodel.to(device)

You are using a model of type bert to instantiate a model of type roberta. This is not supported for all configurations of models and can yield errors.
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at E:\code\chinese-roberta-wwm-ext and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight', 'embeddings.LayerNorm.bias', 'embeddings.LayerNorm.weight', 'embeddings.position_embeddings.weight', 'embeddings.token_type_embeddings.weight', 'embeddings.word_embeddings.weight', 'encoder.layer.0.attention.output.LayerNorm.bias', 'encoder.layer.0.attention.output.LayerNorm.weight', 'encoder.layer.0.attention.output.dense.bias', 'encoder.layer.0.attention.output.dense.weight', 'encoder.layer.0.attention.self.key.bias', 'encoder.layer.0.attention.self.key.weight', 'encoder.layer.0.attention.self.query.bias', 'encoder.layer.0.attention.self.query.weight', 'encoder.layer.0.a

device= cuda


RobertaTextClassficationModel(
  (distilbert): RobertaForSequenceClassification(
    (roberta): RobertaModel(
      (embeddings): RobertaEmbeddings(
        (word_embeddings): Embedding(21128, 768, padding_idx=0)
        (position_embeddings): Embedding(512, 768, padding_idx=0)
        (token_type_embeddings): Embedding(2, 768)
        (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (encoder): RobertaEncoder(
        (layer): ModuleList(
          (0): RobertaLayer(
            (attention): RobertaAttention(
              (self): RobertaSelfAttention(
                (query): Linear(in_features=768, out_features=768, bias=True)
                (key): Linear(in_features=768, out_features=768, bias=True)
                (value): Linear(in_features=768, out_features=768, bias=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (output): RobertaSelfOutput(
            

In [7]:
total_params = 0
for name, parameters in mymodel.named_parameters():
    if not parameters.requires_grad: continue
    print(name, ':', parameters.size())
    total_params += parameters.numel()
print("模型需要训练参数为：", total_params)

distilbert.roberta.embeddings.word_embeddings.weight : torch.Size([21128, 768])
distilbert.roberta.embeddings.position_embeddings.weight : torch.Size([512, 768])
distilbert.roberta.embeddings.token_type_embeddings.weight : torch.Size([2, 768])
distilbert.roberta.embeddings.LayerNorm.weight : torch.Size([768])
distilbert.roberta.embeddings.LayerNorm.bias : torch.Size([768])
distilbert.roberta.encoder.layer.0.attention.self.query.weight : torch.Size([768, 768])
distilbert.roberta.encoder.layer.0.attention.self.query.bias : torch.Size([768])
distilbert.roberta.encoder.layer.0.attention.self.key.weight : torch.Size([768, 768])
distilbert.roberta.encoder.layer.0.attention.self.key.bias : torch.Size([768])
distilbert.roberta.encoder.layer.0.attention.self.value.weight : torch.Size([768, 768])
distilbert.roberta.encoder.layer.0.attention.self.value.bias : torch.Size([768])
distilbert.roberta.encoder.layer.0.attention.output.dense.weight : torch.Size([768, 768])
distilbert.roberta.encoder.laye

In [7]:
#5、train model
loss_func=nn.CrossEntropyLoss()
optimizer=torch.optim.Adam(mymodel.parameters(),lr=0.00001)

from sklearn.metrics import accuracy_score
def flat_accuracy(preds,labels):
    pred_flat=np.argmax(preds,axis=1).flatten()
    labels_flat=labels.flatten()
    return accuracy_score(labels_flat,pred_flat)

epochs=3
for epoch in range(epochs):
    train_loss = 0.0
    train_acc=0.0
    for i,data in enumerate(train_loader):
        input_ids,attention_mask,labels=[elem.to(device) for elem in data]
        #优化器置零
        optimizer.zero_grad()
        #得到模型的结果
        out=mymodel(input_ids.long(),attention_mask)
        #计算误差
        loss=loss_func(out,labels)
        train_loss += loss.item()
        #误差反向传播
        loss.backward()
        #更新模型参数
        optimizer.step()
        #计算acc 
        #out=out.detach().numpy()
        out=out.detach().cpu().numpy()
        #labels=labels.detach().numpy()
        labels=labels.detach().cpu().numpy()
        train_acc+=flat_accuracy(out,labels)
        if (i + 1) % 10 == 0:
                print("train %d/%d epochs Batch %d Loss:%f, Acc:%f" %(epoch+1,epochs, (i+1), train_loss/(i+1),train_acc/(i+1)))
    print("train %d/%d epochs Loss:%f, Acc:%f" %(epoch+1,epochs,train_loss/(i+1),train_acc/(i+1)))

train 1/3 epochs Batch 10 Loss:0.383828, Acc:0.825000
train 1/3 epochs Batch 20 Loss:0.396195, Acc:0.850000
train 1/3 epochs Batch 30 Loss:0.366358, Acc:0.870833
train 1/3 epochs Batch 40 Loss:0.363269, Acc:0.875000
train 1/3 epochs Batch 50 Loss:0.357927, Acc:0.877500
train 1/3 epochs Batch 60 Loss:0.352648, Acc:0.879167
train 1/3 epochs Loss:0.344397, Acc:0.882937
train 2/3 epochs Batch 10 Loss:0.294893, Acc:0.881250
train 2/3 epochs Batch 20 Loss:0.242128, Acc:0.903125
train 2/3 epochs Batch 30 Loss:0.205164, Acc:0.918750
train 2/3 epochs Batch 40 Loss:0.241803, Acc:0.918750
train 2/3 epochs Batch 50 Loss:0.221557, Acc:0.926250
train 2/3 epochs Batch 60 Loss:0.197012, Acc:0.934375
train 2/3 epochs Loss:0.188249, Acc:0.937500
train 3/3 epochs Batch 10 Loss:0.111006, Acc:0.962500
train 3/3 epochs Batch 20 Loss:0.119311, Acc:0.965625
train 3/3 epochs Batch 30 Loss:0.088671, Acc:0.975000
train 3/3 epochs Batch 40 Loss:0.076743, Acc:0.978125
train 3/3 epochs Batch 50 Loss:0.076226, Acc:0

In [10]:
#6、evaluate
from sklearn import metrics

print("evaluate...")
pred_list = []
y_list = []
mymodel.eval()
for j,batch in enumerate(val_loader):
    val_input_ids,val_attention_mask,val_labels=[elem.to(device) for elem in batch]
    with torch.no_grad():
        pred=mymodel(val_input_ids,val_attention_mask)
        pred=pred.detach().cpu().numpy()
        pred_flat=np.argmax(pred,axis=1).flatten()
        pred_list.extend(pred_flat)
        val_labels=val_labels.detach().cpu().numpy()
        y_list.extend(val_labels)

classify_report = metrics.classification_report(pred_list, y_list, digits=4) #分类报告 support测试集样本数
print(classify_report) 
confusion_matrix = metrics.confusion_matrix(pred_list, y_list) #混淆矩阵
print(confusion_matrix) 

evaluate...
              precision    recall  f1-score   support

           0     0.9913    0.9913    0.9913       229
           1     0.9000    0.9000    0.9000        20

    accuracy                         0.9839       249
   macro avg     0.9456    0.9456    0.9456       249
weighted avg     0.9839    0.9839    0.9839       249

[[227   2]
 [  2  18]]
