## GPT2

GPT-2（Generative Pre-trained Transformer 2）是OpenAI提出的一种预训练语言模型，是GPT系列的第二代模型。GPT-2基于Transformer架构，具有强大的文本生成能力，在自然语言处理领域取得了显著的成就。

### GPT-2的原理如下：

Transformer架构：GPT-2采用了Transformer架构，其中包括多层的自注意力机制和前馈神经网络。这种架构能够有效地捕捉输入序列中的长距离依赖关系。

自回归生成：GPT-2使用自回归生成的方式，即在生成每个词时都考虑前面生成的词，通过注意力机制来捕捉上下文信息，从而生成连贯的文本。

无监督预训练：GPT-2在大规模文本语料上进行无监督预训练，学习通用的语言表示。这使得模型在各种下游任务上都能够表现良好。

微调：在特定的下游任务上进行微调，使得模型可以适应不同的任务需求，如文本生成、文本分类等。

### GPT-2解决了传统语言模型的一些问题，包括：

生成质量：GPT-2在生成文本方面取得了很高的质量，能够生成连贯、合理的文本片段。

上下文理解：GPT-2通过自回归生成方式，能够充分利用上下文信息，生成更加准确的文本。

多样性和创造性：GPT-2能够生成多样性的文本，有一定的创造性，可以应用于文本生成、对话系统等多个领域。

GPT-2的参数规模为1.5亿个参数（small版本），最大版本的参数规模为15亿个。GPT-2在2019年发布，受到了广泛关注，并在自然语言生成领域取得了显著的进展。

GPT-3和GPT-4的参数规模分别是：

GPT-3：1750亿个参数。
GPT-4：规模暂未公布，但预计会进一步增大。

In [1]:
#GPT2模型文件https://huggingface.co/uer/gpt2-chinese-cluecorpussmall

In [2]:
import pandas as pd
import numpy as np
import torch.nn as nn
import torch
from transformers import BertTokenizer, GPT2ForSequenceClassification, GPT2Config
from sklearn import metrics
from collections import Counter

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
import transformers
print(transformers.__version__)
print(torch.__version__)

4.41.2
2.1.0+cu118


In [4]:
#1、加载数据
train_df = pd.read_csv('data.csv', encoding='utf-8', header=None, names=['label','review'])
print(train_df.shape)

sentences = list(train_df['review'][1:])
label =train_df['label'][1:].values

(1243, 2)


In [5]:
# instantiate the configuration for your model, this can be imported from transformers
configuration = GPT2Config()

In [6]:
#2 token encodding
model_path = r'E:\code\gpt2-chinese-cluecorpussmall'
tokenizer=BertTokenizer.from_pretrained(model_path)
## set up your tokenizer, just like you described, and set the pad token
tokenizer.eos_token = tokenizer.pad_token
max_length=32
sentences_tokened=tokenizer(sentences,padding=True,truncation=True,max_length=max_length, return_tensors='pt')
label=torch.tensor(label.astype(np.int64))

OSError: Incorrect path_or_model_id: 'E:\code\gpt2-chinese-cluecorpussmall'. Please provide either the path to a local folder or the repo_id of a model on the Hub.

In [7]:
#3 encoding data
from torch.utils.data import Dataset,DataLoader,random_split

class DataToDataset(Dataset):
    def __init__(self,encoding,labels):
        self.encoding=encoding
        self.labels=labels
        
    def __len__(self):
        return len(self.labels)
        
    def __getitem__(self,index):
        return self.encoding['input_ids'][index],self.encoding['attention_mask'][index],self.labels[index]

#封装数据
datasets=DataToDataset(sentences_tokened,label)
train_size=int(len(datasets)*0.8)
test_size=len(datasets)-train_size
print([train_size,test_size])
train_dataset,val_dataset=random_split(dataset=datasets,lengths=[train_size,test_size])

NameError: name 'sentences_tokened' is not defined

In [None]:
BATCH_SIZE=16
#这里的num_workers要大于0
train_loader=DataLoader(dataset=train_dataset,batch_size=BATCH_SIZE,shuffle=True,num_workers=0)
val_loader=DataLoader(dataset=val_dataset,batch_size=BATCH_SIZE,shuffle=True,num_workers=0)#

In [None]:
#4、create model
class GPT2TextClassficationModel(nn.Module):
    def __init__(self):
        super(GPT2TextClassficationModel,self).__init__()
        self.distilbert=GPT2ForSequenceClassification(configuration).from_pretrained(model_path, num_labels=2)
        # set the pad token of the model's configuration
        self.distilbert.config.pad_token_id = self.distilbert.config.eos_token_id
        
    def forward(self,ids,mask):
        out=self.distilbert(input_ids=ids,attention_mask=mask)
        #print(out.shape)
        #print(out)
        return out[0]


mymodel=GPT2TextClassficationModel()

#获取gpu和cpu的设备信息
device=torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("device=",device)
if torch.cuda.device_count()>1:
    print("Let's use ",torch.cuda.device_count(),"GPUs!")
    mymodel=nn.DataParallel(mymodel)
mymodel.to(device)

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at E:\code\gpt2-chinese-cluecorpussmall and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


device= cuda


GPT2TextClassficationModel(
  (distilbert): GPT2ForSequenceClassification(
    (transformer): GPT2Model(
      (wte): Embedding(21128, 768)
      (wpe): Embedding(1024, 768)
      (drop): Dropout(p=0.1, inplace=False)
      (h): ModuleList(
        (0): GPT2Block(
          (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (attn): GPT2Attention(
            (c_attn): Conv1D()
            (c_proj): Conv1D()
            (attn_dropout): Dropout(p=0.1, inplace=False)
            (resid_dropout): Dropout(p=0.1, inplace=False)
          )
          (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (mlp): GPT2MLP(
            (c_fc): Conv1D()
            (c_proj): Conv1D()
            (act): NewGELUActivation()
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (1): GPT2Block(
          (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (attn): GPT2Attention(
            (c_attn): Conv1D()
       

In [None]:
#5、train model
loss_func=nn.CrossEntropyLoss()
optimizer=torch.optim.Adam(mymodel.parameters(),lr=0.00001)

from sklearn.metrics import accuracy_score
def flat_accuracy(preds,labels):
    pred_flat=np.argmax(preds,axis=1).flatten()
    labels_flat=labels.flatten()
    return accuracy_score(labels_flat,pred_flat)

epochs=3
for epoch in range(epochs):
    train_loss = 0.0
    train_acc=0.0
    for i,data in enumerate(train_loader):
        input_ids,attention_mask,labels=[elem.to(device) for elem in data]
        #优化器置零
        optimizer.zero_grad()
        #得到模型的结果
        out=mymodel(input_ids.long(),attention_mask)
        #计算误差
        loss=loss_func(out,labels)
        train_loss += loss.item()
        #误差反向传播
        loss.backward()
        #更新模型参数
        optimizer.step()
        #计算acc 
        #out=out.detach().numpy()
        out=out.detach().cpu().numpy()
        #labels=labels.detach().numpy()
        labels=labels.detach().cpu().numpy()
        train_acc+=flat_accuracy(out,labels)
        if (i + 1) % 10 == 0:
                print("train %d/%d epochs Batch %d Loss:%f, Acc:%f" %(epoch+1,epochs, (i+1), train_loss/(i+1),train_acc/(i+1)))
    print("train %d/%d epochs Loss:%f, Acc:%f" %(epoch+1,epochs,train_loss/(i+1),train_acc/(i+1)))

train 1/3 epochs Batch 10 Loss:0.385499, Acc:0.781250
train 1/3 epochs Batch 20 Loss:0.316752, Acc:0.846875
train 1/3 epochs Batch 30 Loss:0.236989, Acc:0.891667
train 1/3 epochs Batch 40 Loss:0.219348, Acc:0.901563
train 1/3 epochs Batch 50 Loss:0.190997, Acc:0.917500
train 1/3 epochs Batch 60 Loss:0.172897, Acc:0.923958
train 1/3 epochs Loss:0.166283, Acc:0.927579
train 2/3 epochs Batch 10 Loss:0.069036, Acc:0.987500
train 2/3 epochs Batch 20 Loss:0.063258, Acc:0.990625
train 2/3 epochs Batch 30 Loss:0.047014, Acc:0.993750
train 2/3 epochs Batch 40 Loss:0.044587, Acc:0.993750
train 2/3 epochs Batch 50 Loss:0.050186, Acc:0.991250
train 2/3 epochs Batch 60 Loss:0.048678, Acc:0.991667
train 2/3 epochs Loss:0.047353, Acc:0.992063
train 3/3 epochs Batch 10 Loss:0.014103, Acc:0.993750
train 3/3 epochs Batch 20 Loss:0.019296, Acc:0.990625
train 3/3 epochs Batch 30 Loss:0.014419, Acc:0.993750
train 3/3 epochs Batch 40 Loss:0.022364, Acc:0.992188
train 3/3 epochs Batch 50 Loss:0.025056, Acc:0

In [None]:
#6、evaluate
from sklearn import metrics

print("evaluate...")
pred_list = []
y_list = []
mymodel.eval()
for j,batch in enumerate(val_loader):
    val_input_ids,val_attention_mask,val_labels=[elem.to(device) for elem in batch]
    with torch.no_grad():
        pred=mymodel(val_input_ids,val_attention_mask)
        pred=pred.detach().cpu().numpy()
        pred_flat=np.argmax(pred,axis=1).flatten()
        pred_list.extend(pred_flat)
        val_labels=val_labels.detach().cpu().numpy()
        y_list.extend(val_labels)

classify_report = metrics.classification_report(pred_list, y_list, digits=4) #分类报告 support测试集样本数
print(classify_report) 
confusion_matrix = metrics.confusion_matrix(pred_list, y_list) #混淆矩阵
print(confusion_matrix) 

evaluate...
              precision    recall  f1-score   support

           0     0.9909    0.9732    0.9820       224
           1     0.7931    0.9200    0.8519        25

    accuracy                         0.9679       249
   macro avg     0.8920    0.9466    0.9169       249
weighted avg     0.9710    0.9679    0.9689       249

[[218   6]
 [  2  23]]
