## Transformer

Transformer模型是一种基于自注意力机制（Self-Attention）的深度学习模型，它在2017年由Google的研究者在论文《Attention Is All You Need》中首次提出¹²³。

**特点**：
- **自注意力机制**：Transformer使用自注意力机制来处理序列数据，这使得每个元素都能直接与序列中的其他元素交互和获取信息。
- **并行计算**：与传统的循环神经网络（RNN）相比，Transformer能够更好地利用现代硬件进行并行计算，从而加快训练速度。
- **无需递归**：Transformer完全摒弃了递归结构，这减少了模型的复杂性并提高了效率。
- **多头注意力**：通过多头注意力机制，模型能够同时关注输入序列的不同部分，捕捉丰富的上下文信息。

**解决的问题**：
- **长距离依赖问题**：在处理长序列数据时，RNN和LSTM等传统模型容易受到梯度消失或爆炸的影响，难以捕捉长距离依赖关系。Transformer通过自注意力机制有效地解决了这一问题。
- **并行化难题**：RNN由于其递归特性，难以实现有效的并行化。Transformer的结构使得模型可以充分利用现代计算资源进行并行处理。

**来源**：
- Transformer模型首次出现在2017年的论文《Attention Is All You Need》中。

Transformer模型的提出标志着自然语言处理领域的一个重要转折点，它的设计理念和架构已经成为了后续许多模型的基础，如BERT、GPT等。

In [17]:
import torch
import torch.nn as nn
from torch.autograd import Variable
import torch.utils.data as Data
import torchvision
import numpy as np;
from gensim.models import KeyedVectors
import copy
import torch.nn.functional as F
from sklearn.model_selection import train_test_split
import pandas as pd
from collections import Counter
from sklearn.metrics import classification_report
import math

In [2]:
#读取数据
df = pd.read_csv('分词后data.csv')
df = df.dropna()
print(df.head())

                                                  文本  标签
0                       商业秘密 秘密性 维系 商业价值 垄断 地位 前提条件    0
1  南口 阿玛施 新春 第一批 限量 春装 店 春暖花开 淑女 裙冰 蓝色 公主 衫 气质 粉小...   1
2                                 带给 常州 一场 壮观 视觉 盛宴    0
3                                     原因 不明 泌尿系统 结石    0
4                                    年 盐城 拉回来 麻麻 嫁妆    0


In [3]:
data = df['文本'].tolist()
label = df['标签'].tolist()
print(len(data), len(label)) #查看语料信息
print(Counter(label)) #查看不同标签文本数量

1241 1241
Counter({0: 1119, 1: 122})


In [4]:
texts = [each.split() for each in data]
print(data[0:5])

['商业秘密 秘密性 维系 商业价值 垄断 地位 前提条件 ', '南口 阿玛施 新春 第一批 限量 春装 店 春暖花开 淑女 裙冰 蓝色 公主 衫 气质 粉小 西装 冰丝 女王 长半裙 皇 ', '带给 常州 一场 壮观 视觉 盛宴 ', '原因 不明 泌尿系统 结石 ', '年 盐城 拉回来 麻麻 嫁妆 ']


In [5]:
#构建词表，将文本中的字符单词替换为数字索引
word_vocb=[]
word_vocb.append('')
for text in texts:
    for word in text:
        word_vocb.append(word)
word_vocb=set(word_vocb)
vocb_size=len(word_vocb)

In [6]:
print(vocb_size)

5919


In [7]:
#词表与索引的映射
word_to_idx={word:i for i,word in enumerate(word_vocb)}
idx_to_word={word_to_idx[word]:word for word in word_to_idx}

In [8]:
print(word_to_idx['商业价值'])
print(idx_to_word[222])

282
噪音


In [9]:
#演示文本最大长度设置为30
max_len = 30
#生成训练数据，删除超过max_len的部分，不够的补0
texts_with_id=np.zeros([len(texts),max_len])
for i in range(0,len(texts)):
    if len(texts[i])<max_len:
        for j in range(0,len(texts[i])):
            texts_with_id[i][j]=word_to_idx[texts[i][j]]
        for j in range(len(texts[i]),max_len):
            texts_with_id[i][j] = word_to_idx['']
    else:
        for j in range(0,max_len):
            texts_with_id[i][j]=word_to_idx[texts[i][j]]

In [10]:
print(texts_with_id.shape)
print(texts_with_id[0])

(1241, 30)
[4751. 5814. 4514.  282. 2059. 1049. 5804.    0.    0.    0.    0.    0.
    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.
    0.    0.    0.    0.    0.    0.]


In [11]:
#Transformer模型
class Transformer(nn.Module):
    def __init__(self,args):
        super(Transformer, self).__init__()
        vocb_size = args['vocb_size']
        dim = args['dim'] #词向量维度
        n_class = args['n_class']
        pad_size = args['max_len'] # 每句话处理成的长度(短填长切)
        embedding_matrix=args['embedding_matrix']
        hidden_size = 128 #隐藏层单元
        num_layers = 2 #RNN层数
        dropout = 0.5 #防过拟合随机丢失
        #device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')   # 设备
        device = 'cpu'
        dim_model = 300
        hidden = 1024
        last_hidden = 512
        num_head = 5
        num_encoder = 2
        
        #需要将事先训练好的词向量载入
        self.embedding = nn.Embedding(vocb_size, dim,_weight=embedding_matrix)
        
        self.postion_embedding = Positional_Encoding(dim, pad_size, dropout, device)
        
        self.encoder = Encoder(dim_model, num_head, hidden, dropout)
        self.encoders = nn.ModuleList([
            copy.deepcopy(self.encoder)
            # Encoder(config.dim_model, config.num_head, config.hidden, config.dropout)
            for _ in range(num_encoder)])

        self.fc1 = nn.Linear(pad_size * dim_model, num_layers)
        # self.fc2 = nn.Linear(last_hidden, num_layers)
        # self.fc1 = nn.Linear(dim_model, num_layers)
        
    def forward(self, x):
        out = self.embedding(x)
        out = self.postion_embedding(out)
        for encoder in self.encoders:
            out = encoder(out)
        out = out.view(out.size(0), -1)
        # out = torch.mean(out, 1)
        out = self.fc1(out)
        return out
    
class Encoder(nn.Module):
    def __init__(self, dim_model, num_head, hidden, dropout):
        super(Encoder, self).__init__()
        self.attention = Multi_Head_Attention(dim_model, num_head, dropout)
        self.feed_forward = Position_wise_Feed_Forward(dim_model, hidden, dropout)

    def forward(self, x):
        out = self.attention(x)
        out = self.feed_forward(out)
        return out


class Positional_Encoding(nn.Module):
    def __init__(self, embed, pad_size, dropout, device):
        super(Positional_Encoding, self).__init__()
        self.device = device
        self.pe = torch.tensor([[pos / (10000.0 ** (i // 2 * 2.0 / embed)) for i in range(embed)] for pos in range(pad_size)])
        self.pe[:, 0::2] = np.sin(self.pe[:, 0::2])
        self.pe[:, 1::2] = np.cos(self.pe[:, 1::2])
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = x + nn.Parameter(self.pe, requires_grad=False).to(self.device)
        out = self.dropout(out)
        return out


class Scaled_Dot_Product_Attention(nn.Module):
    '''Scaled Dot-Product Attention '''
    def __init__(self):
        super(Scaled_Dot_Product_Attention, self).__init__()

    def forward(self, Q, K, V, scale=None):
        '''
        Args:
            Q: [batch_size, len_Q, dim_Q]
            K: [batch_size, len_K, dim_K]
            V: [batch_size, len_V, dim_V]
            scale: 缩放因子 论文为根号dim_K
        Return:
            self-attention后的张量，以及attention张量
        '''
        attention = torch.matmul(Q, K.permute(0, 2, 1))
        if scale:
            attention = attention * scale
        # if mask:  # TODO change this
        #     attention = attention.masked_fill_(mask == 0, -1e9)
        attention = F.softmax(attention, dim=-1)
        context = torch.matmul(attention, V)
        return context


class Multi_Head_Attention(nn.Module):
    def __init__(self, dim_model, num_head, dropout=0.0):
        super(Multi_Head_Attention, self).__init__()
        self.num_head = num_head
        assert dim_model % num_head == 0
        self.dim_head = dim_model // self.num_head
        self.fc_Q = nn.Linear(dim_model, num_head * self.dim_head)
        self.fc_K = nn.Linear(dim_model, num_head * self.dim_head)
        self.fc_V = nn.Linear(dim_model, num_head * self.dim_head)
        self.attention = Scaled_Dot_Product_Attention()
        self.fc = nn.Linear(num_head * self.dim_head, dim_model)
        self.dropout = nn.Dropout(dropout)
        self.layer_norm = nn.LayerNorm(dim_model)

    def forward(self, x):
        batch_size = x.size(0)
        Q = self.fc_Q(x)
        K = self.fc_K(x)
        V = self.fc_V(x)
        Q = Q.view(batch_size * self.num_head, -1, self.dim_head)
        K = K.view(batch_size * self.num_head, -1, self.dim_head)
        V = V.view(batch_size * self.num_head, -1, self.dim_head)
        # if mask:  # TODO
        #     mask = mask.repeat(self.num_head, 1, 1)  # TODO change this
        scale = K.size(-1) ** -0.5  # 缩放因子
        context = self.attention(Q, K, V, scale)

        context = context.view(batch_size, -1, self.dim_head * self.num_head)
        out = self.fc(context)
        out = self.dropout(out)
        out = out + x  # 残差连接
        out = self.layer_norm(out)
        return out


class Position_wise_Feed_Forward(nn.Module):
    def __init__(self, dim_model, hidden, dropout=0.0):
        super(Position_wise_Feed_Forward, self).__init__()
        self.fc1 = nn.Linear(dim_model, hidden)
        self.fc2 = nn.Linear(hidden, dim_model)
        self.dropout = nn.Dropout(dropout)
        self.layer_norm = nn.LayerNorm(dim_model)

    def forward(self, x):
        out = self.fc1(x)
        out = F.relu(out)
        out = self.fc2(out)
        out = self.dropout(out)
        out = out + x  # 残差连接
        out = self.layer_norm(out)
        return out

## ChatGPT 生成的代码解释

这段代码定义了一个Transformer模型，用于文本分类任务。下面是对代码的解释：

1. `class Transformer(nn.Module):`：定义了一个名为Transformer的PyTorch模型类，继承自`nn.Module`。

2. `def __init__(self,args):`：模型的初始化方法，接收一个参数`args`，其中包含了模型所需的各种参数。

3. 在`__init__`方法中定义了许多参数，包括词汇表大小、词向量维度、类别数等。同时还定义了一些Transformer模型的超参数，如隐藏层单元数量、RNN层数、防止过拟合的随机丢失率等。

4. `self.embedding = nn.Embedding(vocb_size, dim,_weight=embedding_matrix)`：定义一个词嵌入层，用于将词的索引转换为词向量。这里使用了事先训练好的词向量矩阵作为初始权重。

5. `self.postion_embedding = Positional_Encoding(dim, pad_size, dropout, device)`：定义一个位置编码层，用于加入位置信息到词向量中。

6. `self.encoder = Encoder(dim_model, num_head, hidden, dropout)`：定义了一个编码器，包含多层Transformer的编码器。

7. `def forward(self, x):`：定义模型的前向传播方法，接收输入x，返回模型的输出。

8. `out = self.embedding(x)`：将输入x通过词嵌入层转换为词向量。

9. `out = self.postion_embedding(out)`：将词向量加上位置编码。

10. `for encoder in self.encoders:`：遍历多层编码器。

11. `out = encoder(out)`：将输入通过编码器得到输出。

12. `out = out.view(out.size(0), -1)`：将输出展平为一维向量。

13. `out = self.fc1(out)`：使用全连接层将输出映射到类别空间。

整个模型包含了Embedding层、Positional Encoding层、多层Encoder层和全连接层。通过这些组件，Transformer模型可以有效地捕捉输入序列中的信息，并输出用于分类的结果。

In [12]:
args = {}

word_dim = 300 #词向量的维度
n_class = 2 #类别

#textCNN调用的参数
args['vocb_size']=vocb_size
args['max_len']=max_len
args['n_class']=n_class
args['dim']=word_dim

In [13]:
#word2vec词向量
cn_model = KeyedVectors.load_word2vec_format('F:/sgns.weibo.word.bz2', binary=False)

In [14]:
#embedding层的参数大小为vocb_size*dim，即词汇表大小乘词向量的维度 又称为lookup表
embedding_matrix = np.zeros((vocb_size, word_dim))

for word, i in word_to_idx.items():
    if word in cn_model:
        embedding_vector = cn_model[word]
        if embedding_vector is not None:
            # words not found in embedding index will be all-zeros.
            embedding_matrix[i] = embedding_vector
args['embedding_matrix']=torch.Tensor(embedding_matrix)

In [18]:
#构建Transformer模型
rnn=Transformer(args)
print(rnn) #输出模型结构

Transformer(
  (embedding): Embedding(5919, 300)
  (postion_embedding): Positional_Encoding(
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (encoder): Encoder(
    (attention): Multi_Head_Attention(
      (fc_Q): Linear(in_features=300, out_features=300, bias=True)
      (fc_K): Linear(in_features=300, out_features=300, bias=True)
      (fc_V): Linear(in_features=300, out_features=300, bias=True)
      (attention): Scaled_Dot_Product_Attention()
      (fc): Linear(in_features=300, out_features=300, bias=True)
      (dropout): Dropout(p=0.5, inplace=False)
      (layer_norm): LayerNorm((300,), eps=1e-05, elementwise_affine=True)
    )
    (feed_forward): Position_wise_Feed_Forward(
      (fc1): Linear(in_features=300, out_features=1024, bias=True)
      (fc2): Linear(in_features=1024, out_features=300, bias=True)
      (dropout): Dropout(p=0.5, inplace=False)
      (layer_norm): LayerNorm((300,), eps=1e-05, elementwise_affine=True)
    )
  )
  (encoders): ModuleList(
    (0): Encode

In [19]:
total_params = 0
for name, parameters in rnn.named_parameters():
    if not parameters.requires_grad: continue
    print(name, ':', parameters.size())
    total_params += parameters.numel()
print("模型需要训练参数为：", total_params)

embedding.weight : torch.Size([5919, 300])
encoder.attention.fc_Q.weight : torch.Size([300, 300])
encoder.attention.fc_Q.bias : torch.Size([300])
encoder.attention.fc_K.weight : torch.Size([300, 300])
encoder.attention.fc_K.bias : torch.Size([300])
encoder.attention.fc_V.weight : torch.Size([300, 300])
encoder.attention.fc_V.bias : torch.Size([300])
encoder.attention.fc.weight : torch.Size([300, 300])
encoder.attention.fc.bias : torch.Size([300])
encoder.attention.layer_norm.weight : torch.Size([300])
encoder.attention.layer_norm.bias : torch.Size([300])
encoder.feed_forward.fc1.weight : torch.Size([1024, 300])
encoder.feed_forward.fc1.bias : torch.Size([1024])
encoder.feed_forward.fc2.weight : torch.Size([300, 1024])
encoder.feed_forward.fc2.bias : torch.Size([300])
encoder.feed_forward.layer_norm.weight : torch.Size([300])
encoder.feed_forward.layer_norm.bias : torch.Size([300])
encoders.0.attention.fc_Q.weight : torch.Size([300, 300])
encoders.0.attention.fc_Q.bias : torch.Size([300

In [15]:
#参数设置

EPOCH = 5; #轮次，根据训练情况设置

LR = 0.001 #学习率，根据训练情况设置
optimizer = torch.optim.Adam(rnn.parameters(), lr=LR) #优化器
#损失函数
loss_function = nn.CrossEntropyLoss()
#训练批次大小，和内存显存相关
epoch_size=100;
texts_len=len(texts_with_id)
print(texts_len)
#划分训练数据和测试数据
x_train, x_test, y_train, y_test = train_test_split(texts_with_id, label, test_size=0.2, random_state=42)
 
test_x=torch.LongTensor(x_test)
test_y=torch.LongTensor(y_test)
train_x=x_train
train_y=y_train
 
test_epoch_size=200;

1241


In [16]:
print(x_train.shape)
print(label[0:10])
print(y_train[0:10])

(992, 30)
[0, 1, 0, 0, 0, 0, 1, 0, 1, 0]
[0, 0, 1, 0, 0, 0, 0, 0, 0, 0]


In [36]:
for epoch in range(EPOCH):
    train_acc_all = 0
    for i in range(0, math.ceil(len(train_x)/epoch_size)):
 
        b_x = Variable(torch.LongTensor(train_x[i*epoch_size:i*epoch_size+epoch_size]))
 
        b_y = Variable(torch.LongTensor((train_y[i*epoch_size:i*epoch_size+epoch_size])))
        output = rnn(b_x)
        loss = loss_function(output, b_y)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        print('batch: ' + str(i) + " 损失:" + str(loss.data))
        pred_y = torch.max(output, 1)[1].data.squeeze()
        acc = (b_y == pred_y)
        acc = acc.numpy().sum()
        train_acc_all = train_acc_all + acc
 
    acc_all = 0;
    for j in range(0, math.ceil(len(test_x) / test_epoch_size)):
        b_x = Variable(torch.LongTensor(test_x[j * test_epoch_size:j * test_epoch_size + test_epoch_size]))
        b_y = Variable(torch.LongTensor((test_y[j * test_epoch_size:j * test_epoch_size + test_epoch_size])))
        test_output = rnn(b_x)
        pred_y = torch.max(test_output, 1)[1].data.squeeze()
        # print(pred_y)
        # print(test_y)
        acc = (pred_y == b_y)
        acc = acc.numpy().sum()
        #print("准确率 " + str(acc / b_y.size(0)))
        acc_all = acc_all + acc
 
    train_accuracy = train_acc_all / len(train_y)
    test_accuracy = acc_all / (test_y.size(0))
    print("epoch " + str(epoch) + " " + "训练集准确率：" + str(train_accuracy) + " 测试集准确率：" + str(test_accuracy))

batch: 0 损失:tensor(0.6320)
batch: 1 损失:tensor(2.8240)
batch: 2 损失:tensor(1.3480)
batch: 3 损失:tensor(0.2346)
batch: 4 损失:tensor(5.2503)
batch: 5 损失:tensor(0.3456)
batch: 6 损失:tensor(0.7860)
batch: 7 损失:tensor(1.2058)
batch: 8 损失:tensor(1.6717)
epoch 0 训练集准确率：0.7016129032258065 测试集准确率：0.7188755020080321
batch: 0 损失:tensor(1.0603)
batch: 1 损失:tensor(2.5234)
batch: 2 损失:tensor(1.2582)
batch: 3 损失:tensor(0.4836)
batch: 4 损失:tensor(0.2890)
batch: 5 损失:tensor(1.8217)
batch: 6 损失:tensor(0.2649)
batch: 7 损失:tensor(0.5454)
batch: 8 损失:tensor(0.9145)
epoch 1 训练集准确率：0.7358870967741935 测试集准确率：0.7188755020080321
batch: 0 损失:tensor(0.6578)
batch: 1 损失:tensor(1.8156)
batch: 2 损失:tensor(1.0508)
batch: 3 损失:tensor(0.6016)
batch: 4 损失:tensor(0.5647)
batch: 5 损失:tensor(0.4683)
batch: 6 损失:tensor(0.2923)
batch: 7 损失:tensor(0.3579)
batch: 8 损失:tensor(0.5645)
epoch 2 训练集准确率：0.8024193548387096 测试集准确率：0.7068273092369478
batch: 0 损失:tensor(0.3119)
batch: 1 损失:tensor(0.5432)
batch: 2 损失:tensor(0.4272)
batch: 3 损

In [37]:
test_output = rnn(test_x)
pred_y = torch.max(test_output, 1)[1].data.squeeze()
#输出结果报告
print(classification_report(test_y, pred_y, digits=4, target_names = ['正常短信', '垃圾短信']))

              precision    recall  f1-score   support

        正常短信     0.9819    0.9775    0.9797       222
        垃圾短信     0.8214    0.8519    0.8364        27

    accuracy                         0.9639       249
   macro avg     0.9017    0.9147    0.9080       249
weighted avg     0.9645    0.9639    0.9641       249

