# 实验一：文本分类
## 一、数据集介绍

- 20 News：https://github.com/hccngu/MLADA/tree/master/data
- 压缩包中有四个数据集，我们使用**20news.json**文件。
- 该数据集是**经过预处理后**的文本分类数据集，共有20个类。

导入必要的包，本次实验我选择Transformer模型来做文本分类，它主要用于解决Seq2Seq问题，文本类数据就是一种典型的序列数据，所以能在分类等方面取得较好的表现  
这里使用distilbert预训练模型来将每篇新闻编码为512维的向量

In [1]:
import json
import torch
from sklearn.model_selection import train_test_split
from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification, AdamW
from torch.utils.data import TensorDataset, DataLoader

js = open('data/20news.json')
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

读取json格式的新闻数据，tokenizer可以自动帮助我们分词，所以只需要raw里面的原始文本数据，然后使用sklearn的划分训练集测试集的函数划分出训练集，验证集和测试集用于训练和评价，这里为了减少电脑负担，只挑取label为0~4的数据。

In [2]:
texts = []
labels = []
for line in js.readlines():
    js_l = json.loads(line)
    if js_l["label"] < 4:
        texts.append(js_l["raw"])
        labels.append(js_l["label"])

train_texts, test_texts, train_labels, test_labels = train_test_split(texts, labels, test_size=.2)
train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=.2)
train_encodings = tokenizer(train_texts, truncation=True, padding=True)
val_encodings = tokenizer(val_texts, truncation=True, padding=True)
test_encodings = tokenizer(test_texts, truncation=True, padding=True)

构造自己的Dataset用于后面的模型读取数据

In [3]:
class NewsDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)


train_dataset = NewsDataset(train_encodings, train_labels)
val_dataset = NewsDataset(val_encodings, val_labels)
test_dataset = NewsDataset(test_encodings, test_labels)

model也是采用的预训练模型，是已经搭建好的Transformer模型，我们只需将数据放入其中进行训练即可，特别的是，该模型主要用于情感分类，默认的label数量是2，如果需要更多的label需要手动设置

In [4]:
from torch.utils.data import DataLoader
from transformers import DistilBertForSequenceClassification, AdamW

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=4)
model.to(device)
model.train()

train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)

optim = AdamW(model.parameters(), lr=5e-5)

for epoch in range(3):
    for i, batch in enumerate(train_loader):
        optim.zero_grad()
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs[0]
        loss.backward()
        optim.step()
        print("{}/{}, epoch = {}".format(i, len(train_loader), epoch+1))
model.eval()

model.save_pretrained('./model')

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_transform.bias', 'vocab_projector.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'classifier.weight', 'classifier

0/294, epoch = 1
1/294, epoch = 1
2/294, epoch = 1
3/294, epoch = 1
4/294, epoch = 1
5/294, epoch = 1
6/294, epoch = 1
7/294, epoch = 1
8/294, epoch = 1
9/294, epoch = 1
10/294, epoch = 1
11/294, epoch = 1
12/294, epoch = 1
13/294, epoch = 1
14/294, epoch = 1
15/294, epoch = 1
16/294, epoch = 1
17/294, epoch = 1
18/294, epoch = 1
19/294, epoch = 1
20/294, epoch = 1
21/294, epoch = 1
22/294, epoch = 1
23/294, epoch = 1
24/294, epoch = 1
25/294, epoch = 1
26/294, epoch = 1
27/294, epoch = 1
28/294, epoch = 1
29/294, epoch = 1
30/294, epoch = 1
31/294, epoch = 1
32/294, epoch = 1
33/294, epoch = 1
34/294, epoch = 1
35/294, epoch = 1
36/294, epoch = 1
37/294, epoch = 1
38/294, epoch = 1
39/294, epoch = 1
40/294, epoch = 1
41/294, epoch = 1
42/294, epoch = 1
43/294, epoch = 1
44/294, epoch = 1
45/294, epoch = 1
46/294, epoch = 1
47/294, epoch = 1
48/294, epoch = 1
49/294, epoch = 1
50/294, epoch = 1
51/294, epoch = 1
52/294, epoch = 1
53/294, epoch = 1
54/294, epoch = 1
55/294, epoch = 1
56

149/294, epoch = 2
150/294, epoch = 2
151/294, epoch = 2
152/294, epoch = 2
153/294, epoch = 2
154/294, epoch = 2
155/294, epoch = 2
156/294, epoch = 2
157/294, epoch = 2
158/294, epoch = 2
159/294, epoch = 2
160/294, epoch = 2
161/294, epoch = 2
162/294, epoch = 2
163/294, epoch = 2
164/294, epoch = 2
165/294, epoch = 2
166/294, epoch = 2
167/294, epoch = 2
168/294, epoch = 2
169/294, epoch = 2
170/294, epoch = 2
171/294, epoch = 2
172/294, epoch = 2
173/294, epoch = 2
174/294, epoch = 2
175/294, epoch = 2
176/294, epoch = 2
177/294, epoch = 2
178/294, epoch = 2
179/294, epoch = 2
180/294, epoch = 2
181/294, epoch = 2
182/294, epoch = 2
183/294, epoch = 2
184/294, epoch = 2
185/294, epoch = 2
186/294, epoch = 2
187/294, epoch = 2
188/294, epoch = 2
189/294, epoch = 2
190/294, epoch = 2
191/294, epoch = 2
192/294, epoch = 2
193/294, epoch = 2
194/294, epoch = 2
195/294, epoch = 2
196/294, epoch = 2
197/294, epoch = 2
198/294, epoch = 2
199/294, epoch = 2
200/294, epoch = 2
201/294, epo

292/294, epoch = 3
293/294, epoch = 3


由于Transformer较为复杂，这里batchsize只能设为8，否则就会出现显存不够的错误。在经过几分钟的训练之后，将模型保存，并对其进行测试。最后准确率能达到97.1%，在我前几次实验中甚至达到了99.1%的准确率，可以看出Transformer搭配DistilBert在文本分类上的有着不俗的表现

In [7]:
test_loader = DataLoader(test_dataset, batch_size=8, shuffle=True)
loss_fn = torch.nn.CrossEntropyLoss()
test_loss, correct = 0, 0
for i, batch in enumerate(test_loader):
    input_ids = batch['input_ids'].to(device)
    attention_mask = batch['attention_mask'].to(device)
    labels = batch['labels'].to(device)
    pred = model(input_ids, attention_mask=attention_mask)
    test_loss += loss_fn(pred.logits, labels).item()
    correct += (pred.logits.argmax(1) == labels).type(torch.float).sum().item()
test_loss /= len(test_loader.dataset)
correct /= len(test_loader.dataset)
print(f"Test Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f} \n")

Test Error: 
 Accuracy: 97.1%, Avg loss: 0.013774 

