# 作业二：文本分类 Part1

* 本次作业使用神经网络进行文本情感分类
* 使用Stanford-Sentiment-Treebank电影评论作为数据集

文件名|说明
:-:|:-:
senti.train.tsv | 训练数据
senti.dev.tsv | 验证数据
senti.test.tsv | 测试数据

* 文件的每一行是一个句子，和该句子的情感分类，中间由tab分割

**首先导入这次作业需要的包，并设置随机种子**

In [1]:
import random
from collections import defaultdict
from pathlib import Path

import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchtext
import tqdm

def set_random_seed(seed):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.deterministic = True

set_random_seed(2020)
device = torch.device('cuda:0' if torch.cuda.is_available else 'cpu')

**设定计算设备与数据集路径**

In [2]:
device = torch.device('cuda' if torch.cuda.is_available else 'cpu')
data_path = Path('/media/bnu/data/nlp-practice/sentiment-analysis/standford-sentiment-treebank')

print('PyTorch Version:', torch.__version__)
print('-' * 60)
if torch.cuda.is_available():
    print('CUDA Device Count:', torch.cuda.device_count())
    print('CUDA Device Name:')
    for i in range(torch.cuda.device_count()):
        print('\t', torch.cuda.get_device_name(i))
    print('CUDA Current Device Index:', torch.cuda.current_device())
    print('-' * 60)
print('Data Path:', data_path)

PyTorch Version: 1.4.0
------------------------------------------------------------
CUDA Device Count: 2
CUDA Device Name:
	 GeForce RTX 2080 Ti
	 GeForce RTX 2080 Ti
CUDA Current Device Index: 0
------------------------------------------------------------
Data Path: /media/bnu/data/nlp-practice/sentiment-analysis/standford-sentiment-treebank


## 数据处理

### 定义数据集的Dataset

In [4]:
# 定义数据集中每一列的数据类型，用于传换成Tensor
text_field = torchtext.data.Field(sequential=True, batch_first=True, include_lengths=True)
label_field = torchtext.data.LabelField(sequential=False, use_vocab=False, dtype=torch.float)

# 将tsv数据构建为数据集
train_set, valid_set, test_set = torchtext.data.TabularDataset.splits(
    path=data_path,
    train='senti.train.tsv',
    validation='senti.dev.tsv',
    test='senti.test.tsv',
    format='tsv',
    fields=[('text', text_field), ('label', label_field)]
)

# 以训练集数据，构建单词表
text_field.build_vocab(train_set)

**简单测试**

In [5]:
print('Tabular Dataset Example:')
print('Text:', valid_set[10].text)
print('Label:', valid_set[10].label)
print('-' * 60)

print('Vocab: Str -> Index')
print(list(text_field.vocab.stoi.items())[:5])
print('Vocab: Index -> Str')
print(text_field.vocab.itos[:5])
print('Vocab Size:')
print(len(text_field.vocab))

Tabular Dataset Example:
Text: ['The', 'mesmerizing', 'performances', 'of', 'the', 'leads', 'keep', 'the', 'film', 'grounded', 'and', 'keep', 'the', 'audience', 'riveted', '.']
Label: 1
------------------------------------------------------------
Vocab: Str -> Index
[('<unk>', 0), ('<pad>', 1), (',', 2), ('the', 3), ('and', 4)]
Vocab: Index -> Str
['<unk>', '<pad>', ',', 'the', 'and']
Vocab Size:
16284


### 定义数据集的Iterator

In [6]:
train_iter, valid_iter, test_iter = torchtext.data.BucketIterator.splits(
    datasets=(train_set, valid_set, test_set),
    batch_sizes=(256, 256, 256),
    sort_key=lambda x: len(x.text),
    sort_within_batch=True,
    device=device,
)

**简单测试**

In [7]:
print('Train Iterator:')
for batch in train_iter:
    print(batch)
    print('-' * 60, '\n')
    break
    
print('Valid Iterator:')
for batch in valid_iter:
    print(batch)
    print('-' * 60, '\n')
    break
    
print('Test Iterator:')
for batch in test_iter:
    print(batch)
    print('-' * 60, '\n')
    break

Train Iterator:

[torchtext.data.batch.Batch of size 256]
	[.text]:('[torch.cuda.LongTensor of size 256x9 (GPU 0)]', '[torch.cuda.LongTensor of size 256 (GPU 0)]')
	[.label]:[torch.cuda.FloatTensor of size 256 (GPU 0)]
------------------------------------------------------------ 

Valid Iterator:

[torchtext.data.batch.Batch of size 256]
	[.text]:('[torch.cuda.LongTensor of size 256x14 (GPU 0)]', '[torch.cuda.LongTensor of size 256 (GPU 0)]')
	[.label]:[torch.cuda.FloatTensor of size 256 (GPU 0)]
------------------------------------------------------------ 

Test Iterator:

[torchtext.data.batch.Batch of size 256]
	[.text]:('[torch.cuda.LongTensor of size 256x9 (GPU 0)]', '[torch.cuda.LongTensor of size 256 (GPU 0)]')
	[.label]:[torch.cuda.FloatTensor of size 256 (GPU 0)]
------------------------------------------------------------ 



## 定义模型

### 词向量平均模型

In [8]:
class EmbedAvgModel(nn.Module):
    
    def __init__(self, n_words, n_embed, p_drop, pad_idx):
        super(EmbedAvgModel, self).__init__()
        self.embed = nn.Embedding(n_words, n_embed, padding_idx=pad_idx)
        self.linear = nn.Linear(n_embed, 1)
        self.drop = nn.Dropout(p_drop)
        
    def forward(self, inputs, mask):
        # (batch, len, n_embed)
        inp_embed = self.drop(self.embed(inputs))
        # (batch, len, 1)
        mask = mask.float().unsqueeze(2)
        # (batch, len, n_embed)
        inp_embed = inp_embed * mask
        # (batch, n_embed)
        sum_embed = inp_embed.sum(1) / (mask.sum(1) + 1e-5)
        return self.linear(sum_embed).squeeze()
        

In [9]:
model = EmbedAvgModel(
    n_words=len(text_field.vocab),
    n_embed=100,
    p_drop=0.2,
    pad_idx=text_field.vocab.stoi['<pad>']
)
model.to(device)

EmbedAvgModel(
  (embed): Embedding(16284, 100, padding_idx=1)
  (linear): Linear(in_features=100, out_features=1, bias=True)
  (drop): Dropout(p=0.2, inplace=False)
)

### Attention加权平均模型

In [10]:
class AttnAvgModel(nn.Module):
    
    def __init__(self, n_words, n_embed, p_drop, pad_idx):
        super(AttnAvgModel, self).__init__()
        self.embed = nn.Embedding(n_words, n_embed, padding_idx=pad_idx)
        self.linear = nn.Linear(n_embed, 1)
        self.drop = nn.Dropout(p_drop)
        self.coef = nn.Parameter(torch.randn(1, 1, n_embed))


    def forward(self, inputs, mask):
        # (batch, len, n_embed)
        inp_embed = self.embed(inputs)
        # (batch, len)
        inp_cos = F.cosine_similarity(inp_embed, self.coef, dim=-1)
        inp_cos.masked_fill_(~mask, -1e5)
        # (batch, 1, len)
        inp_attn = F.softmax(inp_cos, dim=-1).unsqueeze(1)
        # (batch, n_embed)
        sum_embed = torch.bmm(inp_attn, inp_embed).squeeze()
        sum_embed = self.drop(sum_embed)
        return self.linear(sum_embed).squeeze()
    
    def calc_attention_weight(self, text):
        # (1, len, n_embed)
        inp_embed = self.embed(text)
        # (1, len)
        inp_cos = F.cosine_similarity(inp_embed, self.coef, dim=-1)
        # (batch, 1, len)
        inp_attn = F.softmax(inp_cos, dim=-1)
        return inp_attn

In [11]:
model = AttnAvgModel(
    n_words=len(text_field.vocab),
    n_embed=100,
    p_drop=0.2,
    pad_idx=text_field.vocab.stoi['<pad>']
)
model.to(device)

for batch in train_iter:
    inputs, lengths = batch.text
    mask = (inputs != text_field.vocab.stoi['<pad>'])
    outputs = model(inputs, mask)
    print(outputs.shape)
    break

torch.Size([256])


## 模型训练

In [12]:
class TCLearner:
    def __init__(self, model):
        self.model = model
        self.model.to(device)
        self.optimizer = torch.optim.AdamW(self.model.parameters(), lr=1e-3)
        self.crirerion = nn.BCEWithLogitsLoss()
    
    def _calc_correct_num(self, outputs, targets):
        preds = torch.round(torch.sigmoid(outputs))
        return (preds == targets).int().sum().item()
    
    def fit(self, train_iter, valid_iter, n_epochs):
        for epoch in range(n_epochs):
            model.train()
            total_loss = 0.0
            total_sents, total_correct = 0, 0
            
            for batch in train_iter:
                inputs, lengths = batch.text
                targets = batch.label
                mask = (inputs != text_field.vocab.stoi['<pad>'])
                
                outputs = self.model(inputs, mask)
                loss = self.crirerion(outputs, targets)
                
                self.optimizer.zero_grad()
                loss.backward()
                self.optimizer.step()

                total_loss += loss.item() * len(targets)
                total_sents += len(targets)
                total_correct += self._calc_correct_num(outputs, targets)
            
            epoch_loss = total_loss / total_sents
            epoch_acc = total_correct / total_sents
            print(f'Epoch {epoch+1}')
            print(f'Train --> Loss: {epoch_loss:.3f}, Acc: {epoch_acc:.3f}')
            
            model.eval()
            total_loss = 0.0
            total_sents, total_correct = 0, 0
            with torch.no_grad():
                for batch in valid_iter:
                    inputs, lengths = batch.text
                    targets = batch.label
                    mask = (inputs != text_field.vocab.stoi['<pad>'])

                    outputs = self.model(inputs, mask)
                    loss = self.crirerion(outputs, targets)

                    total_loss += loss.item() * len(targets)
                    total_sents += len(targets)
                    total_correct += self._calc_correct_num(outputs, targets)
                
            epoch_loss = total_loss / total_sents
            epoch_acc = total_correct / total_sents
            print(f'Valid --> Loss: {epoch_loss:.3f}, Acc: {epoch_acc:.3f}')
        
    def predict(self, test_iter):
        model.eval()
        total_loss = 0.0
        total_sents, total_correct = 0, 0
        with torch.no_grad():
            for batch in test_iter:
                inputs, lengths = batch.text
                targets = batch.label
                mask = (inputs != text_field.vocab.stoi['<pad>'])

                outputs = self.model(inputs, mask)
                loss = self.crirerion(outputs, targets)

                total_loss += loss.item() * len(targets)
                total_sents += len(targets)
                total_correct += self._calc_correct_num(outputs, targets)

        epoch_loss = total_loss / total_sents
        epoch_acc = total_correct / total_sents
        print(f'Test --> Loss: {epoch_loss:.3f}, Acc: {epoch_acc:.3f}')
        
        

### 词向量平均模型训练

In [13]:
model = EmbedAvgModel(
    n_words=len(text_field.vocab),
    n_embed=200,
    p_drop=0.5,
    pad_idx=text_field.vocab.stoi['<pad>']
)
model.to(device)

learner = TCLearner(model)
learner.fit(train_iter, valid_iter, 10)
learner.predict(test_iter)

Epoch 1
Train --> Loss: 0.662, Acc: 0.602
Valid --> Loss: 0.635, Acc: 0.663
Epoch 2
Train --> Loss: 0.570, Acc: 0.720
Valid --> Loss: 0.560, Acc: 0.750
Epoch 3
Train --> Loss: 0.479, Acc: 0.792
Valid --> Loss: 0.508, Acc: 0.765
Epoch 4
Train --> Loss: 0.415, Acc: 0.828
Valid --> Loss: 0.474, Acc: 0.784
Epoch 5
Train --> Loss: 0.369, Acc: 0.853
Valid --> Loss: 0.454, Acc: 0.804
Epoch 6
Train --> Loss: 0.337, Acc: 0.870
Valid --> Loss: 0.444, Acc: 0.803
Epoch 7
Train --> Loss: 0.315, Acc: 0.881
Valid --> Loss: 0.434, Acc: 0.811
Epoch 8
Train --> Loss: 0.294, Acc: 0.887
Valid --> Loss: 0.429, Acc: 0.820
Epoch 9
Train --> Loss: 0.274, Acc: 0.896
Valid --> Loss: 0.428, Acc: 0.818
Epoch 10
Train --> Loss: 0.263, Acc: 0.901
Valid --> Loss: 0.427, Acc: 0.820
Test --> Loss: 0.424, Acc: 0.806


**单词 L2 Norm分析**

In [14]:
# (n_words)
embed_norm = model.embed.weight.norm(dim=1)

word_idx = list(range(len(text_field.vocab)))
word_idx.sort(key=lambda x: embed_norm[x])

print('15个L2-Norm最小的单词：')
for i in word_idx[:15]:
    print(text_field.vocab.itos[i])
print('-' * 60)

print('15个L2-Norm最大的单词：')
for i in word_idx[-15:]:
    print(text_field.vocab.itos[i])

15个L2-Norm最小的单词：
<pad>
a
finishing
TV-insider
field
perpetrated
clocks
The
Nelson
cold-blooded
shirt
pic
combat
arctic
abroad
------------------------------------------------------------
15个L2-Norm最大的单词：
devoid
sinks
loses
poorly
meat
gunfight
wonderful
lacks
shallow
Fails
poor
sweet
ahead
lacking
worst


* 从上面的结果可以看出，L2-Norm小的单词往往是跟情感无关的单词
* L2-Norm大的单词基本都是能够反映情感的单词

### Attention加权平均模型的训练

In [15]:
model = AttnAvgModel(
    n_words=len(text_field.vocab),
    n_embed=200,
    p_drop=0.5,
    pad_idx=text_field.vocab.stoi['<pad>']
)
model.to(device)

learner = TCLearner(model)
learner.fit(train_iter, valid_iter, 10)
learner.predict(test_iter)

Epoch 1
Train --> Loss: 0.668, Acc: 0.591
Valid --> Loss: 0.644, Acc: 0.634
Epoch 2
Train --> Loss: 0.579, Acc: 0.712
Valid --> Loss: 0.568, Acc: 0.745
Epoch 3
Train --> Loss: 0.480, Acc: 0.788
Valid --> Loss: 0.509, Acc: 0.768
Epoch 4
Train --> Loss: 0.411, Acc: 0.831
Valid --> Loss: 0.477, Acc: 0.780
Epoch 5
Train --> Loss: 0.363, Acc: 0.855
Valid --> Loss: 0.459, Acc: 0.776
Epoch 6
Train --> Loss: 0.329, Acc: 0.871
Valid --> Loss: 0.446, Acc: 0.788
Epoch 7
Train --> Loss: 0.303, Acc: 0.883
Valid --> Loss: 0.442, Acc: 0.790
Epoch 8
Train --> Loss: 0.283, Acc: 0.893
Valid --> Loss: 0.437, Acc: 0.798
Epoch 9
Train --> Loss: 0.264, Acc: 0.901
Valid --> Loss: 0.437, Acc: 0.804
Epoch 10
Train --> Loss: 0.253, Acc: 0.904
Valid --> Loss: 0.435, Acc: 0.802
Test --> Loss: 0.425, Acc: 0.803


**分析计算向量u与词向量的余弦相似度**

In [16]:
# (1, n_embed)
u = model.coef.view(1, -1)
# (n_words, n_embed)
embedding = model.embed.weight
# (n_words)
cos_sim = F.cosine_similarity(u, embedding, dim=-1)

word_idx = list(range(len(text_field.vocab)))
word_idx.sort(key=lambda x: cos_sim[x])

print('15个余弦相似度最小的单词：')
for i in word_idx[:15]:
    print(text_field.vocab.itos[i])
print('-' * 60)

print('15个余弦相似度最大的单词：')
for i in word_idx[-15:]:
    print(text_field.vocab.itos[i])

15个余弦相似度最小的单词：
and
the
,
of
a
banter-filled
to
taking
zap
Swim
Sandra
hiatus
coming-of-age
incredibly
produces
------------------------------------------------------------
15个余弦相似度最大的单词：
uninspired
flat
tedious
violence
scooped
all-night
Hollywood-itis
poor
devoid
lacking
fallen
poorly
neither
lacks
worst


* 当面的结果可以看出，余弦相似度比较高的单词都能够很好的反映句子的情感，这些单词在Attention后的权重会比较高
* 余弦相似度小的单词多为一些名词和介词，与文本表示的情感基本无关

**分析训练数据中单词的Attention权重**

In [17]:
train_iter, valid_iter, test_iter = torchtext.data.BucketIterator.splits(
    datasets=(train_set, valid_set, test_set),
    batch_sizes=(1, 1, 1),
    sort_key=lambda x: len(x.text),
    sort_within_batch=True,
    device=device,
)

weight_dict = defaultdict(list)

with torch.no_grad():
    for k, batch in enumerate(train_iter):
        inputs, lengths = batch.text
        attn = model.calc_attention_weight(inputs)
        inputs = inputs.view(-1)
        attn = attn.view(-1)
        if inputs.shape[0] == 1:
            weight_dict[inputs.item()].append(attn.item())
        else:
            for i in range(len(inputs)):
                weight_dict[inputs[i].item()].append(attn[i].item())
        if (k + 1) % 10000 == 0:
            print(f'{k+1} sentences finish!')

10000 sentences finish!
20000 sentences finish!
30000 sentences finish!
40000 sentences finish!
50000 sentences finish!
60000 sentences finish!


In [18]:
mean_dict, std_dict = {}, {}
for k, v in weight_dict.items():
    # 至少出现100次
    if len(v) >= 100:
        mean_dict[k] = np.mean(v)
        std_dict[k] = np.std(v)

In [19]:
word_idx = list(std_dict.keys())
word_idx.sort(key=lambda x: std_dict[x], reverse=True)
print('30个Attention标准差最大的单词：')
print('-' * 60)
for i in word_idx[:30]:
    print(f'{text_field.vocab.itos[i]}, Freq:{len(weight_dict[i])}, Std:{std_dict[i]:.3f}, Mean:{mean_dict[i]:.3f}')
print()
print('30个Attention标准差最小的单词：')
print('-' * 60)
for i in word_idx[-30:]:
    print(f'{text_field.vocab.itos[i]}, Freq:{len(weight_dict[i])}, Std:{std_dict[i]:.3f}, Mean:{mean_dict[i]:.3f}')

30个Attention标准差最大的单词：
------------------------------------------------------------
tedious, Freq:102, Std:0.187, Mean:0.204
stupid, Freq:132, Std:0.180, Mean:0.211
painful, Freq:108, Std:0.178, Mean:0.206
mess, Freq:149, Std:0.174, Mean:0.227
waste, Freq:114, Std:0.172, Mean:0.206
pretentious, Freq:118, Std:0.172, Mean:0.205
worse, Freq:122, Std:0.171, Mean:0.164
flat, Freq:181, Std:0.171, Mean:0.192
bland, Freq:140, Std:0.170, Mean:0.184
appealing, Freq:124, Std:0.169, Mean:0.180
provocative, Freq:103, Std:0.168, Mean:0.195
unfunny, Freq:105, Std:0.167, Mean:0.212
tired, Freq:135, Std:0.165, Mean:0.216
convincing, Freq:110, Std:0.165, Mean:0.152
hackneyed, Freq:101, Std:0.165, Mean:0.162
gorgeous, Freq:133, Std:0.164, Mean:0.173
dumb, Freq:167, Std:0.163, Mean:0.190
epic, Freq:103, Std:0.163, Mean:0.162
success, Freq:107, Std:0.163, Mean:0.163
impressive, Freq:125, Std:0.163, Mean:0.184
boring, Freq:169, Std:0.162, Mean:0.180
creepy, Freq:114, Std:0.162, Mean:0.179
stylish, Freq:108, 

* Attention权重标准差大的单词，其权重的的平均值也很大
* 这些标准差大的单词，往往会有明显的情感倾向，所以会有较大的权重平均值
* 造成权重标准差大的原因主要是因为句子的长短，由于句子长短不同包含的情感倾向单词数目不同，造成权重变化很大