# LSTM on IMDB Sentiment Analysis

这个文件在IMDB上进行情感分类，练习torchtext等的使用，熟悉训练流程，并测试以下模型的表现：

* LSTM

相对前一个文件做了以下改进

* 使用预训练词向量
* 进行了pack，pad操作
* 考虑了num_layer,dropout,momentum的影响

目前的进度：

* 未完成

问题：
* BucketIterator，shuffle，packandpad这几个怎么处理？（目前采用的是只在batch间shuffle，长度相近的在同一个batch，batch内按降序排列）
* packpad操作之后hidden和output对应位置不一致？
* Adam不需要指定学习率？


参考：
* [torchtext使用--updated IMDB](https://blog.csdn.net/weixin_43301333/article/details/105745053)

## Requirement
* torchtext==0.6.0

## Import

In [1]:
import torch
from torchtext import datasets
from torchtext import data
import numpy as np
import random
from torch import nn,optim
from sklearn import metrics
import torch.nn.functional as F

use_cuda=torch.cuda.is_available()
device=torch.device("cuda" if use_cuda else "cpu")

## 供调整的参数

In [2]:
lr=1e-3
bs=3
d_embed=100
d_hidden=256
d_output=2
dropout=0.0
max_epochs=10
require_improvement=1
n_layers=1
bidirectional=True

## 数据载入和处理

在载入和处理数据部分采用了torchtext库。

由于在colab上无法运行spacy，我们采用简单的按空格分词，spacy后续在服务器上跑时加进去。

In [3]:
tokenize = lambda x: x.split()
TEXT=data.Field(tokenize=tokenize,batch_first=True,include_lengths=True)
LABEL=data.LabelField(dtype=torch.long)
train_data,test_data=datasets.IMDB.splits(TEXT,LABEL)

**下面展示样本数量和一个样本。**

In [4]:
print(f'Number of training examples: {len(train_data)}')
print(f'Number of testing examples: {len(test_data)}')
print(vars(train_data.examples[0])['text'])

Number of training examples: 25000
Number of testing examples: 25000
['Bromwell', 'High', 'is', 'a', 'cartoon', 'comedy.', 'It', 'ran', 'at', 'the', 'same', 'time', 'as', 'some', 'other', 'programs', 'about', 'school', 'life,', 'such', 'as', '"Teachers".', 'My', '35', 'years', 'in', 'the', 'teaching', 'profession', 'lead', 'me', 'to', 'believe', 'that', 'Bromwell', "High's", 'satire', 'is', 'much', 'closer', 'to', 'reality', 'than', 'is', '"Teachers".', 'The', 'scramble', 'to', 'survive', 'financially,', 'the', 'insightful', 'students', 'who', 'can', 'see', 'right', 'through', 'their', 'pathetic', "teachers'", 'pomp,', 'the', 'pettiness', 'of', 'the', 'whole', 'situation,', 'all', 'remind', 'me', 'of', 'the', 'schools', 'I', 'knew', 'and', 'their', 'students.', 'When', 'I', 'saw', 'the', 'episode', 'in', 'which', 'a', 'student', 'repeatedly', 'tried', 'to', 'burn', 'down', 'the', 'school,', 'I', 'immediately', 'recalled', '.........', 'at', '..........', 'High.', 'A', 'classic', 'line:

有25000个训练样本和25000个测试样本，尽管这个数量比不太符合要求，但是这个任务比较简单，我们就这么来。

一个样本是一个字典的形式，'text'中含有分词完毕的单词列表，'label'中含其标签（pos或neg）。

**下面我们需要把训练样本中再分一些出来作为验证集。**

In [5]:
#确保每次分割相同
SEED = 1234
np.random.seed(SEED)
random.seed(SEED)
torch.manual_seed(SEED)
if use_cuda:
    torch.cuda.manual_seed(SEED)
    
train_data,valid_data=train_data.split(split_ratio=0.8)

print(f'Number of training examples: {len(train_data)}')
print(f'Number of validation examples: {len(valid_data)}')
print(f'Number of testing examples: {len(test_data)}')

Number of training examples: 20000
Number of validation examples: 5000
Number of testing examples: 25000


**下面我们需要建立词典**

**这里我们使用Glove的100维词向量初始化**

这里词典最大长度是否需要指定？

In [6]:
TEXT.build_vocab(train_data,vectors='glove.6B.100d',unk_init=torch.Tensor.normal_)
LABEL.build_vocab(train_data)

d_vocab=len(TEXT.vocab)
print(f"Unique tokens in TEXT vocabulary: {len(TEXT.vocab)}")
print(f"Unique tokens in LABEL vocabulary: {len(LABEL.vocab)}")

print('最频繁的20个单词：')
print(TEXT.vocab.freqs.most_common(20))

Unique tokens in TEXT vocabulary: 244103
Unique tokens in LABEL vocabulary: 2
最频繁的20个单词：
[('the', 230093), ('a', 123651), ('and', 122106), ('of', 114226), ('to', 106104), ('is', 82638), ('in', 68335), ('I', 52549), ('that', 51704), ('this', 45786), ('it', 43665), ('/><br', 40779), ('was', 37439), ('as', 33966), ('with', 33196), ('for', 32883), ('The', 27080), ('but', 27064), ('on', 24547), ('movie', 24429)]


测试和验证文本中可能出现训练集中没有的单词，另外在训练时为了满足批量输入需要将所有或一个批次的文本长度对齐，因此上述字典的建立中会自动加入特殊标记_&lt;unk&gt;_ 和*&lt;pad&gt;* ，用来表示未知字符和填充字符。


**下面我们需要建立迭代器**

In [7]:
train_iterator, valid_iterator, test_iterator =data.BucketIterator.splits(
    (train_data,valid_data,test_data),
    batch_size=bs,device=device,shuffle=True,sort_within_batch=True)

#测试
for x in train_iterator:
    print(x.text[0].shape)
    print(x.text[0])
    break

torch.Size([3, 403])
tensor([[   756,      2,    498,  ..., 143978,   2526,   5409],
        [  2338,   5804,  10313,  ...,      1,      1,      1],
        [    49,   8307,      9,  ...,      1,      1,      1]])


sort_within_batch可以让iterator生成的batch按照长度排序，这是packed pad sequences所要求的。

值得注意的是，**迭代器中的文本已经被转换成了序号**。


## Model

定义一个LSTM模型。

In [36]:
class simple_rnn(nn.Module):
    
    def __init__(self,d_vocab: int,d_embed:int ,d_hidden:int ,d_output:int,dropout=0,vectors=None,
                 n_layers=1,bidirectional=False,pad_idx=0):
        super(simple_rnn, self).__init__()
        self.bi=2 if bidirectional else 1
        self.n_layers=n_layers
        self.pad_idx=pad_idx
        self.d_hidden=d_hidden
        self.d_output=d_output
        
        self.embed=nn.Embedding.from_pretrained(TEXT.vocab.vectors)
        self.rnn=nn.LSTM(d_embed,d_hidden,batch_first=True,num_layers=n_layers,bidirectional=bidirectional,dropout=dropout)
        self.fc=nn.Linear(d_hidden*self.bi,d_output)
        self.dropout=nn.Dropout(dropout)
        
    def forward(self,text,text_length):
        # input:(bs,1ength),(bs)
        #print(text.shape)
        #print(text_length)
        embeded=self.dropout(self.embed(text)) #(bs,length,d_embed)
        packed=nn.utils.rnn.pack_padded_sequence(embeded,text_length,batch_first=True)
        output,(hidden,cell)=self.rnn(packed)
        output,output_len=nn.utils.rnn.pad_packed_sequence(output,batch_first=True)
        #print(output)
        #print(output_len)
        output=torch.gather(output,1,(text_length-1).unsqueeze(-1).unsqueeze(-1).expand(-1,-1,self.d_hidden*self.bi))   #(bs,1,d_hidden*bi)
        #print(output.shape)
        #print(output)
        '''
        if self.bi==2:
            hidden=torch.cat((hidden[-1,:,:],hidden[-2,:,:]),dim=1)
        print(hidden)
        '''

        return self.fc(output.squeeze())#(batch,d_output)
    
model=simple_rnn(d_vocab,d_embed,d_hidden,d_output,dropout,n_layers=n_layers,bidirectional=bidirectional,pad_idx=TEXT.vocab.stoi[TEXT.pad_token])
print(model)
if use_cuda:
    model.cuda()

simple_rnn(
  (embed): Embedding(244103, 100)
  (rnn): LSTM(100, 256, batch_first=True, bidirectional=True)
  (fc): Linear(in_features=512, out_features=2, bias=True)
  (dropout): Dropout(p=0.0, inplace=False)
)


测试一下能否跑通

In [37]:
optimizer = optim.Adam(model.parameters(),lr=lr)
criterion = nn.CrossEntropyLoss()
if use_cuda:
    criterion.cuda()
with torch.no_grad():
    for batch in train_iterator:
        x,l=batch.text
        y=batch.label
        if use_cuda:
            x.cuda()
            y.cuda()
            l.cuda()
        preds=model(x,l)
        print(preds.shape)
        criterion(preds,y)
        break

torch.Size([3, 2])


## Training

In [34]:
def train(model, train_iter, dev_iter, test_iter):
    model.train()
    optimizer = optim.Adam(model.parameters(),lr=lr)
    criterion = nn.CrossEntropyLoss()
    if use_cuda:
        criterion.cuda()

    # 学习率指数衰减，每次epoch：学习率 = gamma * 学习率
    # scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer, gamma=0.9)
    dev_best_loss = float('inf')
    last_improve = 0  # 记录上次验证集loss下降的batch数
    #writer = SummaryWriter(log_dir=config.log_path + '/' + time.strftime('%m-%d.%H.%M', time.localtime())+'_'+which_data+'_'+which_model+'_'+which_task+'_'+exp_number)
    
    for epoch in range(max_epochs):
        train_loss=0
        train_correct=0
        # scheduler.step() # 学习率衰减
        for i, batch in enumerate(train_iter):
            optimizer.zero_grad()
            x=batch.text
            y=batch.label
            if use_cuda:
                x.cuda()
                y.cuda()
            outputs = model(x)
            loss = criterion(outputs, y)
            loss.backward()
            optimizer.step()
            #训练集的准确率
            true = y.data.cpu()
            preds = torch.max(outputs.data, 1)[1].cpu()
            train_correct+=(true==preds).sum()
            train_loss+=loss.item()
        train_loss/=len(train_iterator)   #train_loss
        train_acc=train_correct/len(train_iterator.dataset)   #train_acc
            
        #验证集
        dev_acc, dev_loss = evaluate(model, dev_iter)
        if dev_loss < dev_best_loss:
            dev_best_loss = dev_loss
            improve = '*'
            last_improve=epoch
        else:
            improve = ''
        msg = 'Epoch: {0:>6},  Train Loss: {1:>5.2},  Train Acc: {2:>6.2%},  Val Loss: {3:>5.2},  Val Acc: {4:>6.2%} {5}'
        print(msg.format(epoch+1, train_loss, train_acc, dev_loss, dev_acc, improve))
        #writer.add_scalar("loss/train", loss.item(), total_batch)
        #writer.add_scalar("loss/dev", dev_loss, total_batch)
        #writer.add_scalar("acc/train", train_acc, total_batch)
        #writer.add_scalar("acc/dev", dev_acc, total_batch)

        if epoch - last_improve > require_improvement:
            # 验证集loss超过1epoch没下降，结束训练
            print("No optimization for a long time, auto-stopping...")
            break
    #writer.close()
    test(model, test_iter)

def evaluate(model, data_iter, test=False):
    model.eval()
    loss_total = 0
    predict_all = np.array([], dtype=int)
    labels_all = np.array([], dtype=int)
    with torch.no_grad():
        for batch in data_iter:
            x=batch.text
            labels=batch.label
            if use_cuda:
                x.cuda()
                labels.cuda()
            outputs = model(x)
            loss = F.cross_entropy(outputs, labels)
            loss_total += loss
            labels = labels.data.cpu().numpy()
            predic = torch.max(outputs.data, 1)[1].cpu().numpy()
            labels_all = np.append(labels_all, labels)
            predict_all = np.append(predict_all, predic)
    model.train()
    acc = metrics.accuracy_score(labels_all, predict_all)
    
    if test:
        report = metrics.classification_report(labels_all, predict_all, labels=[0,1],target_names=['pos','neg'], digits=4,output_dict=True)
        confusion = metrics.confusion_matrix(labels_all, predict_all)
        return acc, loss_total / len(data_iter), report, confusion
    
    return acc, loss_total / len(data_iter)


def test(model, test_iter):
    test_acc, test_loss, test_report, test_confusion = evaluate(model, test_iter, test=True)
    msg = 'Test Loss: {0:>5.2},  Test Acc: {1:>6.2%}'
    print(msg.format(test_loss, test_acc))
    print("Precision, Recall and F1-Score...")
    print(test_report)
    print("Confusion Matrix...")
    print(test_confusion)

In [35]:
#lr=1e-3
train(model,train_iterator,valid_iterator,test_iterator)

Epoch:      1,  Train Loss:  0.69,  Train Acc: 50.34%,  Val Loss:   0.7,  Val Acc: 50.94% *
Epoch:      2,  Train Loss:  0.69,  Train Acc: 49.79%,  Val Loss:   0.7,  Val Acc: 52.42% *
Epoch:      3,  Train Loss:  0.69,  Train Acc: 50.13%,  Val Loss:  0.71,  Val Acc: 51.92% 
Epoch:      4,  Train Loss:  0.69,  Train Acc: 50.41%,  Val Loss:  0.71,  Val Acc: 54.28% 
No optimization for a long time, auto-stopping...
Test Loss:   0.8,  Test Acc: 53.72%
Precision, Recall and F1-Score...
{'pos': {'precision': 0.540250756593169, 'recall': 0.49984, 'f1-score': 0.5192603365884064, 'support': 12500}, 'neg': {'precision': 0.5346483066617045, 'recall': 0.57464, 'f1-score': 0.5539232697127434, 'support': 12500}, 'accuracy': 0.53724, 'macro avg': {'precision': 0.5374495316274368, 'recall': 0.53724, 'f1-score': 0.536591803150575, 'support': 25000}, 'weighted avg': {'precision': 0.5374495316274367, 'recall': 0.53724, 'f1-score': 0.536591803150575, 'support': 25000}}
Confusion Matrix...
[[6248 6252]
 [5

In [38]:
#lr=1e-4
train(model,train_iterator,valid_iterator,test_iterator)

Epoch:      1,  Train Loss:  0.69,  Train Acc: 49.44%,  Val Loss:  0.69,  Val Acc: 49.88% *
Epoch:      2,  Train Loss:  0.69,  Train Acc: 50.35%,  Val Loss:  0.69,  Val Acc: 50.76% 
Epoch:      3,  Train Loss:  0.69,  Train Acc: 50.18%,  Val Loss:  0.69,  Val Acc: 50.18% 
No optimization for a long time, auto-stopping...
Test Loss:  0.69,  Test Acc: 50.06%
Precision, Recall and F1-Score...
{'pos': {'precision': 0.5018676627534685, 'recall': 0.15048, 'f1-score': 0.23153618906942391, 'support': 12500}, 'neg': {'precision': 0.5003293807641633, 'recall': 0.85064, 'f1-score': 0.6300663664375443, 'support': 12500}, 'accuracy': 0.50056, 'macro avg': {'precision': 0.501098521758816, 'recall': 0.50056, 'f1-score': 0.4308012777534841, 'support': 25000}, 'weighted avg': {'precision': 0.5010985217588159, 'recall': 0.50056, 'f1-score': 0.43080127775348415, 'support': 25000}}
Confusion Matrix...
[[ 1881 10619]
 [ 1867 10633]]


In [41]:
#dropout=0
train(model,train_iterator,valid_iterator,test_iterator)

Epoch:      1,  Train Loss:  0.69,  Train Acc: 49.61%,  Val Loss:  0.69,  Val Acc: 49.56% *
Epoch:      2,  Train Loss:  0.69,  Train Acc: 50.21%,  Val Loss:  0.69,  Val Acc: 51.24% *
Epoch:      3,  Train Loss:  0.69,  Train Acc: 49.29%,  Val Loss:   0.7,  Val Acc: 49.90% 
Epoch:      4,  Train Loss:  0.69,  Train Acc: 49.94%,  Val Loss:  0.69,  Val Acc: 50.38% 
No optimization for a long time, auto-stopping...
Test Loss:  0.68,  Test Acc: 52.64%
Precision, Recall and F1-Score...
{'pos': {'precision': 0.5794933655006032, 'recall': 0.19216, 'f1-score': 0.2886151997596876, 'support': 12500}, 'neg': {'precision': 0.5157995684488133, 'recall': 0.86056, 'f1-score': 0.6450007495128167, 'support': 12500}, 'accuracy': 0.52636, 'macro avg': {'precision': 0.5476464669747082, 'recall': 0.5263599999999999, 'f1-score': 0.4668079746362521, 'support': 25000}, 'weighted avg': {'precision': 0.5476464669747082, 'recall': 0.52636, 'f1-score': 0.4668079746362521, 'support': 25000}}
Confusion Matrix...
[[

## Results and Analysis

可以看到训练的效果相当差，这与所参照的博客一致，思考可能由以下原因导致：
* 文本长度没有处理，有过长的文本
* 没有使用pack pad等操作

除此之外，还将在后续进行以下优化：
* 使用预训练词向量
* 调整dropout
* 调整momentum