<a href="https://colab.research.google.com/github/Bingle-labake/deeplearn/blob/master/RNN/5_Sentiment_RNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##情感分析 RNN
在此 notebook 中，你将实现一个情感分析递归神经网络。


> 使用 RNN 比普通的前馈网络更准确，因为我们可以包含关于字词序列的信息。


我们将使用影评数据集以及情感标签：positive 正面或 negative 负面。

![替代文字](http://p0.qhimg.com/t013a7fa22d276a009a.png)

###网络结构
网络结构如下图所示：

![替代文字](http://p3.qhimg.com/t01e7e71cba9a48565c.png)


> ***首先，将字词传入嵌入层***。我们需要创建一个嵌入层，因为数据集有成千上万的字词，所以我们需要一种更高效的输入数据表示方式，而不是采用低效的独热编码向量。我们之前在 Word2Vec 课程中已经讲解过嵌入层。我们可以使用 Skip-gram Word2Vec 模型训练嵌入向量，并将这些嵌入向量当做输入。但是，直接添加一个嵌入层就足够了，模型能够自己学习不同的嵌入表。我们使用嵌入层的目的是降维，而不是学习语言表示法。

> ***将输入字词传入嵌入层后，将新的嵌入传递给 LSTM 单元***。LSTM 单元将向网络中添加递归连接，并使我们能够包含关于影评数据的字词序列信息。

> ***最后，LSTM 输出将传入 S 型输出层***。之所以使用 S 型函数，是因为 positive = 1，negative = 0，而 S 型函数将输出 0-1 之间的预测情感值。


我们只关心 S 型函数的最后一个值，其他输出值都可以忽略。我们将通过比较最后一个时间步的输出和训练标签（正面或负面）来计算损失。


---


###加载并可视化数据¶


In [0]:
import numpy as np

# read data from text files
with open('data/reviews.txt', 'r') as f:
    reviews = f.read()
with open('data/labels.txt', 'r') as f:
    labels = f.read()

###预处理数据
构建神经网络的第一步是将数据整理成恰当的形状，然后传入网络中。因为我们将使用嵌入层，所以需要将每个字词表示为整数。还需要稍微清理数据。

上面显示了示例影评数据。下面是数据处理步骤：


> 我们需要删除句点和多余的空格。

> 此外，你可能注意到了，影评用换行符 \n 分隔。我将使用 \n 作为分隔符，将文本拆分为单个影评。

> 然后将所有影评组合成一个很长的字符串。

首先删除所有标点。然后获取没有换行符的文本并拆分为单个字词。

In [20]:
from string import punctuation

print(punctuation)

# get rid of punctuation
reviews = reviews.lower() # lowercase, standardize
all_text = ''.join([c for c in reviews if c not in punctuation])

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [21]:
# split by new lines and spaces
reviews_split = all_text.split('\n')
all_text = ' '.join(reviews_split)

# create a list of words
words = all_text.split()
words[:10]

['bromwell', 'high', 'is', 'a', 'cartoon', 'comedy', 'it', 'ran', 'at', 'the']

###对字词进行编码
嵌入查询要求我们向网络中传入整数。最简单的方式是创建字典，将词汇表中的字词映射到整数。然后将每个影评转换为整数，这样就能传入网络中。


> 练习：现在将字词编码为整数。构建一个将字词映射到整数的字典。稍后我们将使用 0 填充输入向量，所以整数应该从 1 开始，而不是从 0 开始。 将影评转换为整数，并将影评存储到新的 reviews_ints 列表中。




In [0]:
# feel free to use this import 
from collections import Counter

## Build a dictionary that maps words to integers
counts = Counter(words)
vocab = sorted(counts, key=counts.get, reverse=True)
vocab_to_int = {word: ii for ii, word in enumerate(vocab, 1)}


## use the dict to tokenize each review in reviews_split
## store the tokenized reviews in reviews_ints
reviews_ints = []
for review in reviews_split:
    reviews_ints.append([vocab_to_int[word] for word in review.split()])
    

####测试代码

为了测试你是否正确地实现了字典，请输出词汇表中的唯一字词的数量，并输出第一个标记化影评的内容。

In [23]:
# stats about vocabulary
print('Unique words: ', len((vocab_to_int)))  # should ~ 74000+
print()

# print tokens in first review
print('Tokenized review: \n', reviews_ints[:1])

Unique words:  74072

Tokenized review: 
 [[21025, 308, 6, 3, 1050, 207, 8, 2138, 32, 1, 171, 57, 15, 49, 81, 5785, 44, 382, 110, 140, 15, 5194, 60, 154, 9, 1, 4975, 5852, 475, 71, 5, 260, 12, 21025, 308, 13, 1978, 6, 74, 2395, 5, 613, 73, 6, 5194, 1, 24103, 5, 1983, 10166, 1, 5786, 1499, 36, 51, 66, 204, 145, 67, 1199, 5194, 19869, 1, 37442, 4, 1, 221, 883, 31, 2988, 71, 4, 1, 5787, 10, 686, 2, 67, 1499, 54, 10, 216, 1, 383, 9, 62, 3, 1406, 3686, 783, 5, 3483, 180, 1, 382, 10, 1212, 13583, 32, 308, 3, 349, 341, 2913, 10, 143, 127, 5, 7690, 30, 4, 129, 5194, 1406, 2326, 5, 21025, 308, 10, 528, 12, 109, 1448, 4, 60, 543, 102, 12, 21025, 308, 6, 227, 4146, 48, 3, 2211, 12, 8, 215, 23]]


###对标签进行编码
标签为“positive”或“negative”。要在网络中使用标签，我们需要将它们转换为 0 和 1。



> 练习：将标签从 positive 和 negative 分别转换为 1 和 0，并将它们放入新的 encoded_labels 列表中。



In [0]:
# 1=positive, 0=negative label conversion
labels_split = labels.split('\n')
encoded_labels = np.array([1 if label == 'positive' else 0 for label in labels_split])

###删除离群值
为了使影评保持标准形状，还要执行一个预处理步骤。网络要求输入文本是标准大小，所以我们需要将影评变形为特定的长度。为了满足该要求，我们将完成两大步骤：


1.   删除超长或超短的影评，即离群值 2.填充或截断剩余数据，使所有影评长度一样。

![替代文字](http://p0.qhimg.com/t017e7b3384ace69fcc.png)

在填充影评之前，先检查文本中是否有超长或超短的影评。离群值会干扰训练过程。

In [25]:
# outlier review stats
review_lens = Counter([len(x) for x in reviews_ints])
print("Zero-length reviews: {}".format(review_lens[0]))
print("Maximum review length: {}".format(max(review_lens)))

Zero-length reviews: 1
Maximum review length: 2514


我们的数据存在几个问题。有一个影评的长度为 0。最长的影评太长了，导致 RNN 需要完成多个训练步。我们需要删除超短的影评并截断超长的影评。这样就能删除离群值，并提高模型训练效率。



> 练习：首先，从 reviews_ints 列表中删除长度为 0 的影评，并从 encoded_labels 中删除相应的标签。



In [26]:
print('Number of reviews before removing outliers: ', len(reviews_ints))

## remove any reviews/labels with zero length from the reviews_ints list.
non_zero_idx = [ii for ii, review in enumerate(reviews_ints) if len(review) != 0]

reviews_ints = [reviews_ints[ii] for ii in non_zero_idx]
encoded_labels = np.array([encoded_labels[ii] for ii in non_zero_idx])

print('Number of reviews after removing outliers: ', len(reviews_ints))

Number of reviews before removing outliers:  25001
Number of reviews after removing outliers:  25000




---

###填充序列
对于很短和很长的影评，我们将通过填充或截断方式使影评保持特定长度。对于短于 seq_length 的影评，我们将用 0 填充它。对于长于 seq_length 的影评，我们将截取前 seq_length 个字词。对于我们的模型来说，建议将 seq_length 设为 200。


> 练习：定义一个返回 features 数组的函数，该数组包含填充到标准大小的影评，之后我们会将该数组传入网络中。


> 1.   数据应该来自 review_ints，因为我们想将整数传入网络中。
> 2.   每行应该包含 seq_length 个元素。
> 3.   对于少于 seq_length 个字词的影评，在左侧填充 0。也就是说，如果影评为 ['best', 'movie', 'ever']（用整数表示则为 [117, 18, 128]），填充后的行为 [0, 0, 0, ..., 0, 117, 18, 128]。
> 4.   对于长于 seq_length 的影评，仅将前 seq_length 个字词作为特征向量。


举个例子，如果 seq_length=10 并且输入影评为：

In [0]:
[117, 18, 128]

生成的填充序列应该为：

In [0]:
[0, 0, 0, 0, 0, 0, 0, 117, 18, 128]

####最终 features 数组应该为二维数组，行数等于影评数，列数等于指定的 seq_length。

这种处理方式很重要，并且实现方法有多种。如果你要构建深度神经网络，就需要会预处理数据。

In [0]:
def pad_features(reviews_ints, seq_length):
    ''' Return features of review_ints, where each review is padded with 0's 
        or truncated to the input seq_length.
    '''
    ## implement function
    # getting the correct rows x cols shape
    features = np.zeros((len(reviews_ints), seq_length), dtype=int)

    # for each review, I grab that review and 
    for i, row in enumerate(reviews_ints):
        features[i, -len(row):] = np.array(row)[:seq_length]
     
    return features

In [28]:
# Test your implementation!

seq_length = 200

features = pad_features(reviews_ints, seq_length=seq_length)

## test statements - do not change - ##
assert len(features)==len(reviews_ints), "Your features should have as many rows as reviews."
assert len(features[0])==seq_length, "Each feature row should contain seq_length values."

# print first 10 values of the first 30 batches 
print(features[:30,:10])

[[    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [22382    42 46418    15   706 17139  3389    47    77    35]
 [ 4505   505    15     3  3342   162  8312  1652     6  4819]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [   54    10    14   116    60   798   552    71   364     5]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [    1   330   578    34     3   162   748  2731     9   325]
 [    9    11 10171  5305  1946   689   444    22   280   673]
 [    0     0     0     0     0     0     0     0     0

###训练集、验证集和测试集
准备好数据后，我们需要将数据拆分为训练集、验证集和测试集。






> 练习：创建训练集、验证集和测试集。

> *   你需要分别为特征和标签创建这些数据集，例如创建 train_x 和 train_y。


> *   定义一个拆分比例 split_frac，表示将数据集中的多少数据保留为训练集。通常设为 0.8 或 0.9。

> *   将剩余数据一分为二，创建验证集和测试集。



In [29]:
split_frac = 0.8

## split data into training, validation, and test data (features and labels, x and y)
split_idx = int(len(features)*0.8)
train_x, remaining_x = features[:split_idx], features[split_idx:]
train_y, remaining_y = encoded_labels[:split_idx], encoded_labels[split_idx:]

test_idx = int(len(remaining_x)*0.5)
val_x, test_x = remaining_x[:test_idx], remaining_x[test_idx:]
val_y, test_y = remaining_y[:test_idx], remaining_y[test_idx:]

## print out the shapes of your resultant feature data
print("\t\t\tFeature Shapes:")
print("Train set: \t\t{}".format(train_x.shape), 
      "\nValidation set: \t{}".format(val_x.shape),
      "\nTest set: \t\t{}".format(test_x.shape))

			Feature Shapes:
Train set: 		(20000, 200) 
Validation set: 	(2500, 200) 
Test set: 		(2500, 200)


####检查代码

训练集、验证集和测试集的比例分别为 0.8、0.1、0.1，最终的特征数据形状应该如下所示：

In [0]:
                    Feature Shapes:
Train set: 		 (20000, 200) 
Validation set: 	(2500, 200) 
Test set: 		  (2500, 200)



---

###DataLoader 和批处理
创建训练集、测试集和验证集后，我们可以按照以下两个步骤创建 DataLoader： 1.使用 TensorDataset 创建一种已知数据格式。TensorDataset 的参数包括输入数据集和目标数据集，并且第一个维度一样，然后创建一个数据集。 2.创建 DataLoader 并批处理训练、验证和测试张量数据集。

In [30]:
train_data = TensorDataset(torch.from_numpy(train_x), torch.from_numpy(train_y))
train_loader = DataLoader(train_data, batch_size=batch_size)

NameError: ignored

这是创建生成器函数并将数据分成多批的替代方式。

In [0]:
import torch
from torch.utils.data import TensorDataset, DataLoader

# create Tensor datasets
train_data = TensorDataset(torch.from_numpy(train_x), torch.from_numpy(train_y))
valid_data = TensorDataset(torch.from_numpy(val_x), torch.from_numpy(val_y))
test_data = TensorDataset(torch.from_numpy(test_x), torch.from_numpy(test_y))

# dataloaders
batch_size = 50

# make sure to SHUFFLE your data
train_loader = DataLoader(train_data, shuffle=True, batch_size=batch_size)
valid_loader = DataLoader(valid_data, shuffle=True, batch_size=batch_size)
test_loader = DataLoader(test_data, shuffle=True, batch_size=batch_size)

In [32]:
# obtain one batch of training data
dataiter = iter(train_loader)
sample_x, sample_y = dataiter.next()

print('Sample input size: ', sample_x.size()) # batch_size, seq_length
print('Sample input: \n', sample_x)
print()
print('Sample label size: ', sample_y.size()) # batch_size
print('Sample label: \n', sample_y)

Sample input size:  torch.Size([50, 200])
Sample input: 
 tensor([[    0,     0,     0,  ...,   102,    21,    51],
        [    0,     0,     0,  ...,     9,    67,  4033],
        [    0,     0,     0,  ...,   359,    44,     8],
        ...,
        [    0,     0,     0,  ...,    39,    12,   709],
        [    8,    13,   250,  ...,    56,   651,    53],
        [    0,     0,     0,  ...,    15, 55614, 55615]])

Sample label size:  torch.Size([50])
Sample label: 
 tensor([1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0,
        1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1,
        1, 1])




---

##在 PyTorch 中创建情感分析网络
请在下面定义网络。

![替代文字](http://p3.qhimg.com/t01e7e71cba9a48565c.png)

层级结构如下图所示： 1.一个[嵌入层](https://pytorch.org/docs/stable/nn.html#embedding)：将字词标记（整数）转换为特定大小的嵌入。 2.一个[ LSTM](https://pytorch.org/docs/stable/nn.html#lstm) 层级：由 hidden_state 大小和层级数量定义 3.一个全连接输出层：将 LSTM 层级输出映射到期望的 output_size 4.一个 S 型激活层：将所有输出转换为 0-1 之间的值；仅返回最后一个 S 型函数输出值作为网络的输出。

###嵌入层
我们需要添加一个[嵌入层](https://pytorch.org/docs/stable/nn.html#embedding)，因为词汇表中有 74000 多个字词。用独热编码的形式表示这么多的类别效率太低了。所以我们将使用嵌入层并将该嵌入层当做查询表，而不是使用独热编码。我们可以使用 Word2Vec 训练嵌入层，然后加载它。但是也可以创建一个新的层级后仅用于降维，并让网络自己去学习权重。

###LSTM 层级
我们将向该递归网络中添加一个[ LSTM](https://pytorch.org/docs/stable/nn.html#lstm)，该LSTM 的参数包括 input_size、hidden_dim、层数、丢弃概率（针对多个层级之间的丢弃层），以及一个 batch_first 参数。

通常，如果层级更多（2-3 个），网络效果将更好。添加更多层级使网络能够学习复杂的关系。



> ***练习：***完成 __init__、forward 和 init_hidden 函数。



注意：init_hidden 应该将 LSTM 层级的隐藏状态和单元状态全初始化为 0，并将它们移到 GPU 上（如果有的话）。

In [33]:
# First checking if GPU is available
train_on_gpu=torch.cuda.is_available()

if(train_on_gpu):
    print('Training on GPU.')
else:
    print('No GPU available, training on CPU.')

Training on GPU.


In [0]:
import torch.nn as nn

class SentimentRNN(nn.Module):
    """
    The RNN model that will be used to perform Sentiment analysis.
    """

    def __init__(self, vocab_size, output_size, embedding_dim, hidden_dim, n_layers, drop_prob=0.5):
        """
        Initialize the model by setting up the layers.
        """
        super(SentimentRNN, self).__init__()
        
        self.output_size = output_size
        self.n_layers = n_layers
        self.hidden_dim = hidden_dim
        # define all layers
        
        # embedding and LSTM layers
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, n_layers, 
                            dropout=drop_prob, batch_first=True)
        
        # dropout layer
        self.dropout = nn.Dropout(0.3)
        
        # linear and sigmoid layers
        self.fc = nn.Linear(hidden_dim, output_size)
        self.sig = nn.Sigmoid()
 

    def forward(self, x, hidden):
        """
        Perform a forward pass of our model on some input and hidden state.
        """
        batch_size = x.size(0)

        # embeddings and lstm_out
        embeds = self.embedding(x)
        lstm_out, hidden = self.lstm(embeds, hidden)
    
        # stack up lstm outputs
        lstm_out = lstm_out.contiguous().view(-1, self.hidden_dim)
        
        # dropout and fully-connected layer
        out = self.dropout(lstm_out)
        out = self.fc(out)
        # sigmoid function
        sig_out = self.sig(out)
        
        # reshape to be batch_size first
        sig_out = sig_out.view(batch_size, -1)
        sig_out = sig_out[:, -1] # get last batch of labels
        
        # return last sigmoid output and hidden state
        return sig_out, hidden
    
    
    def init_hidden(self, batch_size):
        ''' Initializes hidden state '''
        # Create two new tensors with sizes n_layers x batch_size x hidden_dim,
        # initialized to zero, for hidden state and cell state of LSTM
        
        weight = next(self.parameters()).data
        
        if (train_on_gpu):
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda(),
                  weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda())
        else:
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_(),
                      weight.new(self.n_layers, batch_size, self.hidden_dim).zero_())
        
        return hidden

###实例化网络
请在此部分实例化网络。首先定义超参数。

*   vocab_size：词汇表的大小或输入（字词标记）的值范围。
*   output_size：期望输出的大小；我们希望输出的类别分数数量（正面/负面）。
*   embedding_dim：嵌入查询表的列数；嵌入大小。
*   hidden_dim：隐藏层的 LSTM 单元数量。通常数量越多，效果越好。常见的值包括 128、256、512 等。
*   n_layers：网络中的 LSTM 层级数量。通常在 1-3 层之间


> 练习：定义模型超参数。



In [35]:
# Instantiate the model w/ hyperparams
vocab_size = len(vocab_to_int)+1 # +1 for the 0 padding + our word tokens
output_size = 1
embedding_dim = 400
hidden_dim = 256
n_layers = 2

net = SentimentRNN(vocab_size, output_size, embedding_dim, hidden_dim, n_layers)

print(net)

SentimentRNN(
  (embedding): Embedding(74073, 400)
  (lstm): LSTM(400, 256, num_layers=2, batch_first=True, dropout=0.5)
  (dropout): Dropout(p=0.3)
  (fc): Linear(in_features=256, out_features=1, bias=True)
  (sig): Sigmoid()
)




---

###训练
下面是典型的训练代码。如果你想自己编写代码，可以删了所有这些代码并自己手动输入代码。还可以添加代码并按名称保存模型。



> 我们将使用一种新的交叉熵损失，这种损失专门用于单个 S 型函数输出。BCELoss，即二元交叉熵损失，对在 0-1 之间的单个值应用交叉熵损失。


还有以下超参数：


*   lr：优化器的学习速率。
*   epochs：遍历训练集的次数。
*   clip：最大梯度值上限（防止梯度爆炸）。



In [0]:
# loss and optimization functions
lr=0.001

criterion = nn.BCELoss()
optimizer = torch.optim.Adam(net.parameters(), lr=lr)

In [37]:
# training params

epochs = 4 # 3-4 is approx where I noticed the validation loss stop decreasing

counter = 0
print_every = 100
clip=5 # gradient clipping

# move model to GPU, if available
if(train_on_gpu):
    net.cuda()

net.train()
# train for some number of epochs
for e in range(epochs):
    # initialize hidden state
    h = net.init_hidden(batch_size)

    # batch loop
    for inputs, labels in train_loader:
        counter += 1

        if(train_on_gpu):
            inputs, labels = inputs.cuda(), labels.cuda()

        # Creating new variables for the hidden state, otherwise
        # we'd backprop through the entire training history
        h = tuple([each.data for each in h])

        # zero accumulated gradients
        net.zero_grad()

        # get the output from the model
        output, h = net(inputs, h)

        # calculate the loss and perform backprop
        loss = criterion(output.squeeze(), labels.float())
        loss.backward()
        # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
        nn.utils.clip_grad_norm_(net.parameters(), clip)
        optimizer.step()

        # loss stats
        if counter % print_every == 0:
            # Get validation loss
            val_h = net.init_hidden(batch_size)
            val_losses = []
            net.eval()
            for inputs, labels in valid_loader:

                # Creating new variables for the hidden state, otherwise
                # we'd backprop through the entire training history
                val_h = tuple([each.data for each in val_h])

                if(train_on_gpu):
                    inputs, labels = inputs.cuda(), labels.cuda()

                output, val_h = net(inputs, val_h)
                val_loss = criterion(output.squeeze(), labels.float())

                val_losses.append(val_loss.item())

            net.train()
            print("Epoch: {}/{}...".format(e+1, epochs),
                  "Step: {}...".format(counter),
                  "Loss: {:.6f}...".format(loss.item()),
                  "Val Loss: {:.6f}".format(np.mean(val_losses)))

Epoch: 1/4... Step: 100... Loss: 0.714163... Val Loss: 0.651452
Epoch: 1/4... Step: 200... Loss: 0.557441... Val Loss: 0.613568
Epoch: 1/4... Step: 300... Loss: 0.654193... Val Loss: 0.681645
Epoch: 1/4... Step: 400... Loss: 0.678328... Val Loss: 0.678126
Epoch: 2/4... Step: 500... Loss: 0.494402... Val Loss: 0.671706
Epoch: 2/4... Step: 600... Loss: 0.508051... Val Loss: 0.526870
Epoch: 2/4... Step: 700... Loss: 0.606021... Val Loss: 0.553802
Epoch: 2/4... Step: 800... Loss: 0.651838... Val Loss: 0.495164
Epoch: 3/4... Step: 900... Loss: 0.294188... Val Loss: 0.479113
Epoch: 3/4... Step: 1000... Loss: 0.351133... Val Loss: 0.454691
Epoch: 3/4... Step: 1100... Loss: 0.266095... Val Loss: 0.479119
Epoch: 3/4... Step: 1200... Loss: 0.440874... Val Loss: 0.446265
Epoch: 4/4... Step: 1300... Loss: 0.177936... Val Loss: 0.525860
Epoch: 4/4... Step: 1400... Loss: 0.202405... Val Loss: 0.467553
Epoch: 4/4... Step: 1500... Loss: 0.428756... Val Loss: 0.568864
Epoch: 4/4... Step: 1600... Loss: 



---

###测试
有几种方式可以测试网络。


*   测试数据效果：首先，看看训练过的模型在上面定义的所有测试数据上的效果。我们将计算测试数据的平均损失和准确率。
*   对用户生成的数据进行推理：其次，检查能否一次输入一个示例影评（没有标签），并看看训练过的模型会预测出什么结果。像这样查看新的用户输入数据并预测输出标签称为推理。



In [38]:
# Get test data loss and accuracy

test_losses = [] # track loss
num_correct = 0

# init hidden state
h = net.init_hidden(batch_size)

net.eval()
# iterate over test data
for inputs, labels in test_loader:

    # Creating new variables for the hidden state, otherwise
    # we'd backprop through the entire training history
    h = tuple([each.data for each in h])

    if(train_on_gpu):
        inputs, labels = inputs.cuda(), labels.cuda()
    
    # get predicted outputs
    output, h = net(inputs, h)
    
    # calculate loss
    test_loss = criterion(output.squeeze(), labels.float())
    test_losses.append(test_loss.item())
    
    # convert output probabilities to predicted class (0 or 1)
    pred = torch.round(output.squeeze())  # rounds to the nearest integer
    
    # compare predictions to true label
    correct_tensor = pred.eq(labels.float().view_as(pred))
    correct = np.squeeze(correct_tensor.numpy()) if not train_on_gpu else np.squeeze(correct_tensor.cpu().numpy())
    num_correct += np.sum(correct)


# -- stats! -- ##
# avg test loss
print("Test loss: {:.3f}".format(np.mean(test_losses)))

# accuracy over all test data
test_acc = num_correct/len(test_loader.dataset)
print("Test accuracy: {:.3f}".format(test_acc))

Test loss: 0.505
Test accuracy: 0.808


###对测试影评进行推理
你可以将该 test_review 更改为任何其他文本。读一读影评，然后自己判断是正面还是负面影评。再看看模型能否正确预测出影评的情感！


> 练习：编写一个 predict 函数，它的参数包括训练过的网络、普通的 text_review，以及序列长度，并输出一段描述影评是正面还是负面影评的文字。
> *   你可以使用你已经定义过的任何函数，或定义任何帮助你完成 predict 的辅助函数，但是定义函数的参数只需包含训练过的网络、文本影评和序列长度。








In [40]:
from string import punctuation

# negative test review
test_review_neg = 'The worst movie I have seen; acting was terrible and I want my money back. This movie had bad acting and the dialogue was slow.'


def tokenize_review(test_review):
    test_review = test_review.lower() # lowercase
    # get rid of punctuation
    test_text = ''.join([c for c in test_review if c not in punctuation])

    # splitting by spaces
    test_words = test_text.split()

    # tokens
    test_ints = []
    test_ints.append([vocab_to_int[word] for word in test_words])

    return test_ints

# test code and generate tokenized review
test_ints = tokenize_review(test_review_neg)
print(test_ints)

[[1, 247, 18, 10, 28, 108, 113, 14, 388, 2, 10, 181, 60, 273, 144, 11, 18, 68, 76, 113, 2, 1, 410, 14, 539]]


In [41]:
# test sequence padding
seq_length=200
features = pad_features(test_ints, seq_length)

print(features)

[[  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   1 247  18  10  28
  108 113  14 388   2  10 181  60 273 144  11  18  68  76 113   2   1 410
   14 539]]


In [42]:
# test conversion to tensor and pass into your model
feature_tensor = torch.from_numpy(features)
print(feature_tensor.size())

torch.Size([1, 200])


In [0]:
def predict(net, test_review, sequence_length=200):
    ''' Prints out whether a give review is predicted to be 
        positive or negative in sentiment, using a trained model.
        
        params:
        net - A trained net 
        test_review - a review made of normal text and punctuation
        sequence_length - the padded length of a review
        '''
    
    
    # print custom response based on whether test_review is pos/neg
    net.eval()
    
    # tokenize review
    test_ints = tokenize_review(test_review)
    
    # pad tokenized sequence
    seq_length=sequence_length
    features = pad_features(test_ints, seq_length)
    
    # convert to tensor to pass into your model
    feature_tensor = torch.from_numpy(features)
    
    batch_size = feature_tensor.size(0)
    
    # initialize hidden state
    h = net.init_hidden(batch_size)
    
    if(train_on_gpu):
        feature_tensor = feature_tensor.cuda()
    
    # get the output from the model
    output, h = net(feature_tensor, h)
    
    # convert output probabilities to predicted class (0 or 1)
    pred = torch.round(output.squeeze()) 
    # printing output value, before rounding
    print('Prediction value, pre-rounding: {:.6f}'.format(output.item()))
    
    # print custom response
    if(pred.item()==1):
        print("Positive review detected!")
    else:
        print("Negative review detected.")
    
        

In [0]:
# positive test review
test_review_pos = 'This movie had the best acting and the dialogue was so good. I loved it.'


In [45]:
# call function
# try negative and positive reviews!
seq_length=200
predict(net, test_review_neg, seq_length)

Prediction value, pre-rounding: 0.005395
Negative review detected.


###可以自己编写测试影评！
训练好模型并创建 predict 函数后，你可以传入任何类型的文本，此模型将预测该文本具有正面情感还是负面情感。试着让此模型发挥最大作用，并看看它会将哪些字词当做正面字词，将哪些字词当做负面字词。

稍后你将学习如何将这样的模型部署到生产环境中，让模型对用户输入到网络应用中的任何数据做出预测。