# 建立多层感知器模型进行IMDb情感分析

首先我们先建立多层感知器模型进行IMDb情感分析，在上一章中，我们完成了数据的预处理，本章我们将建立：

**（1）嵌入层** 将数字列表转换成向量列表

**（2）多层感知器** 使用多层感知器模型处理向量列表。


   $\bullet$**平坦层：** 共有3200个神经元，因为原来的数字列表一共有100个数字，每个数字转换成32维的向量，所以平坦层有3200个神经元
   
   $\bullet$**隐藏层：**一共有256个神经元
   
   $\bullet$**输出层：**只有一个神经元，输出1代表正面评价，输出0代表负面评价

# 数据预处理

导入所需模块

In [1]:
from keras.datasets import imdb
from keras.preprocessing import sequence
from keras.preprocessing.text import Tokenizer

Using TensorFlow backend.


读取IMDb数据集目录

In [2]:
import re #导入正则表达式模块
def rm_tags(text): #创建rm_tage函数，输入参数是text文字
    re_tag = re.compile(r'<[^>]+>') #创建re_tag为正则表达式变量，赋值为‘<[^>]+>’
    return re_tag.sub('', text) #使用re_tag将text文字中符合正则表达式条件的字符替换成空字符串

In [3]:
import os
def read_files(filetype):
    #创建read_files函数，输入参数为filetype。读取训练数据时传入‘train’，读取测试数据时传入‘test’
    path = 'aclImdb/'#设置文件的存取路径
    file_list = [] #创建文件列表
    
    positive_path = path + filetype + '/pos/' #设置正面评价的文件目录为positive_path
    for f in os.listdir(positive_path): #用for循环将positive_path目录下的所有文件加入file_list
        file_list += [positive_path + f]
        
    negative_path = path + filetype + '/neg/'#设置负面评价的文件目录为positive_path
    for f in os.listdir(negative_path):#用for循环将negative_path目录下的所有文件加入file_list
        file_list += [negative_path + f]
        
    print('read', filetype, 'files:', len(file_list))#显示读取的filetype目录下的文件个数
    
    all_labels = ([1] * 12500 + [0] * 12500) 
    #产生all_labels,前12500项是正面，所以产生12500项1的列表，后12500项是负面，所以产生12500项0的列表
    
    all_texts = [] #设置all_texts为空列表
    
    '''
    用fi读取file_list所有文件，使用打开文件为file_input，使用file_input.readlines()读取文件，
    用join连接所有文件内容，然后使用rm_tags删除tag，最后加入all_texrs list
    '''
    for fi in file_list:
        with open(fi, encoding='utf8') as file_input:
            all_texts += [rm_tags(' '.join(file_input.readlines()))]
            
    return all_labels, all_texts

读取训练数据

In [4]:
y_train, train_text = read_files('train')

read train files: 25000


读取测试数据

In [5]:
y_test, test_text = read_files('test')

read test files: 25000


建立token

In [6]:
token = Tokenizer(num_words=2000)
token.fit_on_texts(train_text)

将影评文字转换成数字列表

In [7]:
x_train_seq = token.texts_to_sequences(train_text)
x_test_seq = token.texts_to_sequences(test_text)

截长补短让数字列表长度都为100

In [8]:
x_train = sequence.pad_sequences(x_train_seq, 100)
x_test = sequence.pad_sequences(x_test_seq, 100)

# 加入嵌入层

Keras提供了嵌入层将数字列表转换成向量列表

导入相关模块

In [9]:
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation, Flatten
from keras.layers.embeddings import Embedding

建立模型

In [10]:
model = Sequential()

把嵌入层加入模型

In [11]:
model.add(Embedding(output_dim=32, input_dim= 2000, input_length=100))
model.add(Dropout(0.2))

参数说明

output_dim = 32 输出维数是32，我们希望将数字列表转换成32维的向量

input_dim = 2000 输入维数是2000，因为之前建立的字典有2000个单词

input_length = 100 因为数字列表每一项有100个数字

# 建立多层感知器模型

加入平坦层，平坦层有3200个神经元

In [12]:
model.add(Flatten())

加入隐藏层

In [13]:
model.add(Dense(units=256, activation='relu'))
model.add(Dropout(0.35))

加入输出层

In [14]:
model.add(Dense(units=1, activation='sigmoid'))

查看模型摘要

In [15]:
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 100, 32)           64000     
_________________________________________________________________
dropout_1 (Dropout)          (None, 100, 32)           0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 3200)              0         
_________________________________________________________________
dense_1 (Dense)              (None, 256)               819456    
_________________________________________________________________
dropout_2 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 257       
Total params: 883,713
Trainable params: 883,713
Non-trainable params: 0
_________________________________________________________________
None

# 训练模型

定义训练方式

In [16]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

开始训练

In [17]:
train_history = model.fit(x_train, y_train, batch_size=100, epochs=10, verbose=2, validation_split=0.2)

Train on 20000 samples, validate on 5000 samples
Epoch 1/10
 - 6s - loss: 0.4716 - acc: 0.7627 - val_loss: 0.5328 - val_acc: 0.7584
Epoch 2/10
 - 5s - loss: 0.2653 - acc: 0.8907 - val_loss: 0.5224 - val_acc: 0.7796
Epoch 3/10
 - 6s - loss: 0.1603 - acc: 0.9409 - val_loss: 0.5256 - val_acc: 0.8052
Epoch 4/10
 - 6s - loss: 0.0808 - acc: 0.9712 - val_loss: 0.9687 - val_acc: 0.7324
Epoch 5/10
 - 5s - loss: 0.0469 - acc: 0.9847 - val_loss: 1.2965 - val_acc: 0.6978
Epoch 6/10
 - 5s - loss: 0.0349 - acc: 0.9876 - val_loss: 1.2107 - val_acc: 0.7428
Epoch 7/10
 - 6s - loss: 0.0308 - acc: 0.9885 - val_loss: 1.0860 - val_acc: 0.7774
Epoch 8/10
 - 6s - loss: 0.0275 - acc: 0.9903 - val_loss: 1.1285 - val_acc: 0.7862
Epoch 9/10
 - 6s - loss: 0.0230 - acc: 0.9915 - val_loss: 1.2168 - val_acc: 0.7800
Epoch 10/10
 - 6s - loss: 0.0255 - acc: 0.9907 - val_loss: 1.4009 - val_acc: 0.7524


一共执行了10个训练周期，误差越来越小，准确率越来越高

# 评估模型准确率

In [18]:
scores = model.evaluate(x_test, y_test, verbose=1)
scores[1]



0.80916

准确率是0.81

# 进行预测

执行预测

In [19]:
predict = model.predict_classes(x_test)



查看前10项的预测结果

In [20]:
predict[: 10]

array([[1],
       [1],
       [0],
       [1],
       [1],
       [1],
       [1],
       [1],
       [1],
       [1]], dtype=int32)

使用一维数组查看预测结果（使用reshape把二维数组predict转换成以为）

In [21]:
predict_classes = predict.reshape(-1)
predict_classes[: 10]

array([1, 1, 0, 1, 1, 1, 1, 1, 1, 1], dtype=int32)

# 查看测试数据的预测结果

我们将创建display_test_Sentiment函数，显示正面评价或负面评价

In [22]:
SentimentDict = {1: '正面的', 0: '负面的'}
def display_test_Sentiment(i):
    print(test_text[i])
    print('label真实值：', SentimentDict[y_test[i]], '预测结果：', SentimentDict[predict_classes[i]])

显示第2项预测结果

In [23]:
display_test_Sentiment(2)

"Miss Cast Away" is an amusing trifle, which dispenses with serious plot or character development to pack in as many gags as possible. Best enjoyed with a large audience that is open to such entertainments and perhaps, has had a few drinks. Most of the jokes are current-event based so in future years this film may become a time-capsule of turn-of-the-21st-century pop culture references.The 30i to 24p conversion of the footage does create a jerky appearance in some parts, most noticeably the opening aerial shots.The appearance of Micheal Jackson is indeed a strange non-sequiter event. But I, for one, find it encouraging that Mr. Jackson has shown a helpful interest in one of his protégés even after he (the director) has passed from the cute-preteen-boy stage.The effects work is not as bad as one review suggested. Most of it was done by a one-man crew in a brief span of time consisting of animator William Sutton, whose name seems to have been omitted from the IMDb credits. His work is an

显示第12502项预测结果

In [24]:
display_test_Sentiment(12502)

Seriously Reality Charity TV These producers must think that the masses are full of non-thinkers.These shows are called reality, which means they are suppose to resemble something real, with truth or facts.I suppose the characters are really acting in all the pathetic-ness.At one point I wonder if these type of shows decrease or increase the collective unconsciousness.We live in a world that already contains individuals that are not authentic. Is it necessary to promote an inauthentic way of being?
label真实值： 负面的 预测结果： 正面的


# 查看《美女与野兽》影评

点击《美女与野兽》的影评界面，拷贝一段影评，并创建变量input_text

In [25]:
input_text = '''
I have seen the animated movie as a kid and always wanted to see it again. Decided to watch this. Maybe this is more 
enjoyable for kids and at some point I was getting a bit bored but I am glad I watched it till the end because it's a
beautiful movie and brought a tear to my eyes a few times.

'''

将影评文字转换成数字列表

In [26]:
input_seq = token.texts_to_sequences([input_text])

查看数字列表

In [27]:
print(input_seq)

[[9, 24, 106, 1, 1121, 16, 13, 3, 549, 2, 206, 469, 5, 63, 8, 170, 867, 5, 102, 10, 275, 10, 6, 49, 733, 14, 358, 2, 29, 45, 209, 9, 12, 393, 3, 223, 1095, 17, 9, 240, 1259, 9, 292, 8, 1, 126, 84, 41, 3, 303, 16, 2, 834, 3, 5, 57, 521, 3, 167, 207]]


查看数字列表长度

In [28]:
len(input_seq[0])

60

增加数字列表使其长度为100

In [29]:
pad_input_seq = sequence.pad_sequences(input_seq, maxlen=100)

截长补短后查看数字列表长度

In [30]:
len(pad_input_seq[0])

100

用多层感知器进行预测

In [31]:
predict_result = model.predict_classes(pad_input_seq)



查看预测结果

In [32]:
predict_result

array([[1]], dtype=int32)

读取预测结果中的元素

In [33]:
predict_result[0][0]

1

执行预测，用之前定义的SentimentDict字典将结果转换成文字

In [34]:
SentimentDict[predict_result[0][0]]

'正面的'

故该评价是正面评价

# 预测《美女与野兽》的影评是正面还是负面的

把前面的命令整理成函数predict_review函数

In [35]:
def predict_review(input_text):
    input_seq = token.texts_to_sequences([input_text])
    pad_input_seq = sequence.pad_sequences(input_seq, maxlen=100)
    predict_result = model.predict_classes(pad_input_seq)
    print(SentimentDict[predict_result[0][0]])

输入影评文字input_text，就可以输出预测结果

预测差评

In [36]:
input_text = '''This movie was completely unnecessary. It does everything from the original animated film but in a 
bland and boring way. The few things added or changed had no reason to be because the story was already perfect. 
There was no need for a remake and it was obviously done for some quick cash. They should have just done a rerelease
of the original instead of a watered down version.
'''

In [37]:
predict_review(input_text)

负面的


预测好评

In [38]:
input_text = '''I am writing this review because I'm impressed with such negative feedback in some reviews. I think it 
is a great remake, it is definitely a matter of taste as I loved feeling like it was the original movie, while it 
wasn't! It took me back to my childhood and it was just beautiful.

If you are looking for a change of script or some kind of surprise you will definitely be disappointed. If you just want 
to enjoy a magical movie and sing along, you will love it :)'''

In [39]:
predict_review(input_text)

正面的


可见这两次预测都是比较准确的，我们也可以多做几次尝试

《美女与野兽》影评地址
https://www.imdb.com/title/tt2771200/reviews

# 文字处理时使用较大的字典提取更多文字

我们希望可以提高预测的准确率，方法如下：

$\bullet$**增加字典的单词数：**这里我们把原来为2000个单词的词典增加为3800个单词的词典

$\bullet$**增加数字列表截长补短的长度** 原来数字列表的长度是100，这里我们增加为380

修改数据的预处理

In [40]:
#增加字典的单词量为3800
token = Tokenizer(num_words=3800)
token.fit_on_texts(train_text)

In [41]:
#将文字转换为数字列表
x_train_seq = token.texts_to_sequences(train_text)
x_test_seq = token.texts_to_sequences(test_text)

In [42]:
#截长补短，让所有影评所产生的数字序列长度均为380
x_train = sequence.pad_sequences(x_train_seq, maxlen=380)
x_test = sequence.pad_sequences(x_test_seq, maxlen=380)

建立模型

In [43]:
model = Sequential()

In [44]:
model.add(Embedding(output_dim=32, input_dim=3800, input_length=380))
model.add(Dropout(0.2))

In [45]:
model.add(Flatten())

In [46]:
model.add(Dense(units=256, activation='relu'))
model.add(Dropout(0.2))

In [47]:
model.add(Dense(units=1, activation='sigmoid'))

In [48]:
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 380, 32)           121600    
_________________________________________________________________
dropout_3 (Dropout)          (None, 380, 32)           0         
_________________________________________________________________
flatten_2 (Flatten)          (None, 12160)             0         
_________________________________________________________________
dense_3 (Dense)              (None, 256)               3113216   
_________________________________________________________________
dropout_4 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 257       
Total params: 3,235,073
Trainable params: 3,235,073
Non-trainable params: 0
_________________________________________________________________


训练模型

In [49]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [50]:
train_history = model.fit(x_train, y_train, batch_size=100, epochs=10, verbose=2, validation_split=0.2)

Train on 20000 samples, validate on 5000 samples
Epoch 1/10
 - 19s - loss: 0.4794 - acc: 0.7567 - val_loss: 0.4975 - val_acc: 0.7850
Epoch 2/10
 - 19s - loss: 0.2008 - acc: 0.9224 - val_loss: 0.5416 - val_acc: 0.7894
Epoch 3/10
 - 19s - loss: 0.0882 - acc: 0.9703 - val_loss: 0.5833 - val_acc: 0.8158
Epoch 4/10
 - 19s - loss: 0.0325 - acc: 0.9908 - val_loss: 0.8868 - val_acc: 0.7732
Epoch 5/10
 - 19s - loss: 0.0123 - acc: 0.9974 - val_loss: 0.8633 - val_acc: 0.8052
Epoch 6/10
 - 22s - loss: 0.0075 - acc: 0.9983 - val_loss: 0.9064 - val_acc: 0.8186
Epoch 7/10
 - 25s - loss: 0.0049 - acc: 0.9989 - val_loss: 1.0689 - val_acc: 0.8024
Epoch 8/10
 - 24s - loss: 0.0049 - acc: 0.9990 - val_loss: 1.3058 - val_acc: 0.7774
Epoch 9/10
 - 24s - loss: 0.0090 - acc: 0.9973 - val_loss: 1.0933 - val_acc: 0.8128
Epoch 10/10
 - 21s - loss: 0.0178 - acc: 0.9938 - val_loss: 1.0341 - val_acc: 0.8212


修改pad_sequences

In [51]:
pad_input_seq = sequence.pad_sequences(input_seq, maxlen=380)

修改predict_review函数

In [52]:
def predict_review(input_text):
    input_seq = token.texts_to_sequences([input_text])
    pad_input_seq = sequence.pad_sequences(input_seq, maxlen=380)
    predict_result = model.predict_classes(pad_input_seq)
    print(SentimentDict[predict_result[0][0]])

评估模型准确率

In [53]:
scores = model.evaluate(x_test, y_test, verbose=1)
scores[1]



0.85488

模型准确率提高到0.85

# 使用Keras RNN模型进行IMDb情感分析

本节，我们使用SimpleRNN(units = 16)建立16个神经元的RNN层

导入所需模块

In [54]:
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import SimpleRNN

建立模型

In [55]:
model = Sequential()

加入嵌入层

In [56]:
model.add(Embedding(output_dim=32, input_dim= 3800, input_length=380))
model.add(Dropout(0.35))

加入RNN层

In [57]:
model.add(SimpleRNN(units=16))

加入隐藏层

In [58]:
model.add(Dense(units=256, activation='relu'))
model.add(Dropout(0.35))

加入输出层

In [59]:
model.add(Dense(units=1, activation='sigmoid'))

查看模型摘要

In [60]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 380, 32)           121600    
_________________________________________________________________
dropout_5 (Dropout)          (None, 380, 32)           0         
_________________________________________________________________
simple_rnn_1 (SimpleRNN)     (None, 16)                784       
_________________________________________________________________
dense_5 (Dense)              (None, 256)               4352      
_________________________________________________________________
dropout_6 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_6 (Dense)              (None, 1)                 257       
Total params: 126,993
Trainable params: 126,993
Non-trainable params: 0
_________________________________________________________________


定义训练方式

In [61]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

开始训练

In [62]:
train_history = model.fit(x_train, y_train, batch_size=100, epochs=10, verbose=2, validation_split=0.2)

Train on 20000 samples, validate on 5000 samples
Epoch 1/10
 - 19s - loss: 0.5799 - acc: 0.6846 - val_loss: 0.6068 - val_acc: 0.7380
Epoch 2/10
 - 18s - loss: 0.3596 - acc: 0.8518 - val_loss: 0.6682 - val_acc: 0.6968
Epoch 3/10
 - 18s - loss: 0.2916 - acc: 0.8849 - val_loss: 0.8436 - val_acc: 0.6688
Epoch 4/10
 - 18s - loss: 0.2463 - acc: 0.9038 - val_loss: 0.5229 - val_acc: 0.8024
Epoch 5/10
 - 18s - loss: 0.2115 - acc: 0.9199 - val_loss: 0.5676 - val_acc: 0.8006
Epoch 6/10
 - 18s - loss: 0.1726 - acc: 0.9342 - val_loss: 0.6942 - val_acc: 0.7394
Epoch 7/10
 - 18s - loss: 0.1449 - acc: 0.9447 - val_loss: 0.7185 - val_acc: 0.7776
Epoch 8/10
 - 18s - loss: 0.1360 - acc: 0.9486 - val_loss: 0.8210 - val_acc: 0.7166
Epoch 9/10
 - 18s - loss: 0.1132 - acc: 0.9577 - val_loss: 1.4753 - val_acc: 0.6718
Epoch 10/10
 - 18s - loss: 0.1000 - acc: 0.9626 - val_loss: 0.8901 - val_acc: 0.7710


评估模型准确率

In [63]:
scores = model.evaluate(x_test, y_test, verbose=1)
scores[1]



0.82428

模型准确率约为0.82

# 使用Keras LSTM模型进行IMDb情感分析

以下代码使用LSTM(32)建立32个神经元的LSTM层

In [64]:
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation, Flatten
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import LSTM

In [65]:
model = Sequential()

In [66]:
model.add(Embedding(output_dim=32, input_dim=3800, input_length=380))
model.add(Dropout(0.2))

In [67]:
model.add(LSTM(32))

In [68]:
model.add(Dense(units=256, activation='relu'))
model.add(Dropout(0.2))

In [69]:
model.add(Dense(units=1, activation='sigmoid'))

In [70]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (None, 380, 32)           121600    
_________________________________________________________________
dropout_7 (Dropout)          (None, 380, 32)           0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 32)                8320      
_________________________________________________________________
dense_7 (Dense)              (None, 256)               8448      
_________________________________________________________________
dropout_8 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_8 (Dense)              (None, 1)                 257       
Total params: 138,625
Trainable params: 138,625
Non-trainable params: 0
_________________________________________________________________


训练模型

In [71]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [72]:
train_history = model.fit(x_train, y_train, batch_size=100, epochs=10, verbose=2, validation_split=0.2)

Train on 20000 samples, validate on 5000 samples
Epoch 1/10
 - 47s - loss: 0.5168 - acc: 0.7394 - val_loss: 0.4622 - val_acc: 0.7732
Epoch 2/10
 - 47s - loss: 0.2899 - acc: 0.8784 - val_loss: 0.4057 - val_acc: 0.8134
Epoch 3/10
 - 46s - loss: 0.2312 - acc: 0.9105 - val_loss: 0.4007 - val_acc: 0.8174
Epoch 4/10
 - 46s - loss: 0.2060 - acc: 0.9220 - val_loss: 0.4735 - val_acc: 0.8082
Epoch 5/10
 - 46s - loss: 0.1799 - acc: 0.9315 - val_loss: 0.4200 - val_acc: 0.8626
Epoch 6/10
 - 47s - loss: 0.1585 - acc: 0.9420 - val_loss: 0.6139 - val_acc: 0.8138
Epoch 7/10
 - 48s - loss: 0.1413 - acc: 0.9496 - val_loss: 0.7020 - val_acc: 0.7872
Epoch 8/10
 - 46s - loss: 0.1221 - acc: 0.9561 - val_loss: 0.4344 - val_acc: 0.8590
Epoch 9/10
 - 46s - loss: 0.1108 - acc: 0.9599 - val_loss: 0.5705 - val_acc: 0.8154
Epoch 10/10
 - 46s - loss: 0.1116 - acc: 0.9592 - val_loss: 0.5445 - val_acc: 0.8402


评估模型准确率

In [73]:
scores = model.evaluate(x_test, y_test, verbose=1)
scores[1]



0.85604

使用LSTM模型准确率提升至0.856