# IMDb网络电影数据集与自然语言处理

IMDb网络电影数据库（Internet Movie Database）是一个与电影相关的在线数据库。IMDb开始于1990年，自1998年起成为亚马逊旗下的网站，已经积累了大量的电影数据。共收录了400多万部电影作品数据。网址为：http://www.imdb.com/

IMDb数据集共有50000项“影评文字”，分为训练数据与测试数据各25000项，每一项“影评文字”都被标记为“正面评价”或“负面评价”。

希望建立一个模型，经过大量“影评文字”训练后，此模型可以用于预测“影评文字”是“正面评价”或“负面评价”。

## 1 Keras自然语言处理介绍

Keras自然语言处理IMDb影评文字步骤：

1）读取IMDb数据集

训练数据共25000项：0~12499为正面评价影评文字；12500~24999为负面评价影评文字。

测试数据共25000项：0~12499为正面评价影评文字；12500~24999为负面评价影评文字。

2）建立token字典

深度学习模型只能接受数字，必须将“影评文字”转换为“数字列表”。将文字转换为数字，也必须有字典。Keras提供了Tokenizer模块，就是类似字典的功能。建立token时需要指定字典字数，如2000个字的字典。会依照每一个单词在所有影评中出现的次数进行排序，排序前2000名的英文单词会列入字典中。若有单词不在字典中，就不转换。

3）使用token将“影评文字”转换为“数字列表”

4）截长补短让所有“数字列表”长度都是100

“影评文字”的字数不固定，转换成“数字列表”字数也不固定，而后续要将“数字列表”转为“向量列表”，并送入深度学习模型进行训练，所以长度必须固定。方法就是“截长补短”：若将数字列表的长度都设置为100。如果数字列表的长度是59，就在前面补上41个“0”；如果数字列表的长度是126，就将前面的26个数字截去。

5）Embedding层（嵌入层）将“数字列表”转换为“向量列表”

词嵌入是一种自然语言处理技术，原理是将文字映射成多维几何空间的向量。语义类似的文字向量在多维的几何向量空间中的距离也比较相近。数字在语义上无任何关联，为了能让每一个文字有关联性，必须转换为向量。

Keras提供了嵌入层可以用于将“数字列表”转换为“向量列表”。即：将“影评文字”先转换为“数字列表”，再转换为“向量列表”。

6）将“向量列表”送入深度学习模型进行训练

深度学习模型包括：多层感知器MLP、递归神经网络RNN、长短期记忆LSTM、卷积神经网络CNN

## 2 下载IMDb数据集

In [79]:
#导入所需模块
import urllib.request
import os
import tarfile

In [80]:
#下载IMDb数据集
# url="http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
# filepath="data/aclImdb_v1.tar.gz"
# if not os.path.isfile(filepath):
#     result=urllib.request.urlretrieve(url,filepath)
#     print('downloaded:',result)

In [81]:
#解压下载的文件
# if not os.path.exists("data/aclImdb"):
#     tfile = tarfile.open("data/aclImdb_v1.tar.gz",'r:gz')
#     result=tfile.extractall('data/')

## 3 读取IMDb数据

In [82]:
#导入所需模块
from keras.datasets import imdb
from keras.preprocessing import sequence            #用于“截长补短”
from keras.preprocessing.text import Tokenizer      #用于建立字典

In [83]:
#创建rm_tag函数删除文字中的HTML标签
import re                                  #导入Regular Expression模块
def rm_tags(text):                        #创建rm_tags函数，输入参数text文字
    re_tag = re.compile(r'<[^>]+>')        #创建re_tag为正则表达式变量
    return re_tag.sub('',text)             #将text文字中符合正则表达式条件的字符替换成空字符串

In [84]:
#创建read_files函数读取IMDb文件目录
import os
def read_files(filetype):           #读取训练数据时传入“train”;读取测试数据时传入“test”
    path = "data/aclImdb/"
    file_list=[]
    
    positive_path=path+filetype+"/pos/"        #设置正面评价的文件目录
    for f in os.listdir( positive_path):      #for循环将positive_path目录下所有的文件加入file_list
        file_list+=[ positive_path+f]
    
    negative_path=path+filetype+"/neg/"        #设置负面评价的文件目录
    for f in os.listdir( negative_path):      #for循环将negative_path目录下所有的文件加入file_list
        file_list+=[ negative_path+f]
    
    print('read',filetype,'files:',len(file_list))   #显示当前读取的filetype("train"或“test”)目录下的文件个数
    
    all_labels = ([1]*12500+[0]*12500)   #前12500项是正面，产生12500项1的列表；后12500项是负面，产生12500项0的列表。
    
    all_texts = []
    for fi in file_list:                #读取所有文件
        with open(fi,encoding='utf8') as file_input:         #打开文件
#使用file_input.readlines()读取文件，并使用join连接所有文件的内容，然后使用rm_tags删除tag,最后加入all_texts list
            all_texts += [rm_tags(" ".join(file_input.readlines()))]
    return all_labels,all_texts

In [85]:
#读取IMDb数据集目录
y_train,train_text=read_files("train")      #读取训练数据

read train files: 25000


In [86]:
y_test,test_text=read_files("test")    #读取测试数据

read test files: 25000


## 4 查看IMDb数据

In [87]:
#查看第0项“影评文字”
train_text[0]

'Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as "Teachers". My 35 years in the teaching profession lead me to believe that Bromwell High\'s satire is much closer to reality than is "Teachers". The scramble to survive financially, the insightful students who can see right through their pathetic teachers\' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line: INSPECTOR: I\'m here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn\'t!'

In [88]:
#查看label
y_train[0]

1

In [89]:
#查看第12501项影评文字
train_text[12501]

"Airport '77 starts as a brand new luxury 747 plane is loaded up with valuable paintings & such belonging to rich businessman Philip Stevens (James Stewart) who is flying them & a bunch of VIP's to his estate in preparation of it being opened to the public as a museum, also on board is Stevens daughter Julie (Kathleen Quinlan) & her son. The luxury jetliner takes off as planned but mid-air the plane is hi-jacked by the co-pilot Chambers (Robert Foxworth) & his two accomplice's Banker (Monte Markham) & Wilson (Michael Pataki) who knock the passengers & crew out with sleeping gas, they plan to steal the valuable cargo & land on a disused plane strip on an isolated island but while making his descent Chambers almost hits an oil rig in the Ocean & loses control of the plane sending it crashing into the sea where it sinks to the bottom right bang in the middle of the Bermuda Triangle. With air in short supply, water leaking in & having flown over 200 miles off course the problems mount for 

In [90]:
y_train[12501]

0

## 5 建立token

In [91]:
#使用Tokenizer建立token,输入参数num_words=2000,即建立一个有2000个单词的字典
token = Tokenizer(num_words=2000)
#读取所有训练数据影评，按照单词在影评中出现的次数进行排序，排序的前2000名的英文单词会列入字典中
token.fit_on_texts(train_text)

In [92]:
#查看token读取多少文章
print(token.document_count)

25000


In [93]:
#查看token.index_worrd属性
print(token.index_word)



## 6 使用token将“影评文字”转换成“数字列表”

In [94]:
x_train_seq = token.texts_to_sequences(train_text)
x_test_seq = token.texts_to_sequences(test_text)

In [95]:
print(train_text[0])

Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as "Teachers". My 35 years in the teaching profession lead me to believe that Bromwell High's satire is much closer to reality than is "Teachers". The scramble to survive financially, the insightful students who can see right through their pathetic teachers' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line: INSPECTOR: I'm here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn't!


In [96]:
print(x_train_seq[0])

[308, 6, 3, 1068, 208, 8, 29, 1, 168, 54, 13, 45, 81, 40, 391, 109, 137, 13, 57, 149, 7, 1, 481, 68, 5, 260, 11, 6, 72, 5, 631, 70, 6, 1, 5, 1, 1530, 33, 66, 63, 204, 139, 64, 1229, 1, 4, 1, 222, 899, 28, 68, 4, 1, 9, 693, 2, 64, 1530, 50, 9, 215, 1, 386, 7, 59, 3, 1470, 798, 5, 176, 1, 391, 9, 1235, 29, 308, 3, 352, 343, 142, 129, 5, 27, 4, 125, 1470, 5, 308, 9, 532, 11, 107, 1466, 4, 57, 554, 100, 11, 308, 6, 226, 47, 3, 11, 8, 214]


## 7 让转换后的数字长度相同

In [98]:
#让每一个数字列表的长度都是100
x_train = sequence.pad_sequences(x_train_seq, maxlen=100)
x_test = sequence.pad_sequences(x_test_seq,maxlen=100)

In [99]:
#显示第0项“数字列表”
print('before pad_sequences length=',len(x_train_seq[0]))
print(x_train_seq[0])

before pad_sequences length= 106
[308, 6, 3, 1068, 208, 8, 29, 1, 168, 54, 13, 45, 81, 40, 391, 109, 137, 13, 57, 149, 7, 1, 481, 68, 5, 260, 11, 6, 72, 5, 631, 70, 6, 1, 5, 1, 1530, 33, 66, 63, 204, 139, 64, 1229, 1, 4, 1, 222, 899, 28, 68, 4, 1, 9, 693, 2, 64, 1530, 50, 9, 215, 1, 386, 7, 59, 3, 1470, 798, 5, 176, 1, 391, 9, 1235, 29, 308, 3, 352, 343, 142, 129, 5, 27, 4, 125, 1470, 5, 308, 9, 532, 11, 107, 1466, 4, 57, 554, 100, 11, 308, 6, 226, 47, 3, 11, 8, 214]


In [100]:
#显示第0项“数字列表”，经过pad_sequences处理后的内容
print('after pad_sequences length=',len(x_train[0]))
print(x_train[0])

after pad_sequences length= 100
[  29    1  168   54   13   45   81   40  391  109  137   13   57  149
    7    1  481   68    5  260   11    6   72    5  631   70    6    1
    5    1 1530   33   66   63  204  139   64 1229    1    4    1  222
  899   28   68    4    1    9  693    2   64 1530   50    9  215    1
  386    7   59    3 1470  798    5  176    1  391    9 1235   29  308
    3  352  343  142  129    5   27    4  125 1470    5  308    9  532
   11  107 1466    4   57  554  100   11  308    6  226   47    3   11
    8  214]


In [103]:
#显示第6项“数字列表”
print('before pad_sequences length=',len(x_train_seq[6]))
print(x_train_seq[6])

before pad_sequences length= 88
[418, 90, 31, 494, 5, 93, 3, 547, 1779, 706, 1, 61, 7, 323, 133, 21, 88, 56, 1493, 8, 1444, 474, 235, 30, 1691, 1, 7, 1, 18, 66, 302, 1739, 2, 66, 238, 85, 72, 21, 353, 1, 18, 186, 1, 110, 6, 51, 1724, 1, 16, 148, 1639, 21, 2, 127, 21, 191, 5, 397, 21, 1531, 1, 459, 6, 48, 357, 4, 5, 4, 835, 2, 6, 48, 51, 323, 301, 54, 102, 44, 21, 22, 263, 5, 141, 2, 838, 3, 342, 61]


In [104]:
#显示第6项“数字列表”，经过pad_sequences处理后的内容
print('after pad_sequences length=',len(x_train[6]))
print(x_train[6])

after pad_sequences length= 100
[   0    0    0    0    0    0    0    0    0    0    0    0  418   90
   31  494    5   93    3  547 1779  706    1   61    7  323  133   21
   88   56 1493    8 1444  474  235   30 1691    1    7    1   18   66
  302 1739    2   66  238   85   72   21  353    1   18  186    1  110
    6   51 1724    1   16  148 1639   21    2  127   21  191    5  397
   21 1531    1  459    6   48  357    4    5    4  835    2    6   48
   51  323  301   54  102   44   21   22  263    5  141    2  838    3
  342   61]
