情感分析又称为意见挖掘，是使用自然语言处理，文字分析等方法找出作者某些话题上的态度，情感，评价或情绪。情感分析的商业价值在于可以提早知道顾客对公司或产品的观感，以便调整销售策略的方向

IMDb是一个与电影相关的在线数据库，至今已经积累了大量的电影信息。IMDb数据集一共有50000项“影评文字”，训练集和测试集各有25000项，每一项影评文字都被标记为“正面评价”或“负面评价”。
我们希望建立一个模型，使得模型可以根据影评文字识别出是正面评价还是负面评价。

# Keras自然语言处理介绍

Keras自然语言处理IMDb影评文字步骤如下

## 读取IMDb数据集

IMDb数据集分为训练数据和测试数据，训练集和测试集的都是前一半是正面影评，后一般是负面影评

## 建立token

使用深度学习建立模型时，我们必须将影评文字转换成数字列表，和语言翻译一样，我们必须要有字典将文字转换成数字。Keras提供了Tokenizer模块，就是类似字典的功能。建立的token的方式如下

$\bullet$ 建立token时必须指定字典的字数，例如2000个字的字典

$\bullet$ 读取训练数据25000项，依照没一个单词在所有影评中出现的频数进行排序，并将排序的前2000名的英文单词列入字典中，我们可以说这是影评的“常用字典”

$\bullet$ 我们用此词典对影评文字进行转换，如果有单词不出现在字典里，则不转换，我们只关心影评文字在常用字典里面出现的单词。因为常用单词对我们预测的目标影响较大，不常用的单词影响较小。

## 使用token将影评文字转换成数字列表

## 截长补短让所有数字列表长度为100

因为影评文字的长度不固定，所以转换成的数字列表的长度也不固定，而深度学习模型的训练必须长度固定。以对于我们固定长度为100，对于长度小于100的数字列表，我们用0来填充前面多余的位置，如果长度超过了100，那么我们截取后100个数字，舍弃掉前面的数字。

## 使用嵌入层将数字列表转换成向量列表

## 将向量列表送入深度学习模型进行训练

# 下载IMDb数据集

In [1]:
import urllib.request
import os
import tarfile

In [3]:
url = 'http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz'
filepath = 'aclImdb_v1.tar.gz'
if not os.path.isfile(filepath):
    result = urllib.request.urlretrieve(url, filepath)
    print('downloaded:', result)

downloaded: ('aclImdb_v1.tar.gz', <http.client.HTTPMessage object at 0x7f0d700ef240>)


In [5]:
if not os.path.exists('aclImdb'): #判断解压缩目录是否存在
    tfile = tarfile.open('aclImdb_v1.tar.gz', 'r:gz') #打开压缩文件
    result = tfile.extractall() #解压缩目录到当前文件夹

# 读取IMDb数据

导入所需模块

In [7]:
from keras.preprocessing import sequence #导入sequence，用于截长补短
from keras.preprocessing.text import Tokenizer #导入Tokenizer模块，用于建立字典

创建rm_tag函数删除文字中的HTML标签

In [8]:
import re #导入正则表达式模块
def rm_tags(text): #创建rm_tage函数，输入参数是text文字
    re_tag = re.compile(r'<[^>]+>') #创建re_tag为正则表达式变量，赋值为‘<[^>]+>’
    return re_tag.sub('', text) #使用re_tag将text文字中符合正则表达式条件的字符替换成空字符串

In [9]:
import os
def read_files(filetype):
    #创建read_files函数，输入参数为filetype。读取训练数据时传入‘train’，读取测试数据时传入‘test’
    path = 'aclImdb/'#设置文件的存取路径
    file_list = [] #创建文件列表
    
    positive_path = path + filetype + '/pos/' #设置正面评价的文件目录为positive_path
    for f in os.listdir(positive_path): #用for循环将positive_path目录下的所有文件加入file_list
        file_list += [positive_path + f]
        
    negative_path = path + filetype + '/neg/'#设置负面评价的文件目录为positive_path
    for f in os.listdir(negative_path):#用for循环将negative_path目录下的所有文件加入file_list
        file_list += [negative_path + f]
        
    print('read', filetype, 'files:', len(file_list))#显示读取的filetype目录下的文件个数
    
    all_labels = ([1] * 12500 + [0] * 12500) 
    #产生all_labels,前12500项是正面，所以产生12500项1的列表，后12500项是负面，所以产生12500项0的列表
    
    all_texts = [] #设置all_texts为空列表
    
    '''
    用fi读取file_list所有文件，使用打开文件为file_input，使用file_input.readlines()读取文件，
    用join连接所有文件内容，然后使用rm_tags删除tag，最后加入all_texrs list
    '''
    for fi in file_list:
        with open(fi, encoding='utf8') as file_input:
            all_texts += [rm_tags(' '.join(file_input.readlines()))]
            
    return all_labels, all_texts
        

读取训练数据

In [10]:
y_train, train_text = read_files('train')

read train files: 25000


读取测试数据

In [11]:
y_test, test_text = read_files('test')

read test files: 25000


# 查看IMDb数据

查看第0项影评文字

In [12]:
train_text[0]

"Yeah, it's a chick flick and it moves kinda slow, but it's actually pretty good - and I consider myself a manly man. You gotta love Judy Davis, no matter what she's in, and the girl who plays her daughter gives a natural, convincing performance.The scenery of the small, coastal summer spot is beautiful and plays well with the major theme of the movie. The unknown (at least unknown to me) actors and actresses lend a realism to the movie that draws you in and keeps your attention. Overall, I give it an 8/10. Go see it."

查看第0项的label是1，也就是正面评价

In [13]:
y_train[0]

1

In [14]:
test_text[0]

"This is one of the best presentations of the 60's put on film. Arthur Penn, director of Bonnie and Clyde and Little Big Man, saw that Steve Tesich's outstanding script rang with truth, and from these two talents comes solid cinema. Jodi Thelin's Georgia Miles gives male viewers a hit of pained nostalgia for the archetypal beauty who is almost within our grasps, but, always just out of reach. Just see it, or you cinematic education will be incomplete."

In [15]:
y_test[12501]

0

# 建立token

建立token

In [16]:
token = Tokenizer(num_words=2000)
token.fit_on_texts(train_text)

查看token读取了多少文章

In [17]:
print(token.document_count)

25000


In [18]:
print(token.word_index)



由此可知，the出现的次数最多，其次是and，后续我们会使用这个词典将英文单词转换成数字，如the转换成1，and转换成2，a转换成3等等

# 使用token将影评文字转换成数字列表

使用token将影评文字转换成数字列表,使用token.texts_to_sequences分别将训练数据和测试数据的影评文字转换成数字列表

In [19]:
x_train_seq = token.texts_to_sequences(train_text)
x_test_seq = token.texts_to_sequences(test_text)

In [20]:
print(train_text[0])

Yeah, it's a chick flick and it moves kinda slow, but it's actually pretty good - and I consider myself a manly man. You gotta love Judy Davis, no matter what she's in, and the girl who plays her daughter gives a natural, convincing performance.The scenery of the small, coastal summer spot is beautiful and plays well with the major theme of the movie. The unknown (at least unknown to me) actors and actresses lend a realism to the movie that draws you in and keeps your attention. Overall, I give it an 8/10. Go see it.


In [21]:
print(x_train_seq[0])

[1238, 41, 3, 504, 2, 8, 1094, 1926, 546, 17, 41, 161, 180, 48, 2, 9, 1126, 542, 3, 128, 21, 115, 1711, 53, 547, 47, 437, 7, 2, 1, 246, 33, 294, 37, 573, 405, 3, 1244, 1074, 235, 1, 1378, 4, 1, 388, 1496, 1460, 6, 303, 2, 294, 69, 15, 1, 673, 752, 4, 1, 16, 1, 1851, 29, 218, 1851, 5, 68, 152, 2, 1502, 3, 1877, 5, 1, 16, 11, 21, 7, 2, 937, 125, 687, 442, 9, 198, 8, 31, 708, 160, 136, 63, 8]


# 让转换后的数字列表长度相同

In [22]:
x_train = sequence.pad_sequences(x_train_seq, maxlen=100)
x_test = sequence.pad_sequences(x_test_seq, maxlen=100)

In [26]:
print('before pad_sequences length=', len(x_train_seq[1]))
print(x_train_seq[1])

before pad_sequences length= 684
[636, 969, 5, 108, 50, 9, 10, 233, 8, 12, 57, 5, 1, 197, 9, 215, 10, 18, 14, 1, 82, 54, 7, 19, 1, 517, 310, 1090, 19, 59, 395, 36, 10, 42, 4, 1037, 14, 157, 9, 89, 8, 3, 209, 5, 102, 1, 432, 197, 154, 97, 4, 94, 129, 5, 93, 248, 9, 65, 1964, 263, 141, 9, 127, 965, 1, 309, 11, 537, 13, 57, 510, 1473, 30, 35, 1, 2, 91, 9, 100, 28, 4, 1, 785, 11, 93, 10, 197, 1216, 1, 401, 624, 943, 5, 226, 2, 1, 4, 8, 28, 204, 176, 1, 224, 381, 291, 124, 129, 70, 97, 4, 1, 404, 442, 1, 16, 183, 3, 51, 150, 627, 700, 309, 2, 1, 11, 28, 11, 741, 1, 82, 61, 3, 185, 560, 2, 23, 318, 33, 1672, 79, 10, 269, 5, 74, 3, 986, 35, 1, 539, 34, 26, 96, 19, 23, 1792, 26, 3, 105, 440, 2, 511, 513, 83, 980, 631, 35, 1211, 46, 6, 3, 1299, 129, 23, 11, 9, 254, 1134, 242, 943, 445, 2, 144, 523, 91, 170, 130, 22, 693, 152, 275, 144, 67, 143, 1, 329, 61, 6, 1, 780, 4, 3, 150, 128, 822, 11, 43, 1672, 129, 5, 1084, 23, 242, 8, 60, 13, 26, 6, 30, 412, 115, 26, 182, 5, 24, 254, 609, 29, 3, 715, 5

In [27]:
print('after pad_sequences length=', len(x_train[1]))
print(x_train[1])

after pad_sequences length= 100
[ 352    9   58  131    1  835   61    6  114   17    9   36    1   87
   84    8  162   68 1827   34   72   51  541 1172   14  185  447    2
   44  330    3  693  185  333   41  470  131   41    5  647    1  197
    7  656   44   21   66    1  232   18    4   10  197  962 1326    6
  430    1    9  100    1   82  339  467  104  545   14    1   29  207
 1497 1783   22    1  114   17   44   21   36   97    4   94   21  140
  102   94   28   29  218  276  486  238   25  141  107   49  207    5
  102  125]


In [28]:
print('before pad_sequences length=', len(x_train_seq[0]))
print(x_train_seq[0])

before pad_sequences length= 91
[1238, 41, 3, 504, 2, 8, 1094, 1926, 546, 17, 41, 161, 180, 48, 2, 9, 1126, 542, 3, 128, 21, 115, 1711, 53, 547, 47, 437, 7, 2, 1, 246, 33, 294, 37, 573, 405, 3, 1244, 1074, 235, 1, 1378, 4, 1, 388, 1496, 1460, 6, 303, 2, 294, 69, 15, 1, 673, 752, 4, 1, 16, 1, 1851, 29, 218, 1851, 5, 68, 152, 2, 1502, 3, 1877, 5, 1, 16, 11, 21, 7, 2, 937, 125, 687, 442, 9, 198, 8, 31, 708, 160, 136, 63, 8]


In [29]:
print('after pad_sequences length=', len(x_train[0]))
print(x_train[0])

after pad_sequences length= 100
[   0    0    0    0    0    0    0    0    0 1238   41    3  504    2
    8 1094 1926  546   17   41  161  180   48    2    9 1126  542    3
  128   21  115 1711   53  547   47  437    7    2    1  246   33  294
   37  573  405    3 1244 1074  235    1 1378    4    1  388 1496 1460
    6  303    2  294   69   15    1  673  752    4    1   16    1 1851
   29  218 1851    5   68  152    2 1502    3 1877    5    1   16   11
   21    7    2  937  125  687  442    9  198    8   31  708  160  136
   63    8]
