# 基于LSTM的IMDB影评集情感分类任务
### 数据科学与大数据技术二班&emsp;廖寒曦
#### 在本机cpu:M1 &emsp;python v3.8.6&emsp; tensorflow v2.4.0上执行约两小时达到训练要求后退出训练，测试集上Accuracy为0.805，通过对超参数设置可节约训练时间
#### 运行结果：
##### INFO:tensorflow:Best Accuracy: 0.805<br/>INFO:tensorflow:Testing Accuracy not improved over 3 epochs, Early Stop

## 一、配置环境与数据预处理
#### （一）导入需要的模块

In [14]:
#coding: utf-8
'''
Filename: Assignment_RNN.ipynb
Author  : 廖寒曦
Time    : 2022/04/02 09:59:30
'''
import os
import warnings
warnings.filterwarnings("ignore")
import tensorflow as tf
import numpy as np
import pprint
import logging
import time
from collections import Counter
from pathlib import Path
from tqdm import tqdm

#### （二）数据预处理
从keras包中导入IMDB数据,其结构为嵌套列表，子列表为一条影评，而子列表中元素为词在对应词库中的索引,需要将词索引转换为词

In [15]:
#从keras包里倒入所需的imdb影评集
(x_train, y_train),(x_test, y_test) = tf.keras.datasets.imdb.load_data()

print("训练集大小为{}".format(len(x_train)))
print("测试集大小为{}".format(len(x_test)))

#载入imdb数据集原始数据索引表
_word2idx = tf.keras.datasets.imdb.get_word_index()
#将所有数据往后移三个位置
word2idx = {w:i+3 for w , i in _word2idx.items()}
#补位标识符
word2idx['<pad>'] = 0
#每个样本开始索引
word2idx['<start>'] = 1
#预训练模型，有些没对应上的词的特殊字符
word2idx['<unk>'] = 2
idx2word = {i:w for w , i in word2idx.items() } 
#健为单词，值为索引号

训练集大小为25000
测试集大小为25000


先将每条影评从短到长进行排序，再将词索引转换为可读文本，构建一个计数器，记录影评中出现单词的词频，再将每条影评中词频小于10的单词删除。为避免操作过程中出现变量丢失，将训练与测试集保存至本地。

In [16]:
#按文本长度大小进行排序
def sort_by_len(x,y):
    x,y = np.asarray(x),np.asarray(y)
    idx = sorted(range(len(x)),key = lambda i :len(x[i]))
    return x[idx],y[idx]
#短的在前，长的在后
x_train,y_train = sort_by_len(x_train,y_train)
x_test,y_test = sort_by_len(x_test,y_test)

#将文本数据保存至本地文件
def write_file(f_path,xs,ys):
    with open(f_path,'w',encoding='utf8') as f:
        for x,y in zip(xs,ys):
            f.write(str(y)+ '\t' + ' '.join([idx2word[i] for i in x][1:])+'\n')#制符表\t，label在前，其后是影评，影评是由原始train数据集中的索引对应产生
     
write_file('/Users/liaohanxi/Desktop/sentiment_classfication/train.txt',x_train,y_train)
write_file('/Users/liaohanxi/Desktop/sentiment_classfication/test.txt',x_test,y_test)
counter = Counter()#创建计数器对象，返回值是一个字典，字典的键是传入参数的元素，字典的值是该元素的出现次数

with open ("/Users/liaohanxi/Desktop/sentiment_classfication/train.txt",encoding='utf-8') as f:
    for line in f:
        line = line.rstrip()#对字符串调用该方法，删除右起的指定字符，中间的字符不会删除，相对的还有 左起删除lstrip，前后缀删除strip
        label , words = line.split('\t')#以制表符对每行影评进行分割，制表符之前的0/1是label，制表符之后的值为words
        words = words.split(' ')#将每行影评拆分成独立的words
        counter.update(words)#对计数器对象进行更新

words = ['<pad>'] + [w for w,freq in counter.most_common() if freq >= 10]
#counter对象的most_common方法返回一个元祖，(a,b)，a就是前述的元素，b为该元素的出现次数
#随后选出所有词频大于10的词
#words就是我们的词库，查看words的类型和大小
print('the type of words:',type(words))
print('the size of words (Vocab Size):',len(words))

the type of words: <class 'list'>
the size of words (Vocab Size): 20598


从网上下载预训练模型，此处采用的是斯坦福大学的词向量模型（glove.6B.50d.txt,该文件在网上可下载），每个单词的维度为50维，模型中可匹配单词为40万，将处理过的影评词汇与词向量模型中单词比对，其中19697个单词是已有向量表示的

In [17]:
Path('/Users/liaohanxi/Desktop/sentiment_classfication/vocab').mkdir(exist_ok=True)#创建目录，并将词库保存至本地

with open('/Users/liaohanxi/Desktop/sentiment_classfication/vocab/word.txt','w',encoding='utf8') as f:
    for w in words:
        f.write(w+'\n')
        word2idx = {}
with open('/Users/liaohanxi/Desktop/sentiment_classfication/vocab/word.txt',encoding='utf-8') as f:
    for i , line in enumerate(f):#enumerate 对可迭代对象，返回索引和值
        line = line.rstrip()
        word2idx[line] = i

embedding = np.zeros((len(word2idx)+1,50))#对于影评集中不在预料表中的词，就是unk

with open('/Users/liaohanxi/Desktop/sentiment_classfication/glove.6B/glove.6B.50d.txt',encoding='utf-8') as f:#读取预训练的词向量数据集
    count = 0
    for i , line in enumerate(f):#i是word
        if i % 100000 ==0:
            print('-At line {}'.format(i))
        line = line.rstrip()
        sp = line.split(' ')
        word, vec = sp[0],sp[1:]
        if word in word2idx:
            count +=1
            embedding[word2idx[word]] = np.asarray(vec, dtype='float32')#将词转换为对应的向量

#打印已找到的词的信息
print('[%d / %d]words have found in pre-trained values'%(count, len(word2idx)))
np.save('/Users/liaohanxi/Desktop/sentiment_classfication/vocab/word.npy',embedding)

-At line 0
-At line 100000
-At line 200000
-At line 300000
[19676 / 20598]words have found in pre-trained values


## 二、搭建模型
#### （一）制作数据生成器
制作数据生成器，模型训练过程中，每调用一次便会产生一个含评价标签的影评样本

In [5]:
def data_generator(f_path,params):
    with open(f_path,encoding='utf-8') as f:
        print('Reading',f_path)
        for line in f:
            line = line.rstrip()
            label, text = line.split('\t')#每一行拆分成label和setence
            text = text.split(' ')#将text按word拆分
            x = [params['word2idx'].get(w, len(word2idx)) for w in text]#得到当前词对应的ID，并将其填入列表x，x就表示一个句子
            #若影评长度小于预设最大长度，则进行结断操作
            if len(x) >= params['max_len']:
                x = x[:params['max_len']]
            else:
                x += [0] * (params['max_len'] - len(x))#若小于给顶长度，则补零，零对应字符pad
            y = int(label)
            yield x,y#yield会顺序产生数据

#### （二）生成数据集
此处接口经过设计可以识别训练集和测试集，若参数is_training值为True，则生成器从训练集中获取数据

In [6]:
def dataset(is_training,params):#第一个参数表示是否是用于训练，改为false则将模型用于测试数据生成
    _shapes = ([params['max_len']],())
    _types = (tf.int32, tf.int32)#将word以索引格式输出

    if is_training:
        ds = tf.data.Dataset.from_generator(#必须先定义好生成器才能在from_generator调用
            lambda: data_generator(params['train_path'],params),
            output_shapes = _shapes,
            output_types = _types,)
        ds = ds.shuffle(params['num_samples'])
        ds = ds.batch(params['batch_size'])
        ds = ds.prefetch(tf.data.experimental.AUTOTUNE)
    else:
        ds = tf.data.Dataset.from_generator(
            lambda: data_generator(params['test_path'],params),
            output_shapes = _shapes,
            output_types = _types,)
        ds = ds.shuffle(params['num_samples'])#shuffle洗牌
        ds = ds.batch(params['batch_size'])
        ds = ds.prefetch(tf.data.experimental.AUTOTUNE)#设置缓存序列，动态根据cpu状态设置并行调用的数量
    return ds

#### （三）搭建网络模型
此处采用函数型搭建，以便与后续外接使用。网络为三层双向LSTM结构，最后衔接全联接层。为避免过拟合，在每层都会删除部分cell

In [7]:
class Model(tf.keras.Model):
    def __init__(self,params):#初始化
        super().__init__()
        #进行词嵌入
        self.embedding = tf.Variable(np.load('/Users/liaohanxi/Desktop/sentiment_classfication/vocab/word.npy'),dtype=tf.float32,name='pretrained_emdedding',trainable=False,)
        #防止神经网络过拟合，删除部分神经元，减小对权重w的依赖，删除比例就是这里的dropout_rate
        self.drop1 = tf.keras.layers.Dropout(params['dropout_rate'])
        self.drop2 = tf.keras.layers.Dropout(params['dropout_rate'])
        self.drop3 = tf.keras.layers.Dropout(params['dropout_rate'])
        #rnn层，采用LSTM
        self.rnn1 = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(params['rnn_units'],return_sequences=True))#设定units，即每个隐层神经元数量，设定返回hidden_layer值序列给下一层
        self.rnn2 = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(params['rnn_units'],return_sequences=True))#在外层用bidirectional封装
        self.rnn3 = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(params['rnn_units'],return_sequences=False))#设为false 只需要输出最后一个值
        #
        self.drop_fc = tf.keras.layers.Dropout(params['dropout_rate'])
        self.fc = tf.keras.layers.Dense(2*params['rnn_units'],tf.nn.elu)

        self.out_linear = tf.keras.layers.Dense(2)#两个结果即正负例的概率

    def call(self,inputs,training=False):
        if inputs.dtype != tf.int32:
            inputs = tf.cast(inputs,tf.int32)

        batch_sz = tf.shape(inputs)[0]
        rnn_units = 2*params['rnn_units']

        x = tf.nn.embedding_lookup(self.embedding,inputs)#[batch_size,max_len,size_of_vec]

        x = self.drop1(x, training = training)
        x = self.rnn1(x)

        x = self.drop2(x, training = training)
        x = self.rnn2(x)

        x = self.drop3(x, training = training)
        x = self.rnn3(x)

        x = self.drop_fc(x,training = training)
        x = self.fc(x)

        x = self.out_linear(x)
        
        return x

## 三、训练模型
#### （一）设置超参数

In [8]:
params = {
    'vocab_path':'/Users/liaohanxi/Desktop/sentiment_classfication/vocab/word.txt',
    'train_path':'/Users/liaohanxi/Desktop/sentiment_classfication/train.txt',
    'test_path':'/Users/liaohanxi/Desktop/sentiment_classfication/test.txt',
    'num_samples':25000,#数据集大小
    'num_labels':2,
    'batch_size':256,
    'max_len':100,#样本最大长度
    'rnn_units':64,#模型每层cell个数
    'dropout_rate':0.2,
    'clip_norm':10.,#梯度截断，防止过拟合
    'num_patience':3,#3次未下降就停止
    'lr':3e-4#学习率
}

#### （二）预设置
预设一个退出训练条件，若超过num_patience步训练，精度仍未提升，则停止训练。配置一个执行日志

In [9]:
#设置一个停止训练的判断条件，若连续
def is_descending(history: list):
    history = history[-(params['num_patience']+1):]
    for i in range(1,len(history)):
        if history[i-1] <= history[i]:
            return False
    return True

#导入   
word2idx = {}
with open(params['vocab_path'],encoding='utf-8') as f:
    for i,line in enumerate(f):
        line = line.rstrip()
        word2idx[line] = i
params['word2idx'] = word2idx
params['vocab_size'] = len(word2idx) + 1

#将搭建好的模型实例化
model = Model(params)
model.build(input_shape=(None,None))#设置输入数据的size，fit也能自动识别

#设置学习率衰减
decay_lr = tf.optimizers.schedules.ExponentialDecay(params['lr'],1000,0.95)
optim = tf.optimizers.Adam(params['lr'])
global_step = 0

#设置一个精确的历史列表，便于寻找最好的训练模型
history_acc = []
best_acc = .0 

t0 = time.time()
logger = logging.getLogger('tensorflow')
logger.setLevel(logging.INFO)

#### （三）执行训练
训练时每完成一次训练就会在测试集上验证，达到预设的退出条件后，才会退出训练，其中每50步会打印一次执行日志

In [10]:
print(model.summary())
while True:
    for texts, labels in dataset(is_training=True, params=params):
        with tf.GradientTape() as tape:#记录运算中的梯度
            logits = model(texts, training=True)
            loss = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=labels,logits=logits)
            loss = tf.reduce_mean(loss)
        
        optim.lr.assign(decay_lr(global_step))
        grads = tape.gradient(loss,model.trainable_variables)
        grads,_ = tf.clip_by_global_norm(grads,params['clip_norm'])
        optim.apply_gradients(zip(grads,model.trainable_variables))#将算出的梯度与对应的待更新参数配对

        if global_step % 50 == 0:
            logger.info("step {} | Loss: {:.4f} | Spent: {:.1f} secs | LR: {:.6f}".format(global_step,loss.numpy().item(),time.time()-t0,optim.lr.numpy().item()))
            t0 = time.time()
        global_step +=1

    m = tf.keras.metrics.Accuracy()

    for texts,labels in dataset(is_training=False,params=params):
        logits = model(texts,training=False)
        y_pred = tf.argmax(logits,axis=-1)
        m.update_state(y_true=labels,y_pred=y_pred)

    acc = m.result().numpy()
    logger.info('Evalution: Testing Accuracy: {:.3f}'.format(acc))
    history_acc.append(acc)
    
    if acc > best_acc:
        best_acc = acc
    logger.info('Best Accuracy: {:.3f}'.format(best_acc))

    if len(history_acc) > params['num_patience'] and is_descending(history_acc):
        logger.info('Testing Accuracy not improved over {} epochs, Early Stop'.format(params['num_patience']))
        break

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dropout (Dropout)            multiple                  0         
_________________________________________________________________
dropout_1 (Dropout)          multiple                  0         
_________________________________________________________________
dropout_2 (Dropout)          multiple                  0         
_________________________________________________________________
bidirectional (Bidirectional multiple                  58880     
_________________________________________________________________
bidirectional_1 (Bidirection multiple                  98816     
_________________________________________________________________
bidirectional_2 (Bidirection multiple                  98816     
_________________________________________________________________
dropout_3 (Dropout)          multiple                  0     

2022-04-06 00:45:46.834535: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
2022-04-06 00:45:46.834654: W tensorflow/core/platform/profile_utils/cpu_utils.cc:126] Failed to get CPU frequency: 0 Hz


INFO:tensorflow:step 0 | Loss: 0.7137 | Spent: 3.8 secs | LR: 0.000300
INFO:tensorflow:step 50 | Loss: 0.6410 | Spent: 62.6 secs | LR: 0.000299
Reading /Users/liaohanxi/Desktop/sentiment_classfication/test.txt
INFO:tensorflow:Evalution: Testing Accuracy: 0.697
INFO:tensorflow:Best Accuracy: 0.697
Reading /Users/liaohanxi/Desktop/sentiment_classfication/train.txt
INFO:tensorflow:step 100 | Loss: 0.5584 | Spent: 116.0 secs | LR: 0.000298
INFO:tensorflow:step 150 | Loss: 0.5446 | Spent: 61.8 secs | LR: 0.000298
Reading /Users/liaohanxi/Desktop/sentiment_classfication/test.txt
INFO:tensorflow:Evalution: Testing Accuracy: 0.723
INFO:tensorflow:Best Accuracy: 0.723
Reading /Users/liaohanxi/Desktop/sentiment_classfication/train.txt
INFO:tensorflow:step 200 | Loss: 0.5873 | Spent: 114.9 secs | LR: 0.000297
INFO:tensorflow:step 250 | Loss: 0.5088 | Spent: 63.1 secs | LR: 0.000296
Reading /Users/liaohanxi/Desktop/sentiment_classfication/test.txt
INFO:tensorflow:Evalution: Testing Accuracy: 0.733

2022-04-06 01:14:00.190566: W tensorflow/compiler/tf2mlcompute/convert/mlc_convert_utils.cc:690] ComputeTimeStepForAdam: Computing time_step from beta1_power and beta2_power gives different results, probably due to losing precision in pow or log. The time_step that comes from the larger beta_power is chosen (time_step = 960).
2022-04-06 01:14:00.192765: W tensorflow/compiler/tf2mlcompute/convert/mlc_convert_utils.cc:690] ComputeTimeStepForAdam: Computing time_step from beta1_power and beta2_power gives different results, probably due to losing precision in pow or log. The time_step that comes from the larger beta_power is chosen (time_step = 960).
2022-04-06 01:14:00.194823: W tensorflow/compiler/tf2mlcompute/convert/mlc_convert_utils.cc:690] ComputeTimeStepForAdam: Computing time_step from beta1_power and beta2_power gives different results, probably due to losing precision in pow or log. The time_step that comes from the larger beta_power is chosen (time_step = 960).
2022-04-06 01:14

Reading /Users/liaohanxi/Desktop/sentiment_classfication/test.txt
INFO:tensorflow:Evalution: Testing Accuracy: 0.779
INFO:tensorflow:Best Accuracy: 0.779
Reading /Users/liaohanxi/Desktop/sentiment_classfication/train.txt


2022-04-06 01:15:18.231365: W tensorflow/compiler/tf2mlcompute/convert/mlc_convert_utils.cc:690] ComputeTimeStepForAdam: Computing time_step from beta1_power and beta2_power gives different results, probably due to losing precision in pow or log. The time_step that comes from the larger beta_power is chosen (time_step = 981).
2022-04-06 01:15:18.233171: W tensorflow/compiler/tf2mlcompute/convert/mlc_convert_utils.cc:690] ComputeTimeStepForAdam: Computing time_step from beta1_power and beta2_power gives different results, probably due to losing precision in pow or log. The time_step that comes from the larger beta_power is chosen (time_step = 981).
2022-04-06 01:15:18.234998: W tensorflow/compiler/tf2mlcompute/convert/mlc_convert_utils.cc:690] ComputeTimeStepForAdam: Computing time_step from beta1_power and beta2_power gives different results, probably due to losing precision in pow or log. The time_step that comes from the larger beta_power is chosen (time_step = 981).
2022-04-06 01:15

INFO:tensorflow:step 1000 | Loss: 0.4360 | Spent: 114.2 secs | LR: 0.000285
INFO:tensorflow:step 1050 | Loss: 0.4647 | Spent: 61.6 secs | LR: 0.000284
Reading /Users/liaohanxi/Desktop/sentiment_classfication/test.txt
INFO:tensorflow:Evalution: Testing Accuracy: 0.779
INFO:tensorflow:Best Accuracy: 0.779
Reading /Users/liaohanxi/Desktop/sentiment_classfication/train.txt
INFO:tensorflow:step 1100 | Loss: 0.3899 | Spent: 114.6 secs | LR: 0.000284
INFO:tensorflow:step 1150 | Loss: 0.4698 | Spent: 61.7 secs | LR: 0.000283
Reading /Users/liaohanxi/Desktop/sentiment_classfication/test.txt
INFO:tensorflow:Evalution: Testing Accuracy: 0.782
INFO:tensorflow:Best Accuracy: 0.782
Reading /Users/liaohanxi/Desktop/sentiment_classfication/train.txt
INFO:tensorflow:step 1200 | Loss: 0.4715 | Spent: 114.2 secs | LR: 0.000282
INFO:tensorflow:step 1250 | Loss: 0.4833 | Spent: 61.8 secs | LR: 0.000281
Reading /Users/liaohanxi/Desktop/sentiment_classfication/test.txt
INFO:tensorflow:Evalution: Testing Accu