## Sequence
* 获取数据
* 搭建神经网络 sequential().add()
* 编译神经网络 compile()
* 训练神经网络 fit()
* 评估神经网络 evalue()


## unfamiliar concepts
* filter: 相当于一个matrix，维度为（想要输出的维度*（词的维度*词的个数））
* pool_size 池化窗口的大小，图 in brain
* pooling: extracting some feature 
* local feature
* CNN->抓取情感词

In [1]:
from tensorflow import keras
from keras.models import Sequential
from keras.layers import Dense, Conv1D, Flatten, Embedding,MaxPooling1D
from keras.preprocessing import sequence
from keras.datasets import imdb


Using TensorFlow backend.


In [2]:
max_features = 20000
maxlen = 80
batch_size = 128


print('Loading data...')
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
#to get the data,the number of the data is 25000
#x_train中的是一组list，一个list是一个影评，里面的每个词是用它出现的频率从1-num_words，那这个list一得出来就是一组数
#y_train中的是一组list，一个list代表的是这个影评的极性（0 or 1）
print(len(x_train), 'train sequences')
print(len(x_test), 'test sequences')



Loading data...
25000 train sequences
25000 test sequences


In [3]:
x_train = x_train[:5000]
y_train = y_train[:5000]

In [4]:
x_test = x_test[:500]
y_test = y_test[:500]

In [5]:
print(len(x_train), 'train sequences')
print(len(x_test), 'test sequences')
print('Pad sequences (samples x time)')
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
#0做padding
#maxlen根据文本来定
#x_train is the list which is waiting for being cutted as the maxlen
#maxlen is the maxim length of list
#return a numpy matrix(length * maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)

5000 train sequences
500 test sequences
Pad sequences (samples x time)
x_train shape: (5000, 80)
x_test shape: (500, 80)


In [6]:

#记录baseline algorithm

print('Build model...')
model = Sequential()
#follow the sequence to create the model

model.add(Embedding(max_features, 32, input_length = maxlen))
#max_feature->词汇表大小
#I consider the max_features vocabulary
#use 32 dimension to represent each word
#the length of every review is maxlen

model.add(Conv1D(filters = 64, kernel_size = 3, padding = 'same',activation = 'relu'))
#************************对卷积层的维度不了解
#filters: 整数，输出空间的维度 （即卷积中滤波器的输出数量）
#kernel_size: 一个整数，或者单个整数表示的元组或列表， 指明 1D 卷积窗口的长度
#"same" 表示填充输入以使输出的和原始输入的那个长度相等

model.add(MaxPooling1D(pool_size = 2))#********global->number maxpooling->vector
#Max pooling取每一个区域的最大值
#pool_size: 整数，最大池化的窗口大小。
model.add(Flatten())
model.add(Dense(250, activation = 'relu'))
model.add(Dense(125,activation = 'relu'))
model.add(Dense(1,activation = 'sigmoid'))#dense + softmax
#dense是创造一个全连接层，其参数为（输出数据的维度，输出数据的维度）


model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

print('Train...')
model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=5,
          validation_data=(x_test, y_test))
loss_and_metrics = model.evaluate(x_test, y_test,
                            batch_size=batch_size)
print('Test accuracy:', loss_and_metrics)

Build model...
Train...
Train on 5000 samples, validate on 500 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Test accuracy: [0.6015336561203003, 0.8099999947547912]


In [7]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 80, 32)            640000    
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 80, 64)            6208      
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 40, 64)            0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 2560)              0         
_________________________________________________________________
dense_1 (Dense)              (None, 250)               640250    
_________________________________________________________________
dense_2 (Dense)              (None, 125)               31375     
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 126       
Total para