<a href="https://colab.research.google.com/github/JavanTang/Learn-a-little-tensorflow-every-day/blob/master/%E6%B7%B1%E5%BA%A6%E5%AD%A6%E4%B9%A0%E2%80%94%E2%80%94Tensorflow%E5%AD%A6%E4%B9%A0%EF%BC%88%E4%B8%89%EF%BC%89%E6%96%87%E6%9C%AC%E5%88%86%E7%B1%BB.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

简单概述一下本次的任务与昨天的任务类似都是做分类，但是我们需要使用tensorflow去做的是文本分类，将文本形式的影评分为正面与负面，我们使用的是的`数据集`是来自IMDB的数据集，之后我们还要下载一些中文的数据进行分析，在官方教程的基础上做扩充。


## 任务简述
1. 下载IMDB数据集
2. 分析数据
3. 将数据格式化
4. 构建模型
5. 训练模型
6. 验证模型
7. **评估模型**
8. 使用中文数据集进行练习

这里插一句本系列练习中都有Colaboratory，这个是Google大佬免费提供的机器学习的一个平台，如果有翻墙可以直接点Colaboratory链接，那样学习会更加轻松，排版也会较为舒服而且还可以随时修改参数。


## 上代码

In [4]:
# 导入包
!pip install tensorflow==2.0.0-alpha0
import tensorflow as tf
from tensorflow import keras as k
import numpy as np
import pandas as pd
print(tf.__version__)

2.0.0-alpha0


In [105]:
imdb = k.datasets.imdb

(train_datas, train_labels), (test_datas, test_labels) = imdb.load_data()

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz


In [106]:
print(train_datas[0])

[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 22665, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 21631, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 19193, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 10311, 8, 4, 107, 117, 5952, 15, 256, 4, 31050, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 12118, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]


我们先查看第一个样本，这个都是数字组成的，那是因为IMDB将文字都化成了数字

In [107]:
# 文字:数字的对应关系
word_index = imdb.get_word_index()

# 上面我们只有数字，所以我们现在需要将word_index变成“数字对应文字的关系”
index_word = {v:k for k,v in word_index.items()}

def decode_review(text):
  '''
  将text list中的number转换称为word
  '''
  return ' '.join([index_word.get(i, '?') for i in text])

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json


In [108]:
# 使用这个函数可以还原数据
print(decode_review(train_datas[0]))

the as you with out themselves powerful lets loves their becomes reaching had journalist of lot from anyone to have after out atmosphere never more room titillate it so heart shows to years of every never going villaronga help moments or of every chest visual movie except her was several of enough more with is now current film as you of mine potentially unfortunately of you than him that with out themselves her get for was camp of you movie sometimes movie that with scary but pratfalls to story wonderful that in seeing in character to of 70s musicians with heart had shadows they of here that with her serious to have does when from why what have critics they is you that isn't one will very to as itself with other tricky in of seen over landed for anyone of gilmore's br show's to whether from than out themselves history he name half some br of 'n odd was two most of mean for 1 any an boat she he should is thought frog but of script you not while history he heart to real at barrel but whe

下面开始构建模型，在构建模型之前我们回顾一下（二）中的图片分类有一个input_shape参数，当时这个参数我们选用了28*28，这是因为像素点矩阵是统一的，同时我们设置的模型内部的节点也是统一的（这句话不理解可以重新看看（一）中对神经网络的理解）

这里为了长度标准化我们使用了pad_sequences函数。
```
函数说明： 
将长为nb_samples的序列（标量序列）转化为形如(nb_samples,nb_timesteps)2D numpy array。如果提供了参数maxlen，nb_timesteps=maxlen，否则其值为最长序列的长度。其他短于该长度的序列都会在后部填充0以达到该长度。长于nb_timesteps的序列将会被截断，以使其匹配目标长度。padding和截断发生的位置分别取决于padding和truncating. 
参数：
sequences：浮点数或整数构成的两层嵌套列表
maxlen：None或整数，为序列的最大长度。大于此长度的序列将被截短，小于此长度的序列将在后部填0.
dtype：返回的numpy array的数据类型
padding：‘pre’或‘post’，确定当需要补0时，在序列的起始还是结尾补
truncating：‘pre’或‘post’，确定当需要截断序列时，从起始还是结尾截断
value：浮点数，此值将在填充时代替默认的填充值0
返回值： 
返回形如(nb_samples,nb_timesteps)的2D张量
```




In [0]:
train_datas = k.preprocessing.sequence.pad_sequences(train_datas,
                                                    maxlen=256,
                                                    padding='post',
                                                    value=0
                                                        )
test_datas = k.preprocessing.sequence.pad_sequences(test_datas,
                                                   maxlen=256,
                                                   padding='post',
                                                   value=0
                                                   )

In [0]:
# 通过print我们可以发现后面的都被替换成了-1，同时所有的特征全部变成了256的长度
print(train_datas[1])

[    1   194  1153   194  8255    78   228     5     6  1463  4369  5012
   134    26     4   715     8   118  1634    14   394    20    13   119
   954   189   102     5   207   110  3103    21    14    69   188     8
    30    23     7     4   249   126    93     4   114     9  2300  1523
     5   647     4   116     9    35  8163     4   229     9   340  1322
     4   118     9     4   130  4901    19     4  1002     5    89    29
   952    46    37     4   455     9    45    43    38  1543  1905   398
     4  1649    26  6853     5   163    11  3215 10156     4  1153     9
   194   775     7  8255 11596   349  2637   148   605 15358  8003    15
   123   125    68 23141  6853    15   349   165  4362    98     5     4
   228     9    43 36893  1157    15   299   120     5   120   174    11
   220   175   136    50     9  4373   228  8255     5 25249   656   245
  2350     5     4  9837   131   152   491    18 46151    32  7464  1212
    14     9     6   371    78    22   625    64  1

In [0]:
# 构建模型, 这里我们先自行DIV一下
# 先用（二）中图片分类的模型试试看

model = k.Sequential([
#     这个不用，具体原因可以看（一）中的解释
#     keras.layers.Flatten(input_shape=(28,28)),
    k.layers.Dense(128, activation=tf.nn.relu),
    k.layers.Dense(2, activation=tf.nn.softmax)
])

In [0]:
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

In [0]:
history = model.fit(train_datas,
                    train_labels,
                    epochs=40,
                    batch_size=512,
                    verbose=1)


TypeError: ignored

准确性全部是50%，五五开？？？

这里我们需要思考，为什么将一句话放进去出来的结果确是这样？在这个系列的学习中，主要是要学会思考，学会寻找问题并解决问题。

我们来想想train_datas中的数据是什么样的，比如取第一条数据可能是:["223"."13","3","22","19",...,...]，其中这些数字是word_index中的序号，这些序号放入神经网络中为什么是五五开？

给一下几个方向：

1. 训练数据
2. 设计的模型
3. 损失函数

训练数据：

这个方面主要是我们数据的特征没有提取好，那我们如何解决这个问题？
**Google 或者 Bing 或者 Baidu 搜索：神经网络 文本如何提取特征**，然后研读前5篇文章，这也是本次的练习，将五篇的文章总结一下，总结使用什么方法。

在官方的教程中它直接直接使用了keras.layers.Embedding去提取了词向量。

在那五篇文章中没有看到关于词向量的解释，可以看看这些[知乎中对词向量的解释](https://www.zhihu.com/question/21714667)，[Embedding](https://blog.csdn.net/wangyangzhizhou/article/details/77530479)

那我们改写一下模型：

In [0]:
train_datas[0]

array([    1,    14,    22,    16,    43,   530,   973,  1622,  1385,
          65,   458,  4468,    66,  3941,     4,   173,    36,   256,
           5,    25,   100,    43,   838,   112,    50,   670, 22665,
           9,    35,   480,   284,     5,   150,     4,   172,   112,
         167, 21631,   336,   385,    39,     4,   172,  4536,  1111,
          17,   546,    38,    13,   447,     4,   192,    50,    16,
           6,   147,  2025,    19,    14,    22,     4,  1920,  4613,
         469,     4,    22,    71,    87,    12,    16,    43,   530,
          38,    76,    15,    13,  1247,     4,    22,    17,   515,
          17,    12,    16,   626,    18, 19193,     5,    62,   386,
          12,     8,   316,     8,   106,     5,     4,  2223,  5244,
          16,   480,    66,  3785,    33,     4,   130,    12,    16,
          38,   619,     5,    25,   124,    51,    36,   135,    48,
          25,  1415,    33,     6,    22,    12,   215,    28,    77,
          52,     5,

In [0]:
vob_len = len(index_word)+1

print(vob_len)

model = k.Sequential([
    k.layers.Embedding(vob_len+2, 16),
    k.layers.GlobalAveragePooling1D(),
    k.layers.Dense(16, activation=tf.nn.relu),
    k.layers.Dense(1, activation=tf.nn.sigmoid)
])
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

# 下面这条语句出现的的信息很重要，从这里我们一般就知道我们数据是怎么变化的，同时每一层出来的是什么我们也可以知道。
model.summary()



88585
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, None, 16)          1417392   
_________________________________________________________________
global_average_pooling1d (Gl (None, 16)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 16)                272       
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 17        
Total params: 1,417,681
Trainable params: 1,417,681
Non-trainable params: 0
_________________________________________________________________


在k.layers.Dense中有一个参数是activation，这里叫做激活函数[什么是激活函数](https://www.zhihu.com/question/22334626)

这里又留了一个问题：

**在keras中有几种激活函数，列举出来，当前我们已经使用过的这几种是什么意思？**



上面那个代码大家还没有接触到的应该是`k.layers.GlobalAveragePooling1D`函数，留下几个问题：

1. 这个函数的作用是大家的一个练习，自行去找资料去解决问题。
2. 同时大家去掉这个来看看会出现什么问题。
3. 对于loss与这个系列（一）不同，我这里简单的说一下就是binary_crossentropy，是处理二分类问题，具体的大家可以去网上自行查询，实在有困难可以留言评论。

下面继续coding


In [0]:
#创建了验证集,前10000个做训练，后10000个做验证集
#https://www.zhihu.com/question/26588665 这里解释了什么是测试集与验证集的
x_val = train_datas[:10000]
partial_x_train = train_datas[10000:]

y_val = train_labels[:10000]
partial_y_train = train_labels[10000:]

history = model.fit(partial_x_train,
                    partial_y_train,
                    epochs=40,
                    batch_size=512,
                    validation_data=(x_val, y_val),
                    verbose=1)

Train on 15000 samples, validate on 10000 samples
Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40


In [0]:
results = model.evaluate(test_datas, test_labels)
print(results)

[0.3240479930019379, 0.87284]


忧伤，终于看完了么？

还没有，还有一个新技能：评估模型，分析模型表现如何。

我们看上面的在测试集上面的准确率了么，它只有86-88%，但是我们在训练集上面已经到了99%的准确率了，我们需要找到上面这些
```
Epoch 13/40
15000/15000 [==============================] - 2s 109us/sample - loss: 0.0751 - acc: 0.9863 - val_loss: 0.2831 - val_acc: 0.8877
Epoch 14/40
15000/15000 [==============================] - 2s 112us/sample - loss: 0.0700 - acc: 0.9875 - val_loss: 0.2822 - val_acc: 0.8876
Epoch 15/40
15000/15000 [==============================] - 2s 111us/sample - loss: 0.0652 - acc: 0.9882 - val_loss: 0.2833 - val_acc: 0.8872
Epoch 16/40
15000/15000 [==============================] - 2s 107us/sample - loss: 0.0606 - acc: 0.9893 - val_loss: 0.2854 - val_acc: 0.8879
Epoch 17/40
15000/15000 [==============================] - 2s 108us/sample - loss: 0.0564 - acc: 0.9907 - val_loss: 0.2867 - val_acc: 0.8869
Epoch 18/40
15000/15000 [==============================] - 2s 107us/sample - loss: 0.0527 - acc: 0.9919 - val_loss: 0.2885 - val_acc: 0.8883
Epoch 19/40
15000/15000 [==============================] - 2s 110us/sample - loss: 0.0491 - acc: 0.9925 - val_loss: 0.2918 - val_acc: 0.8859
Epoch 20/40
15000/15000 [==============================] - 2s 112us/sample - loss: 0.0461 - acc: 0.9929 - val_loss: 0.2934 - val_acc: 0.8865
Epoch 21/40
15000/15000 [==============================] - 2s 113us/sample - loss: 0.0430 - acc: 0.9934 - val_loss: 0.2954 - val_acc: 0.8867
Epoch 22/40
15000/15000 [==============================] - 2s 112us/sample - loss: 0.0405 - acc: 0.9939 - val_loss: 0.2991 - val_acc: 0.8859
Epoch 23/40
15000/15000 [==============================] - 2s 108us/sample - loss: 0.0378 - acc: 0.9949 - val_loss: 0.3007 - val_acc: 0.8861
Epoch 24/40
15000/15000 [==============================] - 2s 110us/sample - loss: 0.0353 - acc: 0.9951 - val_loss: 0.3034 - val_acc: 0.8863
Epoch 25/40
15000/15000 [==============================] - 2s 111us/sample - loss: 0.0331 - acc: 0.9955 - val_loss: 0.3056 - val_acc: 0.8864
Epoch 26/40
15000/15000 [==============================] - 2s 111us/sample - loss: 0.0310 - acc: 0.9956 - val_loss: 0.3085 - val_acc: 0.8851
Epoch 27/40
15000/15000 [==============================] - 2s 111us/sample - loss: 0.0291 - acc: 0.9961 - val_loss: 0.3121 - val_acc: 0.8852
```

到多久的时候在验证集上面就已经达到了稳定（拟合），我们需要在那个Epoch就停止训练，防止它出现过拟合（简单的来说就是训练集都对，测试集准确率确不高，缺少泛化性）

通俗一点就是一味的看宇哥视频确没有动手刷数学题目，以至于遇到了不一样的题目就不会了（考研党的梗！！！）

In [3]:
# 这个命令可能失效
# Google网盘：https://drive.google.com/open?id=1OXWKZwfTpAXpuyCb4ESZq_UrycN9YvHe
!wget --no-check-certificate 'https://docs.google.com/uc?export=download&id=1OXWKZwfTpAXpuyCb4ESZq_UrycN9YvHe' -O dmsc.csv

# 我直接是在colab中Coding，所以我直接使用网盘当我的文件夹。 这个操作可以看参考7.
from google.colab import drive
drive.mount('/content/drive/')

# 执行这个可以直接看到网盘的内容 
!ls "/content/drive/My Drive/"

--2019-04-09 01:16:38--  https://docs.google.com/uc?export=download&id=1OXWKZwfTpAXpuyCb4ESZq_UrycN9YvHe
Resolving docs.google.com (docs.google.com)... 74.125.142.101, 74.125.142.138, 74.125.142.113, ...
Connecting to docs.google.com (docs.google.com)|74.125.142.101|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘dmsc.csv’

dmsc.csv                [<=>                 ]       0  --.-KB/s               dmsc.csv                [ <=>                ]   3.17K  --.-KB/s    in 0s      

2019-04-09 01:16:38 (31.0 MB/s) - ‘dmsc.csv’ saved [3244]

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20http

In [0]:
dmsc = pd.read_csv('drive/My Drive/dataset/DMSC.csv',encoding = "utf-8")

In [1]:
# 使用分词工具1
!pip install pkuseg
import re
import pkuseg



In [29]:
# 设置正负样本各自的数量
size = 10000

from multiprocessing import Pool

p = Pool()

dmsc_dataset = dmsc[dmsc['Star']!=3].loc[:,['Star','Comment']]
xl_data_label = dmsc_dataset['Star'].values
xl_data_review = dmsc_dataset['Comment'].values

label = []
review = []

word_index = {}
# 将空格的编号赋值为0,未知符号设置为1
word_index[' '] = 0
word_index['<unknow>'] = 1

seg = pkuseg.pkuseg('web')

print('开始进行分词，总数%d' % len(xl_data_review))

def seg_cut(i):
  text = re.sub("[\s+\.\!\/_,$%^*(+\"\']+|[。+、——！.…，。？、~@#￥%……&*（）]+", " ",xl_data_review[i])
  _build = seg.cut(text)
  result = []
  for j in _build:
    # 如果这个词没有出现过，则将它的编号往后+1
    if not j in word_index:
      word_index[j] = len(word_index) + 1
    result.append(word_index[j]) # 这里就已经将文字转换数值的数据保存下来了
  result = np.array(result)
  review.append(result) # 将所有文本的数据，变成数值型
  label.append(xl_data_label[i])
  
post = 0
neg = 0
# 简单处理label，将它变为 0 1 二分类问题
for i in range(len(xl_data_label)):
  if xl_data_label[i] > 3:
    xl_data_label[i] = 1
  elif xl_data_label[i] < 3:
    xl_data_label[i] = 0

# 收集所有的词,将每个词对应一个编号
for i in range(len(xl_data_review)):
#   p.apply_async(seg_cut, args=(i,))
  if xl_data_label[i] == 1 and post < size:
    post += 1
    seg_cut(i)
  if xl_data_label[i] == 0 and neg <size:
    neg += 1
    seg_cut(i)
  if post >= size and neg >= size:
    break
  if (post+neg) % 10000 == 0:
    print('已经加载成功%d篇' % (post+neg))
p.close()
p.join()





开始进行分词，总数1650497
已经加载成功10000篇


下面我们可以看看我们现在训练数据全部变成了字符型了

In [0]:
from sklearn.model_selection import train_test_split
label = np.array(label)
train_x, test_x, train_y, test_y = train_test_split(review, label, test_size=0.33, random_state=42)

In [31]:
train_y.shape

(13400,)

In [0]:
train_x = k.preprocessing.sequence.pad_sequences(train_x,
                                                    maxlen=256,
                                                    padding='post',
                                                    value=0
                                                        )
test_x = k.preprocessing.sequence.pad_sequences(test_x,
                                                   maxlen=256,
                                                   padding='post',
                                                   value=0
                                                   )

In [33]:
# 我们这里再写一个函数还原它
index_word = {v:k for k,v in word_index.items()}
def reduction_number(data):
  '''
  data list => 文字
  '''
  return ''.join([index_word[i] for i in data])

print(reduction_number(train_x[0]))

因为1对她有太多期待所以导致这部续作不尽如人意                                                                                                                                                                                                                                                  


终于处理成为和之前一样的效果了，那我们开始和之前一样的操作

1. 构建模型
2. 设置模型的一些优化参数
3. 训练模型
4. 验证模型

In [34]:
# 构建模型
vob_len = len(word_index) + 1
dmsc_model = k.Sequential([
    k.layers.Embedding(vob_len, 16),
    k.layers.GlobalAveragePooling1D(),
    k.layers.Dense(16, activation=tf.nn.relu),
    k.layers.Dense(1, activation=tf.nn.sigmoid)

])
dmsc_model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, None, 16)          487312    
_________________________________________________________________
global_average_pooling1d_1 ( (None, 16)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 16)                272       
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 17        
Total params: 487,601
Trainable params: 487,601
Non-trainable params: 0
_________________________________________________________________


In [0]:
# 设置模型优化参数

dmsc_model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

In [13]:
train_x.shape

(13400, 256)

In [36]:
# 训练模型

x_val = train_x[:1000]
y_val = train_y[:1000]
partial_x_train = train_x[1000:]
partial_y_train = train_y[1000:]

dmsc_history = dmsc_model.fit(partial_x_train,
                        partial_y_train,
                        epochs=100,
                        batch_size=512,
                        validation_data=(x_val, y_val),
                        verbose=1)


Train on 12400 samples, validate on 1000 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/10

In [37]:
size = 0
for i in test_y:
  if i == 1:
    size += 1
print(size*1.0/len(test_y)*100)

50.34848484848485


In [38]:
# 验证模型

result = dmsc_model.evaluate(test_x, test_y)
print(result)

[0.4826192412231908, 0.80863637]


In [39]:
dmsc_model.predict(test_x)

array([[0.92586744],
       [0.8183203 ],
       [0.9931318 ],
       ...,
       [0.4850828 ],
       [0.0115703 ],
       [0.38390815]], dtype=float32)

In [40]:
text = '还不错咯，值得一看'
text = seg.cut(text)
number = [word_index.get(i,-1) for i in text]
print(number)
number = np.array([number])
number = k.preprocessing.sequence.pad_sequences(number,
                                                maxlen=256,
                                                padding='post',
                                                value=0
                                                )
print(number)

[149, 752, 3577, -1, 221, 81, 474]
[[ 149  752 3577   -1  221   81  474    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0   

In [46]:
def text2number(text):
  '''
  将文字转换成为数值型
  '''
  text = seg.cut(text)
  number = [word_index.get(i,1) for i in text]
  number = np.array([number])
  number = k.preprocessing.sequence.pad_sequences(number,
                                                   maxlen=256,
                                                   padding='post',
                                                   value=0
                                                   )
  return number


# 自嗨一下
while True:
  comment = input()
  if comment == 'q':
    print('测试结束')
    break
  comment_number = text2number(comment)
  pred = dmsc_model.predict(comment_number)
  if pred[0][0] > 0.5:
    print('有好感')
  else:
    print('无好感')

看毛线呀，这个还不如回家看
无好感
这是我今年看过最好看的电影了，点个赞
有好感
支持一波
有好感
情怀电影吧，看不看都行
无好感
特效爆炸好吗！
有好感
q
测试结束


到此为止，OK，我们使用了官方的教材，同时在官方教材中做了延伸，训练数据来源于豆瓣，么大～




## Github

[深度学习——Tensorflow学习（三）文本分类.ipynb](https://github.com/JavanTang/Learn-a-little-tensorflow-every-day/blob/master/%E6%B7%B1%E5%BA%A6%E5%AD%A6%E4%B9%A0%E2%80%94%E2%80%94Tensorflow%E5%AD%A6%E4%B9%A0%EF%BC%88%E4%B8%89%EF%BC%89%E6%96%87%E6%9C%AC%E5%88%86%E7%B1%BB.ipynb)

## Colaboratory
[深度学习——Tensorflow学习（三）文本分类.ipynb](https://colab.research.google.com/drive/16FGkMgX4p6bQibwwIdzsC71NTX6hyOdU)

## Reference

1. [Tensorflow官方教程](https://www.tensorflow.org/tutorials/keras/basic_text_classification) 
2. [pad_sequences函数](https://blog.csdn.net/HHTNAN/article/details/82585776)
3. [什么是激活函数](https://www.zhihu.com/question/22334626)
4. [测试集与验证集的区别](https://www.zhihu.com/question/26588665)
5. [知乎中对词向量的解释](https://www.zhihu.com/question/21714667)
6. [Embedding](https://blog.csdn.net/wangyangzhizhou/article/details/77530479)
7. [Google Colab 免费 GPU 使用教程](https://juejin.im/post/5c05e1bc518825689f1b4948)
8. [过滤符号](https://blog.csdn.net/mach_learn/article/details/41744487)
9. [中文影评数据集](https://www.kaggle.com/utmhikari/doubanmovieshortcomments)


下班了还没有撸完这篇文章，明天在这个文章下面更新使用新浪提供的数据做中文的文本分类。

DMSC.csv  simplifyweibo_4_moods.csv
