# 电影评论分类

1. 数据预处理
2. 句子向量化
3. 使用神经网络对其进行分类

## 数据提取以及预处理

In [20]:
import pandas as pd
import matplotlib.pyplot as plt
import jieba
from collections import defaultdict
from collections import Counter
import random
from six.moves import cPickle as pickle
import numpy as np

In [21]:
data_path = 'E:/MYGIT/DataSources/movie_comments.csv'
comments_pd = pd.read_csv(data_path,low_memory=False)

In [22]:
comments_pd.shape

(261497, 5)

In [23]:
comments_pd

Unnamed: 0,id,link,name,comment,star
0,1,https://movie.douban.com/subject/26363254/,战狼2,吴京意淫到了脑残的地步，看了恶心想吐,1
1,2,https://movie.douban.com/subject/26363254/,战狼2,首映礼看的。太恐怖了这个电影，不讲道理的，完全就是吴京在实现他这个小粉红的英雄梦。各种装备轮...,2
2,3,https://movie.douban.com/subject/26363254/,战狼2,吴京的炒作水平不输冯小刚，但小刚至少不会用主旋律来炒作…吴京让人看了不舒服，为了主旋律而主旋...,2
3,4,https://movie.douban.com/subject/26363254/,战狼2,凭良心说，好看到不像《战狼1》的续集，完虐《湄公河行动》。,4
4,5,https://movie.douban.com/subject/26363254/,战狼2,中二得很,1
5,6,https://movie.douban.com/subject/26363254/,战狼2,“犯我中华者，虽远必诛”，吴京比这句话还要意淫一百倍。,1
6,7,https://movie.douban.com/subject/26363254/,战狼2,脑子是个好东西，希望编剧们都能有。,2
7,8,https://movie.douban.com/subject/26363254/,战狼2,三星半，实打实的7分。第一集在爱国主旋律内部做着各种置换与较劲，但第二集才真正显露吴京的野心...,4
8,9,https://movie.douban.com/subject/26363254/,战狼2,开篇长镜头惊险大气引人入胜 结合了水平不俗的快剪下实打实的真刀真枪 让人不禁热血沸腾 特别弹...,4
9,10,https://movie.douban.com/subject/26363254/,战狼2,15/100吴京的冷峰在这部里即像成龙，又像杰森斯坦森，但体制外的同类型电影，主角总是代表个...,1


In [24]:
moive_stars = list(comments_pd['star'])
movie_comments = list(comments_pd['comment'])
#movie_name = [str(name) for name in movie_name]
#movie_comments = [str(comm) for comm in movie_comments]

In [25]:
print(len(moive_stars), len(movie_comments))

261497 261497


In [26]:
print(moive_stars[24],movie_comments[24])

4 好看，这部戏让人看的热血沸腾，打戏挺燃的，吴京演技棒呆了


文本处理，剔除一些干扰评论，剔除是英文的评论

思考？应不应该剔除那些评论字数少于某特定值的数据

---

第一次清洗，去除评论为空,或评论字数只有一两个词的评论

---

In [27]:
%%time
removes1 = [_i for _i,comm in enumerate(movie_comments) if (not isinstance(comm, str)) 
           or (comm == '')  or (len(comm) < 4)]
removes2 = [_i for _i,comm in enumerate(moive_stars) if comm not in ['1','2','3','4','5']]
print(len(removes1), len(removes2))
removes = removes1+removes2
removes = set(removes)


13100 1
Wall time: 170 ms


In [28]:
moive_stars = [star for _i,star in enumerate(moive_stars) if _i not in removes]
movie_comments = [comm for _i,comm in enumerate(movie_comments) if _i not in removes]

In [29]:
#繁体字转简体字
from langconv import *
def cht_to_chs(line):
    line = Converter('zh-hans').convert(line)
    line.encode('utf-8')
    return line

In [30]:
print(len(movie_comments),len(moive_stars))

248396 248396


In [31]:
#繁体字转简体处理
movie_comments = [cht_to_chs(comm) for comm in movie_comments]

In [33]:
print(len(movie_comments),len(moive_stars))

248396 248396


第二次清洗，剔除那些评论不是中文的评论数据

---

In [34]:
import langid

In [35]:
string1 = '爱和坚持。'
print(langid.classify(string1))

('ja', -56.275206565856934)


In [37]:
#舍弃不用该方法，效率太低了
def is_en(string):
    result = langid.classify(string)
    if result[0] not in ('zh', 'ja'):
    #if result[0] != 'zh':
        return True

In [38]:
def not_ch(string):
    length = len(string)
    length_ch = 0
    for ch in string:
        if '\u4e00'  <=  ch <='\u9fff':
            length_ch +=1
    if length_ch < length*0.35:
        return True
    else:
        return False

In [39]:
%%time
for i in range(1000):
    if not_ch(movie_comments[i]):
        print(movie_comments[i])

You got a dream, you gotta protect it. Don't let your dreams be dreams.
You have a dream, you got to protect it.
好！好！！好！！！
好！好！！好！！！
MIT？
This is part of my life……
MIT？
This is part of my life……
http://www.douban.com/review/1279547/
2008.2.3
Smith毁容鸟...
2008.2.3
Smith毁容鸟...
http://v.youku.com/v_show/id_XODY3ODI3Ng==.html
真尴尬啊。大结局居然是十年后问了声“my name is hanmeimei whats your name？”
真尴尬啊。大结局居然是十年后问了声“my name is hanmeimei whats your name？”
真尴尬啊。大结局居然是十年后问了声“my name is hanmeimei whats your name？”
我只认识sandy and sue，再见
ヽ(´o｀；
ヽ(´o｀；
ヽ(´o｀；
ヽ(´o｀；
Wall time: 22 ms


In [40]:
print(len(movie_comments),len(moive_stars))

248396 248396


In [41]:
%%time
removes = [_i for _i,comm in enumerate(movie_comments) if not_ch(comm)]
removes = set(removes)
print(len(removes))

10691
Wall time: 1.77 s


In [42]:
stars = [star for _i,star in enumerate(moive_stars) if _i not in removes]
comments = [comm for _i,comm in enumerate(movie_comments) if _i not in removes]

In [43]:
print(len(comments),len(stars))

237705 237705


In [44]:
from collections import Counter
starsdict = Counter(stars)
starsdict

Counter({'1': 22176, '2': 25980, '4': 76397, '5': 52932, '3': 60220})

In [45]:
#获取每个类别的评论
from collections import defaultdict
def get_each_classdata(dataset, labels):
    datas = defaultdict(list)
    for i in range(len(labels)):
        datas[labels[i]].append(dataset[i])
    return datas

In [46]:
label_data_dict = get_each_classdata(comments, stars)

In [47]:
label_data_dict.keys()

dict_keys(['1', '2', '4', '5', '3'])

---

如果直接是按5个星级来分类，由于评论对应星级评判太感性，太情感化，即使是人都不能准确地把某一评论具体定位到1分，2分，3分等量化分数上，更别提机器了，这里为了使得实际效果好些，改成两分类问题，把1星2星的评论归为差评， 4星5星评论归为好评。 3星评论暂时舍弃不用

---

In [49]:
data_dict = defaultdict(list)
data_dict[0] = label_data_dict['1']+ label_data_dict['2']
data_dict[1] = label_data_dict['4']+ label_data_dict['5']

In [50]:
print(len(data_dict[0]))

48156


In [51]:
#均衡下正负样本
num_per_class = 45000
def get_balanced_dataset(datas, num_per_class):
    dataset = []
    labels = []
    for key in datas:
        if len(datas[key]) < num_per_class:
            print('testpoint')
            dataset += datas[key]
            labels += [key]*len(datas[key])
        else:
            dataset += random.sample(datas[key],num_per_class)
            labels += [key]* num_per_class
    return dataset, labels

In [107]:
dataset,labels = get_balanced_dataset(data_dict, num_per_class)

In [110]:
sum(labels[45000:])

45000

In [150]:
print(len(dataset),len(labels))

90000 90000


In [53]:
print(dataset[222],labels[222])

神剧！我真的不知道该从哪里吐槽，槽点太多，剧情很神，不知道在讲什么，莫名其妙，断断续续没有主题或者说不知道想表达什么，动作戏很假，完全接不上，画面也是断断续续的，配音很不自然，演技浮夸，很尴尬…… 0


## 句子向量化

### 方法1：get sentence embedding from word embedding

---

![sentence_embedding.png](https://i.loli.net/2019/09/10/InrCOWD1XxQSvld.png)

[句子向量化论文参考](https://openreview.net/pdf?id=SyK00v5xx)
- line2：计算每条句子的向量，以列存储 原文是 compute the weighted average of the word vector
- line6：原文是 remove the projections of the average vectors on their first singular verctor(common component removal),至于为什么line6这样计算会是common component removal，还没弄明白，有空再深究
- 最后得到的矩阵是M×N的 M为词向量维度，N为句子个数，可以看出每个句子向量都是以列保存的

In [152]:
from gensim.models import KeyedVectors
import logging

#该函数对应的是上诉图片中伪代码的实现过程
def get_sentences_vec(model_wv, sent_list, word_frequence):
    # 句子向量化处理
    a = 0.001
    row = model_wv.vector_size
    col = len(sent_list)
    sent_mat = np.zeros((row, col))
    for i, sent in enumerate(sent_list):
        length = len(sent)
        sent_vec = np.zeros(row)
        for word in sent:
            pw = word_frequence[word]
            w = a / (a + pw)
            try:
                vec = np.array(model_wv[word])
                sent_vec += w * vec
            except:
                pass
        sent_mat[:, i] += sent_vec
        sent_mat[:, i] /= length

    # PCA处理
    sent_mat = np.mat(sent_mat)
    u, s, vh = np.linalg.svd(sent_mat, full_matrices=False)
    sent_mat = sent_mat - u * u.T * sent_mat
    return sent_mat

def get_word_frequence(words):
    #这里不做停用次处理，直接在计算句子向量时候，如果找不到该词，直接跳过
    word_list = []
    for word in words:
        word_list += word
    word_frequence = Counter(word_list)
    return word_frequence

In [153]:
stopwords = []
with open('stopwords.txt') as f:
    line = f.readline()
    while line !='':
        stopwords.append(line.strip('\n'))
        line = f.readline()
stopwords.append('\n')
stopwords.append(' ')
stopwords = set(stopwords)

In [154]:
corpus = []
removes = []
for _i,sen in enumerate(dataset):
    sen = sen.strip('\n')
    words = jieba.lcut(sen)
    temp = []
    for word in words:
        if word not in stopwords:
            temp.append(word)
    if temp == []:
        removes.append(_i)
    else:
        corpus.append(temp) 

In [159]:
corpus[1143]

['看到', '名字', '想', '一星', '没', '看好']

In [160]:
len(removes),len(corpus)

(131, 89869)

In [161]:
labels = [lab for _i,lab in enumerate(labels) if _i not in removes]

In [162]:
len(labels), len(corpus)

(89869, 89869)

In [163]:
fname = r"E:/MYGIT/model/wiki_stopwords/wiki_word2vec.kv"
model_wv = KeyedVectors.load(fname, mmap='r')

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [164]:
words_frequence = get_word_frequence(corpus)

In [165]:
%%time
comment_vecs = get_sentences_vec(model_wv, corpus, words_frequence)

Wall time: 5min 10s


In [77]:
del(model_wv)

In [166]:
comment_vecs = comment_vecs.T

In [167]:
comment_vecs.shape , comment_vecs.dtype

((89869, 500), dtype('float64'))

In [168]:
labels = np.array(labels)

In [169]:
#打乱数据
def randomize(dataset, labels):
    permutation = np.random.permutation(labels.shape[0])#
    shuffled_dataset = dataset[permutation,:]
    shuffled_labels = labels[permutation]
    return shuffled_dataset, shuffled_labels

In [170]:
dataset, labels = randomize(comment_vecs, labels)

In [171]:
dataset.shape, labels.shape

((89869, 500), (89869,))

---

这里对数据进行标准化处理,并把结果保存

---

In [172]:
X_ = (dataset - np.mean(dataset, axis=0)) / np.std(dataset, axis=0)

In [173]:
#划分训练集，验证集，测试集
from sklearn.model_selection import train_test_split

In [174]:
train_dataset,test_dataset,train_labels,test_labels = train_test_split(X_, labels, test_size=0.2)
valid_dataset,test_dataset,valid_labels,test_labels = train_test_split(test_dataset, test_labels,test_size=0.5)

In [175]:
print(train_dataset.shape,valid_dataset.shape, test_dataset.shape)

(71895, 500) (8987, 500) (8987, 500)


In [176]:
#保存划分好的数据
from six.moves import cPickle as pickle
pickle_file = 'E:/MYGIT/DataSources/MovieComments/Comments_04'
try:
    with open(pickle_file, 'wb') as f:
        save = {
        'train_dataset': train_dataset,
        'train_labels': train_labels,
        'valid_dataset': valid_dataset,
        'valid_labels': valid_labels,
        'test_dataset': test_dataset,
        'test_labels': test_labels,
        }
        pickle.dump(save, f, pickle.HIGHEST_PROTOCOL)
except Exception as e:
    print('Unable to save data to', pickle_file, ':', e)

###  方法2 : TF-IDF

---

In [54]:
stopwords = []
with open('stopwords.txt') as f:
    line = f.readline()
    while line !='':
        stopwords.append(line.strip('\n'))
        line = f.readline()
stopwords.append('\n')
stopwords = set(stopwords)

In [55]:
dataset[71], labels[71]

('本片看得那个囧啊，每个人出场都那么囧。周杰伦为什么不用双截棍啊？', 0)

In [56]:
corpus = []
for sen in dataset:
    words = jieba.lcut(sen)
    temp = ''
    for word in words:
        if word not in stopwords:
            temp += word + ' '
    corpus.append(temp) 

Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\Oliver\AppData\Local\Temp\jieba.cache
Loading model cost 0.905 seconds.
Prefix dict has been built succesfully.


In [72]:
corpus[7]

'麻烦 胡歌 粉装路 装 一点 '

In [58]:
'\n' in stopwords

True

In [87]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [88]:
vectorizer1 = TfidfVectorizer(max_features=2000)

In [89]:
X = vectorizer1.fit_transform(corpus)

In [96]:
X[1]

<1x2000 sparse matrix of type '<class 'numpy.float64'>'
	with 1 stored elements in Compressed Sparse Row format>

In [90]:
X.shape

(90000, 2000)

In [77]:
#X = X.toarray()

In [82]:
X = (X - np.mean(X, axis=0)) / np.std(X, axis=0)

In [78]:
labels = np.array(labels)

In [79]:
a = np.array([[1,2],[3,4]])
#a =(a - np.mean(a, axis=0)) / np.std(a, axis=0)
a

array([[1, 2],
       [3, 4]])

In [91]:
from sklearn.model_selection import train_test_split
train_dataset,test_dataset,train_labels,test_labels = train_test_split(X, labels, test_size=0.2)
valid_dataset,test_dataset,valid_labels,test_labels = train_test_split(test_dataset, test_labels,test_size=0.5)

In [94]:
print(train_dataset.shape,valid_dataset.shape, test_dataset.shape)

(72000, 2000) (9000, 2000) (9000, 2000)


In [85]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

clf = RandomForestClassifier()
clf.fit(train_dataset, train_labels)
y_hat = clf.predict(valid_dataset)
acc = format(accuracy_score(valid_labels, y_hat),'.4f')
print(acc)



0.7468


In [93]:
#保存划分好的数据
#注意这里保存的TFIDF向量是用稀疏矩阵保存的
from six.moves import cPickle as pickle
pickle_file = 'E:/MYGIT/DataSources/MovieComments/Comments_TFIDF_02'
try:
    with open(pickle_file, 'wb') as f:
        save = {
        'train_dataset': train_dataset,
        'train_labels': train_labels,
        'valid_dataset': valid_dataset,
        'valid_labels': valid_labels,
        'test_dataset': test_dataset,
        'test_labels': test_labels,
        }
        pickle.dump(save, f, pickle.HIGHEST_PROTOCOL)
except Exception as e:
    print('Unable to save data to', pickle_file, ':', e)

## 神经网络模型

In [2]:
# These are all the modules we'll be using later. Make sure you can import them
# before proceeding further.
from __future__ import print_function
import tensorflow as tf
from six.moves import cPickle as pickle

In [3]:
import os
pickle_file = 'E:/MYGIT/DataSources/MovieComments/Comments_04'
with open(pickle_file, 'rb') as f:
    save = pickle.load(f)
    train_dataset = save['train_dataset']
    train_labels = save['train_labels']
    valid_dataset = save['valid_dataset']
    valid_labels = save['valid_labels']
    test_dataset = save['test_dataset']
    test_labels = save['test_labels']
    del(save)  # hint to help gc free up memory
    print('Training set', train_dataset.shape, train_labels.shape)
    print('Validation set', valid_dataset.shape, valid_labels.shape)
    print('Test set', test_dataset.shape, test_labels.shape)

Training set (71895, 500) (71895,)
Validation set (8987, 500) (8987,)
Test set (8987, 500) (8987,)


In [12]:
#数据类型统一化，以及label变成one-hot编码
num_labels = 2
def reformat(datasets, labels):
    labels = (np.arange(num_labels) == labels[:,None]).astype(np.float32)
    datasets = datasets.astype(np.float32)
    return datasets,labels

train_dataset,train_labels1 = reformat(train_dataset,train_labels)
valid_dataset,valid_labels1 = reformat(valid_dataset,valid_labels)
test_dataset, test_labels1 = reformat(test_dataset,test_labels)
print('Training set', train_dataset.shape, train_labels.shape)
print('Validation set', valid_dataset.shape, valid_labels.shape)
print('Test set', test_dataset.shape, test_labels.shape)

Training set (71895, 500) (71895,)
Validation set (8987, 500) (8987,)
Test set (8987, 500) (8987,)


In [7]:
##先用简单模型测试一下，可以预判下之前的数据处理是否合理
from sklearn.ensemble import RandomForestClassifier

In [8]:
clf = RandomForestClassifier(n_estimators=10,criterion='gini')

In [9]:
clf.fit(train_dataset, train_labels)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [11]:
test_labels[:10]

array([1, 1, 0, 1, 0, 1, 0, 1, 1, 0])

In [10]:
from sklearn.metrics import accuracy_score
y_hat = clf.predict(test_dataset)
acc = format(accuracy_score(test_labels, y_hat),'.4f')
print(acc)

0.6377


In [28]:
tf.train.exponential_decay??

In [14]:
#2层神经网络
batch_size = 128
num_hidden_nodes1 = 1024
num_hidden_nodes2 = 128
beta_regul = 1e-3
feature_size = 500

graph = tf.Graph()
with graph.as_default():
    # Input data. For the training data, we use a placeholder that will be fed
    # at run time with a training minibatch.
    tf_train_dataset = tf.placeholder(tf.float32,
                                    shape=(batch_size, feature_size))
    tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)
    global_step = tf.Variable(0)

    # Variables.
    #第一层输入节点是28*28级原始数据的维度大小，本层节点是个数是num_hidden_nodes1
    weights1 = tf.Variable(tf.truncated_normal([feature_size, num_hidden_nodes1],
                                               stddev=np.sqrt(2.0 / (feature_size))))
    biases1 = tf.Variable(tf.zeros([num_hidden_nodes1]))
    
    #第二层的输入节点是第一层节点个数，本层节点个数是num_hidden_nodes2,stddev是指定生成数据的标准差
    weights2 = tf.Variable(tf.truncated_normal([num_hidden_nodes1, num_hidden_nodes2], 
                                               stddev=np.sqrt(2.0 / num_hidden_nodes1)))
    biases2 = tf.Variable(tf.zeros([num_hidden_nodes2]))
    #最后一层的输入节点个数是第二层节点个数，本层节点个数是要分类的类别个数。
    weights3 = tf.Variable(tf.truncated_normal([num_hidden_nodes2, num_labels], 
                                               stddev=np.sqrt(2.0 / num_hidden_nodes2)))
    biases3 = tf.Variable(tf.zeros([num_labels]))

    # Training computation.
    lay1_train = tf.nn.relu(tf.matmul(tf_train_dataset, weights1) + biases1)
    lay2_train = tf.nn.relu(tf.matmul(lay1_train, weights2) + biases2)
    logits = tf.matmul(lay2_train, weights3) + biases3
    loss = tf.reduce_mean(
    tf.nn.softmax_cross_entropy_with_logits_v2(labels=tf_train_labels, logits=logits)) + \
      beta_regul * (tf.nn.l2_loss(weights1) + tf.nn.l2_loss(weights2) + tf.nn.l2_loss(weights3))

    # Optimizer.
    learning_rate = tf.train.exponential_decay(0.1, global_step, 1000, 0.65, staircase=True)
    optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss, global_step=global_step)

    # Predictions for the training, validation, and test data.
    train_prediction = tf.nn.softmax(logits)
    lay1_valid = tf.nn.relu(tf.matmul(tf_valid_dataset, weights1) + biases1)
    lay2_valid = tf.nn.relu(tf.matmul(lay1_valid, weights2) + biases2)
    valid_prediction = tf.nn.softmax(tf.matmul(lay2_valid, weights3) + biases3)
    
    lay1_test = tf.nn.relu(tf.matmul(tf_test_dataset, weights1) + biases1)
    lay2_test = tf.nn.relu(tf.matmul(lay1_test, weights2) + biases2)
    test_prediction = tf.nn.softmax(tf.matmul(lay2_test, weights3) + biases3)

In [15]:
def accuracy(predictions, labels):
    return (100.0 * np.sum(np.argmax(predictions, 1) == np.argmax(labels, 1))
          / predictions.shape[0])

In [16]:
#128*18001 = 2304128
num_steps = 3601
with tf.Session(graph=graph) as session:
    tf.initialize_all_variables().run()
    print("Initialized")
    for step in range(num_steps):
        # Pick an offset within the training data, which has been randomized.
        # Note: we could use better randomization across epochs.
        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
        # Generate a minibatch.
        batch_data = train_dataset[offset:(offset + batch_size)]
        batch_labels = train_labels1[offset:(offset + batch_size)]
        # Prepare a dictionary telling the session where to feed the minibatch.
        # The key of the dictionary is the placeholder node of the graph to be fed,
        # and the value is the numpy array to feed to it.
        feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
        _, l, predictions = session.run([optimizer, loss, train_prediction], feed_dict=feed_dict)
        if(step % 200 == 0):
            print("Minibatch loss at step %d: %f" % (step, l))
            print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
            print("Validation accuracy: %.1f%%" % accuracy(valid_prediction.eval(), valid_labels1))
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels1))

Instructions for updating:
Use `tf.global_variables_initializer` instead.
Initialized
Minibatch loss at step 0: 1.619719
Minibatch accuracy: 57.0%
Validation accuracy: 49.0%
Minibatch loss at step 200: 1.497957
Minibatch accuracy: 64.8%
Validation accuracy: 57.6%
Minibatch loss at step 400: 1.463217
Minibatch accuracy: 63.3%
Validation accuracy: 63.0%
Minibatch loss at step 600: 1.433888
Minibatch accuracy: 64.8%
Validation accuracy: 63.3%
Minibatch loss at step 800: 1.380140
Minibatch accuracy: 64.8%
Validation accuracy: 60.1%
Minibatch loss at step 1000: 1.333677
Minibatch accuracy: 71.9%
Validation accuracy: 62.9%
Minibatch loss at step 1200: 1.309573
Minibatch accuracy: 73.4%
Validation accuracy: 64.2%
Minibatch loss at step 1400: 1.314891
Minibatch accuracy: 70.3%
Validation accuracy: 64.4%
Minibatch loss at step 1600: 1.238327
Minibatch accuracy: 70.3%
Validation accuracy: 65.7%
Minibatch loss at step 1800: 1.259339
Minibatch accuracy: 66.4%
Validation accuracy: 64.2%
Minibatch l

In [97]:
#使用TFIDF得到句子向量
import os
pickle_file = 'E:/MYGIT/DataSources/MovieComments/Comments_TFIDF_02'
with open(pickle_file, 'rb') as f:
    save = pickle.load(f)
    train_dataset = save['train_dataset']
    train_labels = save['train_labels']
    valid_dataset = save['valid_dataset']
    valid_labels = save['valid_labels']
    test_dataset = save['test_dataset']
    test_labels = save['test_labels']
    del(save)  # hint to help gc free up memory
    print('Training set', train_dataset.shape, train_labels.shape)
    print('Validation set', valid_dataset.shape, valid_labels.shape)
    print('Test set', test_dataset.shape, test_labels.shape)

Training set (72000, 2000) (72000,)
Validation set (9000, 2000) (9000,)
Test set (9000, 2000) (9000,)


In [99]:
train_dataset[1]

<1x2000 sparse matrix of type '<class 'numpy.float64'>'
	with 3 stored elements in Compressed Sparse Row format>

In [100]:
#数据类型统一化，以及label变成one-hot编码
num_labels = 2
def reformat(datasets, labels):
    labels = (np.arange(num_labels) == labels[:,None]).astype(np.float32)
    datasets = datasets.astype(np.float32)
    datasets = datasets.toarray()
    return datasets,labels

train_dataset,train_labels1 = reformat(train_dataset,train_labels)
valid_dataset,valid_labels1 = reformat(valid_dataset,valid_labels)
test_dataset, test_labels1 = reformat(test_dataset,test_labels)
print('Training set', train_dataset.shape, train_labels.shape)
print('Validation set', valid_dataset.shape, valid_labels.shape)
print('Test set', test_dataset.shape, test_labels.shape)

Training set (72000, 2000) (72000,)
Validation set (9000, 2000) (9000,)
Test set (9000, 2000) (9000,)


In [102]:
#2层神经网络
batch_size = 128
num_hidden_nodes1 = 1024
num_hidden_nodes2 = 128
beta_regul = 1e-3
feature_size = 2000

graph = tf.Graph()
with graph.as_default():
    # Input data. For the training data, we use a placeholder that will be fed
    # at run time with a training minibatch.
    tf_train_dataset = tf.placeholder(tf.float32,
                                    shape=(batch_size, feature_size))
    tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)
    global_step = tf.Variable(0)

    # Variables.
    #第一层输入节点是28*28级原始数据的维度大小，本层节点是个数是num_hidden_nodes1
    weights1 = tf.Variable(tf.truncated_normal([feature_size, num_hidden_nodes1],
                                               stddev=np.sqrt(2.0 / (feature_size))))
    biases1 = tf.Variable(tf.zeros([num_hidden_nodes1]))
    
    #第二层的输入节点是第一层节点个数，本层节点个数是num_hidden_nodes2,stddev是指定生成数据的标准差
    weights2 = tf.Variable(tf.truncated_normal([num_hidden_nodes1, num_hidden_nodes2], 
                                               stddev=np.sqrt(2.0 / num_hidden_nodes1)))
    biases2 = tf.Variable(tf.zeros([num_hidden_nodes2]))
    #最后一层的输入节点个数是第二层节点个数，本层节点个数是要分类的类别个数。
    weights3 = tf.Variable(tf.truncated_normal([num_hidden_nodes2, num_labels], 
                                               stddev=np.sqrt(2.0 / num_hidden_nodes2)))
    biases3 = tf.Variable(tf.zeros([num_labels]))

    # Training computation.
    lay1_train = tf.nn.relu(tf.matmul(tf_train_dataset, weights1) + biases1)
    lay2_train = tf.nn.relu(tf.matmul(lay1_train, weights2) + biases2)
    logits = tf.matmul(lay2_train, weights3) + biases3
    loss = tf.reduce_mean(
    tf.nn.softmax_cross_entropy_with_logits_v2(labels=tf_train_labels, logits=logits)) + \
      beta_regul * (tf.nn.l2_loss(weights1) + tf.nn.l2_loss(weights2) + tf.nn.l2_loss(weights3))

    # Optimizer.
    learning_rate = tf.train.exponential_decay(0.1, global_step, 1000, 0.65, staircase=True)
    optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss, global_step=global_step)

    # Predictions for the training, validation, and test data.
    train_prediction = tf.nn.softmax(logits)
    lay1_valid = tf.nn.relu(tf.matmul(tf_valid_dataset, weights1) + biases1)
    lay2_valid = tf.nn.relu(tf.matmul(lay1_valid, weights2) + biases2)
    valid_prediction = tf.nn.softmax(tf.matmul(lay2_valid, weights3) + biases3)
    
    lay1_test = tf.nn.relu(tf.matmul(tf_test_dataset, weights1) + biases1)
    lay2_test = tf.nn.relu(tf.matmul(lay1_test, weights2) + biases2)
    test_prediction = tf.nn.softmax(tf.matmul(lay2_test, weights3) + biases3)

In [106]:
#128*3601 = 2304128
num_steps = 3601
with tf.Session(graph=graph) as session:
    tf.initialize_all_variables().run()
    print("Initialized")
    for step in range(num_steps):
        # Pick an offset within the training data, which has been randomized.
        # Note: we could use better randomization across epochs.
        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
        # Generate a minibatch.
        batch_data = train_dataset[offset:(offset + batch_size)]
        batch_labels = train_labels1[offset:(offset + batch_size)]
        # Prepare a dictionary telling the session where to feed the minibatch.
        # The key of the dictionary is the placeholder node of the graph to be fed,
        # and the value is the numpy array to feed to it.
        feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
        _, l, predictions = session.run([optimizer, loss, train_prediction], feed_dict=feed_dict)
        if(step % 200 == 0):
            print("Minibatch loss at step %d: %f" % (step, l))
            print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
            print("Validation accuracy: %.1f%%" % accuracy(valid_prediction.eval(), valid_labels1))
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels1))

Initialized
Minibatch loss at step 0: 1.586879
Minibatch accuracy: 41.4%
Validation accuracy: 50.7%
Minibatch loss at step 200: 1.543147
Minibatch accuracy: 63.3%
Validation accuracy: 61.7%
Minibatch loss at step 400: 1.500564
Minibatch accuracy: 57.0%
Validation accuracy: 68.6%
Minibatch loss at step 600: 1.392491
Minibatch accuracy: 77.3%
Validation accuracy: 61.5%
Minibatch loss at step 800: 1.350853
Minibatch accuracy: 73.4%
Validation accuracy: 73.2%
Minibatch loss at step 1000: 1.311717
Minibatch accuracy: 72.7%
Validation accuracy: 73.4%
Minibatch loss at step 1200: 1.213789
Minibatch accuracy: 73.4%
Validation accuracy: 72.4%
Minibatch loss at step 1400: 1.230655
Minibatch accuracy: 76.6%
Validation accuracy: 73.8%
Minibatch loss at step 1600: 1.242313
Minibatch accuracy: 64.8%
Validation accuracy: 74.3%
Minibatch loss at step 1800: 1.115149
Minibatch accuracy: 76.6%
Validation accuracy: 70.4%
Minibatch loss at step 2000: 1.137029
Minibatch accuracy: 75.8%
Validation accuracy: 

## 结果分析

1. 方法1句子向量化得到的模型测试集准确率大概0.66左右，TF-IDF的到的准确率大概在0.76左右

2. 为什么准确率都不是太理想？？
 - 原始评论数据本来区分就不是很明显，特别是比较中性的评论对结果影响较大，很难区分该评论是好是坏
 - 原始评论数据中一些一些评论看起来是正向评论，但标签却是负
 - 句子向量化的处理还不够好