用之前得到最好效果的100维向量，加lr分类器。顺便测试一下，在100维的程度下，svm的效果如何。

lr的结果是0.871，svm的结果是0.874。svm确实比lr好一点点。不过因为SVM不方便得到概率值，之后输出概率的任务就交给lr了。

In [1]:
import re
from bs4 import BeautifulSoup
import nltk
from nltk.corpus import stopwords
from gensim.models.doc2vec import TaggedDocument


def review_to_words(raw_review):
    review_text = BeautifulSoup(raw_review, 'lxml').get_text()
    letters_only = re.sub("[^a-zA-Z]", " ", review_text) 
    words = letters_only.lower().split()
    stops = set(stopwords.words("english"))
    meaningful_words = [w for w in words if not w in stops]
    return(" ".join(meaningful_words))


def tag_reviews(reviews, prefix):
    tagged = []
    for i, review in enumerate(reviews):
        tagged.append(TaggedDocument(words=review.split(), tags=[prefix + '_%s' % i]))
    return tagged

Using TensorFlow backend.


In [2]:
# gensim modules
from gensim.models import Doc2Vec

# numpy
import numpy as np

# classifier
from sklearn.linear_model import LogisticRegression

# random
from random import shuffle

# preprocess packages
import pandas as pd
# import sys
# sys.path.insert(0, '..')
# from utils.TextPreprocess import review_to_words, tag_reviews


'''
Training Data
'''
train = pd.read_csv("../Sentiment/data/labeledTrainData.tsv", header=0, 
                         delimiter='\t', quoting=3, error_bad_lines=False)
num_reviews = train["review"].size

print("Cleaning and parsing the training set movie reviews...")
clean_train_reviews = []
for i in range(0, num_reviews):
    clean_train_reviews.append(review_to_words(train["review"][i]))

'''
Test Data
'''
test = pd.read_csv("../Sentiment/data/testData.tsv", header = 0, delimiter = "\t", quoting = 3)

num_reviews = len(test["review"])
clean_test_reviews = []

print("Cleaning and parsing the test set movie reviews...")
for i in range(0, num_reviews):
    clean_review = review_to_words(test["review"][i])
    clean_test_reviews.append(clean_review)


# Unlabeled Train Data
unlabeled_reviews = pd.read_csv("../Sentiment/data/unlabeledTrainData.tsv", header = 0, delimiter = "\t", quoting = 3)
num_reviews = len(unlabeled_reviews["review"])
clean_unlabeled_reviews = []

print("Cleaning and parsing the test set movie reviews...")
for i in range( 0, num_reviews):
    if( (i+1)%5000 == 0 ):
        print("Review %d of %d\n" % (i+1, num_reviews))
    clean_review = review_to_words(unlabeled_reviews["review"][i])
    clean_unlabeled_reviews.append(clean_review)

Cleaning and parsing the training set movie reviews...
Cleaning and parsing the test set movie reviews...
Cleaning and parsing the test set movie reviews...
Review 5000 of 50000

Review 10000 of 50000

Review 15000 of 50000

Review 20000 of 50000

Review 25000 of 50000

Review 30000 of 50000

Review 35000 of 50000

Review 40000 of 50000

Review 45000 of 50000

Review 50000 of 50000



In [3]:
# tag all reviews
train_tagged = tag_reviews(clean_train_reviews, 'TRAIN')
test_tagged = tag_reviews(clean_test_reviews, 'TEST')
unlabeled_train_tagged = tag_reviews(clean_unlabeled_reviews, 'UNTRAIN')

In [4]:
# model construction
model_dbow = Doc2Vec(min_count=1, window=10, size=100, sample=1e-3, negative=5, dm=0, workers=3)

# build vocabulary
all_tagged = []
tag_objects = [train_tagged, test_tagged, unlabeled_train_tagged]
for tag_object in tag_objects:
    for tag in tag_object:
        all_tagged.append(tag)

model_dbow.build_vocab(all_tagged)

# train two model
train_tagged2 = []
tag_objects = [train_tagged, unlabeled_train_tagged]
for tag_object in tag_objects:
    for tag in tag_object:
        train_tagged2.append(tag)

for i in range(10):
    shuffle(train_tagged2)
    model_dbow.train(train_tagged2, total_examples=len(train_tagged2), epochs=1, start_alpha=0.025, end_alpha=0.025)


train_array_dbow = []
for i in range(len(train_tagged)):
    tag = train_tagged[i].tags[0]
    train_array_dbow.append(model_dbow.docvecs[tag])

train_target = train['sentiment'].values

test_array_dbow = []
for i in range(len(test_tagged)):
    test_array_dbow.append(model_dbow.infer_vector(test_tagged[i].words))


In [5]:
from sklearn.svm import SVC

clf = SVC(C=1.0, kernel='rbf')
clf.fit(train_array_dbow, train_target)
result = clf.predict(test_array_dbow)

print("output...")
output = pd.DataFrame(data={'id': test['id'], 'sentiment': result})
output.to_csv('doc2vec_svm100.csv', index=False, quoting=3)

output...


In [6]:
from sklearn.linear_model import LogisticRegression

lr_dbow = LogisticRegression()
lr_dbow.fit(train_array_dbow, train_target)
result_dbow = lr_dbow.predict(test_array_dbow)

print("output...")
output_dbow = pd.DataFrame(data={'id': test['id'], 'sentiment': result_dbow})
output_dbow.to_csv('doc2vec_lr100.csv', index=False, quoting=3)

output...


注意，上面两个结果依然还是0或1的结果，下面写得到概率的结果

In [7]:
result_dbow_prob = lr_dbow.predict_proba(test_array_dbow)


In [10]:
result_dbow[:10]

array([1, 0, 0, 0, 1, 1, 0, 0, 0, 1])

In [11]:
result_dbow_prob[:10]

array([[  8.98102520e-04,   9.99101897e-01],
       [  9.98541567e-01,   1.45843297e-03],
       [  7.78536606e-01,   2.21463394e-01],
       [  8.98943653e-01,   1.01056347e-01],
       [  1.05363912e-01,   8.94636088e-01],
       [  2.49067824e-01,   7.50932176e-01],
       [  9.05103156e-01,   9.48968442e-02],
       [  8.19656551e-01,   1.80343449e-01],
       [  9.74803625e-01,   2.51963749e-02],
       [  4.56141579e-01,   5.43858421e-01]])

可以看到概率值的第二列是为1的概率，我们只要这一列。

In [12]:
result_dbow_prob[:, 1]

array([ 0.9991019 ,  0.00145843,  0.22146339, ...,  0.12014127,
        0.99536379,  0.58438488])

In [13]:
print("output...")
output_dbow_prob = pd.DataFrame(data={'id': test['id'], 'sentiment': result_dbow_prob[:, 1]})
output_dbow_prob.to_csv('../Sentiment/result/doc2vec_lr100_prob.csv', index=False, quoting=3)

output...


为了在Part 3.5中读取sentence vector，在这个笔记里，我把train sentence vector和test sentence vector保存到txt文件。这里我先直接把模型保存一下好了。

In [14]:
model_dbow.save('../Sentiment/src/deep/model/doc2vec_lr100')

测试一下保存的模型能不能正常使用

In [16]:
test_model = Doc2Vec.load('../Sentiment/src/deep/model/doc2vec_lr100')

In [17]:
test_model.docvecs['TRAIN_0']

array([ 0.14595501,  0.38399282,  0.06572972,  0.30974752,  0.67297232,
        0.13194489, -0.05424781,  0.28447962, -0.23332863,  0.44351801,
       -0.00948488, -0.87945515,  0.56327236,  0.26428932, -0.34765893,
       -0.17097975, -0.45460328, -0.12888889, -0.48940602, -0.01185165,
       -0.24453115, -0.04505147,  0.09383383, -0.16496325,  0.01960274,
       -0.29901358, -0.13207597, -0.10162185,  0.20436931,  0.13023561,
        0.22586688,  0.75536847,  0.24891821,  0.14947703,  0.00144878,
       -0.20468356, -0.31889659,  0.04161833, -0.64493978,  0.25871462,
       -0.61675662, -0.12647435,  0.84288538,  0.19948879, -0.4759973 ,
        0.12623964, -0.36842909, -0.22224943,  0.23471437, -0.07343078,
       -0.26600158,  0.08183515,  0.1728107 ,  0.56280148,  0.23905422,
        0.22810945,  0.13373871,  0.17811313,  0.02367399, -0.54043096,
        0.64316767, -0.83761817,  0.48490623, -0.30863473, -0.06078329,
       -0.18348159, -0.02447214, -0.13533106,  0.17773797, -0.08

呃，发现还需要train_tagged这样有tag信息的对象才能读取。我还是直接把处理好的vector保存好得了。其实就是train_array_dbow和test_array_dbow。

In [21]:
np.savetxt('../Sentiment/data/train_feature_d2v.txt', train_array_dbow)

In [22]:
train_array_dbow[0]

array([ 0.14595501,  0.38399282,  0.06572972,  0.30974752,  0.67297232,
        0.13194489, -0.05424781,  0.28447962, -0.23332863,  0.44351801,
       -0.00948488, -0.87945515,  0.56327236,  0.26428932, -0.34765893,
       -0.17097975, -0.45460328, -0.12888889, -0.48940602, -0.01185165,
       -0.24453115, -0.04505147,  0.09383383, -0.16496325,  0.01960274,
       -0.29901358, -0.13207597, -0.10162185,  0.20436931,  0.13023561,
        0.22586688,  0.75536847,  0.24891821,  0.14947703,  0.00144878,
       -0.20468356, -0.31889659,  0.04161833, -0.64493978,  0.25871462,
       -0.61675662, -0.12647435,  0.84288538,  0.19948879, -0.4759973 ,
        0.12623964, -0.36842909, -0.22224943,  0.23471437, -0.07343078,
       -0.26600158,  0.08183515,  0.1728107 ,  0.56280148,  0.23905422,
        0.22810945,  0.13373871,  0.17811313,  0.02367399, -0.54043096,
        0.64316767, -0.83761817,  0.48490623, -0.30863473, -0.06078329,
       -0.18348159, -0.02447214, -0.13533106,  0.17773797, -0.08

读取一下保存的数据，看看效果如何：

In [23]:
test_train_d2v = np.loadtxt('../Sentiment/data/train_feature_d2v.txt')

In [24]:
test_train_d2v.shape

(25000, 100)

In [25]:
test_train_d2v[0]

array([ 0.14595501,  0.38399282,  0.06572972,  0.30974752,  0.67297232,
        0.13194489, -0.05424781,  0.28447962, -0.23332863,  0.44351801,
       -0.00948488, -0.87945515,  0.56327236,  0.26428932, -0.34765893,
       -0.17097975, -0.45460328, -0.12888889, -0.48940602, -0.01185165,
       -0.24453115, -0.04505147,  0.09383383, -0.16496325,  0.01960274,
       -0.29901358, -0.13207597, -0.10162185,  0.20436931,  0.13023561,
        0.22586688,  0.75536847,  0.24891821,  0.14947703,  0.00144878,
       -0.20468356, -0.31889659,  0.04161833, -0.64493978,  0.25871462,
       -0.61675662, -0.12647435,  0.84288538,  0.19948879, -0.4759973 ,
        0.12623964, -0.36842909, -0.22224943,  0.23471437, -0.07343078,
       -0.26600158,  0.08183515,  0.1728107 ,  0.56280148,  0.23905422,
        0.22810945,  0.13373871,  0.17811313,  0.02367399, -0.54043096,
        0.64316767, -0.83761817,  0.48490623, -0.30863473, -0.06078329,
       -0.18348159, -0.02447214, -0.13533106,  0.17773797, -0.08

一模一样，没问题。

In [26]:
np.savetxt('../Sentiment/data/test_feature_d2v.txt', test_array_dbow)