# Part 2 Doc2vec（原方法实现，冗长版）

# 资料介绍

[A gentle introduction to Doc2Vec](https://medium.com/scaleabout/a-gentle-introduction-to-doc2vec-db3e8c0cce5e)。

[word2vec-sentiments](https://github.com/linanqiu/word2vec-sentiments)

上面的文章也说到了，doc2vec是一种非监督的方法，适合用于情感分析。类似于word2vec，经过训练后，每一段都能用一个vector来代表。所以我们可以用这些vector代替特征值，来训练模型。

不过这里有个问题，输入doc2vec中的数据，是LabeledSentence objects的迭代器。每一个LabeledSentence object都代表一句话，这句话由两部分组成：由word组成的list，和由labels组成的list。不过有label的话，这不就相当于监督式学习了？这样的话数据集里的unlabeledTrainData不就用不到了？

# 方法1：只训练有标签的数据集

这部分就用[word2vec-sentiments](https://github.com/linanqiu/word2vec-sentiments)的方法吧。

In [131]:
# gensim modules
from gensim import utils
from gensim.models.doc2vec import TaggedDocument
from gensim.models.doc2vec import LabeledSentence
from gensim.models import Doc2Vec

# numpy
import numpy as np

# classifier
from sklearn.linear_model import LogisticRegression

# random
from random import shuffle

# preprocess packages
import pandas as pd
from bs4 import BeautifulSoup
import re
from nltk.corpus import stopwords # import the stop word list


## 1.1 Input Format

- `train-neg.txt`: 12500 negative movie reviews from the training data
- `train-pos.txt`: 12500 positive movie reviews from the training data
- `train-unsup.txt`: 50000 Unlabelled movie reviews

下面是两个评论样本，要先对原文进行处理，比如全部小写，去标点。每个样本就是一行，每一行用回车隔开，这样才能被识别。

```
once again mr costner has dragged out a movie for far longer than necessary aside from the terrific sea rescue sequences of which there are very few i just did not care about any of the characters most of us have ghosts in the closet and costner s character are realized early on and then forgotten until much later by which time i did not care the character we should really care about is a very cocky overconfident ashton kutcher the problem is he comes off as kid who thinks he s better than anyone else around him and shows no signs of a cluttered closet his only obstacle appears to be winning over costner finally when we are well past the half way point of this stinker costner tells us all about kutcher s ghosts we are told why kutcher is driven to be the best with no prior inkling or foreshadowing no magic here it was all i could do to keep from turning it off an hour in
this is an example of why the majority of action films are the same generic and boring there s really nothing worth watching here a complete waste of the then barely tapped talents of ice t and ice cube who ve each proven many times over that they are capable of acting and acting well don t bother with this one go see new jack city ricochet or watch new york undercover for ice t or boyz n the hood higher learning or friday for ice cube and see the real deal ice t s horribly cliched dialogue alone makes this film grate at the teeth and i m still wondering what the heck bill paxton was doing in this film and why the heck does he always play the exact same character from aliens onward every film i ve seen with bill paxton has him playing the exact same irritating character and at least in aliens his character died which made it somewhat gratifying overall this is second rate action trash there are countless better films to see and if you really want to see this one watch judgement night which is practically a carbon copy but has better acting and a better script the only thing that made this at all worth watching was a decent hand on the camera the cinematography was almost refreshing which comes close to making up for the horrible film itself but not quite
```

处理方法，作者只进行了小写化和去标点，我打算还打算去html和stop words，直接用之前写的函数就行了。

In [8]:
train = pd.read_csv("../Sentiment/data/labeledTrainData.tsv", header=0, 
                         delimiter='\t', quoting=3, error_bad_lines=False)

In [5]:
def review_to_words(raw_review):
    # Function to convert a raw review to a string of words
    # The input is a single string (a raw movie review), and 
    # the output is a single string (a preprocessed movie review)
    #
    # 1. Remove HTML
    review_text = BeautifulSoup(raw_review, "lxml").get_text() 
    #
    # 2. Remove non-letters        
    letters_only = re.sub("[^a-zA-Z]", " ", review_text) 
    #
    # 3. Convert to lower case, split into individual words
    words = letters_only.lower().split()                             
    #
    # 4. In Python, searching a set is much faster than searching
    #   a list, so convert the stop words to a set
    stops = set(stopwords.words("english"))                  
    # 
    # 5. Remove stop words
    meaningful_words = [w for w in words if not w in stops]   
    #
    # 6. Join the words back into one string separated by space, 
    # and return the result.
    return( " ".join( meaningful_words )) 

In [9]:
print("Cleaning and parsing the training set movie reviews...\n")

# number of reviews
num_reviews = train['review'].size

# initialize an empty list to hold the clean reviews
clean_train_reviews = []

for i in range( 0, num_reviews):
    # If the index is evenly divisible by 5000, print a message
    if( (i+1)%5000 == 0 ):
        print("Review %d of %d\n" % (i+1, num_reviews))                                                                  
    clean_train_reviews.append( review_to_words( train["review"][i] ))

Cleaning and parsing the training set movie reviews...

Review 5000 of 25000

Review 10000 of 25000

Review 15000 of 25000

Review 20000 of 25000

Review 25000 of 25000



对test data也做同样的处理：

In [11]:
# Test Data
test = pd.read_csv("../Sentiment/data/testData.tsv", header = 0, delimiter = "\t", quoting = 3)
num_reviews = len(test["review"])
clean_test_reviews = []

print("Cleaning and parsing the test set movie reviews...")
for i in range( 0, num_reviews):
    if( (i+1)%5000 == 0 ):
        print("Review %d of %d\n" % (i+1, num_reviews))                                                                  
    clean_review = review_to_words(test["review"][i])
    clean_test_reviews.append(clean_review)

Cleaning and parsing the test set movie reviews...
Review 5000 of 25000

Review 10000 of 25000

Review 15000 of 25000

Review 20000 of 25000

Review 25000 of 25000



对unlabeled data也做同样处理

In [15]:
# Unlabeled Train Data 
unlabeled_reviews = pd.read_csv("../Sentiment/data/unlabeledTrainData.tsv", header = 0, delimiter = "\t", quoting = 3)
num_reviews = len(unlabeled_reviews["review"])
clean_unlabeled_reviews = []

print("Cleaning and parsing the test set movie reviews...")
for i in range( 0, num_reviews):
    if( (i+1)%5000 == 0 ):
        print("Review %d of %d\n" % (i+1, num_reviews))                                                                  
    clean_review = review_to_words(unlabeled_reviews["review"][i])
    clean_unlabeled_reviews.append(clean_review)


Cleaning and parsing the test set movie reviews...
Review 5000 of 50000

Review 10000 of 50000

Review 15000 of 50000

Review 20000 of 50000

Review 25000 of 50000

Review 30000 of 50000

Review 35000 of 50000

Review 40000 of 50000

Review 45000 of 50000

Review 50000 of 50000



Gensim的Doc2Vec应用于训练要求每一篇文章/句子有一个唯一标识的label。这里gensim有TaggedDocument和LabeledSentence两个方法。不过通过调查，发现TaggedDocument更新一些，最终会取代LabeledSentence，所以这里我们使用TaggedDocument。

我们使用Gensim自带的TaggedDocument方法. 标识的格式为"TRAIN_i"和"TEST_i"，其中i为序号。

In [178]:
def tag_reviews(reviews, prefix):
    tagged = []
    for i, review in enumerate(reviews):
        tagged.append(TaggedDocument(words=review.split(), tags=[prefix + '_%s' % i]))
    return tagged

In [179]:
train_tagged = tag_reviews(clean_train_reviews, 'TRAIN')
test_tagged = tag_reviews(clean_test_reviews, 'TEST')
unlabeled_train_tagged = tag_reviews(clean_unlabeled_reviews, 'UNTRAIN')

好了，到此为止需要输入的数据就准备好了。下面查看一些我们得到的数据：

In [180]:
train_tagged[0]

TaggedDocument(words=['stuff', 'going', 'moment', 'mj', 'started', 'listening', 'music', 'watching', 'odd', 'documentary', 'watched', 'wiz', 'watched', 'moonwalker', 'maybe', 'want', 'get', 'certain', 'insight', 'guy', 'thought', 'really', 'cool', 'eighties', 'maybe', 'make', 'mind', 'whether', 'guilty', 'innocent', 'moonwalker', 'part', 'biography', 'part', 'feature', 'film', 'remember', 'going', 'see', 'cinema', 'originally', 'released', 'subtle', 'messages', 'mj', 'feeling', 'towards', 'press', 'also', 'obvious', 'message', 'drugs', 'bad', 'kay', 'visually', 'impressive', 'course', 'michael', 'jackson', 'unless', 'remotely', 'like', 'mj', 'anyway', 'going', 'hate', 'find', 'boring', 'may', 'call', 'mj', 'egotist', 'consenting', 'making', 'movie', 'mj', 'fans', 'would', 'say', 'made', 'fans', 'true', 'really', 'nice', 'actual', 'feature', 'film', 'bit', 'finally', 'starts', 'minutes', 'excluding', 'smooth', 'criminal', 'sequence', 'joe', 'pesci', 'convincing', 'psychopathic', 'powerf

In [181]:
test_tagged[4]

TaggedDocument(words=['accurate', 'depiction', 'small', 'time', 'mob', 'life', 'filmed', 'new', 'jersey', 'story', 'characters', 'script', 'believable', 'acting', 'drops', 'ball', 'still', 'worth', 'watching', 'especially', 'strong', 'images', 'still', 'even', 'though', 'first', 'viewed', 'years', 'ago', 'young', 'hood', 'steps', 'starts', 'bigger', 'things', 'tries', 'things', 'keep', 'going', 'wrong', 'leading', 'local', 'boss', 'suspect', 'end', 'skimmed', 'good', 'place', 'enjoy', 'health', 'life', 'film', 'introduced', 'joe', 'pesce', 'martin', 'scorsese', 'also', 'present', 'perennial', 'screen', 'wise', 'guy', 'frank', 'vincent', 'strong', 'characterizations', 'visuals', 'sound', 'muddled', 'much', 'acting', 'amateurish', 'great', 'story'], tags=['TEST_4'])

In [182]:
unlabeled_train_tagged[8]

TaggedDocument(words=['well', 'made', 'gritty', 'science', 'fiction', 'movie', 'could', 'lost', 'among', 'hundreds', 'similar', 'movies', 'several', 'strong', 'points', 'keep', 'near', 'top', 'one', 'writing', 'directing', 'solid', 'manages', 'part', 'avoid', 'many', 'sci', 'fi', 'cliches', 'though', 'good', 'job', 'keeping', 'suspense', 'landscape', 'look', 'movie', 'appeal', 'sci', 'fi', 'fans', 'looking', 'masterpiece', 'looking', 'good', 'old', 'fashioned', 'post', 'apoc', 'gritty', 'future', 'space', 'sci', 'fi', 'good', 'suspense', 'special', 'effects', 'movie', 'thoroughly', 'enjoyable', 'good', 'ending'], tags=['UNTRAIN_8'])

下面是使用LabeledSentence实现的例子，效果一样

In [None]:
def labelizeReviews(reviews, label_type):
    labelized = []
    for i, review in enumerate(reviews):
        label = '%s_%s'%(label_type, i)
        labelized.append(LabeledSentence(review.split(), [label]))
    return labelized

In [None]:
train_tagged = labelizeReviews(clean_train_reviews, 'TRAIN')
test_tagged = labelizeReviews(clean_test_reviews, 'TEST')
unlabeled_train_tagged = labelizeReviews(clean_unlabeled_reviews, 'UNTRAIN')

# Model 

## Building the Vocabulary Table

这里使用build_vocab，可以把所有的数据全部输入，包括test数据集的也可以。只是用于建立词典而已，不会有因为训练集的存在而造成数据泄露。

doc2vec与word2vec一样，也有两种模型，一种是和CBOW相似的PV-DM model，一种是和skip-gram相似的PV-DBOW。根据文章的讲解，PV-DM效果更好，并经常做到state-of-art，这里我们就先拿PV-DM做个例子。

dm (int {1,0}) – Defines the training algorithm. If dm=1, ‘distributed memory’ (PV-DM) is used. Otherwise, distributed bag of words (PV-DBOW) is employed.


In [183]:
model_dbow = Doc2Vec(min_count=1, window=10, size=100, sample=1e-3, negative=5, dm=0, workers=3)

In [61]:
model_dbow.build_vocab([train_tagged, test_tagged, unlabeled_train_tagged])

AttributeError: 'list' object has no attribute 'words'

这里遇到一个问题，在构建词典的时候，我们希望把[train_tagged, test_tagged, unlabeled_train_tagged]这三个都用起来，但是不论使用`np.concatenate((train_tagged, test_tagged, unlabeled_train_tagged))`的方法，还是用`[train_tagged, test_tagged, unlabeled_train_tagged]`形式，都不能自动合成一个输入。这里只能曲线救国了，把三个clean_reviews先整合到一起，直接得到一个包含所有reviews的document。

In [92]:
all_clean_reviews = np.concatenate((clean_train_reviews, clean_test_reviews, clean_unlabeled_reviews))

In [93]:
all_tagged = tag_reviews(all_clean_reviews, 'ALL')

In [94]:
# train_tagged 
# test_tagged 
# unlabeled_train_tagged
# all_tagged, contain three parts above
len(all_tagged)

100000

另一种方法是，建立一个空list，然后对三个tag对象迭代，全部加到一个list里：

In [184]:
all_tagged = []

In [185]:
tag_objects = [train_tagged, test_tagged, unlabeled_train_tagged]
for tag_object in tag_objects:
    for tag in tag_object:
        all_tagged.append(tag)

In [189]:
len(all_tagged)

100000

再次尝试构建词典：

In [196]:
model_dbow = Doc2Vec(min_count=1, window=10, size=100, sample=1e-3, negative=5, dm=0, workers=3, seed=2)

In [197]:
model_dbow.build_vocab(all_tagged, progress_per=20000)

进行多次重复训练，每一次都需要对训练数据重新打乱，以提高精度

In [198]:
epoch_num = 10

In [199]:
len(train_tagged)

25000

In [200]:
# 训练集
model_dbow.train(train_tagged, total_examples=len(train_tagged), epochs=10)

28992769

In [202]:
model_dbow.most_similar('good')

[('yamashita', 0.4659208655357361),
 ('gombon', 0.4030294716358185),
 ('choirs', 0.39732831716537476),
 ('meighan', 0.39478492736816406),
 ('bekim', 0.393124520778656),
 ('counsil', 0.3909412622451782),
 ('findus', 0.38706544041633606),
 ('rhoades', 0.38631200790405273),
 ('unwit', 0.3849732279777527),
 ('corporatization', 0.38302838802337646)]

In [204]:
model_dbow.docvecs['TRAIN_0']

array([ 0.23571487,  0.45775369, -0.19429441,  0.19739306,  0.06349532,
       -0.32546872, -0.29554409,  0.11032848, -0.33108968,  0.47355616,
       -0.18420415, -0.12026758,  0.11244816,  0.39083016,  0.07875296,
        0.0182258 ,  0.08871382, -0.46112907,  0.06538947, -0.30757257,
       -0.03677357,  0.4571391 , -0.11571738, -0.28078708, -0.3350139 ,
        0.06919808, -0.11012798,  0.29829952, -0.12432194,  0.3541871 ,
       -0.03057882,  0.06124721,  0.10388241,  0.090198  , -0.07303879,
       -0.30610693, -0.05933514,  0.01993979,  0.35669476, -0.32851699,
       -0.40991127,  0.03481026,  0.29278004,  0.07365297, -0.10998157,
        0.10443848, -0.08990712,  0.1885464 ,  0.03642843, -0.2410778 ,
        0.41257307,  0.16583849, -0.25326484,  0.09907204, -0.24363291,
       -0.72867095, -0.16245642,  0.14579734, -0.19472615, -0.28438318,
        0.29534045,  0.43244103, -0.00199401, -0.07133139, -0.14348222,
       -0.17259851,  0.08573415,  0.32934815, -0.15625125,  0.32

In [210]:
np.shape(model_dbow.docvecs)

(100000, 100)

In [218]:
model_dbow.docvecs['TEST_0']

array([ 0.00421057,  0.00187568, -0.00238172, -0.00200227,  0.00264446,
        0.00183726,  0.00278547,  0.00327837, -0.0033134 ,  0.0025815 ,
       -0.00311388, -0.00161481,  0.00120862, -0.00446639, -0.00444229,
        0.00184929,  0.00492307, -0.00024591, -0.00163187, -0.00164549,
       -0.00123777, -0.00072104,  0.00135876,  0.00248231,  0.00436443,
       -0.00171494,  0.00202779, -0.00377534,  0.00079256, -0.00202123,
        0.00443694,  0.00239019,  0.00180055, -0.0011058 , -0.00277819,
        0.00379431, -0.00499493, -0.00345632,  0.00391399, -0.00172628,
       -0.0010942 ,  0.00473932, -0.00321528, -0.00195504, -0.00295239,
       -0.0042989 ,  0.00058964, -0.00414301,  0.00104371, -0.00337555,
        0.00157633, -0.00178631, -0.00185223,  0.00257274, -0.00103555,
        0.00420367,  0.00315895,  0.00051415,  0.00232465, -0.00388533,
       -0.00181878,  0.00057972, -0.00340101, -0.00182394, -0.00126339,
        0.00448554, -0.00453214,  0.00489036,  0.00281866, -0.00

现在能正常用tag得到向量了，可能和我更改了tag_reviews和去掉shuffle有关，但是为什么test也会有向量呢？训练的时候我明明只添加了train_tag啊。我猜测是用于训练

In [215]:
train_array = []
for i in range(len(train_tagged)):
    train_array.append(model_dbow.docvecs[i])

In [229]:
train_target = train['sentiment'].values

In [221]:
test_tagged[0]

TaggedDocument(words=['naturally', 'film', 'main', 'themes', 'mortality', 'nostalgia', 'loss', 'innocence', 'perhaps', 'surprising', 'rated', 'highly', 'older', 'viewers', 'younger', 'ones', 'however', 'craftsmanship', 'completeness', 'film', 'anyone', 'enjoy', 'pace', 'steady', 'constant', 'characters', 'full', 'engaging', 'relationships', 'interactions', 'natural', 'showing', 'need', 'floods', 'tears', 'show', 'emotion', 'screams', 'show', 'fear', 'shouting', 'show', 'dispute', 'violence', 'show', 'anger', 'naturally', 'joyce', 'short', 'story', 'lends', 'film', 'ready', 'made', 'structure', 'perfect', 'polished', 'diamond', 'small', 'changes', 'huston', 'makes', 'inclusion', 'poem', 'fit', 'neatly', 'truly', 'masterpiece', 'tact', 'subtlety', 'overwhelming', 'beauty'], tags=['TEST_0'])

## 第一种test data
关于如何得到test数据，还有很多种尝试方法。这里我们用模型来预测每一个test数据的向量。

In [224]:
test_array = []
for i in range(len(test_tagged)):
    test_array.append(model_dbow.infer_vector(test_tagged[i].words))


In [225]:
test_array[0]

array([-0.16741326,  0.55852312, -0.03362535,  0.31930959,  0.08296984,
       -0.07947304, -0.07797807,  0.14376642, -0.15254921, -0.2210415 ,
       -0.02304331, -0.17412151,  0.34129703,  0.18006454,  0.34570119,
       -0.14487725,  0.02051304, -0.57484555, -0.05133546,  0.01037545,
        0.22275317,  0.38014448, -0.31867149, -0.06234517, -0.22119807,
       -0.07582545, -0.100991  ,  0.35394457,  0.13840154, -0.16435973,
       -0.0184453 , -0.13852729,  0.26403272,  0.06933542,  0.05637591,
       -0.28032136,  0.12045938,  0.27054679,  0.34062421, -0.40137991,
       -0.2334284 ,  0.49347696, -0.32019919, -0.02283208, -0.05977324,
       -0.11325342, -0.84192359,  0.13864151, -0.4063195 , -0.1884138 ,
        0.10607169,  0.28732526,  0.15027149, -0.28106824, -0.40275922,
       -0.46907219, -0.0948663 ,  0.01465809, -0.0377506 , -0.51499683,
        0.05009205,  0.36480644,  0.08054388,  0.03245149, -0.08314686,
       -0.04861232, -0.21862179,  0.11567195, -0.05528564,  0.06

## 第二种test data

这一次直接取训练好的模型中，那些tag为test的:

In [241]:
test_tagged[0].tags[0]

'TEST_0'

In [242]:
test_array = []
for i in range(len(test_tagged)):
    tag = test_tagged[i].tags[0]
    test_array.append(model_dbow.docvecs[tag])

In [243]:
test_array[0]

array([ 0.00421057,  0.00187568, -0.00238172, -0.00200227,  0.00264446,
        0.00183726,  0.00278547,  0.00327837, -0.0033134 ,  0.0025815 ,
       -0.00311388, -0.00161481,  0.00120862, -0.00446639, -0.00444229,
        0.00184929,  0.00492307, -0.00024591, -0.00163187, -0.00164549,
       -0.00123777, -0.00072104,  0.00135876,  0.00248231,  0.00436443,
       -0.00171494,  0.00202779, -0.00377534,  0.00079256, -0.00202123,
        0.00443694,  0.00239019,  0.00180055, -0.0011058 , -0.00277819,
        0.00379431, -0.00499493, -0.00345632,  0.00391399, -0.00172628,
       -0.0010942 ,  0.00473932, -0.00321528, -0.00195504, -0.00295239,
       -0.0042989 ,  0.00058964, -0.00414301,  0.00104371, -0.00337555,
        0.00157633, -0.00178631, -0.00185223,  0.00257274, -0.00103555,
        0.00420367,  0.00315895,  0.00051415,  0.00232465, -0.00388533,
       -0.00181878,  0.00057972, -0.00340101, -0.00182394, -0.00126339,
        0.00448554, -0.00453214,  0.00489036,  0.00281866, -0.00

# Logistic Regression model 

In [None]:
from sklearn.linear_model import LogisticRegression


In [230]:
lr = LogisticRegression()

In [232]:
train_array = np.array(train_array)

In [233]:
train_array.shape

(25000, 100)

In [234]:
lr.fit(train_array, train_target)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [235]:
result = lr.predict(test_array)

In [236]:
result

array([1, 0, 1, ..., 0, 1, 1])

In [238]:
print("output...")
output = pd.DataFrame(data={'id': test['id'], 'sentiment': result})
output.to_csv('doc2vec2.csv', index=False, quoting=3)

output...


第一种test data提交的效果是0.81。下面尝试第二种的：

In [244]:
result2 = lr.predict(test_array)

In [245]:
print("output...")
output = pd.DataFrame(data={'id': test['id'], 'sentiment': result2})
output.to_csv('doc2vec3.csv', index=False, quoting=3)

output...


第二种的结果是0.5，很惨，看来test数据不应该直接从模型里拿出来。

# 增加训练数据

上面训练doc2vec的时候，我们只用了train_tagged，这一次使用train_tagged和unlabeled_train_tagged

In [246]:
train_tagged2 = []
tag_objects = [train_tagged, unlabeled_train_tagged]
for tag_object in tag_objects:
    for tag in tag_object:
        train_tagged2.append(tag)

In [249]:
model_dbow = Doc2Vec(min_count=1, window=10, size=100, sample=1e-3, negative=5, dm=0, workers=3)

In [250]:
model_dbow.build_vocab(all_tagged, progress_per=20000)

In [251]:
model_dbow.train(train_tagged2, total_examples=len(train_tagged2), epochs=10, start_alpha=0.025, end_alpha=0.005)

87255268

In [254]:
train_tagged[0].tags

['TRAIN_0']

In [255]:
train_array = []
for i in range(len(train_tagged)):
    tag = train_tagged[i].tags[0]
    train_array.append(model_dbow.docvecs[tag])

In [256]:
train_array = np.array(train_array)
train_array.shape

(25000, 100)

In [257]:
train_target

array([1, 1, 0, ..., 0, 0, 1])

In [258]:
test_array = []
for i in range(len(test_tagged)):
    test_array.append(model_dbow.infer_vector(test_tagged[i].words))


In [259]:
lr = LogisticRegression()

In [260]:
lr.fit(train_array, train_target)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [261]:
result = lr.predict(test_array)

In [262]:
result

array([1, 0, 0, ..., 0, 1, 1])

In [263]:
print("output...")
output = pd.DataFrame(data={'id': test['id'], 'sentiment': result})
output.to_csv('doc2vec4.csv', index=False, quoting=3)

output...


结果是0.855，还可以，起码比之前的bag of words要高了。下面我换成随机森林试试

In [264]:
from sklearn.ensemble import RandomForestClassifier

# Initialize a Random Forest classifier with 100 trees
forest = RandomForestClassifier(n_estimators = 100) 

forest = forest.fit( train_array, train_target)

In [265]:
result = forest.predict(test_array)

In [266]:
print("output...")
output = pd.DataFrame(data={'id': test['id'], 'sentiment': result})
output.to_csv('doc2vec5.csv', index=False, quoting=3)

output...


得分是0.775……差的好多

# 模型调参

这一次尝试在多次训练迭代的时候，shuffle数据。下一次尝试

In [267]:
model_dbow = Doc2Vec(min_count=1, window=10, size=100, sample=1e-3, negative=5, dm=0, workers=3)
model_dbow.build_vocab(all_tagged, progress_per=20000)

In [268]:
for i in range(10):
    shuffle(train_tagged2)
    model_dbow.train(train_tagged2, total_examples=len(train_tagged2), epochs=1, start_alpha=0.025, end_alpha=0.025)

In [271]:
model_dbow.docvecs['TRAIN_2']

array([-0.2390765 ,  0.1142169 ,  0.20712513,  0.89075464, -0.22244544,
        0.27381375, -0.17348099, -0.00806451, -0.07966816, -0.25908321,
        0.14005575,  0.81300563,  0.42594698, -0.41442031, -0.50202197,
       -0.3608807 , -0.18744084, -0.74130625, -0.3690764 ,  0.73912692,
        0.488134  ,  0.57566327,  0.03431462, -0.11286174,  0.09621059,
       -0.53745824,  0.19251792,  0.03610734, -0.16454478,  0.46170405,
        0.32712093,  0.35736078, -0.18140447,  0.53149986,  0.71819407,
       -0.05336506, -0.17479719, -0.09747268,  0.21144409,  0.26047847,
       -0.12783214, -0.17676389, -0.22317085,  0.25071913,  0.287783  ,
        0.58708721, -0.13033538,  0.02738427, -0.17963417, -0.46462777,
        0.12726952, -0.59729439, -0.03004212, -0.04822551,  0.11562251,
        0.53002852, -0.0609221 ,  0.52630454,  0.11480941, -1.12891674,
        0.2790131 , -0.22757432,  0.42739171, -0.02386028,  0.0052669 ,
        0.58239245,  0.04886733,  0.25173002,  0.53091973, -0.30

In [272]:
model_dbow.docvecs[2]

array([-0.2390765 ,  0.1142169 ,  0.20712513,  0.89075464, -0.22244544,
        0.27381375, -0.17348099, -0.00806451, -0.07966816, -0.25908321,
        0.14005575,  0.81300563,  0.42594698, -0.41442031, -0.50202197,
       -0.3608807 , -0.18744084, -0.74130625, -0.3690764 ,  0.73912692,
        0.488134  ,  0.57566327,  0.03431462, -0.11286174,  0.09621059,
       -0.53745824,  0.19251792,  0.03610734, -0.16454478,  0.46170405,
        0.32712093,  0.35736078, -0.18140447,  0.53149986,  0.71819407,
       -0.05336506, -0.17479719, -0.09747268,  0.21144409,  0.26047847,
       -0.12783214, -0.17676389, -0.22317085,  0.25071913,  0.287783  ,
        0.58708721, -0.13033538,  0.02738427, -0.17963417, -0.46462777,
        0.12726952, -0.59729439, -0.03004212, -0.04822551,  0.11562251,
        0.53002852, -0.0609221 ,  0.52630454,  0.11480941, -1.12891674,
        0.2790131 , -0.22757432,  0.42739171, -0.02386028,  0.0052669 ,
        0.58239245,  0.04886733,  0.25173002,  0.53091973, -0.30

发现上面还是按顺序来的，感觉就像没有shuffle过一样

In [273]:
train_array = []
for i in range(len(train_tagged)):
    tag = train_tagged[i].tags[0]
    train_array.append(model_dbow.docvecs[tag])

In [274]:
test_array = []
for i in range(len(test_tagged)):
    test_array.append(model_dbow.infer_vector(test_tagged[i].words))


In [275]:
lr = LogisticRegression()
lr.fit(train_array, train_target)
result = lr.predict(test_array)

In [276]:
print("output...")
output = pd.DataFrame(data={'id': test['id'], 'sentiment': result})
output.to_csv('doc2vec6.csv', index=False, quoting=3)

output...


结果是0.87，接近BOW_chi_tfidf.csv的0.88了。好了，我就用这个做deep模型吧

通过运行程序得到了dbow和dm两个模型，前者还是0.87，后者0.75。果然dbow效果好