# 分布式词向量

[Word2vec](https://code.google.com/archive/p/word2vec/)是Google在2013年提出的。Word2vec是一种神经网络实现，用于学习词的分布式表达（[distributed representations](http://www.cs.toronto.edu/~bonner/courses/2014s/csc321/lectures/lec5.pdf) for words）。

Word2vec即使不利用标签，也能产生有意义的表达。这是非常有用的，因为大部分真实世界里的数据是没有标签的。如果给的词足够多，词向量会展现很多有趣的特性。比如有相似意义的词会出现在一个类里，而不同的类是有间隔的，这种特性可以让词之间的关系，可以通过向量计算来表示。一个有名的例子是：king - man + woman = queen。这个资料可能会有用，[Learning Representations of Text using Neural Networks](https://docs.google.com/file/d/0B7XkCwpI5KDYRWRnd1RzWXQ2TWc/edit)

分布式词向量对于词预测和翻译很有用，这次我们用来做情感分析。

# 在Python中使用word2vec

利用gensim包来实现word2vec，要提前下载好。尽管word2vec不想其他一些深度学习算法需要使用GPU，但计算量也很大。不论是Google的C语言版本还是python版本，都需要使用多线程来进行处理，所以我们还需要cython。它能帮我们节省训练时间。

# 训练前的预备工作

首先，用pandas导入数据，不过这次我们用unlabeledTrain.tsv，其中包含了50000个没有标签的评论。在Part 1，训练词袋模型时，如果一个评论没有标签，那么这条数据就是没有用的。但word2vec能从没有标记的数据中学习。

In [2]:
import pandas as pd

In [3]:
train = pd.read_csv("data/labeledTrainData.tsv", header=0, 
                     delimiter="\t", quoting=3)

test = pd.read_csv( "data/testData.tsv", header=0, delimiter="\t", quoting=3 )
unlabeled_train = pd.read_csv("data/unlabeledTrainData.tsv", header=0, 
                              delimiter="\t", quoting=3 )

# Verify the number of reviews that were read (100,000 in total)
print("Read %d labeled train reviews, %d labeled test reviews, and %d unlabeled reviews\n" % (train["review"].size, test["review"].size, unlabeled_train["review"].size ))

Read 25000 labeled train reviews, 25000 labeled test reviews, and 50000 unlabeled reviews



In [4]:
train.head()

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."
3,"""3630_4""",0,"""It must be assumed that those who praised thi..."
4,"""9495_8""",1,"""Superbly trashy and wondrously unpretentious ..."


In [6]:
train['review'][0]

'"With all this stuff going down at the moment with MJ i\'ve started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ\'s feeling towards the press and also the obvious message of drugs are bad m\'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally

接下来对数据进行清洗，和Part 1差不多，不过有些不一样的地方。首先，训练word2vec的时候，最好不要去除stop words，因为word2vec算法中，更多的词汇能产生更高质量的词向量，所以我们提供一个可选项。另外，最好不要去除数字：

In [7]:
from bs4 import BeautifulSoup
import re
from nltk.corpus import stopwords

def review_to_wordlist(review, remove_stopwords=False):
    # Function to convert a document to a sequence of words,
    # optionally removing stop words.  Returns a list of words.
    
    # 1. Remove HTML
    review_text = BeautifulSoup(review, 'lxml').get_text()
      
    # 2. Remove non-letters
    review_text = re.sub("[^a-zA-Z]"," ", review_text)
    
    # 3. Convert words to lower case and split them
    words = review_text.lower().split()

    # 4. Optionally remove stop words (false by default)
    if remove_stopwords:
        stops = set(stopwords.words("english"))
        words = [w for w in words if not w in stops]
    
    # 5. Return a list of words
    return(words)

现在，我们想要规定好输入的格式。输入Word2vec的是单个句子，一个句子是一个list，由词组成。换句话说，输入格式是a list of lists。

想要把段落分割为句子并不是一件直观的事情。英语句子的结尾可以是"?", "!", """, or ".", 而空格和大小写也靠不住。所以，我们将使用NLTK中的**punkt**标记生成器来进行句子分割。我们先得下载NLTK，然后用nltk.download()来下载相关的训练文件，来训练punkt。

In [8]:
# Download the punkt tokenizer for sentence splitting
import nltk.data
# nltk.download()   

# Load the punkt tokenizer
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

# Define a function to split a review into parsed sentences
def review_to_sentences( review, tokenizer, remove_stopwords=False ):
    # Function to split a review into parsed sentences. Returns a 
    # list of sentences, where each sentence is a list of words
    
    # 1. Use the NLTK tokenizer to split the paragraph into sentences
    raw_sentences = tokenizer.tokenize(review.strip())
    
    # 2. Loop over each sentence
    sentences = []
    for raw_sentence in raw_sentences:
        # If a sentence is empty, skip it
        if len(raw_sentence) > 0:
            # Otherwise, call review_to_wordlist to get a list of words
            sentences.append( review_to_wordlist( raw_sentence, remove_stopwords ))

    # Return the list of sentences (each sentence is a list of words,
    # so this returns a list of lists
    return sentences

In [14]:
# tokenizer只负责将paragraph分割为多个sentence
# 对于每个sentence，review_to_wordlist负责对一个sentence进行清洗
s1 = review_to_sentences(train['review'][0], tokenizer)
# 输出第一个sentence的list
s1[0]

['with',
 'all',
 'this',
 'stuff',
 'going',
 'down',
 'at',
 'the',
 'moment',
 'with',
 'mj',
 'i',
 've',
 'started',
 'listening',
 'to',
 'his',
 'music',
 'watching',
 'the',
 'odd',
 'documentary',
 'here',
 'and',
 'there',
 'watched',
 'the',
 'wiz',
 'and',
 'watched',
 'moonwalker',
 'again']

把这个函数用来处理数据上，准备给word2vec的输入（下面运行会花几分钟）：

In [13]:
sentences = []  # Initialize an empty list of sentences

print("Parsing sentences from training set")
for review in train["review"]:
    sentences += review_to_sentences(review, tokenizer)

print("Parsing sentences from unlabeled set")
for review in unlabeled_train["review"]:
    sentences += review_to_sentences(review, tokenizer)

Parsing sentences from training set


  'Beautiful Soup.' % markup)
  ' that document to Beautiful Soup.' % decoded_markup


Parsing sentences from unlabeled set


  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  'Beautiful Soup.' % markup)
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup


在运行过程中，可能会得到一些警告，BeautifulSoup提示说有一些URL在句子中，这个没关系，不必担心。不过，我们可以在文本处理的阶段，去除文本中的URL。

我们可以看一下输出的结果与Part 1的结果有什么不同：

In [14]:
# Check how many sentences we have in total - should be around 850,000+
len(sentences)

795538

In [17]:
print(sentences[0])

print(sentences[1])

['with', 'all', 'this', 'stuff', 'going', 'down', 'at', 'the', 'moment', 'with', 'mj', 'i', 've', 'started', 'listening', 'to', 'his', 'music', 'watching', 'the', 'odd', 'documentary', 'here', 'and', 'there', 'watched', 'the', 'wiz', 'and', 'watched', 'moonwalker', 'again']
['maybe', 'i', 'just', 'want', 'to', 'get', 'a', 'certain', 'insight', 'into', 'this', 'guy', 'who', 'i', 'thought', 'was', 'really', 'cool', 'in', 'the', 'eighties', 'just', 'to', 'maybe', 'make', 'up', 'my', 'mind', 'whether', 'he', 'is', 'guilty', 'or', 'innocent']


# 训练和保存模型

在训练模型的时候，有几个参数会影响运行时间和生成的模型效果：

- Architecture: Architecture options are skip-gram (default) or continuous bag of words. We found that skip-gram was very slightly slower but produced better results.

- Training algorithm: Hierarchical softmax (default) or negative sampling. For us, the default worked well.

- Downsampling of frequent words: The Google documentation recommends values between .00001 and .001. For us, values closer 0.001 seemed to improve the accuracy of the final model.

- Word vector dimensionality: More features result in longer runtimes, and often, but not always, result in better models. Reasonable values can be in the tens to hundreds; we used 300.

- Context / window size: How many words of context should the training algorithm take into account? 10 seems to work well for hierarchical softmax (more is better, up to a point).

- Worker threads: Number of parallel processes to run. This is computer-specific, but between 4 and 6 should work on most systems.

- Minimum word count: This helps limit the size of the vocabulary to meaningful words. Any word that does not occur at least this many times across all documents is ignored. Reasonable values could be between 10 and 100. In this case, since each movie occurs 30 times, we set the minimum word count to 40, to avoid attaching too much importance to individual movie titles. This resulted in an overall vocabulary size of around 15,000 words. Higher values also help limit run time.

设定参数并不容易，但一旦设定好，训练模型就简单多了：

In [None]:
# Import the built-in logging module and configure it 
# so that Word2Vec creates nice output messages
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

# Set values for various parameters
num_features = 300    # Word vector dimensionality                      
min_word_count = 40   # Minimum word count                        
num_workers = 4       # Number of threads to run in parallel
context = 10          # Context window size                                                                                    
downsampling = 1e-3   # Downsample setting for frequent words

# Initialize and train the model (this will take some time)
from gensim.models import word2vec
print("Training model...")
model = word2vec.Word2Vec(sentences, workers=num_workers, 
            size=num_features, min_count = min_word_count, 
            window = context, sample = downsampling)

# If you don't plan to train the model any further, calling 
# init_sims will make the model much more memory-efficient.
model.init_sims(replace=True)

# It can be helpful to create a meaningful model name and 
# save the model for later use. You can load it later using Word2Vec.load()
model_name = "model/300features_40minwords_10context"
model.save(model_name)

在终端输入`top -o cpu`，查看CPU的运行，应该在300%~400%之间。因为上面输出的结果太长，为了方便笔记本查看就把结果清楚了，可以下载后在自己的机器上跑一跑，正常的话几分钟就好了。

# 探索模型结果

训练结束后，查看一下75000个评论的训练结果吧。doesnt_match函数会推断在一个集合里，哪一个单词与其他单词最不相似：

In [31]:
model.doesnt_match('man woman child kitchen'.split())

'kitchen'

可以看到我们的模型能分辨不同的意思。试一试国家和首都：

In [32]:
model.doesnt_match('france england germany berlin'.split())

'berlin'

当然也有不完善的地方：

In [33]:
model.doesnt_match('paris berlin london austria'.split())

'paris'

我们用most_similar函数来查看词汇集群：

In [34]:
model.most_similar('man')

[('woman', 0.6284574270248413),
 ('lady', 0.576130211353302),
 ('lad', 0.5677862167358398),
 ('monk', 0.5471663475036621),
 ('guy', 0.5225706100463867),
 ('person', 0.5145993232727051),
 ('men', 0.5144864916801453),
 ('farmer', 0.5137633085250854),
 ('politician', 0.512186586856842),
 ('soldier', 0.5106308460235596)]

In [35]:
model.most_similar('queen')

[('princess', 0.680779218673706),
 ('bride', 0.6166126728057861),
 ('belle', 0.6164517998695374),
 ('victoria', 0.5983539819717407),
 ('eva', 0.5980991721153259),
 ('matriarch', 0.5820175409317017),
 ('stepmother', 0.5744807720184326),
 ('maid', 0.5740950107574463),
 ('brigitte', 0.5711568593978882),
 ('mistress', 0.5702614188194275)]

对情感分析做测试：

In [None]:
model.most_similar('awful')

[《How to Generate a Good Word Embedding?》导读](http://licstar.net/archives/620) 给出了一个word2vec模型训练套路：

首先根据具体任务，选一个领域相似的语料，在这个条件下，语料越大越好。然后下载一个 word2vec 的新版（14年9月更新），语料小（小于一亿词，约 500MB 的文本文件）的时候用 Skip-gram 模型，语料大的时候用 CBOW 模型。最后记得设置迭代次数为三五十次，维度至少选 50，就可以了。