In [1]:
import pandas as pd
import numpy as np

In [2]:
# Read data from files
train = pd.read_csv("labeledTrainData.tsv", header=0, delimiter="\t", quoting=3)
test = pd.read_csv( "testData.tsv", header=0, delimiter="\t", quoting=3 )
unlabeled_train = pd.read_csv( "unlabeledTrainData.tsv", header=0, delimiter="\t", quoting=3 )

In [3]:
# Verify the number of reviews that were read (100,000 in total)
print ("Read %d labeled train reviews, %d labeled test reviews, " \
 "and %d unlabeled reviews\n" % (train["review"].size,  
 test["review"].size, unlabeled_train["review"].size ))

Read 25000 labeled train reviews, 25000 labeled test reviews, and 104805 unlabeled reviews



我们编写的用于清理数据的函数也类似于第1部分，尽管现在有一些不同之处。首先，为了训练Word2Vec，**最好不要删除停止词，因为为了生成高质量的词向量，该算法依赖于句子的更广泛的上下文。**出于这个原因，我们将在下面的函数中使stop单词删除成为**可选的**。不删除数字可能更好，但是我们把它留给读者作为练习。

In [4]:
# Import various modules for string cleaning
from bs4 import BeautifulSoup
import re
from nltk.corpus import stopwords

def review_to_wordlist(review, remove_stopwords=False):
    # Function to convert a document to a sequence of words,
    # optionally removing stop words.  Returns a list of words.
    #
    # 1. Remove HTML 删除html符号
    review_text = BeautifulSoup(review).get_text()
    #  
    # 2. Remove non-letters 删除非字母符号，后续可以考虑不删除数字
    review_text = re.sub("[^a-zA-Z]"," ", review_text)
    #
    # 3. Convert words to lower case and split them 把所有单词转换成小写然后将文本分割成单词
    words = review_text.lower().split()
    #
    # 4. Optionally remove stop words (false by default) 有选择的删除stopwords
    if remove_stopwords:
        stops = set(stopwords.words("english"))
        words = [w for w in words if not w in stops]
    #
    # 5. Return a list of words
    return words

接下来，我们需要一个特定的输入格式。**Word2Vec每个句子都是以单词为元素列表，文本是以句子为元素的列表，其中，橘子也是一个列表。换句话说，输入格式是列表的列表。**


如何把一段话分成句子一点也不简单。在自然语言中有各种各样的陷阱。英语句子可以以“?”结尾,“!”"""或"。此外，间距和标题化也不是可靠的指导。出于这个原因，我们将使用NLTK的punkt标记器进行句子拆分。为了使用它，您需要安装NLTK并使用NLTK .download()来下载punkt的相关训练文件。

In [5]:
# Download the punkt tokenizer for sentence splitting
import nltk.data
#nltk.download()   

# Load the punkt tokenizer
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

#print(tokenizer)

# Define a function to split a review into parsed sentences
def review_to_sentences(review, tokenizer, remove_stopwords=False):
    # Function to split a review into parsed sentences. Returns a 
    # list of sentences, where each sentence is a list of words
    # 函数的作用是:将review分解成已解析的句子。返回一个句子列表，其中每个句子都是单词列表
    # 1. Use the NLTK tokenizer to split the paragraph into sentences
    raw_sentences = tokenizer.tokenize(review.strip())
    #
    # 2. Loop over each sentence遍历
    sentences = []
    for raw_sentence in raw_sentences:
        # If a sentence is empty, skip it
        if len(raw_sentence) > 0:
            # Otherwise, call review_to_wordlist to get a list of words
            sentences.append(review_to_wordlist(raw_sentence, remove_stopwords))
    #
    # Return the list of sentences (each sentence is a list of words,
    # so this returns a list of lists
    return sentences

现在，我们可以应用这个函数来准备我们的数据输入到Word2Vec(这将需要几分钟):

In [6]:
sentences = []  # Initialize an empty list of sentences

print("Parsing sentences from training set")
for review in train["review"]:
    sentences += review_to_sentences(review, tokenizer)

print("Parsing sentences from unlabeled set")
for review in unlabeled_train["review"]:
    sentences += review_to_sentences(review, tokenizer)

Parsing sentences from training set


  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' that document to Beautiful Soup.' % decoded_markup


Parsing sentences from unlabeled set


  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' Beautiful Soup.' % markup)
  ' that document to Beautiful Soup.' % decoded_markup
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup


AttributeError: 'float' object has no attribute 'strip'

In [None]:
type(review)

你可能会从BeautifulSoup那里得到一些关于句子中url的警告。这些都不需要担心(尽管您可能想在清理文本时删除url)。


我们可以看看输出，看看这与第1部分有什么不同:

In [39]:
# Check how many sentences we have in total - should be around 850,000+
print(len(sentences))

704530


In [40]:
print(sentences[0]) 

['with', 'all', 'this', 'stuff', 'going', 'down', 'at', 'the', 'moment', 'with', 'mj', 'i', 've', 'started', 'listening', 'to', 'his', 'music', 'watching', 'the', 'odd', 'documentary', 'here', 'and', 'there', 'watched', 'the', 'wiz', 'and', 'watched', 'moonwalker', 'again']


In [41]:
print(sentences[1]) 

['maybe', 'i', 'just', 'want', 'to', 'get', 'a', 'certain', 'insight', 'into', 'this', 'guy', 'who', 'i', 'thought', 'was', 'really', 'cool', 'in', 'the', 'eighties', 'just', 'to', 'maybe', 'make', 'up', 'my', 'mind', 'whether', 'he', 'is', 'guilty', 'or', 'innocent']


需要注意的一个小细节是Python列表中“+=”和“append”之间的区别。在许多应用程序中，这两者是可互换的，但在这里就不一样了。**如果你将一个列表的列表附加到另一个列表的列表中，“附加”只会附加第一个列表;您需要使用“+=”来一次性连接所有列表。**

## Training and Saving Your Model

有了这些解析良好的句子列表，我们就可以开始训练模型了。有许多参数选择会影响运行时和生成的最终模型的质量。有关下面算法的详细信息，请参阅word2vec API文档和谷歌文档。

**架构**:架构选项是跳跃图(默认)或连续的单词包。我们发现跳跃图的速度稍微慢一些，但效果更好。

**训练算法**:分级softmax(默认)或消极抽样。对我们来说，默认设置运行良好。

**对常用词进行下采样**:谷歌文档建议的值在.00001和.001之间。对于我们来说，接近0.001的值似乎可以提高最终模型的准确性。

**字向量维度**:更多的特征导致更长的运行时，并且通常(但不总是)导致更好的模型。合理的数值可以是几十到几百;我们使用300年。

**上下文/窗口大小**:训练算法应该考虑多少上下文单词?选择10在分层softmax中效果不错(更多是更好的，在一定程度上)。

**工作线程**:要运行的并行进程的数量。这是特定于计算机的，但是在大多数系统中4到6之间应该可以工作。

**最少字数**:这有助于将词汇量限制在有意义的单词上。任何在所有文档中至少出现这么多次的单词都将被忽略。合理的数值可以在10到100之间。在本例中，由于每部电影出现30次，因此我们将最小字数设置为40，以避免对单个电影标题赋予过多的重要性。结果，总的词汇量大约是15000个单词。更高的值还有助于限制运行时间。

选择参数并不容易，但是一旦我们选择了参数，创建一个Word2Vec模型就很简单了:

In [42]:
# Import the built-in logging module and configure it so that Word2Vec 
# creates nice output messages 导入内置的日志模块并对其进行配置，以便Word2Vec创建良好的输出消息
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

# Set values for various parameters
num_features = 300    # Word vector dimensionality                      
min_word_count = 40   # Minimum word count                        
num_workers = 4       # Number of threads to run in parallel 要并行运行的线程数
context = 10          # Context window size                                                                                    
downsampling = 1e-3   # Downsample setting for frequent words 下采样设置频繁的单词

# Initialize and train the model (this will take some time)
from gensim.models import word2vec
print("Training model...") 
model = word2vec.Word2Vec(sentences, workers=num_workers, 
            size=num_features, min_count = min_word_count, 
            window = context, sample = downsampling)

# If you don't plan to train the model any further, calling 
# init_sims will make the model much more memory-efficient.
model.init_sims(replace=True)

# It can be helpful to create a meaningful model name and 
# save the model for later use. You can load it later using Word2Vec.load()
model_name = "300features_40minwords_10context"
model.save(model_name)

2019-10-08 19:58:30,351 : INFO : collecting all words and their counts
2019-10-08 19:58:30,351 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2019-10-08 19:58:30,414 : INFO : PROGRESS: at sentence #10000, processed 225803 words, keeping 17776 word types
2019-10-08 19:58:30,457 : INFO : PROGRESS: at sentence #20000, processed 451892 words, keeping 24948 word types
2019-10-08 19:58:30,509 : INFO : PROGRESS: at sentence #30000, processed 671315 words, keeping 30034 word types


Training model...


2019-10-08 19:58:30,560 : INFO : PROGRESS: at sentence #40000, processed 897815 words, keeping 34348 word types
2019-10-08 19:58:30,598 : INFO : PROGRESS: at sentence #50000, processed 1116963 words, keeping 37761 word types
2019-10-08 19:58:30,635 : INFO : PROGRESS: at sentence #60000, processed 1338404 words, keeping 40723 word types
2019-10-08 19:58:30,684 : INFO : PROGRESS: at sentence #70000, processed 1561580 words, keeping 43333 word types
2019-10-08 19:58:30,731 : INFO : PROGRESS: at sentence #80000, processed 1780887 words, keeping 45714 word types
2019-10-08 19:58:30,779 : INFO : PROGRESS: at sentence #90000, processed 2004996 words, keeping 48135 word types
2019-10-08 19:58:30,829 : INFO : PROGRESS: at sentence #100000, processed 2226967 words, keeping 50207 word types
2019-10-08 19:58:30,875 : INFO : PROGRESS: at sentence #110000, processed 2446581 words, keeping 52081 word types
2019-10-08 19:58:30,907 : INFO : PROGRESS: at sentence #120000, processed 2668776 words, keepin

2019-10-08 19:58:33,708 : INFO : downsampling leaves estimated 11248894 word corpus (73.9% of prior 15230444)
2019-10-08 19:58:33,764 : INFO : estimated required memory for 15440 words and 300 dimensions: 44776000 bytes
2019-10-08 19:58:33,764 : INFO : resetting layer weights
2019-10-08 19:58:33,954 : INFO : training model with 4 workers on 15440 vocabulary and 300 features, using sg=0 hs=0 sample=0.001 negative=5 window=10
2019-10-08 19:58:34,967 : INFO : EPOCH 1 - PROGRESS: at 10.81% examples, 1206145 words/s, in_qsize 8, out_qsize 0
2019-10-08 19:58:35,978 : INFO : EPOCH 1 - PROGRESS: at 22.70% examples, 1263853 words/s, in_qsize 7, out_qsize 0
2019-10-08 19:58:36,981 : INFO : EPOCH 1 - PROGRESS: at 34.92% examples, 1297021 words/s, in_qsize 7, out_qsize 0
2019-10-08 19:58:37,967 : INFO : EPOCH 1 - PROGRESS: at 47.44% examples, 1324783 words/s, in_qsize 7, out_qsize 0
2019-10-08 19:58:38,985 : INFO : EPOCH 1 - PROGRESS: at 60.09% examples, 1343685 words/s, in_qsize 8, out_qsize 0
20

## Exploring the Model Results

祝贺你到目前为止一切顺利!让我们来看看我们从75000个培训评论中创建的模型。


“doesnt_match”函数将尝试推断出一个集合中哪些单词与其他单词最不相似:

In [44]:
model.doesnt_match("man woman child kitchen".split())

  if __name__ == '__main__':


'kitchen'

我们的模型能够区分意义上的差异!它知道，男人、女人和孩子之间的相似之处比他们在厨房里的相似之处更多。更多的探索表明，该模型对更细微的意义差异更敏感，例如国家和城市之间的差异:

In [45]:
model.doesnt_match("france england germany berlin".split())

  if __name__ == '__main__':


'berlin'

...虽然我们使用的训练集相对较小，但它肯定不是完美的:

In [46]:
model.doesnt_match("paris berlin london austria".split())

  if __name__ == '__main__':


'paris'

我们也可以使用“most_similar”函数来深入了解模型的词簇:

In [47]:
model.most_similar("man")

  if __name__ == '__main__':


[('woman', 0.6359063982963562),
 ('lady', 0.5870093703269958),
 ('lad', 0.5420260429382324),
 ('monk', 0.5376467108726501),
 ('boy', 0.5312069654464722),
 ('priest', 0.5237777233123779),
 ('person', 0.5219061374664307),
 ('men', 0.5074207782745361),
 ('millionaire', 0.5048858523368835),
 ('guy', 0.4964485168457031)]

In [48]:
model.most_similar("queen")

  if __name__ == '__main__':


[('princess', 0.6752606630325317),
 ('bride', 0.6285693645477295),
 ('fatale', 0.5853664875030518),
 ('regina', 0.580701470375061),
 ('femme', 0.5714904069900513),
 ('maid', 0.5712674260139465),
 ('belle', 0.5650311708450317),
 ('mistress', 0.5605853796005249),
 ('sultry', 0.5578803420066833),
 ('dame', 0.5563564300537109)]

考虑到我们的特殊训练集，《辣提法》与《女王》的相似度高居榜首也就不足为奇了。


或者，与情绪分析更相关的是:

In [49]:
model.most_similar("awful")

  if __name__ == '__main__':


[('terrible', 0.7532333731651306),
 ('horrible', 0.7510112524032593),
 ('dreadful', 0.707053542137146),
 ('atrocious', 0.7025836706161499),
 ('abysmal', 0.6978950500488281),
 ('appalling', 0.6969230771064758),
 ('horrendous', 0.6883354187011719),
 ('horrid', 0.6818529367446899),
 ('lousy', 0.6378222703933716),
 ('embarrassing', 0.6169452667236328)]

因此，我们似乎有了一个相当好的语义模型——至少和单词一样好。但是我们如何使用这些神奇的分布式单词向量进行监督学习呢?下一节将对此进行尝试。

## Numeric Representations of Words

现在我们已经有了一个训练有素的模型，对单词有了一定的语义理解，那么我们应该如何使用它呢?在第2部分中训练的Word2Vec模型由词汇表中每个单词的特征向量组成，存储在一个名为“syn0”的numpy数组中:

In [50]:
from gensim.models import Word2Vec

In [51]:
model = Word2Vec.load("300features_40minwords_10context")

2019-10-08 19:59:59,496 : INFO : loading Word2Vec object from 300features_40minwords_10context
2019-10-08 19:59:59,734 : INFO : loading vocabulary recursively from 300features_40minwords_10context.vocabulary.* with mmap=None
2019-10-08 19:59:59,734 : INFO : loading trainables recursively from 300features_40minwords_10context.trainables.* with mmap=None
2019-10-08 19:59:59,734 : INFO : loading wv recursively from 300features_40minwords_10context.wv.* with mmap=None
2019-10-08 19:59:59,734 : INFO : setting ignored attribute vectors_norm to None
2019-10-08 19:59:59,737 : INFO : setting ignored attribute cum_table to None
2019-10-08 19:59:59,737 : INFO : loaded 300features_40minwords_10context


In [52]:
type(model.syn0)

AttributeError: 'Word2Vec' object has no attribute 'syn0'

In [53]:
model.syn0.shape

AttributeError: 'Word2Vec' object has no attribute 'syn0'

syn0中的行数是模型词汇表中的单词数，列数对应于我们在第2部分中设置的特征向量的大小。将最小字数设置为40，我们得到的总词汇量为16,492个单词，每个单词有300个特征。个别字向量的存取方法如下:

In [54]:
model["flower"]

  if __name__ == '__main__':


array([ -8.59023705e-02,  -3.39425132e-02,  -7.29977340e-02,
        -3.47801521e-02,  -3.87294665e-02,   6.23288602e-02,
         1.45014273e-02,   2.29095016e-02,   4.18861061e-02,
        -4.63192957e-03,   6.92662969e-02,   3.67813110e-02,
        -2.85203774e-02,  -7.00469762e-02,  -2.33646948e-02,
        -8.65339935e-02,  -1.02313928e-01,   9.40788761e-02,
        -3.46443467e-02,   1.28474422e-02,  -2.49164477e-02,
        -5.74302906e-03,   2.05675513e-02,   1.33091249e-02,
         3.76211889e-02,   2.06470024e-02,   3.52376513e-02,
         2.17546653e-02,  -1.05265353e-03,   2.71668136e-02,
         8.24769586e-02,  -6.59254147e-03,  -1.90616082e-02,
         3.39015061e-03,  -1.57508515e-02,   7.15400651e-02,
         4.43138881e-03,  -6.99389949e-02,   1.15690805e-01,
        -3.39600183e-02,  -9.79051590e-02,  -7.17130443e-03,
         2.42847041e-03,   1.04541451e-01,   1.85147729e-02,
        -3.89136374e-02,   3.08787171e-02,   8.55970308e-02,
         8.79267603e-02,

## From Words To Paragraphs, Attempt 1: Vector Averaging

IMDB数据集的一个挑战是可变长度的审查。我们需要找到一种方法来提取单个的单词向量，并将它们转换成一个特征集，每个特征集的长度都是相同的。


由于每个单词都是300维空间中的向量，所以我们可以使用向量操作来组合每个review中的单词。我们尝试的一种方法是在给定的review中简单地平均单词向量(为此，我们删除了停止单词，这只会增加噪音)。


下面的代码对特征向量进行平均，构建在第2部分的代码之上。

In [55]:
import numpy as np  # Make sure that numpy is imported

def makeFeatureVec(words, model, num_features):
    # Function to average all of the word vectors in a given
    # paragraph
    #
    # Pre-initialize an empty numpy array (for speed)
    featureVec = np.zeros((num_features,),dtype="float32")
    #
    nwords = 0.
    # 
    # Index2word is a list that contains the names of the words in 
    # the model's vocabulary. Convert it to a set, for speed 
    index2word_set = set(model.index2word)
    #
    # Loop over each word in the review and, if it is in the model's
    # vocaublary, add its feature vector to the total
    for word in words:
        if word in index2word_set: 
            nwords = nwords + 1.
            featureVec = np.add(featureVec,model[word])
    # 
    # Divide the result by the number of words to get the average
    featureVec = np.divide(featureVec,nwords)
    return featureVec

In [57]:
def getAvgFeatureVecs(reviews, model, num_features):
    # Given a set of reviews (each one a list of words), calculate 
    # the average feature vector for each one and return a 2D numpy array 
    # 
    # Initialize a counter
    counter = 0.
    # 
    # Preallocate a 2D numpy array, for speed
    reviewFeatureVecs = np.zeros((len(reviews),num_features),dtype="float32")
    # 
    # Loop through the reviews
    for review in reviews:
       #
       # Print a status message every 1000th review
       if counter%1000. == 0.:
           print ("Review %d of %d" % (counter, len(reviews)))
       # 
       # Call the function (defined above) that makes average feature vectors
       reviewFeatureVecs[counter] = makeFeatureVec(review, model, \
           num_features)
       #
       # Increment the counter
       counter = counter + 1.
    return reviewFeatureVecs

现在，我们可以调用这些函数来创建每个段落的平均向量。以下操作需要几分钟时间:

In [59]:
# ****************************************************************
# Calculate average feature vectors for training and testing sets,
# using the functions we defined above. Notice that we now use stop word
# removal.

clean_train_reviews = []
for review in train["review"]:
    clean_train_reviews.append( review_to_wordlist( review, \
        remove_stopwords=True ))

trainDataVecs = getAvgFeatureVecs( clean_train_reviews, model, num_features )

print("Creating average feature vecs for test reviews") 
clean_test_reviews = []
for review in test["review"]:
    clean_test_reviews.append( review_to_wordlist( review, \
        remove_stopwords=True ))

testDataVecs = getAvgFeatureVecs( clean_test_reviews, model, num_features )

Review 0 of 25000


AttributeError: 'Word2Vec' object has no attribute 'index2word'

接下来，使用平均段落向量来训练一个随机森林。注意，与第1部分中一样，我们只能使用标记的培训评审来培训模型。

In [60]:
# Fit a random forest to the training data, using 100 trees
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier( n_estimators = 100 )

print("Fitting a random forest to labeled training data...") 
forest = forest.fit( trainDataVecs, train["sentiment"] )

# Test & extract results 
result = forest.predict( testDataVecs )

# Write the test results 
output = pd.DataFrame( data={"id":test["id"], "sentiment":result} )
output.to_csv( "Word2Vec_AverageVectors.csv", index=False, quoting=3 )

Fitting a random forest to labeled training data...


NameError: name 'trainDataVecs' is not defined

我们发现，这比碰运气的效果要好得多，但却比大言不话的效果差几个百分点。


既然矢量的元素平均没有产生惊人的结果，也许我们可以用一种更聪明的方法来做?加权单词向量的标准方法是应用“tf-idf”权值，它度量给定单词在给定文档集合中的重要性。在Python中提取tf-idf权重的一种方法是使用scikit-learn的TfidfVectorizer，它的接口类似于我们在第1部分中使用的CountVectorizer。然而，当我们尝试以这种方式来加权我们的字向量时，我们发现在性能上没有实质性的改进。

## From Words to Paragraphs, Attempt 2: Clustering 

Word2Vec创建语义相关的单词集群，因此另一种可能的方法是利用集群内单词的相似性。以这种方式分组向量称为“向量量化”。为此，我们首先需要找到词簇的中心，这可以通过使用K-Means之类的聚类算法来实现。


在K- means中，我们需要设置的一个参数是“K”，即集群的数量。我们应该如何决定创建多少集群?试错结果表明，小的聚类，平均每个聚类只有5个单词左右，比有很多单词的大聚类结果更好。集群代码如下所示。我们使用scikit-learn来执行我们的K-Means。


K-意味着聚类与大K可以非常缓慢;下面的代码在我的电脑上花了40多分钟。下面，我们围绕K-Means函数设置一个计时器，以查看需要多长时间。

In [61]:
from sklearn.cluster import KMeans
import time

start = time.time() # Start time

# Set "k" (num_clusters) to be 1/5th of the vocabulary size, or an
# average of 5 words per cluster
word_vectors = model.syn0
num_clusters = word_vectors.shape[0] / 5

# Initalize a k-means object and use it to extract centroids
kmeans_clustering = KMeans( n_clusters = num_clusters )
idx = kmeans_clustering.fit_predict( word_vectors )

# Get the end time and print how long the process took
end = time.time()
elapsed = end - start
print("Time taken for K Means clustering: ", elapsed, "seconds.") 

AttributeError: 'Word2Vec' object has no attribute 'syn0'

每个单词的集群分配现在存储在idx中，原始Word2Vec模型中的词汇表仍然存储在model.index2word中。为了方便，我们把这些压缩到一个字典如下:

In [62]:
# Create a Word / Index dictionary, mapping each vocabulary word to
# a cluster number                                                                                            
word_centroid_map = dict(zip( model.index2word, idx ))

AttributeError: 'Word2Vec' object has no attribute 'index2word'

这有点抽象，让我们仔细看看我们的集群包含什么。您的集群可能不同，因为Word2Vec依赖于随机数种子。下面是一个循环，它打印出集群0到9的单词:

In [63]:
# For the first 10 clusters
for cluster in xrange(0,10):
    #
    # Print the cluster number  
    print "\nCluster %d" % cluster
    #
    # Find all of the words for that cluster number, and print them out
    words = []
    for i in xrange(0,len(word_centroid_map.values())):
        if( word_centroid_map.values()[i] == cluster ):
            words.append(word_centroid_map.keys()[i])
    print(words) 

SyntaxError: Missing parentheses in call to 'print' (<ipython-input-63-a6cc0cc8a7c4>, line 5)

结果非常有趣:

In [64]:
Cluster 0

SyntaxError: invalid syntax (<ipython-input-64-ddaca8b2cd9f>, line 1)

In [65]:
Cluster 1

SyntaxError: invalid syntax (<ipython-input-65-5d82318982d4>, line 1)

In [66]:
Cluster 2

SyntaxError: invalid syntax (<ipython-input-66-2b0c1501d5ef>, line 1)

In [67]:
Cluster 3

SyntaxError: invalid syntax (<ipython-input-67-81de173f4fd0>, line 1)

In [68]:
Cluster 4

SyntaxError: invalid syntax (<ipython-input-68-a741daaae5af>, line 1)

In [69]:
Cluster 5

SyntaxError: invalid syntax (<ipython-input-69-6e6c286c0224>, line 1)

In [70]:
Cluster 6

SyntaxError: invalid syntax (<ipython-input-70-3939f73a1396>, line 1)

In [71]:
Cluster 7

SyntaxError: invalid syntax (<ipython-input-71-373d775b6def>, line 1)

In [72]:
Cluster 8

SyntaxError: invalid syntax (<ipython-input-72-caf8842782b4>, line 1)

In [73]:
Cluster 9

SyntaxError: invalid syntax (<ipython-input-73-05c41236b158>, line 1)

我们可以看到集群的质量是不同的。有些是有意义的——集群3主要包含名称，集群6-8包含相关的形容词(集群6是我的最爱)。另一方面，集群5有点神秘:龙虾和鹿有什么共同点(除了是两种动物)?集群0更糟糕:顶层公寓和套房似乎属于一起，但它们似乎不属于苹果和护照。集群2包含…也许战争相关的单词?也许我们的算法对形容词最有效。


无论如何，现在我们已经为每个单词分配了一个集群(或“重心”)，并且我们可以定义一个函数来将评论转换成重心。这就像一个词包，但使用语义相关的集群，而不是单独的词:

In [74]:
def create_bag_of_centroids( wordlist, word_centroid_map ):
    #
    # The number of clusters is equal to the highest cluster index
    # in the word / centroid map
    num_centroids = max( word_centroid_map.values() ) + 1
    #
    # Pre-allocate the bag of centroids vector (for speed)
    bag_of_centroids = np.zeros( num_centroids, dtype="float32" )
    #
    # Loop over the words in the review. If the word is in the vocabulary,
    # find which cluster it belongs to, and increment that cluster count 
    # by one
    for word in wordlist:
        if word in word_centroid_map:
            index = word_centroid_map[word]
            bag_of_centroids[index] += 1
    #
    # Return the "bag of centroids"
    return bag_of_centroids

上面的函数将为每个评论提供一个numpy数组，每个评论都有一些与集群数量相等的特性。最后，我们为我们的训练和测试集创建中心体包，然后训练一个随机森林并提取结果:

In [75]:
# Pre-allocate an array for the training set bags of centroids (for speed)
train_centroids = np.zeros( (train["review"].size, num_clusters), \
    dtype="float32" )

# Transform the training set reviews into bags of centroids
counter = 0
for review in clean_train_reviews:
    train_centroids[counter] = create_bag_of_centroids( review, \
        word_centroid_map )
    counter += 1

# Repeat for test reviews 
test_centroids = np.zeros(( test["review"].size, num_clusters), \
    dtype="float32" )

counter = 0
for review in clean_test_reviews:
    test_centroids[counter] = create_bag_of_centroids( review, \
        word_centroid_map )
    counter += 1

# Fit a random forest and extract predictions 
forest = RandomForestClassifier(n_estimators = 100)

# Fitting the forest may take a few minutes
print("Fitting a random forest to labeled training data...") 
forest = forest.fit(train_centroids,train["sentiment"])
result = forest.predict(test_centroids)

# Write the test results 
output = pd.DataFrame(data={"id":test["id"], "sentiment":result})
output.to_csv( "BagOfCentroids.csv", index=False, quoting=3 )

NameError: name 'num_clusters' is not defined

我们发现上面的代码给出的结果与第1部分给出的结果大致相同(甚至更差)。