**Reading the Data**

必要的文件可以从数据页下载。您需要的第一个文件是unlabeledTrainData，它包含25,000条IMDB电影评论，每个评论都带有正面或负面的情绪标签。


接下来，将以制表符分隔的文件读入Python。为此，我们可以使用在Titanic教程中介绍的panda包，它提供了read_csv函数，可以方便地读写数据文件。如果您以前没有使用过panda，您可能需要安装它。


In [1]:
# Import the pandas package, then use the "read_csv" function to read
# the labeled training data
import pandas as pd

In [2]:
train = pd.read_csv("labeledTrainData.tsv", header=0, delimiter="\t", quoting=3)

这里，“header=0”表示文件的第一行包含列名，“delimiter=\t”表示字段由制表符分隔，quoting=3告诉Python忽略双引号，否则在读取文件时可能会遇到错误。


我们可以确保阅读25,000行和3列如下:

In [3]:
train.shape

(25000, 3)

In [4]:
train.columns.values

array(['id', 'sentiment', 'review'], dtype=object)

这三列分别称为“id”、“sentiment”和“array”。现在你已经读入了训练集，看看一些评论:

In [5]:
print (train["review"][0]) #打印训练集review列第一个值

"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally sta

有一些HTML标签，比如"

、缩写、标点符号——这些都是处理网上文本时的常见问题。当你在学习的时候，花点时间看看训练集中的其他评论——下一节将讨论如何整理机器学习的文本。

**Data Cleaning and Text Preprocessing**

删除HTML标记:BeautifulSoup包


首先，我们将删除HTML标记。为此，我们将使用Beautiful Soup库。安装:

$ sudo pip install BeautifulSoup4

从命令行(不是从Python内部)。然后，从Python内部，加载包，并使用它来提取文本的审查:

In [6]:
# Import BeautifulSoup into your workspace
from bs4 import BeautifulSoup     

In [7]:
# Initialize the BeautifulSoup object on a single movie review     
example1 = BeautifulSoup(train["review"][0])  

In [8]:
# Print the raw review and then the output of get_text(), for 
# comparison
print(train["review"][0])
print(example1.get_text())

"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally sta

调用get_text()将给出评审的文本，不包含标记或标记。如果您浏览BeautifulSoup文档，您将看到它是一个非常强大的库——比我们需要的这个数据集更强大。然而，使用正则表达式删除标记被认为是不可靠的实践，所以即使对于这样简单的应用程序，通常最好使用像BeautifulSoup这样的包。

处理标点符号、数字和句号:NLTK和正则表达式


在考虑如何清理文本时，我们应该考虑要解决的数据问题。对于许多问题，去掉标点符号是有意义的。另一方面，在本例中，我们正在处理一个情感分析问题，“!!”或“:-(”可能携带情感，并且应该被视为单词。在本教程中，为了简单起见，我们删除了所有的标点符号，但是您可以自己使用它。


类似地，在本教程中，我们将删除数字，但是还有其他同样有意义的处理数字的方法。例如，我们可以将它们视为单词，或者将它们全部替换为占位符字符串，如“NUM”。


为了去掉标点符号和数字，我们将使用一个名为re的包来处理正则表达式。不需要安装任何东西。有关正则表达式如何工作的详细描述，请参阅包文档。现在，试试以下方法:

In [9]:
import re
# Use regular expressions to do a find-and-replace
letters_only = re.sub("[^a-zA-Z]",           # The pattern to search for只要不是a-zA-Z的
                      " ",                   # The pattern to replace it with 使用空格代替
                      example1.get_text() )  # The text to search 要操作的文本
print (letters_only)

 With all this stuff going down at the moment with MJ i ve started listening to his music  watching the odd documentary here and there  watched The Wiz and watched Moonwalker again  Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent  Moonwalker is part biography  part feature film which i remember going to see at the cinema when it was originally released  Some of it has subtle messages about MJ s feeling towards the press and also the obvious message of drugs are bad m kay Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring  Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him The actual feature film bit when it finally starts is only on for    mi

对正则表达式的全面概述超出了本教程的范围，但是现在只要知道[]表示组成员关系，^表示“不”就足够了。换句话说，上面的re.sub()语句表示，“查找任何不是小写字母(a-z)或大写字母(a-z)的内容，并用空格替换它。”


我们还将把我们的评论转换成小写的，并把它们分成单独的单词(在NLP行话中称为“tokenization”):

In [10]:
lower_case = letters_only.lower()        # Convert to lower case
words = lower_case.split()               # Split into words

最后，我们需要决定如何处理频繁出现的不太有意义的单词。这些词被称为“停止词”;在英语中，它们包括“a”、“and”、“is”和“the”等单词。很方便的是，有一些Python包内建了停止单词列表。让我们从Python自然语言工具包(NLTK)导入一个停止单词列表。如果你的电脑上还没有这个库，你需要安装它;你还需要安装与之配套的数据包，如下:

In [11]:
# import nltk
# nltk.download()  # Download text data sets, including stop words

现在我们可以使用nltk来获得一个停止单词列表:

In [12]:
from nltk.corpus import stopwords # Import the stopword list
print(stopwords.words("english"))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

这将允许您查看英语停止词的列表。要删除我们的电影评论中的停止词，请执行以下步骤:

In [13]:
# Remove stop words from "words"
words = [w for w in words if not w in stopwords.words("english")]
print(words)

['stuff', 'going', 'moment', 'mj', 'started', 'listening', 'music', 'watching', 'odd', 'documentary', 'watched', 'wiz', 'watched', 'moonwalker', 'maybe', 'want', 'get', 'certain', 'insight', 'guy', 'thought', 'really', 'cool', 'eighties', 'maybe', 'make', 'mind', 'whether', 'guilty', 'innocent', 'moonwalker', 'part', 'biography', 'part', 'feature', 'film', 'remember', 'going', 'see', 'cinema', 'originally', 'released', 'subtle', 'messages', 'mj', 'feeling', 'towards', 'press', 'also', 'obvious', 'message', 'drugs', 'bad', 'kay', 'visually', 'impressive', 'course', 'michael', 'jackson', 'unless', 'remotely', 'like', 'mj', 'anyway', 'going', 'hate', 'find', 'boring', 'may', 'call', 'mj', 'egotist', 'consenting', 'making', 'movie', 'mj', 'fans', 'would', 'say', 'made', 'fans', 'true', 'really', 'nice', 'actual', 'feature', 'film', 'bit', 'finally', 'starts', 'minutes', 'excluding', 'smooth', 'criminal', 'sequence', 'joe', 'pesci', 'convincing', 'psychopathic', 'powerful', 'drug', 'lord', 

它查看我们的“单词”列表中的每个单词，并丢弃在停止单词列表中找到的任何单词。在所有这些步骤之后，您的审查现在应该这样开始:
    
我们可以对数据做许多其他的事情——例如，Porter词干分析和Lemmatizing(在NLTK中都可用)将允许我们把“messages”、“message”和“messaging”当作同一个单词，这肯定是有用的。然而，为了简单起见，本教程将在此结束。

把它们放在一起


现在我们有代码来清理一个评审—但是我们需要清理25,000个培训评审!为了使我们的代码可重用，让我们创建一个可以多次调用的函数:

In [14]:
def review_to_words(raw_review ):
    # Function to convert a raw review to a string of words
    # The input is a single string (a raw movie review), and 
    # the output is a single string (a preprocessed movie review)
    #
    # 1. Remove HTML
    review_text = BeautifulSoup(raw_review).get_text() 
    #
    # 2. Remove non-letters        
    letters_only = re.sub("[^a-zA-Z]", " ", review_text) 
    #
    # 3. Convert to lower case, split into individual words
    words = letters_only.lower().split()                             
    #
    # 4. In Python, searching a set is much faster than searching #搜索集合比搜索列表更快
    #   a list, so convert the stop words to a set
    stops = set(stopwords.words("english"))                  
    # 
    # 5. Remove stop words
    meaningful_words = [w for w in words if not w in stops]   
    #
    # 6. Join the words back into one string separated by space, 将单词连接成一个用空格隔开的字符串
    # and return the result.
    return( " ".join( meaningful_words ))   

这里有两个新元素:首先，我们将stopwords列表转换为另一种数据类型set。因为我们将会调用这个函数成千上万次，所以它需要更快，而且在Python中搜索集合比搜索列表要快得多。


第二，我们把这些词重新组合成一段。这是为了使输出更容易在我们的单词包中使用，如下所示。定义了上面的函数后，如果你调用该函数进行单个检查:

In [15]:
clean_review = review_to_words(train["review"][0])
print(clean_review)

stuff going moment mj started listening music watching odd documentary watched wiz watched moonwalker maybe want get certain insight guy thought really cool eighties maybe make mind whether guilty innocent moonwalker part biography part feature film remember going see cinema originally released subtle messages mj feeling towards press also obvious message drugs bad kay visually impressive course michael jackson unless remotely like mj anyway going hate find boring may call mj egotist consenting making movie mj fans would say made fans true really nice actual feature film bit finally starts minutes excluding smooth criminal sequence joe pesci convincing psychopathic powerful drug lord wants mj dead bad beyond mj overheard plans nah joe pesci character ranted wanted people know supplying drugs etc dunno maybe hates mj music lots cool things like mj turning car robot whole speed demon sequence also director must patience saint came filming kiddy bad sequence usually directors hate working

它应该提供与我们在前一节中所做的所有单独步骤完全相同的输出。现在让我们循环并一次清理所有的训练集(这可能需要几分钟，取决于你的电脑):

In [16]:
# Get the number of reviews based on the dataframe column size
num_reviews = train["review"].size

# Initialize an empty list to hold the clean reviews初始化一个空列表以保存清洗过的reviews
clean_train_reviews = []

# Loop over each review; create an index i that goes from 0 to the length遍历每一条评论;创建一个从0到长度的索引i
# of the movie review list 
for i in range(0, num_reviews):
    # Call our function for each one, and add the result to the list of
    # clean reviews
    clean_train_reviews.append( review_to_words( train["review"][i] ) )

有时候，等待一段冗长的代码运行是很烦人的。编写代码以便提供状态更新是很有帮助的。要让Python在每处理1000个评审之后打印一个状态更新，请尝试在上面的代码中添加一两行代码:

In [17]:
print ("Cleaning and parsing the training set movie reviews...\n")
clean_train_reviews = []
for i in range(0, num_reviews):
    # If the index is evenly divisible by 1000, print a message如果索引被1000整除，则打印一条消息
    if((i+1)%5000==0):
        print ("Review %d of %d\n" % (i+1, num_reviews))                                                                   
    clean_train_reviews.append( review_to_words( train["review"][i] ))

Cleaning and parsing the training set movie reviews...

Review 5000 of 25000

Review 10000 of 25000

Review 15000 of 25000

Review 20000 of 25000

Review 25000 of 25000



**Creating Features from a Bag of Words (Using scikit-learn)**

既然我们已经整理了我们的训练集的reviews，我们如何将它们转换成某种用于机器学习的数字表示呢?一种常见的方法被称为单词包。单词包模型从所有文档中学习词汇，然后通过计算每个单词出现的次数来对每个文档建模。例如，考虑以下两句话:


句子1:“猫坐在帽子上”


句子2:“狗吃掉了猫和帽子”


从这两个句子中，我们的词汇如下:


{the, cat, sat, on, hat, dog, ate, and}


为了得到单词包，我们计算每个单词在每个句子中出现的次数。在句子1中，“the”出现两次，“cat”、“sat”、“on”、“hat”各出现一次，因此句子1的特征向量为:


{the, cat, sat, on, hat, dog, ate, and}


第1句:{2,1,1,1,1,0,0,0}


同样，第2句的特征是:{3,1,0,0,1,1,1,1}


在IMDB数据中，我们有大量的评论，这将为我们提供大量词汇表。为了限制特征向量的大小，我们应该选择一些最大的词汇量。下面，我们使用5000个最常用的单词(记住停止词已经被删除了)。


我们将使用scikit-learn中的feature_extraction模块来创建单词包特性。如果你在泰坦尼克号比赛中做了随机森林教程，你应该已经安装了scikit-learn;否则你需要安装它。

In [18]:
print ("Creating the bag of words...\n")
from sklearn.feature_extraction.text import CountVectorizer

# Initialize the "CountVectorizer" object, which is scikit-learn's
# bag of words tool.  初始化“CountVectorizer”对象，它是scikit-learn的单词包工具。
vectorizer = CountVectorizer(analyzer = "word",   
                             tokenizer = None,    
                             preprocessor = None, 
                             stop_words = None,   
                             max_features = 8000) 
print(type(vectorizer))

# fit_transform() does two functions: First, it fits the model
# and learns the vocabulary; second, it transforms our training data
# into feature vectors. The input to fit_transform should be a list of 
# strings.
# fit_transform()执行两个函数:首先，它适合模型并学习词汇表;
# 其次，它将我们的训练数据转换成特征向量。fit_transform的输入应该是字符串列表。
train_data_features = vectorizer.fit_transform(clean_train_reviews)
# print(train_data_features)

# Numpy arrays are easy to work with, so convert the result to an 
# array
train_data_features = train_data_features.toarray()

Creating the bag of words...

<class 'sklearn.feature_extraction.text.CountVectorizer'>


要查看训练数据数组现在的样子，请执行以下操作:

In [19]:
print(train_data_features.shape)

(25000, 8000)


它有25,000行和5,000个特征(每个词汇表单词一个)。


注意，CountVectorizer有自己的选项来自动执行预处理、标记化和stopwords删除——对于每个选项，我们可以使用一个内置方法或指定自己的函数来代替指定“None”。有关详细信息，请参阅函数文档。但是，我们希望在本教程中编写自己的数据清理函数，向您展示如何一步一步地进行数据清理。


现在单词包模型已经训练好了，让我们来看看单词:

In [20]:
# Take a look at the words in the vocabulary
vocab = vectorizer.get_feature_names()
print(vocab)



如果您感兴趣，还可以打印每个单词的计数

词汇表:

In [21]:
import numpy as np

# Sum up the counts of each vocabulary word
dist = np.sum(train_data_features, axis=0)
# print(type(np))

# For each, print the vocabulary word and the number of times it 
# appears in the training set
for tag, count in zip(vocab, dist):
    print(count, tag)

SyntaxError: unexpected EOF while parsing (<ipython-input-21-5c7624e672b9>, line 10)

**Random Forest**

此时，我们已经从单词包和每个特征向量的原始情感标签中获得了数字训练特性，让我们来做一些监督学习吧!在这里，我们将使用泰坦尼克号教程中介绍的随机森林分类器。scikit-learn中包含随机森林算法(随机森林使用许多基于树的分类器进行预测，因此称为“森林”)。下面，我们将树的数量设置为100作为一个合理的默认值。更多的树可能(也可能不会)表现得更好，但肯定需要更长的运行时间。同样地，您为每个评审包含的特性越多，所需的时间就越长。

In [None]:
print ("Training the random forest...")
from sklearn.ensemble import RandomForestClassifier

# Initialize a Random Forest classifier with 100 trees
forest = RandomForestClassifier(n_estimators=300) 

# Fit the forest to the training set, using the bag of words as 
# features and the sentiment labels as the response variable
#
# This may take a few minutes to run
forest = forest.fit(train_data_features, train["sentiment"])

**Creating a Submission**

剩下的就是在我们的测试集中运行训练好的随机森林，并创建一个提交文件。如果您还没有这样做，请下载testData。来自数据页的tsv。该文件包含另外25,000个评论和id;我们的任务是预测情绪标签。


注意，当我们对测试集使用单词包时，我们只调用“transform”，而不像对训练集那样调用“fit_transform”。出于这个原因，我们将测试设置为禁止，直到我们准备好进行预测。

In [None]:
# Read the test data
test = pd.read_csv("testData.tsv", 
                   header=0, 
                   delimiter="\t", 
                   quoting=3 )

# Verify that there are 25,000 rows and 2 columns
print (test.shape)

In [None]:
# Create an empty list and append the clean reviews one by one
num_reviews = len(test["review"])
clean_test_reviews = [] 

print ("Cleaning and parsing the test set movie reviews...\n")
for i in range(0,num_reviews):
    if((i+1) % 5000==0):
        print("Review %d of %d\n" % (i+1, num_reviews))
    clean_review = review_to_words( test["review"][i] )
    clean_test_reviews.append( clean_review )

In [None]:
# Get a bag of words for the test set, and convert to a numpy array
test_data_features = vectorizer.transform(clean_test_reviews)
test_data_features = test_data_features.toarray()

# Use the random forest to make sentiment label predictions
result = forest.predict(test_data_features)

# Copy the results to a pandas dataframe with an "id" column and
# a "sentiment" column
output = pd.DataFrame( data={"id":test["id"], "sentiment":result} )

# Use pandas to write the comma-separated output file
output.to_csv("Bag_of_Words_model.csv", index=False, quoting=3)

祝贺您，您已经准备好提交您的第一个提交了!尝试不同的方法，看看结果如何变化。您可以以不同的方式清理评论，为单词表示选择不同数量的词汇，尝试Porter词干提取、不同的分类器或其他任何数量的东西。要在不同的数据集上尝试你的NLP排骨，你也可以去我们的烂番茄比赛。或者，如果你已经准备好学习一些完全不同的东西，那就继续学习深度学习和单词向量页面。