## 加载数据

In [114]:
import pandas as pd

In [115]:
train = pd.read_csv('labeledTrainData.tsv', header=0, delimiter='\t', quoting=3)
test = pd.read_csv('testData.tsv', header=0, delimiter='\t', quoting=3)
unlabeled_train = pd.read_csv('unlabeledTrainData.tsv', header=0, delimiter='\t', quoting=3)

In [116]:
print("Read %d labeled train reviews, %d labeled test reviews, and %d unlabeled reviews\n" % (train['review'].size, 
                                                                                              test['review'].size, 
                                                                                              unlabeled_train['review'].size))

Read 25000 labeled train reviews, 25000 labeled test reviews, and 104805 unlabeled reviews



## 数据预处理

In [117]:
from bs4 import  BeautifulSoup
import re
from nltk.corpus import stopwords

In [118]:
def review_to_wordlist(review, remove_stopwords=False):
    # 1.remove HTML
    review_text = BeautifulSoup(review).get_text()
    # remove non-letters
    review_text = re.sub('[^a-zA-Z]', ' ', review_text)
    # convert words to lower and split them
    words = review_text.lower().split()
    # remove stopwords
    if remove_stopwords:
        stops = set(stopwords.words('english'))
        words = [w for w in words if not w in stops]
        
    return words

## 构造word2vec的输入形式(列表的列表)

**Word2Vec每个句子都是以单词为元素列表，文本是以句子为元素的列表，其中，句子也是一个列表。换句话说，输入格式是列表的列表。**

In [119]:
# Download the punkt tokenizer for sentence splitting
import nltk.data
#nltk.download()   

# Load the punkt tokenizer
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

#print(tokenizer)

# Define a function to split a review into parsed sentences
def review_to_sentences(review, tokenizer, remove_stopwords=False):
    # Function to split a review into parsed sentences. Returns a 
    # list of sentences, where each sentence is a list of words
    # 1. Use the NLTK tokenizer to split the paragraph into sentences
    raw_sentences = tokenizer.tokenize(review.strip())
    #
    # 2. Loop over each sentence
    sentences = []
    for raw_sentence in raw_sentences:
        # If a sentence is empty, skip it
        if len(raw_sentence) > 0:
            # Otherwise, call review_to_wordlist to get a list of words
            sentences.append(review_to_wordlist(raw_sentence, remove_stopwords))
    #
    # Return the list of sentences (each sentence is a list of words,
    # so this returns a list of lists
    return sentences


In [120]:
sentences = []  # Initialize an empty list of sentences

print("Parsing sentences from training set")
for review in train["review"]:
    sentences += review_to_sentences(review, tokenizer)

print("Parsing sentences from unlabeled set")
for review in unlabeled_train["review"]:
    sentences += review_to_sentences(review, tokenizer)


Parsing sentences from training set


  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' that document to Beautiful Soup.' % decoded_markup


Parsing sentences from unlabeled set


  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' Beautiful Soup.' % markup)
  ' that document to Beautiful Soup.' % decoded_markup
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup


AttributeError: 'float' object has no attribute 'strip'

In [121]:
type(review)

float

需要注意的一个细节是Python列表中“+=”和“append”之间的区别。在许多应用程序中，这两者是可互换的，但在这里就不一样了。**如果你将一个列表的列表附加到另一个列表的列表中，“附加”只会附加第一个列表;您需要使用“+=”来一次性连接所有列表。**

## 检查分好的句子，准备输入

In [113]:
review = train['review'][0]
raw_sentences = tokenizer.tokenize(review.strip())
type(raw_sentences[0])

str

In [54]:
review_text = BeautifulSoup(raw_sentence[0]).get_text()

In [55]:
review_text = re.sub("[^a-zA-Z]"," ", review_text)

In [56]:
words = review_text.lower().split()

In [39]:
stops = set(stopwords.words("english"))
words = [w for w in words if not w in stops]

In [57]:
len(words)

32

In [58]:
type(words)

list

In [43]:
a.strip()

'"With all this stuff going down at the moment with MJ i\'ve started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ\'s feeling towards the press and also the obvious message of drugs are bad m\'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally

In [46]:
raw_sentence = tokenizer.tokenize(a.strip())

In [47]:
sentence = []

In [48]:
len(raw_sentence)

15

In [53]:
type(raw_sentence[0])

str

In [59]:
type(review)

float

In [63]:
a = train['review'][0]

In [64]:
review_text = BeautifulSoup(a).get_text()

In [65]:
type(a)

str

In [91]:
review = train['review'][0]

In [92]:
type(review)

str

In [93]:
raw_sentences = tokenizer.tokenize(review.strip())

In [77]:
raw_sentences

['"With all this stuff going down at the moment with MJ i\'ve started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again.',
 'Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent.',
 'Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released.',
 "Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring.",
 'Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit 

In [75]:
type(raw_sentence[0])

str

In [73]:
len(raw_sentence[0])

181

In [74]:
len(raw_sentence[0][0])

1

In [81]:
type(raw_sentences)
raw_sentence = raw_sentences[0]

list

In [84]:
review_text = BeautifulSoup(raw_sentences[0]).get_text()
review_text = re.sub("[^a-zA-Z]"," ", review_text)        
words = review_text.lower().split()

In [85]:
type(words)

list

In [86]:
type(review)

str

In [94]:
review = train['review'][0]

In [95]:
raw_sentences = tokenizer.tokenize(review.strip())

In [96]:
sentences = []

In [103]:
for raw_sentence in raw_sentences:
    if len(raw_sentence) > 0:
        review_text = BeautifulSoup(raw_sentence).get_text()
        #  
        # 2. Remove non-letters 删除非字母符号，后续可以考虑不删除数字
        review_text = re.sub("[^a-zA-Z]"," ", review_text)
        #
        # 3. Convert words to lower case and split them 把所有单词转换成小写然后将文本分割成单词
        words = review_text.lower().split()
        stops = set(stopwords.words("english"))
        words = [w for w in words if not w in stops]
        sentences.append(words)

In [100]:
type(words)

list

In [101]:
type(words[0])

str

In [102]:
words[0]

'hope'

In [106]:
type(sentences[0][0])

str

In [107]:
type(review)

str