# Kaggle Sentiment analysis -- Bag of Words Meets Bags of Popcorn
## Background
情感分析是机器学习中一个具有挑战性的课题。人们在表达语言时，具有很多技巧性，同一语句在不同语境下往往具有不同语意，而这些可能会导致人或者电脑的误解，本题给出了IMDB电影评论中部分有标注的训练数据，目的是预测测试集中评论积极或者消极。
## Data Set
The labeled data set consists of 50,000 IMDB movie reviews, specially selected for sentiment analysis. The sentiment of reviews is binary, meaning the IMDB rating < 5 results in a sentiment score of 0, and rating >=7 have a sentiment score of 1. No individual movie has more than 30 reviews. The 25,000 review labeled training set does not include any of the same movies as the 25,000 review test set. In addition, there are another 50,000 IMDB reviews provided without any rating labels.
## Data fields
* id - Unique ID of each review
* sentiment - Sentiment of the review; 1 for positive reviews and 0 for negative reviews
* review - Text of the review

In [1]:
import os
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn import svm
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectKBest,chi2

## read data

In [5]:
train = pd.read_csv('data/labeledTrainData.tsv',header=0,delimiter="\t", quoting=3)
test = pd.read_csv('data/testData.tsv',header=0,delimiter="\t", quoting=3)
print 'The first review is:'
print train["review"][0]

The first review is:
"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bi

In [None]:
import re
import nltk
from bs4 import BeautifulSoup
from nltk.corpus import stopwords

class KaggleWord2VecUtility(object):
    """processing raw HTML text into list for learning"""
    @staticmethod
    def review_to_wordlist( review, remove_stopwords=False ):
        # convert a document to a sequence of words,
        # optionally removing stop words.  return a list of words.        
        review_text = BeautifulSoup(review).get_text() # remove HTML
        review_text = re.sub("[^a-zA-Z]"," ", review_text)  # remove non-letters
        words = review_text.lower().split() # lower case and split
        if remove_stopwords:
            stops = set(stopwords.words("english"))   # optional remove stop words(false by default)
            words = [w for w in words if not w in stops]      
        return(words) # return list of words

    # split review into parsed sentences
    @staticmethod
    def review_to_sentences( review, tokenizer, remove_stopwords=False ):
        raw_sentences = tokenizer.tokenize(review.decode('utf8').strip()) #split the paragraph into sentences
        sentences = []
        for raw_sentence in raw_sentences:
            if len(raw_sentence) > 0:
                sentences.append( KaggleWord2VecUtility.review_to_wordlist( raw_sentence, remove_stopwords ))
        return sentences  # returns a list of lists

## Baseline by BOW & RandomForest

In [None]:
# download text data sets, including stop words
print 'Download text data sets. If you already have NLTK datasets downloaded, just close the window'
#nltk.download()  
clean_train_reviews = [] # init list

# clean train data
for i in xrange( 0, len(train["review"])):
    clean_train_reviews.append(" ".join(KaggleWord2VecUtility.review_to_wordlist(train["review"][i], True)))
    
# ****** create bow from the training set
vectorizer = CountVectorizer(analyzer = "word",   \
                             tokenizer = None,    \
                             preprocessor = None, \
                             stop_words = None,   \
                             max_features = 5000)
train_data_features = vectorizer.fit_transform(clean_train_reviews)
# convert the result to array
train_data_features = train_data_features.toarray()

# ===========train random forest using bow===========
# init a randomforest classifier with 100 trees
forest = RandomForestClassifier(n_estimators = 100)
forest = forest.fit( train_data_features, train["sentiment"] )

clean_test_reviews = [] # init list
# clean test set
for i in xrange(0,len(test["review"])):
    clean_test_reviews.append(" ".join(KaggleWord2VecUtility.review_to_wordlist(test["review"][i], True)))
# get bow for test set, convert to a numpy array
test_data_features = vectorizer.transform(clean_test_reviews)
test_data_features = test_data_features.toarray()
# use random forest to make predictions
result = forest.predict(test_data_features)
# copy result to submit style
output = pd.DataFrame( data={"id":test["id"], "sentiment":result} )
# output the submit file
output.to_csv(os.path.join(os.path.dirname(__file__), 'data', 'bow.csv'), index=False, quoting=3)
print "Wrote results to bow.csv"

## Improved by TfidfVectorizer & SVM

In [None]:
# init
clean_train_reviews = []

# clean train set 
for i in xrange( 0, len(train["review"])):
    clean_train_reviews.append(" ".join(KaggleWord2VecUtility.review_to_wordlist(train["review"][i])))
    
# create tfidf
train_vectorizer = TfidfVectorizer(stop_words='english',ngram_range=(1,3), sublinear_tf=True, min_df=2)
train_tfidf_arr = train_vectorizer.fit_transform(clean_train_reviews)

# select k's feature
ch2 = SelectKBest(chi2, k=200000)
train_feature = ch2.fit_transform(train_tfidf_arr, train['sentiment'])
    
# ============train svm using tfidf==================
# init svc classifier with c & linear kernel
clf_linear  = svm.SVC(C=1.0, kernel='linear').fit(train_feature, train["sentiment"]) 

# clean test set    
for i in xrange(0,len(test["review"])):
    clean_test_reviews.append(" ".join(KaggleWord2VecUtility.review_to_wordlist(test["review"][i], True)))
    
test_vectorizer = TfidfVectorizer(stop_words='english',ngram_range=(1,3), sublinear_tf=True, min_df=2)
test_tfidf_arr = test_vectorizer.fit_transform(clean_test_reviews)

test_feature = ch2.transform(test_tfidf_arr)

# use svm forest to make predictions
result = clf_linear.predict(test_feature)
# copy result to submit style
output = pd.DataFrame( data={"id":test["id"], "sentiment":result} )
# output the submit file
output.to_csv(os.path.join(os.path.dirname(__file__), 'data', 'svm.csv'), index=False, quoting=3)
print "Wrote results to svm.csv" 

## Use Word2vec 

In [None]:
from nltk.corpus import stopwords
import nltk.data
from gensim.models import Word2Vec

def makeFeatureVec(words, model, num_features):
    # average all of the word vectors 
    featureVec = np.zeros((num_features,),dtype="float32") #init 4 speed
    nwords = 0.
    index2word_set = set(model.index2word) #set 4 speed
    for word in words:
        if word in index2word_set:
            nwords = nwords + 1.
            featureVec = np.add(featureVec,model[word])
    featureVec = np.divide(featureVec,nwords)
    return featureVec

def getAvgFeatureVecs(reviews, model, num_features):
    # calculate average feature vector for each one
    # return a 2D numpy array
    counter = 0.
    reviewFeatureVecs = np.zeros((len(reviews),num_features),dtype="float32") #init 4 speed
    for review in reviews:
        # print status message every 1000 review
        if counter%1000. == 0.:
            print "Review %d of %d" % (counter, len(reviews))
        reviewFeatureVecs[counter] = makeFeatureVec(review, model, num_features)
        counter = counter + 1.
    return reviewFeatureVecs

def getCleanReviews(reviews):
    clean_reviews = []
    for review in reviews["review"]:
        clean_reviews.append( KaggleWord2VecUtility.review_to_wordlist( review, remove_stopwords=True ))
    return clean_reviews

if __name__ == '__main__':
    # read data 
    train = pd.read_csv('data/labeledTrainData.tsv',header=0,delimiter="\t", quoting=3)
    test = pd.read_csv('data/testData.tsv',header=0,delimiter="\t", quoting=3)
    unlabeled_train = pd.read_csv( 'data/unlabeledTrainData.tsv', "unlabeledTrainData.tsv"), \
    header=0,  delimiter="\t", quoting=3 )

    # confirm read number (100000)
    print "Read %d labeled train reviews, %d labeled test reviews, " \
     "and %d unlabeled reviews\n" % (train["review"].size,
     test["review"].size, unlabeled_train["review"].size )

    # Load the punkt tokenizer
    tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

    sentences = []  # init
    # clean lable train data
    for review in train["review"]:
        sentences += KaggleWord2VecUtility.review_to_sentences(review, tokenizer)
    
    # clean unlabeled train data
    for review in unlabeled_train["review"]:
        sentences += KaggleWord2VecUtility.review_to_sentences(review, tokenizer)

    # ======== set parameters and train word2vec ===========
    # Set parameters
    num_features = 300    # word vector dimension
    min_word_count = 40   # Min word count
    num_workers = 4       # number of threads
    context = 10          # context window size
    downsampling = 1e-3   # downsample setting for frequent words

    # Init
    model = Word2Vec(sentences, workers=num_workers, \
                size=num_features, min_count = min_word_count, \
                window = context, sample = downsampling, seed=1)

    # save the model 
    model_name = "300features_40minwords_10context"
    model.save(model_name)
    
    #========== test ==============
    model.doesnt_match("man woman child kitchen".split())
    model.doesnt_match("france england germany berlin".split())
    model.most_similar("man")
    model.most_similar("queen")

    # ======== create average vectors for the train & test set===========
    trainDataVecs = getAvgFeatureVecs( getCleanReviews(train), model, num_features )
    testDataVecs = getAvgFeatureVecs( getCleanReviews(test), model, num_features )

    # ======== use random forest to predict ============
    forest = RandomForestClassifier( n_estimators = 100 )
    forest = forest.fit( trainDataVecs, train["sentiment"] )
    result = forest.predict( testDataVecs )

    # output results
    output = pd.DataFrame( data={"id":test["id"], "sentiment":result} )
    output.to_csv( "Word2Vec_Ave.csv", index=False, quoting=3 )
    print "Wrote Word2Vec_Ave.csv"