# Word2Vec을 활용한 머신러닝 실습

이번 튜토리얼은 앞서 생성한 Word2Vec을 이용하여 Sentiment Analysis를 합니다. 

In [64]:
from prepro import data_prepro, review_to_wordlist
import pandas as pd
from IPython.display import display
from gensim.models import Word2Vec
import numpy as np

In [2]:
train = pd.read_csv( "../src/labeledTrainData.tsv", header=0, delimiter="\t", quoting=3 )

In [25]:
dev_portion = 0.2
train_count = int(train.shape[0] * (1 - dev_portion))

## Data Filtering

In [26]:
train_input_data = data_prepro(train['review'][:train_count])
train_labels = train['sentiment'][:train_count]

dev_input_data = data_prepro(train['review'][train_count:])
dev_labels = train['sentiment'][train_count:]

In [22]:
display(" ".join(train_input_data[0]))

'with all this stuff going down at the moment with mj i ve started listening to his music watching the odd documentary here and there watched the wiz and watched moonwalker again maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent moonwalker is part biography part feature film which i remember going to see at the cinema when it was originally released some of it has subtle messages about mj s feeling towards the press and also the obvious message of drugs are bad m kay visually impressive but of course this is all about michael jackson so unless you remotely like mj in anyway then you are going to hate this and find it boring some may call mj an egotist for consenting to the making of this movie but mj and most of his fans would say that he made it for the fans which if true is really nice of him the actual feature film bit when it finally starts is only on for minutes or s

In [23]:
display(train_labels[0])

1

## Load Model

In [12]:
model = Word2Vec.load("300features_40minwords_10context")

## 학습을 위한 Word Embedding Vector를 이용한 Sentence Embedding Vector 만들기

여기서는 간단하게 Word Vector들의 평균을 활용하여 Sentence Embedding을 할 것입니다.

In [14]:
def makeFeatureVec(words, model, num_features):
    featureVec = np.zeros((num_features,),dtype="float32")
    nwords = 0. 
    index2word_set = set(model.wv.index2word)
    for word in words:
        if word in index2word_set: 
            nwords = nwords + 1.
            featureVec = np.add(featureVec,model[word])
    featureVec = np.divide(featureVec,nwords)
    return featureVec

In [15]:
def getAvgFeatureVecs(reviews, model, num_features):
    counter = 0
    reviewFeatureVecs = np.zeros((len(reviews),num_features),dtype="float32")
    for review in reviews:
        if counter%1000. == 0.:
            print("Review %d of %d" % (counter, len(reviews)))
        reviewFeatureVecs[counter] = makeFeatureVec(review, model, num_features)
        counter = counter + 1
    return reviewFeatureVecs

In [30]:
num_features = 300
trainDataVecs = getAvgFeatureVecs(train_input_data, model, num_features)
devDataVecs = getAvgFeatureVecs(dev_input_data, model, num_features)

Review 0 of 20000
Review 1000 of 20000
Review 2000 of 20000
Review 3000 of 20000
Review 4000 of 20000
Review 5000 of 20000
Review 6000 of 20000
Review 7000 of 20000
Review 8000 of 20000
Review 9000 of 20000
Review 10000 of 20000
Review 11000 of 20000
Review 12000 of 20000
Review 13000 of 20000
Review 14000 of 20000
Review 15000 of 20000
Review 16000 of 20000
Review 17000 of 20000
Review 18000 of 20000
Review 19000 of 20000
Review 0 of 5000
Review 1000 of 5000
Review 2000 of 5000
Review 3000 of 5000
Review 4000 of 5000


## Sentence Embedding Vector를 활용한 머신러닝 학습

지금까지 만든 Sentence Embedding Vector를 이용하여 Sentiment Analysis 예측을 하려고 합니다. 여기선 RandomForest와 SVM, Linear Regression을 통해 학습을 합니다.

In [29]:
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier( n_estimators = 100 )

print("Fitting a random forest to labeled training data...")
forest = forest.fit(trainDataVecs, train_labels)

Fitting a random forest to labeled training data...


In [31]:
def accuracy(predict, ground_truth):
    correct = 0
    for p, g in zip(predict, ground_truth):
        if p == g:
            correct += 1
            
    return correct / len(predict)

In [41]:
from sklearn.metrics import mean_squared_error
predict = forest.predict(trainDataVecs)
display("accuracy : " + str(accuracy(predict, train_labels)))
display("Mean Square Error : " + str(mean_squared_error(predict, train_labels)))

'accuracy : 1.0'

'Mean Square Error : 0.0'

In [42]:
dev_predict = forest.predict(devDataVecs)
display("accuracy : " + str(accuracy(dev_predict, dev_labels)))
display("Mean Square Error : " + str(mean_squared_error(dev_predict, dev_labels)))

'accuracy : 0.824'

'Mean Square Error : 0.176'

In [54]:
from sklearn.svm import SVR
svm_reg = SVR(kernel="rbf")

print("Fitting a SVR to labeled training data...")
svm_reg = svm_reg.fit(trainDataVecs, train_labels)

Fitting a SVR to labeled training data...


In [55]:
predict = svm_reg.predict(trainDataVecs)
display("accuracy : " + str(accuracy(predict, train_labels)))
display("Mean Square Error : " + str(mean_squared_error(predict, train_labels)))

'accuracy : 0.0'

'Mean Square Error : 0.211800763772'

In [56]:
dev_predict = svm_reg.predict(devDataVecs)
display("accuracy : " + str(accuracy(dev_predict, dev_labels)))
display("Mean Square Error : " + str(mean_squared_error(dev_predict, dev_labels)))

'accuracy : 0.0'

'Mean Square Error : 0.214130389543'

In [50]:
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()

print("Fitting a Linear Regression to labeled training data...")
lin_reg = lin_reg.fit(trainDataVecs, train_labels)

Fitting a Linear Regression to labeled training data...


In [51]:
predict = lin_reg.predict(trainDataVecs)
display("accuracy : " + str(accuracy(predict, train_labels)))
display("Mean Square Error : " + str(mean_squared_error(predict, train_labels)))

'accuracy : 0.0'

'Mean Square Error : 0.113799310561'

In [53]:
dev_predict = lin_reg.predict(devDataVecs)
display("accuracy : " + str(accuracy(dev_predict, dev_labels)))
display("Mean Square Error : " + str(mean_squared_error(dev_predict, dev_labels)))

'accuracy : 0.0'

'Mean Square Error : 0.119380547427'

In [58]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

input_data = trainDataVecs
label_data = train['sentiment']

forest_cla = RandomForestClassifier(n_estimators = 100)

forest_cla.fit(trainDataVecs, train_labels)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [61]:
dev_predict = forest_cla.predict(devDataVecs)

In [62]:
display("accuracy : " + str(accuracy(dev_predict, dev_labels)))
display("Mean Square Error : " + str(mean_squared_error(dev_predict, dev_labels)))

'accuracy : 0.8254'

'Mean Square Error : 0.1746'

## 모델 예측 실험

지금까지 만들어 본 모델을 가지고 Sentiment가 어떤지 확인해 봅니다.

* 긍정인 경우 : 1, 부정인 경우 : 0

In [65]:
test_str_list = ["I love this movie! It was awesome when I was watching the best scenery.", 
                 "Because it was boring, I hate this movie. It was too much useless, and I don't wann recommend this.",
                 "The reason I watched this movie is every pictures are like super and amazingly wonderful! I'd like to see this again.",
                 "Even though someone recommend this, I will disagree with his thought. It's just terrible and horrible movie ever I've seen.",
                 "Greatly maded with fantastic casts! The players are facinating. Every scene was perfect to beat my heart.",
                 "I have no idea why director made this! It was totally disaster. Awful casts! Terrible plots!",
                 "Someone told me this is great because there are amazing catings and wonderful plots, but finally the worst movie ever I've seen.",
                 "Sounds like interesting!"]

test_reviews = []
for review in test_str_list:
    test_reviews.append(review_to_wordlist(review))

In [66]:
test_inputs = getAvgFeatureVecs( test_reviews, model, num_features )

Review 0 of 8


In [67]:
display(forest_cla.predict(test_inputs))

array([1, 0, 1, 0, 1, 0, 1, 0])