# Sentiment Analysis with Word2Vec and KMeans
** Rakha Paleva Kawiswara ** <br>
15/05/2019

Feel free to use this notebook for your research. I'm very appreciate if you give credit and upvotes to this notebook :). Suggestion on this notebook is very expected, Thanks!

-----------------------------------------------------------------------------------------------------------------------------
# Notebook Goals
This notebook will carry out sentiment analysis with unsupervised learning approach even though a supervised learning approach can be done because there are target labels in the data. Because the results of scraping a web data are raw data where data do not have a target label and sentiment analysis cannot be carried out with a supervised learning approach.

Word2Vec will be used as a word representation and K-Means will be used to cluster data.

# Notebook Workflows
* [1. Import Library](#ImportLibrary)
* [2. Load Data](#LoadData)
* [3. Data Preprocessing](#DataPreprocessing)
* [4. Word2Vec Model Training](#Word2VecTraining)
* [5. Clustering with K-Means](#KMeansTraining)

# 1. Import Library <a name="ImportLibrary"></a>

In [1]:
import numpy as np
import random
import pandas as pd
import re
from gensim.models import Word2Vec
from sklearn.cluster import KMeans
import warnings
warnings.simplefilter('ignore')
np.random.seed(0)
random.seed(0)
pd.set_option('max_colwidth', 200)



# 2. Load Data <a name="LoadData"></a>

In [2]:
data = pd.read_csv('../input/amazon_alexa.tsv', sep='\t', header=0, usecols=['verified_reviews', 'feedback'])
target = data.feedback
data = data.drop('feedback', axis=1)
data.head()

Unnamed: 0,verified_reviews
0,Love my Echo!
1,Loved it!
2,"Sometimes while playing a game, you can answer a question correctly but Alexa says you got it wrong and answers the same as you. I like being able to turn lights on and off while away from home."
3,"I have had a lot of fun with this thing. My 4 yr old learns about dinosaurs, i control the lights and play games like categories. Has nice sound when playing music as well."
4,Music


# 3. Data Preprocessing <a name="DataPreprocessing"></a>

In [3]:
def clean_data(text):
    # lower text
    text = text.lower()
    # remove excessive space
    text = re.sub(r'\s+', ' ', text)
    text = text.strip()
    return text

In [4]:
data['verified_reviews'] = [clean_data(i) for i in data.verified_reviews]
data.head()

Unnamed: 0,verified_reviews
0,love my echo!
1,loved it!
2,"sometimes while playing a game, you can answer a question correctly but alexa says you got it wrong and answers the same as you. i like being able to turn lights on and off while away from home."
3,"i have had a lot of fun with this thing. my 4 yr old learns about dinosaurs, i control the lights and play games like categories. has nice sound when playing music as well."
4,music


In [5]:
# delete data that contain empty string
index = data[data.verified_reviews == ''].index
data = data.drop(index)
target = target.drop(index)

# 4. Word2Vec Model Training <a name="Word2VecTraining"></a>

In [6]:
token_review = [i.split() for i in data.verified_reviews]
word2vec = Word2Vec(token_review, size=100, min_count=1, seed=0)
vocab = list(word2vec.wv.vocab.keys())
print('Size of vocab: ',len(vocab))

Size of vocab:  6807


In [7]:
word2vec.wv.most_similar('echo')

[('one', 0.9998496770858765),
 ('bought', 0.9998383522033691),
 ('spot', 0.9998190999031067),
 ('new', 0.9998188018798828),
 ('my', 0.999810516834259),
 ('dot.', 0.999808669090271),
 ('the', 0.9998076558113098),
 ('first', 0.9998066425323486),
 ('plus', 0.9998033046722412),
 ('this', 0.999802827835083)]

In [8]:
word2vec.wv.most_similar('amazon')

[('or', 0.999948263168335),
 ('on', 0.9999452233314514),
 ('even', 0.9999272227287292),
 ('any', 0.9999250769615173),
 ('of', 0.9999232292175293),
 ('while', 0.9999215006828308),
 ('they', 0.9999185800552368),
 ('from', 0.999915361404419),
 ('another', 0.9999144077301025),
 ('your', 0.9999122619628906)]

In [9]:
word2vec.wv.most_similar('love')

[('like', 0.9996095895767212),
 ('now', 0.9995641112327576),
 ('it.', 0.9995486736297607),
 ('really', 0.9995301961898804),
 ('got', 0.9995297789573669),
 ('we', 0.9995282292366028),
 ('i', 0.9995241761207581),
 ('have', 0.9995133876800537),
 ('am', 0.9995023012161255),
 ('in', 0.9995008707046509)]

# 5. Clustering with K-Means <a name="KMeansTraining"></a>
The result of Word2Vec is a vector representation of each word not a sentence. One way to use the results from Word2Vec on K-Means is to take the average words in the sentence. But on this notebook, I use words that have been represented by word2vec to be trained on K-Means.

The results of predictions from K-means are predictions of each word in the sentence. I averaged the results of the prediction as the final result (stacking method)

In [10]:
kmeans = KMeans(n_clusters=2, random_state=0)
kmeans.fit(word2vec.wv.vectors)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=2, n_init=10, n_jobs=None, precompute_distances='auto',
    random_state=0, tol=0.0001, verbose=0)

In [11]:
def predict(x, cluster_model, embedding_model, vocab):
    x = [i for i in x if i in vocab]
#     # to bypass data that contain empty string
#     if len(x) == 0:
#         return 0
    prediction = cluster_model.predict(embedding_model[x])
    return np.mean(prediction, dtype=int)

In [12]:
data['predicted_feedback'] = [predict(i, kmeans, word2vec, vocab) for i in token_review]
print('Predicted feedback: \n',data.predicted_feedback.value_counts())
data.head()

Predicted feedback: 
 0    2763
1     308
Name: predicted_feedback, dtype: int64


Unnamed: 0,verified_reviews,predicted_feedback
0,love my echo!,0
1,loved it!,0
2,"sometimes while playing a game, you can answer a question correctly but alexa says you got it wrong and answers the same as you. i like being able to turn lights on and off while away from home.",0
3,"i have had a lot of fun with this thing. my 4 yr old learns about dinosaurs, i control the lights and play games like categories. has nice sound when playing music as well.",0
4,music,1


To measure accuracy, the biggest value is taken from the measurement results because K-Means doesn't know what label 0 or label 1 is

In [13]:
def measure(y_target, y_pred):
    value_type_0 = np.mean(y_target == y_pred)
    value_type_1 = np.mean(y_target != y_pred)
    if value_type_0 > value_type_1:
        return value_type_0
    return value_type_1

In [14]:
print(measure(target, data.predicted_feedback))

0.8241615109084989


# Conclusion
Word2Vec can be used for the K-Means algorithm by stacking predictive results. Maybe the next project is to do a comparison between Word2Vec and Bag of Words on the KMeans algorithm