## Language agnostic sentiment analysis using mean embedding vectors.
Given a comment in any language, the polarity of the sentiment expressed is determined as follows:

1.   The given comment is translated into english.
2.   Each word in the translated comment is represented by a d-dimensionsal embedding vector.
1.   The mean of the embedding vectors of relevant words in the comment is computed.
2.   The mean embedding vector is used to as input to a pre-trained classifier to determine the sentiment polarity expressed by the comment.





## Import

In [1]:
! pip install googletrans
import pandas as pd # for data handling
import numpy as np # for linear algebra
import matplotlib.pyplot as plt # for plotting
from googletrans import Translator # for translation
from gensim.models import KeyedVectors # for pre-trained embedding
import tensorflow as tf # for neural network classifier
from sklearn.manifold import TSNE # to display comments in 2D plot

Collecting googletrans
  Downloading https://files.pythonhosted.org/packages/fd/f0/a22d41d3846d1f46a4f20086141e0428ccc9c6d644aacbfd30990cf46886/googletrans-2.4.0.tar.gz
Building wheels for collected packages: googletrans
  Building wheel for googletrans (setup.py) ... [?25l[?25hdone
  Stored in directory: /root/.cache/pip/wheels/50/d6/e7/a8efd5f2427d5eb258070048718fa56ee5ac57fd6f53505f95
Successfully built googletrans
Installing collected packages: googletrans
Successfully installed googletrans-2.4.0


## Define function for translating comments

In [0]:
# We use google traslator
translator = Translator()
def translate(comment):
  """returns comment translated into english"""
  return translator.translate(comment).text

## Load pre-trained word2vec model for embedding

In [3]:
# Retrieve embedding file using wget
# use this if embedding file is not available locally
URL = "https://s3.amazonaws.com/dl4j-distribution/" # source url
FILE = "GoogleNews-vectors-negative300.bin.gz" # source file name
SOURCE = URL+FILE # source for embedding file
DIR = "/root/input/" # directory
! wget -P "$DIR" -c "$SOURCE" # retrieve embedding file

# Load pre-trained word2vec model from embedding file
EMBEDDING_FILE = DIR + FILE 
word2vec = KeyedVectors.load_word2vec_format(EMBEDDING_FILE, binary=True)

# Define vocabulary and embedding_size
vocabulary = set(word2vec.index2word) # set of words in vocabulary
embedding_size = word2vec.vector_size # dimension of word vector
print("Model contains %d words" %len(vocabulary))
print("Each word is represented by a %d dimensional vector" %embedding_size)

--2019-06-05 12:20:55--  https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.110.181
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.110.181|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1647046227 (1.5G) [application/x-gzip]
Saving to: ‘/root/input/GoogleNews-vectors-negative300.bin.gz’


2019-06-05 12:21:40 (35.5 MB/s) - ‘/root/input/GoogleNews-vectors-negative300.bin.gz’ saved [1647046227/1647046227]

Model contains 3000000 words
Each word is represented by a 300 dimensional vector


Examine embedding vectors of words

In [22]:
word = input("Type in a word:")
print()
if word not in vocabulary:
  print("\The word %s is not in the volcabulary" %word)
else:
  print("%s is is respresented by the %d dimensional embedding vector:\n"
       %(word, embedding_size))
  print(', '.join([str(v)[:5] for v in word2vec[word.lower()].tolist()]))

Type in a word:great

great is is respresented by the 300 dimensional embedding vector:

0.071, 0.208, -0.02, 0.178, 0.132, -0.09, 0.096, -0.11, -0.00, 0.148, -0.03, -0.18, 0.041, -0.08, 0.021, 0.069, 0.180, 0.222, -0.10, -0.06, 0.000, 0.160, 0.040, 0.073, 0.153, 0.067, -0.10, 0.041, 0.042, -0.11, -0.06, 0.041, 0.25, 0.212, 0.159, 0.014, -0.04, 0.013, 0.003, 0.209, 0.152, -0.07, 0.216, -0.05, -0.02, -0.00, 0.152, -0.02, 0.021, -0.15, 0.104, 0.318, -0.18, 0.036, -0.11, -0.03, -0.10, -0.12, 0.322, -0.07, -0.15, 0.267, -0.15, -0.12, 0.107, 0.066, -0.02, -0.10, -0.20, 0.117, 0.061, 0.067, 0.106, -0.07, -0.15, -0.00, -0.14, 0.253, 0.048, 0.097, -0.00, 0.112, 0.053, 0.017, -0.05, -0.33, -0.09, 0.142, -0.13, 0.022, 0.100, -0.05, -0.15, -0.00, -0.09, -0.04, 0.085, 0.306, -0.11, -0.19, -0.20, 0.081, -0.04, -0.08, -0.10, 0.292, 0.023, -0.03, 0.035, -0.10, -0.06, 0.279, -0.11, -0.01, 0.384, -0.07, -0.02, -0.13, -0.05, -0.05, -0.08, -0.02, 0.083, 0.273, -0.06, -0.04, -0.01, -0.11, -0.10, 0.202, -0

## Define function to compute mean embedding vector of words in translated comment

In [0]:
def mean_vector(comment):
  """returns mean of vector representation words in text.
  returns a vector of zeros if none of the words appear in vocabulary """
  words = [w for w in comment.split() if w in vocabulary] # valid words
  if not words: return np.zeros((embedding_size,), dtype="float32") # no word  
  return np.mean([word2vec[w] for w in words], axis=0)

## Retreive pre-trained model for sentiment classification

In [12]:
# This model is a single layered neural network 
# trained using IMDB sentiment analysis data set.

def getSAmodel(weights_file):
  """Returns trained single layered network"""
  model = tf.keras.models.Sequential([
      tf.keras.layers.Flatten(input_shape=(300,)),
      tf.keras.layers.Dense(32, activation=tf.nn.relu),
      tf.keras.layers.Dropout(0.2),
      tf.keras.layers.Dense(2, activation=tf.nn.softmax)])
  model.compile(optimizer='adam',
                loss='sparse_categorical_crossentropy',
                metrics=['accuracy'])
  model.load_weights(weights_file)
  return model

weights_file = "SA.model.weights.hdf5" 
model = getSAmodel(weights_file)

Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.


## Define function to classify comment

In [0]:
def sentiment_polarity(comment):
  """returns sentiment polarity and mean vector for comment """
  mean_v = mean_vector(translate(comment))
  return model.predict(np.array([mean_v]))[0][1], mean_v

## Classify user specified comment

In [28]:
comment = input("Type in a comment:") # input comment in any language
score, _ = sentiment_polarity(comment) # compute polarity score
sentiment = 'POSITIVE' if score > 0.5 else 'NEGATIVE'
print("\nYour comment: %s \n\texpresses a %s sentiment \n\t(score = %4.4f)"
     %(comment, sentiment, score))

Type in a comment:I am worried that this boring workshop will drag on for another hour

Your comment: I am worried that this boring workshop will drag on for another hour 
	expresses a NEGATIVE sentiment 
	(score = 0.0001)
