In this practical you will explore different methods of converting text into numerical vector so that it can be applied to train Neural Network models. The learning task that you will be working on today is referred to as sentiment analysis and it involves predicting whether a text represents a positive or a negative sentiment. In Task 1 we will implement a toy example using a very small dataset. In Task 2 we will use a real-world dataset containing restaurant reviews obtained from Yelp.

We will be working with the Word2Vec and GloVe pre-trained word embeddings using the Gensim library. You will have to install the library (can be done via the Anaconda Navigator) and also download the Google word2vec and the GloVe pre-trained models. Please read this [Tutorial](https://machinelearningmastery.com/develop-word-embeddings-python-gensim/) before you start working on this practical. 

NOTE: When you are loading the pre-trained word2vec or glove embeddings you can use argument $limit=50000$ in order to speed up the process. This will get only the embeddings of the 50000 most popular words. Otherwise it may take quite a while to load the models into memory.

# Task 1

In this task you will develop a model which predicts whether a short text contains positive or negative emotions. The train and test datasets are presented below. It is a 2-class classification problem, when 1 - stands for positive and 0 - stands for negative.

In [1]:
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
from keras.models import Sequential
from keras.layers import Dense
from keras import optimizers
from numpy.random import seed
import tensorflow

Using TensorFlow backend.


In [2]:
train = ['Well done!',
        'Good work',
        'Great effort',
        'nice work',
        'Excellent!',
        'Weak',
        'Poor effort!',
        'not good',
        'poor work',
        'Could have done better.']

labels_train = np.array([1,1,1,1,1,0,0,0,0,0])

test = ['Amazing job!',
        'Fantastic work',
        'Good effort',
        'Could not have done better',
        'not great',
        'poor job',
        'poor job',
        'very weak',]

labels_test = np.array([1,1,1,1,0,0,0,0])

***
**T1.1 Obtaining Bag of Words (BOW) binary representation**

In [13]:
# create the vectorizer
vectorizer = CountVectorizer(binary=True)

# Converting the train data into vectors
data_train = vectorizer.fit_transform(train).toarray()

#Converting the test data into vectors
data_test = vectorizer.transform(test).toarray()

In [14]:
print(data_train)

[[0 0 1 0 0 0 0 0 0 0 0 0 1 0]
 [0 0 0 0 0 1 0 0 0 0 0 0 0 1]
 [0 0 0 1 0 0 1 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 1 0 0 0 0 1]
 [0 0 0 0 1 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 1 0 0]
 [0 0 0 1 0 0 0 0 0 0 1 0 0 0]
 [0 0 0 0 0 1 0 0 0 1 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 1 0 0 1]
 [1 1 1 0 0 0 0 1 0 0 0 0 0 0]]


***
**T1.2 Training and evaluating NN model with the binary BOW representation**

Now you can use your vector representation of the train/test data and the labels in order to train and evaluate a simple NN model. Implement a simple MLP model and train/test it using the obtained vector representation of the data.

In [15]:
def getModel(n_features):
    model = Sequential()
    model.add(Dense(5, input_dim = n_features, activation='relu', kernel_initializer='random_uniform', bias_initializer='zeros'))
    model.add(Dense(2, activation='sigmoid', kernel_initializer='random_uniform', bias_initializer='zeros'))
    opt = optimizers.Adam(lr=0.05, beta_1=0.9, beta_2=0.999)
    model.compile(optimizer=opt, loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    return model

In [16]:
model = getModel(len(vectorizer.vocabulary_))
model.fit(data_train, labels_train, epochs=20, batch_size=data_train.shape[0], verbose=0)

#Evaluating the model on the test data
loss, acc = model.evaluate(data_test, labels_test)
print('Accuracy on test dataset: %.2f' % acc)
print('Loss: ', loss)

Accuracy on test dataset: 0.75
Loss:  0.9601109623908997


***
**T1.3 BOW - word count representation**

Do the same as above, but instead of binary BOW use the word count BOW representation.

In [17]:
vectorizer = CountVectorizer()

# Converting the train data into vectors
data_train = vectorizer.fit_transform(train).toarray()

#Converting the test data into vectors
data_test = vectorizer.transform(test).toarray()

model = getModel(len(vectorizer.vocabulary_))
model.fit(data_train, labels_train, epochs=20, batch_size=data_train.shape[0], verbose=0)

#Evaluating the model on the test data
loss, acc = model.evaluate(data_test, labels_test)
print('Accuracy on test dataset: %.2f' % acc)
print('Loss: ', loss)

Accuracy on test dataset: 0.75
Loss:  1.1510188579559326


***
**T1.4 BOW - TF-IDF representation**

This time you should use the TFIDF version of the BOW method to convert the text into vectors and apply it to train and test your NN model.

In [18]:
from sklearn.feature_extraction.text import TfidfVectorizer

# create the vectorizer
vectorizer = TfidfVectorizer()

# Converting the train data into vectors
data_train = vectorizer.fit_transform(train).toarray()

#Converting the test data into vectors
data_test = vectorizer.transform(test).toarray()

model = getModel(len(vectorizer.vocabulary_))
model.fit(data_train, labels_train, epochs=50, batch_size=data_train.shape[0], verbose=0)

#Evaluating the model on the test data
loss, acc = model.evaluate(data_test, labels_test)
print('Accuracy on test dataset: %.2f' % acc)
print('Loss: ', loss)

Accuracy on test dataset: 0.75
Loss:  2.041787624359131


***
**T1.5 Words Embeddings - word2vec**

In this task we will use the pre-trained embeddings from the word2vec model.

First, following the [Tutorial](https://machinelearningmastery.com/develop-word-embeddings-python-gensim/) you should download the word2vec model. You can then load it into memory as follows. 

In [3]:
from gensim.models import KeyedVectors
import re

In [7]:
#file = 'D:\PycharmProjects\RecordLinkage\Embedings Files\GoogleNews-vectors-negative300.bin'
file = 'GoogleNews-vectors-negative300.bin'
word2vec = KeyedVectors.load_word2vec_format(file, binary=True, limit=50000)
word2vec_vectors = word2vec.wv

  after removing the cwd from sys.path.


The most straightforward approach to obtaining vector representation for a sentence/document is to obtained a vector representation for each of the individual words within the document and average them. There are multiple ways of implementing it, below is just one of them.  

In [24]:
#Converting text data into vectors
def getWord2Vec(list):
    vectors = []
    for row in list:
        tokens = [w.lower() for w in re.sub(r'[^\w\s]','',row).split(' ')]
        temp = []
        for token in tokens:
            if token in word2vec_vectors:
                temp.append(word2vec[token])
        vectors.append(np.mean(temp, axis=0))
    return np.asarray(vectors)

Now you can use the getWord2Vec method to obtain vector representation of each of the train and test record.

In [25]:
#Converting train and text data into vectors
data_train = getWord2Vec(train)
print(data_train[0])
data_test = getWord2Vec(test)     
print(data_train.shape)
print(np.asarray(data_test).shape)

[-6.59179688e-02  1.31835938e-01 -9.08203125e-02  6.95800781e-02
 -1.24023438e-01 -8.37402344e-02  1.89285278e-02  3.60107422e-02
  1.18896484e-01 -6.66503906e-02 -1.13525391e-02 -1.23168945e-01
  8.36791992e-02 -1.07421875e-02 -1.22070312e-03  2.08190918e-01
  9.72290039e-02  9.70458984e-02  6.00585938e-02 -9.06066895e-02
 -4.27246094e-02  1.30859375e-01 -2.83203125e-02 -9.35058594e-02
  1.51916504e-01 -6.61010742e-02 -1.25488281e-01 -1.82617188e-01
 -1.06689453e-01  2.24609375e-02 -3.52172852e-02  4.58984375e-02
  1.19873047e-01 -1.25488281e-01  6.28356934e-02  9.20410156e-02
  1.75781250e-02 -1.19628906e-02  8.23974609e-02 -1.58691406e-02
  2.22656250e-01  2.35595703e-02  3.91235352e-02  1.72729492e-02
 -1.81274414e-02 -3.73535156e-02 -1.93359375e-01 -4.23583984e-02
  3.58886719e-02 -4.31976318e-02  6.81762695e-02  9.13085938e-02
 -4.59594727e-02 -5.61523438e-02 -5.56640625e-02  1.68457031e-02
 -1.12060547e-01 -4.46777344e-02  1.37451172e-01 -4.23812866e-02
 -7.76062012e-02  1.87988

In [26]:
model = getModel(300)
model.fit(data_train, labels_train, epochs=50, batch_size=data_train.shape[0], verbose=0)

#Evaluating the model on the test data
loss, acc = model.evaluate(data_test, labels_test)
print('Accuracy on test dataset: %.2f' % acc)
print('Loss: ', loss)

Accuracy on test dataset: 0.88
Loss:  1.7167880535125732


***
**T1.6 Words Embeddings - GloVe**

In this task you should do the same as above but instead of Word2Ves use the GloVe model. You can use the GloVe model with Gensim library as presented in the [Tutorial](https://machinelearningmastery.com/develop-word-embeddings-python-gensim/).

In [None]:
from gensim.scripts.glove2word2vec import glove2word2vec
glove_input_file = 'D:\PycharmProjects\RecordLinkage\Embedings Files\Glove files\glove.6B.300d.txt'
word2vec_output_file = 'D:\PycharmProjects\RecordLinkage\Embedings Files\Glove files\glove.6B.300d_1.txt'
glove2word2vec(glove_input_file, word2vec_output_file)

In [21]:
# load the Stanford GloVe model
filename = '/Users/annajurek/Documents/Queens/word embedding/glove.6B/glove.6B.50d.txt.word2vec'
#filename = 'D:\PycharmProjects\RecordLinkage\Embedings Files\Glove files\glove.6B.300d_1.txt'
glove = KeyedVectors.load_word2vec_format(filename, binary=False, limit=50000)
glove_vectors = glove.wv

  """


In [22]:
#Converting text data into vectors
def getGlove(list):
    vectors = []
    for row in list:
        tokens = [w.lower() for w in re.sub(r'[^\w\s]','',row).split(' ')]
        temp = []
        for token in tokens:
            if token in glove_vectors:
                temp.append(glove[token])
        vectors.append(np.mean(temp, axis=0))
    return np.asarray(vectors)

In [29]:
#Converting train and text data into vectors
data_train = getGlove(train)
data_test = getGlove(test)     
print(data_train.shape)
print(np.asarray(data_test).shape)

(10, 300)
(8, 300)


In [30]:
model = getModel(300)
model.fit(data_train, labels_train, epochs=50, batch_size=data_train.shape[0], verbose=0)

#Evaluating the model on the test data
loss, acc = model.evaluate(data_test, labels_test)
print('Accuracy on test dataset: %.2f' % acc)
print('Loss: ', loss)

Accuracy on test dataset: 0.75
Loss:  0.7757281064987183


# Task 2

In this task we will repeat the same steps as in Task one but this time we will use a real-world dataset with reviews of restaurants obtained from Yelp.

In [7]:
import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.read_csv('yelp_reviews.csv',encoding = "ISO-8859-1")

#select input and output variables
data = df.values[:,0]
labels = df.values[:,1]


***
**T2.1** Split the data into train and test sets. Use any test size.

**T2.2** Obtained vector representation of train and test data using Binary BOW, TFIDF, Word2Vec and Glove

**T2.3** Implement a MLP and evaluate it with each of the data representations.

In [8]:
x_train, x_test, y_train, y_test = train_test_split(data, labels,test_size=0.5, random_state=0)

In [23]:
#Bag of Words - Binary
bow_train = vectorizer.fit_transform(x_train).toarray()
bow_test = vectorizer.transform(x_test).toarray()
bow_model = getModel1(1, [50], bow_train, y_train)
loss, acc = bow_model.evaluate(bow_test, y_test)
print('BOW Accuracy on test dataset: %.2f' % acc)
print('Loss: ', loss)

#Bag of words - TFIDF
tfidf_train = vectorizer.fit_transform(x_train).toarray()
tfidf_test = vectorizer.transform(x_test).toarray()
tfidf_model = getModel1(1, [50], tfidf_train, y_train)
loss, acc = tfidf_model.evaluate(tfidf_test, y_test)
print('TFIDF Accuracy on test dataset: %.2f' % acc)
print('Loss: ', loss)

#Embeddings - W2V
w2v_train = getWord2Vec(x_train)
w2v_test = getWord2Vec(x_test) 
w2v_model = getModel1(1, [50], w2v_train, y_train)
loss, acc = w2v_model.evaluate(w2v_test, y_test)
print('W2V Accuracy on test dataset: %.2f' % acc)
print('Loss: ', loss)

#Embeddings - GloVe
glove_train = getGlove(x_train)
glove_test = getGlove(x_test)
glove_model = getModel1(1, [50], glove_train, y_train)
loss, acc = glove_model.evaluate(glove_test, y_test)
print('Glove Accuracy on test dataset: %.2f' % acc)
print('Loss: ', loss)

BOW Accuracy on test dataset: 0.76
Loss:  1.6894555771685986
TFIDF Accuracy on test dataset: 0.76
Loss:  1.739709367713775
W2V Accuracy on test dataset: 0.83
Loss:  1.4289307043734325
Glove Accuracy on test dataset: 0.80
Loss:  1.57802675383158


In [15]:
def getModel1(h_layers_no, neurons_no, data_train, labels_train):
    model = Sequential()
    model.add(Dense(neurons_no[0], input_dim = data_train.shape[1], activation='relu', kernel_initializer='random_uniform', bias_initializer='zeros'))
    for l in range(h_layers_no-1):
        model.add(Dense(neurons_no[l], activation='relu', kernel_initializer='random_uniform', bias_initializer='zeros'))
    model.add(Dense(2, activation='sigmoid', kernel_initializer='random_uniform', bias_initializer='zeros'))
    opt = optimizers.Adam(lr=0.05, beta_1=0.9, beta_2=0.999)
    model.compile(optimizer=opt, loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    model.fit(data_train, labels_train, epochs=100, batch_size=32, verbose=0)
    return model