In this practical you will explore different methods of converting text into numerical vector so that it can be applied to train Neural Network models. The learning task that you will be working on today is referred to as sentiment analysis and it involves predicting whether a text represent a positive or a negative sentiment. In Task 1 we will implement a toy example using a very small dataset. In Task 2 we will use a real-world dataset containing restaurant reviews obtained from Yelp.

We will be working with the Word2Vec and GloVe pre-trained word embeddings using the Gensim library. You will have to install the library (can be done via the Anaconda Navigator) and also download the Google word2vec and the GloVe pre-trained models. Please read this [Tutorial](https://machinelearningmastery.com/develop-word-embeddings-python-gensim/) before you start working on this practical. 

NOTE: When you are loading the pre-trained word2vec or glove embeddings you can use the argument $limit=50000$ in order to speed up the process. This will get only the embeddings of the 50000 most popular words. Otherwise, on some machines, it may take a while to load the models into memory.

# Task 1

In this task you will a develop model which predicts whether a short text contains positive or negative emotions. The train and test datasets are presented below. It is a 2-class classification problem, when 1 - stands for positive and 0 - stands for negative.

In [84]:
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
from keras.models import Sequential
from keras.layers import Dense
from keras import optimizers
from numpy.random import seed
import tensorflow

In [85]:
train = ['Well done!',
        'Good work',
        'Great effort',
        'nice work',
        'Excellent!',
        'Weak',
        'Poor effort!',
        'not good',
        'poor work',
        'Could have done better.']

labels_train = np.array([1,1,1,1,1,0,0,0,0,0])

test = ['Amazing job!',
        'Fantastic work',
        'Good effort',
        'Could not have done better',
        'not great',
        'poor job',
        'poor job',
        'very weak',]

labels_test = np.array([1,1,1,1,0,0,0,0])

In [86]:
train[0]

'Well done!'

***
**T1.1 Obtaining Bag of Words (BOW) binary representation**

In [87]:
# create the vectorizer
vectorizer = CountVectorizer(binary=True)

# Converting the train data into vectors
data_train = vectorizer.fit_transform(train).toarray()

#Converting the test data into vectors
data_test = vectorizer.transform(test).toarray()

***
**T1.2 Training and evaluating NN model with the binary BOW representation**

Now you can use your vector representation of the train/test data and the labels in order to train and evaluate a simple NN model. Implement a simple MLP model with Keras and train/test it using the obtained vector representation of your data.

In [93]:
seed(0)
tensorflow.random.set_seed(0)
model=Sequential()
model.add(Dense(12,input_dim=data_train.shape[1],activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(2, activation='sigmoid'))

model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

history=model.fit(data_train, labels_train, epochs=150, batch_size=10,verbose=0)

_, accuracy = model.evaluate(data_test, labels_test)
print('Accuracy: %.2f' % (accuracy*100))

Accuracy: 37.50


***
**T1.3 BOW - word count representation**

Do the same as above, but instead of binary BOW use the word count BOW representation.

In [40]:
# create the vectorizer
vectorizer = CountVectorizer(token_pattern = r"(?u)\b\w+\b")

# Converting the train data into vectors
data_train = vectorizer.fit_transform(train).toarray()

#Converting the test data into vectors
data_test = vectorizer.transform(test).toarray()

In [41]:
model=Sequential()
model.add(Dense(12,input_dim=data_train.shape[1],activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(2, activation='sigmoid'))

model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

history=model.fit(data_train, labels_train, epochs=150, batch_size=10,verbose=0)

_, accuracy = model.evaluate(data_test, labels_test)
print('Accuracy: %.2f' % (accuracy*100))

Accuracy: 50.00


***
**T1.4 BOW - TF-IDF representation**

This time you should use the TFIDF version of the BOW method to convert the text into vectors and apply it to train and test your NN model.

In [42]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [43]:
# create the transform
vectorizer = TfidfVectorizer()

# Converting the train data into vectors
data_train = vectorizer.fit_transform(train).toarray()

#Converting the test data into vectors
data_test = vectorizer.transform(test).toarray()

In [44]:
model=Sequential()
model.add(Dense(12,input_dim=data_train.shape[1],activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(2, activation='sigmoid'))

model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

history=model.fit(data_train, labels_train, epochs=150, batch_size=10,verbose=0)

_, accuracy = model.evaluate(data_test, labels_test)
print('Accuracy: %.2f' % (accuracy*100))

Accuracy: 75.00


***
**T1.5 Words Embeddings - word2vec**

In this task we will use the pre-trained embeddings from the word2vec model.

First, following the [Tutorial](https://machinelearningmastery.com/develop-word-embeddings-python-gensim/) you should download the word2vec model. You can then load it into memory as follows. 

In [45]:
from gensim.models import KeyedVectors
import re

In [46]:
file = 'GoogleNews-vectors-negative300.bin' #pathway to the file
word2vec = KeyedVectors.load_word2vec_format(file, binary=True, limit=50000)
word2vec_vectors = word2vec.wv

  This is separate from the ipykernel package so we can avoid doing imports until


The most straightforward approach to obtaining a vector representation for a sentence/document is to obtain a vector representation for each of the individual words within the document and average/sum them. There are multiple ways of implementing it, below is just one of them.  

In [97]:
#Converting text data into vectors
def getWord2Vec(list):
    vectors = []
    for row in list:
        tokens = [w.lower() for w in re.sub(r'[^\w\s]','',row).split(' ')]
        temp = []
        for token in tokens:
            if token in word2vec_vectors:
                temp.append(word2vec[token])
        vectors.append(np.mean(temp, axis=0))
        print(np.asarray(vectors).shape)
    return np.asarray(vectors)

Now you can use the getWord2Vec method to obtain vector representation of each of the train and test record.

In [82]:
# Converting the train data into vectors
data_train = getWord2Vec(train)

#Converting the test data into vectors
#data_test = getWord2Vec(test)

(1, 300)
(2, 300)
(3, 300)
(4, 300)
(5, 300)
(6, 300)
(7, 300)
(8, 300)
(9, 300)
(10, 300)


In [83]:
print(data_train[0][1])

0.13183594


In [50]:
print(data_train.shape)

(10, 300)


In [19]:
model=Sequential()
model.add(Dense(12,input_dim=data_train.shape[1],activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(2, activation='sigmoid'))

model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

history=model.fit(data_train, labels_train, epochs=150, batch_size=10,verbose=0)

_, accuracy = model.evaluate(data_test, labels_test)
print('Accuracy: %.2f' % (accuracy*100))

Accuracy: 87.50


***
**T1.6 Words Embeddings - GloVe**

In this task you should do the same as above but instead of Word2Ves use the GloVe model. You can use the GloVe model with Gensim library as presented in the [Tutorial](https://machinelearningmastery.com/develop-word-embeddings-python-gensim/).

In [14]:
from gensim.scripts.glove2word2vec import glove2word2vec
glove_input_file = 'glove.6B.50d.txt'
word2vec_output_file = 'glove.6B.50d.txt.word2vec'
glove2word2vec(glove_input_file, word2vec_output_file)

(400000, 50)

In [34]:
filename = 'glove.6B.300d.txt.word2vec'
glove = KeyedVectors.load_word2vec_format(filename, binary=False)
glove_vectors=glove.wv

  This is separate from the ipykernel package so we can avoid doing imports until


In [98]:
#Converting text data into vectors
def getGloVe2Vec(list):
    vectors = []
    for row in list:
        tokens = [w.lower() for w in re.sub(r'[^\w\s]','',row).split(' ')]
        temp = []
        for token in tokens:
            if token in glove_vectors:
                temp.append(glove[token])
        vectors.append(np.mean(temp, axis=0))
    return np.asarray(vectors)

In [17]:
# Converting the train data into vectors
data_train = getGloVe2Vec(train)

#Converting the test data into vectors
data_test = getGloVe2Vec(test)

In [35]:
seed(0)
tensorflow.random.set_seed(0)
model=Sequential()
model.add(Dense(12,input_dim=data_train.shape[1],activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(2, activation='sigmoid'))

model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

history=model.fit(data_train, labels_train, epochs=150, batch_size=10,verbose=0)

_, accuracy = model.evaluate(data_test, labels_test)
print('Accuracy: %.2f' % (accuracy*100))

Accuracy: 75.00


# Task 2

In this task we will repeat the same steps as in Task 1 but this time we will use a real-world dataset with reviews of restaurants obtained from Yelp.

In [94]:
import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.read_csv('yelp_reviews.csv',encoding = "ISO-8859-1")

#select input and output variables
data = df.values[:,0]
labels = df.values[:,1]

***
**T2.1** Split the data into train and test sets. Use any test size.

**T2.2** Obtained vector representation of the train and test data using Binary BOW, TFIDF, Word2Vec and Glove

**T2.3** Implement a MLP and evaluate it with each of the data representations.

**T2.4** Investigate different architectures of your MLP to see whether you can improve the accuracy on the test dataset with any of the models.

In [95]:
x_train, x_test, y_train, y_test = train_test_split(data, labels,test_size=0.2, random_state=0)

In [99]:
seed(0)
tensorflow.random.set_seed(0)
# Binary
bow_vectorizer = CountVectorizer(binary=True)
bow_train=bow_vectorizer.fit_transform(x_train).toarray()
bow_test=bow_vectorizer.transform(x_test).toarray()
model=Sequential()
model.add(Dense(12,input_dim=bow_train.shape[1],activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(2, activation='sigmoid'))

model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

history=model.fit(bow_train, y_train, epochs=150, batch_size=10,verbose=0)

_, accuracy = model.evaluate(bow_test, y_test)
print('Accuracy: %.2f' % (accuracy*100))

#TFIDF
tfidf_vectorizer = TfidfVectorizer()
tfidf_train = tfidf_vectorizer.fit_transform(x_train).toarray()
tfidf_test = tfidf_vectorizer.transform(x_test).toarray()
model=Sequential()
model.add(Dense(12,input_dim=tfidf_train.shape[1],activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(2, activation='sigmoid'))

model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

history=model.fit(tfidf_train, y_train, epochs=150, batch_size=10,verbose=0)

_, accuracy = model.evaluate(tfidf_test, y_test)
print('Accuracy: %.2f' % (accuracy*100))

#word2vec
word2vec_train = getWord2Vec(x_train)
word2vec_test = getWord2Vec(x_test)
model=Sequential()
model.add(Dense(12,input_dim=word2vec_train.shape[1],activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(2, activation='sigmoid'))

model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

history=model.fit(word2vec_train, y_train, epochs=150, batch_size=10,verbose=0)

_, accuracy = model.evaluate(word2vec_test, y_test)
print('Accuracy: %.2f' % (accuracy*100))

#glove
glove_train = getGloVe2Vec(x_train)
glove_test = getGloVe2Vec(x_test)
model=Sequential()
model.add(Dense(12,input_dim=glove_train.shape[1],activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(2, activation='sigmoid'))

model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

history=model.fit(glove_train, y_train, epochs=150, batch_size=10,verbose=0)

_, accuracy = model.evaluate(glove_test, y_test)
print('Accuracy: %.2f' % (accuracy*100))

Accuracy: 77.00


KeyboardInterrupt: 