# Sentiment Analysis of Amazon Product Reviews

Dataset contains 4 milions reviews splitted into 3.6mln for training set and 400k for test set. <br>
Reviews was splitted into negatives (1-2 stars) and positive (4-5 stars). <br>
Entire dataset comes from <a href='https://www.kaggle.com/bittlingmayer/amazonreviews'>here</a>.

# Data exploration

In [1]:
with open('train.ft.txt', 'r', encoding = 'utf8') as file:
    max_len, avg_len, labels_percent, i = 0, 0, 0, 0
    for line in file:
        line = line.strip().split(' ')
        if len(line) - 1 > max_len:
            max_len = len(line) - 1 
        avg_len += len(line) - 1
        labels_percent += int(line[0][9]) - 1
        i = i + 1
    avg_len /= i
    labels_percent /= i
    
print('Maximum words in sentence in training set: {}'.format(max_len))
print('Average words in sentence in training set: {:.2f}'.format(avg_len))
print('Positive review in training set: {:.2f}%\n'.format(labels_percent * 100))

with open('test.ft.txt', 'r', encoding = 'utf8') as file:
    max_len, avg_len, labels_percent, i = 0, 0, 0, 0
    for line in file:
        line = line.strip().split(' ')
        if len(line) - 1 > max_len:
            max_len = len(line) - 1 
        avg_len += len(line) - 1
        labels_percent += int(line[0][9]) - 1
        i = i + 1
    avg_len /= i  
    labels_percent /= i
    
print('Maximum words in sentence in test set: {}'.format(max_len))
print('Average words in sentence in test set: {:.2f}'.format(avg_len))
print('Positive review in test set: {:.2f}%'.format(labels_percent * 100))

Maximum words in sentence in training set: 257
Average words in sentence in training set: 78.48
Positive review in training set: 50.00%

Maximum words in sentence in test set: 230
Average words in sentence in test set: 78.42
Positive review in test set: 50.00%


The distribution of the dataset is not skewed. The number of negative and positive examples is equal. The average number of words the in review is around 78, so later i want to pad/truncate the reviews to the lists of 100 words.

# Loading Global Vectors for Word Representation

File for word embeddings is located <a href='https://nlp.stanford.edu/projects/glove/'>here</a>. It comes from Stanford NLP Group. It was trained using Glove unsupervised algorithm on 2 billion tweets which contains 27 billion tokens. It has 1.2 million words embedded for 25, 50, 100 or 200 dimensional vectors. 

I am using below indexes for special cases:
- index 0 as blank word for padding sentences <br>
- index 1 as unknown word

In [2]:
import numpy as np

def loadGloveModel(gloveFile, vector_size, vocab_size):
    print("Loading Glove Model. May take some time")
    with open(gloveFile, 'r', encoding = 'utf-8') as file:
        i = 2
        embeddings = np.zeros((vocab_size, vector_size))
        word_to_index = {'': 0, 'unknown': 1}
        index_to_word = {0 : '', 1 : 'unknown'}
        for line in file:
            splitLine = line.split(' ')
            word = splitLine[0]
            word_to_index[word] = i
            index_to_word[i] = word
            embeddings[i] = np.asarray(splitLine[1:], dtype='float32')
            i = i + 1
        print("Done.", len(word_to_index)," words loaded!")
    return embeddings, word_to_index, index_to_word

embedding_matrix, word_to_index, index_to_word = loadGloveModel('Glove/glove.twitter.27B.100d.txt', 100, 1193516)

Loading Glove Model. May take some time
Done. 1193515  words loaded!


# Creating custom Word Embeddings

Unfortunetly the vocabulary of Glove Word Embeddings is too big to pass to the Keras Embedding Layer. Therefore I am creating my own embeddings by training on reviews from 'train.ft.txt' using Word2Vec algoritm. <br><br>
I am using a 5 word context with a minimum of 10 words and downsampling the most common words. <br>
Words are embedded in 300 dimensional vectors.


In [None]:
from gensim.models import Word2Vec
from string import punctuation

num_features = 300                  
min_word_count = 10    
context = 5                                                                                          
downsampling = 1e-3 

def strip_punctuation(s):
    return ''.join(c for c in s if c not in punctuation)
 
# creating iterator which is necassery for training on big files
class SentencesIterator():
    def __init__(self, inputPath):
        self.file = open(inputPath, "r", encoding='utf-8')

    def __iter__(self):
        return self

    def __next__(self):
        line = self.file.readline()
        if line is '':
            raise StopIteration
        stripped = strip_punctuation(line[11:]).lower().strip()
        return stripped.split(' ')


sentences_iterator = SentencesIterator('train.ft.txt')

print("Training Word2Vec model")
w2v = Word2Vec(sentences_iterator, workers=4, size=num_features, min_count = min_word_count,\
                 window = context, sample = downsampling, iter=100)
w2v.init_sims(replace=True)
w2v.save("w2v_300features_10minwordcounts3")
print("Vocabulary size : {}".format(len(w2v.wv.index2word)))
print("Top 10 words in vocabulary: {}".format(w2v.wv.index2word[0:10]))

# Creating preprocessed data files

Data files containing raw review text with labels is converted to indexes list. Label is at first place in line, indexes are separated by spaces. <br>
0 - negative review, 1 - positive review

In [None]:
from gensim.models import Word2Vec
import numpy as np
from string import punctuation
w2v = Word2Vec.load("w2v_300features_10minwordcounts")

# index 169029 is used as an unknown word
voc_size = w2v.wv.vectors.shape[0] #169029

# creating dictionary word2index
word2index = {}
for key in w2v.wv.vocab:
    word2index[key] = w2v.wv.vocab[key].index

def sentence_to_indexes(sentence, voc_size):
    return [word2index.get(word, voc_size) for word in sentence.split(' ')]

def strip_punctuation(s):
    return ''.join(c for c in s if c not in punctuation)

def create_preprocessed_dataset(train_file, test_file, train_preprocessed_file, test_preprocessed_file):
    print('Creating preprocessed train dataset file. May take some time')
    with open(train_file, 'r', encoding='utf8') as file_in,\
    open(train_preprocessed_file, 'w') as file_out:
        i = 0
        for sentence in file_in:
            if (sentence == ''):
                break
            sentence_out = [int(sentence[9]) - 1]
            stripped = strip_punctuation(sentence[11:]).lower().strip()
            sentence_out += sentence_to_indexes(stripped, voc_size)
            print(*sentence_out, sep = ' ', file = file_out)
            i = i + 1
            if (i % 900000 == 0):
                print('{}% loaded'.format(100 * i / 3600000))
    print('Done!')
                
    print('Creating preprocessed test dataset file. May take some time')
    i = 0
    with open(test_file, 'r', encoding='utf8') as file_in,\
    open(test_preprocessed_file, 'w') as file_out:
        for sentence in file_in:
            if (sentence == ''):
                break
            sentence_out = [int(sentence[9]) - 1]
            stripped = strip_punctuation(sentence[11:]).lower().strip()
            sentence_out += sentence_to_indexes(stripped, voc_size)
            print(*sentence_out, sep = ' ', file = file_out)
            i = i + 1
            if (i % 100000 == 0):
                print('{}% loaded'.format(100 * i / 400000))
            
    print('Done!')

create_preprocessed_dataset('train.ft.txt', 'test.ft.txt', 'train_preprocessed2.txt', 'test_preprocessed2.txt')

# Creating generators

Datasets are too big to keep it in RAM and feed to training models. Therefore, it is necessary to create generators that will feed Keras models directly from preprocessed files. <br>

Train examples - max_length = 257, average_length = 78.46 <br>
Test examples - max_length = 230, average_length = 78.41 <br>

- The average generator works by averaging all vectors in the sentence. <br>
- In the sentence generator I am padding all shorter senteces to 100 length and truncating the longer ones. The list of 100 words contains integer indexes of words.

In [None]:
from gensim.models import Word2Vec
import numpy as np
w2v = Word2Vec.load("w2v_300features_10minwordcounts")
vec_size = 300
embedding_matrix = np.append(w2v.wv.vectors, np.zeros((1, vec_size)), axis=0)

def indexes_to_vectors(indexes):
    return [embedding_matrix[int(index)] for index in indexes]

def average_generator(inputPath, vector_size, batch_size):
    with open(inputPath, "r") as file:
        while True:
            i = 0
            X = np.zeros((batch_size, vector_size))
            Y = np.zeros((batch_size,))
            while i < batch_size:
                line = file.readline()
                if line == '':
                    file.seek(0)
                    line = file.readline()
                line = line.strip().split(" ")    
                Y[i] = line[0]
                X[i] = np.average(indexes_to_vectors(line[1:]), axis=0)
                i = i + 1
            yield (X, Y)

Tx = 100
            
def sentences_generator(inputPath, batch_size):
    with open(inputPath, "r") as file:
        while True:
            i = 0
            X = np.zeros((batch_size, Tx))
            Y = np.zeros((batch_size,))
            while i < batch_size:
                line = file.readline()
                if line == '':
                    file.seek(0)
                    line = file.readline()
                line = line.strip().split(" ")    
                Y[i] = line[0]
                sentence_length = len(line[1:])
                if (sentence_length <= 100):
                    X[i, -sentence_length:] = line[1:]
                else:
                    X[i, :] = line[1:101]
                i = i + 1
            yield (X, Y)

# Average Vector Model (Benchmark Model)

A simple model whose operation is based on averaging all words embeddings in a sentence. The averaged vector is further fed to the MLP network. Effectiveness of the algorithm will not be good, but it will be useful on evaluating the next model.

In [None]:
import keras
from keras.layers import Input, Dense, Dropout
from keras.models import Model
from keras.optimizers import Adam

In [None]:
vec_size = 300
batch_size = 2500
train_examples = 3600000
test_examples = 400000

def average_model(vec_size):
    X_inp = Input(shape = (vec_size,))
    X = Dense(128, activation = 'relu')(X_inp)
    X = Dropout(0.3)(X)
    X = Dense(32, activation = 'relu')(X)
    X = Dropout(0.3)(X)
    X = Dense(1, activation = 'sigmoid')(X)
    
    model = Model(inputs = X_inp, outputs = X)
    return model


train_gen = average_generator('train_preprocessed2.txt', vec_size, batch_size)
test_gen = average_generator('test_preprocessed2.txt', vec_size, batch_size) 

model = average_model(vec_size)
model.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['acc'])
model.fit_generator(train_gen, steps_per_epoch = train_examples // batch_size,
                    validation_data=test_gen, validation_steps = test_examples // batch_size, epochs = 15)

- Epoch 1:  loss: 0.4943 - acc: 0.7587 - val_loss: 0.4584 - val_acc: 0.7821
- Epoch 2:  loss: 0.4545 - acc: 0.7860 - val_loss: 0.4326 - val_acc: 0.7981
- Epoch 15: loss: 0.4020 - acc: 0.8164 - val_loss: 0.3835 - val_acc: 0.8253

The benchmark model achived 82.5% accuracy on the test set.

# LSTM model (custom word embeddings)

First recurrential model with 1 layer of LSTM (128 hidden states) which is then fed into 2 Dense layers. The model contains an Embedding Layer which transform the lists of 100 words into the corresponding vectors. <br>
The model is using my own word embeddings of 169029 words into 300d vectors.

In [None]:
import keras
from keras.layers import Input, Dense, LSTM, Dropout
from keras.layers.embeddings import Embedding
from keras.models import Model
from keras.optimizers import Adam

In [None]:
w2v = Word2Vec.load("w2v_300features_10minwordcounts")
embedding_matrix = w2v.wv.vectors
embedding_matrix = np.append(w2v.wv.vectors, np.zeros((1, vec_size)), axis=0)
print("Shape of embedding matrix: ", embedding_matrix.shape)
voc_size = embedding_matrix.shape[0]
vec_size = embedding_matrix.shape[1]

embedding_layer = Embedding(voc_size, vec_size, trainable=False, weights = [embedding_matrix])

def lstm_model():
    X_inp = Input(shape = (100,))
    X = embedding_layer(X_inp)
    X = LSTM(128, return_sequences=False, dropout_W=0.2, dropout_U=0.2)(X)
    X = Dense(10, activation = 'relu')(X)
    X = Dropout(0.2)(X)
    X = Dense(1, activation = 'sigmoid')(X)
    
    model = Model(inputs = X_inp, outputs = X)
    print(model.summary())
    return model

batch_size = 2000
train_examples = 3600000
test_examples = 400000

train_gen = sentences_generator('train_preprocessed2.txt', batch_size)
test_gen = sentences_generator('test_preprocessed2.txt', batch_size) 

model = lstm_model()
model.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['acc'])
model.fit_generator(train_gen, steps_per_epoch = train_examples // batch_size,
                    validation_data=test_gen, validation_steps = test_examples // batch_size, epochs = 2)

- Epoch 1: 1827s - loss: 0.4579 - acc: 0.7835 - val_loss: 0.3086 - val_acc: 0.8662
- Epoch 2: 1816s - loss: 0.3072 - acc: 0.8704 - val_loss: 0.2443 - val_acc: 0.8973

The model achived 89.7% accuracy on the test set. It is good result, considering that it is only 2 epochs and training the word embeddings did not take a long time.
 

# LSTM with Glove Word Embeddings

The architecture of the model is identical with the one above, but this one is using 100d Glove word embeddings. The embedding matrix is too big to feed into the embedding layer. Therefore transforming words indexes is done in new generator.

In [None]:
Tx = 100
vec_size = embedding_matrix.shape[1]

def indexes_to_vectors(indexes):
    return [embedding_matrix[int(index)] for index in indexes]
            
def vectors_generator(inputPath, batch_size):
    with open(inputPath, "r") as file:
        while True:
            i = 0
            X = np.zeros((batch_size, Tx, vec_size))
            Y = np.zeros((batch_size,))
            while i < batch_size:
                line = file.readline()
                if line == '':
                    file.seek(0)
                    line = file.readline()
                line = line.strip().split(" ")    
                Y[i] = line[0]
                sentence_length = len(line[1:])
                if (sentence_length <= 100):
                    X[i, -sentence_length:] = indexes_to_vectors(line[1:])
                else:
                    X[i] = indexes_to_vectors(line[1:101])
                i = i + 1
            yield (X, Y)

In [None]:
import keras
from keras.layers import Input, Dense, LSTM, Dropout
from keras.layers.embeddings import Embedding
from keras.models import Model
from keras.optimizers import Adam

In [None]:
def lstm_model():
    X_inp = Input(shape = (100, 100))
    X = LSTM(128, return_sequences=False, dropout_W=0.2, dropout_U=0.2)(X_inp)
    X = Dense(10, activation = 'relu')(X)
    X = Dropout(0.2)(X)
    X = Dense(1, activation = 'sigmoid')(X)
    
    model = Model(inputs = X_inp, outputs = X)
    print(model.summary())
    return model

batch_size = 2000
train_examples = 3600000
test_examples = 400000

train_gen = vectors_generator('train_preprocessed.txt', batch_size)
test_gen = vectors_generator('test_preprocessed.txt', batch_size) 

model = lstm_model()
model.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['acc'])
model.fit_generator(train_gen, steps_per_epoch = train_examples // batch_size,
                    validation_data=test_gen, validation_steps = test_examples // batch_size, epochs = 10)

#model.save('my_model.h5')
#model = keras.models.load_model('my_model.h5')

- Epoch 1: loss: 0.3033 - acc: 0.8720 - val_loss: 0.2074 - val_acc: 0.9171
- Epoch 2: loss: 0.2184 - acc: 0.9152 - val_loss: 0.1802 - val_acc: 0.9295
- Epoch 10: loss: 0.1680 - acc: 0.9377 - val_loss: 0.1470 - val_acc: 0.9445

The model achived 94.5% accuracy. It is really good score comparing to others models on Kaggle. Especially considering that it is only 3 layer model. <br>

I want to get deeper analysis of the model performance. Therefrom I am creating another generator to yield only test data without labels to the predict_generator.

In [5]:
import keras
import numpy as np

model = keras.models.load_model('my_model.h5')

Tx = 100
vec_size = embedding_matrix.shape[1]
batch_size = 2000
test_examples = 400000

def indexes_to_vectors(indexes):
    return [embedding_matrix[int(index)] for index in indexes]
            
def vectors_generator(inputPath, batch_size):
    with open(inputPath, "r") as file:
        while True:
            i = 0
            X = np.zeros((batch_size, Tx, vec_size))
            while i < batch_size:
                line = file.readline()
                if line == '':
                    file.seek(0)
                    line = file.readline()
                line = line.strip().split(" ")    
                sentence_length = len(line[1:])
                if (sentence_length <= 100):
                    X[i, -sentence_length:] = indexes_to_vectors(line[1:])
                else:
                    X[i] = indexes_to_vectors(line[1:101])
                i = i + 1
            yield X
            
test_gen = vectors_generator('test_preprocessed.txt', batch_size) 

predictions = model.predict_generator(test_gen, test_examples // batch_size)

with open('test_preprocessed2.txt', 'r') as file:
    Y_true = []
    for line in file:
        Y_true.append(line[0])       
Y_true = np.asarray(Y_true, dtype='int')

In [6]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_auc_score

print('Accuracy score: ', accuracy_score(np.round(predictions), Y_true))
print('Area under the ROV curve score: ', roc_auc_score(np.round(predictions), Y_true))
print('Confusion matrix:')
print(confusion_matrix(np.round(predictions), Y_true))
print('Classification report:')
print(classification_report(np.round(predictions), Y_true))

Accuracy score:  0.94512
Area under the ROV curve score:  0.9451487751290625
Confusion matrix:
[[189828  11780]
 [ 10172 188220]]
Classification report:
              precision    recall  f1-score   support

         0.0       0.95      0.94      0.95    201608
         1.0       0.94      0.95      0.94    198392

   micro avg       0.95      0.95      0.95    400000
   macro avg       0.95      0.95      0.95    400000
weighted avg       0.95      0.95      0.95    400000



The model seem to get errors equal on positive/negative examples. It is slightly less precise on the positive reviews. If I would like to further improve the model performance I would try to implement a different architecture or try to perform an error analysis to see what types of reviews are the most difficult to the model.