# 8. Appendix A

## 8.1 Word Embedding with Continous Bag Of Words (CBOW) Approach

Given the greater development of python libraries for word embedding, the following appendix has been added to provide transparency into how it is performed.

We will be using the python packages of:
* numpy for computing and data prep
* keras for reducing cognitive load
* gensim for topic modeling

In [1]:
import numpy as np
import keras.backend as K
from keras.models import Sequential
from keras.layers import Dense, Embedding, Lambda
from keras.utils import np_utils
from keras.preprocessing import sequence
from keras.preprocessing.text import Tokenizer
import gensim

We have tagged each review with either a neutral, positive or negative sentiment based on the star rating. 

In [2]:
data = open('yelp.txt','r')
corona_data = [text for text in data if text.count(' ') >= 2]
vectorize = Tokenizer()
vectorize.fit_on_texts(corona_data)
corona_data = vectorize.texts_to_sequences(corona_data)
total_vocab = sum(len(s) for s in corona_data)
word_count = len(vectorize.word_index) + 1
window_size = 2

We will generate pairs of the context words and the target words.

In [3]:
def cbow_model(data, window_size, total_vocab):
    total_length = window_size*2
    for text in data:
        text_len = len(text)
        for idx, word in enumerate(text):
            context_word = []
            target   = []            
            begin = idx - window_size
            end = idx + window_size + 1
            context_word.append([text[i] for i in range(begin, end) if 
                                 0 <= i < text_len and i != idx])
            target.append(word)
            contextual = sequence.pad_sequences(context_word, 
                                                total_length=total_length)
            final_target = np_utils.to_categorical(target, total_vocab)
            yield(contextual, final_target) 

We will build the neural network model that will train the CBOW on our sample data.

In [4]:
model = Sequential()
model.add(Embedding(input_dim=total_vocab, output_dim=100, 
                    input_length=window_size*2))
model.add(Lambda(lambda x: K.mean(x, axis=1), output_shape=(100,)))
model.add(Dense(total_vocab, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')
for i in range(10):
    cost = 0
    for x, y in cbow_model(data, window_size, total_vocab):
        cost += model.train_on_batch(contextual, final_target)
    print(i, cost)

0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0


We will perform the vector file creation.

In [5]:
dimensions=100
vect_file = open('vectors.txt','w')
vect_file.write('{} {}\n'.format(len(vectorize.word_index.items()), dimensions))

8

In [6]:
weights = model.get_weights()[0]
for text, i in vectorize.word_index.items():
    final_vec = ' '.join(map(str, list(weights[i, :])))
    vect_file.write('{} {}\n'.format(text, final_vec))
vect_file.close()

We will use the vector file in gensim model with the example word "neutral".

In [8]:
cbow_output = gensim.models.KeyedVectors.load_word2vec_format('vectors.txt', 
                                                              binary=False)
cbow_output.most_similar(positive=['neutral'])

[('fit', 0.26622945070266724),
 ('very', 0.24694503843784332),
 ('outside', 0.22985662519931793),
 ('cycle', 0.22688570618629456),
 ('did', 0.20111840963363647),
 ('smile', 0.19902928173542023),
 ('his', 0.19888196885585785),
 ('young', 0.19812779128551483),
 ('have', 0.19288483262062073),
 ('where', 0.18997474014759064)]

The above output shows all the words that are related to the review that was tagged neutral.

In [11]:
cbow_output.most_similar(positive=['positive'])

[('nice', 0.2773880362510681),
 ('good', 0.19985781610012054),
 ('too', 0.17834430932998657),
 ('instructors', 0.16947823762893677),
 ('ideas', 0.16248522698879242),
 ('up', 0.15913422405719757),
 ('all', 0.14724114537239075),
 ('struggles', 0.1456916332244873),
 ('over', 0.14495055377483368),
 ('here', 0.14492066204547882)]

The above output shows all the words that are related to the review that was tagged positive.

After generating word vectors using CBOW, we can use Convolutional Neural Network (CNN) to train through labeled training set to capture the semantic features of the text.

Note: We just tagged each review with postive, neutral and negative just for the above example purpose to demonstrate how CBOW can generated related words. To actually build a sentiment analysis model we would have to use CNN on top of the wordvector as mentioned above.