In this notebook we code a vanilla version of CBOW using Tensorflow and Oscar Wilde's 'The Picture of Dorian Gray'

In [1]:
import numpy as np
import tensorflow as tf
import nltk

We need to break the data into words, we are going to do it in a slightly fancier way using a tokenizer from the nltk package. 

In [2]:
from nltk.tokenize import word_tokenize

We load the book

In [3]:
book=open('./data/dorian.txt','r')

and create the list of tokens

In [4]:
token_list=word_tokenize(book.read())

Let's take a sample.

In [5]:
token_list[34:40]

['may', 'copy', 'it', ',', 'give', 'it']

Note the differences, we obtained more than just the words. Let's create the list of distinct tokens

In [6]:
tokens = list(set(token_list))
len(tokens)

8238

so, there are 8238 different tokens. Now, we need to create our training data, recall that we are using CBOW, so we need to fix an $m$, for this post we use $m=2$.

In [7]:
X_data_words=[]
y_data_words=[]
for i in range(2,len(token_list)-2):
    X_data_words+=[[token_list[i-2],token_list[i-1],token_list[i+1],token_list[i+2]]]
    y_data_words+=[[token_list[i]]]

and we are ready to create the pipeline. Let's recall: 
- First it is one-hot encoding. 
- Embedding given by U.
- Average.
- Decoding via V.
- Softmax
- Loss function.

## One-hot

In [8]:
def one_hot_enc(word):
    one_hot=np.zeros(8238)
    one_hot[tokens.index(word)]=1
    return one_hot

def one_hot_dec(vector):
    return tokens[np.argmax(vector)]

## Some further preparations:
Unfortunately, we need to use a slightly different structure, the reason is that if we were to hold the one-hot encoding of every word in the X_data_words we will create a numpy array of size (98379,4,8238), and if we try to create an empyt such np array, we get

In [9]:
np.zeros((98379,4,8238))

MemoryError: 

Which is not a good sign, instead we use a different approach, we collect only the indexes

In [10]:
X_data = np.array([[tokens.index(word) for word in _set] for _set in X_data_words])
y_data = np.array([[tokens.index(word) for word in _set] for _set in y_data_words])

and use tf.nn.embedding_lookup. (This is similar as how sparse matrices work).

## The Tensorflow Graph

We first create the objects that will appear on the graphs

In [11]:
X = tf.placeholder(shape=(4),dtype=tf.int32,name='INPUT')
y_ = tf.placeholder(shape=(1),dtype=tf.int32,name='OUTPUT')
U = tf.Variable(tf.random_normal([8238,128], stddev=0.5),name="U",dtype=tf.float32)
V = tf.Variable(tf.random_normal([128,8238], stddev=0.5),name="V",dtype=tf.float32)
aveg_creator = tf.constant([[0.25,0.25,0.25,0.25]],name='average_creator')

Now, we create the graph

In [12]:
u= tf.nn.embedding_lookup(U,X)
u_aveg= tf.matmul(aveg_creator,u,name='average')
v= tf.matmul(u_aveg,V)
y_hat = tf.nn.softmax(v)

## The loss function and training

We need to modify the loss function a bit for technical reasons.

In [13]:
y=tf.nn.embedding_lookup(np.eye(8238,dtype=np.float32),y_)
loss = -tf.reduce_sum(y*tf.log(tf.clip_by_value(y_hat,1e-10,1.0)))+tf.nn.l2_loss(U)+tf.nn.l2_loss(V)
#loss = -tf.reduce_sum(y * tf.log(y_hat), reduction_indices=[1])

and use AdamOptimizer for the gradient descent.

In [14]:
train=tf.train.AdamOptimizer(0.1).minimize(loss)

## The pipeline
We are ready to start the session

In [15]:
sess=tf.Session()
sess.run(tf.global_variables_initializer())

We run 100000 times, note that this is quite little for this data set because of the way we build it. Why, there are 80000 examples and we are only going over then one at the time. 

In [19]:
for i in range(10000):
    j=np.random.randint(98379)
    sess.run(train,feed_dict={X:X_data[j],y_:y_data[j]})
    if i%1000==0:
        print('step: ',i,' the loss is ',sess.run(loss,feed_dict={X:X_data[j],y_:y_data[j]}))

step:  0  the loss is  190043.0
step:  1000  the loss is  9.01697
step:  2000  the loss is  10.591
step:  3000  the loss is  11.8745
step:  4000  the loss is  12.7963
step:  5000  the loss is  15.0773
step:  6000  the loss is  15.8069
step:  7000  the loss is  19.2467
step:  8000  the loss is  17.6352
step:  9000  the loss is  18.9454


Now, let's run an example:

In [22]:
Ex=['no', 'artist','ethical', 'sympathies']
Ex_indexes=[tokens.index(word) for word in Ex]
Ex_indexes

[5560, 7075, 448, 7332]

In [34]:
vec=sess.run(y_hat,feed_dict={X:Ex_indexes})
for i in range(10):
    print(tokens[vec[0].argsort()[-10:][::-1][i]])

drugs
clever
examining
again
water-lilies
oils
my
Bruno
house-party
crime


This word doesn't fit much in the middle, if you let it running for a couple hours it may do better, but don't expect much of this model.

**Disclaimer:** This is by far not the best way to do this. Here are some things to note and improve:

- We are using an on-line gradient descent, similar to batch with small batches (size 1). This is slow but easier to code.
- The loss function we are using is not the best option, there is a softmax_cross_entropy that would allow to avoid the cliping of the values and the regularizers.
- The data feeding can be improve.
- The network should be designed to recieve batches, we are fixing the entry values.
- Checking the loss in this way is not a great idea, better use tensorboard.

**Homework:** Learn to do this!

In [35]:
sess.close()

## Using Keras

We can also use Keras to create the model, first we need to prepare the data

In [11]:
import keras.backend as K
from keras.models import Sequential
from keras.layers import Dense, Embedding, Lambda

Using TensorFlow backend.


In [12]:
cbow_K= Sequential()
cbow_K.add(Embedding(input_dim=8238,output_dim=128,input_length=4))
cbow_K.add(Lambda(lambda x:K.mean(x,axis=1),output_shape=(128,)))
cbow_K.add(Dense(8238,activation='softmax'))

In [13]:
cbow_K.compile(loss='categorical_crossentropy',optimizer='adam')

In [20]:
for i in range(10):
    loss=0
    for i in range(len(X_data)):
        one_hot=np.zeros(8238)
        one_hot[y_data[i]]=1
        one_hot=np.array([one_hot])
        XXX=np.array([X_data[i]])
        loss+=cbow_K.train_on_batch(XXX,one_hot)
        if i%1000==0:
            print("%d iteration"%i,"loss: %.3f"%loss)
        if i>5000:
            break
    break

0 iteration loss: 5.340
1000 iteration loss: 5010.296
2000 iteration loss: 10750.323
3000 iteration loss: 16300.109
4000 iteration loss: 22115.803
5000 iteration loss: 27895.694


In [26]:
vec=cbow_K.predict(np.array([Ex_indexes]))

In [27]:
for i in range(10):
    print(tokens[vec[0].argsort()[-10:][::-1][i]])

the
.
,
of
is
a
to
in
and
The
