As a first example of a deep NLP we create a simple (and naive) text classifier.

In [1]:
import numpy as np
import tensorflow as tf

# The data
We use the [Movie Review data from Rotten Tomatoes](http://www.cs.cornell.edu/people/pabo/movie-review-data/) it consists on 

- 5331 Postive reviews.
- 5331 Negative reviews.

## Cleaning the data

Before cleaning the data let's take a quick look at it.

In [2]:
from data_helpers import load_data_and_labels

In [3]:
X_sentences,y=load_data_and_labels('./data/rt/rt-polarity.pos','./data/rt/rt-polarity.neg')

In [4]:
X_sentences[0],y[0]

("the rock is destined to be the 21st century 's new conan and that he 's going to make a splash even greater than arnold schwarzenegger , jean claud van damme or steven segal",
 array([0, 1]))

In [5]:
X_sentences[6338],y[6338]

('in comparison to his earlier films it seems a disappointingly thin slice of lower class london life despite the title amounts to surprisingly little',
 array([1, 0]))

Note that sentences may have different length.

In [6]:
list(map(len,X_sentences))[:20]

[173,
 216,
 30,
 87,
 111,
 133,
 59,
 106,
 123,
 73,
 68,
 83,
 89,
 67,
 59,
 121,
 40,
 38,
 98,
 55]

In order to fix this, and get our vocabulary, we can use methods from the tflearn.

In [7]:
from tflearn.data_utils import VocabularyProcessor

Let's create a constant to keep the sentence_len

In [8]:
sentence_len=60

In [9]:
vocab_proc = VocabularyProcessor(sentence_len)

In [10]:
X = np.array(list(vocab_proc.fit_transform(X_sentences)))

In [11]:
X[0]

array([ 1,  2,  3,  4,  5,  6,  1,  7,  8,  9, 10, 11, 12, 13, 14,  9, 15,
        5, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0])

Note, how the index 0 is been used to PAD. Let's look slightly more at this method, it produces a generator object which we can used to get the list corresponding to the sentences

In [12]:
next(vocab_proc.transform(['the reviews are']))

array([   1, 8820,  415,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0])

We also look at the vocabulary size.

In [13]:
vocab_size = len(vocab_proc.vocabulary_)
#vocab_dict = vocab_processor.vocabulary_._mapping

# The first model

For our first model we don't use a pre-trained embedding. Instead, we let the model learn the embedding by itself. 

After the embedding layer, we have a hidden layer and we end we some classification. We start by defining some hyperparameters

In [14]:
#Global hyper-parameters
emb_dim=100
hidden_dim=50
num_classes=2

we create the placeholders to hold the data.

In [15]:
input_x = tf.placeholder(tf.int32, shape=[None, sentence_len], name="input_x")
input_y = tf.placeholder(tf.float32, [None, num_classes], name="input_y")
#dropout_keep_prob = tf.placeholder(tf.float32, name="dropout_keep_prob")

Next, we create the variables we are going to need.

In [16]:
with tf.name_scope("embedding"):
    W = tf.Variable(tf.random_uniform([vocab_size, emb_dim], -1.0, 1.0),name="W")
    embedded_chars = tf.nn.embedding_lookup(W, input_x)

We can put all the this vectors together into a large vector, for this we use the reshape method

In [17]:
with tf.name_scope("reshape"):
    emb_vec= tf.reshape(embedded_chars,shape=[-1,sentence_len*emb_dim])

we can now go over the hidden dimension, but first we need a variable for this

In [18]:
with tf.name_scope("hidden"):
    W_h= tf.Variable(tf.random_uniform([sentence_len*emb_dim, hidden_dim], -1.0, 1.0),name="w_hidden")
    b_h= tf.Variable(tf.zeros([hidden_dim],name="b_hidden"))
    hidden_output= tf.nn.relu(tf.matmul(emb_vec,W_h)+b_h)

finally, the output layer

In [19]:
with tf.name_scope("output_layer"):
    W_o= tf.Variable(tf.random_uniform([hidden_dim,2], -1.0, 1.0),name="w_o")
    b_o= tf.Variable(tf.zeros([2],name="b_o"))
    score = tf.nn.relu(tf.matmul(hidden_output,W_o)+b_o)
    predictions = tf.argmax(score, 1, name="predictions")

note that we didn't put the softmax layer here.

In [20]:
with tf.name_scope("loss"):
    losses=tf.nn.softmax_cross_entropy_with_logits(labels=input_y, logits=score)
    loss=tf.reduce_mean(losses)

In [21]:
with tf.name_scope("accuracy"):
    correct_predictions = tf.equal(predictions, tf.argmax(input_y, 1))
    accuracy = tf.reduce_mean(tf.cast(correct_predictions, "float"), name="accuracy")

We are almost ready to start the session, we need the training operation

In [22]:
global_step = tf.Variable(0, name="global_step", trainable=False)
optimizer=tf.train.AdamOptimizer(1e-4).minimize(loss)
loss_summary = tf.summary.scalar("loss", loss)
acc_summary = tf.summary.scalar("accuracy", accuracy)
summary_op=tf.summary.merge([loss_summary,acc_summary])

Running the session

In [23]:
sess = tf.Session()
sess.run(tf.global_variables_initializer())
train_summary_writer = tf.summary.FileWriter('./summaries/', sess.graph)

In [24]:
for i in range(100):
    acc,loss_,_=sess.run([accuracy,loss,optimizer],feed_dict={input_x:X,input_y:y})
    step,summaries = sess.run([global_step,summary_op],feed_dict={input_x:X,input_y:y})
    train_summary_writer.add_summary(summaries, i)
    print("This is step: %d, acc=%.2f, loss=%.2f"%(i,acc,loss_),end='\r')

This is step: 99, acc=0.51, loss=0.91

and after initiating tensorboard by using 

tensorboard --logdir="./summaries"

we can navigate to http://127.0.1.1:6006/ to see what we get.