# Text Analytics Seminar - Hands-on Session on Active Learning
In the second part of this session, we are going to look at some non-deterministic classifiers, namely deep neural networks. Due to a random weight initialization at the begin of their training phase, they produce different results for different random seeds, even though the hyperparameters are the same. For active learning, we will again first compute the upper bound, then have a closer look at some things to watch out for when working with deep neural networks. This session requires keras and theano, two frameworks for training deep neural networks (there are plenty more).

### General Set up
Again, require some imports and have to set some configurations:

In [1]:
import argparse
import numpy as np

from keras import metrics
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.callbacks import Callback

from sklearn.metrics import accuracy_score

import data_processing as dp

config = {
    'embedding':'embedding/glove.6B.50d.subset.oov.vec',
    'train':'data/train.tsv',
    'dev':'data/dev.tsv',
    'test':'data/test.tsv',
    'epochs':10, # Number of epochs to train for
    'batch_size':5, # Our batch size for one backward pass
    'random_seed':123456789, # Our random seed for the weight initialization
    'optimizer':'adagrad', # The optimizer we want to use. Basically we can use everything from keras.
    'model':'results/mlp-full' # The path to store our model in
}

# Add the random seed to our model path
model_path = config['model'] + '-' + str(config['random_seed']) + '.model'
# For keras and theano, it is ok to fix the numpy random seed.
np.random.seed(config['random_seed'])
weights_path = model_path

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


We again load the same dataset for subjectivity classification

In [2]:
###########################################
#       Loading data and vectors
###########################################

embedding,embed_dim = dp.load_word2vec_embedding(config['embedding'])
    
X_train, y_train = dp.load_data(config['train'], textindex=1, labelindex=0)
X_dev, y_dev = dp.load_data(config['dev'], textindex=1, labelindex=0)
X_test, y_test = dp.load_data(config['test'], textindex=1, labelindex=0)

# Get index-word/label dicts for lookup:
vocab_dict = dp.get_index_dict(X_train + X_dev + X_test)
label_dict = {'subjective':0, 'objective':1}

# Replace words / labels in the data by the according index
vocab_dict_flipped = dict((v,k) for k,v in vocab_dict.items())
label_dict_flipped = {0:'subjective', 1:'objective'}

# Get indexed data and labels
X_train_index = [[vocab_dict_flipped[word] for word in chunk] for chunk in X_train]
X_dev_index =  [[vocab_dict_flipped[word] for word in chunk] for chunk in X_dev]
X_test_index =  [[vocab_dict_flipped[word] for word in chunk] for chunk in X_test]

y_train_index = dp.get_binary_labels(label_dict, y_train)
y_dev_index = dp.get_binary_labels(label_dict, y_dev)

# Get embedding matrix:
embed_matrix = dp.get_embedding_matrix(embedding,vocab_dict)

# Use the simple count over all features in a single example:
# Do average over word vectors:
X_train_embedded = np.array([np.mean([embed_matrix[element] for element in example], axis=0) for example in X_train_index])
X_dev_embedded = np.array([np.mean([embed_matrix[element] for element in example], axis=0) for example in X_dev_index])
X_test_embedded = np.array([np.mean([embed_matrix[element] for element in example], axis=0) for example in X_test_index])

print("Loaded data.")

Loaded data.


For neural networks, we train several epochs (epoch = one full pass through our training set) with decreasing learning rates. Training several times on the same data with different learning rates helps the net to focus on different things in each epoch. After each epoch we evaluate our current model on the development set and store it if we have a best performing model. Following class implements this functionality which we can pass in keras to our fit() function.

In [3]:
# Class for checking f1 measure during training
class AccScore(Callback):
    def on_train_begin(self, logs={}):
        self.best_acc = 0.0
    def on_epoch_end(self, batch, logs={}):
        # Get predictions
        predict = np.asarray(self.model.predict(self.validation_data[0],batch_size=config['batch_size']))
        # Flatten all outputs and remove padding
        pred = []
        true = []
        for doc_pred,doc_true in zip(predict,self.validation_data[1]):
            true.append(label_dict_flipped[doc_true.tolist().index(max(doc_true))])
            pred.append(label_dict_flipped[doc_pred.tolist().index(max(doc_pred))])
        self.accs=accuracy_score(pred, true)
        if self.accs > self.best_acc:
            self.best_acc=self.accs
            model.save_weights(weights_path)
        return

accscore_met= AccScore()

Now, let us implement a simple multi-layer perceptron. Since our data is rather low dimensional (one document is represented by an average of all 50-dimensional word vectors in it, we can keep the number of hidden units rather small.

In [4]:
model = Sequential()
# A simple dense layer with 128 hidden units. The activation function is ReLU.
model.add(Dense(128, activation='relu',input_shape=(embed_dim, ))) 
# Dropout acts as a regularizer to prevent overfitting.
model.add(Dropout(0.4)) 
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.4))
# The final layer does the predcition. 
# Sigmoid is a common activation function for binary classifcation.
model.add(Dense(2, activation='sigmoid')) 

Finally, we can train our network on the data:

In [5]:
# We first have to compile the model
model.compile(config['optimizer'], 'binary_crossentropy',metrics=[metrics.categorical_accuracy])
# Now we can train it:
model.fit(X_train_embedded, y_train_index, epochs=config['epochs'], batch_size=config['batch_size'], validation_data=(X_dev_embedded, y_dev_index), verbose=1, callbacks=[accscore_met])

Instructions for updating:
keep_dims is deprecated, use keepdims instead
Train on 5000 samples, validate on 1000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f01e45d9f98>

After training, we load the best model on the dev set and compute the performance on the test set.

In [6]:
model.load_weights(weights_path)
result = model.predict(X_test_embedded)

pred = []
for i in range(len(result)):
    pred.append(label_dict_flipped[result[i].tolist().index(max(result[i]))])

print("Test accuracy: ",accuracy_score(pred, y_test))
print("Done")

Test accuracy:  0.8905
Done


Great! So we have trained a simple neural network on our data. It even performs a bit better than the linear SVM using the same features. Now let's try out some different random seeds. What do you notice? Don't forget to write down the test scores together with the random seed for a comparison.

### Deep Active Learning
For active learning, we implement something similar to our support vector machine before. Keep in mind, how we have to set two different random seeds now, a python random seed for the random sampling and a numpy random seed for the deep neural network.

In [None]:
from active_learning import Active_Learning

active_learning_config = {
    'embedding':'embedding/glove.6B.50d.subset.oov.vec',
    'train_labeled':'data/train_labeled.tsv',
    'train_unlabeled':'data/train_unlabeled.tsv',
    'dev':'data/dev.tsv',
    'test':'data/test.tsv',
    'sampling':'random', # Sampling strategy, currently implemented ['random', 'confidence']
    'c':2, # C for our SVM
    'random_sampling_seed':42, # Random seed for the pseudo-randomnumber generator during random sampling
    'maximum_iterations':500, # Maximum number of active learning iterations
    'active_learning_history':'results/mlp-al-random.result', # File to store the results in
    'epochs':10, # Number of epochs to train for
    'batch_size':5, # Our batch size for one backward pass
    'neural_network_random_seed':123456789, # Our random seed for the weight initialization
    'optimizer':'adagrad', # The optimizer we want to use. Basically we can use everything from keras.
    'model':'results/mlp-full' # The path to store our model in
}

# Add the random seed to our model path
model_path = active_learning_config['model'] + '-' + str(active_learning_config['neural_network_random_seed']) + '.model'
# For keras and theano, it is ok to fix the numpy random seed.
np.random.seed(active_learning_config['neural_network_random_seed'])
weights_path = model_path

###########################################
#       Loading data and vectors
###########################################

embedding,embed_dim = dp.load_word2vec_embedding(active_learning_config['embedding'])

X_train, y_train = dp.load_data(active_learning_config['train_labeled'], textindex=1, labelindex=0)
X_dev, y_dev = dp.load_data(active_learning_config['dev'], textindex=1, labelindex=0)
X_test, y_test = dp.load_data(active_learning_config['test'], textindex=1, labelindex=0)

# Active learning data
X_active, y_active = dp.load_data(active_learning_config['train_unlabeled'], textindex=1, labelindex=0)

# Get index-word/label dicts for lookup:
# NOTE: Creating a dictionary out of all data has the implicit assumption 
#       that all the words we encounter during sampling and testing we have already seen during training.
vocab_dict = dp.get_index_dict(X_train + X_test + X_active) 
label_dict = {'subjective':0, 'objective':1}

# Replace words / labels in the data by the according index
vocab_dict_flipped = dict((v,k) for k,v in vocab_dict.items())
label_dict_flipped = {0:'subjective', 1:'objective'}

# Get indexed data and labels
X_train_index = [[vocab_dict_flipped[word] for word in chunk] for chunk in X_train]
X_dev_index = [[vocab_dict_flipped[word] for word in chunk] for chunk in X_dev]
X_test_index =  [[vocab_dict_flipped[word] for word in chunk] for chunk in X_test]

# Active learning data
X_active_index =  [[vocab_dict_flipped[word] for word in chunk] for chunk in X_active]

print ("Number of initial training documents: ",len(X_train))

# Get embedding matrix:
embed_matrix = dp.get_embedding_matrix(embedding,vocab_dict)

# Use the simple count over all features in a single example:
# Do average over word vectors:
X_train_embedded = np.array([np.mean([embed_matrix[element] for element in example], axis=0) for example in X_train_index])
X_dev_embedded = np.array([np.mean([embed_matrix[element] for element in example], axis=0) for example in X_dev_index])
X_test_embedded = np.array([np.mean([embed_matrix[element] for element in example], axis=0) for example in X_test_index])

# Active learning
X_active_embedded = np.array([np.mean([embed_matrix[element] for element in example], axis=0) for example in X_active_index])

y_train_index = dp.get_binary_labels(label_dict, y_train)
y_dev_index = dp.get_binary_labels(label_dict, y_dev)
y_active_index = dp.get_binary_labels(label_dict, y_active)

# Define our pools for active learning
pool_data = X_active_embedded[:]
pool_labels = y_active_index[:]

print("Loaded data.")

# Class for checking f1 measure during training
class AccScore(Callback):
    def on_train_begin(self, logs={}):
        self.best_acc = 0.0
    def on_epoch_end(self, batch, logs={}):
        # Get predictions
        predict = np.asarray(self.model.predict(self.validation_data[0],batch_size=active_learning_config['batch_size']))
        # Flatten all outputs and remove padding
        pred = []
        true = []
        for doc_pred,doc_true in zip(predict,self.validation_data[1]):
            true.append(label_dict_flipped[doc_true.tolist().index(max(doc_true))])
            pred.append(label_dict_flipped[doc_pred.tolist().index(max(doc_pred))])
        self.accs=accuracy_score(pred, true)
        if self.accs > self.best_acc:
            self.best_acc=self.accs
            model.save_weights(weights_path)
        return

accscore_met= AccScore()

###########################################
#       Implement model
###########################################

model = Sequential()
# A simple dense layer with 128 hidden units. The activation function is ReLU.
model.add(Dense(128, activation='relu',input_shape=(embed_dim, ))) 
# Dropout acts as a regularizer to prevent overfitting.
model.add(Dropout(0.4)) 
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.4))
# The final layer does the predcition. 
# Sigmoid is a common activation function for binary classifcation.
model.add(Dense(2, activation='sigmoid')) 

###########################################
#       Compile the model
###########################################
# We first have to compile the model
model.compile(active_learning_config['optimizer'], 'binary_crossentropy',metrics=[metrics.categorical_accuracy])

###########################################
#       Start active learning
###########################################
# Active learning results for visualization
step, acc = [],[]

iteration = 0

outlog = open(active_learning_config['active_learning_history'],'w')
outlog.write('Iteration\tAccuracy\n')

while len(pool_data) > 1 and iteration < active_learning_config['maximum_iterations']:
    if len(X_train_embedded) % 50 == 0:
        print("Training on: ", len(X_train_embedded), " instances.")

    model.fit(X_train_embedded, y_train_index, epochs=active_learning_config['epochs'], batch_size=active_learning_config['batch_size'], validation_data=(X_dev_embedded, y_dev_index), verbose=0, callbacks=[accscore_met])

    # Load best weights and compute test performance
    model.load_weights(weights_path)
    result = model.predict(X_test_embedded)
    pred = []
    for i in range(len(result)):
        pred.append(label_dict_flipped[result[i].tolist().index(max(result[i]))])
    test_acc = accuracy_score(y_test,pred)
    outlog.write('{}\t{}\n'.format(iteration,test_acc))
    step.append(iteration); acc.append(test_acc)

    # Add data from the pool to the training set based on our active learning:
    al = Active_Learning(pool_data, model, active_learning_config['random_sampling_seed'])
    if active_learning_config['sampling'] == 'random':
        add_sample_data = al.get_random()
    else:
        add_sample_data = al.get_most_uncertain(active_learning_config['sampling'])

    # Get the data index from pool
    sample_index = dp.get_array_index(pool_data, add_sample_data)
        
    # Get the according label
    add_sample_label = pool_labels[sample_index]

    # Add it to the training pool
    X_train_embedded = np.vstack((X_train_embedded, add_sample_data))
    y_train_index = np.vstack((y_train_index, add_sample_label))

    # Remove labeled data from pool
    np.delete(pool_labels, sample_index, axis=0) 
    np.delete(pool_data, sample_index, axis=0)

    iteration += 1

outlog.close()

print("Done with active learning")

Number of initial training documents:  2
Loaded data.
Training on:  50  instances.
Training on:  100  instances.
Training on:  150  instances.
Training on:  200  instances.
Training on:  250  instances.
Training on:  300  instances.
Training on:  350  instances.


Notice, how the training time really increases a lot? Again, let's plot the graph. 

In [None]:
import visualize as vz

vz.plot(step, acc)

That is it! The code is also provided in proper classes in the code folder. Feel free to experiment with it or modify it for other purposes.