# Bag-of-words classifier with pretrained word embeddings

- If we have a trained word embeddings model, we can transfer that knowledge into a new task and model
- Initialize the weights in the classifier with pretrained word embeddings
- Here we will use embeddings trained with word2vec on news data: GoogleNews-vectors-negative300.bin from https://code.google.com/archive/p/word2vec/

### Read data

In [1]:
import json
import random
with open("data/imdb_train.json") as f:
    data=json.load(f)
random.shuffle(data) 
print(data[0])

# We need to gather the texts, into a list
texts=[one_example["text"] for one_example in data]
labels=[one_example["class"] for one_example in data]
print(texts[:2])
print(labels[:2])

{'class': 'pos', 'text': 'Her Deadly Rival (1995): Starring Harry Hamlin, Annie Potts, Lisa Zane, Tommy Hinkley, Susan Diol, Roma Maffia, Robert C. Treveiler, D. L. Anderson, William Blair, Sean Bridges, Robin Dallenbach, Wilbur Fitzgerald, Dale Frye, Stan Kelly, Deborah Hobart, David Lenthall, Lorri Lindberg, Chuck Kinlaw, Amy Parrish, Melissa Suzanne McBride, Ralph Wilcox, Al Wiggins, Jeff Sumerel, Daria Sanford....Director James Hayman, Screenplay Dan Vining.  Actor Harry Hamlin (of LA Law fame, Clash of The Titans and other films) seems perfectly cast in this \\Lifetime\\" type film directed by James Hayman and released in 1995. He and his wife Lisa Rinna would later work on a film about sex addiction. \\"Her Deadly Rival\\" is, at first glance, similar to the better known Hollywood box-office hit \\"Fatal Attraction\\". In \\"Rival\\", happily married couple Jim and Kris Lanford move into a new home in the typically beautiful suburbs. They have the seemingly perfect marriage- they

### Vectorizing the data

- When we use an embedding layer (keras.layers.Embedding) the input data must be a sequence, not a bag-of-words vector
- You can use CountVectorizer only as an analyzer without building the feature matrix
- We will then build the vectorizer part later ourselves

In [2]:
from sklearn.feature_extraction.text import CountVectorizer
import numpy
analyzer=CountVectorizer(lowercase=False).build_analyzer() # includes tokenizer and preprocessing
print(analyzer(texts[0]))



['Her', 'Deadly', 'Rival', '1995', 'Starring', 'Harry', 'Hamlin', 'Annie', 'Potts', 'Lisa', 'Zane', 'Tommy', 'Hinkley', 'Susan', 'Diol', 'Roma', 'Maffia', 'Robert', 'Treveiler', 'Anderson', 'William', 'Blair', 'Sean', 'Bridges', 'Robin', 'Dallenbach', 'Wilbur', 'Fitzgerald', 'Dale', 'Frye', 'Stan', 'Kelly', 'Deborah', 'Hobart', 'David', 'Lenthall', 'Lorri', 'Lindberg', 'Chuck', 'Kinlaw', 'Amy', 'Parrish', 'Melissa', 'Suzanne', 'McBride', 'Ralph', 'Wilcox', 'Al', 'Wiggins', 'Jeff', 'Sumerel', 'Daria', 'Sanford', 'Director', 'James', 'Hayman', 'Screenplay', 'Dan', 'Vining', 'Actor', 'Harry', 'Hamlin', 'of', 'LA', 'Law', 'fame', 'Clash', 'of', 'The', 'Titans', 'and', 'other', 'films', 'seems', 'perfectly', 'cast', 'in', 'this', 'Lifetime', 'type', 'film', 'directed', 'by', 'James', 'Hayman', 'and', 'released', 'in', '1995', 'He', 'and', 'his', 'wife', 'Lisa', 'Rinna', 'would', 'later', 'work', 'on', 'film', 'about', 'sex', 'addiction', 'Her', 'Deadly', 'Rival', 'is', 'at', 'first', 'glanc

### Use gensim to read the embedding model

In [3]:
from gensim.models import KeyedVectors

vector_model=KeyedVectors.load_word2vec_format("data/wiki-news-300d-1M.vec", binary=False, limit=1000000)

# sort based on the index to make sure they are in the correct order
words=[k for k,v in sorted(vector_model.vocab.items(), key=lambda x:x[1].index)]
print("Words from embedding model:",len(words))
print("First 50 words:",words[:50])

Words from embedding model: 1000000
First 50 words: [',', 'the', '.', 'and', 'of', 'to', 'in', 'a', '"', ':', ')', 'that', '(', 'is', 'for', 'on', '*', 'with', 'as', 'it', 'The', 'or', 'was', "'", "'s", 'by', 'from', 'at', 'I', 'this', 'you', '/', 'are', '=', 'not', '-', 'have', '?', 'be', 'which', ';', 'all', 'his', 'has', 'one', 'their', 'about', 'but', 'an', '|']


### Normalize the vectors

- Easier to learn on top of these vectors when the magnitude does not vary much

In [4]:
print("Before normalization:",vector_model.get_vector("in")[:10])
vector_model.init_sims(replace=True)
print("After normalization:",vector_model.get_vector("in")[:10])

Before normalization: [-0.0234 -0.0268 -0.0838  0.0386 -0.0321  0.0628  0.0281 -0.0252  0.0269
 -0.0063]
After normalization: [-0.0163762  -0.01875564 -0.05864638  0.02701372 -0.02246478  0.04394979
  0.01966543 -0.0176359   0.01882563 -0.00440898]


### Expand the vocabulary using words from the embedding model

- The embedding model usually knows more words than the task specific model, because it has seen a lot more data
- If you wish, you can use the embedding model vocabulary to expand the task specific one

In [5]:
# init the vectorizer vocabulary using words from the embedding model
def init_vocabulary(vocab, text, text_analyzer):
    for word in analyzer(text):
        vocab.setdefault(word, len(vocab))
    return vocab

words_from_model=" ".join(words[:50000]) # use 50K words from the embedding model to initialize the vocabulary --> expands the learned vocabulary
vocabulary={"<SPECIAL>": 0} # zero has a special meaning in sequence models, prevent using it for a normal word
vocabulary=init_vocabulary(vocabulary, words_from_model, analyzer)
print("Words from embedding model:",len(vocabulary))


Words from embedding model: 47944


### Vectorizer

- Build a dictionary to turn words into numbers, here we use the one which we initialized with the embedding model
- Vectorizing a sequence: In our data each example is a list of words, we need to turn each example into list of numbers

In [6]:
def vectorizer(vocab, texts):
    vectorized_data=[] # turn text into numbers based on our vocabulary mapping
    for one_example in texts:
        vectorized_example=[]
        for word in analyzer(one_example):
            vocab.setdefault(word, len(vocab)) # add word to our vocabulary if it does not exist
            vectorized_example.append(vocab[word])
        vectorized_data.append(vectorized_example)
    
    vectorized_data=numpy.array(vectorized_data) # turn python list into numpy matrix
    return vectorized_data, vocab

vectorized_data, vocabulary=vectorizer(vocabulary, texts)

# now vectorized data is the same as feature_matrix, but in different format
print("Words in vocabulary:",len(vocabulary))
print("Vectorized data shape:",vectorized_data.shape)
print("First example vectorized:",vectorized_data[0])
inversed_vocabulary={value:key for key, value in vocabulary.items()} # inverse the dictionary
print("First example text:",[inversed_vocabulary[idx] for idx in vectorized_data[0]])
        

Words in vocabulary: 105571
Vectorized data shape: (25000,)
First example vectorized: [2153, 25519, 34132, 1537, 31813, 2955, 31591, 9649, 29157, 5760, 36714, 7137, 41480, 5357, 47944, 8077, 47945, 1416, 47946, 3558, 1268, 3629, 6299, 14666, 5868, 47947, 27319, 12643, 9201, 35157, 9965, 3626, 13102, 16718, 800, 47948, 47949, 47389, 8303, 47950, 6556, 34510, 11279, 14335, 20383, 7601, 24518, 3318, 20463, 4334, 47951, 47952, 15117, 2578, 943, 47953, 38277, 3502, 47954, 9492, 2955, 31591, 3, 6034, 1772, 4870, 27232, 3, 13, 13942, 2, 53, 1402, 724, 4468, 1789, 5, 19, 20417, 853, 272, 2255, 16, 943, 47953, 2, 752, 5, 1537, 140, 2, 27, 1238, 5760, 47955, 63, 205, 80, 9, 272, 31, 1746, 7479, 2153, 25519, 34132, 7, 18, 90, 13606, 700, 4, 1, 225, 413, 2763, 1246, 530, 625, 35418, 47956, 65, 34132, 9115, 1519, 1966, 2703, 2, 15839, 47957, 415, 70, 88, 241, 5, 1, 1994, 2615, 7806, 314, 23, 1, 4559, 2012, 1521, 54, 21, 5326, 5, 602, 951, 4005, 4819, 103, 69, 7561, 44858, 2146, 66, 1890, 9, 2703, 2

### Labels into onehot vectors

- Same as in the original BOW classifier

In [7]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

label_encoder=LabelEncoder() #Turns class labels into integers
one_hot_encoder=OneHotEncoder(sparse=False) #Turns class integers into one-hot encoding
class_numbers=label_encoder.fit_transform(labels)
print("class_numbers shape=",class_numbers.shape)
print("class_numbers",class_numbers)
print("class labels",label_encoder.classes_)
#And now yet the one-hot encoding
classes_1hot=one_hot_encoder.fit_transform(class_numbers.reshape(-1,1))
print("classes_1hot",classes_1hot)


class_numbers shape= (25000,)
class_numbers [1 1 1 ... 1 0 1]
class labels ['neg' 'pos']
classes_1hot [[0. 1.]
 [0. 1.]
 [0. 1.]
 ...
 [0. 1.]
 [1. 0.]
 [0. 1.]]


## Network

- First we need to create an embedding matrix which we can then plug into the neural network
- The embedding matrix must follow the order from the vectorizer

In [8]:
def load_pretrained_embeddings(vocab, embedding_model):
    """ vocab: vocabulary from our data vectorizer, embedding_model: model loaded with gensim """
    pretrained_embeddings=numpy.random.uniform(low=-0.05, high=0.05, size=(len(vocab),embedding_model.vectors.shape[1])) # initialize new matrix (words x embedding dim)
    found=0
    for word,idx in vocab.items():
        if word in embedding_model.vocab:
            pretrained_embeddings[idx]=embedding_model.get_vector(word)
            found+=1
            
    print("Found pretrained vectors for {found} words.".format(found=found))
    return pretrained_embeddings

pretrained=load_pretrained_embeddings(vocabulary, vector_model)
print("Shape of pretrained embeddings:",pretrained.shape)
print("Vector for the word 'in':",pretrained[vocabulary["in"]][:10])


Found pretrained vectors for 93284 words.
Shape of pretrained embeddings: (105571, 300)
Vector for the word 'in': [-0.0163762  -0.01875564 -0.05864638  0.02701372 -0.02246478  0.04394979
  0.01966543 -0.0176359   0.01882563 -0.00440898]


### Sequential input

- Remember how the shape of the input data matrix had undefined number of columns
- Now we must make it into fixed size (same for each example)
- Padding: include zeros until you reach the correct size
- You will hear more about this next week!

In [9]:
from keras.preprocessing.sequence import pad_sequences
print("Old shape:", vectorized_data.shape)
vectorized_data_padded=pad_sequences(vectorized_data, padding='post')
print("New shape:", vectorized_data_padded.shape)
print("First example:", vectorized_data_padded[0])

Using TensorFlow backend.


Old shape: (25000,)
New shape: (25000, 2366)
First example: [ 2153 25519 34132 ...     0     0     0]


In [None]:
import tensorflow as tf
from keras.layers import Layer
from keras import backend as K

class MaskedAveragePooling(Layer):
    def __init__(self, **kwargs):
        self.supports_masking = True
        super(MaskedAveragePooling, self).__init__(**kwargs)

    def compute_mask(self, input, input_mask=None):
        # do not pass the mask to the next layers
        return None

    def call(self, x, mask=None):
        if mask is not None:
            # mask (batch, time)
            mask = K.cast(mask, K.floatx())
            # mask (batch, x_dim, time)
            mask = K.repeat(mask, x.shape[-1])
            # mask (batch, time, x_dim)
            mask = tf.transpose(mask, perm=[0,2,1])
            x = x * mask
        return K.sum(x, axis=1) / K.sum(mask, axis=1)

    def compute_output_shape(self, input_shape):
        # remove temporal dimension
        return input_shape[0], input_shape[2]

In [None]:
from keras.models import Model
from keras.layers import Input, Dense, Embedding, Activation
from keras import backend as K
from keras.layers.core import Lambda
from keras.optimizers import SGD, Adam


example_count,sequence_len=vectorized_data_padded.shape
example_count,class_count=classes_1hot.shape

vector_size=pretrained.shape[1] # embedding dim ("hidden layer") must be the same as in the pretrained model

inp=Input(shape=(sequence_len,))
embeddings=Embedding(len(vocabulary), vector_size, mask_zero=True, weights=[pretrained])(inp)
average_embeddings=MaskedAveragePooling()(embeddings)
#sums=Lambda(lambda s: K.sum(s, axis=1), output_shape=lambda s: (s[0],s[2]))(embeddings) # custom layer to sum all embeddings
tanh=Activation("tanh")(average_embeddings)
outp=Dense(class_count, activation="softmax")(tanh)
model=Model(inputs=[inp], outputs=[outp])

optimizer=Adam(lr=0.0001) # define the learning rate
model.compile(optimizer=optimizer,loss="categorical_crossentropy",metrics=['accuracy'])

print(model.summary())

hist=model.fit(vectorized_data_padded,classes_1hot,batch_size=100,verbose=1,epochs=100,validation_split=0.1)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 2366)              0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 2366, 300)         31671300  
_________________________________________________________________
masked_average_pooling_1 (Ma (None, 300)               0         
_________________________________________________________________
activation_1 (Activation)    (None, 300)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 2)                 602       
Total params: 31,671,902
Trainable params: 31,671,902
Non-trainable params: 0
_________________________________________________________________
None
Train on 22500 samples, validate on 2500 samples
Epoch 1/100
 3400/22500 [===>..........................] - ETA: 5:12 - loss:

In [None]:
import matplotlib.pyplot as plt
print("History:",hist.history["val_acc"])
print("Max accuracy:",numpy.max(hist.history["val_acc"]))
plt.ylim(0.85,1.0)
plt.plot(hist.history["val_acc"],label="Validation set accuracy")
plt.plot(hist.history["acc"],label="Training set accuracy")
plt.legend()
plt.show()