In this code demo, we will see how we can use Pre-trained word vectors to populate the embedding matrix and then use it to train a classifier.

In [1]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [2]:
import pandas as pd
import os
BASE_DIR="/content/gdrive/MyDrive/RNN-LSTM"

In [3]:
train=pd.read_csv(os.path.join(BASE_DIR,'headlines.csv'))

In [4]:
train.head()

Unnamed: 0,ID,TITLE,CATEGORY
0,226435,Google+ rolls out 'Stories' for tricked out ph...,t
1,356684,Dov Charney's Redeeming Quality,b
2,246926,White God adds Un Certain Regard to the Palm Dog,e
3,318360,"Google shows off Androids for wearables, cars,...",t
4,277235,China May new bank loans at 870.8 bln yuan,b


In [5]:
## We will create a classifier using embedding layer architecture
X=train['TITLE']
y=train['CATEGORY']

In [6]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

In [7]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.20,random_state=200)

In [8]:
enc=LabelEncoder()

In [9]:
y_train=enc.fit_transform(y_train)

In [10]:
enc.classes_

array(['b', 'e', 'm', 't'], dtype=object)

In [11]:
y_train

array([2, 3, 3, ..., 3, 1, 2])

In [12]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

In [13]:
seq_len=16
max_words=10000

In [2]:
tokenizer=Tokenizer(num_words=max_words)
### Split the text into words and assign an integer id
tokenizer.fit_on_texts(X_train.tolist())
tokenizer.word_index

In [15]:
## Create a sequence for each entry in the title column
sequence=tokenizer.texts_to_sequences(X_train.tolist())

In [3]:
sequence

In [17]:
## Pad the sequences
train_features=pad_sequences(sequence,maxlen=seq_len)

In [18]:
train_features

array([[   0,    0,    0, ...,  142, 1562, 8052],
       [   0,    0,    0, ...,    4, 1671,  525],
       [   0,    0,    0, ..., 5370,    6,   47],
       ...,
       [   0,    0,    0, ..., 4732, 1042,  359],
       [   0,    0,    0, ...,   46,   41,   80],
       [   0,    0,    0, ..., 2953, 6426, 2189]], dtype=int32)

In [19]:
train_features.shape

(168967, 16)

In [20]:
## Create test features
sequence=tokenizer.texts_to_sequences(X_test.tolist())

In [4]:
sequence

In [22]:
test_features=pad_sequences(sequence,maxlen=seq_len)

In [23]:
test_features

array([[   0,    0,    0, ...,  113,    2,   31],
       [   0,    0,    0, ...,    4, 4018, 3115],
       [   0,    0,    0, ...,  375, 5948, 4400],
       ...,
       [   0,    0,    0, ...,   11,  157, 1648],
       [   0,    0,    0, ...,   97,   76,    7],
       [   0,    0,    0, ...,  310, 3979, 5986]], dtype=int32)

In [24]:
test_features.shape

(42242, 16)

In [26]:
## Convert y_test and y_train to one hot encoded vector
from tensorflow.keras.utils import to_categorical

In [27]:
y_train=to_categorical(y_train)

In [28]:
y_train

array([[0., 0., 1., 0.],
       [0., 0., 0., 1.],
       [0., 0., 0., 1.],
       ...,
       [0., 0., 0., 1.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.]], dtype=float32)

In [30]:
import numpy as np
import time

Now what I will do is I will instantiate the weights of the embedding matrix using pre- train word embedding’s. These word embedding’s can be downloaded from this particular link:

Can be downloaded from http://nlp.stanford.edu/data/glove.6B.zip

If you use this link you will be able to download a couple of text files. Each text file contains word embedding’s of different dimensionality. For the purposes of this code demo, we will be working with word vectors that have a dimension of 50— that each word vector has 50 elements.

Now, before we read in word vectors into our memory and do something over It, what we will do is: I will first demonstrate how the word vectors look like. 

(open sample_word_vecs.txt)

When you download pre-train word vectors they are usually present in a text file and the usual format followed is this that you will have a word which will be space separated by word vectors. For example, in this particular example “the” is represented by these numbers; “is” is represented by these numbers. Now, when you use pre trained word vectors the dimensionality would either be 50, 60, 100,150, or 300. This text file should only be seen as an example as to how the word vectors are stored once you download them onto your system.

Here, I will read the word vectors into memory. What I will do is: will populate a dictionary with keys being the words and the values of the word vectors being the values in the dictionary. So, let me run this.

In [31]:
### Read glove word vectors
t0=time.time()
embedding_index={}
con=open(os.path.join(BASE_DIR,'glove.6B.50d.txt'),encoding='utf-8')
for line in con: ##looping over each line
    values=line.split() ##splitting it by space
    word=values[0] ##first value in each line is the word itself
    vector=np.asarray(values[1:],dtype='float32') ##everything else is the word vector
    embedding_index[word]=vector ##populate the dict
con.close()
t1=time.time()
print("Took {} seconds to load glove word vectors".format(t1-t0))

Took 7.981776475906372 seconds to load glove word vectors


Within this code, I am reading the pre-trained vectors file as a text file and I am looping over each line and splitting it by space. The first value in each line is the word itself and everything else are the word vectors which I am converting into NumPy array for each line and here I'm populating my dictionary.

Now let’s take a look at all the keys in the word vectors at we have just read in. Now these are all the key in my dictionary.



In [5]:
embedding_index.keys()

Now, you can see there is a word called Japan. Let's figure out what is the vector associated with Japan. 

In [33]:
embedding_index['japan']

array([-0.31739 , -0.14033 ,  0.32292 ,  1.072   ,  0.33008 ,  0.39406 ,
       -0.016682,  0.076903, -0.74591 , -0.31521 ,  1.0033  , -0.12659 ,
        0.063252,  0.64006 ,  0.70721 ,  0.84303 , -0.68832 ,  0.47214 ,
       -0.66002 ,  0.73962 ,  1.1116  , -0.89428 , -0.90364 , -0.47281 ,
        0.88529 , -2.0194  ,  0.30623 , -0.31662 , -0.44423 , -0.52139 ,
        3.0287  ,  0.70315 ,  0.92315 ,  0.52263 , -0.62674 , -0.58995 ,
       -0.15876 , -0.078332, -1.0794  , -0.71552 , -1.2764  , -0.85554 ,
        1.2827  , -1.2134  ,  1.0125  ,  0.40329 , -0.16276 ,  0.99117 ,
        0.031016, -0.35431 ], dtype=float32)

These numbers are the word vectors for Japan. 

Let’s look at its shape as you can see its dimension 50, each vector contains 50 entities.

In [34]:
embedding_index['japan'].shape

(50,)

Now the next step that we will do is we will create an embedding weight matrix which will contain weights and those weights will be the values of the word vectors corresponding to the words that occur in my corpus. So first time I am creating an empty matrix containing only zeros with 10,000 rows and 50 columns.

In [35]:
## Now create an embedding matrix for 10000 words in our corpus
embedding_weight_matrix=np.zeros((max_words,50))

In [36]:
embedding_weight_matrix.shape

(10000, 50)

Now here, I will loop over the words that are there in my corpus and I will update this embedding weight matrix with the word vectors that I obtained from pre-trained word embedding.

In [37]:
for word,i in tokenizer.word_index.items():
    if i < max_words:
        vector=embedding_index.get(word)
        if vector is not None:
            embedding_weight_matrix[i]=vector

 This is how my embedding weight matrix looks like and this is its shape. It will have 10,000 rows and 50 columns.

In [38]:
embedding_weight_matrix

array([[ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.68046999, -0.039263  ,  0.30186   , ..., -0.073297  ,
        -0.064699  , -0.26043999],
       [ 0.33041999,  0.24995001, -0.60873997, ..., -0.50703001,
        -0.027273  , -0.53285003],
       ...,
       [-0.72750998,  0.85914999, -2.07520008, ..., -0.24068999,
        -0.67565   , -1.02989995],
       [ 0.32872999,  0.19727001,  1.80250001, ...,  0.86822999,
         0.30015001,  0.45583001],
       [-0.95811999,  0.56607002,  0.24886   , ..., -0.43867001,
        -0.50740999,  1.02049994]])

In [39]:
embedding_weight_matrix.shape

(10000, 50)

In [40]:
## Now we will assemble the model
from keras.models import Sequential
from keras.layers import Dense, Embedding, Flatten

In [41]:
model=Sequential()
model.add(Embedding(input_dim=max_words,output_dim=50,
                    weights=[embedding_weight_matrix],
                    input_length=seq_len))
model.add(Flatten())
model.add(Dense(1024,activation='relu'))
model.add(Dense(4,activation='softmax'))

I will need sequential which is the entry point to Keras API. I will be including dense layer. I'll be including an embedding layer whose weights will be the embedding weight matrix that I have just created and I will also need to flatten the embedding layer, so that I can connect it to a dense layer. So here I'm assembling my model. Now you can see the input dimension here is equal to 10,000- if you remember my maximum vocabulary is 10,000. The output dimension is 50 the reason for that is each of the word vectors that I have as only a length of 50. The input length is going to be the length of the sequence if you remember I have completed my sequences to a length of 16 and here I'm instantiating weights. So I am saying the weights of this embedding layer are going to be this object. Now if you remember this object was something that I generated here. Then I flatten my embedding layer and connected to dense layer in the end I have again a dense layer with 4 neurons and a Softmax activation.

In [42]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 16, 50)            500000    
_________________________________________________________________
flatten (Flatten)            (None, 800)               0         
_________________________________________________________________
dense (Dense)                (None, 1024)              820224    
_________________________________________________________________
dense_1 (Dense)              (None, 4)                 4100      
Total params: 1,324,324
Trainable params: 1,324,324
Non-trainable params: 0
_________________________________________________________________


Now you can see all the parameters in my model are trainable. But what I need to do is I need to make sure that the parameters in the embedding layer, they become fixed and embedding layer is used in the same sense as the convolutional blocks were used in transfer learning. Now to do that what I will do is: I will access the layers property and since embedding is my first layer which indexed as 0, I will change its trainable property false. Now, if I look at my model summary you can now see the parameters in the embedding layer are non-trainable.

In [43]:
model.layers[0].trainable=False

In [44]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 16, 50)            500000    
_________________________________________________________________
flatten (Flatten)            (None, 800)               0         
_________________________________________________________________
dense (Dense)                (None, 1024)              820224    
_________________________________________________________________
dense_1 (Dense)              (None, 4)                 4100      
Total params: 1,324,324
Trainable params: 824,324
Non-trainable params: 500,000
_________________________________________________________________


In [45]:
model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

In [46]:
model.fit(train_features,y_train,epochs=3,batch_size=32,validation_split=0.20)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7f7f9f5c0090>

Now this will take some time I'm running my model only for 3 epochs. Now after this model trains we can see it has achieved the validation accuracy of 84%. Let’s obtain predictions from this model.

In [47]:
preds=model.predict(test_features)
preds

array([[7.9241997e-01, 1.9895509e-02, 1.3075784e-02, 1.7460869e-01],
       [7.5139076e-01, 6.9886526e-05, 1.6313244e-06, 2.4853779e-01],
       [3.4345221e-10, 3.4921861e-12, 4.5142750e-14, 1.0000000e+00],
       ...,
       [9.9993479e-01, 1.8836440e-07, 7.7958804e-09, 6.4978631e-05],
       [4.1593373e-02, 1.9883511e-05, 2.3821981e-04, 9.5814860e-01],
       [3.1795645e-01, 2.8387917e-02, 1.0119211e-01, 5.5246353e-01]],
      dtype=float32)

Fine the max probability

In [48]:
max_labels = []
for i in preds:
  max_labels.append(np.argmax(i))

Inverse transform them into their actual labels and here I'm computing the accuracy on the test data.

In [49]:
pred_labels=enc.inverse_transform(np.array(max_labels))

In [50]:
(y_test==pred_labels).sum()/pred_labels.shape

array([0.84768714])

 So, the accuracy on the test data is around 84%.