In the code demo, we will talk about how we can use embedding layers in keras.

In [1]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [41]:
import pandas as pd
import os
import numpy as np
BASE_DIR="/content/gdrive/MyDrive/RNN-LSTM"

In [3]:
train=pd.read_csv(os.path.join(BASE_DIR,'headlines.csv'))

In [4]:
train.head()

Unnamed: 0,ID,TITLE,CATEGORY
0,226435,Google+ rolls out 'Stories' for tricked out ph...,t
1,356684,Dov Charney's Redeeming Quality,b
2,246926,White God adds Un Certain Regard to the Palm Dog,e
3,318360,"Google shows off Androids for wearables, cars,...",t
4,277235,China May new bank loans at 870.8 bln yuan,b


This data contains the news headlines as well as your corresponding categories. There are in total four unique categories. These categories are about news belonging to technology, business etc.

In [5]:
## We will create a classifier using embedding layer architecture
X=train['TITLE']
y=train['CATEGORY']

In [6]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

In [7]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.20,random_state=200)

In [8]:
enc=LabelEncoder()

In [9]:
y_train=enc.fit_transform(y_train)

In [10]:
## the unique labels in the category columns
enc.classes_

array(['b', 'e', 'm', 't'], dtype=object)

In [11]:
y_train

array([2, 3, 3, ..., 3, 1, 2])

Now, we will turn our attention to the data stored in X_train and X_test parts. If you remember this is the data about News headlines which we will be using as a predictors. Now as mentioned earlier, we are using an embedding layer. So we will need to do some data preparation in order to use our embedding layer which is we will need to convert a text into a sequence of numbers and then we will have to truncate those numbers to a particular length. For that, we will be using the Tokenizer class We will be using the pad sequences() function to zero pad our sequences.

In [12]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

In [13]:
seq_len=16
max_words=10000

So I have given a number 16 to a variable called sequence length. This is going to be a setting for deciding how long each sequence would be. So I've truncated my sequences to a length of at max 16. Any sequence which is less than 16 will be zero padded, any sequences which is more than 16 words will be truncated. This is a setting for the vocabulary. Now here I'm saying that I only want to consider 10,000 unique words in my corpus. 

In [14]:
tokenizer=Tokenizer(num_words=max_words)

In [15]:
### Split the text into words and assign an integer id
tokenizer.fit_on_texts(X_train.tolist())

And creating an object of tokenizer class and then I'm fitting my train data on it. What this will do is: This will create an integer index for each word in my corpus.

In [1]:
tokenizer.word_index

Now, I will create integer sequences on my train data.

In [17]:
## Create a sequence for each entry in the title column
sequence=tokenizer.texts_to_sequences(X_train.tolist())

In [2]:
sequence

Now each of the sentence in my train data has now been converted into a sequence of integers according to this mapping. Now, you can see that not all the sequences are of same length. We need to make them of same length the way we do that is by using pad sequences () method and giving it a value off sequence length.

In [19]:
## Pad the sequences
train_features=pad_sequences(sequence,maxlen=seq_len)

#default is pre
## parameters: padding and truncating - can give post

In [20]:
train_features

array([[   0,    0,    0, ...,  142, 1562, 8052],
       [   0,    0,    0, ...,    4, 1671,  525],
       [   0,    0,    0, ..., 5370,    6,   47],
       ...,
       [   0,    0,    0, ..., 4732, 1042,  359],
       [   0,    0,    0, ...,   46,   41,   80],
       [   0,    0,    0, ..., 2953, 6426, 2189]], dtype=int32)

In [None]:
train_features.shape

(168967, 16)

Do the same for test also:

In [21]:
## Create test features
sequence=tokenizer.texts_to_sequences(X_test.tolist())

In [3]:
sequence

In [23]:
test_features=pad_sequences(sequence,maxlen=seq_len)

In [24]:
test_features

array([[   0,    0,    0, ...,  113,    2,   31],
       [   0,    0,    0, ...,    4, 4018, 3115],
       [   0,    0,    0, ...,  375, 5948, 4400],
       ...,
       [   0,    0,    0, ...,   11,  157, 1648],
       [   0,    0,    0, ...,   97,   76,    7],
       [   0,    0,    0, ...,  310, 3979, 5986]], dtype=int32)

In [25]:
test_features.shape

(42242, 16)

Now at this point if you remember the Y vector that we had contains the integer labels. Now in order to use a target variables within keras you need to one hot encoded if it is not binary

In [26]:
## Convert y_test and y_train to one hot encoded vector
from tensorflow.keras.utils import to_categorical

In [27]:
y_train=to_categorical(y_train)

In [28]:
y_train

array([[0., 0., 1., 0.],
       [0., 0., 0., 1.],
       [0., 0., 0., 1.],
       ...,
       [0., 0., 0., 1.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.]], dtype=float32)

In [29]:
## Assemble the model
from keras.models import Sequential
from keras.layers import Dense, Embedding, Flatten

So I'm importing the sequential class and I will be using dense layers. I will be using embedding layers and in order to connect Dense and embedding layers I will have to flatten them, so I’m also importing the flatten layer.

In [31]:
model=Sequential()
model.add(Embedding(input_dim=max_words,output_dim=64,input_length=seq_len))
model.add(Flatten())
model.add(Dense(1024,activation='relu'))
model.add(Dense(4,activation='softmax'))

Now input_dim defines the number of rows in the embedding layer. Now if you remember the maximum words that we chosen was 10,000. So our embedding layer will have 10,000 rows. Here, we are saying that this embedding layer will output vectors with a dimension of 64. So each vector will be of length 64 and here we are specifying the length of the inputs that will go to an embedding layer. Now if you remember the inputs that will go to embedding layer are my train features and each of them as a length of 16. Next, I will flatten this layer to connect it to a dense layer which has 1024 cells and each cell has a relu activation. The last layer contains four cells because my target variable only has four unique values and its activation is softmax.

In [32]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 16, 64)            640000    
_________________________________________________________________
flatten_1 (Flatten)          (None, 1024)              0         
_________________________________________________________________
dense_2 (Dense)              (None, 1024)              1049600   
_________________________________________________________________
dense_3 (Dense)              (None, 4)                 4100      
Total params: 1,693,700
Trainable params: 1,693,700
Non-trainable params: 0
_________________________________________________________________


In [33]:
model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

In [34]:
model.fit(train_features,y_train,epochs=3,batch_size=32,validation_split=0.20)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7f3321369110>

This will take some time to run. You can see now that the model has been trained we're getting a 91% accuracy on our validation set.

In [35]:
preds=model.predict(test_features)

In [36]:
preds

array([[8.6802810e-01, 2.5659227e-05, 1.5594635e-05, 1.3193056e-01],
       [3.9068091e-01, 3.7802190e-05, 3.2304013e-06, 6.0927808e-01],
       [1.5176908e-05, 4.4397647e-10, 1.3902290e-10, 9.9998486e-01],
       ...,
       [1.0000000e+00, 4.4385003e-11, 2.0820911e-13, 2.3511115e-09],
       [4.5210958e-01, 2.5955535e-02, 9.2327716e-03, 5.1270217e-01],
       [2.9743867e-06, 1.5632887e-07, 5.0865716e-09, 9.9999690e-01]],
      dtype=float32)

This is how our predictions look like. Now these predictions are in the form of integers. We will need to decode them for that we will use the inverse_transform() method in our encoded class.

In [47]:
max_labels = []
for i in preds:
  max_labels.append(np.argmax(i))

In [48]:
np.array(max_labels)

array([0, 3, 3, ..., 0, 3, 3])

In [49]:
pred_labels=enc.inverse_transform(np.array(max_labels))

In [50]:
pred_labels

array(['b', 't', 't', ..., 'b', 't', 't'], dtype=object)

Now we can see the predicted labels. Let’s see the accuracy on the test data

In [51]:
(y_test==pred_labels).sum()/pred_labels.shape

array([0.9102552])

the accuracy on the test data is around 91%.
So in this way, we can include an embedding layer to text classification.