In this code demo, we will see how we can build LSTM using Keras.

In [47]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [48]:
import pandas as pd
import os
BASE_DIR="/content/gdrive/MyDrive/RNN-LSTM"

In [49]:
train=pd.read_csv(os.path.join(BASE_DIR,'headlines.csv'))

In [50]:
train.head()

Unnamed: 0,ID,TITLE,CATEGORY
0,226435,Google+ rolls out 'Stories' for tricked out ph...,t
1,356684,Dov Charney's Redeeming Quality,b
2,246926,White God adds Un Certain Regard to the Palm Dog,e
3,318360,"Google shows off Androids for wearables, cars,...",t
4,277235,China May new bank loans at 870.8 bln yuan,b


In [51]:
## We will create a classifier using embedding layer and Recurrent layer
X=train['TITLE']
y=train['CATEGORY']

In [52]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

In [53]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.20,random_state=200)

In [54]:
enc=LabelEncoder()

In [55]:
y_train=enc.fit_transform(y_train)

In [56]:
enc.classes_

array(['b', 'e', 'm', 't'], dtype=object)

In [57]:
y_train

array([2, 3, 3, ..., 3, 1, 2])

In [58]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

In [59]:
seq_len=16
max_words=10000

In [60]:
tokenizer=Tokenizer(num_words=max_words)
### Split the text into words and assign an integer id
tokenizer.fit_on_texts(X_train.tolist())
## Create a sequence for each entry in the title column
sequence=tokenizer.texts_to_sequences(X_train.tolist())
## Pad the sequences
train_features=pad_sequences(sequence,maxlen=seq_len)

In [61]:
train_features

array([[   0,    0,    0, ...,  142, 1562, 8052],
       [   0,    0,    0, ...,    4, 1671,  525],
       [   0,    0,    0, ..., 5370,    6,   47],
       ...,
       [   0,    0,    0, ..., 4732, 1042,  359],
       [   0,    0,    0, ...,   46,   41,   80],
       [   0,    0,    0, ..., 2953, 6426, 2189]], dtype=int32)

In [62]:
train_features.shape

(168967, 16)

In [63]:
## Create test features
sequence=tokenizer.texts_to_sequences(X_test.tolist())
test_features=pad_sequences(sequence,maxlen=seq_len)
test_features

array([[   0,    0,    0, ...,  113,    2,   31],
       [   0,    0,    0, ...,    4, 4018, 3115],
       [   0,    0,    0, ...,  375, 5948, 4400],
       ...,
       [   0,    0,    0, ...,   11,  157, 1648],
       [   0,    0,    0, ...,   97,   76,    7],
       [   0,    0,    0, ...,  310, 3979, 5986]], dtype=int32)

In [64]:
test_features.shape

(42242, 16)

In [65]:
## Convert y_test and y_train to one hot encoded vector
from tensorflow.keras.utils import to_categorical

In [66]:
y_train=to_categorical(y_train)

In [67]:
y_train

array([[0., 0., 1., 0.],
       [0., 0., 0., 1.],
       [0., 0., 0., 1.],
       ...,
       [0., 0., 0., 1.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.]], dtype=float32)

In [68]:
import numpy as np
import time

In [73]:
### Read glove word vectors
t0=time.time()
embedding_index={}
con=open(os.path.join(BASE_DIR,'glove.6B.100d.txt'),encoding='utf-8')
for line in con:
    values=line.split()
    word=values[0]
    vector=np.asarray(values[1:],dtype='float32')
    embedding_index[word]=vector
con.close()
t1=time.time()
print("Took {} seconds to load glove word vectors".format(t1-t0))

Took 16.805164337158203 seconds to load glove word vectors


In [75]:
## Now create an embedding matrix for 10000 words in our corpus
embedding_weight_matrix=np.zeros((max_words,100))
for word,i in tokenizer.word_index.items():
    if i < max_words:
        vector=embedding_index.get(word)
        if vector is not None:
            embedding_weight_matrix[i]=vector

Now, I will start assembling my model.

LSTM model using Keras

Now beside the dense layers and the embedding layer I will also use LSTM layer

In [72]:
## Now we will assemble the model
from keras.models import Sequential
from keras.layers import Dense, Embedding,LSTM

In [77]:
model=Sequential()
model.add(Embedding(input_dim=max_words,output_dim=100,
                    weights=[embedding_weight_matrix],
                    input_length=seq_len))
model.add(LSTM(100))
model.add(Dense(4,activation='softmax'))

My first layer in the model is an embedding layer, where the input dimensional is equal to the maximum vocabulary allowed which is 10,000. The output dimension is going to be hundred because I'm using word vectors that have a dimension 100. I will also have to specify the weights and I will also have to specify the sequence length of each input. Since the LSTM layer also accepts word vectors as an input and the word vectors that my embedding layers will produce will have a dimension of hundred, that's why you see hundred as a parameter value over here.
Lastly, will be my dense layer with soft Max activation.

will have to make sure that the embedding layer is non- trainable-

In [78]:
model.layers[0].trainable=False

In [79]:
model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 16, 100)           1000000   
_________________________________________________________________
lstm (LSTM)                  (None, 100)               80400     
_________________________________________________________________
dense_2 (Dense)              (None, 4)                 404       
Total params: 1,080,804
Trainable params: 80,804
Non-trainable params: 1,000,000
_________________________________________________________________


We can see that I have non- trainable parameters which correspond to the weights of the vectors in the embedding layer. I will compile my model. Then I will run my model of the.

In [80]:
model.compile(optimizer='rmsprop', loss='categorical_crossentropy',
              metrics=['accuracy'])

In [81]:
model.fit(train_features, y_train, epochs=3,batch_size=32,
          validation_split=0.20)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7faa704ead10>

you can see that my validation accuracy is 91%.
Let's look at the accuracy of this model on my test data

In [82]:
preds=model.predict(test_features)

max_labels = []
for i in preds:
  max_labels.append(np.argmax(i))

pred_labels=enc.inverse_transform(np.array(max_labels))
(y_test==pred_labels).sum()/pred_labels.shape

array([0.90876379])

the accuracy on my test data is around 91%.