# Stance Detection for the Fake News Challenge

## Identifying Textual Relationships with Deep Neural Nets

### Check the problem context [here](https://drive.google.com/open?id=1KfWaZyQdGBw8AUTacJ2yY86Yxgw2Xwq0).

### Download files required for the project from [here](https://drive.google.com/open?id=10yf39ifEwVihw4xeJJR60oeFBY30Y5J8).

## Step1: Load the given dataset  

1. Mount the google drive

2. Import Glove embeddings

3. Import the test and train datasets

### Mount the google drive to access required project files

Run the below commands

In [1]:
# from google.colab import drive

In [2]:
# drive.mount('/content/drive/')

#### Path for Project files on google drive

**Note:** You need to change this path according where you have kept the files in google drive. 

In [3]:
# project_path = "/content/drive/My Drive/Datasets/Fake News Challenge/"
project_path = 'data/'

### Loading the Glove Embeddings

In [4]:
# from zipfile import ZipFile
# with ZipFile(project_path+'glove.6B.zip', 'r') as z:
#     z.extractall()

###### Libraries

In [5]:
import pandas as pd
import numpy as np

from nltk.tokenize import sent_tokenize
from tensorflow.keras.preprocessing.text import text_to_word_sequence

from tqdm import tqdm

from sklearn.model_selection import train_test_split

# Load the dataset [5 Marks]

1. Using [read_csv()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) in pandas load the given train datasets files **`train_bodies.csv`** and **`train_stances.csv`**

2. Using [merge](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html) command in pandas merge the two datasets based on the Body ID. 

Note: Save the final merged dataset in a dataframe with name **`dataset`**.

In [6]:
train_bodies = pd.read_csv('data/train_bodies.csv')
train_stances = pd.read_csv('data/train_stances.csv')

print('train_bodies: ', train_bodies.shape)
print('train_stances: ', train_stances.shape)
#
print()
#
# print('train_bodies: \n', train_bodies.head())
# print('train_stances: \n', train_stances.head())

#### Check how many common bodies exist between train_bodies and train_stances
def common_member(a, b): 
    a_set = set(a) 
    b_set = set(b) 
    if (a_set & b_set): 
        return (a_set & b_set) 
    else: 
        print("No common elements")

#
print('Unique Body ID in train_bodies: ', len(sorted(train_bodies['Body ID'].unique())))
print('Unique Body ID in train_stances: ', len(sorted(train_stances['Body ID'].unique())))
#
print('...')
print('Common entries: ', len(common_member(train_bodies['Body ID'], train_stances['Body ID'])))

train_bodies:  (1683, 2)
train_stances:  (49972, 3)

Unique Body ID in train_bodies:  1683
Unique Body ID in train_stances:  1683
...
Common entries:  1683


In [7]:
#### Merge by 'Body ID'
dataset = pd.merge(train_stances, train_bodies, on='Body ID')

#### Reorder columns
dataset = dataset.reindex(columns=['Body ID', 'articleBody', 'Headline', 'Stance']).sort_values(['Body ID'])

print(dataset.shape)

(49972, 4)



<h2> Check1:</h2>
  
<h3> You should see the below output if you run `dataset.head()` command as given below </h3>

In [8]:
dataset.head()

Unnamed: 0,Body ID,articleBody,Headline,Stance
41651,0,A small meteorite crashed into a wooded area i...,"Soldier shot, Parliament locked down after gun...",unrelated
41657,0,A small meteorite crashed into a wooded area i...,Italian catches huge wels catfish; is it a rec...,unrelated
41658,0,A small meteorite crashed into a wooded area i...,Not coming to a store near you: The pumpkin sp...,unrelated
41659,0,A small meteorite crashed into a wooded area i...,One gunman killed in shooting on Parliament Hi...,unrelated
41660,0,A small meteorite crashed into a wooded area i...,Soldier shot at war memorial in Canada,unrelated


## Step2: Data Pre-processing and setting some hyper parameters needed for model


#### Run the code given below to set the required parameters.

1. `MAX_SENTS` = Maximum no.of sentences to consider in an article.

2. `MAX_SENT_LENGTH` = Maximum no.of words to consider in a sentence.

3. `MAX_NB_WORDS` = Maximum no.of words in the total vocabualry.

4. `MAX_SENTS_HEADING` = Maximum no.of sentences to consider in a heading of an article.

In [9]:
MAX_NB_WORDS = 20000
MAX_SENTS = 20
MAX_SENTS_HEADING = 1
MAX_SENT_LENGTH = 20
VALIDATION_SPLIT = 0.2
EMBEDDING_DIM = 100

### Download the `Punkt` from nltk using the commands given below. This is for sentence tokenization.

For more info on how to use it, read [this](https://stackoverflow.com/questions/35275001/use-of-punktsentencetokenizer-in-nltk).



In [10]:
# import nltk
# nltk.download('punkt')

# Tokenizing the text and loading the pre-trained Glove word embeddings for each token  [5 marks] 

Keras provides [Tokenizer API](https://keras.io/preprocessing/text/) for preparing text. Read it before going any further.

#### Import the Tokenizer from keras preprocessing text

In [11]:
from tensorflow.keras.preprocessing.text import Tokenizer

#### Initialize the Tokenizer class with maximum vocabulary count as `MAX_NB_WORDS` initialized at the start of step2. 

In [12]:
# Replace out-of-vocabulary words with token
tokenizer = Tokenizer(num_words=MAX_NB_WORDS, split=' ', oov_token='oov_token', lower=True)

#### Now, using fit_on_texts() from Tokenizer class, lets encode the data 

Note: We need to fit articleBody and Headline also to cover all the words.

In [13]:
tokenizer_dataset = dataset['articleBody'] + dataset['Headline']

tokenizer.fit_on_texts(tokenizer_dataset)

In [14]:
print(list(tokenizer.word_counts.keys())[0], ': ', tokenizer.word_counts[list(tokenizer.word_counts.keys())[0]])
print(list(tokenizer.word_docs.keys())[0], ': ', tokenizer.word_docs[list(tokenizer.word_docs.keys())[0]])
print(list(tokenizer.word_index.keys())[0], ': ', tokenizer.word_index[list(tokenizer.word_index.keys())[0]])
tokenizer.document_count

a :  478453
house :  5738
oov_token :  1


49972

#### fit_on_texts() gives the following attributes in the output as given [here](https://faroit.github.io/keras-docs/1.2.2/preprocessing/text/).

* **word_counts:** dictionary mapping words (str) to the number of times they appeared on during fit. Only set after fit_on_texts was called.

* **word_docs:** dictionary mapping words (str) to the number of documents/texts they appeared on during fit. Only set after fit_on_texts was called.

* **word_index:** dictionary mapping words (str) to their rank/index (int). Only set after fit_on_texts was called.

* **document_count:** int. Number of documents (texts/sequences) the tokenizer was trained on. Only set after fit_on_texts or fit_on_sequences was called.



### Now, tokenize the sentences using nltk sent_tokenize() and encode the senteces with the ids we got form the above `t.word_index`

Initialise 2 lists with names `texts` and `articles`.

```
texts = [] to store text of article as it is.

articles = [] split the above text into a list of sentences.
```

In [15]:
texts = []
articles = []

In [16]:
# from nltk.tokenize import sent_tokenize

for text in tqdm(dataset['articleBody']):
    texts.append(text)
    articles.append(sent_tokenize(text))

100%|███████████████████████████████████████████████████████████████████████████| 49972/49972 [01:10<00:00, 705.40it/s]


In [17]:
# print(len(texts))
# print(len(articles))
# print(texts[0])
# print('---')
# print(articles[0])

## Check 2:

first element of texts and articles should be as given below. 

In [18]:
texts[0]

'A small meteorite crashed into a wooded area in Nicaragua\'s capital of Managua overnight, the government said Sunday. Residents reported hearing a mysterious boom that left a 16-foot deep crater near the city\'s airport, the Associated Press reports. \n\nGovernment spokeswoman Rosario Murillo said a committee formed by the government to study the event determined it was a "relatively small" meteorite that "appears to have come off an asteroid that was passing close to Earth." House-sized asteroid 2014 RC, which measured 60 feet in diameter, skimmed the Earth this weekend, ABC News reports. \nMurillo said Nicaragua will ask international experts to help local scientists in understanding what happened.\n\nThe crater left by the meteorite had a radius of 39 feet and a depth of 16 feet,  said Humberto Saballos, a volcanologist with the Nicaraguan Institute of Territorial Studies who was on the committee. He said it is still not clear if the meteorite disintegrated or was buried.\n\nHumbe

In [19]:
articles[0]

["A small meteorite crashed into a wooded area in Nicaragua's capital of Managua overnight, the government said Sunday.",
 "Residents reported hearing a mysterious boom that left a 16-foot deep crater near the city's airport, the Associated Press reports.",
 'Government spokeswoman Rosario Murillo said a committee formed by the government to study the event determined it was a "relatively small" meteorite that "appears to have come off an asteroid that was passing close to Earth."',
 'House-sized asteroid 2014 RC, which measured 60 feet in diameter, skimmed the Earth this weekend, ABC News reports.',
 'Murillo said Nicaragua will ask international experts to help local scientists in understanding what happened.',
 'The crater left by the meteorite had a radius of 39 feet and a depth of 16 feet,  said Humberto Saballos, a volcanologist with the Nicaraguan Institute of Territorial Studies who was on the committee.',
 'He said it is still not clear if the meteorite disintegrated or was bu

# Now iterate through each article and each sentence to encode the words into ids using t.word_index  [5 marks] 

Here, to get words from sentence you can use `text_to_word_sequence` from keras preprocessing text.

1. Import text_to_word_sequence

2. Initialize a variable of shape (no.of articles, MAX_SENTS, MAX_SENT_LENGTH) with name `data` with zeros first (you can use numpy [np.zeros](https://docs.scipy.org/doc/numpy/reference/generated/numpy.zeros.html) to initialize with all zeros)and then update it while iterating through the words and sentences in each article.

In [20]:
# from tensorflow.keras.preprocessing.text import text_to_word_sequence

data = np.zeros(shape=(len(articles), MAX_SENTS, MAX_SENT_LENGTH), dtype='int32')

for i in tqdm(range(len(articles))):
    #print(i)
    article = articles[i]
    for j, sent in enumerate(article):
        if j < MAX_SENTS:
            words = text_to_word_sequence(sent)
            k=0
            for w in words:
                if k < MAX_SENT_LENGTH:
                    if w in tokenizer.word_index:
                        data[i, j, k] = tokenizer.word_index[w]
                    k += 1

#

100%|██████████████████████████████████████████████████████████████████████████| 49972/49972 [00:36<00:00, 1364.45it/s]


In [21]:
# articles[20]

### Check 3:

Accessing first element in data should give something like given below.

In [22]:
data[0, :, :]

array([[    4,   486,   434,  7205,    82,     4,  3733,   332,     6,
         3889,   351,     5,  1433,  2957,     2,    90,    13,   465,
            0,     0],
       [  758,    96,  1046,     4,  2676,  1751,     8,   189,     4,
         1218,  1075,  2027,   699,   159,     2,  3030,   450,     2,
          556,   244],
       [   90,  1066,  4112,  2346,    13,     4,  1093,  3301,    20,
            2,    90,     3,  1792,     2,   530,  2006,    16,    10,
            4,  3108],
       [  187,  3640,   972,   203,  2554,    44,  6771,  1720,  1251,
            6, 13307, 17922,     2,   777,    32,   739,  3987,    68,
           86,     0],
       [ 2346,    13,  1585,    39,  1095,   352,   778,     3,   368,
          261,  1776,     6,  4448,    71,   495,     0,     0,     0,
            0,     0],
       [    2,   699,   189,    20,     2,   434,    33,     4,  7412,
            5,  2257,  1251,     7,     4,  5267,     5,  1218,  1251,
           13,  3360],
       [  

# Repeat the same process for the `Headings` as well. Use variables with names `texts_heading` and `articles_heading` accordingly. [5 marks] 

In [23]:
texts_heading = []
articles_heading = []

for text in tqdm(dataset['Headline']):
    texts_heading.append(text)
    articles_heading.append(sent_tokenize(text))

100%|█████████████████████████████████████████████████████████████████████████| 49972/49972 [00:03<00:00, 14342.73it/s]


In [24]:
# print(len(texts_heading))
# print(len(articles_heading))
print(texts_heading[0])
print('---')
print(articles_heading[0])

Soldier shot, Parliament locked down after gunfire erupts at war memorial
---
['Soldier shot, Parliament locked down after gunfire erupts at war memorial']


In [25]:
# from tensorflow.keras.preprocessing.text import text_to_word_sequence

data_heading = np.zeros(shape=(len(articles_heading), 1, MAX_SENT_LENGTH), dtype='int32')

for i in tqdm(range(len(articles_heading))):
    #print(i)
    article = articles_heading[i]
    for j, sent in enumerate(article):
        if j < 1:
            words = text_to_word_sequence(sent)
            k=0
            for w in words:
                if k < MAX_SENT_LENGTH:
                    if w in tokenizer.word_index:
                        data_heading[i, j, k] = tokenizer.word_index[w]
                    k += 1

#
print(articles_heading[0])
data_heading[0, :, :]

100%|█████████████████████████████████████████████████████████████████████████| 49972/49972 [00:02<00:00, 24057.62it/s]


['Soldier shot, Parliament locked down after gunfire erupts at war memorial']


array([[  734,   210,   344,  7113,   193,    35,  1335, 11488,    22,
          234,   684,     0,     0,     0,     0,     0,     0,     0,
            0,     0]])

### Now the features are ready, lets make the labels ready for the model to process.

### Convert labels into one-hot vectors

You can use [get_dummies](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html) in pandas to create one-hot vectors.

In [26]:
labels = pd.get_dummies(dataset.Stance)
print(labels.shape)
labels.head()

(49972, 4)


Unnamed: 0,agree,disagree,discuss,unrelated
41651,0,0,0,1
41657,0,0,0,1
41658,0,0,0,1
41659,0,0,0,1
41660,0,0,0,1


### Check 4:

The shape of data and labels shoould match the given below numbers.

In [27]:
print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)

Shape of data tensor: (49972, 20, 20)
Shape of label tensor: (49972, 4)


### Shuffle the data

In [28]:
## get numbers upto no.of articles
indices = np.arange(data.shape[0])
## shuffle the numbers
np.random.shuffle(indices)

In [29]:
## shuffle the data
data = data[indices]
data_heading = data_heading[indices]
## shuffle the labels according to data
labels = labels.iloc[indices]

### Split into train and validation sets. Split the train set 80:20 ratio to get the train and validation sets.


Use the variable names as given below:

x_train, x_val - for body of articles.

x-heading_train, x_heading_val - for heading of articles.

y_train - for training labels.

y_val - for validation labels.



In [30]:
x_train, x_val, x_heading_train, x_heading_val, y_train, y_val = train_test_split(data, data_heading, labels, 
                                                                                  test_size = 0.20, 
                                                                                  random_state = 42)

### Check 5:

The shape of x_train, x_val, y_train and y_val should match the below numbers.

In [31]:
print(x_train.shape)
print(y_train.shape)

print(x_val.shape)
print(y_val.shape)

(39977, 20, 20)
(39977, 4)
(9995, 20, 20)
(9995, 4)


### Create embedding matrix with the glove embeddings


Run the below code to create embedding_matrix which has all the words and their glove embedding if present in glove word list.

In [32]:
# load the whole embedding into memory
embeddings_index = dict()
# f = open('./glove.6B.100d.txt')
f = open('data//glove pretrained embeddings//glove.6B.100d.txt', encoding='utf-8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()
print('Loaded %s word vectors.' % len(embeddings_index))

# create a weight matrix for words in training docs
# embedding_matrix = np.zeros((vocab_size, 100))
# embedding_matrix = np.zeros((MAX_NB_WORDS, 100))
embedding_matrix = np.zeros((len(tokenizer.word_index), EMBEDDING_DIM))

for word, i in tokenizer.word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

Loaded 400000 word vectors.


In [33]:
print(embedding_matrix.shape)
len(tokenizer.word_index)

(33643, 100)


33643

# Try the sequential model approach and report the accuracy score. [10 marks]  

### Import layers from Keras to build the model

REFERENCE:
- https://keras.io/examples/imdb_cnn/

In [214]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, LSTM, SpatialDropout1D, BatchNormalization
from tensorflow.keras.layers import Bidirectional

from tensorflow.keras.layers import Input
from tensorflow.keras import Model
from tensorflow.keras.layers import Conv1D, Conv2D, MaxPooling1D, Flatten

from sklearn.metrics import accuracy_score, confusion_matrix,classification_report

### Model

In [39]:
"""
REFERENCE: https://stackoverflow.com/questions/46155868/keras-embedding-

It order to use words for natural language processing or machine learning tasks, 
it is necessary to first map them onto a continuous vector space, thus creating 
word vectors or word embeddings. The Keras Embedding layer is useful for constructing 
such word vectors.

input_dim : the vocabulary size. This is how many unique words are represented in your corpus.

output_dim : the desired dimension of the word vector. For example, if output_dim = 100, 
then every word will be mapped onto a vector with 100 elements, whereas if output_dim = 300, 
then every word will be mapped onto a vector with 300 elements.

input_length : the length of your sequences. For example, if your data consists of sentences, 
then this variable represents how many words there are in a sentence. As disparate sentences 
typically contain different number of words, it is usually required to pad your sequences 
such that all sentences are of equal length. The keras.preprocessing.pad_sequence method 
can be used for this (https://keras.io/preprocessing/sequence/).
"""

#### Create pre-trained embedding object
# embedding_layer = Embedding(len(tokenizer.word_index),
#                             EMBEDDING_DIM,
#                             weights=[embedding_matrix],
#                             input_length=MAX_SENT_LENGTH,
#                             trainable=False)

embedding_layer = Embedding(len(tokenizer.word_index),
                            EMBEDDING_DIM,
                            weights=[embedding_matrix],
                            input_length=[20,20],
                            trainable=False)

In [36]:
# # embed_dim = 128 # 30 # 128
# lstm_out = 196 # 128 # 196

# model = Sequential()
# model.add(embedding_layer)
# model.add(Bidirectional(LSTM(lstm_out, dropout=0.2, recurrent_dropout=0.2)))
# model.add(BatchNormalization())
# model.add(Dense(4, activation='softmax'))
# model.compile(loss = 'categorical_crossentropy', optimizer='adam',metrics = ['accuracy'])
# print(model.summary())

In [204]:
#
# Reference CNN: https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html
# Reference model: https://keras.io/getting-started/functional-api-guide/
#
# sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
sequence_input = Input(shape=(400,), dtype='int32', name='articleInput')
embedded_sequences = embedding_layer(sequence_input)
x1 = Conv1D(128, 5, activation='relu')(embedded_sequences)
# x1 = Conv2D(128, kernel_size=(3, 3), activation='relu', name='conv_1')(embedded_sequences)
x1 = MaxPooling1D(5)(x1)
x1 = Conv1D(128, 5, activation='relu')(x1)
x1 = MaxPooling1D(5)(x1)
x1 = Conv1D(128, 5, activation='relu')(x1)
x1 = MaxPooling1D(5)(x1)  # global max pooling
x1 = Flatten()(x1)
x1 = Dense(128, activation='relu')(x1)

#
sequence_input2 = Input(shape=(20,), dtype='int32', name='headingInput')
embedded_sequences2 = embedding_layer(sequence_input2)
x2 = Conv1D(128, 5, activation='relu')(embedded_sequences2)
x2 = MaxPooling1D(5)(x2)
# x2 = Conv1D(128, 5, activation='relu')(x2)
# x2 = MaxPooling1D(5)(x2)
# x2 = Conv1D(128, 5, activation='relu')(x2)
# x2 = MaxPooling1D(5)(x2)  # global max pooling
x2 = Flatten()(x2)
x2 = Dense(128, activation='relu')(x2)

#
x = concatenate([x1, x2])
#
x = Dense(64, activation='relu')(x)
x = Dense(64, activation='relu')(x)
x = Dense(64, activation='relu')(x)

In [205]:
print(x1.shape)
print(x2.shape)
x.shape

(None, 128)
(None, 128)


TensorShape([None, 64])

In [206]:
# preds = Dense(len(labels_index), activation='softmax')(x)
main_output = Dense(4, activation='softmax', name='stanceOutput')(x)

model = Model(inputs = [sequence_input, sequence_input2], outputs=main_output)
model.compile(loss='categorical_crossentropy',
              optimizer='rmsprop',
              metrics=['acc'])

#
print(model.summary())

Model: "model_4"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
articleInput (InputLayer)       [(None, 400)]        0                                            
__________________________________________________________________________________________________
embedding_2 (Embedding)         multiple             3364300     articleInput[0][0]               
                                                                 headingInput[0][0]               
__________________________________________________________________________________________________
conv1d_202 (Conv1D)             (None, 396, 128)     64128       embedding_2[92][0]               
__________________________________________________________________________________________________
max_pooling1d_172 (MaxPooling1D (None, 79, 128)      0           conv1d_202[0][0]           

### Compile and fit the model

In [145]:
# batch_size = 32
# model.fit(x_train, np.array(y_train), epochs = 1, batch_size=batch_size, verbose = 1)

In [207]:
x_train = np.reshape(x_train, (x_train.shape[0], 400,))
x_val = np.reshape(x_val, (x_val.shape[0], 400,))
x_heading_train = np.reshape(x_heading_train, (x_heading_train.shape[0], 20,))
x_heading_val = np.reshape(x_heading_val, (x_heading_val.shape[0], 20,))

print(x_train.shape)
print(x_val.shape)
print(x_heading_train.shape)
print(x_heading_val.shape)

(39977, 400)
(9995, 400)
(39977, 20)
(9995, 20)


In [208]:
def fitAndCheck(epochs=1):
    # happy learning!
    model.fit({'articleInput':x_train, 'headingInput':x_heading_train}, {'stanceOutput':np.array(y_train)}, 
              #validation_data=({'articleInput':x_val, 'headingInput':x_heading_val}, {'stanceOutput':y_val}),
              validation_split=0.1,
              epochs=epochs, batch_size=128, verbose=2)

In [210]:
max_acc = 0

for i in tqdm(range(20)):
    fitAndCheck(5)
    y_pred = model.predict({'articleInput':x_val, 'headingInput':x_heading_val})
    #
    acc = accuracy_score(np.argmax(y_pred, axis=1), np.argmax(np.array(y_val), axis=1))
    con_mat = confusion_matrix(np.argmax(y_pred, axis=1), np.argmax(np.array(y_val), axis=1))
    print()
    print("LSTM model accuracy score  : ", acc)
    #print()
    print("LSTM model confusion matrix: \n", con_mat)
    
    if acc > max_acc:
        max_acc = acc
        print('Higher accuracy found, saving model...')
        modelBaseName = 'modelData/cnn_stance_' +str((i+1)*5) + '_' + str(int(round(acc*100, 0)))
        model.save(modelBaseName + '.h5')
        model.save_weights(modelBaseName + '_weights.h5')
    #print()
    #print("LSTM model classification report: \n", classification_report(np.argmax(y_pred, axis=1), 
    #                                                                    np.argmax(np.array(y_val), axis=1)))


  0%|                                                                                           | 0/20 [00:00<?, ?it/s]

Train on 35979 samples, validate on 3998 samples
Epoch 1/5
35979/35979 - 28s - loss: 0.2868 - acc: 0.8857 - val_loss: 0.3938 - val_acc: 0.8534
Epoch 2/5
35979/35979 - 26s - loss: 0.2349 - acc: 0.9070 - val_loss: 0.2985 - val_acc: 0.8859
Epoch 3/5
35979/35979 - 26s - loss: 0.1983 - acc: 0.9203 - val_loss: 0.3771 - val_acc: 0.8542
Epoch 4/5
35979/35979 - 27s - loss: 0.1697 - acc: 0.9328 - val_loss: 0.2338 - val_acc: 0.9185
Epoch 5/5
35979/35979 - 27s - loss: 0.1470 - acc: 0.9421 - val_loss: 0.2607 - val_acc: 0.9157

LSTM model accuracy score  :  0.9132566283141571
LSTM model confusion matrix: 
 [[ 407   21   18   31]
 [  78   87    6   16]
 [  29    4 1401   73]
 [ 205   32  354 7233]]
Higher accuracy found, saving model...



  5%|████                                                                              | 1/20 [02:29<47:15, 149.22s/it]

Train on 35979 samples, validate on 3998 samples
Epoch 1/5
35979/35979 - 28s - loss: 0.1340 - acc: 0.9484 - val_loss: 0.2252 - val_acc: 0.9260
Epoch 2/5
35979/35979 - 27s - loss: 0.1239 - acc: 0.9530 - val_loss: 0.2203 - val_acc: 0.9310
Epoch 3/5
35979/35979 - 27s - loss: 0.1084 - acc: 0.9594 - val_loss: 0.2111 - val_acc: 0.9317
Epoch 4/5
35979/35979 - 27s - loss: 0.0955 - acc: 0.9638 - val_loss: 0.4528 - val_acc: 0.8819
Epoch 5/5
35979/35979 - 27s - loss: 0.0860 - acc: 0.9674 - val_loss: 0.3146 - val_acc: 0.9167

LSTM model accuracy score  :  0.9163581790895448
LSTM model confusion matrix: 
 [[ 574   46   33  255]
 [  29   84    0   10]
 [  42    4 1647  234]
 [  74   10   99 6854]]
Higher accuracy found, saving model...



 10%|████████▏                                                                         | 2/20 [04:59<44:51, 149.56s/it]

Train on 35979 samples, validate on 3998 samples
Epoch 1/5
35979/35979 - 27s - loss: 0.0787 - acc: 0.9719 - val_loss: 0.4675 - val_acc: 0.9075
Epoch 2/5
35979/35979 - 27s - loss: 0.0778 - acc: 0.9731 - val_loss: 0.2566 - val_acc: 0.9322
Epoch 3/5
35979/35979 - 27s - loss: 0.0705 - acc: 0.9754 - val_loss: 0.2217 - val_acc: 0.9462
Epoch 4/5
35979/35979 - 28s - loss: 0.0625 - acc: 0.9781 - val_loss: 0.2315 - val_acc: 0.9395
Epoch 5/5
35979/35979 - 28s - loss: 0.0605 - acc: 0.9786 - val_loss: 0.3959 - val_acc: 0.9155

LSTM model accuracy score  :  0.9149574787393697
LSTM model confusion matrix: 
 [[ 518   13   34   65]
 [  52  110   24   31]
 [  25    3 1338   78]
 [ 124   18  383 7179]]



 15%|████████████▎                                                                     | 3/20 [07:32<42:38, 150.51s/it]

Train on 35979 samples, validate on 3998 samples
Epoch 1/5
35979/35979 - 28s - loss: 0.0557 - acc: 0.9815 - val_loss: 0.3647 - val_acc: 0.9327
Epoch 2/5
35979/35979 - 28s - loss: 0.0534 - acc: 0.9816 - val_loss: 0.3430 - val_acc: 0.9240
Epoch 3/5
35979/35979 - 28s - loss: 0.0505 - acc: 0.9833 - val_loss: 0.2358 - val_acc: 0.9490
Epoch 4/5
35979/35979 - 28s - loss: 0.0491 - acc: 0.9840 - val_loss: 0.3039 - val_acc: 0.9422
Epoch 5/5
35979/35979 - 28s - loss: 0.0467 - acc: 0.9858 - val_loss: 0.2215 - val_acc: 0.9560

LSTM model accuracy score  :  0.951975987993997
LSTM model confusion matrix: 
 [[ 599   12   29   70]
 [  24  121    4   17]
 [  35    5 1659  130]
 [  61    6   87 7136]]
Higher accuracy found, saving model...



 20%|████████████████▍                                                                 | 4/20 [10:08<40:36, 152.25s/it]

Train on 35979 samples, validate on 3998 samples
Epoch 1/5
35979/35979 - 28s - loss: 0.0420 - acc: 0.9860 - val_loss: 0.3432 - val_acc: 0.9432
Epoch 2/5
35979/35979 - 28s - loss: 0.0436 - acc: 0.9861 - val_loss: 0.2521 - val_acc: 0.9567
Epoch 3/5
35979/35979 - 28s - loss: 0.0408 - acc: 0.9869 - val_loss: 0.2878 - val_acc: 0.9517
Epoch 4/5
35979/35979 - 27s - loss: 0.0397 - acc: 0.9877 - val_loss: 0.4255 - val_acc: 0.9407
Epoch 5/5
35979/35979 - 27s - loss: 0.0416 - acc: 0.9875 - val_loss: 0.3820 - val_acc: 0.9322

LSTM model accuracy score  :  0.9326663331665833
LSTM model confusion matrix: 
 [[ 616   23   28  129]
 [  11  112    0   10]
 [  38    4 1601  221]
 [  54    5  150 6993]]



 25%|████████████████████▌                                                             | 5/20 [12:40<38:03, 152.21s/it]

Train on 35979 samples, validate on 3998 samples
Epoch 1/5
35979/35979 - 28s - loss: 0.0431 - acc: 0.9880 - val_loss: 0.3606 - val_acc: 0.9527
Epoch 2/5
35979/35979 - 27s - loss: 0.0390 - acc: 0.9889 - val_loss: 0.4106 - val_acc: 0.9537
Epoch 3/5
35979/35979 - 28s - loss: 0.0372 - acc: 0.9892 - val_loss: 0.4419 - val_acc: 0.9295
Epoch 4/5
35979/35979 - 27s - loss: 0.0413 - acc: 0.9887 - val_loss: 0.3162 - val_acc: 0.9577
Epoch 5/5
35979/35979 - 27s - loss: 0.0380 - acc: 0.9901 - val_loss: 0.3125 - val_acc: 0.9525

LSTM model accuracy score  :  0.9559779889944973
LSTM model confusion matrix: 
 [[ 619   18   29   83]
 [  14  118    6   13]
 [  29    4 1654   93]
 [  57    4   90 7164]]
Higher accuracy found, saving model...



 30%|████████████████████████▌                                                         | 6/20 [15:13<35:32, 152.36s/it]

Train on 35979 samples, validate on 3998 samples
Epoch 1/5
35979/35979 - 28s - loss: 0.0338 - acc: 0.9895 - val_loss: 0.4065 - val_acc: 0.9412
Epoch 2/5
35979/35979 - 28s - loss: 0.0365 - acc: 0.9903 - val_loss: 0.3274 - val_acc: 0.9520
Epoch 3/5
35979/35979 - 27s - loss: 0.0383 - acc: 0.9892 - val_loss: 0.3636 - val_acc: 0.9555
Epoch 4/5
35979/35979 - 26s - loss: 0.0332 - acc: 0.9907 - val_loss: 0.3301 - val_acc: 0.9540
Epoch 5/5
35979/35979 - 28s - loss: 0.0351 - acc: 0.9901 - val_loss: 0.4697 - val_acc: 0.9525

LSTM model accuracy score  :  0.9530765382691345
LSTM model confusion matrix: 
 [[ 562   11   14   45]
 [  16  122    2   12]
 [  49    4 1626   80]
 [  92    7  137 7216]]



 35%|████████████████████████████▋                                                     | 7/20 [17:45<32:59, 152.30s/it]

Train on 35979 samples, validate on 3998 samples
Epoch 1/5
35979/35979 - 28s - loss: 0.0323 - acc: 0.9908 - val_loss: 0.3686 - val_acc: 0.9495
Epoch 2/5
35979/35979 - 28s - loss: 0.0330 - acc: 0.9914 - val_loss: 0.3702 - val_acc: 0.9577
Epoch 3/5
35979/35979 - 28s - loss: 0.0350 - acc: 0.9903 - val_loss: 0.3437 - val_acc: 0.9482
Epoch 4/5
35979/35979 - 27s - loss: 0.0381 - acc: 0.9912 - val_loss: 0.3402 - val_acc: 0.9567
Epoch 5/5
35979/35979 - 27s - loss: 0.0458 - acc: 0.9907 - val_loss: 0.3859 - val_acc: 0.9505

LSTM model accuracy score  :  0.9501750875437719
LSTM model confusion matrix: 
 [[ 590   12   37   67]
 [  13  118    4   13]
 [  25    4 1604   88]
 [  91   10  134 7185]]



 40%|████████████████████████████████▊                                                 | 8/20 [20:18<30:31, 152.62s/it]

Train on 35979 samples, validate on 3998 samples
Epoch 1/5
35979/35979 - 28s - loss: 0.0338 - acc: 0.9915 - val_loss: 0.3773 - val_acc: 0.9522
Epoch 2/5
35979/35979 - 28s - loss: 0.0352 - acc: 0.9915 - val_loss: 0.3764 - val_acc: 0.9590
Epoch 3/5
35979/35979 - 28s - loss: 0.0350 - acc: 0.9908 - val_loss: 0.4311 - val_acc: 0.9527
Epoch 4/5
35979/35979 - 29s - loss: 0.0462 - acc: 0.9900 - val_loss: 0.4926 - val_acc: 0.9512
Epoch 5/5
35979/35979 - 27s - loss: 0.0362 - acc: 0.9918 - val_loss: 0.4444 - val_acc: 0.9465

LSTM model accuracy score  :  0.9463731865932966
LSTM model confusion matrix: 
 [[ 581   17   29   91]
 [  21  114    1   26]
 [  38    4 1654  126]
 [  79    9   95 7110]]



 45%|████████████████████████████████████▉                                             | 9/20 [22:52<28:03, 153.04s/it]

Train on 35979 samples, validate on 3998 samples
Epoch 1/5
35979/35979 - 27s - loss: 0.0350 - acc: 0.9907 - val_loss: 0.3163 - val_acc: 0.9602
Epoch 2/5
35979/35979 - 27s - loss: 0.0375 - acc: 0.9913 - val_loss: 0.4804 - val_acc: 0.9422
Epoch 3/5
35979/35979 - 27s - loss: 0.0312 - acc: 0.9920 - val_loss: 0.3332 - val_acc: 0.9577
Epoch 4/5
35979/35979 - 26s - loss: 0.0378 - acc: 0.9907 - val_loss: 0.3718 - val_acc: 0.9565
Epoch 5/5
35979/35979 - 27s - loss: 0.0326 - acc: 0.9915 - val_loss: 0.4750 - val_acc: 0.9547

LSTM model accuracy score  :  0.9531765882941471
LSTM model confusion matrix: 
 [[ 621   13   30   93]
 [  13  126    1   22]
 [  43    3 1669  127]
 [  42    2   79 7111]]



 50%|████████████████████████████████████████▌                                        | 10/20 [25:21<25:16, 151.69s/it]

Train on 35979 samples, validate on 3998 samples
Epoch 1/5
35979/35979 - 28s - loss: 0.0399 - acc: 0.9910 - val_loss: 0.3906 - val_acc: 0.9467
Epoch 2/5
35979/35979 - 26s - loss: 0.0355 - acc: 0.9915 - val_loss: 0.3047 - val_acc: 0.9547
Epoch 3/5
35979/35979 - 27s - loss: 0.0378 - acc: 0.9916 - val_loss: 0.4678 - val_acc: 0.9490
Epoch 4/5
35979/35979 - 28s - loss: 0.0308 - acc: 0.9926 - val_loss: 0.4097 - val_acc: 0.9462
Epoch 5/5
35979/35979 - 26s - loss: 0.0303 - acc: 0.9917 - val_loss: 0.4019 - val_acc: 0.9277

LSTM model accuracy score  :  0.9260630315157579
LSTM model confusion matrix: 
 [[ 568    8   17   73]
 [  25  122    8   39]
 [  18    5 1423   98]
 [ 108    9  331 7143]]



 55%|████████████████████████████████████████████▌                                    | 11/20 [27:49<22:36, 150.69s/it]

Train on 35979 samples, validate on 3998 samples
Epoch 1/5
35979/35979 - 27s - loss: 0.0383 - acc: 0.9919 - val_loss: 0.3434 - val_acc: 0.9500
Epoch 2/5
35979/35979 - 28s - loss: 0.0380 - acc: 0.9919 - val_loss: 0.3951 - val_acc: 0.9562
Epoch 3/5
35979/35979 - 27s - loss: 0.0403 - acc: 0.9913 - val_loss: 0.4340 - val_acc: 0.9612
Epoch 4/5
35979/35979 - 27s - loss: 0.0350 - acc: 0.9924 - val_loss: 0.6390 - val_acc: 0.9502
Epoch 5/5
35979/35979 - 27s - loss: 0.0312 - acc: 0.9926 - val_loss: 0.6102 - val_acc: 0.9405

LSTM model accuracy score  :  0.9340670335167583
LSTM model confusion matrix: 
 [[ 603   12   44   99]
 [  16  118    3   31]
 [  21    3 1488   96]
 [  79   11  244 7127]]



 60%|████████████████████████████████████████████████▌                                | 12/20 [30:19<20:02, 150.33s/it]

Train on 35979 samples, validate on 3998 samples
Epoch 1/5
35979/35979 - 28s - loss: 0.0375 - acc: 0.9917 - val_loss: 0.5766 - val_acc: 0.9575
Epoch 2/5
35979/35979 - 27s - loss: 0.0356 - acc: 0.9924 - val_loss: 0.3728 - val_acc: 0.9570
Epoch 3/5
35979/35979 - 27s - loss: 0.0316 - acc: 0.9924 - val_loss: 0.4808 - val_acc: 0.9587
Epoch 4/5
35979/35979 - 26s - loss: 0.0408 - acc: 0.9917 - val_loss: 0.6436 - val_acc: 0.9425
Epoch 5/5
35979/35979 - 25s - loss: 0.0322 - acc: 0.9929 - val_loss: 0.4743 - val_acc: 0.9537

LSTM model accuracy score  :  0.9542771385692846
LSTM model confusion matrix: 
 [[ 625    9   52   73]
 [  16  130    8   19]
 [  28    2 1652  130]
 [  50    3   67 7131]]



 65%|████████████████████████████████████████████████████▋                            | 13/20 [32:47<17:27, 149.66s/it]

Train on 35979 samples, validate on 3998 samples
Epoch 1/5
35979/35979 - 27s - loss: 0.0378 - acc: 0.9919 - val_loss: 0.8045 - val_acc: 0.9577
Epoch 2/5
35979/35979 - 28s - loss: 0.0389 - acc: 0.9916 - val_loss: 0.4445 - val_acc: 0.9387
Epoch 3/5
35979/35979 - 27s - loss: 0.0345 - acc: 0.9922 - val_loss: 0.4641 - val_acc: 0.9592
Epoch 4/5
35979/35979 - 28s - loss: 0.0373 - acc: 0.9920 - val_loss: 0.3532 - val_acc: 0.9580
Epoch 5/5
35979/35979 - 27s - loss: 0.0386 - acc: 0.9917 - val_loss: 0.3983 - val_acc: 0.9570

LSTM model accuracy score  :  0.9565782891445723
LSTM model confusion matrix: 
 [[ 622   14   25   88]
 [  11  124    1    9]
 [  32    3 1678  119]
 [  54    3   75 7137]]
Higher accuracy found, saving model...



 70%|████████████████████████████████████████████████████████▋                        | 14/20 [35:21<15:05, 150.92s/it]

Train on 35979 samples, validate on 3998 samples
Epoch 1/5
35979/35979 - 27s - loss: 0.0358 - acc: 0.9922 - val_loss: 0.5263 - val_acc: 0.9595
Epoch 2/5
35979/35979 - 27s - loss: 0.0367 - acc: 0.9922 - val_loss: 0.4557 - val_acc: 0.9567
Epoch 3/5
35979/35979 - 27s - loss: 0.0468 - acc: 0.9907 - val_loss: 0.4012 - val_acc: 0.9485
Epoch 4/5
35979/35979 - 27s - loss: 0.0371 - acc: 0.9913 - val_loss: 0.6385 - val_acc: 0.9550
Epoch 5/5
35979/35979 - 28s - loss: 0.0439 - acc: 0.9917 - val_loss: 0.4905 - val_acc: 0.9447

LSTM model accuracy score  :  0.9502751375687843
LSTM model confusion matrix: 
 [[ 602   26   19   87]
 [  22  107    2   13]
 [  29    3 1676  140]
 [  66    8   82 7113]]



 75%|████████████████████████████████████████████████████████████▊                    | 15/20 [37:51<12:33, 150.73s/it]

Train on 35979 samples, validate on 3998 samples
Epoch 1/5
35979/35979 - 27s - loss: 0.0512 - acc: 0.9927 - val_loss: 1.2458 - val_acc: 0.8904
Epoch 2/5
35979/35979 - 26s - loss: 0.0402 - acc: 0.9930 - val_loss: 0.4290 - val_acc: 0.9572
Epoch 3/5
35979/35979 - 26s - loss: 0.0402 - acc: 0.9924 - val_loss: 0.3538 - val_acc: 0.9530
Epoch 4/5
35979/35979 - 27s - loss: 0.0466 - acc: 0.9916 - val_loss: 0.4197 - val_acc: 0.9620
Epoch 5/5
35979/35979 - 26s - loss: 0.0341 - acc: 0.9917 - val_loss: 1.9050 - val_acc: 0.9297

LSTM model accuracy score  :  0.9352676338169085
LSTM model confusion matrix: 
 [[ 570    7   22   56]
 [  28  126    4   11]
 [  22    3 1410   44]
 [  99    8  343 7242]]



 80%|████████████████████████████████████████████████████████████████▊                | 16/20 [40:18<09:58, 149.60s/it]

Train on 35979 samples, validate on 3998 samples
Epoch 1/5
35979/35979 - 27s - loss: 0.0448 - acc: 0.9919 - val_loss: 0.4271 - val_acc: 0.9597
Epoch 2/5
35979/35979 - 25s - loss: 0.0344 - acc: 0.9924 - val_loss: 0.3652 - val_acc: 0.9557
Epoch 3/5
35979/35979 - 27s - loss: 0.0366 - acc: 0.9928 - val_loss: 0.3641 - val_acc: 0.9605
Epoch 4/5
35979/35979 - 26s - loss: 0.0372 - acc: 0.9924 - val_loss: 0.3954 - val_acc: 0.9605
Epoch 5/5
35979/35979 - 27s - loss: 0.0387 - acc: 0.9926 - val_loss: 0.5005 - val_acc: 0.9555

LSTM model accuracy score  :  0.9585792896448224
LSTM model confusion matrix: 
 [[ 591   10   14   51]
 [  12  125    1   12]
 [  40    3 1688  113]
 [  76    6   76 7177]]
Higher accuracy found, saving model...



 85%|████████████████████████████████████████████████████████████████████▊            | 17/20 [42:45<07:26, 148.94s/it]

Train on 35979 samples, validate on 3998 samples
Epoch 1/5
35979/35979 - 26s - loss: 0.0291 - acc: 0.9929 - val_loss: 0.3280 - val_acc: 0.9602
Epoch 2/5
35979/35979 - 27s - loss: 0.0397 - acc: 0.9924 - val_loss: 0.7155 - val_acc: 0.9597
Epoch 3/5
35979/35979 - 26s - loss: 0.0428 - acc: 0.9944 - val_loss: 0.6640 - val_acc: 0.9537
Epoch 4/5
35979/35979 - 28s - loss: 0.3511 - acc: 0.9928 - val_loss: 0.6671 - val_acc: 0.9600
Epoch 5/5
35979/35979 - 28s - loss: 0.0367 - acc: 0.9928 - val_loss: 0.5340 - val_acc: 0.9555

LSTM model accuracy score  :  0.960280140070035
LSTM model confusion matrix: 
 [[ 593    7   22   46]
 [  41  134    7   26]
 [  34    3 1676   86]
 [  51    0   74 7195]]
Higher accuracy found, saving model...



 90%|████████████████████████████████████████████████████████████████████████▉        | 18/20 [45:15<04:58, 149.23s/it]

Train on 35979 samples, validate on 3998 samples
Epoch 1/5
35979/35979 - 28s - loss: 0.0503 - acc: 0.9926 - val_loss: 0.4838 - val_acc: 0.9585
Epoch 2/5
35979/35979 - 27s - loss: 0.0336 - acc: 0.9932 - val_loss: 0.6353 - val_acc: 0.9577
Epoch 3/5
35979/35979 - 27s - loss: 0.0362 - acc: 0.9931 - val_loss: 0.8665 - val_acc: 0.9477
Epoch 4/5
35979/35979 - 27s - loss: 0.0460 - acc: 0.9925 - val_loss: 0.7409 - val_acc: 0.9452
Epoch 5/5
35979/35979 - 27s - loss: 0.0322 - acc: 0.9926 - val_loss: 0.7537 - val_acc: 0.9530

LSTM model accuracy score  :  0.9573786893446723
LSTM model confusion matrix: 
 [[ 623    9   22   71]
 [  20  132    3   20]
 [  39    2 1691  139]
 [  37    1   63 7123]]



 95%|████████████████████████████████████████████████████████████████████████████▉    | 19/20 [47:47<02:30, 150.07s/it]

Train on 35979 samples, validate on 3998 samples
Epoch 1/5
35979/35979 - 28s - loss: 0.0343 - acc: 0.9917 - val_loss: 0.4850 - val_acc: 0.9575
Epoch 2/5
35979/35979 - 26s - loss: 0.0301 - acc: 0.9934 - val_loss: 0.8594 - val_acc: 0.9622
Epoch 3/5
35979/35979 - 25s - loss: 0.0447 - acc: 0.9924 - val_loss: 0.9792 - val_acc: 0.9432
Epoch 4/5
35979/35979 - 27s - loss: 0.0375 - acc: 0.9936 - val_loss: 0.7541 - val_acc: 0.9637
Epoch 5/5
35979/35979 - 26s - loss: 0.0586 - acc: 0.9931 - val_loss: 0.4869 - val_acc: 0.9485

LSTM model accuracy score  :  0.9526763381690846
LSTM model confusion matrix: 
 [[ 581   14   22   60]
 [  10  101    2    3]
 [  33   11 1626   76]
 [  95   18  129 7214]]



100%|█████████████████████████████████████████████████████████████████████████████████| 20/20 [50:14<00:00, 150.74s/it]


In [200]:
# print(np.argmax(y_pred, axis=1)[0:10])
# print(np.argmax(np.array(y_val), axis=1)[0:10])
# acc = accuracy_score(np.argmax(y_pred, axis=1), np.argmax(np.array(y_val), axis=1))
# con_mat = confusion_matrix(np.argmax(y_pred, axis=1), np.argmax(np.array(y_val), axis=1))
# print("LSTM model accuracy score  : ", acc)
# print()
# print("LSTM model confusion matrix: \n", con_mat)
# print()
# print("LSTM model classification report: \n", classification_report(np.argmax(y_pred, axis=1), 
#                                                                     np.argmax(np.array(y_val), axis=1)))

## Build the same model with attention layers included for better performance (Optional)

## Fit the model and report the accuracy score for the model with attention layer (Optional)

# EXTRAS

REFERENCE:
- https://machinelearningmastery.com/encoder-decoder-attention-sequence-to-sequence-prediction-keras/
- https://www.tensorflow.org/api_docs/python/tf/keras/layers/Attention

In [212]:
from random import randint

# generate a sequence of random integers
def generate_sequence(length, n_unique):
    return [randint(0, n_unique-1) for _ in range(length)]

# generate random sequence
sequence = generate_sequence(5, 50)
print(sequence)

[0, 47, 25, 33, 46]


In [213]:
# one hot encode sequence
def one_hot_encode(sequence, n_unique):
    encoding = list()
    for value in sequence:
        vector = [0 for _ in range(n_unique)]
        vector[value] = 1
        encoding.append(vector)
    return array(encoding)