# CNN Classification with MPQA Dataset
<hr>

The __modus operandi__ for text classification is to use __word embedding__ for representing words and a Convolutional neural network to learn how to discriminate documents on classification problems. 

__Yoav Goldberg__ commented in _A Primer on Neural Network Models for Natural Language Processing, 2015._ :
> _The non-linearity of the network, as well as the ability to easily integrate pre-trained
word embeddings, often lead to superior classification accuracy._

He also commented in _Neural Network Methods for Natural Language Processing, 2017_ :
> ... _the CNN is in essence a feature-extracting architecture. ... . The CNNs layer's responsibility is to extract meaningful sub-structures that are useful for the overall prediction task at hand._

We will build a text classification model using CNN model on the Customer Reviews Dataset. Since there is no standard train/test split for this dataset, we will use 10-Fold Cross Validation (CV). 

The CNN model is inspired by __Yoon Kim__ paper in his study on the use of Word Embedding + CNN for text classification. The hyperparameters we use based on his study are as follows:
- Transfer function: rectified linear.
- Kernel sizes: 1,2, 3, 4, 5.
- Number of filters: 100.
- Dropout rate: 0.5.
- Weight regularization (L2) constraint: 3.
- Batch Size: 50.
- Update Rule: Adam

## Load the library

In [2]:
import tensorflow as tf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
import nltk
import random
# from nltk.tokenize import TweetTokenizer
from sklearn.model_selection import KFold

%config IPCompleter.greedy=True
%config IPCompleter.use_jedi=False
# nltk.download('twitter_samples')

In [3]:
tf.config.list_physical_devices('GPU') 

[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

## Load the Dataset

In [4]:
corpus = pd.read_pickle('../../../0_data/SUBJ/SUBJ.pkl')
corpus.label = corpus.label.astype(int)
print(corpus.shape)
corpus

(10606, 3)


Unnamed: 0,sentence,label,split
0,complaining,0,train
1,failing to support,0,train
2,desperately needs,0,train
3,many years of decay,0,train
4,no quick fix,0,train
...,...,...,...
10601,urged,1,train
10602,strictly abide,1,train
10603,hope,1,train
10604,strictly abide,1,train


In [5]:
corpus.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10606 entries, 0 to 10605
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   sentence  10606 non-null  object
 1   label     10606 non-null  int32 
 2   split     10606 non-null  object
dtypes: int32(1), object(2)
memory usage: 207.3+ KB


In [6]:
corpus.groupby( by='label').count()

Unnamed: 0_level_0,sentence,split
label,Unnamed: 1_level_1,Unnamed: 2_level_1
0,7294,7294
1,3312,3312


In [7]:
# Separate the sentences and the labels
sentences, labels = list(corpus.sentence), list(corpus.label)

In [8]:
sentences[0]

'complaining'

<!--## Split Dataset-->

# Data Preprocessing
<hr>

Preparing data for word embedding, especially for pre-trained word embedding like Word2Vec or GloVe, __don't use standard preprocessing steps like stemming or stopword removal__. Compared to our approach on cleaning the text when doing word count based feature extraction (e.g. TFIDF) such as removing stopwords, stemming etc, now we will keep these words as we do not want to lose such information that might help the model learn better.

__Tomas Mikolov__, one of the developers of Word2Vec, in _word2vec-toolkit: google groups thread., 2015_, suggests only very minimal text cleaning is required when learning a word embedding model. Sometimes, it's good to disconnect
In short, what we will do is:
- Puntuations removal
- Lower the letter case
- Tokenization

The process above will be handled by __Tokenizer__ class in TensorFlow

- <b>One way to choose the maximum sequence length is to just pick the length of the longest sentence in the training set.</b>

In [9]:
# Define a function to compute the max length of sequence
def max_length(sequences):
    '''
    input:
        sequences: a 2D list of integer sequences
    output:
        max_length: the max length of the sequences
    '''
    max_length = 0
    for i, seq in enumerate(sequences):
        length = len(seq)
        if max_length < length:
            max_length = length
    return max_length

In [12]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

trunc_type='post'
padding_type='post'
oov_tok = "<UNK>"

# Separate the sentences and the labels
sentences, labels = list(corpus.sentence), list(corpus.label)

# Cleaning and Tokenization
tokenizer = Tokenizer(oov_token=oov_tok)
tokenizer.fit_on_texts(sentences)

print("Example of sentence: ", sentences[8])

# Turn the text into sequence
training_sequences = tokenizer.texts_to_sequences(sentences)
max_len = max_length(training_sequences)

print('Into a sequence of int:', training_sequences[8])

# Pad the sequence to have the same size
training_padded = pad_sequences(training_sequences, maxlen=max_len, padding=padding_type, truncating=trunc_type)
print('Into a padded sequence:', training_padded[8])

Example of sentence:  a very complicated process
Into a sequence of int: [5, 44, 946, 581]
Into a padded sequence: [  5  44 946 581   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0]


In [13]:
word_index = tokenizer.word_index
# See the first 10 words in the vocabulary
for i, word in enumerate(word_index):
    print(word, word_index.get(word))
    if i==9:
        break
vocab_size = len(word_index)+1
print(vocab_size)

<UNK> 1
the 2
of 3
to 4
a 5
and 6
not 7
is 8
in 9
be 10
6236


# Model 1: Embedding Random
<hr>

A __standard model__ for document classification is to use (quoted from __Jason Brownlee__, the author of [machinelearningmastery.com](https://machinelearningmastery.com)):
>- Word Embedding: A distributed representation of words where different words that have a similar meaning (based on their usage) also have a similar representation.
>- Convolutional Model: A feature extraction model that learns to extract salient features from documents represented using a word embedding.
>- Fully Connected Model: The interpretation of extracted features in terms of a predictive output.


Therefore, the model is comprised of the following elements:
- __Input layer__ that defines the length of input sequences.
- __Embedding layer__ set to the size of the vocabulary and 100-dimensional real-valued representations.
- __Conv1D layer__ with 32 filters and a kernel size set to the number of words to read at once.
- __MaxPooling1D layer__ to consolidate the output from the convolutional layer.
- __Flatten layer__ to reduce the three-dimensional output to two dimensional for concatenation.

The CNN model is inspired by __Yoon Kim__ paper in his study on the use of Word Embedding + CNN for text classification. The hyperparameters we use based on his study are as follows:
- Transfer function: rectified linear.
- Kernel sizes: 3, 4, 5.
- Number of filters: 100.
- Dropout rate: 0.5.
- Weight regularization (L2): 3.
- Batch Size: 50.
- Update Rule: Adam

We will perform the best parameter using __grid search__ and 10-fold cross validation.

## CNN Model

Now, we will build Convolutional Neural Network (CNN) models to classify encoded documents as either positive or negative.

The model takes inspiration from `CNN for Sentence Classification` by *Yoon Kim*.

Now, we will define our CNN model as follows:
- One Conv layer with 100 filters, kernel size 5, and relu activation function;
- One MaxPool layer with pool size = 2;
- One Dropout layer after flattened;
- Optimizer: Adam (The best learning algorithm so far)
- Loss function: binary cross-entropy (suited for binary classification problem)

**Note**: 
- The whole purpose of dropout layers is to tackle the problem of over-fitting and to introduce generalization to the model. Hence it is advisable to keep dropout parameter near 0.5 in hidden layers. 
- https://missinglink.ai/guides/keras/keras-conv1d-working-1d-convolutional-neural-networks-keras/

In [14]:
from tensorflow.keras import regularizers
from tensorflow.keras.constraints import MaxNorm

def define_model(filters = 100, kernel_size = 3, activation='relu', input_dim = None, output_dim=300, max_length = None ):
    
    model = tf.keras.models.Sequential([
        tf.keras.layers.Embedding(input_dim=vocab_size, 
                                  output_dim=output_dim, 
                                  input_length=max_length, 
                                  input_shape=(max_length, )),
        
        tf.keras.layers.Conv1D(filters=filters, kernel_size = kernel_size, activation = activation, 
                               # set 'axis' value to the first and second axis of conv1D weights (rows, cols)
                               kernel_constraint= MaxNorm( max_value=3, axis=[0,1])),
        
        tf.keras.layers.MaxPool1D(2),
        tf.keras.layers.Flatten(),
        tf.keras.layers.Dropout(0.5),
        tf.keras.layers.Dense(10, activation=activation, 
                              # set axis to 0 to constrain each weight vector of length (input_dim,) in dense layer
                              kernel_constraint = MaxNorm( max_value=3, axis=0)),
        tf.keras.layers.Dropout(0.5),
        tf.keras.layers.Dense(units=1, activation='sigmoid')
    ])
    
    model.compile( loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])
#     model.summary()
    return model

In [15]:
model_0 = define_model( input_dim=1000, max_length=100)
model_0.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 100, 300)          1870800   
_________________________________________________________________
conv1d (Conv1D)              (None, 98, 100)           90100     
_________________________________________________________________
max_pooling1d (MaxPooling1D) (None, 49, 100)           0         
_________________________________________________________________
flatten (Flatten)            (None, 4900)              0         
_________________________________________________________________
dropout (Dropout)            (None, 4900)              0         
_________________________________________________________________
dense (Dense)                (None, 10)                49010     
_________________________________________________________________
dropout_1 (Dropout)          (None, 10)                0

In [16]:
class myCallback(tf.keras.callbacks.Callback):
    # Overide the method on_epoch_end() for our benefit
    def on_epoch_end(self, epoch, logs={}):
        if (logs.get('accuracy') > 0.93):
            print("\nReached 93% accuracy so cancelling training!")
            self.model.stop_training=True


callbacks = tf.keras.callbacks.EarlyStopping(monitor='val_accuracy', min_delta=0, 
                                             patience=5, verbose=2, 
                                             mode='auto', restore_best_weights=True)

## Train and Test the Model

In [None]:
# Parameter Initialization
trunc_type='post'
padding_type='post'
oov_tok = "<UNK>"
activations = ['relu', 'tanh']
filters = 100
kernel_sizes = [1, 2, 3, 4, 5, 6]

columns = ['Activation', 'Filters', 'acc1', 'acc2', 'acc3', 'acc4', 'acc5', 'acc6', 'acc7', 'acc8', 'acc9', 'acc10', 'AVG']
record = pd.DataFrame(columns = columns)

# prepare cross validation with 10 splits and shuffle = True
kfold = KFold(10, True)

# Separate the sentences and the labels
sentences, labels = list(corpus.sentence), list(corpus.label)

for activation in activations:
    for kernel_size in kernel_sizes:
        # kfold.split() will return set indices for each split
        acc_list = []
        for train, test in kfold.split(sentences):
            
            train_x, test_x = [], []
            train_y, test_y = [], []
            
            for i in train:
                train_x.append(sentences[i])
                train_y.append(labels[i])

            for i in test:
                test_x.append(sentences[i])
                test_y.append(labels[i])

            # Turn the labels into a numpy array
            train_y = np.array(train_y)
            test_y = np.array(test_y)

            # encode data using
            # Cleaning and Tokenization
            tokenizer = Tokenizer(oov_token=oov_tok)
            tokenizer.fit_on_texts(train_x)

            # Turn the text into sequence
            training_sequences = tokenizer.texts_to_sequences(train_x)
            test_sequences = tokenizer.texts_to_sequences(test_x)

            max_len = max_length(training_sequences)

            # Pad the sequence to have the same size
            Xtrain = pad_sequences(training_sequences, maxlen=max_len, padding=padding_type, truncating=trunc_type)
            Xtest = pad_sequences(test_sequences, maxlen=max_len, padding=padding_type, truncating=trunc_type)

            word_index = tokenizer.word_index
            vocab_size = len(word_index)+1

            # Define the input shape
            model = define_model(filters, kernel_size, activation, input_dim=vocab_size, max_length=max_len)

            # Train the model
            model.fit(Xtrain, train_y, batch_size=50, epochs=15, verbose=2, 
                      callbacks=[callbacks], validation_data=(Xtest, test_y))

            # evaluate the model
            loss, acc = model.evaluate(Xtest, test_y, verbose=0)
            print('Test Accuracy: {}'.format(acc*100))

            acc_list.append(acc*100)
            
        mean_acc = np.array(acc_list).mean()
        parameters = [activation, kernel_size]
        entries = parameters + acc_list + [mean_acc]

        temp = pd.DataFrame([entries], columns=columns)
        record = record.append(temp, ignore_index=True)
        print()
        print(record)
        print()



Epoch 1/15
191/191 - 12s - loss: 0.5715 - accuracy: 0.7051 - val_loss: 0.4476 - val_accuracy: 0.8435
Epoch 2/15
191/191 - 6s - loss: 0.3333 - accuracy: 0.8698 - val_loss: 0.4152 - val_accuracy: 0.8549
Epoch 3/15
191/191 - 6s - loss: 0.2107 - accuracy: 0.9331 - val_loss: 0.4450 - val_accuracy: 0.8483
Epoch 4/15
191/191 - 7s - loss: 0.1787 - accuracy: 0.9459 - val_loss: 0.5061 - val_accuracy: 0.8435
Epoch 5/15
191/191 - 7s - loss: 0.1517 - accuracy: 0.9537 - val_loss: 0.5590 - val_accuracy: 0.8530
Epoch 6/15
191/191 - 7s - loss: 0.1390 - accuracy: 0.9577 - val_loss: 0.5864 - val_accuracy: 0.8464
Epoch 7/15
191/191 - 7s - loss: 0.1331 - accuracy: 0.9580 - val_loss: 0.6178 - val_accuracy: 0.8464
Restoring model weights from the end of the best epoch.
Epoch 00007: early stopping
Test Accuracy: 85.48539280891418
Epoch 1/15
191/191 - 9s - loss: 0.5902 - accuracy: 0.6865 - val_loss: 0.4577 - val_accuracy: 0.7936
Epoch 2/15
191/191 - 7s - loss: 0.3521 - accuracy: 0.8749 - val_loss: 0.3611 - val

Test Accuracy: 88.01887035369873
Epoch 1/15
191/191 - 8s - loss: 0.6219 - accuracy: 0.6873 - val_loss: 0.5207 - val_accuracy: 0.6745
Epoch 2/15
191/191 - 7s - loss: 0.4104 - accuracy: 0.8284 - val_loss: 0.3595 - val_accuracy: 0.8547
Epoch 3/15
191/191 - 11s - loss: 0.2541 - accuracy: 0.9301 - val_loss: 0.3622 - val_accuracy: 0.8651
Epoch 4/15
191/191 - 7s - loss: 0.2082 - accuracy: 0.9427 - val_loss: 0.4042 - val_accuracy: 0.8575
Epoch 5/15
191/191 - 8s - loss: 0.1782 - accuracy: 0.9510 - val_loss: 0.4418 - val_accuracy: 0.8585
Epoch 6/15
191/191 - 8s - loss: 0.1566 - accuracy: 0.9574 - val_loss: 0.5155 - val_accuracy: 0.8453
Epoch 7/15
191/191 - 8s - loss: 0.1425 - accuracy: 0.9614 - val_loss: 0.5399 - val_accuracy: 0.8462
Epoch 8/15
191/191 - 8s - loss: 0.1327 - accuracy: 0.9616 - val_loss: 0.5727 - val_accuracy: 0.8443
Restoring model weights from the end of the best epoch.
Epoch 00008: early stopping
Test Accuracy: 86.50943636894226

  Activation Filters       acc1       acc2      

Epoch 1/15
191/191 - 9s - loss: 0.5795 - accuracy: 0.7132 - val_loss: 0.4173 - val_accuracy: 0.8519
Epoch 2/15
191/191 - 8s - loss: 0.3426 - accuracy: 0.8757 - val_loss: 0.3596 - val_accuracy: 0.8708
Epoch 3/15
191/191 - 9s - loss: 0.2087 - accuracy: 0.9321 - val_loss: 0.3973 - val_accuracy: 0.8575
Epoch 4/15
191/191 - 9s - loss: 0.1678 - accuracy: 0.9463 - val_loss: 0.4674 - val_accuracy: 0.8594
Epoch 5/15
191/191 - 9s - loss: 0.1385 - accuracy: 0.9550 - val_loss: 0.5085 - val_accuracy: 0.8623
Epoch 6/15
191/191 - 8s - loss: 0.1165 - accuracy: 0.9580 - val_loss: 0.5853 - val_accuracy: 0.8557
Epoch 7/15
191/191 - 8s - loss: 0.1015 - accuracy: 0.9642 - val_loss: 0.6634 - val_accuracy: 0.8453
Restoring model weights from the end of the best epoch.
Epoch 00007: early stopping
Test Accuracy: 87.07547187805176
Epoch 1/15
191/191 - 11s - loss: 0.5824 - accuracy: 0.7064 - val_loss: 0.4451 - val_accuracy: 0.8274
Epoch 2/15
191/191 - 10s - loss: 0.3427 - accuracy: 0.8553 - val_loss: 0.3901 - va

Epoch 7/15
191/191 - 8s - loss: 0.0950 - accuracy: 0.9697 - val_loss: 0.5684 - val_accuracy: 0.8736
Restoring model weights from the end of the best epoch.
Epoch 00007: early stopping
Test Accuracy: 88.86792659759521
Epoch 1/15
191/191 - 9s - loss: 0.5950 - accuracy: 0.6976 - val_loss: 0.4756 - val_accuracy: 0.8113
Epoch 2/15
191/191 - 8s - loss: 0.3548 - accuracy: 0.8498 - val_loss: 0.4107 - val_accuracy: 0.8009
Epoch 3/15
191/191 - 11s - loss: 0.2177 - accuracy: 0.9279 - val_loss: 0.4689 - val_accuracy: 0.7943
Epoch 4/15
191/191 - 10s - loss: 0.1593 - accuracy: 0.9470 - val_loss: 0.5767 - val_accuracy: 0.7840
Epoch 5/15
191/191 - 8s - loss: 0.1354 - accuracy: 0.9556 - val_loss: 0.6309 - val_accuracy: 0.7764
Epoch 6/15
191/191 - 8s - loss: 0.1166 - accuracy: 0.9596 - val_loss: 0.6562 - val_accuracy: 0.7877
Restoring model weights from the end of the best epoch.
Epoch 00006: early stopping
Test Accuracy: 81.13207817077637
Epoch 1/15
191/191 - 14s - loss: 0.6138 - accuracy: 0.6867 - val

191/191 - 9s - loss: 0.2496 - accuracy: 0.8770 - val_loss: 0.3522 - val_accuracy: 0.8728
Epoch 4/15
191/191 - 9s - loss: 0.2137 - accuracy: 0.8983 - val_loss: 0.3822 - val_accuracy: 0.8586
Epoch 5/15
191/191 - 10s - loss: 0.1672 - accuracy: 0.9252 - val_loss: 0.4353 - val_accuracy: 0.8605
Epoch 6/15
191/191 - 9s - loss: 0.1339 - accuracy: 0.9417 - val_loss: 0.6284 - val_accuracy: 0.8351
Epoch 7/15
191/191 - 9s - loss: 0.1132 - accuracy: 0.9629 - val_loss: 0.6229 - val_accuracy: 0.8134
Epoch 8/15
191/191 - 9s - loss: 0.0980 - accuracy: 0.9661 - val_loss: 0.6948 - val_accuracy: 0.8407
Restoring model weights from the end of the best epoch.
Epoch 00008: early stopping
Test Accuracy: 87.27615475654602
Epoch 1/15
191/191 - 13s - loss: 0.6056 - accuracy: 0.6889 - val_loss: 0.5134 - val_accuracy: 0.6613
Epoch 2/15
191/191 - 11s - loss: 0.3700 - accuracy: 0.8467 - val_loss: 0.3830 - val_accuracy: 0.8462
Epoch 3/15
191/191 - 10s - loss: 0.2254 - accuracy: 0.9298 - val_loss: 0.4095 - val_accurac

## Summary

In [None]:
record.sort_values(by='AVG', ascending=False)

In [None]:
record[['Activation', 'AVG']].groupby(by='Activation').max().sort_values(by='AVG', ascending=False)

In [None]:
report = record.sort_values(by='AVG', ascending=False)
report = report.to_excel('CNN_SUBJ.xlsx', sheet_name='random')

# Model 2: Word2Vec Static

__Using and updating pre-trained embeddings__
* In this part, we will create an Embedding layer in Tensorflow Keras using a pre-trained word embedding called Word2Vec 300-d tht has been trained 100 bilion words from Google News.
* In this part,  we will leave the embeddings fixed instead of updating them (dynamic).

1. __Load `Word2Vec` Pre-trained Word Embedding__

In [133]:
from gensim.models import KeyedVectors
word2vec = KeyedVectors.load_word2vec_format('../GoogleNews-vectors-negative300.bin', binary=True)

In [136]:
# Access the dense vector value for the word 'handsome'
# word2vec.word_vec('handsome') # 0.11376953
word2vec.word_vec('cool') # 1.64062500e-01

array([ 1.64062500e-01,  1.87500000e-01, -4.10156250e-02,  1.25000000e-01,
       -3.22265625e-02,  8.69140625e-02,  1.19140625e-01, -1.26953125e-01,
        1.77001953e-02,  8.83789062e-02,  2.12402344e-02, -2.00195312e-01,
        4.83398438e-02, -1.01074219e-01, -1.89453125e-01,  2.30712891e-02,
        1.17675781e-01,  7.51953125e-02, -8.39843750e-02, -1.33666992e-02,
        1.53320312e-01,  4.08203125e-01,  3.80859375e-02,  3.36914062e-02,
       -4.02832031e-02, -6.88476562e-02,  9.03320312e-02,  2.12890625e-01,
        1.72119141e-02, -6.44531250e-02, -1.29882812e-01,  1.40625000e-01,
        2.38281250e-01,  1.37695312e-01, -1.76757812e-01, -2.71484375e-01,
       -1.36718750e-01, -1.69921875e-01, -9.15527344e-03,  3.47656250e-01,
        2.22656250e-01, -3.06640625e-01,  1.98242188e-01,  1.33789062e-01,
       -4.34570312e-02, -5.12695312e-02, -3.46679688e-02, -8.49609375e-02,
        1.01562500e-01,  1.42578125e-01, -7.95898438e-02,  1.78710938e-01,
        2.30468750e-01,  

2. __Check number of training words present in Word2Vec__

In [182]:
def training_words_in_word2vector(word_to_vec_map, word_to_index):
    '''
    input:
        word_to_vec_map: a word2vec GoogleNews-vectors-negative300.bin model loaded using gensim.models
        word_to_index: word to index mapping from training set
    '''
    
    vocab_size = len(word_to_index) + 1
    count = 0
    # Set each row "idx" of the embedding matrix to be 
    # the word vector representation of the idx'th word of the vocabulary
    for word, idx in word_to_index.items():
        if word in word_to_vec_map:
            count+=1
            
    return print('Found {} words present from {} training vocabulary in the set of pre-trained word vector'.format(count, vocab_size))

In [261]:
# Separate the sentences and the labels
sentences, labels = list(corpus.sentence), list(corpus.label)

# Cleaning and Tokenization
tokenizer = Tokenizer(oov_token=oov_tok)
tokenizer.fit_on_texts(sentences)

word_index = tokenizer.word_index
training_words_in_word2vector(word2vec, word_index)

Found 4825 words present from 5100 training vocabulary in the set of pre-trained word vector


2. __Define a `pretrained_embedding_layer` function__

In [138]:
from tensorflow.keras.layers import Embedding

def pretrained_embedding_matrix(word_to_vec_map, word_to_index):
    '''
    input:
        word_to_vec_map: a word2vec GoogleNews-vectors-negative300.bin model loaded using gensim.models
        word_to_index: word to index mapping from training set
    '''
    
    # adding 1 to fit Keras embedding (requirement)
    vocab_size = len(word_to_index) + 1
    # define dimensionality of your pre-trained word vectors (= 300)
    emb_dim = word_to_vec_map.word_vec('handsome').shape[0]
    
    
    embed_matrix = np.zeros((vocab_size, emb_dim))
    
    # Set each row "idx" of the embedding matrix to be 
    # the word vector representation of the idx'th word of the vocabulary
    for word, idx in word_to_index.items():
        if word in word_to_vec_map:
            embed_matrix[idx] = word_to_vec_map.word_vec(word)
            
        # initialize the unknown word with standard normal distribution values
        else:
            embed_matrix[idx] = np.random.randn(emb_dim)
            
    return embed_matrix

In [311]:
# Test the function
w_2_i = {'<UNK>': 1, 'handsome': 2, 'cool': 3, 'shit': 4 }
em_matrix = pretrained_embedding_matrix(word2vec, w_2_i)
em_matrix

array([[ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.7603335 ,  0.41298582,  1.6051669 , ...,  0.07348683,
        -0.93163275, -0.64774868],
       [ 0.11376953,  0.1796875 , -0.265625  , ..., -0.21875   ,
        -0.03930664,  0.20996094],
       [ 0.1640625 ,  0.1875    , -0.04101562, ...,  0.10888672,
        -0.01019287,  0.02075195],
       [ 0.10888672, -0.16699219,  0.08984375, ..., -0.19628906,
        -0.23144531,  0.04614258]])

## CNN Model

In [312]:
from tensorflow.keras import regularizers
from tensorflow.keras.constraints import MaxNorm

def define_model_2(filters = 100, kernel_size = 3, activation='relu', 
                 input_dim = None, output_dim=300, max_length = None, emb_matrix = None):
    
    model = tf.keras.models.Sequential([
        tf.keras.layers.Embedding(input_dim=vocab_size, 
                                  output_dim=output_dim, 
                                  input_length=max_length, 
                                  input_shape=(max_length, ),
                                  # Assign the embedding weight with word2vec embedding marix
                                  weights = [emb_matrix],
                                  # Set the weight to be not trainable (static)
                                  trainable = False),
        
        tf.keras.layers.Conv1D(filters=filters, kernel_size = kernel_size, activation = activation, 
                               # set 'axis' value to the first and second axis of conv1D weights (rows, cols)
                               kernel_constraint= MaxNorm( max_value=3, axis=[0,1])),
        
        tf.keras.layers.MaxPool1D(2),
        tf.keras.layers.Flatten(),
        tf.keras.layers.Dropout(0.5),
        tf.keras.layers.Dense(10, activation=activation, 
                              # set axis to 0 to constrain each weight vector of length (input_dim,) in dense layer
                              kernel_constraint = MaxNorm( max_value=3, axis=0)),
        tf.keras.layers.Dropout(0.5),
        tf.keras.layers.Dense(units=1, activation='sigmoid')
    ])
    
    model.compile( loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])
#     model.summary()
    return model

In [313]:
model_0 = define_model_2( input_dim=1000, max_length=100, emb_matrix=np.random.rand(vocab_size, 300))
model_0.summary()

Model: "sequential_1437"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1444 (Embedding)   (None, 100, 300)          1524600   
_________________________________________________________________
conv1d_1439 (Conv1D)         (None, 98, 100)           90100     
_________________________________________________________________
max_pooling1d_1439 (MaxPooli (None, 49, 100)           0         
_________________________________________________________________
flatten_1439 (Flatten)       (None, 4900)              0         
_________________________________________________________________
dropout_2871 (Dropout)       (None, 4900)              0         
_________________________________________________________________
dense_2869 (Dense)           (None, 10)                49010     
_________________________________________________________________
dropout_2872 (Dropout)       (None, 10)            

## Train and Test the Model

In [314]:
class myCallback(tf.keras.callbacks.Callback):
    # Overide the method on_epoch_end() for our benefit
    def on_epoch_end(self, epoch, logs={}):
        if (logs.get('accuracy') >= 0.9):
            print("\nReached 90% accuracy so cancelling training!")
            self.model.stop_training=True

callbacks = tf.keras.callbacks.EarlyStopping(monitor='val_accuracy', min_delta=0, 
                                             patience=5, verbose=2, 
                                             mode='auto', restore_best_weights=True)

In [315]:
# Parameter Initialization
trunc_type='post'
padding_type='post'
oov_tok = "<UNK>"
activations = ['relu']
filters = 100
kernel_sizes = [1, 2, 3, 4, 5, 6, 7, 8]

columns = ['Activation', 'Filters', 'acc1', 'acc2', 'acc3', 'acc4', 'acc5', 'acc6', 'acc7', 'acc8', 'acc9', 'acc10', 'AVG']
record2 = pd.DataFrame(columns = columns)

# prepare cross validation with 10 splits and shuffle = True
kfold = KFold(10, True)

# Separate the sentences and the labels
sentences, labels = list(corpus.sentence), list(corpus.label)

for activation in activations:
    for kernel_size in kernel_sizes:
        # kfold.split() will return set indices for each split
        acc_list = []
        for train, test in kfold.split(sentences):
            
            train_x, test_x = [], []
            train_y, test_y = [], []
            
            for i in train:
                train_x.append(sentences[i])
                train_y.append(labels[i])

            for i in test:
                test_x.append(sentences[i])
                test_y.append(labels[i])

            # Turn the labels into a numpy array
            train_y = np.array(train_y)
            test_y = np.array(test_y)

            # encode data using
            # Cleaning and Tokenization
            tokenizer = Tokenizer(oov_token=oov_tok)
            tokenizer.fit_on_texts(train_x)

            # Turn the text into sequence
            training_sequences = tokenizer.texts_to_sequences(train_x)
            test_sequences = tokenizer.texts_to_sequences(test_x)

            max_len = max_length(training_sequences)

            # Pad the sequence to have the same size
            Xtrain = pad_sequences(training_sequences, maxlen=max_len, padding=padding_type, truncating=trunc_type)
            Xtest = pad_sequences(test_sequences, maxlen=max_len, padding=padding_type, truncating=trunc_type)

            word_index = tokenizer.word_index
            vocab_size = len(word_index)+1
            
            
            emb_matrix = pretrained_embedding_matrix(word2vec, word_index)
            
            # Define the input shape
            model = define_model_2(filters, kernel_size, activation, input_dim=vocab_size, 
                                 max_length=max_len, emb_matrix=emb_matrix)

            # Train the model
            model.fit(Xtrain, train_y, batch_size=50, epochs=30, verbose=0, 
                      callbacks=[callbacks], validation_data=(Xtest, test_y))

            # evaluate the model
            loss, acc = model.evaluate(Xtest, test_y, verbose=0)
            print('Test Accuracy: {}'.format(acc*100))

            acc_list.append(acc*100)
            
        mean_acc = np.array(acc_list).mean()
        parameters = [activation, kernel_size]
        entries = parameters + acc_list + [mean_acc]

        temp = pd.DataFrame([entries], columns=columns)
        record2 = record2.append(temp, ignore_index=True)
        print()
        print(record2)
        print()



Restoring model weights from the end of the best epoch.
Epoch 00016: early stopping
Test Accuracy: 74.0740716457367
Restoring model weights from the end of the best epoch.
Epoch 00012: early stopping
Test Accuracy: 73.01587462425232
Restoring model weights from the end of the best epoch.
Epoch 00013: early stopping
Test Accuracy: 78.04232835769653
Restoring model weights from the end of the best epoch.
Epoch 00012: early stopping
Test Accuracy: 74.60317611694336
Restoring model weights from the end of the best epoch.
Epoch 00012: early stopping
Test Accuracy: 82.80423283576965
Restoring model weights from the end of the best epoch.
Epoch 00015: early stopping
Test Accuracy: 80.90185523033142
Restoring model weights from the end of the best epoch.
Epoch 00016: early stopping
Test Accuracy: 82.49337077140808
Restoring model weights from the end of the best epoch.
Epoch 00015: early stopping
Test Accuracy: 77.98408269882202
Restoring model weights from the end of the best epoch.
Epoch 000

Restoring model weights from the end of the best epoch.
Epoch 00016: early stopping
Test Accuracy: 74.86772537231445
Restoring model weights from the end of the best epoch.
Epoch 00017: early stopping
Test Accuracy: 77.77777910232544
Restoring model weights from the end of the best epoch.
Epoch 00006: early stopping
Test Accuracy: 66.1375641822815
Restoring model weights from the end of the best epoch.
Epoch 00006: early stopping
Test Accuracy: 66.1375641822815
Restoring model weights from the end of the best epoch.
Epoch 00019: early stopping
Test Accuracy: 75.39682388305664
Restoring model weights from the end of the best epoch.
Epoch 00006: early stopping
Test Accuracy: 61.00795865058899
Restoring model weights from the end of the best epoch.
Epoch 00017: early stopping
Test Accuracy: 76.92307829856873
Restoring model weights from the end of the best epoch.
Epoch 00014: early stopping
Test Accuracy: 75.59681534767151
Restoring model weights from the end of the best epoch.
Epoch 0002

## Summary

In [316]:
record2.sort_values(by='AVG', ascending=False)

Unnamed: 0,Activation,Filters,acc1,acc2,acc3,acc4,acc5,acc6,acc7,acc8,acc9,acc10,AVG
0,relu,1,74.074072,73.015875,78.042328,74.603176,82.804233,80.901855,82.493371,77.984083,78.514588,74.270558,77.670414
1,relu,2,81.481481,77.777779,73.280424,81.481481,73.280424,75.596815,77.188331,73.209548,82.493371,75.862068,77.165172
4,relu,5,72.222221,71.164024,81.481481,76.984125,81.216931,72.679043,77.984083,75.331563,77.453583,77.984083,76.450114
3,relu,4,75.396824,79.894179,74.603176,72.751325,73.809522,78.77984,78.249335,75.06631,73.740053,75.06631,75.735688
2,relu,3,76.190478,73.809522,76.455027,73.015875,74.867725,73.209548,76.923078,77.188331,75.331563,78.77984,75.577099
5,relu,6,74.867725,77.777779,66.137564,66.137564,75.396824,61.007959,76.923078,75.596815,78.249335,69.496024,72.159067
7,relu,8,76.190478,71.693122,72.48677,65.608466,70.37037,75.06631,70.822281,72.41379,63.925731,77.984083,71.65614
6,relu,7,79.365081,63.492066,63.22751,77.248675,63.22751,70.822281,68.435013,77.71883,62.599468,75.06631,70.120274


In [317]:
record2[['Activation', 'AVG']].groupby(by='Activation').max().sort_values(by='AVG', ascending=False)

Unnamed: 0_level_0,AVG
Activation,Unnamed: 1_level_1
relu,77.670414


In [318]:
report = record2.sort_values(by='AVG', ascending=False)
report = report.to_excel('CNN_SUBJ_2.xlsx', sheet_name='static')

# Model 3: Word2Vec - Dynamic

* In this part,  we will fine tune the embeddings while training (dynamic).

## CNN Model

In [319]:
from tensorflow.keras import regularizers
from tensorflow.keras.constraints import MaxNorm

def define_model_3(filters = 100, kernel_size = 3, activation='relu', 
                 input_dim = None, output_dim=300, max_length = None, emb_matrix = None):
    
    model = tf.keras.models.Sequential([
        tf.keras.layers.Embedding(input_dim=vocab_size, 
                                  output_dim=output_dim, 
                                  input_length=max_length, 
                                  input_shape=(max_length, ),
                                  # Assign the embedding weight with word2vec embedding marix
                                  weights = [emb_matrix],
                                  # Set the weight to be not trainable (static)
                                  trainable = True),
        
        tf.keras.layers.Conv1D(filters=filters, kernel_size = kernel_size, activation = activation, 
                               # set 'axis' value to the first and second axis of conv1D weights (rows, cols)
                               kernel_constraint= MaxNorm( max_value=3, axis=[0,1])),
        
        tf.keras.layers.MaxPool1D(2),
        tf.keras.layers.Flatten(),
        tf.keras.layers.Dropout(0.5),
        tf.keras.layers.Dense(10, activation=activation, 
                              # set axis to 0 to constrain each weight vector of length (input_dim,) in dense layer
                              kernel_constraint = MaxNorm( max_value=3, axis=0)),
        tf.keras.layers.Dropout(0.5),
        tf.keras.layers.Dense(units=1, activation='sigmoid')
    ])
    
    model.compile( loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])
#     model.summary()
    return model

In [320]:
model_0 = define_model_3( input_dim=1000, max_length=100, emb_matrix=np.random.rand(vocab_size, 300))
model_0.summary()

Model: "sequential_1518"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1525 (Embedding)   (None, 100, 300)          1527300   
_________________________________________________________________
conv1d_1520 (Conv1D)         (None, 98, 100)           90100     
_________________________________________________________________
max_pooling1d_1520 (MaxPooli (None, 49, 100)           0         
_________________________________________________________________
flatten_1520 (Flatten)       (None, 4900)              0         
_________________________________________________________________
dropout_3033 (Dropout)       (None, 4900)              0         
_________________________________________________________________
dense_3031 (Dense)           (None, 10)                49010     
_________________________________________________________________
dropout_3034 (Dropout)       (None, 10)            

## Train and Test the Model

In [321]:
class myCallback(tf.keras.callbacks.Callback):
    # Overide the method on_epoch_end() for our benefit
    def on_epoch_end(self, epoch, logs={}):
        if (logs.get('accuracy') > 0.93):
            print("\nReached 93% accuracy so cancelling training!")
            self.model.stop_training=True

callbacks = tf.keras.callbacks.EarlyStopping(monitor='val_accuracy', min_delta=0, 
                                             patience=5, verbose=2, 
                                             mode='auto', restore_best_weights=True)

In [325]:
# Parameter Initialization
trunc_type='post'
padding_type='post'
oov_tok = "<UNK>"
activations = ['relu']
filters = 100
kernel_sizes = [1, 2, 3, 4, 5, 6, 7, 8]

columns = ['Activation', 'Filters', 'acc1', 'acc2', 'acc3', 'acc4', 'acc5', 'acc6', 'acc7', 'acc8', 'acc9', 'acc10', 'AVG']
record3 = pd.DataFrame(columns = columns)

# prepare cross validation with 10 splits and shuffle = True
kfold = KFold(10, True)

# Separate the sentences and the labels
sentences, labels = list(corpus.sentence), list(corpus.label)

for activation in activations:
    for kernel_size in kernel_sizes:
        # kfold.split() will return set indices for each split
        acc_list = []
        for train, test in kfold.split(sentences):
            
            train_x, test_x = [], []
            train_y, test_y = [], []
            
            for i in train:
                train_x.append(sentences[i])
                train_y.append(labels[i])

            for i in test:
                test_x.append(sentences[i])
                test_y.append(labels[i])

            # Turn the labels into a numpy array
            train_y = np.array(train_y)
            test_y = np.array(test_y)

            # encode data using
            # Cleaning and Tokenization
            tokenizer = Tokenizer(oov_token=oov_tok)
            tokenizer.fit_on_texts(train_x)

            # Turn the text into sequence
            training_sequences = tokenizer.texts_to_sequences(train_x)
            test_sequences = tokenizer.texts_to_sequences(test_x)

            max_len = max_length(training_sequences)

            # Pad the sequence to have the same size
            Xtrain = pad_sequences(training_sequences, maxlen=max_len, padding=padding_type, truncating=trunc_type)
            Xtest = pad_sequences(test_sequences, maxlen=max_len, padding=padding_type, truncating=trunc_type)

            word_index = tokenizer.word_index
            vocab_size = len(word_index)+1
            
            
            emb_matrix = pretrained_embedding_matrix(word2vec, word_index)
            
            # Define the input shape
            model = define_model_3(filters, kernel_size, activation, input_dim=vocab_size, 
                                 max_length=max_len, emb_matrix=emb_matrix)

            # Train the model
            model.fit(Xtrain, train_y, batch_size=50, epochs=20, verbose=0, 
                      callbacks=[callbacks], validation_data=(Xtest, test_y))

            # evaluate the model
            loss, acc = model.evaluate(Xtest, test_y, verbose=0)
            print('Test Accuracy: {}'.format(acc*100))

            acc_list.append(acc*100)
            
        mean_acc = np.array(acc_list).mean()
        parameters = [activation, kernel_size]
        entries = parameters + acc_list + [mean_acc]

        temp = pd.DataFrame([entries], columns=columns)
        record3 = record3.append(temp, ignore_index=True)
        print()
        print(record3)
        print()



Restoring model weights from the end of the best epoch.
Epoch 00015: early stopping
Test Accuracy: 80.15872836112976
Restoring model weights from the end of the best epoch.
Epoch 00015: early stopping
Test Accuracy: 82.53968358039856
Restoring model weights from the end of the best epoch.
Epoch 00011: early stopping
Test Accuracy: 80.42327761650085
Restoring model weights from the end of the best epoch.
Epoch 00011: early stopping
Test Accuracy: 77.51322984695435
Restoring model weights from the end of the best epoch.
Epoch 00013: early stopping
Test Accuracy: 83.59788656234741
Restoring model weights from the end of the best epoch.
Epoch 00017: early stopping
Test Accuracy: 79.31034564971924
Restoring model weights from the end of the best epoch.
Epoch 00010: early stopping
Test Accuracy: 79.57559823989868
Restoring model weights from the end of the best epoch.
Epoch 00013: early stopping
Test Accuracy: 78.51458787918091
Restoring model weights from the end of the best epoch.
Epoch 00

Restoring model weights from the end of the best epoch.
Epoch 00012: early stopping
Test Accuracy: 77.24867463111877
Restoring model weights from the end of the best epoch.
Epoch 00016: early stopping
Test Accuracy: 79.62962985038757
Restoring model weights from the end of the best epoch.
Epoch 00014: early stopping
Test Accuracy: 78.30687761306763
Restoring model weights from the end of the best epoch.
Epoch 00010: early stopping
Test Accuracy: 78.83597612380981
Restoring model weights from the end of the best epoch.
Epoch 00014: early stopping
Test Accuracy: 81.4814805984497
Restoring model weights from the end of the best epoch.
Epoch 00011: early stopping
Test Accuracy: 80.37135004997253
Restoring model weights from the end of the best epoch.
Epoch 00014: early stopping
Test Accuracy: 83.02386999130249
Restoring model weights from the end of the best epoch.
Epoch 00014: early stopping
Test Accuracy: 80.90185523033142
Restoring model weights from the end of the best epoch.
Epoch 000

## Summary

In [326]:
record3.sort_values(by='AVG', ascending=False)

Unnamed: 0,Activation,Filters,acc1,acc2,acc3,acc4,acc5,acc6,acc7,acc8,acc9,acc10,AVG
1,relu,2,80.687833,81.74603,79.365081,79.100531,80.687833,82.228118,80.901855,81.43236,80.901855,80.106103,80.71576
5,relu,6,77.248675,79.62963,78.306878,78.835976,81.481481,80.37135,83.02387,80.901855,81.962866,80.636603,80.239918
3,relu,4,80.687833,80.687833,79.365081,77.248675,80.952382,79.575598,79.045093,79.045093,81.43236,81.167108,79.920706
0,relu,1,80.158728,82.539684,80.423278,77.51323,83.597887,79.310346,79.575598,78.514588,74.801064,79.840851,79.627525
2,relu,3,81.216931,76.719576,81.74603,80.952382,76.719576,81.43236,80.106103,78.249335,77.188331,80.636603,79.496723
7,relu,8,78.042328,76.455027,82.275134,78.571427,78.042328,78.249335,78.249335,81.43236,76.657826,79.045093,78.702019
6,relu,7,80.158728,80.158728,80.687833,79.100531,80.158728,75.331563,74.270558,79.310346,79.575598,76.127321,78.487993
4,relu,5,64.550263,81.481481,80.158728,78.306878,79.62963,76.657826,75.331563,80.636603,80.901855,82.758623,78.041345


In [327]:
report = record3.sort_values(by='AVG', ascending=False)
report = report.to_excel('CNN_SUBJ_3.xlsx', sheet_name='dynamic')