**Question** : Load, Preprocess and split the Parts Of Speech tagged corpora from NLTK. Use a Pre Trained model as word embedding as GoogleNews-vectors. Build, Compile , Train and Evaluate the model with pre-trained model by layer as bidirectional LSTM




**Description** :

* load POS tagged corpora from NLTK by using brown, treebank and conll2000. Import these three libraries from NLTK corpus

* Divide data into words ( X ) and tags ( Y ) using an empty list as  X and Y and store in it.

* Convert Text to integer by using text_to_sequences

* Truncate long sentences into fixed lengths as 100 

* Convert classes to binary form by using to_categorical

* Split data into training  and  testing sets ( test set as 0.15 )

* Split training data into training and validation sets ( Valid set as 0.15 )

* Use a pre-trained model as word embedding as google news vector. load word2vec using the following function present in the       gensim library

* assign word vectors from word2vec model and each word in word2vec model is represented using a 300 dimensional vector

* create an empty embedding matrix and create a word to index dictionary mapping

* copy vectors from word2vec model to the words present in corpus

* Build the model by using Sequential API with adding layers as embedding with different dimensions as input_dim     =             VOCABULARY_SIZE,  output_dim    =  EMBEDDING_SIZE,  input_length  =  MAX_SEQ_LENGTH,   weights       = [embedding_weights],     trainable     =  True  and add second layer as bidirectional LSTM with 64 neurons and return_sequences = True and add third     layer as TimeDistributed layer with also dense layer as NUM_CLASSES and activation function as softmax ( Here Num classes = 13              )

* Compile the model with loss as Categorical_crossentropy , optimizer as adam and metrics as accuracy

* Fit or train the model with training sets , epochs = 5 , batch_size = 128 and validation sets

* Evaluate the model with testing sets as loss and accuracy


**Solution**:

In [None]:
from nltk.corpus import brown
from nltk.corpus import treebank
from nltk.corpus import conll2000
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils.np_utils import to_categorical
from sklearn.model_selection import train_test_split
from gensim.models import KeyedVectors
import numpy as np
import tensorflow as tf

In [None]:
# load POS tagged corpora from NLTK
treebank_corpus = treebank.tagged_sents(tagset='universal')
brown_corpus = brown.tagged_sents(tagset='universal')
conll_corpus = conll2000.tagged_sents(tagset='universal')
tagged_sentences = treebank_corpus + brown_corpus + conll_corpus

In [None]:
# let's look at the data
tagged_sentences[2]

In [None]:
X = []
Y = []

for sentence in tagged_sentences:
    X_sentence = []
    Y_sentence = []
    for entity in sentence:         
        X_sentence.append(entity[0]) 
        Y_sentence.append(entity[1]) 
        
    X.append(X_sentence)
    Y.append(Y_sentence)

In [None]:
num_words = len(set([word.lower() for sentence in X for word in sentence]))
num_tags   = len(set([word.lower() for sentence in Y for word in sentence]))

In [None]:
# encode X

word_tokenizer = Tokenizer()                      
word_tokenizer.fit_on_texts(X)                    
X_encoded = word_tokenizer.texts_to_sequences(X)  

In [None]:
# encode Y

tag_tokenizer = Tokenizer()
tag_tokenizer.fit_on_texts(Y)
Y_encoded = tag_tokenizer.texts_to_sequences(Y)

In [None]:
different_length = [1 if len(input) != len(output) else 0 for input, output in zip(X_encoded, Y_encoded)]

In [None]:
lengths = [len(seq) for seq in X_encoded]

In [None]:
MAX_SEQ_LENGTH = 100  
X_padded = pad_sequences(X_encoded, maxlen=MAX_SEQ_LENGTH, padding="pre", truncating="post")
Y_padded = pad_sequences(Y_encoded, maxlen=MAX_SEQ_LENGTH, padding="pre", truncating="post")

In [None]:
X, Y = X_padded, Y_padded

In [None]:
Y = to_categorical(Y)

In [None]:
TEST_SIZE = 0.15
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=TEST_SIZE, random_state=4)

In [None]:
VALID_SIZE = 0.15
X_train, X_validation, Y_train, Y_validation = train_test_split(X_train, Y_train, test_size=VALID_SIZE, random_state=4)

In [None]:
NUM_CLASSES = Y.shape[2]

In [None]:
path = 'C:\\Users\\gupta\\Desktop\\datasets\\GoogleNews-vectors-negative300.bin'

In [None]:
word2vec = KeyedVectors.load_word2vec_format(path, binary=True)

In [None]:
EMBEDDING_SIZE  = 300 
VOCABULARY_SIZE = len(word_tokenizer.word_index) + 1

embedding_weights = np.zeros((VOCABULARY_SIZE, EMBEDDING_SIZE))

word2id = word_tokenizer.word_index

for word, index in word2id.items():
    try:
        embedding_weights[index, :] = word2vec[word]
    except KeyError:
        pass

In [None]:
# print("Embeddings shape: {}".format(embedding_weights.shape))

# embedding_weights[word_tokenizer.word_index['joy']]

In [None]:
# create architecture

model = tf.keras.Sequential()
model.add(tf.keras.layers.Embedding(input_dim     = VOCABULARY_SIZE,
                             output_dim    = EMBEDDING_SIZE,
                             input_length  = MAX_SEQ_LENGTH,
                             weights       = [embedding_weights],
                             trainable     = True
))
model.add(tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64, return_sequences=True)))
model.add(tf.keras.layers.TimeDistributed(tf.keras.layers.Dense(NUM_CLASSES, activation='softmax')))

In [None]:
model.compile(loss =  'categorical_crossentropy',
                  optimizer =  'adam',
                  metrics   =  ['acc'])

In [None]:
model.fit(X_train, Y_train, batch_size=128, epochs=5, validation_data=(X_validation, Y_validation))

In [None]:
loss , accuracy = model.evaluate(X_test, Y_test)

print(accuracy)