<a href="https://colab.research.google.com/github/Samarth1410/Samarth_FMML/blob/main/18_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Many to Many RNNs

These types of networks take a sequence as an input and give a sequence as an output. It can be used in problems like machine translation, named entity recognition, POS tagging and others.

In this project you would work on different types of RNNs on the task of POS tagging. 

In [None]:
import nltk
nltk.download('treebank')
nltk.download('brown')
nltk.download('universal_tagset')

In [None]:
## We will use Treebank from NLTK as dataset
from nltk.corpus import treebank
from nltk.corpus import brown

In [None]:
# load POS tagged corpora from NLTK
treebank_corpus = treebank.tagged_sents(tagset='universal')
brown_corpus = brown.tagged_sents(tagset='universal')
tagged_sentences = treebank_corpus + brown_corpus


In [None]:
print("Number of sentences: " + str(len(tagged_sentences)))
tagged_sentences[0]

This is a many-to-many problem, each data point will be a different sentence of the corpora.

Each data point will have multiple words in the input sequence. This is what we will refer to as X.

Each word will have its correpsonding tag in the output sequence. This what we will refer to as Y.



In [None]:
X = [] # store input sequence
Y = [] # store output sequence

for sentence in tagged_sentences:
    X_sentence = []
    Y_sentence = []
    for entity in sentence:         
        X_sentence.append(entity[0])  # entity[0] contains the word
        Y_sentence.append(entity[1])  # entity[1] contains corresponding tag
        
    X.append(X_sentence)
    Y.append(Y_sentence)


In [None]:
num_words = len(set([word.lower() for sentence in X for word in sentence]))
num_tags   = len(set([word.lower() for sentence in Y for word in sentence]))


In [None]:
print("Total number of tagged sentences: {}".format(len(X)))
print("Vocabulary size: {}".format(num_words))
print("Total number of tags: {}".format(num_tags))

In [None]:
## Task - 1
## Vectorize each sentence and pad each sequence to a fixed length

In [None]:
from keras.preprocessing.sequence import pad_sequences
from keras.utils.np_utils import to_categorical
from keras.preprocessing.text import Tokenizer

In [None]:
word_tokenizer = Tokenizer()                      # instantiate tokeniser
                                                # fit tokeniser on data
                                                # use the tokeniser to encode input sequence

In [None]:
## Task - 2 
## Convert Y to categorical and pad it as input
tag_tokenizer = Tokenizer()


In [None]:
## Padding
#X_encoded is the encoded form X from Task-1
MAX_SEQ_LENGTH = 100  # sequences greater than 100 in length will be truncated

X_padded = pad_sequences(X_encoded, maxlen=MAX_SEQ_LENGTH, padding="pre", truncating="post")
# Pad for Y
X, Y = X_padded, Y_padded


In [None]:
# Change Y to categorical

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle


In [None]:
### Split data in training and testing 
TEST_SIZE = 0.15
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=TEST_SIZE, random_state=4)


In [None]:
print("TRAINING DATA")
print('Shape of input sequences: {}'.format(X_train.shape))
print('Shape of output sequences: {}'.format(Y_train.shape))
print("-"*50)
print("TESTING DATA")
print('Shape of input sequences: {}'.format(X_test.shape))
print('Shape of output sequences: {}'.format(Y_test.shape))


In [None]:
NUM_CLASSES = Y.shape[2]

In [None]:
from keras.models import Sequential
from keras.layers import Embedding
from keras.layers import Dense, Input
from keras.layers import TimeDistributed
from keras.layers import LSTM, GRU, Bidirectional, SimpleRNN, RNN
from keras.models import Model

In [None]:
X[0]

In [None]:
### Task - 3 Complete the two lines

In [None]:
rnn_model = Sequential()

# create embedding layer - usually the first layer in text problems
rnn_model.add(Embedding(num_words + 1,         # vocabulary size - number of unique words in data
                        output_dim    =  300,          # length of vector with which each word is represented
                        input_length  =  MAX_SEQ_LENGTH,          # length of input sequence
                        trainable     =  False                    # False - don't update the embeddings
))

# add an any RNN layer which contains 64 RNN cells

# add time distributed (output at each sequence) layer


In [None]:
rnn_model.compile(loss      =  'categorical_crossentropy',
                  optimizer =  'adam',
                  metrics   =  ['acc'])


In [None]:
rnn_model.summary()


In [None]:
rnn_training = rnn_model.fit(X_train, Y_train, batch_size=256, epochs=10)


In [None]:
from matplotlib import pyplot as plt

In [None]:
plt.plot(rnn_training.history['acc'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc="lower right")
plt.show()


In [None]:
loss, accuracy = rnn_model.evaluate(X_test, Y_test, verbose = 1)
print("Loss: {0},\nAccuracy: {1}".format(loss, accuracy))

In [None]:
## Task - 4 How did turning the trainable parameter in Embedding layer into True effect the performance?

In [None]:
## Task - 5 How else can you improve the accuracy? 

In [None]:
## Task - 6 Use other RNNs present in Keras like LSTM, GRU, BiLSTMs, BiGRU and compare any three models with RNNs 