**Initialization**
* I use these 3 lines of code on top of my each Notebooks because it will help to prevent any problems while reloading and reworking on a Project or Problem. And the third line of code helps to make visualization within the Notebook.

In [1]:
#@ Initialization:
%reload_ext autoreload
%autoreload 2 
%matplotlib inline

**Downloading the Dependencies**
* I have downloaded all the Libraries and Dependencies required for this Project in one particular cell.

In [15]:
#@ Downloading the Libraries and Dependencies:
import os, glob
from random import shuffle
from IPython.display import display

import numpy as np                                      # Module to work with Arrays.
from keras.preprocessing import sequence                # Module to handle Padding Input.
from keras.models import Sequential                     # Base Keras Neural Network Model.
from keras.layers import Dense, Dropout, Flatten        # Layers Objects to pile into Model.
from keras.layers import SimpleRNN                      # Convolutional Layer and MaxPooling.
from keras.layers.wrappers import Bidirectional

from nltk.tokenize import TreebankWordTokenizer         # Module for Tokenization.
from gensim.models.keyedvectors import KeyedVectors

**Getting the Data**
* I have used Google Colab for this Project so the process of downloading and reading the Data might be different in other platforms. I have used [**Large Moview Review Dataset**](https://ai.stanford.edu/~amaas/data/sentiment/) for this Project. This is a dataset for binary sentiment classification containing substantially more data. The Dataset has a set of 25,000 highly polar movie reviews for training and 25,000 for testing. There is additional unlabeled data for use as well. Raw text and already processed bag of words formats are provided.

In [3]:
#@ Getting the Data:
def preprocess_data(filepath):
  positive_path = os.path.join(filepath, "pos")
  negative_path = os.path.join(filepath, "neg")
  pos_label = 1
  neg_label = 0
  dataset = []
  
  for filename in glob.glob(os.path.join(positive_path, '*.txt')):                            # Positive Sentiment Dataset.
    with open(filename, "r") as f:
      dataset.append((pos_label, f.read()))
  for filename in glob.glob(os.path.join(negative_path, '*.txt')):                            # Negative Sentiment Dataset.
    with open(filename, "r") as f:
      dataset.append((neg_label, f.read()))

  shuffle(dataset)                                                                            # Shuffling the Dataset.
  return dataset 

**Processing the Dataset**
* I have manually downloaded the Dataset from [**Large Moview Review Dataset**](https://ai.stanford.edu/~amaas/data/sentiment/). I have used the small subset of Data.

In [4]:
#@ Processing the Dataset:
PATH = "/content/drive/My Drive/Colab Notebooks/Data/Smalltrain"                     # Path to the Dataset.
dataset = preprocess_data(PATH)                                                      # Processing the Dataset.

#@ Inspecting the Dataset:
dataset[:3]                                                                          # Inspecting the Dataset.

[(1,
  "For me too, this Christmas special is one that I remember very fondly. In 1989, I snatched up the 2 CDs I found of the soundtrack recording, giving one to my sister and keeping the other for myself. It's part of my family's Christmas tradition now, and I would love to be able to actually see the show again rather than just remember it as I listen.<br /><br />It has been noted elsewhere that John Denver made a number of appearances on the Muppet Show, and they did more than one special together. The good rapport between Denver and his fuzzy companions comes through clearly here, in a charming and fun show that is good for all ages."),
 (0,
  "Live! Yes, but not kicking.<br /><br />True story: Some time ago, a Dutch TV station made an announcement that they were going to air a new reality show. A contest rather. The main participant in this show would be a woman who was dying of something terrible and she would be donating her kidneys to one lucky person with progressive kidney f

**Tokenization and Vectorization**
* The next step is to perform the Tokenization and Vectorization of the Dataset. I will use Google news pretrained Model Vectors for the process of Vectorization. The Google News Word2vec Vocabulary includes some stopwords as well. 

In [6]:
#@ Tokenization and Vectorization:
# !wget -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"                # Pretrained Word2vec Model.    

word_vectors = KeyedVectors.load_word2vec_format("/content/GoogleNews-vectors-negative300.bin.gz",           # Word2vec Model Vectors.
                                       binary=True, limit=100000)

#@ Function for Tokenization and Vectorization:
def tokenize_and_vectorize(dataset):
  tokenizer = TreebankWordTokenizer()                                  # Instantiating the Tokenizer.
  vectorized_data = []
  for sample in dataset:
    tokens = tokenizer.tokenize(sample[1])                             # Process for Tokenization.
    sample_vecs = []
    for token in tokens:
      try:
        sample_vecs.append(word_vectors[token])                        # Process for Vectorization.
      except KeyError:
        pass
    vectorized_data.append(sample_vecs)
  
  return vectorized_data                                               # Returning the Vectorized Data.

#@ Function for Collecting the Target Labels:
def collect_expected(dataset):
  """ Collecting the Target Labels: 0 for Negative Review and 1 for Positive Review. """
  expected=[]
  for sample in dataset:
    expected.append(sample[0])
  return expected

#@ Tokenization and Vectorization:
vectorized_data = tokenize_and_vectorize(dataset)
expected = collect_expected(dataset)

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


**Splitting into Training and Testing.**
* Now, I will split the above obtained Dataset into Training set and a Test set. I will split the Dataset into 80% for Training and 20% for Test set. The next code will bucket the Data into Training set X train along with correct labels y train and similarly into Test set X test along with correct labels y test.

In [7]:
#@ Splitting the Dataset into Training set and Test set:
split_part = int(len(vectorized_data) * 0.8)

#@ Training set:
X_train = vectorized_data[:split_part]
y_train = expected[:split_part]

#@ Test set:
X_test = vectorized_data[split_part:]
y_test = expected[split_part:]

### **Recurrent Neural Networks**
* A Recurrent Neural Network is a class of Artificial Neural Networks where connections between nodes form a directed graph along a temporal sequence. This allows it to exhibit temporal Dynamic behavior. Derived from Feedforward Neural Networks, RNN can use their internal state or memory to process variable length sequences of inputs. This makes them applicable to tasks such as Unsegmented and Connected Handwriting Recognition or Speech Recognition.

In [8]:
#@ Parameters of Recurrent Neural Networks:
maxlen = 500                                    # Maximum review length.
batch_size = 32                                 # Number of samples shown to the network before updating the weights.
embedding_dims = 300                            # Length of token vectors for passing in RNN.
epochs = 10                                     # Number of times for passing the training dataset.
num_neurons = 50

**Padding and Truncating the Sequence**
* **Keras** has the preprocessing helper method called pad_sequences which is used to pad the input Data. But it works only on the sequence of scalars and sequence of vectors. Now, I will write the helper function to pad the input Data.

In [9]:
#@ Padding and Truncating the Token Sequence:

def pad_trunc(data, maxlen):
  """ Padding the Dataset with zero Vectors. """
  new_data = []
  # Creating zeros vectors of length of Word vectors.
  zero_vector = []
  for _ in range(len(data[0][0])):
    zero_vector.append(0.0)

  for sample in data:
    if len(sample) > maxlen:
      temp = sample[:maxlen]
    elif len(sample) < maxlen:
      temp = sample 
      # Append the appropriate number of 0 vectors to the list.
      additional_elems = maxlen - len(sample)
      for _ in range(additional_elems):
        temp.append(zero_vector)
    else:
      temp = sample 
    new_data.append(temp)
  return new_data


#@ Gathering the Truncated and Augmented Data:
X_train = pad_trunc(X_train, maxlen)
X_test = pad_trunc(X_test, maxlen)

#@ Converting the Data into Numpy Arrays:
X_train = np.reshape(X_train, (len(X_train), maxlen, embedding_dims))
y_train = np.array(y_train)
X_test = np.reshape(X_test, (len(X_test), maxlen, embedding_dims))
y_test = np.array(y_test)

#@ Inspecting the shape of the Data:
display(f"Shape of Training Data {X_train.shape, y_train.shape}")
display(f"Shape of Testing Data {X_test.shape, y_test.shape}")

'Shape of Training Data ((1601, 500, 300), (1601,))'

'Shape of Testing Data ((401, 500, 300), (401,))'

**Recurrent Neural Network**
* Now, The Dataset is ready to build the Neural Network.

In [10]:
#@ Simple Recurrent Neural Network:
model = Sequential()                                         # Standard Model definition for Keras.
model.add(SimpleRNN(                                         # Adding the Recurrent layer.
    num_neurons, return_sequences=True,
    input_shape=(maxlen, embedding_dims)
))
model.add(Dropout(0.2))                                      # Adding the Dropout layer.
model.add(Flatten())
model.add(Dense(1, activation="sigmoid"))                    # Using sigmoid activation.

#@ Compiling the Recurrent Neural Network:
model.compile(
    loss="binary_crossentropy",
    optimizer="rmsprop",
    metrics=["accuracy"]
)

#@ Training the Recurrent Neural Network:
model.fit(
    X_train, y_train,
    batch_size=batch_size,
    epochs=epochs,
    validation_data=(X_test, y_test)
)

#@ Inspecting the Summary of the Model:
print("\n")
model.summary()

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
simple_rnn (SimpleRNN)       (None, 500, 50)           17550     
_________________________________________________________________
dropout (Dropout)            (None, 500, 50)           0         
_________________________________________________________________
flatten (Flatten)            (None, 25000)             0         
_________________________________________________________________
dense (Dense)                (None, 1)                 25001     
Total params: 42,551
Trainable params: 42,551
Non-trainable params: 0
_________________________________________________________________


**Building the Larger Network**
* I will use 100 neurons instead of 50 in the Model defined below.

In [11]:
#@ Simple Recurrent Neural Network:
num_neurons=100                                              # Adding 100 Neurons.
epochs=5
model = Sequential()                                         # Standard Model definition for Keras.
model.add(SimpleRNN(                                         # Adding the Recurrent layer.
    num_neurons, return_sequences=True,
    input_shape=(maxlen, embedding_dims)
))
model.add(Dropout(0.2))                                      # Adding the Dropout layer.
model.add(Flatten())
model.add(Dense(1, activation="sigmoid"))                    # Using sigmoid activation.

#@ Compiling the Recurrent Neural Network:
model.compile(
    loss="binary_crossentropy",
    optimizer="rmsprop",
    metrics=["accuracy"]
)

#@ Training the Recurrent Neural Network:
model.fit(
    X_train, y_train,
    batch_size=batch_size,
    epochs=epochs,
    validation_data=(X_test, y_test)
)

#@ Inspecting the Summary of the Model:
print("\n")
model.summary()

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
simple_rnn_1 (SimpleRNN)     (None, 500, 100)          40100     
_________________________________________________________________
dropout_1 (Dropout)          (None, 500, 100)          0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 50000)             0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 50001     
Total params: 90,101
Trainable params: 90,101
Non-trainable params: 0
_________________________________________________________________


**Saving the Recurrent Neural Network**

In [12]:
#@ Saving the Recurrent Neural Network:
model_structure = model.to_json()
with open("simplernn_model.json", "w") as json_file:
  json_file.write(model_structure)
model.save_weights("simplernn_model.h5")
print("Model saved!!")

Model saved!!


**Model Evaluation**
* Now, I have trained a Model. I will make a sentence with Positive Sentiment and I will predict the Sentiment of the sentence using the Neural Network.

In [14]:
#@ Model Evaluation:
sample_1 = "Natural Language Processing is one of the most interesting topics in Machine Learning. Many people loves to learn Natural \
            Language Processing in the modern days. Surprisingly, some people doen't like Natural Langugae Processing a lot! I can't wait \
            to learn NLP in future days. I am fond of reading NLP."

#@ Making Predictions:
vec_list = tokenize_and_vectorize([(1, sample_1)])
test_vec_list = pad_trunc(vec_list, maxlen)
test_vec = np.reshape(test_vec_list, (len(test_vec_list), maxlen, embedding_dims))

#@ Inspecting the Prediction:
f"The predicted sentiment by the Model is: {model.predict_classes(test_vec)}"

'The predicted sentiment by the Model is: [[1]]'

**Note:**
* The Simple Recurrent Neural Network as mentioned above is not the better choice. They are relatively expensive to train and pass new samples through compared to Feedforward net or Convolutional Neural Net. At least the results obtained above are not appreciably better.

## **Bidirectional Recurrent Neural Network**
* **Keras** added a layer wrapper that will automatically flip around the necessary inputs and outputs to automatically assemble a Bidirectional Recurrent Neural Network.

In [16]:
#@ Parameters of Recurrent Neural Network:
num_neurons = 20    
maxlen = 100                                      # Maximum review length.
embedding_dims = 300                              # Length of token vectors for passing in RNN.

#@ Bidirectional Recurrent Neural Network:
model = Sequential()                              # Standard Model Definition for Keras.
model.add(Bidirectional(SimpleRNN(                # Bidirectional Neural Network.
    num_neurons, return_sequences=True,
    input_shape=(maxlen, embedding_dims)
)))
model.add(Dropout(0.2))                                      # Adding the Dropout layer.
model.add(Flatten())
model.add(Dense(1, activation="sigmoid"))                    # Using sigmoid activation.

#@ Compiling the Recurrent Neural Network:
model.compile(
    loss="binary_crossentropy",
    optimizer="rmsprop",
    metrics=["accuracy"]
)

#@ Training the Recurrent Neural Network:
model.fit(
    X_train, y_train,
    batch_size=batch_size,
    epochs=epochs,
    validation_data=(X_test, y_test)
)

#@ Inspecting the Summary of the Model:
print("\n")
model.summary()

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
bidirectional (Bidirectional (None, 500, 40)           12840     
_________________________________________________________________
dropout_2 (Dropout)          (None, 500, 40)           0         
_________________________________________________________________
flatten_2 (Flatten)          (None, 20000)             0         
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 20001     
Total params: 32,841
Trainable params: 32,841
Non-trainable params: 0
_________________________________________________________________
