**Initialization**
* I use these 3 lines of code on top of my each Notebooks because it will help to prevent any problems while reloading and reworking on a same Project or Problem. And the third line of code helps to make visualization within the Notebook.

In [2]:
#@ Initialization:
%reload_ext autoreload 
%autoreload 2
%matplotlib inline

**Downloading the Dependencies**
* I have downloaded all the Libraries and Dependencies required for this Project in one particular cell.

In [3]:
#@ Downloading the Libraries and Dependencies:
import os, glob
from random import shuffle
from IPython.display import display

import numpy as np                                      # Module to work with Arrays.
from keras.preprocessing import sequence                # Module to handle Padding Input.
from keras.models import Sequential                     # Base Keras Neural Network Model.
from keras.layers import Dense, Dropout, Flatten        # Layers Objects to pile into Model.
from keras.layers import LSTM                           # Convolutional Layer and MaxPooling.

from nltk.tokenize import TreebankWordTokenizer         # Module for Tokenization.
from gensim.models.keyedvectors import KeyedVectors

**Getting the Data**
* I have used Google Colab for this Project so the process of downloading and reading the Data might be different in other platforms. I have used [**Large Moview Review Dataset**](https://ai.stanford.edu/~amaas/data/sentiment/) for this Project. This is a dataset for binary sentiment classification containing substantially more data. The Dataset has a set of 25,000 highly polar movie reviews for training and 25,000 for testing. There is additional unlabeled data for use as well. Raw text and already processed bag of words formats are provided.

In [4]:
#@ Getting the Data:
def preprocess_data(filepath):
  positive_path = os.path.join(filepath, "pos")
  negative_path = os.path.join(filepath, "neg")
  pos_label = 1
  neg_label = 0
  dataset = []
  
  for filename in glob.glob(os.path.join(positive_path, '*.txt')):                            # Positive Sentiment Dataset.
    with open(filename, "r") as f:
      dataset.append((pos_label, f.read()))
  for filename in glob.glob(os.path.join(negative_path, '*.txt')):                            # Negative Sentiment Dataset.
    with open(filename, "r") as f:
      dataset.append((neg_label, f.read()))

  shuffle(dataset)                                                                            # Shuffling the Dataset.
  return dataset 

**Processing the Dataset**
* I have manually downloaded the Dataset from [**Large Moview Review Dataset**](https://ai.stanford.edu/~amaas/data/sentiment/). I have used the small subset of Data.

In [5]:
#@ Processing the Dataset:
PATH = "/content/drive/My Drive/Colab Notebooks/Data/Smalltrain"                     # Path to the Dataset.
dataset = preprocess_data(PATH)                                                      # Processing the Dataset.

#@ Inspecting the Dataset:
dataset[:3]                                                                          # Inspecting the Dataset.

[(0,
  'Admittedly, I find Al Pacino to be a guilty pleasure. He was a fine actor until Scent of a Woman, where he apparently overdosed on himself irreparably. I hoped this film, of which I\'d heard almost nothing growing up, would be a nice little gem. An overlooked, ahead-of-its-time, intelligent and engaging city-political thriller. It\'s not.<br /><br />City Hall is a movie that clouds its plot with so many characters, names, and "realistic" citywide issues, that for a while you think its a plot in scope so broad and implicating, that once you find out the truth, it will blow your mind. In truth, however, these subplots and digressions result ultimately in fairly tame and very familiar urban story trademarks such as Corruption of Power, Two-Faced Politicians, Mafia with Police ties, etc. And theoretically, this setup allows for some thrilling tension, the fear that none of the characters are safe, and anything could happen! But again, it really doesn\'t.<br /><br />Unfortunately, t

**Tokenization and Vectorization**
* The next step is to perform the Tokenization and Vectorization of the Dataset. I will use Google news pretrained Model Vectors for the process of Vectorization. The Google News Word2vec Vocabulary includes some stopwords as well. 

In [7]:
#@ Tokenization and Vectorization:
# !wget -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"                # Pretrained Word2vec Model.    

word_vectors = KeyedVectors.load_word2vec_format("/content/GoogleNews-vectors-negative300.bin.gz",           # Word2vec Model Vectors.
                                       binary=True, limit=100000)

#@ Function for Tokenization and Vectorization:
def tokenize_and_vectorize(dataset):
  tokenizer = TreebankWordTokenizer()                                  # Instantiating the Tokenizer.
  vectorized_data = []
  for sample in dataset:
    tokens = tokenizer.tokenize(sample[1])                             # Process for Tokenization.
    sample_vecs = []
    for token in tokens:
      try:
        sample_vecs.append(word_vectors[token])                        # Process for Vectorization.
      except KeyError:
        pass
    vectorized_data.append(sample_vecs)
  
  return vectorized_data                                               # Returning the Vectorized Data.

#@ Function for Collecting the Target Labels:
def collect_expected(dataset):
  """ Collecting the Target Labels: 0 for Negative Review and 1 for Positive Review. """
  expected=[]
  for sample in dataset:
    expected.append(sample[0])
  return expected

#@ Tokenization and Vectorization:
vectorized_data = tokenize_and_vectorize(dataset)
expected = collect_expected(dataset)

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


**Splitting into Training and Testing.**
* Now, I will split the above obtained Dataset into Training set and a Test set. I will split the Dataset into 80% for Training and 20% for Test set. The next code will bucket the Data into Training set X train along with correct labels y train and similarly into Test set X test along with correct labels y test.

In [23]:
#@ Splitting the Dataset into Training set and Test set:
split_part = int(len(vectorized_data) * 0.8)

#@ Training set:
X_train = vectorized_data[:split_part]
y_train = expected[:split_part]

#@ Test set:
X_test = vectorized_data[split_part:]
y_test = expected[split_part:]

### **Long Short Term Memory**
* Long Short Term Memory or LSTM is an Artificial Recurrent Neural Network or RNN architecture used in the field of Deep Learning. Unlike standard Feedforward Neural Networks, LSTM has Feedback connections. It can not only process single data points, but also entire sequences of data such as Speech or Video.

In [9]:
#@ Parameters of LSTM Neural Network:
maxlen = 500                                    # Maximum review length.
batch_size = 32                                 # Number of samples shown to the network before updating the weights.
embedding_dims = 300                            # Length of token vectors for passing in RNN.
epochs = 10                                     # Number of times for passing the training dataset.
num_neurons = 50

**Padding and Truncating the Sequence**
* **Keras** has the preprocessing helper method called pad_sequences which is used to pad the input Data. But it works only on the sequence of scalars and sequence of vectors. Now, I will write the helper function to pad the input Data.

In [10]:
#@ Padding and Truncating the Token Sequence:

def pad_trunc(data, maxlen):
  """ Padding the Dataset with zero Vectors. """
  new_data = []
  # Creating zeros vectors of length of Word vectors.
  zero_vector = []
  for _ in range(len(data[0][0])):
    zero_vector.append(0.0)

  for sample in data:
    if len(sample) > maxlen:
      temp = sample[:maxlen]
    elif len(sample) < maxlen:
      temp = sample 
      # Append the appropriate number of 0 vectors to the list.
      additional_elems = maxlen - len(sample)
      for _ in range(additional_elems):
        temp.append(zero_vector)
    else:
      temp = sample 
    new_data.append(temp)
  return new_data


#@ Gathering the Truncated and Augmented Data:
X_train = pad_trunc(X_train, maxlen)
X_test = pad_trunc(X_test, maxlen)

#@ Converting the Data into Numpy Arrays:
X_train = np.reshape(X_train, (len(X_train), maxlen, embedding_dims))
y_train = np.array(y_train)
X_test = np.reshape(X_test, (len(X_test), maxlen, embedding_dims))
y_test = np.array(y_test)

#@ Inspecting the shape of the Data:
display(f"Shape of Training Data {X_train.shape, y_train.shape}")
display(f"Shape of Testing Data {X_test.shape, y_test.shape}")

'Shape of Training Data ((1601, 500, 300), (1601,))'

'Shape of Testing Data ((401, 500, 300), (401,))'

**Long Short Term Memory**
* Now, The Dataset is ready to build the Neural Network.

In [11]:
#@ Long Short Term Memory or LSTM:
model = Sequential()                                     # Standard Model Definition for Keras.
model.add(LSTM(                                          # Adding the LSTM Layer.
    num_neurons, return_sequences=True,
    input_shape=(maxlen, embedding_dims)
))
model.add(Dropout(0.2))                                  # Adding the Dropout Layer.
model.add(Flatten())                                     # Flatten the output of LSTM.
model.add(Dense(1, activation="sigmoid"))                # Output Layer.

#@ Compiling the LSTM Neural Network:
model.compile(
    loss="binary_crossentropy",
    optimizer="rmsprop",
    metrics=["accuracy"]
)

#@ Training the LSTM Neural Network:
model.fit(
    X_train, y_train,                                     # Training Dataset.
    batch_size=batch_size,
    epochs=epochs,
    validation_data=(X_test, y_test)                      # Validation Dataset.
)

#@ Inspecting the Summary of the Model:
print("\n")
model.summary()                                           # Summary of the Model.

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm (LSTM)                  (None, 500, 50)           70200     
_________________________________________________________________
dropout (Dropout)            (None, 500, 50)           0         
_________________________________________________________________
flatten (Flatten)            (None, 25000)             0         
_________________________________________________________________
dense (Dense)                (None, 1)                 25001     
Total params: 95,201
Trainable params: 95,201
Non-trainable params: 0
_________________________________________________________________


**Saving the LSTM Model**

In [12]:
#@ Saving the Recurrent Neural Network:
model_structure = model.to_json()
with open("lstm.json", "w") as json_file:
  json_file.write(model_structure)
model.save_weights("lstm.h5")
print("Model saved!!")

Model saved!!


**Model Evaluation**
* Now, I have trained a Model. I will make a sentence with Positive Sentiment and I will predict the Sentiment of the sentence using the Neural Network.

In [15]:
#@ Model Evaluation:
sample_1 = """ I hate that the dismal weather had me down for so long, \ 
            when will it break! Ugh, when does happiness return? The sun is \ 
            blinding and the puffy clouds are too thin. I can't wait for the weekend."""

#@ Making Predictions:
vec_list = tokenize_and_vectorize([(1, sample_1)])
test_vec_list = pad_trunc(vec_list, maxlen)
test_vec = np.reshape(test_vec_list, (len(test_vec_list), maxlen, embedding_dims))

#@ Inspecting the Prediction:
f"The predicted sentiment by the Model is: {model.predict_classes(test_vec)}"

'The predicted sentiment by the Model is: [[0]]'

**Optimizing the Vector Size**
* Padding and Truncating each sample to 400 Tokens was important for Convolutional Neural Nets so that the filters could scan a vector with a consistent length. 

In [22]:
#@ Optimizing the Vector Size:
def test_len(data, maxlen):
  total_len = truncated = exact = padded = 0
  for sample in data:
    total_len = len(sample)
    if len(sample) > maxlen:
      truncated += 1
    elif len(sample) < maxlen:
      padded += 1
    else:
      exact += 1
  print(f"Padded: {padded}")
  print(f"Equal: {exact}")
  print(f"Truncated: {truncated}")
  print(f"Average length: {total_len/len(data)}")

#@ Applying in the Dataset:
test_len(vectorized_data, 500)

Padded: 0
Equal: 1897
Truncated: 105
Average length: 0.24975024975024976


**Optimized Long Short Term Memory**



In [24]:
#@ Parameters of LSTM Neural Network:
maxlen = 200                                    # Maximum review length.
batch_size = 32                                 # Number of samples shown to the network before updating the weights.
embedding_dims = 300                            # Length of token vectors for passing in RNN.
epochs = 10                                     # Number of times for passing the training dataset.
num_neurons = 50

#@ Gathering the Truncated and Augmented Data:
X_train = pad_trunc(X_train, maxlen)
X_test = pad_trunc(X_test, maxlen)

#@ Converting the Data into Numpy Arrays:
X_train = np.reshape(X_train, (len(X_train), maxlen, embedding_dims))
y_train = np.array(y_train)
X_test = np.reshape(X_test, (len(X_test), maxlen, embedding_dims))
y_test = np.array(y_test)

In [25]:
#@ Long Short Term Memory or LSTM:
model = Sequential()                                     # Standard Model Definition for Keras.
model.add(LSTM(                                          # Adding the LSTM Layer.
    num_neurons, return_sequences=True,
    input_shape=(maxlen, embedding_dims)
))
model.add(Dropout(0.2))                                  # Adding the Dropout Layer.
model.add(Flatten())                                     # Flatten the output of LSTM.
model.add(Dense(1, activation="sigmoid"))                # Output Layer.

#@ Compiling the LSTM Neural Network:
model.compile(
    loss="binary_crossentropy",
    optimizer="rmsprop",
    metrics=["accuracy"]
)

#@ Training the LSTM Neural Network:
model.fit(
    X_train, y_train,                                     # Training Dataset.
    batch_size=batch_size,
    epochs=epochs,
    validation_data=(X_test, y_test)                      # Validation Dataset.
)

#@ Inspecting the Summary of the Model:
print("\n")
model.summary()                                           # Summary of the Model.

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_1 (LSTM)                (None, 200, 50)           70200     
_________________________________________________________________
dropout_1 (Dropout)          (None, 200, 50)           0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 10000)             0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 10001     
Total params: 80,201
Trainable params: 80,201
Non-trainable params: 0
_________________________________________________________________
