**Initialization**
* I use these 3 lines of code on top of my each Notebooks because it will help to prevent any problems while reloading and reworking on a Project or Problem. And the third line of code helps to make visualization within the Notebook.

In [1]:
#@ Initialization:
%reload_ext autoreload
%autoreload 2 
%matplotlib inline

**Downloading the Dependencies**
* I have downloaded all the Libraries and Dependencies required for this Project in one particular cell.

In [15]:
#@ Downloading the Libraries and Dependencies:
# !pip install nlpia
import os, glob
from random import shuffle
from IPython.display import display

import numpy as np                                      # Module to work with Arrays.
from keras.preprocessing import sequence                # Module to handle Padding Input.
from keras.models import Sequential                     # Base Keras Neural Network Model.
from keras.layers import Dense, Dropout, Activation     # Layers Objects to pile into Model.
from keras.layers import Conv1D, GlobalMaxPooling1D     # Convolutional Layer and MaxPooling.

from nltk.tokenize import TreebankWordTokenizer         # Module for Tokenization.
from gensim.models.keyedvectors import KeyedVectors
from nlpia.loaders import get_data                      # Importing the NLPIA Package.

**Getting the Data**
* I have used Google Colab for this Project so the process of downloading and reading the Data might be different in other platforms. I have used [**Large Moview Review Dataset**](https://ai.stanford.edu/~amaas/data/sentiment/) for this Project. This is a dataset for binary sentiment classification containing substantially more data. The Dataset has a set of 25,000 highly polar movie reviews for training and 25,000 for testing. There is additional unlabeled data for use as well. Raw text and already processed bag of words formats are provided.

In [4]:
#@ Getting the Data:
def preprocess_data(filepath):
  positive_path = os.path.join(filepath, "pos")
  negative_path = os.path.join(filepath, "neg")
  pos_label = 1
  neg_label = 0
  dataset = []
  
  for filename in glob.glob(os.path.join(positive_path, '*.txt')):                            # Positive Sentiment Dataset.
    with open(filename, "r") as f:
      dataset.append((pos_label, f.read()))
  for filename in glob.glob(os.path.join(negative_path, '*.txt')):                            # Negative Sentiment Dataset.
    with open(filename, "r") as f:
      dataset.append((neg_label, f.read()))

  shuffle(dataset)                                                                            # Shuffling the Dataset.
  return dataset 

**Processing the Dataset**
* I have manually downloaded the Dataset from [**Large Moview Review Dataset**](https://ai.stanford.edu/~amaas/data/sentiment/). I have used the small subset of Data.

In [5]:
#@ Processing the Dataset:
PATH = "/content/drive/My Drive/Colab Notebooks/Data/Smalltrain"                     # Path to the Dataset.
dataset = preprocess_data(PATH)                                                      # Processing the Dataset.

#@ Inspecting the Dataset:
dataset[:3]                                                                          # Inspecting the Dataset.

[(0,
  "When Marlene Dietrich was labeled box office poison in 1938 one of a handful of actresses so named by the trades papers, it was films like The Garden Of Allah. How a film could be so breathtakingly beautiful to behold and be so insipidly dull is beyond me. Also how Marlene if she was trying to expand her range and not play a sexpot got stuck with such an old fashioned story is beyond me.<br /><br />The Garden Of Allah, one of the very first films in modern technicolor was a novel set at the turn of the last century by Robert Hitchens who then collaborated on a play adaption with Mary Anderson that ran for 241 performances in 1911-12. It then got two silent screen adaptions. The story is about a monk who runs away from the monastery out in French Tunisia to see some of what he's missed in the world. He runs into a similarly sheltered woman who was unmarried and spent her prime years caring for a sick parent. She's traveling now in the desert and the two meet on a train.<br /><br

**Tokenization and Vectorization**
* The next step is to perform the Tokenization and Vectorization of the Dataset. I will use Google news pretrained Model Vectors for the process of Vectorization. The Google News Word2vec Vocabulary includes some stopwords as well. 


In [7]:
#@ Tokenization and Vectorization:
# !wget -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"                # Pretrained Word2vec Model.    

word_vectors = KeyedVectors.load_word2vec_format("/content/GoogleNews-vectors-negative300.bin.gz",           # Word2vec Model Vectors.
                                       binary=True, limit=100000)

#@ Function for Tokenization and Vectorization:
def tokenize_and_vectorize(dataset):
  tokenizer = TreebankWordTokenizer()                                  # Instantiating the Tokenizer.
  vectorized_data = []
  for sample in dataset:
    tokens = tokenizer.tokenize(sample[1])                             # Process for Tokenization.
    sample_vecs = []
    for token in tokens:
      try:
        sample_vecs.append(word_vectors[token])                        # Process for Vectorization.
      except KeyError:
        pass
    vectorized_data.append(sample_vecs)
  
  return vectorized_data                                               # Returning the Vectorized Data.

#@ Function for Collecting the Target Labels:
def collect_expected(dataset):
  """ Collecting the Target Labels: 0 for Negative Review and 1 for Positive Review. """
  expected=[]
  for sample in dataset:
    expected.append(sample[0])
  return expected

#@ Tokenization and Vectorization:
vectorized_data = tokenize_and_vectorize(dataset)
expected = collect_expected(dataset)

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


**Splitting into Training and Testing.**
* Now, I will split the above obtained Dataset into Training set and a Test set. I will split the Dataset into 80% for Training and 20% for Test set. The next code will bucket the Data into Training set X train along with correct labels y train and similarly into Test set X test along with correct labels y test.

In [8]:
#@ Splitting the Dataset into Training set and Test set:
split_part = int(len(vectorized_data) * 0.8)

#@ Training set:
X_train = vectorized_data[:split_part]
y_train = expected[:split_part]

#@ Test set:
X_test = vectorized_data[split_part:]
y_test = expected[split_part:]

### **Convolutional Neural Networks**
* In Deep Learning, a Convolutional Neural Network is a class of Deep Neural Networks, most commonly applied to analyzing Visual Imagery. They are also known as shift invariant or space invariant Artificial Neural Networks, based on their shared-weights architecture and translation invariance characteristics. The next blocks of code sets most of the hyperparameters for Convolutional Neural Network.  

In [9]:
#@ Parameters of Convolutional Neural Networks:
maxlen = 500                                    # Maximum review length.
batch_size = 32                                 # Number of samples shown to the network before updating the weights.
embedding_dims = 300                            # Length of token vectors for passing in Convnet.
filters = 250                                   # Number of filters required for training.
kernel_size = 3                                 # Width of Filters.
hidden_dims = 250                               # Number of neurons in feed forward net.
epochs = 10                                     # Number of times for passing the training dataset.

**Padding and Truncating the Sequence**
* **Keras** has the preprocessing helper method called pad_sequences which is used to pad the input Data. But it works only on the sequence of scalars and sequence of vectors. Now, I will write the helper function to pad the input Data.

In [12]:
#@ Padding and Truncating the Token Sequence:

def pad_trunc(data, maxlen):
  """ Padding the Dataset with zero Vectors. """
  new_data = []
  # Creating zeros vectors of length of Word vectors.
  zero_vector = []
  for _ in range(len(data[0][0])):
    zero_vector.append(0.0)

  for sample in data:
    if len(sample) > maxlen:
      temp = sample[:maxlen]
    elif len(sample) < maxlen:
      temp = sample 
      # Append the appropriate number of 0 vectors to the list.
      additional_elems = maxlen - len(sample)
      for _ in range(additional_elems):
        temp.append(zero_vector)
    else:
      temp = sample 
    new_data.append(temp)
  return new_data


#@ Gathering the Truncated and Augmented Data:
X_train = pad_trunc(X_train, maxlen)
X_test = pad_trunc(X_test, maxlen)

#@ Converting the Data into Numpy Arrays:
X_train = np.reshape(X_train, (len(X_train), maxlen, embedding_dims))
y_train = np.array(y_train)
X_test = np.reshape(X_test, (len(X_test), maxlen, embedding_dims))
y_test = np.array(y_test)

#@ Inspecting the shape of the Data:
display(f"Shape of Training Data {X_train.shape, y_train.shape}")
display(f"Shape of Testing Data {X_test.shape, y_test.shape}")

'Shape of Training Data ((1601, 500, 300), (1601,))'

'Shape of Testing Data ((401, 500, 300), (401,))'

**Convolutional Neural Network**
* Now, The Data is ready to build the Neural Network. Each stride in the Convolution will be of one token. And I will be using the ReLU activation Function.

In [17]:
#@ Convolutional Neural Network:
model = Sequential()                                           # Standard Model definition for Keras.
model.add(Conv1D(                                              # Adding one Convolutional layer.
    filters, kernel_size, 
    padding="valid", activation="relu",
    strides=1,
    input_shape=(maxlen, embedding_dims)
))
model.add(GlobalMaxPooling1D())                                # Adding the Pooling layer.
model.add(Dense(hidden_dims))                                  # Fully connected Hidden layer.
model.add(Dropout(0.2))
model.add(Activation("relu"))                                  # Adding the ReLU Activation layer.
model.add(Dense(1))                                            
model.add(Activation("sigmoid"))                               # Adding the Sigmoid Activation layer for the Ouptut.

#@ Compiling the Convolutional Neural Network:
model.compile(
    loss="binary_crossentropy",
    optimizer="adam",
    metrics=["accuracy"]
)

#@ Training the Convolutional Neural Network:
model.fit(
    X_train, y_train,
    batch_size=batch_size,                                     # Number of Data samples processed before backpropagation.
    epochs=epochs,
    validation_data=(X_test, y_test)
)

#@ Inspecting the Summary of the Model:
print("\n")
model.summary()

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv1d_1 (Conv1D)            (None, 498, 250)          225250    
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 250)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 250)               62750     
_________________________________________________________________
dropout_1 (Dropout)          (None, 250)               0         
_________________________________________________________________
activation_2 (Activation)    (None, 250)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 251       
_______

**Saving the Model**

In [18]:
#@ Saving the Model:
model_structure = model.to_json()                            # Saving the structure of the Network.
with open("cnn_model.json", "w") as json_file:
  json_file.write(model_structure)
model.save_weights("cnn_weights.h5")

**Loading the Saved Model**


In [21]:
#@ Loading the Saved Model:
from keras.models import model_from_json
with open("cnn_model.json", "r") as json_file:
  json_string = json_file.read()
model = model_from_json(json_string)

model.load_weights("cnn_weights.h5")                         # Loading the saved Model.       

**Model Evaluation**
* Now, I will make a sentence with a Positive sentiment. And I will predict the sentiment with the help of Neural Network.

In [24]:
#@ Model Evaluation:
sample_1 = "Natural Language Processing is one of the most interesting topics in Machine Learning. Many people loves to learn Natural \
            Language Processing in the modern days. Surprisingly, some people doen't like Natural Langugae Processing a lot! I can't wait \
            to learn NLP in future days. I am fond of reading NLP."

#@ Making Predictions:
vec_list = tokenize_and_vectorize([(1, sample_1)])
test_vec_list = pad_trunc(vec_list, maxlen)
test_vec = np.reshape(test_vec_list, (len(test_vec_list), maxlen, embedding_dims))

#@ Inspecting the Prediction:
f"The predicted sentiment by the Model is: {model.predict_classes(test_vec)}"

'The predicted sentiment by the Model is: [[1]]'