**Initialization**
* I use these 3 lines of code on top of my each Notebooks because it will help to prevent any problems while reloading and reworking on a Project or Problem. And the third line of code helps to make visualization within the Notebook.

In [6]:
#@ Initialization:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

**Downloading the Dependencies**
* I have downloaded all the Libraries and Dependencies required for this Project in one particular cell.

In [8]:
#@ Downloading the Libraries and Dependencies:
# !pip install nlpia
import os, glob
from random import shuffle

import numpy as np                                      # Module to work with Arrays.
from keras.preprocessing import sequence                # Module to handle Padding Input.
from keras.models import Sequential                     # Base Keras Neural Network Model.
from keras.layers import Dense, Dropout, Activation     # Layers Objects to pile into Model.
from keras.layers import Conv1D, GlobalMaxPool1D        # Convolutional Layer and MaxPooling.

from nltk.tokenize import TreebankWordTokenizer         # Module for Tokenization.
from gensim.models.keyedvectors import KeyedVectors
from nlpia.loaders import get_data                      # Importing the NLPIA Package.

**Getting the Data**
* I have used Google Colab for this Project so the process of downloading and reading the Data might be different in other platforms. I have used [**Large Moview Review Dataset**](https://ai.stanford.edu/~amaas/data/sentiment/) for this Project. This is a dataset for binary sentiment classification containing substantially more data. The Dataset has a set of 25,000 highly polar movie reviews for training and 25,000 for testing. There is additional unlabeled data for use as well. Raw text and already processed bag of words formats are provided.

In [9]:
#@ Getting the Data:
def preprocess_data(filepath):
  positive_path = os.path.join(filepath, "pos")
  negative_path = os.path.join(filepath, "neg")
  pos_label = 1
  neg_label = 0
  dataset = []
  
  for filename in glob.glob(os.path.join(positive_path, '*.txt')):                            # Positive Sentiment Dataset.
    with open(filename, "r") as f:
      dataset.append((pos_label, f.read()))
  for filename in glob.glob(os.path.join(negative_path, '*.txt')):                            # Negative Sentiment Dataset.
    with open(filename, "r") as f:
      dataset.append((neg_label, f.read()))

  shuffle(dataset)                                                                            # Shuffling the Dataset.
  return dataset 

**Processing the Dataset**
* I have manually downloaded the Dataset from [**Large Moview Review Dataset**](https://ai.stanford.edu/~amaas/data/sentiment/). I have used the small subset of Data.

In [10]:
#@ Processing the Dataset:
PATH = "/content/drive/My Drive/Colab Notebooks/Data/Smalltrain"                     # Path to the Dataset.
dataset = preprocess_data(PATH)                                                      # Processing the Dataset.

#@ Inspecting the Dataset:
dataset[:3]                                                                          # Inspecting the Dataset.

[(0,
  "Anybody who has ever been a fan of the original series, or even has a clue about the storyline should be embarrassed by this series. The Borg does not come around until Q brings the Enterprise to the Gamma sector, the Klingons are NEVER seen until Kirk encounters them, the NCC-1701 was the FIRST ship to carry the Enterprise name....need I go on? Berman and Pilliar have made a mockery of Gene Roddenberry's creation. After he died, they only saw $$$$ and just went their own way. No wonder Majel Barrett was in every single episode of star trek until this series. I don't blame her for not being involved with this mess. Poor Bakula. He's a great actor, as are the entire cast. I like them all, but the storyline is tragic and ignores all of the precedents set by the original series. Just check the ratings. I think more people watched Deep Space 9 (which was untimely canceled)."),
 (1,
  'This first-rate western tale of the gold rush brings great excitement, romance, and James Stewart 

**Tokenization and Vectorization**
* The next step is to perform the Tokenization and Vectorization of the Dataset. I will use Google news pretrained Model Vectors for the process of Vectorization. The Google News Word2vec Vocabulary includes some stopwords as well. 


In [14]:
#@ Tokenization and Vectorization:
# !wget -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"                # Pretrained Word2vec Model.    

word_vectors = KeyedVectors.load_word2vec_format("/content/GoogleNews-vectors-negative300.bin.gz",           # Word2vec Model Vectors.
                                       binary=True, limit=100000)

#@ Function for Tokenization and Vectorization:
def tokenize_and_vectorize(dataset):
  tokenizer = TreebankWordTokenizer()                                  # Instantiating the Tokenizer.
  vectorized_data = []
  for sample in dataset:
    tokens = tokenizer.tokenize(sample[1])                             # Process for Tokenization.
    sample_vecs = []
    for token in tokens:
      try:
        sample_vecs.append(word_vectors[token])                        # Process for Vectorization.
      except KeyError:
        pass
    vectorized_data.append(sample_vecs)
  
  return vectorized_data                                               # Returning the Vectorized Data.

#@ Function for Collecting the Target Labels:
def collect_expected(dataset):
  """ Collecting the Target Labels: 0 for Negative Review and 1 for Positive Review. """
  expected=[]
  for sample in dataset:
    expected.append(sample[0])
  return expected

#@ Tokenization and Vectorization:
vectorized_data = tokenize_and_vectorize(dataset)
expected = collect_expected(dataset)

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


**Splitting into Training and Testing.**
* Now, I will split the above obtained Dataset into Training set and a Test set. I will split the Dataset into 80% for Training and 20% for Test set. The next code will bucket the Data into Training set X_train along with correct labels y_train and similarly into Test set X_test along with correct labels y_test.

In [15]:
#@ Splitting the Dataset into Training set and Test set:
split_part = int(len(vectorized_data) * 0.8)

#@ Training set:
X_train = vectorized_data[:split_part]
y_train = vectorized_data[:split_part]

#@ Test set:
X_test = vectorized_data[split_part:]
y_test = vectorized_data[split_part:]