#1D Convolutional Layers

Convolutional Neural Networks (ConvNets) perform particularly well on computer vision problems due to their ability to operate convolutionally, that is extracting features from local input patches allowing for representation modularity and data efficiency. The same properties that make ConvNets the best choice for computer vision-related problems also make them highly significant to sequence processing. 1D convolution layers are also translation invariant in the sense that because the same input transformation is performed on every patch, a pattern learned at a certain position in a sentence can later be recognized at a different position. Similar to 2D ConvNets, 1D patches can be extracted from an input and output the maximum or average value, a process technically referred to as Max Pooling and Average Pooling respectively, and just as with 2D ConvNets, this is also used for reducing the length of the 1D input (technically known as subsampling).



In [4]:
***
import os
import pickle as pk
import pandas as pd

In [5]:
imdb_dir = 'C:/Ankit/Python/Sentiment Analysis/aclImdb/aclImdb'
tokenizer_path = 'C:/Ankit/Python/Sentiment Analysis/Tokenizer'

### We are using multiple datasets to train the Sentiment Analysis Classifier.
1. IMDB dataset taken from Kaggle <a href="https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews" target="_blank">Kaggle IMDB Dataset</a>

2. Yelp reviews dataset

3. Amazon product reviews

4. Random list of positive comments

5. Random list of negative comments

In [6]:
#importing IMDB dataset
imdb_data=pd.read_csv('C:/Ankit/Python/Sentiment Analysis/IMDB Dataset.csv/IMDB Dataset.csv')
print(imdb_data.shape)
imdb_data.head(10)

labels = list()
texts = list()
for index, row in imdb_data.iterrows():
            texts.append(row[0])
            if row[1] == 'negative':
                labels.append(0)
            else:
                labels.append(1)



#importing amazon reviews
with open('C:/Ankit/Python/Sentiment Analysis/amazon.txt', 'r') as in_file:
    stripped = (line.strip() for line in in_file)
    lines = (line.split('\t') for line in stripped if line)
    
    for row in lines:
            texts.append(row[0])
            if row[1] == '0':
                labels.append(0)
            else:
                labels.append(1)

#importing yelp reviews
with open('C:/Ankit/Python/Sentiment Analysis/yelp.txt', 'r') as in_file:
    stripped = (line.strip() for line in in_file)
    lines = (line.split('\t') for line in stripped if line)
    
    for row in lines:
            texts.append(row[0])
            if row[1] == '0':
                labels.append(0)
            else:
                labels.append(1)

#importing positive reviews reviews
with open('C:/Ankit/Python/Sentiment Analysis/positive_comments.txt', 'r') as in_file:
    stripped = (line.strip() for line in in_file)
    lines = (line.split('\t') for line in stripped if line)
    
    for row in lines:
            texts.append(row[0])
            labels.append(1)

#importing positive reviews reviews
with open('C:/Ankit/Python/Sentiment Analysis/negative_comments.txt', 'r') as in_file:
    stripped = (line.strip() for line in in_file)
    lines = (line.split('\t') for line in stripped if line)
    
    for row in lines:
            texts.append(row[0])
            labels.append(0)




(50000, 2)


In [7]:

# Tokenizing the data
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import numpy as np
from keras.optimizers import RMSprop

Using TensorFlow backend.


In [8]:
# cut off reviews after 500 words
max_len = 500 
# train on 10000 samples
training_samples = 20000
 # validate on 10000 samples 
validation_samples = 10000
# consider only the top 10000 words
max_words = 10000 

# import tokenizer with the consideration for only the top 500 words
tokenizer = Tokenizer(num_words=max_words) 
# fit the tokenizer on the texts
tokenizer.fit_on_texts(texts) 
# convert the texts to sequences
sequences = tokenizer.texts_to_sequences(texts) 

# save the tokenizer
with open(os.path.join(tokenizer_path, 'tokenizer_m1.pickle'), 'wb') as handle:
    pk.dump(tokenizer, handle, protocol=pk.HIGHEST_PROTOCOL)


word_index = tokenizer.word_index
print('Found %s unique tokens. ' % len(word_index))

 # pad the sequence to the required length to ensure uniformity
data = pad_sequences(sequences, maxlen=max_len)
print('Data Shape: {}'.format(data.shape))

labels = np.asarray(labels)
print("Shape of data tensor: ", data.shape)
print("Shape of label tensor: ", labels.shape)

# split the data into training and validation set but before that shuffle it first
indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
labels = labels[indices]

x_train = data[:training_samples]
y_train = labels[:training_samples]
x_val = data[training_samples:training_samples + validation_samples]
y_val = labels[training_samples:training_samples + validation_samples]

# test_data
x_test = data[training_samples+validation_samples:]
y_test = labels[training_samples+validation_samples:]

Found 126268 unique tokens. 
Data Shape: (62662, 500)
Shape of data tensor:  (62662, 500)
Shape of label tensor:  (62662,)


In [9]:
with open(os.path.join(tokenizer_path, 'tokenizer_m1.pickle'), 'wb') as handle:
    pk.dump(tokenizer, handle, protocol=pk.HIGHEST_PROTOCOL)

In [10]:

    # decode the words
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])
decoded_review = ' '.join([reverse_word_index.get(i, '?') for i in sequences[0]])

In [11]:
# model definition
import keras
from keras.models import Sequential
from keras.layers import Embedding, Flatten, Dense, Conv1D, MaxPooling1D, GlobalMaxPooling1D
from keras import Model, layers
from keras import Input

### Callback Functions

We tend to lose control over how our model trains on the provided dataset the moment we call the fit() or fit_generator() method on our model and this means that with a model not “smart” enough, we can only watch it perform very badly during training or quit the training and start allover again. This process can really be expensive and ineffective therefore in order to avoid that, we would like to develop a model that can self-introspect and dynamically take action that will positively affect training. There are many things one cannot predict during training. For instance, one cannot tell the exact number of epochs that will be needed to achieve an optimal validation loss and accuracy.

Mostly during training, we tend to use an arbitrary number of epochs and if the model overfits before that number of epochs is reached then we reduce the number of epochs and train again otherwise, we increase the number of epochs and this approach is very wasteful. A much better way to handle this during training is to stop training when we realize that the validation loss is no longer improving. This can be achieved using a Keras callback. A callback is an object (a class instance implementing specific methods) that is passed to the model in the call to fit and that is called by the model at various points during training. It has access to all the available data about the state of the model and its performance, and it can take action: interrupt training, save a model, load a different weight set, or otherwise alter the state of the model.
Some ways by which callbacks can be used are:

* Model checkpointing — Saving the current weights of the model at different points during training.
* Early stopping — Interrupting training when the validation loss is no longer improving (and save the best model obtained during training).
* Dynamically adjusting the value of certain parameters during training such as the learning rate optimizer.
* Logging training and validation metrics during training or visualizing representations learned by the model as they’re updated. (The Keras progress bar we always see in our terminal during training!)

The code below shows the callback functions we have used for Keras -

In [12]:
# Sample Call-back code
callback_list = [
    keras.callbacks.EarlyStopping(
        patience=1,
        monitor='acc',
    ),
    
    keras.callbacks.TensorBoard(
        log_dir='C:/Ankit/Python/Sentiment Analysis/model/log_dir_m1',
        histogram_freq=1,
        embeddings_freq=1,
    ),

    keras.callbacks.ModelCheckpoint(
        monitor='val_loss',
        save_best_only=True,
        filepath='C:/Ankit/Python/Sentiment Analysis/model/movie_sentiment_m1.h5',
    ),

    keras.callbacks.ReduceLROnPlateau(
        patience=1,
        factor=0.1,
    )
]



Callbacks are passed to the during via the callback argument in the fit() method which takes a list of callbacks. Any number of callbacks can be passed to it.

The monitor argument in the EarlyStopping callback monitor’s the model’s validation accuracy and the patience argument interrupts training when the parameter passed to the monitor argument stops improving for more than the number (of epochs) passed to it (in this case 1).

The filepath argument in the ModelCheckpoint callback saves the current weights after every epoch to the destination model file and the monitor and save_best_only arguments mean we won’t override the model file unless the validation loss (val_loss) has improved. This allows us to keep the best model seen during training.

Also, the ReduceLROnPlateau callback is used to reduce the learning rate when the validation loss has stopped improving. This has proven to be a very effective strategy to get out of local minima during training. The factor argument takes as input a float which is used to divide the learning rate when triggered.

This is a 5-layered 1D ConvNet which is flattened at the end using the GlobalMaxPooling1D layer and fed to a Dense layer. Alternatively, the Flatten layer can also be used to accomplish this task. We then make our prediction by feeding the vector obtained from the Dense layer to another Dense layer of 1 unit and a sigmoid activation function. Our choice for a sigmoid activation function at the output layer is because our classification task involves only two classes (either positive or negative)

In [13]:
# model developing
text_input_layer = Input(shape=(500,))
embedding_layer = Embedding(max_words, 50)(text_input_layer)
text_layer = Conv1D(256, 3, activation='relu')(embedding_layer)
text_layer = MaxPooling1D(3)(text_layer)
text_layer = Conv1D(256, 3, activation='relu')(text_layer)
text_layer = MaxPooling1D(3)(text_layer)
text_layer = Conv1D(256, 3, activation='relu')(text_layer)
text_layer = MaxPooling1D(3)(text_layer)
text_layer = Conv1D(256, 3, activation='relu')(text_layer)
text_layer = MaxPooling1D(3)(text_layer)
text_layer = Conv1D(256, 3, activation='relu')(text_layer)
text_layer = MaxPooling1D(3)(text_layer)
text_layer = GlobalMaxPooling1D()(text_layer)
text_layer = Dense(256, activation='relu')(text_layer)
output_layer = Dense(1, activation='sigmoid')(text_layer)
model = Model(text_input_layer, output_layer)
model.summary()
model.compile(optimizer=RMSprop(lr=0.001),
              loss='binary_crossentropy',
              metrics=['acc'])

Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 500)               0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 500, 50)           500000    
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 498, 256)          38656     
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 166, 256)          0         
_________________________________________________________________
conv1d_2 (Conv1D)            (None, 164, 256)          196864    
_________________________________________________________________
max_pooling1d_2 (MaxPooling1 (None, 54, 256)           0         
_________________________________________________________________
conv1d_3 (Conv1D)            (None, 52, 256)           1968

In [14]:
history = model.fit(x_train, y_train, epochs=50, batch_size=128, callbacks=callback_list,validation_data=(x_val, y_val))

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 20000 samples, validate on 10000 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50


In [15]:
from keras.models import load_model
from keras.preprocessing.sequence import pad_sequences
import os, pickle
import numpy as np

In [16]:
tokenizer_path = 'C:/Ankit/Python/Sentiment Analysis/Tokenizer'
model_path = 'C:/Ankit/Python/Sentiment Analysis/model'
model_file = os.path.join(model_path, 'movie_sentiment_m1.h5')
tokenizer_file = os.path.join(tokenizer_path, 'tokenizer_m1.pickle')
model = load_model(model_file)

# load tokenizer
with open(tokenizer_file, 'rb') as handle:
    tokenizer = pickle.load(handle)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


In [17]:
def review_rating(score, decoded_review):
    if float(score) >= 0.9:
        print('Review: {}\nSentiment: Strongly Positive\nScore: {}'.format(decoded_review, score))
    elif float(score) >= 0.7 and float(score) < 0.9:
        print('Review: {}\nSentiment: Positive\nScore: {}'.format(decoded_review, score))
    elif float(score) >= 0.5 and float(score) < 0.7:
        print('Review: {}\nSentiment: Okay\nScore: {}'.format(decoded_review, score))
    else:
        print('Review: {}\nSentiment: Negative\nScore: {}'.format(decoded_review, score))
    print('\n\n')

In [18]:
def decode_review(text_list):
    word_index = tokenizer.word_index
    sequences = tokenizer.texts_to_sequences(text_list)
    data = pad_sequences(sequences, maxlen=500)

    # decode the words
    reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])
    decoded_review = ' '.join([reverse_word_index.get(i, '?') for i in sequences[0]])
    return decoded_review, data

In [19]:
def score_review(source=None, file_type=None):
    '''
    source: the text, as either a string or a list of strings
    file_type: (str): indicating whether we expecting a file containing the
    text data or a directory containing a list files holding the text
    options: 'file' or 'dir'
    '''
    text_list = list()
    if isinstance(source, str) and file_type is None:
        text_list.append(source)
        decoded_review, data = decode_review(text_list)
        # make prediction
        score = model.predict(data)[0][0]
        review_rating(score, decoded_review)
    
    if isinstance(source, list) and file_type is None:
        for item in source:
            text_list = list()
            text_list.append(item)
            decoded_review, data = decode_review(text_list)
            score = model.predict(data)[0][0]
            review_rating(score, decoded_review)
    
    if isinstance(source, str) and file_type == 'file':
        file_data = open(source).read()
        text_list.append(file_data)
        decoded_review, data = decode_review(text_list)
        # make prediction
        score = model.predict(data)[0][0]
        review_rating(score, decoded_review)
    
    if isinstance(source, str) and file_type == 'dir':
        file_content_holder = list()
        for fname in os.listdir(source):
            if fname[-4:] == '.txt':
                f = open(os.path.join(source, fname),encoding='utf-8')
                file_content_holder.append(f.read())
                f.close()
        for item in file_content_holder:
            text_list = list()
            text_list.append(item)
            decoded_review, data = decode_review(text_list)
            score = model.predict(data)[0][0]
            review_rating(score, decoded_review)

In [20]:
# plotting the results
import matplotlib.pyplot as plt

acc = history.history.get('acc')
val_acc = history.history.get('val_acc')
loss = history.history.get('loss')
val_loss = history.history.get('val_loss')

epochs = range(1, len(acc) + 1)
plt.plot(epochs, acc, 'bo', label='Training Acc')
plt.plot(epochs, val_acc, 'b', label='Validation Acc')
plt.title('Training and Validation Accuracy')
plt.legend()

plt.figure()

plt.plot(epochs, loss, 'bo', label="Training Loss")
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and Validation loss')
plt.legend()
plt.show()

<Figure size 640x480 with 1 Axes>

<Figure size 640x480 with 1 Axes>

In [None]:
#imdb_dir = 'C:/Ankit/Python/Sentiment Analysis/aclImdb/aclImdb'
score_review('C:/Ankit/Python/Sentiment Analysis/aclImdb/aclImdb/test/pos', file_type='dir')

In [25]:
score_review('It was awesome')

Review: it was awesome
Sentiment: Okay
Score: 0.5106653571128845





We will use discrete set of .TXT files that are housed in a directory * pos * and each of those txt files will be scanned and its sentiment will be predicted by model. This has been done to show how to read each file at a time and use of prediction. Each TXT file is a IMDB movie review.

In [None]:
source = 'C:/Ankit/Python/Sentiment Analysis/aclImdb/aclImdb/test/pos'
file_type='dir'
print(list() )

In [None]:
for fname in os.listdir(source):
            print(fname)
            print(fname[-4:])
            if fname[-4:] == '.txt':
                f = open(os.path.join(source, fname))
                print(f)
                file_content_holder.append(f.read())
                print(file_content_holder)
                f.close()

### Tensorboard

The key purpose of TensorBoard is to help us visually monitor everything that goes on inside our model during training. Tensorboard gives us access to several relevant features such as -

1. visually monitoring metrics during training
2. visualizing the architecture of our model
3. visualizing histograms of activations and gradients



In [None]:
import tensorflow

Before using tensorboard, we will need to first create a directory where the log files it generates will be stored using the following command in Jupyter notebook .
( Alternatively you may also execute this command in Python Command line $ mkdir log_dir_m1)

In [None]:
%load_ext tensorboard

In [None]:
%tensorboard --logdir 'C:/Ankit/Python/Sentiment Analysis/model/log_dir_m1'

#With the server started, we can then browse to http://localhost:6006 and look at the model training. In addition to the training and validation metrics,
