# Movie Review Dataset
Movie Review Polarity Dataset

https://raw.githubusercontent.com/jbrownlee/Datasets/master/review_polarity.tar.gz

## Loading and Cleaning Reviews

The text data is already pretty clean; not much preparation is required. Without getting bogged
down too much in the details, we will prepare the data using the following way:
 Split tokens on white space.

 Remove all punctuation from words.

 Remove all words that are not purely comprised of alphabetical characters.

 Remove all words that are known stop words.

 Remove all words that have a length less than 1 character.

In [7]:
import string
import re
from os import listdir
from nltk.corpus import stopwords
from pickle import dump
# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text


# turn a doc into clean tokens
def clean_doc(doc):
    # split into tokens by white space
    tokens = doc.split()
    # prepare regex for char filtering
    re_punc = re.compile('[%s]' % re.escape(string.punctuation))
    # remove punctuation from each word
    tokens = [re_punc.sub('', w) for w in tokens]
    # remove remaining tokens that are not alphabetic
    tokens = [word for word in tokens if word.isalpha()]
    # filter out stop words
    stop_words = set(stopwords.words('english'))
    tokens = [w for w in tokens if not w in stop_words]
    # filter out short tokens
    tokens = [word for word in tokens if len(word) > 1]
    tokens = ' '.join(tokens)
    return tokens


# load all docs in a directory
def process_docs(directory, is_train):
    documents = list()
    # walk through all files in the folder
    for filename in listdir(directory):
        # skip any reviews in the test set
        if is_train and filename.startswith('cv9'):
            continue
        if not is_train and not filename.startswith('cv9'):
            continue
        # create the full path of the file to open
    path = directory + '/' + filename
    doc = load_doc(path)
    # clean doc
    tokens = clean_doc(doc)
    # add to list
    documents.append(tokens)
    return documents

# load and clean a dataset
def load_clean_dataset(is_train):
    # load documents
    neg = process_docs('txt_sentoken/neg', is_train)
    pos = process_docs('txt_sentoken/pos', is_train)
    docs = neg + pos
    # prepare labels
    labels = [0 for _ in range(len(neg))] + [1 for _ in range(len(pos))]
    return docs, labels

# save a dataset to file
def save_dataset(dataset, filename):
    dump(dataset, open(filename, 'wb'))
    print('Saved: %s' % filename)
    
    
# load and clean all reviews
train_docs, ytrain = load_clean_dataset(True)
test_docs, ytest = load_clean_dataset(False)
# save training datasets
save_dataset([train_docs, ytrain], 'train.pkl')
save_dataset([test_docs, ytest], 'test.pkl')


Saved: train.pkl
Saved: test.pkl


## Develop Multi-channel Model

In this section, we will develop a multi-channel convolutional neural network for the sentiment
analysis prediction problem. This section is divided into 3 parts:
1. Encode Data
2. Define Model.
3. Complete Example.

# Define Model


A standard model for document classification is to use an Embedding layer as input, followed by
a one-dimensional convolutional neural network, pooling layer, and then a prediction output
layer. 

The kernel size in the convolutional layer defines the number of words to consider as the convolution is passed across the input text document,
providing a grouping parameter. 

A multi-channel convolutional neural network for document classification involves using multiple
versions of the standard model with different sized kernels. 

This allows the document to be processed at different resolutions or different n-grams (groups of words) at a time, whilst the
model learns how to best integrate these interpretations.

This approach was first described by Yoon Kim in his 2014 paper titled Convolutional Neural
Networks for Sentence Classification. In the paper, Kim experimented with static and dynamic
(updated) embedding layers, we can simplify the approach and instead focus only on the use of
different kernel sizes. 


In Keras, a multiple-input model can be defined using the functional API. We will define a
model with three input channels for processing 4-grams, 6-grams, and 8-grams of movie review
text. Each channel is comprised of the following elements:
 Input layer that defines the length of input sequences.
 Embedding layer set to the size of the vocabulary and 100-dimensional real-valued repre-
sentations.
 Conv1D layer with 32 filters and a kernel size set to the number of words to read at once.
 MaxPooling1D layer to consolidate the output from the convolutional layer.
 Flatten layer to reduce the three-dimensional output to two dimensional for concatenation.

In [3]:
from pickle import load
from numpy import array
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils.vis_utils import plot_model
from keras.models import Model
from keras.layers import Input
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers import Dropout
from keras.layers import Embedding
from keras.layers.convolutional import Conv1D
from keras.layers.convolutional import MaxPooling1D
from keras.layers.merge import concatenate


# load a clean dataset
def load_dataset(filename):
    return load(open(filename, 'rb'))
# fit a tokenizer
def create_tokenizer(lines):
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(lines)
    return tokenizer
# calculate the maximum document length
def max_length(lines):
    return max([len(s.split()) for s in lines])
# encode a list of lines
def encode_text(tokenizer, lines, length):
    # integer encode
    encoded = tokenizer.texts_to_sequences(lines)
    # pad encoded sequences
    padded = pad_sequences(encoded, maxlen=length, padding='post')
    return padded

# define the model
def define_model(length, vocab_size):
    # channel 1
    inputs1 = Input(shape=(length,))
    embedding1 = Embedding(vocab_size, 100)(inputs1)
    conv1 = Conv1D(32, 4, activation='relu')(embedding1)
    drop1 = Dropout(0.5)(conv1)
    pool1 = MaxPooling1D()(drop1)
    flat1 = Flatten()(pool1)
    # channel 2
    inputs2 = Input(shape=(length,))
    embedding2 = Embedding(vocab_size, 100)(inputs2)
    conv2 = Conv1D(32, 6, activation='relu')(embedding2)
    drop2 = Dropout(0.5)(conv2)
    pool2 = MaxPooling1D()(drop2)
    flat2 = Flatten()(pool2)
    # channel 3
    inputs3 = Input(shape=(length,))
    embedding3 = Embedding(vocab_size, 100)(inputs3)
    conv3 = Conv1D(32, 8, activation='relu')(embedding3)
    drop3 = Dropout(0.5)(conv3)
    pool3 = MaxPooling1D()(drop3)
    flat3 = Flatten()(pool3)
    # merge
    merged = concatenate([flat1, flat2, flat3])
    # interpretation
    dense1 = Dense(10, activation='relu')(merged)
    outputs = Dense(1, activation='sigmoid')(dense1)
    model = Model(inputs=[inputs1, inputs2, inputs3], outputs=outputs)
    # compile
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    # summarize
    model.summary()
    plot_model(model, show_shapes=True, to_file='model.png')
    return model


# load training dataset
trainLines, trainLabels = load_dataset('train.pkl')
# create tokenizer
tokenizer = create_tokenizer(trainLines)
# calculate max document length
length = max_length(trainLines)
print('Max document length: %d' % length)
# calculate vocabulary size
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary size: %d' % vocab_size)
# encode data
trainX = encode_text(tokenizer, trainLines, length)
# define model
model = define_model(length, vocab_size)
# fit model
model.fit([trainX,trainX,trainX], array(trainLabels), epochs=7, batch_size=16)
# save the model
model.save('model.h5')

Using TensorFlow backend.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


Max document length: 369
Vocabulary size: 512
Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, 369)          0                                            
__________________________________________________________________________________________________
input_2 (InputLayer)            (None, 369)          0                                            
__________________________________________________________________________________________________
input_3 (InputLayer)            (None, 369)          0                                            
__________________________________________________________________________

![title](picture5.png)

In [None]:
from pickle import load
from numpy import array
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import load_model
# load a clean dataset
def load_dataset(filename):
    return load(open(filename, 'rb'))
# fit a tokenizer
def create_tokenizer(lines):
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(lines)
    return tokenizer

# calculate the maximum document length
def max_length(lines):
    return max([len(s.split()) for s in lines])
# encode a list of lines

def encode_text(tokenizer, lines, length):
# integer encode
    encoded = tokenizer.texts_to_sequences(lines)
    # pad encoded sequences
    padded = pad_sequences(encoded, maxlen=length, padding='post')
    return padded

In [9]:
# load datasets
from keras.models import load_model
trainLines, trainLabels = load_dataset('train.pkl')
testLines, testLabels = load_dataset('test.pkl')
# create tokenizer
tokenizer = create_tokenizer(trainLines)
# calculate max document length
length = max_length(trainLines)
print('Max document length: %d' % length)
# calculate vocabulary size
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary size: %d' % vocab_size)
# encode data
trainX = encode_text(tokenizer, trainLines, length)
testX = encode_text(tokenizer, testLines, length)
# load the model
model = load_model('model.h5')
# evaluate model on training dataset
_, acc = model.evaluate([trainX,trainX,trainX], array(trainLabels), verbose=0)
print('Train Accuracy: %.2f' % (acc*100))
# evaluate model on test dataset dataset
_, acc = model.evaluate([testX,testX,testX], array(testLabels), verbose=0)
print('Test Accuracy: %.2f' % (acc*100))

Max document length: 369
Vocabulary size: 512
Train Accuracy: 100.00
Test Accuracy: 100.00


# Extensions

Different n-grams. Explore the model by changing the kernel size (number of n-grams)
used by the channels in the model to see how it impacts model skill.


 More or Fewer Channels. Explore using more or fewer channels in the model and see
how it impacts model skill.


 Shared Embedding. Explore configurations where each channel shares the same word
embedding and report on the impact on model skill.

 Deeper Network. Convolutional neural networks perform better in computer vision
when they are deeper. Explore using deeper models here and see how it impacts model
skill.

 Truncated Sequences. Padding all sequences to the length of the longest sequence
might be extreme if the longest sequence is very dierent to all other reviews. Study the
distribution of review lengths and truncate reviews to a mean length.


 Truncated Vocabulary. We removed infrequently occurring words, but still had a large
vocabulary of more than 25,000 words. Explore further reducing the size of the vocabulary
and the effect on model skill.

 Epochs and Batch Size. The model appears to fit the training dataset quickly. Explore
alternate configurations of the number of training epochs and batch size and use the test
dataset as a validation set to pick a better stopping point for training the model.

 Pre-Train an Embedding. Explore pre-training a Word2Vec word embedding in the
model and the impact on model skill with and without further fine tuning during training.

 Use GloVe Embedding. Explore loading the pre-trained GloVe embedding and the
impact on model skill with and without further fine tuning during training.


 Train Final Model. Train a final model on all available data and use it make predictions
on real ad hoc movie reviews from the internet.