In [1]:
from keras.datasets import imdb

Using TensorFlow backend.


In [2]:
(train_inputs, train_outputs), (test_inputs, test_outputs) = imdb.load_data()

Loaded IMDB dataset consisting of 50000 movie reviews, split into 25000 training samples and 25000 test samples. The input is a movie review with words mapped to their relative frequency, and the output is a class label 1 or 0, indicating a good or bad overall assessment. Note that both the training and test data are balanced, with 50% good reviews and 50% bad reviews.

In [3]:
train_inputs.shape

(25000,)

In [4]:
test_inputs.shape

(25000,)

In [5]:
train_inputs[0]

[1,
 14,
 22,
 16,
 43,
 530,
 973,
 1622,
 1385,
 65,
 458,
 4468,
 66,
 3941,
 4,
 173,
 36,
 256,
 5,
 25,
 100,
 43,
 838,
 112,
 50,
 670,
 22665,
 9,
 35,
 480,
 284,
 5,
 150,
 4,
 172,
 112,
 167,
 21631,
 336,
 385,
 39,
 4,
 172,
 4536,
 1111,
 17,
 546,
 38,
 13,
 447,
 4,
 192,
 50,
 16,
 6,
 147,
 2025,
 19,
 14,
 22,
 4,
 1920,
 4613,
 469,
 4,
 22,
 71,
 87,
 12,
 16,
 43,
 530,
 38,
 76,
 15,
 13,
 1247,
 4,
 22,
 17,
 515,
 17,
 12,
 16,
 626,
 18,
 19193,
 5,
 62,
 386,
 12,
 8,
 316,
 8,
 106,
 5,
 4,
 2223,
 5244,
 16,
 480,
 66,
 3785,
 33,
 4,
 130,
 12,
 16,
 38,
 619,
 5,
 25,
 124,
 51,
 36,
 135,
 48,
 25,
 1415,
 33,
 6,
 22,
 12,
 215,
 28,
 77,
 52,
 5,
 14,
 407,
 16,
 82,
 10311,
 8,
 4,
 107,
 117,
 5952,
 15,
 256,
 4,
 31050,
 7,
 3766,
 5,
 723,
 36,
 71,
 43,
 530,
 476,
 26,
 400,
 317,
 46,
 7,
 4,
 12118,
 1029,
 13,
 104,
 88,
 4,
 381,
 15,
 297,
 98,
 32,
 2071,
 56,
 26,
 141,
 6,
 194,
 7486,
 18,
 4,
 226,
 22,
 21,
 134,
 476,
 26,
 480,
 5

In [6]:
train_outputs[0]

1

In [7]:
DIMENSION = 10000
import numpy as np

def one_hot_encoding(sequences, dimension = DIMENSION):
    result = np.zeros((len(sequences), dimension))
    for i in range(len(sequences)):
        for j in range(len(sequences[i])):
            if sequences[i][j] < dimension:
                result[i][sequences[i][j]] = 1
    
    return result  

In [8]:
train_inputs = one_hot_encoding(train_inputs)
test_inputs = one_hot_encoding(test_inputs)

Set DIMENSION variable as the cutoff for word frequency in the input data; in our case, we consider only the 10000 most common words appearing in our inputs.

Transform inputs via one-hot encoding, mapping a list of integers to a vector of length DIMENSION with values 0 or 1 at the k-th entry depending upon the absence or presence of k in our list.  Could equally well use sklearn's built-in sklearn.preprocessing.OneHotEncoder object.  The .fit() and .transform() methods of this object take as input an array with dimensions num_samples x num_features, so we would need to transform our training data appropriately.

In [9]:
train_inputs[0]

array([0., 1., 0., ..., 0., 0., 0.])

In [10]:
train_outputs.min(axis = 0)

0

In [11]:
train_outputs.max(axis = 0)

1

In [12]:
from sklearn.naive_bayes import BernoulliNB

In [13]:
model = BernoulliNB()

In [14]:
model.get_params()

{'alpha': 1.0, 'binarize': 0.0, 'class_prior': None, 'fit_prior': True}

The parameters of the BernoulliNB object are fairly simple.  The 'alpha' parameter is set to either 1 or 0 depending upon our use of Laplace smoothing, with default value, alpha = 1, indicating the use of Laplace smoothing.  The 'class_prior' parameter allows us to provide an array of known class priors, the values of which are not updated in subsequent analysis (so that fit_prior = False).  The defaults above assume no prior distribution on the classes (class_prior = None) and allow the algorithm to learn these class weights (fit_prior = True).

In [15]:
model.fit(train_inputs, train_outputs)

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

Recall that the BernoulliNB learning algorithm has simple closed form solutions for both class priors and class conditioned feature weights.  As such, learning is rapidly accomplished.

In [16]:
model.score(test_inputs, test_outputs)

0.8404

Accuracy is lower than in an appropriately designed neural network model (see denseNN_binaryclass.ipynb), but the development and training of such a model is vastly more complicated.