# Validation Data

In this notebook we'll use the IMBD dataset to practice making validation data.
Then we'll tune our hyper-parameters on the validation data.

In [1]:
from keras.datasets import imdb
import numpy as np

In [2]:
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)

  x_train, y_train = np.array(xs[:idx]), np.array(labels[:idx])
  x_test, y_test = np.array(xs[idx:]), np.array(labels[idx:])


We import the dataset and load it into the familiar sets of data.  Note the keyword argument `num_words = 10000` this built-in argument means that we are only taking the 10,000 most commonly used words in the dataset.  The logic is that rarely used words aren't going to help classify the movies as positive or negative.

In [3]:
print ("Train shape : ",train_data.shape)
print ("Test shape : ", test_data.shape)

Train shape :  (25000,)
Test shape :  (25000,)


## Creating the validation data

I don't like to write my own splitting functions, because I'm always afraid I'll make a dumb indexing mistake.  For that reason I highly suggest using scikit-learns inbuilt functions.  

We will use train test split.

Note that the training data is a 1-tensor, that means each sample is just a list of numbers.  Let's look a a sample

In [4]:
from sklearn.model_selection import train_test_split

In [5]:
X_train, X_valid, y_train, y_valid  = train_test_split(train_data, train_labels, test_size = .35, random_state = 10)

### A note on random states

The `random_state` parameter of many algorithms (classifiers / models) in sklearn are used to "freeze" your model, so it can be reproduced over and over.  Models with a `random_state` set are **Not** random.  Sometimes models or methods, require an element of randomness.  For example shuffling and splitting the data is required to be random.  However, since the results change ever so slightly each time you re-run the method -- it can be frustrating (scores will change slightly).  For this reason, I'm setting the random state here, so you can reproduce it on your computer.

By assigning a `random_state = 1` variable in the argument to the method -- we _actually remove all randomness_ from the method.  In essence, the algorithm becomes deterministic (fixed) and the results will always be the same.

I don't like this in general.
Why?

Because it creates a slice of the model, which isn't representative of the real model. The real model has randomized parameters, so it should randomly change every time you instantiate it.  This means that typically to get real results, you need to run the model multiple time (hundreds) in order to know _in general_ how it will perform.

When the difference between two models is <1%, can we be sure one model is better than the other?  Especially when we've randomly created a static version of that model by freezing it's random variables?  It's not really possible.  So I would advise that the better solution would be to run the experiment (with randomness) a few hundred times and then average the results.  This would be a much better indicator of what kind of performance you can get.

In [6]:
print ("X Train shape : {}".format(X_train.shape))
print ("y Train shape : {}".format(y_train.shape))
print ("X Valid shape : {}".format(X_valid.shape))
print ("y valid shape : {}".format(y_valid.shape))

X Train shape : (16250,)
y Train shape : (16250,)
X Valid shape : (8750,)
y valid shape : (8750,)


## We have our validation data, let's do some experiments.

Last time we were trying to figure out which one was better

* bag of words with binary features
* bag of words with count features
* bag of worts with TFIDF features

In [7]:
def binary_bag_of_words (sequences, dimension = 10000):
    results = np.zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):
        results[i, sequence] = 1.
    return results

In [8]:
def count_bag_of_words (sequences, dimension = 10000):
    results = np.zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):
        for j in sequence:
            results[i, j] += 1.
    return results


## Let's make 3 different datasets to try out.



In [9]:
binary_X_train = binary_bag_of_words(X_train)
binary_X_valid = binary_bag_of_words(X_valid)

In [10]:
count_X_train = count_bag_of_words(X_train)
count_X_valid = count_bag_of_words(X_valid)

In [11]:
from sklearn.feature_extraction.text import TfidfTransformer

transformer = TfidfTransformer()

tfidf_X_train = transformer.fit_transform(count_X_train)
tfidf_X_valid = transformer.transform(count_X_valid)

tfidf_X_train = tfidf_X_train.toarray()
tfidf_X_valid= tfidf_X_valid.toarray()

## One helpful *magic* function -- `whos`

we can search for variables with `whos` and see what we've created so far.

In [12]:
whos ndarray

Variable         Type       Data/Info
-------------------------------------
X_train          ndarray    16250: 16250 elems, type `object`, 130000 bytes (126.953125 kb)
X_valid          ndarray    8750: 8750 elems, type `object`, 70000 bytes
binary_X_train   ndarray    16250x10000: 162500000 elems, type `float64`, 1300000000 bytes (1239.776611328125 Mb)
binary_X_valid   ndarray    8750x10000: 87500000 elems, type `float64`, 700000000 bytes (667.572021484375 Mb)
count_X_train    ndarray    16250x10000: 162500000 elems, type `float64`, 1300000000 bytes (1239.776611328125 Mb)
count_X_valid    ndarray    8750x10000: 87500000 elems, type `float64`, 700000000 bytes (667.572021484375 Mb)
test_data        ndarray    25000: 25000 elems, type `object`, 200000 bytes (195.3125 kb)
test_labels      ndarray    25000: 25000 elems, type `int64`, 200000 bytes (195.3125 kb)
tfidf_X_train    ndarray    16250x10000: 162500000 elems, type `float64`, 1300000000 bytes (1239.776611328125 Mb)
tfidf_X_valid    n

In [13]:
from keras import models
from keras import layers

In [15]:
def build_run_model(X_train, X_valid, y_train = y_train, y_valid = y_valid):
    model = models.Sequential()
    model.add(layers.Dense(16, activation='relu', input_shape = (10000,)))
    model.add(layers.Dense(16, activation='relu'))
    model.add(layers.Dense(1, activation="sigmoid"))

    model.compile(optimizer='rmsprop',
                 loss = 'binary_crossentropy',
                 metrics = ['accuracy'])

    model.fit(X_train, y_train, epochs = 15, batch_size = 512)
    valid_loss, valid_acc = model.evaluate(X_valid, y_valid)
    print('valid_acc:', valid_acc)

### Let's run all three models with the different data we have.

In [16]:
build_run_model(binary_X_train, binary_X_valid)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15
valid_acc: 0.8640000224113464


In [17]:
build_run_model(count_X_train, count_X_valid)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15
valid_acc: 0.8755428791046143


In [18]:
build_run_model(tfidf_X_train, tfidf_X_valid)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15
valid_acc: 0.8837714195251465


# Which one is best?

It doesn't actually matter, what's important is that you get the idea of tuning your parameters on the **validation** data so we don't accidentally overfit to the testing data.

Now of course, we could tune it a lot, and get the best results on the validation data, and this might result in bad test scores!  Why?  Because then we've overfit to the **validation** data, which is pretty bad too!

So, in the next lesson we'll start looking at how to detect overfitting so we can stop ourselves when we are doing it.