### Download data

In [238]:
!kg config -c dogs-vs-cats-redux-kernels-edition

In [239]:
!kg download

Starting new HTTPS connection (1): www.kaggle.com
downloading https://www.kaggle.com/c/dogs-vs-cats-redux-kernels-edition/download/test.zip

Starting new HTTPS connection (1): storage.googleapis.com
test.zip already downloaded !
downloading https://www.kaggle.com/c/dogs-vs-cats-redux-kernels-edition/download/train.zip

train.zip already downloaded !
downloading https://www.kaggle.com/c/dogs-vs-cats-redux-kernels-edition/download/sample_submission.csv

sample_submission.csv already downloaded !


In [240]:
# These commands consistently crashed my notebook...? Whatevs.
# unzip test.zip
# unzip train.zip
!ls train | wc -l
!ls test | wc -l

25000
12500



### Organize data

In [241]:
%%bash
echo "Organizing train data"
mkdir -p train/cats train/dogs
find train/dog* -maxdepth 0 -type f | xargs -I {} mv {} train/dogs
find train/cat* -maxdepth 0 -type f | xargs -I {} mv {} train/cats

echo "Organizing test data"
mkdir -p test/unknown
find test -maxdepth 0 -type f | xargs -I {} mv {} test/unknown

echo "Creating validation data from training data"
rm -rf valid
mkdir -p valid/cats valid/dogs
find train/dogs -type f | shuf -n 1000 | xargs -I {} mv {} valid/dogs
find train/cats -type f | shuf -n 1000 | xargs -I {} mv {} valid/cats

echo "Copying sample train, validation, and test data"
rm -rf sample

mkdir -p sample/train/cats sample/train/dogs
find train/dogs -type f | shuf -n 8 | xargs -I {} cp {} sample/train/dogs
find train/cats -type f | shuf -n 8 | xargs -I {} cp {} sample/train/cats

mkdir -p sample/valid/cats sample/valid/dogs
find train/dogs -type f | shuf -n 4 | xargs -I {} cp {} sample/valid/dogs
find train/cats -type f | shuf -n 4 | xargs -I {} cp {} sample/valid/cats

mkdir -p sample/test/unknown
find test/unknown -type f | shuf -n 8 | xargs -I {} cp {} sample/test/unknown

Organizing train data
Organizing test data
Creating validation data from training data
Copying sample train, validation, and test data


In [263]:
!find train | wc -l
!find valid | wc -l
!find test | wc -l
!find sample | wc -l

23003
2003
12502
33


### Load model dependencies

In [264]:
%matplotlib inline

In [265]:
from __future__ import division,print_function

import os, json
from glob import glob
import numpy as np
np.set_printoptions(precision=4, linewidth=100)
from matplotlib import pyplot as plt

There were a bunch of useful Python files in the class repo - copy them to this working directory.

In [22]:
!cp ~/courses/deeplearning1/nbs/*.py .

In [266]:
import utils; reload(utils)
from utils import plots

Toggle this cell to switch between real data or very small amounts of data, which is good for iterating.

In [267]:
path = "./"
# path = "sample/"

### Build and fit model

The loss function is calculated with categorial cross-entropy. `loss` is subject to dropout, `val_loss` is not.

In [268]:
batch_size=64

In [269]:
import vgg16; reload(vgg16)
from vgg16 import Vgg16

In [270]:
vgg = Vgg16()
batches = vgg.get_batches(path+'train', batch_size=batch_size)
val_batches = vgg.get_batches(path+'valid', batch_size=batch_size*2)

Found 23000 images belonging to 2 classes.
Found 2000 images belonging to 2 classes.


Keras has some neat Callbacks features, but the `fit` method defined in vgg16.py doesn't allow me to pass them into the underlying Keras function. Next week when we just write our own model directly in Keras this won't be a problem, but this time I figured I'd just monkey-patch the vgg instance.

In [271]:
def fit(self, batches, val_batches, nb_epoch=1, callbacks=None):
    """
        Fits the model on data yielded batch-by-batch by a Python generator.
        See Keras documentation: https://keras.io/models/model/
    """
    self.model.fit_generator(batches, samples_per_epoch=batches.nb_sample, nb_epoch=nb_epoch,
            validation_data=val_batches, nb_val_samples=val_batches.nb_sample, callbacks=callbacks)

# This monkey patching technique binds `self` properly
# (vgg.fit = fit` does not - you'd have to pass `self=vgg` to invoke properly).

import types
vgg.fit = types.MethodType(fit, vgg)

In [272]:
from keras.callbacks import EarlyStopping, ModelCheckpoint

# Interrupt training when validation loss stops decreasing
early_stopping = EarlyStopping(monitor='val_loss', patience=2)

# Save the weights and architecture of the model after the best epoch (by validation loss)
checkpointer = ModelCheckpoint("best-weights.hdf5", monitor='val_loss', save_best_only=True)

In [None]:
vgg.finetune(batches)
vgg.fit(batches, val_batches, 50, callbacks=[checkpointer, early_stopping])

Epoch 1/50
 3904/23000 [====>.........................] - ETA: 484s - loss: 0.2260 - acc: 0.9326

### Get predictions

In [None]:
batch_size = 4
batches = vgg.get_batches(path+'test', batch_size=4)

Compute the number of batches we'll iterate through.

Check out that python value being interpolated into a bash command and then saved into another variable! Holy shit, that's awesome!

In [None]:
from math import ceil

command_output = !ls {path + 'test/unknown'} | wc -l
num_test_files = int(command_output[0])
print(num_test_files)
num_batches = int(ceil(num_test_files / batch_size))
print(num_batches)

Notes:
* `re.match` immediately returns no match if the beginning of the string doesn't match :|. `re.search` will find a match anywhere in the string.
* `batches.filenames` is key to identifying the filename of each prediction.
* As Jeremy points out at the beginning of Lesson 2, being overly confident really hurts your log loss score (https://www.kaggle.com/c/dogs-vs-cats-redux-kernels-edition#evaluation). Clip 0s and 1s to be 5% less confident.

In [None]:
import csv


with open('results.csv', 'w') as csvfile:
    fieldnames = ['id', 'label']
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    writer.writeheader()

    for batch_index in log_progress(range(0, num_batches)):
        images, labels = next(batches)
        confidences, predictions, predictions_english = vgg.predict(images)
        
        # Being overly confident really hurts your log loss score https://www.kaggle.com/c/dogs-vs-cats-redux-kernels-edition#evaluation
        predictions = np.clip(predictions, 0.05, 0.95)
        
        # plots(images, titles=predictions)  # display pictures with predictions, as a sanity check.
        
        for i in range(len(predictions)):
            # extract image ID from filename
            index = batch_index * batch_size + i
            filename = re.search("\d+", batches.filenames[index]).group(0)
            prediction = predictions[i]
            writer.writerow({'id': filename, 'label': round(prediction, 2)})

### Submit results

In [None]:
from datetime import datetime
date = datetime.now().strftime("%H:%M %m/%d/%Y")
!kg submit results.csv -m "Submitted {date}"