# Dogs vs cats 

This notebook will serve as the reference script for solving the Kaggle's [Dogs vs Cats](https://www.kaggle.com/c/dogs-vs-cats-redux-kernels-edition) classification problem.

## TODO

1. Get dataset from Kaggle and structure it the way the VGG16 model can work on
2. Train the model (training set) and valid it (validation set)
3. For each image in the test set (125000) images, predict the class, either "dog" or "cat"
4. Write the probabilities into a .csv file 
5. Submit it to the Kaggle competition

## 1. Creating validation set and sampling

In [92]:
#import modules
import os
from shutil import copyfile

from utils import *
from vgg16 import Vgg16

%matplotlib inline

In [112]:
# Where are we ?
% pwd

data_dir = 'data/dogs-vs-cats-redux-kernels-edition/'
sample_dir = 'data/dogs-vs-cats-redux-kernels-edition/sample/'

test_dir = "data/dogs-vs-cats-redux-kernels-edition/test/"
result_dir = "./"

### Validation set

In [73]:
# Create valid directory and subdirectories
% mkdir -p $data_dir/valid/dogs
%mkdir -p $data_dir/valid/cats

In [87]:
# Move 2000 images from train to valid (1000 dogs and 1000 cats)

g = glob(data_dir + 'train/dogs/*.jpg')
imgs = np.random.permutation(g)
for i in range(1000):
    filename = imgs[i].split('/')[-1:][0]
    os.rename(imgs[i], data_dir + 'valid/' + filename)
    

g = glob(data_dir + 'train/cats/*.jpg')
imgs = np.random.permutation(g)
for i in range(1000):
    filename = imgs[i].split('/')[-1:][0]
    os.rename(imgs[i], data_dir + 'valid/' + filename)

### Sample set

In [57]:
# create sample directory and subdirectories
% mkdir -p $data_dir/sample/train
% mkdir -p $data_dir/sample/valid

% mkdir -p $data_dir/sample/train/dogs
% mkdir -p $data_dir/sample/train/cats

% mkdir -p $data_dir/sample/valid/dogs
% mkdir -p $data_dir/sample/valid/cats

In [93]:
# Copy 200 images to sample set (100 dogs and 100 cats)
g = glob(data_dir + 'train/dogs/*.jpg')
imgs = np.random.permutation(g)
for i in range(100):
    filename = imgs[i].split('/')[-1:][0]
    copyfile(imgs[i], data_dir + 'sample/train/' + filename)
    

g = glob(data_dir + 'train/cats/*.jpg')
imgs = np.random.permutation(g)
for i in range(100):
    filename = imgs[i].split('/')[-1:][0]
    copyfile(imgs[i], data_dir + 'sample/train/' + filename)
    
    
# Move 50 images to valid sample set (25 dogs and 25 cats)
g = glob(data_dir + 'sample/train/dog*.jpg')
imgs = np.random.permutation(g)
for i in range(1000):
    filename = imgs[i].split('/')[-1:][0]
    os.rename(imgs[i], data_dir + 'sample/valid/dogs/' + filename)
    
g = glob(data_dir + 'sample/train/cat*.jpg')
imgs = np.random.permutation(g)
for i in range(1000):
    filename = imgs[i].split('/')[-1:][0]
    os.rename(imgs[i], data_dir + 'sample/valid/cats/' + filename)

In [105]:
# Move images to their respective folder

% mv $data_dir/valid/dog*.jpg $data_dir/valid/dogs/
% mv $data_dir/valid/cat*.jpg $data_dir/valid/cats/

% mv $data_dir/sample/train/dog*.jpg $data_dir/sample/train/dogs/
% mv $data_dir/sample/train/cat*.jpg $data_dir/sample/train/cats/


mv: cannot stat 'data/dogs-vs-cats-redux-kernels-edition//valid/dog*.jpg': No such file or directory
mv: cannot stat 'data/dogs-vs-cats-redux-kernels-edition//valid/cat*.jpg': No such file or directory
mv: cannot stat 'data/dogs-vs-cats-redux-kernels-edition//sample/train/dog*.jpg': No such file or directory
mv: cannot stat 'data/dogs-vs-cats-redux-kernels-edition//sample/train/cat*.jpg': No such file or directory


## 2. Train the model

In [1]:
#import modules

from utils import *
from vgg16 import Vgg16

 https://github.com/Theano/Theano/wiki/Converting-to-the-new-gpu-back-end%28gpuarray%29

Using gpu device 0: GeForce GTX 770 (CNMeM is disabled, cuDNN 5103)
Using Theano backend.


In [10]:
#Instantiate our pre-trained model
vgg = Vgg16()

In [110]:
#define training constant
batch_size=16
n_epochs = 2

In [107]:
#Load data as batches, i.e a bunch of data grouped together
batches = get_batches(data_dir+'train/', batch_size=batch_size, shuffle=False)

#Define validation batches as well
valid_batches = vgg.get_batches(data_dir+"valid/", batch_size=batch_size, shuffle=False)

Found 23000 images belonging to 2 classes.
Found 2000 images belonging to 2 classes.


### Fine tuning

In [33]:
# Fine-tune the model
vgg.finetune(batches)

### Fitting

In [111]:
# It may be important to save model's weights after each epochs
for i in xrange(n_epochs):
    
    vgg.fit(batches, valid_batches)
    
    # Save
    filename_weights = '{}dogs_cats_redux_w{}.h5'.format(result_dir, i + 1)
    vgg.model.save_weights(filename_weights)
    
    # Log info
    print('Epoch {} done. Weights saved under {}'.format(i + 1, filename_weights))

Epoch 1/1


NameError: name 'result_dir' is not defined

## 3. Make predictions

In [113]:
test_batches, preds = vgg.test(test_dir, batch_size=batch_size)

Found 12500 images belonging to 1 classes.


In [114]:
preds[:10]

array([[  1.0000e+00,   1.3986e-09],
       [  1.0000e+00,   0.0000e+00],
       [  1.0000e+00,   7.1887e-43],
       [  9.9974e-01,   2.6366e-04],
       [  8.9913e-02,   9.1009e-01],
       [  1.0000e+00,   0.0000e+00],
       [  1.0000e+00,   0.0000e+00],
       [  7.0919e-06,   9.9999e-01],
       [  1.0000e+00,   3.2951e-14],
       [  1.0000e+00,   0.0000e+00]], dtype=float32)

In [130]:
#Grab the dog prediction column (since Kaggle wants the probability that an image is a dog)
isdog = preds[:,1]

In [118]:
test_batches.filenames[:5]

['unknown/1.jpg',
 'unknown/10.jpg',
 'unknown/100.jpg',
 'unknown/1000.jpg',
 'unknown/10000.jpg']

In [121]:
print "Raw Predictions: " + str(isdog[:5])
print "Mid Predictions: " + str(isdog[(isdog < .6) & (isdog > .4)])
print "Edge Predictions: " + str(isdog[(isdog == 1) | (isdog == 0)])

Raw Predictions: [  1.3986e-09   0.0000e+00   7.1887e-43   2.6366e-04   9.1009e-01]
Mid Predictions: [ 0.5205  0.4852  0.4056  0.4245  0.4385  0.5566  0.4206  0.5939  0.5188  0.4197  0.5089  0.5047
  0.4767  0.5491  0.4884  0.4264  0.5461  0.5655  0.4074  0.587   0.5759  0.5075  0.5979  0.452
  0.5117  0.598   0.5792  0.4722  0.5766  0.5097  0.4302  0.4836  0.4836  0.547   0.4466  0.477
  0.4649  0.432   0.4871  0.4541  0.5845  0.5711  0.4024  0.5439  0.4091  0.5609  0.4618  0.4489
  0.5803  0.4566  0.4851  0.4216  0.4964  0.5072  0.5727  0.5312  0.5549  0.5294  0.4972  0.4088
  0.4562  0.4841  0.502   0.4466  0.5769  0.5855  0.4891  0.5705  0.4378  0.5635  0.4246  0.4317
  0.4096  0.5204  0.5083  0.4744  0.4603  0.453   0.5787  0.4551  0.5868  0.5823  0.4977  0.5864
  0.479   0.5885  0.4875  0.5889  0.5318  0.5818  0.4026  0.5087  0.5736  0.4172  0.4435]
Edge Predictions: [ 0.  0.  0. ...,  0.  0.  0.]


In [1]:
# Limit the probabilities between 0.05 and 0.95 (log-loss dropped from 6.92 to 0.89 !!)
isdog = isdog.clip(min=0.05, max=0.95)

NameError: name 'isdog' is not defined

In [17]:
## 4. Export predictions

In [131]:
# Grab image IDs from filenames
filenames = test_batches.filenames
ids = np.array([int(f[8:f.find('.')]) for f in filenames])

In [132]:
subm = np.stack([ids,isdog], axis=1)
subm[:5]

array([[  1.0000e+00,   5.0000e-02],
       [  1.0000e+01,   5.0000e-02],
       [  1.0000e+02,   5.0000e-02],
       [  1.0000e+03,   5.0000e-02],
       [  1.0000e+04,   9.1009e-01]])

In [133]:
# Save to csv -- Numpy is *really* powerful
np.savetxt('kaggle_submission.csv', subm, fmt='%d,%.5f', header='id,label', comments='')