## Image Recognition:
    Steps:
        1. Organize the data in an appropriate file structure (shown below).
        2. Use transfer learning to train a pretrained model (vgg16) to predict cats or dogs.
        3. Generate a submission file for Kaggle.
    

### 1. organize data:
    Make sure the data is structured in the following format

![alt text](image_recognition_fs.png "Title")

Sample is just a smaller subset of the original data, this is made so that the code can be tested quickly.

In [5]:
# Import required libraries
import shutil
import os, random
import numpy as np
# vgg contains the class to use Vgg16 model
import vgg16; reload(vgg16)
from vgg16 import Vgg16
# utils contains helper methods such as plotting images
import utils; reload(utils)
from utils import plots

Using Theano backend.
Using gpu device 0: Tesla K80 (CNMeM is disabled, cuDNN 5103)


In [6]:
# assign data location
path = '/home/ubuntu/courses/data/'
# we can use the second path variable if we want to test our code on a smaler subset
#path = '/home/ubuntu/courses/data/sample/'

### Use transfer learning on a pretrained model (vgg16)

In [7]:
batch_size = 64 # limited by memory capabilities
vgg = Vgg16()
batches = vgg.get_batches(path = path+'train/', batch_size = batch_size)
val_batches = vgg.get_batches(path = path+'valid/', batch_size = batch_size)
# fine tune the last layer of vgg16 model to give 2 probabilities instead of 1000 classes( vgg16 default output is 1000 probabilities)
vgg.finetune(batches)

Found 23000 images belonging to 2 classes.
Found 2000 images belonging to 2 classes.


In [9]:
# train the model
vgg.fit(batches, val_batches, nb_epoch=2)

Epoch 1/2
Epoch 2/2


In [10]:
# generate prediction on test images
batches, preds = vgg.test(path+'test/', batch_size = batch_size*2)

Found 12500 images belonging to 1 classes.


In [12]:
# kaggle requests submission in a specific format
# must contain a header id,label
# every line should have image id,probability_of_dog
filenames = batches.filenames
# get the test file id's from filenames list
file_ids = [f.replace('images/', '').replace('.jpg', '') for f in filenames]
isDog = preds[:,1]
# clip to prevent hig log loss, since this is the metric kaggle uses to measure the results
isDog = isDog.clip(min=0.05, max=0.95)
subm = np.stack([file_ids, isDog], axis = 1)
subm = subm.astype('float')
submission_file_name = 'submission_1.csv'
np.savetxt(submission_file_name, subm, fmt='%d,%.5f', header='id,label', comments='')

This approach was able to score in the top 30% of the kaggle competition.
To improve: 
    1. more epocs
    2. try resnet, inception, other architectures