# Dogs vs Cats

This is a CNN based model to differentiate between images of dogs and cats as applied on the dataset of the "Dogs vs Cats" Kaggle competition. It uses the FastAI librry and transfers learning from a ResNet34 model. The model is trained using stochastic gradient descent with restarts and different learning rates are used for different layers. 

### Initial setup

In [None]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [None]:
from fastai.imports import *
from fastai.transforms import *
from fastai.conv_learner import *
from fastai.model import *
from fastai.dataset import *
from fastai.sgdr import *
from fastai.plots import *

sz is the size that the images will be resized to in order to ensure that the training runs quickly. 

In [None]:
PATH = "data/dogscats/"
sz=224

In [None]:
torch.cuda.is_available()

In [None]:
torch.backends.cudnn.enabled

The following two sets of commands should be executed on a Crestle setup, They create symlinks to the data files.

In [None]:
 os.makedirs('data/dogscats/models', exist_ok=True)

 !ln -s /datasets/fast.ai/dogscats/train {PATH}
 !ln -s /datasets/fast.ai/dogscats/test {PATH}
 !ln -s /datasets/fast.ai/dogscats/valid {PATH}

 os.makedirs('/cache/tmp', exist_ok=True)
 !ln -fs /cache/tmp {PATH}

In [None]:
 os.makedirs('/cache/tmp', exist_ok=True)
 !ln -fs /cache/tmp {PATH}

In [None]:
os.listdir(PATH)

In [None]:
os.listdir(f'{PATH}valid')

In [None]:
files = os.listdir(f'{PATH}valid/cats')[:5]
files

The model will assume that the images are kept in train and valid directories with each directory having subfolders for each class (i.e. dogs and cats).

### First build

We use a pre-trained model (resnet34) that is trained on ImageNet. Uncomment next command to reset precomputed activations.

In [None]:
# shutil.rmtree(f'{PATH}tmp', ignore_errors=True)

arch is the architecture of CNN, in this case resnet34
In learn.fit, first parameter is the learning rate, second parameter is number of epochs i.e. how many times the model will scan each of the images.
ImageClassifierData.from_paths reads data from a provided path and creates a dataset ready for training.
tfms_from_model takes care of resizing, image cropping, initial normalization (creating data with (mean,stdev) of (0,1))
ConvLearner.pretrained builds learner that contains a pre-trained model. The last layer of the model needs to be replaced with the layer of the right dimensions (2 in this case)

In [None]:
arch=resnet34
data = ImageClassifierData.from_paths(PATH, tfms=tfms_from_model(arch, sz))
learn = ConvLearner.pretrained(arch, data, precompute=True)
learn.fit(0.01, 2)

In the above output, {0,1,2} denote the epoch number. This is followed by the training set loss, validation set loss and accuracy

In [None]:
data.classes

Predictions are in log scale

In [None]:
log_preds = learn.predict()
log_preds.shape

Converting from log probabilities to 0 or 1

In [None]:
preds = np.argmax(log_preds, axis=1)  
probs = np.exp(log_preds[:,1])    

In [None]:
def rand_by_mask(mask): return np.random.choice(np.where(mask)[0], 4, replace=False)
def rand_by_correct(is_correct): return rand_by_mask((preds == data.val_y)==is_correct)

In [None]:
def plot_val_with_title(idxs, title):
    imgs = np.stack([data.val_ds[x][0] for x in idxs])
    title_probs = [probs[x] for x in idxs]
    print(title)
    return plots(data.val_ds.denorm(imgs), rows=1, titles=title_probs)

In [None]:
def plots(ims, figsize=(12,6), rows=1, titles=None):
    f = plt.figure(figsize=figsize)
    for i in range(len(ims)):
        sp = f.add_subplot(rows, len(ims)//rows, i+1)
        sp.axis('Off')
        if titles is not None: sp.set_title(titles[i], fontsize=16)
        plt.imshow(ims[i])

In [None]:
def load_img_id(ds, idx): return np.array(PIL.Image.open(PATH+ds.fnames[idx]))

def plot_val_with_title(idxs, title):
    imgs = [load_img_id(data.val_ds,x) for x in idxs]
    title_probs = [probs[x] for x in idxs]
    print(title)
    return plots(imgs, rows=1, titles=title_probs, figsize=(16,8))

In [None]:
A few incorrect labels at random

In [None]:
plot_val_with_title(rand_by_correct(False), "Incorrectly classified")

In [None]:
def most_by_mask(mask, mult):
    idxs = np.where(mask)[0]
    return idxs[np.argsort(mult * probs[idxs])[:4]]

def most_by_correct(y, is_correct): 
    mult = -1 if (y==1)==is_correct else 1
    return most_by_mask(((preds == data.val_y)==is_correct) & (data.val_y == y), mult)

In [None]:
plot_val_with_title(most_by_correct(0, True), "Most correct cats")

In [None]:
plot_val_with_title(most_by_correct(0, False), "Most incorrect cats")

In [None]:
plot_val_with_title(most_by_correct(1, False), "Most incorrect dogs")

In [None]:
most_uncertain = np.argsort(np.abs(probs -0.5))[:4]
plot_val_with_title(most_uncertain, "Most uncertain predictions")

#### Optimizing the Learning rate

learn.lr_find() helps find an optimal learning rate. We keep increasing the learning rate from a very small value, until the loss stops decreasing. First create a a new learner, since we want to know how to set the learning rate for a new (untrained) model.

In [None]:
learn = ConvLearner.pretrained(arch, data, precompute=True)

In [None]:
lrf=learn.lr_find()
learn.sched.plot_lr()

Here iteration is one iteration (or minibatch) of SGD. In one epoch there are num_train_samples/num_iterations) of SGD. We choose lr=1e-2 (0.01) i.e. the highest rate at which the loss continues to decrease. If training loss is lower than validation loss that means there is overfitting.

### Data Augmentation

To prevent overfitting we randomly change the images in ways that should not impact their interpretation, such as horizontal flipping, zooming, and rotating. Thus we effectively create more data. We can do this by passing aug_tfms (augmentation transforms) to tfms_from_model, with a list of functions to apply that randomly change the image however we wish.  For side on transformation (horizontal flipping) use transforms_side_on, for photos taken top down (satellite imagery) use transforms_top_down. Another option that specifies random zooming of images up to specified scale uses the max_zoom parameter.

In [None]:
tfms = tfms_from_model(resnet34, sz, aug_tfms=transforms_side_on, max_zoom=1.1)

In [None]:
def get_augs():
    data = ImageClassifierData.from_paths(PATH, bs=2, tfms=tfms, num_workers=1) #bs is for batch size
    x,_ = next(iter(data.aug_dl))
    return data.trn_ds.denorm(x)[1]

In [None]:
ims = np.stack([get_augs() for i in range(6)])

In [None]:
plots(ims, rows=2)

Creating a new data object that includes this augmentation in the transforms.

In [None]:
data = ImageClassifierData.from_paths(PATH, tfms=tfms)
learn = ConvLearner.pretrained(arch, data, precompute=True)

In [None]:
learn.fit(1e-2, 1)

By default when we create a learner, it sets all but the last layer to frozen. Here precompute flag means we will take precomputed activations for all but last layer and only updating the weights in the last layer when we call fit. During data augmentation you cannot have precomputed activations because these are new images.

In [None]:
learn.precompute=False

We use a technique called stochastic gradient descent with restarts (SGDR), a variant of learning rate annealing, which gradually decreases the learning rate as training progresses. From time to time we increase the learning rate ('restarts'), which will force the model to jump to a different part of the weight space if the current area is "spikey". This is because a solution in a flat part of the loss landscape will be more stable (more resilient to data perturbations) and more generalizable. The number of epochs between resetting the learning rate is set by cycle_len, and the number of times this happens is refered to as the number of cycles, and this is the 2nd parameter to fit(). So cycle_len =1 means the lr is rest after every epoch here.

In [None]:
learn.fit(1e-2, 3, cycle_len=1)

In [None]:
learn.sched.plot_lr()

Since validation loss is not improving much we freeze final layer here. We save it so that it can be reloaded.

In [None]:
learn.save('224_lastlayer')

In [None]:
learn.load('224_lastlayer')

### Fine-tuning and differential learning rate annealing

Once final layer is frozen, the other pre-trained layers are fine tuned. We use differential learning rates becuse earlier layers will need less fine tuning. Thumb rule: Last LR same as final layer rate and then divide by 10x successively. 

In [None]:
lr=np.array([1e-4,1e-3,1e-2])

In [None]:
learn.fit(lr, 3, cycle_len=1, cycle_mult=2)

In [None]:
learn.sched.plot_lr()

In [None]:
The cycle_mult parameter = 2 is doubling the length of each cycle

In [None]:
learn.save('224_all')

In [None]:
learn.load('224_all')

Test time augmentation: makes predictions on a number of randomly augmented versions of image along with the original image. Use learner's TTA method.

In [None]:
log_preds,y = learn.TTA()
probs = np.mean(np.exp(log_preds),0)

In [None]:
accuracy_np(probs, y)

Hence, TTA results in a 10 to 20% reduction in error.

### Confusion Matrix

In [None]:
preds = np.argmax(probs, axis=1)
probs = probs[:,1]

In [None]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y, preds)

In [None]:
plot_confusion_matrix(cm, data.classes)

In [None]:
plot_val_with_title(most_by_correct(0, False), "Most incorrect cats")

In [None]:
plot_val_with_title(most_by_correct(1, False), "Most incorrect dogs")