# The Right Whale
### <a href="https://www.kaggle.com/c/noaa-right-whale-recognition">Kaggle Competition</a>


Walkthrough of the <a href="http://deepsense.io/deep-learning-right-whale-recognition-kaggle/">competition winner's solution</a>.

### The Team

![](deepsense.png)

<a href="www.deepsense.io">DeepSense</a> are a deep learning solutions and consultancy company established by former employees of Google, Facebook, and Microsoft. 

Their machine learning team have a background in computer science and algorithms, rather than the more traditional fields of mathematics and statistics. They provide some interesting thoughts in this <a href="http://blog.kaggle.com/2016/01/29/noaa-right-whale-recognition-winners-interview-1st-place-deepsense-io/">interview</a>.

### The Competition

![](competition.png)

From aerial photos, identify whales from 447 individuals in the dataset.

### North Atlantic Right Whales

![](rightwhale.jpg)

- So called because they were the "right whale" to hunt : **20m, 100t**
- Less than **five hundred** left in the wild
- Marine biologists concerned with conservation need to **identify individuals**
- **Callosity** patterns on their heads are calcified skin

### Callosity 

![background](background.png)

# The Competition

- $10,000 prize
- 11,469 images with varying resolutions and quality ( 9.5 GB )
- Resolutions : 887x460 to 7010x4674; aspect ratios : 1.35 to 1.93
- 447 individuals ( 1 - 40 images per individual )
- Dataset contains cropped and rotated duplicates "to discourage hand labelling"; these do not count towards score
- External data not allowed ( does this include pretrained models ? )
- Goal : to provide a probability distribution over whales for each image
- Scoring uses categorical cross-entropy ( multiclass logloss ) : 

$$\mathrm{logloss} = -\frac{1}{N} \sum_{i=1}^N \sum_{j=1}^M y_{ij} \log(p_{ij})$$

where $y_{ij}$ is 1 if the observation $i$ belongs to individual $j$, and 0 otherwise; and $p_{ij}$ is the estimated probability of $i$ belonging to $j$.

<a href="https://jamesmccaffrey.wordpress.com/2013/11/05/why-you-should-use-cross-entropy-error-instead-of-classification-error-or-mean-squared-error-for-neural-network-classifier-training/">This</a> is a great blog post on categorical cross-entropy loss compared with MSE and accuracy loss functions.

### Issues with the Data

<img src="data.png"/>

Images varied enormously in quality : sharpness, focus, colour, exposure and lighting conditions.

*"One does not need to see many images from the dataset in order to realize that whales do not pose very well (or at least were reluctant to do so in this particular case).*

### Compared with other Image Processing Challenges

- The part of the image relevant to the solution occupies a small part of the image and may be occluded.
- We can trivially recognise cats from dogs from motorbikes; even experts have trouble with this particular task.

<img src="w206.jpg" width="600px" />

*Helping out our classifiers to focus on the correct features, i.e. the whale’s heads and their callosity patterns, turned out to be crucial.*

# Solution

### Software

- Python with Theano for training
- Sloth and Julia for labelling

### Hardware

- **NVIDIA Tesla K80**, 24 GB, 4992 cores at 560 MHz
- **NVIDIA GRID K520**, 8 GB, 3072 cores at 800 MHz

### General Solution : the whale "passport photo"

![](passport.png)

### Localising the Head

The goal is to produce a bounding box containing the head.

- Manually annotated all images with coordinates of a bounding box around the head
- Resized images to 256x256
- Trained not a regressor, but a classifier with pixels bins
- Augmentation : 10 degree rotation, 0.8 - 1.2 scale factor, colour perturbation from Krizhevsky *et al.* on <a href="http://www.cs.toronto.edu/~fritz/absps/imagenet.pdf">ImageNet</a>
- Five CNNs : 20, 20, 40, 60, 128 bins
- "We combined the outputs from all the 5 networks."

<img src="step1b.png" style="width: 220px;"/>

![](boundingbox.jpg)

### Aligning the Head

Images should have the the tip of the bonnet ( nose ) and the blowhead aligned consistently.

- Manually annotated all images with "bonnet-tip" and "blowhead" and for callosity continuity
- Takes as input the cropped image output from the previous CNN
- Augmentation : 360 degree rotation, 4 pixel translation, 1 - 1.5 scale factor, random flip, color perturbation
- Result is a 256x256 crop of the original image, containing bonnet-tip and blowhead points as the average output from five random augmentations
- CNN also output a binary callosity continuity variable

![](aligner.png)

![](step1.png)

![](aligned.png)

### Identifying Whales

The output is a probability distribution that an image $i$ belongs to whale $j$.

- Takes the 256x256 output from the previous CNN as input
- Visual validation on the test dataset that head localisation and alignment algorithms performed well
- Augmentation : 8 degrees rotation, 4 pixel translation, 1 - 1.3 scale factor, random flip, color perturbation
- Pooling layers were 3x3 with 2-stride ( scale reduction 0.5x )
- All convolutional layers were followed by batch normalisation
- Nonlinearities were ReLU
- Authors mention a 512x512 and bifurcating net "in the final blend"
- CNN also outputs binary callosity continuity variable
- Results are averaged over 20+ augmentations

<img src="finalnet.png" width="95px">

# Additional Points

- Initialisation : zero-mean, 0.01-std for conv layers; zero-mean, 0.001-std for dense layers
- L2 regularisation : 0.0005 for conv layers, 0.01 - 0.05 for dense layers
- SGD with 0.9 momentum, 500 - 1000 epochs


### Fitting

The additional callosity continuity target variable helped focus the net on the head and not elsewhere; this helped counter overfitting.

The learning rate had a slow decay, and was kicked during training. The best model was training using Nesterov momentum for 100 epochs and then Adam.

<img src="train.png" width="400px" />

### Validation

- 90/10 split from the training dataset
- Fixed random seed ( their favourite is 7300 )

After settling on a model, retrained all models using full dataset :

- Full retrain ( rarely )
- 50 - 100 epochs using additional data ( often )

### Results Hacks

- Due to time constraints, no "complex ensemble methods" were used
- The best model was better than an average of all models, until predicted probabilities were raised to 1.45
- "Small epsilon" added to all probabilities
- "Slight skew" according to whale distribution

### Image Speed Hacks

JPG encoding turned out to be very expensive :

- Reading 111 images took 420ms
- Reading 111 images and decoding to numpy array took ~10s
- Solutions offered : decoding on GPU, or use other image formats

Images were loaded to the GPU in parallel to allow efficient GPU use.

### Final Words from the Authors

- Good quality crops generated in two steps ( head localisation, then alignment ) performed vastly better than when done in a single network
- Manual annotation of callosity continuity was thought to be very important
- Kicking the learning rate allowed for better training
- "Calibrating" the probabilities by a power between 1.1 and 1.6 improved logloss by ~0.1

### Word from me

- There's a great deal of very solid work here, and a number of lessons to be learnt
- There's also something close to cheating, but that's also the fault of the competition for asking for probabilities and not class labels

![](atlantic2.png)