# Performance on the Kaggle Grasp Lift set

In short:
We are at 0.97 AUC , best reported single model of the competition was at 0.976 AUC. 

Our best model uses (leaky) relu, max pooling, fairly long filters (30) and 1x1-convs before normal convs in later  layers (with identity activation inbetween). 

I will show performances of different models and parameters:

## 5-Layer Net

### Basic architecture

Convolutions with time length 30, followed by poolings. 5 Layers refers to number of conv+pool parts.

Everything in batch x channel x time x 1 (But note: Initially 32 input sensors/channels are in dimension **1**, not in the channel dimension!)

|#Layer|Layer|Filter Size| Filter Stride|Nonlinearity/PoolingMode|
|-|
|1|Conv|30x1|1x1|Identity|
|2|Conv|1x32|1x1|Leaky Relu|
|3|Pool|3x1|3x1|Max|
|4|Conv|30x1|1x1|Leaky Relu|
|5|Pool|3x1|3x1|Max|
|6|Conv|30x1|1x1|Leaky Relu|
|7|Pool|3x1|3x1|Max|
|8|Conv|30x1|1x1|Leaky Relu|
|9|Pool|3x1|3x1|Max|
|10|Conv|30x1|1x1|Leaky Relu|
|11|Pool|3x1|3x1|Max|
|12|Conv|1x1|1x1|Softmax|

(all layers have 40 filters)

The last layer can also be seen as a fully connected layer and the 1x1 filter size determined by the input length (the input of 3752 samples is already reduced to only length 1 in the time dimension at that layer). 



### Different activations and poolings

Tests still done before substantial improvements by training on all subjects:

Leaky Relu and max pooling was always better than squaring + log + mean pooling.

|Activation + Pooling| Test AUC|
|-|-|
|Leaky Relu + Max|92.7|
|Square + Mean + Log first layer|89.7|
|Square + Mean + Log later layers|79.1|

Square + Max Pooling not tested.

## Batch normalization, filter lengths, more filters/layers

When training on all subjects, we get better results. Here an overview:

Base version is a five layer net with our (#time_length x 1) + (1 x #channels) two-layers "separated" convolution in the first layer. Later layers normal convolutions. Filter lengths 30(!) in all layers. Max Pooling between all layers, with length=stride=3 (for 6 layers and 7 layers, length=stride=2).

|Comment| Test AUC| Kaggle AUC|
|-|
|Base version|95.23|96.1|
|Batch normalization|96.22|96.35|
|8,12,20 filter time lengths|<95.5||
|80 filters | 96.34|96.7|
|6 Layer (40 filters)|96.36|-|
|7 Layer (40 filters)| 96.02|96.15|
|pre-1x1-conv (80 filters)|96.45|96.6|

Everything below batch normalization was with batch normalization. My last test was a different preprocessing: only remove a baseline mean (before the time window) of every channel (before I also tried to standardize the data to unit variance). This is quite similar to what the second place did. With 1x1 conv before each conv and 80 filters this led to

|Comment| Test AUC| Kaggle AUC|
|-|
|Remove baseline mean|96.53|97.0|

## Conclusion

Fairly standard architectures can lead to quite good performance, batch normalization seems to help. 

It seems quite probable that pre-1x1-conv is not necessary. For example, 6 Layers with 80 filters and remove baseline mean might also lead to ~97 Kaggle performance. 
There are some quite obvious things to test which I am not so interested in since I want to focus on visualizations:

* higher number of filters in earlier layers, smaller in later layers
* longer filters in earlier layers, shorter in later layers
* try smaller nets again with the improvements (training all subjects, batch normalization, etc.)

## Post-Conclusion (more filters, truly separable conv) :)

* more filters than 80 decrease accuracy ~0.5%
* non-separable conv in first layer decreases accuracy ~1.5%
* Increase of batch size from 42 to 45 increases accuracy ~0.3%

All below the first are with batch normalization, removing baseline, etc
(first result is copied from above for comparison)

|Comment| Test AUC| Kaggle AUC|
|-|
|80 filters, first time, then spatial, rest normal|96.3|96.7|
|80 Filters pre-1x1-conv, batchsize 45|96.8||
|120 Filters pre-1x1-conv, batchsize 45|96.3||
|140 Filters pre-1x1-conv, batchsize 45|96.2||
|Normal conv all layers (including first), batchsize 45|94.8||
|first real separable conv, then normal convs, batchsize 45|95.4||
|first normal conv, then pre-1x1-conv, batchsize 45|96.3||