# The effects of colour augmentation.

This notebook is a summary of the findings of our new [paper](https://arxiv.org/abs/2110.04487) on the use of colour augmentation.

The experiments described here were inspired by the fantastic SimCLR work of [Chen et al.](https://arxiv.org/abs/2002.05709)
In a nutshell they train a network to generate an embedding for an image in an unsupervised fashion via contrastive learning /  instance discrimination. They stochastically augment each image in a batch twice and encourage similarity between the embeddings arising from differently augmented variants of the same image and dissimilarity for embeddings arising from different images.

They found that colour augmentation &mdash; also used in the [MoCo model of He et al.](https://arxiv.org/abs/1911.05722) &mdash; was essential for good performance as without it the network could in effect *cheat* by using image colour statistics as a shortcut for achieving the instance discrimination task. The colour augmentation sufficient alters the colour statistics to prevent this from being a reliable instance discrimination signal.

#### Single pixel colour clustering as a shortcut for minimizing consistency loss in semantic segmentation

Let us consider a semi-supervised semantic segmentation experiment in which we drive the unsupervised consistency loss
$L_{cons}$ with standard augmentation (consisting of geometric/affine warps, etc.).

- we have an unsupervised input image $x_u$
- we use the function $aug$ to augment an image $x$ by applying an affine transformation defined by the matrix $M$ that is drawn stochastically: $aug(x, M)$
- we predict per pixel class probability vectors using the student network $f_\theta$ and the teacher network $g_\phi$

Let us follow the semi-supervised semantic segmentation approach of [Perone et al.](https://arxiv.org/abs/1807.04657):
1. We start with an unsupervised input image $x_u$
2. We stochastically draw an affine transformation $M$ and augment the image: $x_u' = aug(x_u, M)$.
3. We apply the teacher network $g_\phi$ to generate per-pixel probability predictions $y_t$ for the input image: $y_t = g_\phi(x_u)$.
4. We obtain student predictions $y_s'$ by applying the student network $f_\theta$ to the augmented image: $y_s' = f_\theta(x_u')$
5. We apply the same affine transformation $M$ to the teacher predictions: $y_t' = aug(y_t, M)$.
6. Applying the same transformation $M$ to the teacher predictions and to the input presented to the student network result in the predictions $y_s'$ and $y_t'$ being aligned in a pixel-wise fashion. We can now compute the consistency loss:
$L_{cons} = |y_s' - y_t'|^2$

**Shortcut:** We observe that the network can minimize $L_{cons}$ by learning to predict the class probability using *only* the RGB value of the corresponding input pixel, ignoring the context provided by surrounding pixels. It in effect learns to cluster the RGB value of a single pixel. Given that surrounding context is necessary for good semantic segmentation performance in all but the simplest of scenarios, this will result in poor performance for most benchmarks.

#### Experimental evidence

Much as [Chen et al.](https://arxiv.org/abs/2002.05709) demonstrated that colour augmentation prevents the network from using colour statistics as a shortcut in an instance discrimination task, we demonstrate with our results below that colour augmentation similarly prevents our network from using the shortcut described above to minimize $L_{cons}$ in a semantic segmentation task.



We implement our colour augmentation in the same way as [MoCo model of He et al.](https://arxiv.org/abs/1911.05722).
This is essentially the same approach used by [Chen et al.](https://arxiv.org/abs/2002.05709), except MoCo is implemented in PyTorch.


### New Results

Here are the results from our new paper:

#### Cityscapes

<table>
    <tr>
        <th># labelled</th>      <th>~ 1/30 (100)</th>    <th>~ 1/8 (372)</th>     <th>~ 1/4 (744)</th>     <th>All</th>
    </tr>
    <tr>
        <td></td><td colspan="4"><span style="font-size:0.9em">Results from other work with ImageNet pre-trained DeepLab v2</span></td>
    </tr>
    <tr>
        <th>Baseline</th>        <td>--</td>              <td>56.2%</td>           <td>60.2%</td>           <td>66.0%</td>
    </tr>
    <tr>
        <th>Adversarial</th>     <td>--</td>              <td>57.1%</td>           <td>60.5%</td>           <td>66.2%</td>
    </tr>
    <tr>
        <th>s4GAN</th>           <td>--</td>              <td>59.3%</td>           <td>61.9%</td>           <td>65.8%</td>
    </tr>
    <tr>
        <td></td><td colspan="4"><span style="font-size:0.9em">Our results: Same ImageNet pre-trained DeepLab v2 network</span></td>
    </tr>
    <tr>
        <th>Baseline</th>        <td>44.41%</td>          <td>55.25%</td>          <td>60.57%</td>          <td>67.53%</td>
    </tr>
    <tr>
        <th>Cutout</th>          <td>47.21%</td>          <td>57.72%</td>          <td>61.96%</td>          <td>67.47%</td>
    </tr>
    <tr>
        <th>+ colour aug.</th>   <td>48.28%</td>          <td>58.30%</td>          <td>62.59%</td>          <td>67.93%</td>
    </tr>
    <tr>
        <th>CutMix</th>          <td>51.20%</td>          <td>60.34%</td>          <td>63.87%</td>          <td>67.68%</td>
    </tr>
    <tr>
        <th>+ colour aug.</th>   <td><strong>51.98%</strong></td>          <td><strong>61.08%</strong></td>          <td><strong>64.61%</strong></td>          <td><strong>68.11%</strong></td>
    </tr>
</table>

The addition of colour augmentation to Cutout and CutMix contributes a slight improvement in performance. Please note that we left the error bounds out of this table.


#### Pascal VOC 2012

First we perform a full ablation using various regularizers using the DeepLab v2 network:

<table>
    <tr>
        <th># labelled</th>      <th>~ 1/100 (106)</th>   <th>~ 1/50 (212)</th>    <th>~ 1/20 (529)</th>    <th>~ 1/8 (1323)</th>    <th>All</th>
    </tr>
    <tr>
        <td></td><td colspan="5"><span style="font-size:0.9em">Results from other work with ImageNet pre-trained DeepLab v2</span></td>
    </tr>
    <tr>
        <th>Baseline</th>        <td>--</td>              <td>48.3%</td>           <td>56.8%</td>           <td>62.0%</td>           <td>70.7%</td>
    </tr>
    <tr>
        <th>Adversarial</th>     <td>--</td>              <td>49.2%</td>           <td>59.1%</td>           <td>64.3%</td>           <td>71.4%</td>
    </tr>
    <tr>
        <th>s4GAN+MLMT</th>      <td>--</td>              <td>60.4%</td>           <td>62.9%</td>           <td>67.3%</td>           <td>73.2%</td>
    </tr>
    <tr>
        <td></td><td colspan="5"><span style="font-size:0.9em">Our results: Same ImageNet pre-trained DeepLab v2 network</span></td>
    </tr>
    <tr>
        <th>Baseline</th>        <td>33.09%</td>          <td>43.15%</td>          <td>52.05%</td>          <td>60.56%</td>          <td>72.59%</td>
    </tr>
    <tr>
        <th>Std. aug.</th>       <td>32.40%</td>          <td>42.81%</td>          <td>53.37%</td>          <td>60.66%</td>          <td>72.24%</td>
    </tr>
    <tr>
        <th>+ colour aug.</th>   <td>46.42%</td>          <td>49.97%</td>          <td>57.17%</td>          <td>65.88%</td>          <td>73.21%</td>
    </tr>
    <tr>
        <th>VAT</th>             <td>38.81%</td>          <td>48.55%</td>          <td>58.50%</td>          <td>62.93%</td>          <td>72.18%</td>
    </tr>
    <tr>
        <th>+ colour aug.</th>   <td>40.05%</td>          <td>49.52%</td>          <td>57.60%</td>          <td>63.05%</td>          <td>72.29%</td>
    </tr>
    <tr>
        <th>ICT</th>             <td>35.82%</td>          <td>46.28%</td>          <td>53.17%</td>          <td>59.63%</td>          <td>71.50%</td>
    </tr>
    <tr>
        <th>+ colour aug.</th>   <td>49.14%</td>          <td>57.52%</td>          <td>64.06%</td>          <td>66.68%</td>          <td>72.91%</td>
    </tr>
    <tr>
        <th>Cutout</th>          <td>48.73%</td>          <td>58.26%</td>          <td>64.37%</td>          <td>66.79%</td>          <td>72.03%</td>
    </tr>
    <tr>
        <th>+ colour aug.</th>   <td>52.43%</td>          <td>60.15%</td>          <td>65.78%</td>          <td>67.71%</td>          <td>73.20%</td>
    </tr>
    <tr>
        <th>CutMix</th>          <td><strong>53.79%</strong></td>          <td>64.81%</td>          <td>66.48%</td>          <td>67.60%</td>          <td>72.54%</td>
    </tr>
    <tr>
        <th>+ colour aug.</th>   <td>53.19%</td>          <td><strong>65.19%</strong></td>          <td><strong>67.65%</strong></td>          <td><strong>69.08%</strong></td>          <td><strong>73.29%</strong></td>
    </tr>
</table>    

The addition of colour augmentation generally improves results across the board for all regularization techniques. It is especially noticable when added to standard augmentation and ICT.

**A note on consistency loss weight:** Furthermore we note that using other regularizers asude Cutout and CutMix &mdash; e.g. standard augmentation and ICT &mdash; often
yielded results that were *worse* than those of our supervised baseline. Getting the positive results &mdash; even if only just positive &mdash; seen above and in our paper often required us to reduce the consistency loss weight to the point that the consistency loss was having very little effect. The addition of colour augmentation allowed us to use the same consistency loss weight of 1 for all experiments, apart from VAT that required a lower value to the use of KL-divergence loss rather than squared difference loss.


Now we evaluate CutMix with optional colour augmentation using DeepLab v3+ DenseNet-161 and PSPNet:


<table>
    <tr>
        <th># labelled</th>      <th>~ 1/100 (106)</th>   <th>~ 1/50 (212)</th>    <th>~ 1/20 (529)</th>    <th>~ 1/8 (1323)</th>    <th>All</th>
    </tr>
    <tr>
        <td></td><td colspan="5"><span style="font-size:0.9em">Results from other work with ImageNet pre-trained DeepLab v3+</span></td>
    </tr>
    <tr>
        <th>Baseline</th>        <td>--</td>              <td>unstable</td>        <td>unstable</td>        <td>63.5%</td>           <td>74.6%</td>
    </tr>
    <tr>
        <th>s4GAN+MLMT</th>      <td>--</td>              <td>62.6%</td>           <td>66.6%</td>           <td>70.4%</td>           <td>74.7%</td>
    </tr>
    <tr>
        <td></td><td colspan="5"><span style="font-size:0.9em">Our results: ImageNet pre-trained DeepLab v3+ network</span></td>
    </tr>
    <tr>
        <th>Baseline</th>        <td>37.95%</td>          <td>48.35%</td>          <td>59.19%</td>          <td>66.58%</td>          <td>76.70%</td>
    </tr>
    <tr>
        <th>CutMix</th>          <td>59.52%</td>          <td><strong>67.05%</strong></td>          <td>69.57%</td>          <td>72.45%</td>          <td>76.73%</td>
    </tr>
    <tr>
        <th>+ colour aug.</th>   <td><strong>60.02%</strong></td>          <td>66.84%</td>          <td><strong>71.62%</strong></td>          <td><strong>72.96%</strong></td>          <td><strong>77.67%</strong></td>
    </tr>
    <tr>
        <td></td><td colspan="5"><span style="font-size:0.9em">Our results: ImageNet pre-trained DenseNet-161 based Dense U-net</span></td>
    </tr>
    <tr>
        <th>Baseline</th>        <td>29.22%</td>          <td>39.92%</td>          <td>50.31%</td>          <td>60.65%</td>          <td>72.30%</td>
    </tr>
    <tr>
        <th>CutMix</th>          <td><strong>54.19%</strong></td>          <td><strong>63.81%</strong></td>          <td><strong>66.57%</strong></td>          <td>66.78%</td>          <td>72.02%</td>
    </tr>
    <tr>
        <th>+ colour aug.</th>   <td>53.04%</td>          <td>62.67%</td>          <td>63.91%</td>          <td><strong>67.63%</strong></td>          <td><strong>74.16%</strong></td>
    </tr>
    <tr>
        <td></td><td colspan="5"><span style="font-size:0.9em">Our results: ImageNet pre-trained ResNet-101 based PSPNet</span></td>
    </tr>
    <tr>
        <th>Baseline</th>        <td>36.69%</td>          <td>46.96%</td>          <td>59.02%</td>          <td>66.67%</td>          <td>77.59%</td>
    </tr>
    <tr>
        <th>CutMix</th>          <td><strong>67.20%</strong></td>          <td>68.80%</td>          <td>73.33%</td>          <td>74.11%</td>          <td>77.42%</td>
    </tr>
    <tr>
        <th>+ colour aug.</th>   <td>66.83%</td>          <td><strong>72.30%</strong></td>          <td><strong>74.64%</strong></td>          <td><strong>75.40%</strong></td>          <td><strong>78.67%</strong></td>
    </tr>
</table>

Colour augmentation is generally beneficial, apart from when using the less common DenseNet-161 based Dense U-net, where the improvements are less clear cut.


#### ISIC 2017

<table>
    <tr>
        <th>Baseline (50)</th>   <th>Std. aug.</th>       <th>VAT</th>             <th>ICT</th>             <th>Cutout</th>          <th>CutMix</th>          <th>Fully sup. (2000)</th>
    </tr>
    <tr>
        <td></td><td colspan="7"><span style="font-size:0.9em">Results from <a href="https://arxiv.org/abs/1808.03887">Li et al.</a> with ImageNet pre-trained DenseUNet-161</span></td>
    </tr>
    <tr>
        <td>72.85%</td>          <td>75.31%</td>          <td>--</td>              <td>--</td>              <td>--</td>              <td>--</td>              <td>79.60%</td>
    </tr>
    <tr>
        <td></td><td colspan="7"><span style="font-size:0.9em">Our results: Same ImageNet pre-trained DenseUNet-161</span></td>
    </tr>
    <tr>
        <td>67.64%</td>          <td>71.40%</td>          <td>69.09%</td>          <td>65.45%</td>          <td>68.76%</td>          <td>74.57%</td>          <td>78.61%</td>
    </tr>
    <tr>
        <td></td><td colspan="7"><span style="font-size:0.9em">+ colour augmentation</span></td>
    </tr>
    <tr>
        <td></td>                <td>73.61%</td>          <td>61.94%</td>          <td>50.93%</td>          <td>73.70%</td>          <td>74.51%</td>          <td></td>
    </tr>
</table>

The picture emerging from the ISIC 2017 results is less clear.

Colour augmentation contributes noticable improvemnets to standard augmentation and Cutout. Following our findings from the Pascal VOC 2012 dataset, we used the same consistency loss weight as with CutMix and Cutout. While this worked for standard augmentation, it did not for VAT and ICT, suggesting that further tuning would be needed to obtain positive results.

Colour augmentation caused a small but not significant drop in performance when added to CutMix. Performance is still strong however


#### Overall

We would advise the use of colour augmentation, unless doing so would hamper colour selectivity that might be necessary in some domains. We can still recommend the CutMix regularizer for semi-supervised semantic segmentation as it exhibits reliably strong performance.



### Effect on the narrative of our original paper

In our [original paper](https://arxiv.org/abs/1906.01916) we stated that prior work stated that the cluster assumption and specifically the low density separation assumption (in which the input data distribution features low density regions separating samples belonging to different classes) as key to the success of semi-supervised learning.

We analysed the data distribution of semantic segmentation and found that the low density separation assumption does not apply for these problems. We still believe our analysis and the method behind it to hold and we believe that this is an interesting artefact of the semantic segmentation problem.

We believed that overcoming the challenging data distribution posed by semantic segmentation requires a regularizer that produces perturbations that exhibit sufficient variety in order to adequately constrain the orientation of the decision boundary without low-density regions to guide it. We stated in the paper that we did not consider standard augmentation to provide perturbations that are sufficiently varied to achieve this. We no longer believe this criticism of standard augmentation to be true as our new results strongly suggest the 'pixel colour clustering' shortcut as the underlying mechanism that held back standard augmentation.

We should note that colour augmentation has in fact been used to solve this problem before by [Ji et al.](https://arxiv.org/abs/1807.06653),
although they mention it briefly in their paper as 'photometric augmentation' and implement it using the [`ColorJitter`](https://pytorch.org/vision/stable/transforms.html) transformation from [torchvision](https://pytorch.org/vision/stable/index.html). The ablation we have presented here shows how essential it is when using standard augmentation to drive consistency loss.

Our new results lessen the importance of the cluster assumption, leading us to conclude that it is in fact *not* essential for strong semi-supervised learning performance as stated in prior work. We believe that this strengthens the arguments in favour of using consistency regularization for semi-supervised learning. Not only is it conceptually simple and easy to implement but it is a powerful and flexible regularizer, as we have shown here.

We still believe that the data distribution present in semantic segmentation problems suggests that it is a challenging problem and can serve as an acid test for future regularizers. As recently noted by [Aitchison et al.](https://arxiv.org/abs/2008.05913), many image standard classification datasets (e.g. CIFAR-10 and ImageNet) used ground truth annotations from multiple human observers during their construction. They then remove ambiguous and out of distribution samples from the dataset. This could in effect introduce low-density regions into the input distribution. In many practical and commercial settings, the cost of using multiple human observers could be prohibitively expensive and/or time consuming. Furthermore, models ambiguous or out of distribution samples are often encountered in practical scenarios. We therefore consider challenging distributions to be worth studying and the semantic segmentation can in some fashion act as a proxy for this.
