# FixMatch

## Principle

FixMatch is an algorithm for semi-supervised learning.
The idea is quite simple. Do supervised over the labeled dataset.
Use the trained model over a special unlabeled dataset to produce pseudo-label.
This special dataset is created based on the unlabeled dataset.
For each image of this dataset, we genereate two versions. One Weakly augmented, one strongly augmented. 
The argmax prediction over the weakly augmented one is the pseudo-label. The prediction over the strongly augmented one is compared to the pseudo-label which plays the role of (an evolving) ground truth.
That being said, the pseudo-label is accepted only if it goes above a threshold, a fixed hyperparameter.
Losses are calculated for the supervised and unsupervised part then summed.

Formally, the total loss for a batch of the supervised learning part is defined as $$l_s = \frac{1}{B} \sum_{b=1}^{B}H(p_b, p_m(y|\alpha(x_b)))$$ with $\alpha(x_b)$ the input preprocessed image $x_b$, $B$ the size of the batch, $p_b$ the actual label, $p_m(y|\alpha(x_b))$ the predicted probability of each label knowing the input and $H$ the cross-entropy loss.\
Similarly, the total loss for a batch of the unsupervised learning part is $$l_u = \frac{1}{\mu B} \sum_{b=1}^{\mu B} \mathbb{1} (max(q_b) \geq \tau) H(\hat{q}_b, p_m(y|A(u_b)))$$, with $\hat{q}_b = argmax(q_b) = argmax(p_m(y|\alpha (u_b)))$ the pseudo-label, $p_m(y|A(u_b))$ the predicted probability of each label knowing the strongly augmented image, $\tau$ the threshold to accept the pseudo-label.\
The final total loss is $L = l_s+\lambda_u l_u$ with $\lambda_u$ another fixed hyperparameter that weighs the unsupervised part.

## Limits

Overall FixMatch is simple and provides results above the methods of its time (UDA, ReMixMatch) as shown in the original paper.\
The catch is to find the strong augmentations that will yield the best results for one's dataset.

# Appendix

## Wide residual networks

Although we do not modify the architecture in this project, we explore the specifities of it.

### Formula

Residual block formula : 
$$
    x_{l+1} = x_l + \mathcal{F}(x_l, \mathcal{W}_l)
$$
With $x_{l+1}$ and $x_l$ inputs and outputs of the $l$-th unit in the network. $\mathcal{F}$ a residual function and $\mathcal{W}_l$ the parameters of the block.
There are 2 types of blocks in a residual network :
* basic : Batch-Normalization then Relu then convolution with a $3 \times 3$ kernel, all of that twice sequentially. $BN - Relu - conv(3 \times 3) - BN - Relu - conv(3 \times 3)$
* bottleneck : $conv(1 \times 1) - conv(3 \times 3) - conv(3 \times 3) - conv(1 \times 1)$ with $conv(1 \times 1)$

Bottleneck blocks are not considered in WideResNet since their goal of enabling longer neural networks (NN) is shown counterproductive by WideResNet architecture which focuses on width of the NN. More specifically, the WideResNet paper focuses on widening basic block to improve their expressiveness.

The paper introduces two factors : 
* deepening factor l, number of convolutions in a basic block
* widening factor k, number of features in convolutionnal layers

Kernel size remains equal to 3.

### Structure

| Group Name | Output Size | Block Type = $\mathcal{B}(3, 3)$ |
|------------|-------------|----------------------|
| conv1      | 32x32       | [3x3, 16]            |
| conv2      | 32x32       | [3x3, 16xk]          |
|            |             | [3x3, 16xk] xN       |
| conv3      | 16x16       | [3x3, 32xk]          |
|            |             | [3x3, 32xk] xN       |
| conv4      | 8x8         | [3x3, 64xk]          |
|            |             | [3x3, 64xk] xN       |
| avg-pool   | 1x1         | [8x8]                |

For instance the second group is a convolution type group whose output size is $32\times 32$. Its content is two $conv(3 \times 3) \times (16\times k)$ features or channels all of that repeated N times to make a sequence.

As for the notation, WRN-n-k denotes a residual networks with n convolutionnal layers in total and a widening factor k. For this project we use WRN-28-2. The [official implementation](https://github.com/szagoruyko/wide-residual-networks) provides [ways](https://github.com/szagoruyko/wide-residual-networks/blob/ae6d0d0561484172790c7a63c8ce6ade5a5a2914/models/wide-resnet.lua#L89) to compute N = $(28-4)/6$ (but why is there a 4 ? I could not find a real explanation, and I am not the only [one](https://github.com/szagoruyko/wide-residual-networks/issues/54#issue-341894131). My guess is that the 4 layers are the conv1 at the beginning, and the different layers at the end).

The WRN-28-2 structure can be written as followed, with possible variations for the end groups :
| Group Name | Output Size | Block Type                   |
|------------|-------------|------------------------------|
| conv1      | 32x32       | [3x3, 16]                    |
| conv2      | 32x32       | [3x3, 32]                    |
|            |             | [3x3, 32] x4 (N=4)           |
| conv3      | 16x16       | [3x3, 64] (stride 2)         |
|            |             | [3x3, 64] x4 (N=4)           |
| conv4      | 8x8         | [3x3, 128] (stride 2)        |
|            |             | [3x3, 128] x4 (N=4)          |
| BN+Relu    |             |                              |
| avg-pool   | 1x1         | [8x8]                        |
| linear     |             |                              |

In addition to this structure, a Dropout layer is inserted in-between the $BN - Relu - conv(3 \times 3)$ of a basic block.

## Exponential Moving Average

Original FixMatch paper relies on the Exponential Moving Average technic to report performances of the trained algorithm.