# Paper summaries
**Papers 1 to 14**

## Paper 1: Quantifying Independently Reproducible ML Research
Not being able to reproduce a paper's results may suggest problem with the paper. Author investigates by independently reproducing; **reproducing using only the information given in the paper**.

Influencing features for reproducability:
- Unambigiouous:
    - #authors, #references, etc...
- Mild subjectivity:
    - #tables, #equations, hyperparameters, pseudocode
- Subjective:
    - #conceptualization figures, rigor vs. empirical (type of research), readability, algorithms difficulty.

Most important:
- _Year of publication_: uncorrelated with reproducability
- _Readability_: most strongly correlated with reproducability; shorter pages seem to make it less readable; suggests page limit negatively influence readability.
- _Pseudocode_: correlated; positive for no pseudo-code, negative for in between or none.
- _Theoretical papers_: harder to reproduce
- _Primary topic_: significantly correlated to reproducibility; 'living' papers possibly more reproducible.
- _code release_: uncorrelated.

Many study deficiencies, like **author bias** (how experienced, topic, only implemented by one), no record of failure.

## Paper 2: Troubling Trends in Machine Learning Scholarship

1. **Failure to distinguish** between **explanation and speculation**; Speculation is like "strong results lead to complacency"
2. **Sources of empirical gain**; emphasizing the wrong things; like modifications that don't do much and thus is related to hyper-parameter tuning.
3. **Mathiness**: impress rather than clarify concepts.
4. **Language misuse**: use wrong words to clarify the paper concepts.

Author argue that strong results are seen as a valid excuse for weak arguments.
Combatting these problems could make ML more accessible.

## Paper 3: ImageNet Classifiers Generalize?
Image classifiers do not generalize reliably. This is shown by trying to reproduce the test results of image recognition models on a **new** test set which is sampled from the **same** data source, following the **same** data cleaning protocol as described in the paper. This is tested by reproducing some of the highly ranked models in this manner.

The classification models fail to reach their original accuracy scores using the new test set (CIFAR-10 3 to 15% drop, ImageNet 11 to 14%). However, the ranking of the highly ranked models is mostly preserved.

There are three types of gaps identified:
- **Generalization gap**: determined solely by random sampling error. Since the new test set had 10.000 data points, this should at most lead to maximally +/- 1% difference in accuracy. Thus does **not explain** the drop in accuracy.
- **Adaptivity gap**: the adaptation of a model to the test set with regards to the true distribution. If we assume this to be true that would mean each model's hyperparameters are tuned in somewhat similar fashion with regards to the test set. Since the ranking of the models are preserved, intuitively it would also seem **unlikely to explain it**.
- **Distribution gap**: the Systemic difference between the current true distribution proxied by the current test set and the new true distribution, which the new test set is a proxy for. This seems the **most likely explanation**.

Also the paper suggests that the gap could be explained by the inability to generalize to slightly “harder” images than those found in the original test sets.


**Question**: Models trained on existing datasets may not generalize to new test sets sampled from the same distribution. What was the paper’s hypothesis for that?
- “The existing test sets have been used too many times and existing models are tuned to the particular test set.” - This is what is initially assumed but later debunked by the fact that ranking is preserved.


## Paper 4: Scaling down Deep learning

Projects run at large scale require enormous amounts of **time**, **money**, and **electricity**. MNIST-1D is a minimalist, low-memory, low-compute alternative to classic deep learning benchmarks. This dataset differentiates more clearly between linear, non-linear, and convolutional models in comparison with the original MNIST dataset. In addition, MNIST is somewhat large for a toy dataset, and hard to hack (researchers cannot easily vary parameters).

The authors state that small scale research is important in the DL field, because it permits creativity, allows deep understanding, improves interpretability, reproducibility and iteration speed. It is easy to perform ablation studies to isolate causal mechanisms of results (e.g. for finding what attributes to a lottery ticket’s success). MNIST-1D has rapid iteration as a priority. However, large scale research is also required: to expose fertile new research territory.

Properties of MNIST-1D
- Extremely small (smaller than MNIST)
- Able to identify models with (spatial) inductive biases
- Easy to hack, extend or modify
- Analogous to large-scale problems
- Differentiates clearly between linear, non-linear, and convolutional models (e.g. logistic vs MLP vs CNN vs GRU) 



## Paper 5: Deep Convolutional Really need to be Deep and Convolutional?
**Goal**: check if shallow networks can be as accurate as deeper ones.

**Main contributions**: 
- Still substantial performance gap between deep CNN's and shallow CNN's/NN's (image recognition)
- Training via distillation yields higher model accuracy than conventional training

**Distillation technique**:
- Use a teacher models real values output scores and mimick those.

**Main finding**: 
If student models have a similar number of parameters as the deep teacher models, high accuracy can not be achieved without multiple layers of convolution even when the student models are trained via distillation
- deep convolutional nets do need to be both deep and convolutional even when trained to mimic very accurate models via distillation

**Conclusion**:
Yes, we need Convolutions and Deep. Although shallow can learn Deep for Speech, not for Image recognition. 
- **A network with a single hidden layer can approximate every decision boundary**.
- Accuracy can be significantly be improved using several layers of convolution.
- On high dimensional data shallow neural networks perform worse than deep neural networks.
- Distillation for models can lead to accurate results.



## Paper 6: Residual Networks Behave Like Ensembles of Relatively Shallow Networks
**Goal**:
- Investigate the impact of the following to confirm whether the Residual Network behaves like _"Ensemble"_.
    - identity **skip-connections**
    - **paths** are **not dependent** on each other
    - skip connections give **rise to large networks**
- Shortcut path -> create more complex model without vanishing gradient

**Ensemble**:
- Ensemble means that arranging a committee of neural networks in a simple voting scheme, then the ﬁnal output predictions are averaged. And it has the following features.
    - **Feature 1**: A path does not depend on each other
    - **Feature 2**: Performance increase from additional ensemble members gets smaller with increasing ensemble

**Method**:
- Introduce the unraveled view
    - residual networks can be viewed as a collection of many paths instead of a single deep network
- Do a lesion study
- Investigate the depth of residual networks

**Conclusion**:
- Unraveled view reveals that **residual networks can be viewed as a collection of many paths**, instead of a single ultra deep network.
- Lesion studies show that, although paths are trained jointly, they do **not strongly depend on each other**, but show ensemble-like behavior.
- The paths through the network that contribute gradient during training are shorter than expected. _Longer paths do not contribute to the gradient_.

## Paper 7: Deep Image Prior
**Goal**:
- Recover original image when having a corruption. Instead of searching for the answer in the image space we now search for it in the space of neural network's parameters.

**Conclusion**:
- It thus shows that the approach of constructing a implicit prior inside deep convolution neural network architectures with randomized weights is well-suited for image restoration tasks.
- A randomly-initialized neural network can be used as a handcrafted prior with excellent results in standard inverse problems such as denoising, super-resolution, and inpainting.

## Paper 8: Approximating CNN's with Bag-of-local-Features

**Main contributions**:
- Solving ImageNet is much simpler than many have thought.
- The findings allow us to build much more interpretable and transparent image classification pipelines.
- CNNs has a bias towards texture.
- BagNet is easier to explain (Hybrid of DNN and BoF)
    - Ex: Medical imaging, autonomous vehicles
- DNN:
    - Have better fine tuning rather than qualitatively different decision strategies
    - Not taking into account spatial ordering
    - Bad at distribution shifts

**Conclusion**:
- Deep Neural Networks can still recognise scrambled images well

## Paper 9: Group Normalization

**Goal**:
- Current normalisation techniques (used for layer normalisation) used in deep learning include:
    - BN (batch normalization) which normalizes over batches.
    - LN (layer normalization) which normalizes over all input channels.
    - IN (instance normalization) which normalizes over each individual channel.
- Make a normalisation technique **insensitive to batch size**.
    - Does not mean it is independent of batch size, it just is insentive to it.

** Contribution**:
- GN (group normalisation) is between LN and IN where it uses a **given number of channels grouped together to normalise over** (as opposed to LN which does all at once and IN which does one channel at a time). 
- Useful when training with limited batch sizes (due to memory requirement):
    - image and video classification
    - segmentation problems
- BN:
    - pre-computed statistics may differ **between training vs testing**.
    - when used in a real life scenario the inputs are single images so no batch normalization can take place. 
        - Issue because the network learned with normalized data. 
        - **Fix**? This is solved by GN
            - _IN_ and _LN_ perform poorly on image based learning.
    - reliance on the batch dimension called BR (Batch renormalisation)
        - improves performance, it still relies on batch dimension.  
        - Hardware scaling; but increases complexity and does not fix the underlying issue of reliance on the batch size.
- **Large batch size**: BN better
- **Small batch size**: GN better
- **GN outperforms both LN and IN**.

## Paper 10: Empirical Comparisons of Optimizers for DL
**Finding**
- Many optimizers, however, unsure how an optimizer will generalize to new workloads (datasets).
    - **SGD** sometimes outperforms an **ADAM** optimizer. 
        - _is this due to a bad optimization schedule?_ 
- An optimizer is a **combination of optimizer(s)** (N(Adam), SGD, Momentum, Nesterov) and hyperparameters. 

<img src="./image/optimizers_comparison.png" height="75" />

- The author: 
    - proposes an inclusion hierarchy / taxonomy.
    - wants to research whether this inclusion taxonomy holds in practice, so, provide an optimization protocol to do parameter search for an optimizer such that a more general optimizer will always match or outperform a simpler optimizer.
    - finds that with his proposed optimization protocol this taxonomy holds for all of his tested datasets with different models. 
- General optimizers never underperform specialized ones
- Hyperparameter search may be the most important factor in optimizer rankings
- Downside of this is that one would **need enough computational power** to follow his proposed learning schedule.


## Paper 11: Attention Is All You Need

**Contributions**:
- Word2vec turns discrete word into a d-dimensional one-hot-encoded vector. 
- Calc context via:
    - Skip gram (from center word to k surrounding) 
    - CBOW (continuous bag-of-words, from surrounding k words to center)
- Self-attention is a parallelizable alternative to training RNN. Its a seq2seq operation where each input vector will be converted (learned) to have all samples contain information of other samples from the sequence.
- Uses similarities between words
    - n^2 number of comparisons via dot-prroduct, softmax and resulting matrix used to take a weighted sum of original word vector.
- Three have the following functions:
    - value is used to for the weighted sum
    - query is used to compare the words to the other vectors
    - key is used by other weights to compare to
- Transformers because support parallelization, variable length input, distant relationships
- Has encoder, decoder, pre and post processing
- Faster than RNN and thus does not use RNN, GRU or LSTM

## Paper 12: Single Headed Attention RNN: Stop Thinking With Your Head
**Background**:
- Current NLP research rely a lot on Transformer

**Motivation**
- Challenges trend by proposing new research direction, which needs less computational power and simpler model, but nearly reach SOTA; thus **there should still exist competition and variety in the types of models for a task.**
- Investigate a language modelling technique that does not rely on “Transformers” (current SOTA)
- Lone GPU vs Expensive computation
- Single head vs multi-headed attention (memory intensive)

**Used**: 
- SHA-RNN: LSTM + single head attention

**Contribution**:
- Single headed performance better than without attention head (4 headed slightly better, but twice as long training)

## Paper 13: Unpaired Img2Img Translation using Cycle-Consistent Adversarial Networks
**Contributions**:
- Neural style transfer
- Cycle consistency
    - Convert Zebra to Horse and then back to Zebra should be the same image.
- Mode collapse
    - produce same label maps regardless of input photo / generator produces limities varieties of samples
- Lacks in transferring shapes

## Paper 14: Critical analysis of self-supervision
**Contributions**:
- BiGAN:
    - extension of GAN
- RotNet:
    - predict "upright" direction of image
- One image is sufficient to train early layers
- Strong supervision superior to self-supervision in deeper layers

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=de0be7a9-29e1-4ab6-9ce7-607fa646094e' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>