# Using representations across domains - Transfer learning

- Presupposition till this point, that we build models specificly for the given task and always start training "from scratch".
- Do not take intuitions, insights, models and hyperparameters from one domain and apply it to another one. 
- If generalization of models is strong, should be able to utilize knowledge stored in one learned model in another, smilar case. (The concept of "similar task" is non-obvious!)

Main source for discussion: [Michéle Sebag: Representation Learning, Domain Adaptation and Generative Models with Deep Learning. DeepLearn2018 Summer School, Genova](https://drive.google.com/file/d/1XZjoYuOxdPUPGZFAVw2_9Ob78msCKzNF/view?usp=sharing)

A very good summary can be found at the blog of [Sebastian Ruder](http://ruder.io/transfer-learning/). 

Continuum between Domain Adaptation, Transfer Learning, and Multi-Task learning.

- **Domain Adaptation**: same task, but "domain" for the task different (adapting spam filter from one user to another user)
- **Trasfer learning**: different task (telling apart cars and telling apart animals, or even a classification and a regression), usually different tasks done sequentially
- **Multi-Task learning**: strengthen model representation by giving it multiple different tasks  (with the idea that the tasks at least have something in common)


## Motivation and basics

* Task: classification, or regression
* A source domain source distribution $D_s$ 
* A target domain target distribution $D_t$

**Idea:**
* Source and target are “sufficiently” related
* ... one wants to use source data to improve learning from target data

### Settings

<img src="http://drive.google.com/uc?export=view&id=1nu7qrEhJz0obfwEzcTqB9FbJe4MSHK-T" width=600 heigth=600>
<img src="http://drive.google.com/uc?export=view&id=11neGgvY5dy5BaS34edW7KfjY9uCdmspl" width=600 heigth=600>

### Warning! - Transferring bias

The myth of the "Tank detector":
https://www.gwern.net/Tanks#could-something-like-it-happen

<img src="https://i.ytimg.com/vi/yvuKQkGVjJE/hqdefault.jpg" width=400 heigth=400>

It can be an "urban legend", but shows that a mere environmental constant in a dataset can becom a a biasing factor, so it is pretty easy to imagine that these biases will hurt performance in a different setting. (As well as again draw attention to the paramount importance of model validation!)

## Some baseline supervised methods:

### Learn source and target domain in union

[J. Huang, A. Smola, A. Gretton, K. M. Borgwardt, and B. Scholk ¨ opf. Correcting sample selection
bias by unlabeled data. In B. Scholk ¨ opf, J. Platt, and T. Hoffman, editors, Advances in Neural
Information Processing Systems 19. MIT Press, Cambridge, MA, 2007](https://papers.nips.cc/paper/3075-correcting-sample-selection-bias-by-unlabeled-data.pdf)

- Default assumption in many learning scenarios: training and test data independently and identically (iid) drawn from the same distribution 
- While the available data have been collected in a biased manner, the test is usually performed over a more general target population
- Training and test data are drawn from different distributions, commonly referred to as sample selection bias
- Covariance shift
- Matching distributions between training and testing sets in feature space
- Account for the difference between Pr(x, y) and Pr′ (x, y) by reweighting the training points such that the means of the training and test points  are close (in a reproducing kernel Hilbert space (RKHS))

Since the source has much more data, target can get "oppressed".

###  Label based on source classifier, add that as input, learn new classifier

[H. Daumé, D. Marcu, Domain Adaptation for Statistical Classifiers](https://www.aaai.org/Papers/JAIR/Vol26/JAIR-2603.pdf)


- Situation: labeled out-of-domain data plentiful, but labeled in-domain data scarce
- Statistical formulation of problem in terms of a simple mixture model
- Treat in-domain data as drawn from a mixture of two distributions: a “truly in-domain” distribution and a “general domain” distribution
- Out-of-domain data is treated as if drawn from a mixture of a “truly out-of-domain” distribution and a “general domain” 
- Inference algorithm case based on the technique of conditional expectation maximization


### Learn two classifiers, one for source, one for target and weight their outputs as "votes"

[H. Daumé, D. Marcu, Domain Adaptation for Statistical Classifiers](https://www.aaai.org/Papers/JAIR/Vol26/JAIR-2603.pdf)

### Use source model as prior

Initialize the target classifier with the weights learned from source, during training regularize towards the original weights. 

#### Detour: Catastrophic forgetting

- What if during learning the new task, we don't just acquire new abilities, but "forget" what we have learned so far? 
- No guarantee that updates based on error gradients on the new task do not lead the model weights away from the optimum learned before (or better to say: it is nearly certain it does so...) 
- It is a difficult task to ensure, that learning does not happen with strong degradation of the prior knowledge (what prior knowledge to degrade)?

**[Overcoming catastrophic forgetting in neural networks](https://arxiv.org/abs/1612.00796)**

- One Idea: make the change of weights that were important for solving prior tasks more difficult. 
- We can use a quadratic error term we can try to keep these weights close to their original values. The authors call this "elastic weight consolidation". 

<img src="https://image.slidesharecdn.com/overcomingcatastrophicforgettinginneuralnetwork-170429024916/95/overcoming-catastrophic-forgetting-in-neural-network-9-638.jpg?cb=1493434190">

An interesting description of "catastrophic forgetting" in NLP context can be found [here](https://explosion.ai/blog/pseudo-rehearsal-catastrophic-forgetting).

### Unsupervised transfer learning - Distance of tasks 

####  Feature augmentation

[H. Daumé, Frustratingly Easy Domain Adaptation In ACL 2007](https://arxiv.org/abs/0907.1815)
- Take each feature in the original problem and make three versions of it: a general version, a source-specific version and a target-specific version. 
- Augmented source data will contain only general and source-specific versions. 
- Augmented target data contains general and target-specific versions 
- Then define a systematic mapping between source and target data as a transformation.
- Distance of tasks" relevant: presume systematic mapping transforming source to target. 

<img src="http://drive.google.com/uc?export=view&id=1lhdJQBM61Dxg8fH0nMbBezJK9qT2oQrf" width=600 heigth=600>

For one method see [Baochen Sun, Jiashi Feng, Kate Saenko: Return of Frustratingly Easy
Domain Adaptation. AAAI 2016](https://arxiv.org/abs/1511.05547) quoted by Sebag.

#### Learning of "general" representations

- Can argue, that by learning on a big enough dataset, a model will learn useful "general" features, that can be later on reusable. 
- This is indeed one of the motivation behind teaching autoencoders (large amounts of unlabeled data), or **re-using model representations** from disctiminative models - trained eg. on ImageNet.

**Capitalize on the notion, that the deep networks are gradually learning hierarchic representations of features, and only at the last couple layers do they learn a classifier.**

**Thus a common practice to take a trained ConvNet, throw away the last 1-2 layers and train it for a new task - remember, last is a softmax over the classes...**

### How to prevent catastrophic forgetting?

To answer the question properly, we have to think over **where and how the forgetting happens** at all.

Since the topmost (eg. Softmax) layers are initialized randomly, most probably they contribute most to the error, thus it can be argued, that until we get them to "mature" a bit more, they are the main culprits we have to work with, thus there is growing evidence, that **the top layers have to be trained first, then the lower layers can be finetuned.**

(This makes all the more sense, since the lower the layers, the more basic features they represent, thus most probably they generalize all the more, so it makes no sense to hit them hard with some big modifications at first.)

Enter:

#### Gradual unfreezing

The idea is pretty simple: for the early part of the transfer learning based training do not allow updates on weights for the majority of the network, just **gradually unfreeze** the layers later on.

<img src="https://humboldt-wi.github.io/blog/img/seminar/group4_ULMFiT/Figure_21.png" width=65%>

See a detailed analysis of Ruder and co.'s UMLFIT [here](https://medium.com/explorations-in-language-and-learning/transfer-learning-in-nlp-2d09c3dfaeb6).

#### Soft version: Differential learning rates

The concept of [differential learning rates](https://blog.slavv.com/differential-learning-rates-59eff5209a4f) is in a sense a soft version of freezing: instead of fixing the lower layers, a smaller learning rate is used for their updates, which - at least in theory - should help them avoid forgetting.

**In practice, though, many times an appropriately small (think: one order of magnitude smaller) global learning rate is sufficient.**

<img src="https://cdn-images-1.medium.com/max/2400/1*BEyoI-p1FGTV0PUoGTxz4Q.png" width=70%>

#### Verdict is still out

The development of more effective transfer learning methods is far from finished, there are quite [recent papers](https://arxiv.org/abs/1812.01640) which offer a good survey:

<img src="http://drive.google.com/uc?export=view&id=1V03HbhaHSRboTubR5afsMh0G6TTbkMMw"  width=55%>

## Detour: Weak supervision - An interesting current solution

[Exploring the Limits of
Weakly Supervised Pretraining](https://research.fb.com/wp-content/uploads/2018/05/exploring_the_limits_of_weakly_supervised_pretraining.pdf)

- Research group at Facebook wanted to teach a discriminative model with exceptionally large dataset, bigger than manual tagging allows
- Decided to capitalize on "weak supervision", that is the noisy "labeling" based on the # tags people give for their Instagram pictures
- Took **3.5 billion (!!!)** Instagram pictures and trained a ConvNet on it. 
- Annotation quality inferior, but the sheer amount of data causes the model to beat the state of the art in transfer scenarios.

This was also a _huge_ implementation challenge, well worth a read [here](https://www.facebook.com/ross.girshick/posts/10160363792300261)  and [here](https://code.facebook.com/posts/1700437286678763/)


## Multi task learning

Basic idea: advantage to force a model to learn multiple tasks concurrently!
It promises, that we generally learn something about the world - model constrained by variety of data. (AGI, rings a bell? :-)


### Specific models

#### "Zero-shot translation"
[Google’s Multilingual Neural Machine Translation System:
Enabling Zero-Shot Translation](https://arxiv.org/pdf/1611.04558.pdf)

Or in a more readable format [here](https://ai.googleblog.com/2016/11/zero-shot-translation-with-googles.html)

- Parallel corpora on _language pairs_
- Only addition, that a separate tag for the given language is put before the input sequence (to inform the model...)

**More on sequence models later!**

**Result:**

1. It learns to translate to language pairs for which _it was never given data_
2. It's inner representation is _language independent_

<img src="https://2.bp.blogspot.com/-AmBczBtfi3Q/WDSB0M3InDI/AAAAAAAABbQ/1U_51u5ynl4FK4L0KOEllfRCq0Oauzy5wCEw/s640/image00.png">

This result amazed even the researchers, much analysis went into it.

#### Multi-modality

Even more ambitious experiment:  Google ["One model to rule them all"]( https://arxiv.org/abs/1706.05137)

- Single model that yields good results on a number of problems spanning multiple domains: trained concurrently on ImageNet, multiple translation tasks, image captioning (COCO dataset), a speech recognition corpus, and an English parsing task. 
- Model architecture: convolutional layers, an attention mechanism, and sparsely-gated layers
- Each of these computational blocks is crucial for a subset of the tasks we train on.
- Even if a block is not crucial for a task, we observe that adding it never hurts performance and in most cases improves it on all tasks. 
- We also show that tasks with less data benefit largely from joint training with other tasks, while performance on large tasks degrades only slightly if at all.
- A model has been given a host of tasks in parallel (language as well as visual tasks!).

### General description

[Ruder: Multi task learning](https://arxiv.org/abs/1706.05098)
Or more "friendly" description [here](http://ruder.io/multi-task-learning-nlp/)

### Limitations

We have to consider, that this question is in strong connection with generalization, albeit over the train-test split or even the "new", "live" data.

As such, it is also subject of the:

**[No free lunch theorem](https://en.wikipedia.org/wiki/No_free_lunch_theorem)**

_"states that any two optimization algorithms are equivalent when their performance is averaged across all possible problems"_

There is no universal model, not even a human one. **There will always be some things we don't know.**


## Outlook

It is obvious, that transfer learning - because of great emphasis on generalization -, and especially multi-task learning is also on the forefront of development towards general AI models.

A recently published amazing example proposes to use VGG CNN-s, pre-trained on standard ImageNet dataset for processing of sound spectrograms with remarkable efficiency.  A material presenting state of the art toolkits can be found here: [Björn Schuller: Deep Learning for signal analysis, DeepLearn2018 summer school, Genova](https://drive.google.com/file/d/1GCt1PAISMH-4L7fGuSvp_4DbzIiSXrZk/view?usp=sharing) [paper](https://www.isca-speech.org/archive/Interspeech_2017/pdfs/0434.PDF) and [toolkit](https://github.com/DeepSpectrum/DeepSpectrum). This shows that we still have to be rather humble, and we still have much to learn about invariances across domains.

