# Multitask Learning

## Learning without forgetting

A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell, “Progressive Neural Networks,” arXiv, vol. cs.LG, Jun. 2016.

Yimeng: Basic idea is that, in fine-tuning, instead of changing parameters of the older network, building a new network, with activation as a linear combination of old network and new network. This way, older models don't get overwhelmed by new tasks.

Xingyu: Not sure if this kind of architecture will be easy to train, as the parameter space can get very large. A possible idea is that, since the arrangement of different tasks clearly matters, can we form the different tasks into a graph and searching for relations among them as a supervision? Potentially could also apply to the multi-scale network. This is inspired by the FlowWeb <https://people.eecs.berkeley.edu/~tinghuiz/projects/flowWeb/images/teaser_h.png>

~~~
@article{Rusu:2016tj,
author = {Rusu, A A and Rabinowitz, N C and Desjardins, G and Soyer, H and Kirkpatrick, J and Kavukcuoglu, K and Pascanu, R and Hadsell, R},
title = {{Progressive Neural Networks}},
journal = {ArXiv e-prints},
year = {2016},
volume = {cs.LG},
month = jun,
annote = {Basic idea is that, in fine-tuning, instead of changing parameters of the older network, building a new network, with activation as a linear combination of old network and new network. This way, older models don't get overwhelmed by new tasks.

Xingyu: Not sure if this kind of architecture will be easy to train, as the parameter space can get very large. A possible idea is that, since the arrangement of different tasks clearly matters, can we form the different tasks into a graph and searching for relations among them as a supervision? Potentially could also apply to the multi-scale network. This is inspired by the FlowWeb: https://people.eecs.berkeley.edu/{\textasciitilde}tinghuiz/projects/flowWeb/images/teaser_h.png},
keywords = {deep learning},
read = {Yes},
rating = {3},
date-added = {2017-02-16T15:54:49GMT},
date-modified = {2017-02-17T20:09:57GMT},
url = {http://arxiv.org/abs/1606.04671},
local-url = {file://localhost/Users/yimengzh/Documents/Papers3_revised/Library.papers3/Articles/2016/Rusu/arXiv%202016%20Rusu.pdf},
file = {{arXiv 2016 Rusu.pdf:/Users/yimengzh/Documents/Papers3_revised/Library.papers3/Articles/2016/Rusu/arXiv 2016 Rusu.pdf:application/pdf}},
uri = {\url{papers3://publication/uuid/992600EF-5369-45A1-99A9-CC9CE97738E7}}
}
~~~

## Bayesian Neural Networks for weighting tasks

Kendall, A., Gal, Y. & Cipolla, R. Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics. arXiv, cs.CV, 1705.07115 (2017).

Faisal Notes: When training one network to do multiple tasks, before, people used to make the tasks equally important, or did hyperparameter tuning. But in this paper, they are doing something clever, for the loss function, the weight on each task is dynamic and decided by the networks uncertainty on that task. That idea is a bit like optimal cue integration. They decompose uncertainty into philosophical categories, and I am unclear how this decomposition relates to the bias-variance decompostion of model error, but it is probably all a subdivision of the variance term. 

* tasks: semantic segmentation, instance segmentation(assignment of pixels to an object as opposed to just class), depth regression
* the uncertainty term is a task-dependent, homoscedastic uncertainty
* So the loss function is minimizing a negative log likelihood. The function is multinomial and gaussian structured, so the parameters being optimized are weights, and variances of those gaussians. The mean of the gaussian is taken as the network output. 
* There is some regularization to ensure that none of the task weights become zero (aka ignoring a task)

>This approach is inspired by [32] which identifies instances using Hough votes from object parts. In this work we extend this idea by using votes from individual pixels using deep learning.

* another cool part of this paper is that the instance segmentation uses something that looks like alan's compositional model, where parts or pixels vote. 
* Further, this task uses a density based clustering algorithm, (OPTICS), that doesn't need a number of clusters hyperparameter, which I am very curious about. 
* The dataset is something that might be seen in autonomous vehicles. 
* The authors are parts of the machine learning group at University of Cambridge, and have been really pressing the idea of how to fuse deep networks and bayesian inference. 

~~~
@article{Kendall2017,
abstract = {Numerous deep learning applications benefit from multi-task learning with multiple regression and classification objectives. In this paper we make the observation that the performance of such systems is strongly dependent on the relative weighting between each task's loss. Tuning these weights by hand is a difficult and expensive process, making multi-task learning prohibitive in practice. We propose a principled approach to multi-task deep learning which weighs multiple loss functions by considering the homoscedastic uncertainty of each task. This allows us to simultaneously learn various quantities with different units or scales in both classification and regression settings. We demonstrate our model learning per-pixel depth regression, semantic and instance segmentation from a monocular input image. Perhaps surprisingly, we show our model can learn multi-task weightings and outperform separate models trained individually on each task.},
archivePrefix = {arXiv},
arxivId = {1705.07115},
author = {Kendall, Alex and Gal, Yarin and Cipolla, Roberto},
eprint = {1705.07115},
file = {:Users/faisal/Documents/Papers/1705.07115.pdf:pdf},
mendeley-groups = {Deep Learning},
month = {may},
title = {{Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics}},
url = {http://arxiv.org/abs/1705.07115},
year = {2017}
}
~~~

# Transferability

## Transferability of CNN features

J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable are features in deep neural networks?,” presented at the Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, 2014, pp. 3320–3328.

Notes: Check end of pp. 2 for a summary of observations. Some may not make sense. Check Section 4, especially my highlighted parts to understand these observations.

In addition to these observations, I should add that random weights don't work, for big dataset.

4.3 Random weights

I think here the network has 8 layers, and for a given n (say n=3), they initialize the network randomly, freeze first n layers, and then train the other parts.

In addition, not sure the training set here, 500 random classes, or nature, or manmade? But it should not matter.


* p.3320: Such first-layer features appear not to be specific to a particular dataset or task, but general in that they are applicable to many datasets and tasks. Features must eventually transition from general to specific by the last layer of the network, but this transition has not been studied extensively. -- Highlighted Mar 29, 2017
* p.3320: optimization difficulties related to splitting networks between co-adapted neurons, -- Highlighted Mar 29, 2017
* p.3320: but that transferring features even from distant tasks can be better than using random features. A final surprising result is that initializing a network with transferred features from almost any number of layers can produce a boost to generalization that lingers even after fine-tuning to the target dataset. -- Highlighted Mar 29, 2017
* p.3322: Fortunately, in ImageNet we are also provided with a hierarchy of parent classes. This information allowed us to create a special split of the dataset into two halves that are as semantically different from each other as possible: -- Highlighted Mar 29, 2017
* p.3322: In Section 4.2 we will show that features transfer more poorly (i.e. they are more specific) when the datasets are less similar. -- Highlighted Mar 29, 2017
* p.3323: The main experiment has random A/B splits and is discussed in Section 4.1 -- Highlighted Mar 29, 2017
* p.3325: This performance drop is evidence that the original network contained fragile co-adapted features on successive layers, that is, features that interact with each other in a complex or fragile way such that this co-adaptation could not be relearned by the upper layers alone. -- Highlighted Mar 29, 2017
* p.3325: Gradient descent was able to find a good solution the first time, but this was only possible because the layers were jointly trained. -- Highlighted Mar 29, 2017
* p.3325:  Alternately, we may say that there is less co-adaptation of features between layers 6 & 7 and between 7 & 8 than between previous layers. -- Highlighted Mar 29, 2017
* p.3325: it has not been previously observed in the literature that such optimization difficulties may be worse in the middle of a network than near the bottom or top. -- Highlighted Mar 29, 2017
* p.3325:  Thanks to the BnB points, we can tell that this drop is from a combination of two separate effects: the drop from lost co-adaptation and the drop from features that are less and less general. -- Highlighted Mar 29, 2017
* p.3325: We believe this is the first time that (1) the extent to which transfer is successful has been carefully quantified layer by layer, and (2) that these two separate effects have been decoupled, showing that each effect dominates in part of the regime. -- Highlighted Mar 29, 2017
* p.3325: Thus, a plausible explanation is that even after 450k iterations of fine-tuning (beginning with completely random top layers), the effects of having seen the base dataset still linger, boosting generalization performance. It is surprising that this effect lingers through so much retraining. -- Highlighted Mar 29, 2017
* p.3325: This generalization improvement seems not to depend much on how much of the first network we keep to initialize the second network: keeping anywhere from one to seven layers produces improved performance, with slightly better performance as we keep more layers. -- Highlighted Mar 29, 2017
* p.3326: We also compare to random, untrained weights because Jarrett et al. (2009) showed — quite strikingly — that the combination of random convolutional filters, rectification, pooling, and local normalization can work almost as well as learned features. -- Highlighted Mar 29, 2017
* p.3326: It is natural to ask whether or not the nearly optimal performance of random filters they report carries over to a deeper network trained on a larger dataset -- Highlighted Mar 29, 2017
* p.3326: getting random weights to work in convolutional neural networks may not be as straightforward as it was for the smaller network size and smaller dataset used by Jarrett et al. (2009). -- Highlighted Mar 29, 2017
* p.3326: First, the transferability gap when using frozen features grows more quickly as n increases for dissimilar tasks (hexagons) than similar tasks -- Highlighted Mar 29, 2017
* p.3326: Second, transferring even from a distant task is better than using random filters. -- Highlighted Mar 29, 2017

~~~
@inproceedings{Yosinski:2014wc,
author = {Yosinski, Jason and Clune, Jeff and Bengio, Yoshua and Lipson, Hod},
title = {{How transferable are features in deep neural networks?}},
booktitle = {Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada},
year = {2014},
editor = {Ghahramani, Zoubin and Welling, Max and Cortes, Corinna and Lawrence, Neil D and Weinberger, Kilian Q},
pages = {3320--3328},
annote = {Great paper on finetuning.

Check end of pp. 2 for a summary of observations. Some may not make sense. Check Section 4, especially my highlighted parts to understand these observations.

In addition to these observations, I should add that random weights don't work, for big dataset.

4.3 Random weights

I think here the network has 8 layers, and for a given n (say n=3), they initialize the network randomly, freeze first n layers, and then train the other parts.

In addition, not sure the training set here, 500 random classes, or nature, or manmade? But it should not matter.

},
keywords = {deep learning},
read = {Yes},
rating = {5},
date-added = {2017-03-29T19:46:20GMT},
date-modified = {2017-03-30T15:01:58GMT},
url = {http://papers.nips.cc/paper/5347-how-transferable-are-features-in-deep-neural-networks},
local-url = {file://localhost/Users/yimengzh/Documents/Papers3_revised/Library.papers3/Articles/2014/Yosinski/NIPS%202014%202014%20Yosinski.pdf},
file = {{NIPS 2014 2014 Yosinski.pdf:/Users/yimengzh/Documents/Papers3_revised/Library.papers3/Articles/2014/Yosinski/NIPS 2014 2014 Yosinski.pdf:application/pdf}},
uri = {\url{papers3://publication/uuid/6FF7EA19-695A-4FE9-B6E3-4C9619FC54EA}}
}
~~~