# Omnidirectional Transfer for Quasilinear Lifelong Learning

[![Paper](https://img.shields.io/badge/Paper-arXiv-green)](https://arxiv.org/pdf/2004.12908.pdf)

## Introduction

* In biological learning, the learning is lifelong, with agents conitnually building on past knowledge and experiences, improving on many tasks given data associated with any task. (e.g. learning a second language improves the individual's performance in his native language)
* Even though classical ML can simultaneously optimize for multiple tasks, if is difficult to sequentially optimize for multiple tasks.
* Catastrophic forgetting: performance on the prior tasks drop precipitously upon training on new tasks.
* Biological learning doesn't suffer from catastrophic forgetting
* Two camps of overcoming catastrophic forgetting:
    * Fixed resources, and therefore, reallocate resources (compressing representations) to incorporate new knowledge (**biologically, this is adulthood**)
    * Adds resources to incorporate new knowledge (**biologically, this is (juvenile) development**)
* The inability to omnidirectionally transfer is one of the key liming factors of AI.
* In ProgNN, new tasks yield additional representational capacity. ProgNN can transfer forward, but they cannot transfer backward.

## Main Contributions

* Representational ensembling that enables omnidirectional transfer via an "Omni-voter" layer
* Computational time reduced from quadratic to quasilinear
* Two types of omnidirectional learning algorithms: 
    * Omnidirectional Forests (ODIF)
    * Omnidirectional Networks (ODIN)
* ODIN and ODIF are resource building (juvenile) but since they can leverage prior representations, they can convert in to an resouce recruiting (adult) state too. 

## Background

### Classical ML

* Consider RVs $(X, Y) \sim P_{X,Y}$ where $X \sim \mathcal{X}$ is the input and $Y \sim \mathcal{Y}$ is the label. 
* $P_{X,Y} \in \mathcal{P}_{X,Y}$ is the joint distribution of $(X, Y)$
* Let $l \colon \mathcal{Y} \times \mathcal{Y} \longrightarrow (0, \infty]$ be a loss function
* The goal of classifcal ML is to find the hypothesis (predictor/ decision rule) $h \colon \mathcal{X} \longrightarrow \mathcal{Y}$ that minimizes te expected loss or *risk*,
    $$ R(h) = \mathbb{E}_{X, Y}[l(h(X), Y)] $$
* A learning algorithm is a function $f$ that maps a dataset $\bf{S}_n = \{ X_i, Y_i \}_{i=1}^n$. 
* If $n$ samples of $(X, Y)$ is i.i.d from some true but unknown $P_{X,Y}$, the generalization error or expected risk is given by,
    $$ \mathbb{E}[R(f(\bf{S}_n))]$$
* The goal: choose a learner $f$ that learns a hypothesis $h$ that has a small generalization error for the given task.

### Lifelong Learning (LL)

* Lifelong learning generalizes classifcal ML in the following ways:
    * environment of $\mathcal{T}$ tasks instead of a single task
    * data arrive sequentially, instead of batch mode
    * computational complexity contraints on the learning algoritm and hypotheses
* Goal of LL: given new data and a new task, use all the exisiting data to achieve a lower generalization error on the new task, while also using the new data to obtain a lower generalization error on the previous tasks.
* previous work: 
    * updating a fixed parametric model as new tasks arrive
    * adding resources as new tasks arrive
    * store/ replay previously encountered data to reduce forgetting
* Task-aware: the learner is aware of all-task details for all tasks $h \colon \mathcal{X} \times \mathcal{T} \longrightarrow \mathcal{Y}$.
* Task-unaware (task-agnostic): learner may not know that the task has changed at all $h \colon \mathcal{X} \longrightarrow \mathcal{Y}$.

### Reference Algorithms

**Resource Building Algorithms**: Progressive Neural Nets (ProgNN), Deconvolution-Factorized CNNs (DF-CNNs)

**Fixed Capacity Algorithms**: Elastic Weight Consolidation (EWC), Online-EWC (O-EWC), Synaptic Intelligence (SI), Learning without Forgetting (LwF), ‘None’ and two variants of exact replay (Total Replay and Partial Replay).

## Evaluation Criteria

### Transfer Efficiency

$$ TE_n^t(f) = \frac{\mathbb{E}[R^t(f(S_n^t))]}{\mathbb{E}[R^t(f(S_n))]} $$

where, $t$ is the task with sample size $n$. 

* The algorithm $f$ has transfer learned iff $TE_n^t(f) > 1$

* **Interpretation**: Transfer efficieny is the ratio of the generalization error of (i) an algorithm that has learnt only from data associated with a given task to (ii) the same learning algorithm that also has access to other data. 

### Forward Transfer Efficiency

$$ FTE_n^t(f) = \frac{\mathbb{E}[R^t(f(S_n^t))]}{\mathbb{E}[R^t(f(S_n^{<t}))]} $$

* $FTE_n^t(f) > 1$ indicates that the algorithm has used data associated with past tasks to improve performance on task $t$. (forward transfers)

* **Interpretation**: Forward transfer efficiency is the expected ratio of the risk of the learning algorithm with (i) access only to task $t$ data, (ii) to access to the data up to and inclduing the last observation from tast $t$.

* Measures the relative effect of previously seen out-of-task data on the performance on task $t$.

### Backward Transfer Efficiency 

$$ BTE_n^t(f) = \frac{\mathbb{E}[R^t(f(S_n^{<t}))]}{\mathbb{E}[R^t(f(S_n))]} $$

* $BTE_n^t(f) > 1$ indicates that the algorithm has used data associated with future tasks to improve performance on previous task $t$. (backward transfers)

### Other 

* If we have a sequence in which tasks do not repeat, transfer efficiency for the first task is all backwards transfer, for the last task it is all forwards transfer, and for the middle tasks it is a combination of the two.

* TE factorize in to FTE and BTE: $ TE_n^t(f) = FTE_n^t(f) \times BTE_n^t(f) $

## Omnidirectional Algorithms

* Approach relies on hypotheses of the nature, $h( \cdot ) = w \odot v \odot u ( \cdot )$. 
* Representer ($u$): maps an $\mathcal{X}$ valued input into an internal representation space $\bar{ \mathcal{X}}$. 
* Voter ($v$): maps transformed data into a posterior distribution on the response space. 
* Decider ($w$): produces a predicted label. 
* In a generalized format, each voter can be allowed to ensemble all the exisiting representations, regardless of the order in which they learnt. This done by the **Omni-voter layer**.
* When the representers have learnt complementary properties, it could help the course of multi-task learning.
* In ODIF: 
    * Representer: Decision forest (output: one-hot encoded vector representations)
    * Voter: Populating the cells of the partitions and taking class votes with out-of-bag samples, as in ‘honest trees' (output: the posteriors)
    * Decider: Average the posterior estimates (output: the argmax)
* In ODIN: 
    * Representer: Backbone of a DN without the final layer
    * Voter: learned via K-Nearest Neighbors
    * Decider: Fully connected layer
* In ODIF and ODIN both, 
    * A new representer is built, when new data from a new task arrive.
    * Then a voter is built which integrates information from the existing representers. (enables forward transfer)
    * When new data of an old task arrives, the voters are updated from the new representations. (enables backward transfer)
    * New test data are passed through all the exisiting representers and corresponding voters to make a prediction
    * When updating the previous task voters with the cross task posteriors, we do not need to subsample the previous task data.
* In ODIN: 
    * Exclues lateral connections unlike ProgNN
    * Representations are independent avoiding intereference between representations

## Experiments

Check the manuscript.

## Results

* The following have been studied. 
    * computational space and time complexity of internal representations
    * representational capacity 
* Types of lifelong learning algorithms based on the computational taxonomy.
    * parametric: 
        * algorithms with fixed resources
        * eventually all algorithms will catastrophically forget at least some knowledge
        * EWC, SI, LwF
    * semi-parametric
        * algorithms whose representational capacity grows slower than sample size
        * have fixed representational capacity per task
        * ProgNN, DF-CNN
        * may lack representation capacity to perform well on complex tasks
        * may waster resources on simpler tasks
    * nonparametric
        * ODIF is the only nonparametric method up to this day

## Summary

* Quasilinar representational ensembling as an approach to omnidirectional lifelong learning

* This representation ensembling approach closely resembles the contructivist view of brain development

* Forest-based representation ensembling approaches can easily add new resources when appropriate.

* The concept of omnidirectional transfer of knowledge is proposed to overcome the issue of catastrophic forgetting. 

* Through omnidirectional transfer, it is possible to realize the goal of lifelong learning, which is to improve the performance on a new task using knowledge about existing tasks and their data, while improving the performance on the previous tasks using the knowledge about new tasks and their data. 

* This work further uses progressive learning concepts to incorporate resource building and resource recruitment into the proposed algorithms. 

## Potential Future directions? 

* How can we expand the proposed learning framework into task-agnostic situations? 
* How can we make deep learning to enable dynamically adding resources when appropriate?
* Obviate the need to store all the data by using a generative model?
* Paradigm of ensembling representations rather than learners can be readily applied more generally. (e.g. batch effects, federated learning)
* Substantial pruning during development and maturity in the brain circuitry is important for performance. This motivates future work for pruning adversarial representers to enhance the transferabilty among tasks even more.

