## Noisy oscillator environment

Author: [Ilya Schurov](http://github.com/ischurov/TheConsciousnessPrior).

### Basic construction

The following environment is aimed to provide the simplest possible interesting example for testing Consciousness Prior.

The inputs are two-dimensional vectors $x_t \in \mathbb R^2$ that depends on $t$ in the following way. For some fixed linearly independent vectors $u_h, u_r \in \mathbb R^2$, we have:

$$x_t = u_h q_t + u_r w_t,\tag{1}$$

where $q_t = A \sin (\omega(t-B))$, $A$, $B$ and $\omega$ are constants, and $w_t$ is a random noise (for example, all $w_t$'s are independent and uniformly distributed in $[-A, A]$).

This means that we have two directions. In one direction ($u_h$) we have harmonic oscillations ($h$ stands for _harmonic_) and in the other direction ($u_r$, $r$ for _random_) it is just a noise.

Assume that $\omega$ is fixed forever. If one knows $u_h$ and $u_r$, it is possible to reconstruct the whole $q_t$ by looking at exactly two moments of time (as we have only two unknown parameters, namely $A$ and $B$) and thus obtain perfect predictions. It is also impossible to make any predictions of $w_t$ due to their randomness.

### How this environment fits into Yoshua's framework
1. Define a set of rules which refer to a set of underlying high-level variables and allow one to predict the part of the input dynamics which is predictable (which are the values of some high-level variables)
> $q_t$ is the only high-level variable. The rule is harmonic oscillation, that can be easily catched by a simple linear recurrence.
2. The high-level variables are not directly observed but are complicated functions of the input (like objects and their positions and other attributes, in a sequence of images)
> One have to know the directions $u_h$ and $u_r$ (or projection operator onto $u_h$) to observe $q_t$. It is probably possible to reconstruct $u_h$ and $u_r$ from the data, but exact function seem to be complicated.
3. Other aspects of the input which also vary in time are much more numerous than the high-level variables but have no coherent structure (e.g. unpredictable noise)
> We currently have only one noisy aspect of input due to super-simplicity, but of course can add more, i.e. make $w_t$ multi-dimensional.

### Problems and hypothesis
1. If one knows $u_h$ and $u_r$, one can obtain $q_t$ and $w_t$ from the data and feed $q_t$ to simple recurrent network without nonlinearities that will predict future $q_t$'s perfectly just after two steps or so. (Learning a rotation matrix.)
2. CP-aware architecture should be able to disentangle deterministic and random parts of the input, i.e. to reconstruct vectors $u_h$ and $u_r$.

### Architecture of minimal CP-aware model
We use the notation and settings of [CP in an Observational Setting](http://nbviewer.jupyter.org/github/ischurov/TheConsciousnessPrior/blob/master/src/comments/cp-in-an-observational-settings.ipynb) and mention the difference between the original and our approaches.

Let us assume for simplicity that $u_h$ and $u_r$ are always orthogonal to each other.

- The representation network is trivial: $h_t = x_t$ for every $t$, as the feature space is already very simple. (Good, nothing to implement!)
- Conscious network is a kind of "soft attention" mechanism (in contrast with the original approach, where it is "hard attention"), $c_t = C(h_t) = \langle u_c, h_t\rangle$, $u_c$ is a fixed vector to be learned (does not depend on $t$). In contrast with the original approach, $C$ does not depend on $h_{t-1}$ and $z_t$. The goal is to be able to predict $c_{t+1}=C(h_{t+1})$ (so there is no $A_t$ counterpart, we predict the same thing we conscious about). It is possible if $u_c = u_h$.
- Generator network have to predict $c_{t+1}$. To be able to reconstruct the parameters of harmonic oscillation, we have to observe state of oscillator at several moments (at least three, if we have three parameters). So the generator should be either recurrent, $(\widehat {a_{t}}, g_t) = G(c_t, g_{t-1})$, where $g_t$ is an internal state of the generator, or be explicitly feeded by several previous values of $c_{t}$, like $\widehat {a_t} = G(c_t, c_{t-1}, c_{t-2}, c_{t-3})$.
- The objective is minimization of MSE loss: $\sum_{t} (\widehat {a_t} - c_{t+1})^2$.
- To avoid failure mode where $u_c$ is zero vector and $c_t \equiv 0$, we must penalize too small norm $\|u_c\|$ (or add a strict constraint $\|u_c\|=1$).
- No need of Mutual Information Objective as there's no constant variables in the representation state (see below the discussion of MIO).

### Variations and generalizations
1. Both oscillating and random parts can be made multi-dimensional.
2. Higher harmonic can be added to oscillating parts.
2. Random parts can be a kind of random walk instead of independent random variables.
3. The embedding (1) can be made time-dependent (i.e. vectors $u_h$ and $u_r$ depend on time) as well as consious function $C$.
4. The embedding (1) can be made non-linear as well as consious function $C$.

### Mutual Information objective (MIO) rationale
Let us consider the following variant of noisy oscillator: 
$$x_t=(A \sin (\omega(t-B)), w_t, D) \in \mathbb R^3,$$
where $A$, $B$, $D$ and $\omega$ are constants, $w_t$ is random noise. In this case, there are three components of $x_t$ already disentangled: harmonic, noisy and constant.

Assume that the representation network is trivial, $h_t \equiv x_t$.

Consider kind of "hard attention" conscious function $C$, like discussed in William's diagram, that for every moment $t$ chooses some index $B_t \in \{1, 2, 3\}$ to attend and (possibly) another index $A_t \in \{1, 2, 3\}$ to predict.

$$c_t = (B_t, b_t, A_t) = C(h_t, c_{t-1}),$$

where $b_t = h[B_t]$.

As our environment is super-simple, let us assume that $C$ is deterministic function of $h_t$ and $c_{t-1}$ and therefore there is no $z_t$ in the arguments of $C$.

The main objective of conscious is to track "predictable" high-level features, i.e. such features that can be used to predict their own future state or state of other high-level features. In our example, there are two such features, namely harmonic and constant parts of $x_t$. It is easy to predict constant provided that you already know it. It is also easy to predict the behaviour of harmonic oscilator provided we know its several previous states that allows us to find constants $A$, $B$ and $\omega$. Also, these two features are independent on each other and all of them are independent on noisy part.

So, conscious has two good targets: either $B_t=A_t=1$ (harmonic part) or $B_t=A_t=3$ (constant part) for every $t$.

Which one is better? The first one, as in this case conscious is able to predict "something useful". So we don't want to allow it to stick with something like "predict the value of a pixel that does not change at all".

More natural example. Let us recall that in the Bengio's paper we have representation network that encodes the inputs. If "predicting constant" is allowed, representation network has stimulus to leave some part of the representation to be constant, and the consious network will chose that part forever as its perfectly predictable.

Now, how to destinguish between harmonic-aware and constant-aware conscious? Mutual Information Objective is to rescue!

Let us recall what mutual information is about. Let us have two random variables $X$ and $Y$. Mutual information shows the difference between entropy of $X$ and entropy of $X$ provided you know $Y$:
$$MI(X, Y)=H(X)-H(X\mid Y).$$

If $X$ and $Y$ are independent, MI is zero, because knowledge of $Y$ does not reduce entropy of $X$. If both $X$ and $Y$ has low entropy, MI is also low, as it cannot decrease much. If entropy of $X$ is high, but knoweledge of $Y$ allows one to decrease it significantly, mutual information is also high.

As mutual information deals with random variables, we have to introduce them — our setting was purely determenistic so far. To do so, fix some input sequence $x_t$ and then pick random $t$ out of all possible moments of time excluding the last one. Then $b_t$ and $h_{t+1}[A_t])$ are a couple of random variables. If $B_t=A_t=3$ (constant-aware conscious), these random variables are essentially non-random (they both are constant $D$). Both $b_t$ and $h_{t+1}[A_t]$ have zero entropy, and therefore the mutual information is also zero.

On the other hand, for $B_t=A_t=1$ (harmonic-aware conscious), $b_t$ and $h_{t+1}[A_t]$ both have rather high entropy (choosing different moment of time we pick different positions). However, if we know $b_t$ (i.e. previous state of the oscillator), we know much of $h_{t+1}[A_t]$ (at least the new state is somewhere near to the old one). So mutual information is high.

This is why maximzation of mutual information between current and future states of consciousness allows us to select "good" targets for conscious function.

#### Evaluation of MIO
The previous example is purely theoretical, but it also provides a way to experimentally check that MIO really works. To this end, we entangle three components of $x$ using simple linear coordinate change, just like discussed in [Basic construction](#Basic construction) section. Then we make conscious to be linear function $\mathbb R^3\to \mathbb R^1$ instead of index-choosing function. We expect that MIO allows the conscious to learn the harmonic part of the data.