Merge overleaf-2019-11-07-0936 into master

AndreasMadsen · Nov 7, 2019 · d8d88a9 · d8d88a9
2 parents 0afd4e9 + bfb7a05
commit d8d88a9
Show file tree

Hide file tree

Showing 7 changed files with 113 additions and 61 deletions.
diff --git a/paper/bibliography.bib b/paper/bibliography.bib
@@ -71,6 +71,21 @@ @article{NAEC
   biburl    = {https://dblp.org/rec/bib/journals/corr/abs-1809-08590},
   bibsource = {dblp computer science bibliography, https://dblp.org}
 }
+@inproceedings{maep-madsen-johansen-2019,
+    author={Andreas Madsen and Alexander Rosenberg Johansen},
+    title={Measuring Arithmetic Extrapolation Performance},
+    booktitle={Science meets Engineering of Deep Learning at 33rd Conference on Neural Information Processing Systems (NeurIPS 2019)},
+    address={Vancouver, Canada},
+    journal={CoRR},
+    volume={abs/1910.01888},
+    month={October},
+    year={2019},
+    url={http://arxiv.org/abs/1910.01888},
+    archivePrefix={arXiv},
+    primaryClass={cs.LG},
+    eprint={1910.01888},
+    timestamp={Fri, 4 Oct 2019 12:00:36 UTC}
+}
 @article{FreivaldsL17,
   author    = {Karlis Freivalds and
                Renars Liepins},

diff --git a/paper/main.tex b/paper/main.tex
@@ -3,7 +3,7 @@
 \usepackage{iclr2020_conference,times}
 
 % COMMENT for anonymous submission
-\def\nonanonymous{}
+% \def\nonanonymous{}
 
 \ifdefined\nonanonymous
 \iclrfinalcopy
@@ -73,7 +73,36 @@
 \maketitle
 
 \begin{abstract}
-Exact arithmetic operations of real numbers present a unique learning challenge for machine learning models. Neural networks can approximate complex functions by learning from labeled data. However, when extrapolating to out-of-distribution samples neural networks often fail. Learning the underlying logic, as opposed to an approximation, is crucial for applications such as comparing, counting, and inferring physical models. Our proposed Neural Addition Unit (NAU) and Neural Multiplication Unit (NMU) can learn, using backpropagation, the underlying rules of real number addition, subtraction, and multiplication thereby performing well when extrapolating. The proposed units controls the arithmetic operation by using a sparse weight matrix, which allows the units to perform exact arithmetic operations. Through theoretical analysis, supported by empirical evidence, we justify how the NAU and NMU improve over previous methods. Our experimental setting, motivated by previous work, includes an arithmetic extrapolation task and multiplication of up to 20 MNIST digits. We show that NAU and NMU have fewer parameters, converges more consistently, learns faster, handles large hidden sizes better, and have more meaningful discrete values than previous attempts.\ifdefined\nonanonymous\footnote{Implementation is available on GitHub: \url{https://github.com/AndreasMadsen/stable-nalu}.}\fi
+
+%Learning exact arithmetic operation of real numbers, as part of a neural network, presents a unique challenge. Neural networks can approximate complex functions by learning from labeled data. However, when extrapolating to out-of-distribution samples neural networks often fail. Learning the underlying logic, as opposed to an approximation, is crucial in applications that depends on inferring physical models, comparing, or counting as part of the model.
+
+%Alternative
+%What’s the domain?
+Neural networks can approximate complex functions, but they struggle to perform exact arithmetic operations over real numbers.
+%What’s the issue?
+The lack of inductive bias for arithmetic operations leaves neural networks without the underlying logic needed to extrapolate on tasks such as addition, subtraction, and multiplication.
+%What’s your contribution?
+We present two new neural network components: the Neural Addition Unit (NAU), which can learn to add and subtract; and Neural Multiplication Unit (NMU) that can multiply subsets of a vector.
+%Why is it novel?
+The NMU is to our knowledge the first arithmetic neural network component that can learn multiplication of a vector with a large hidden size.
+%What’s interesting about it?
+The two new components draw inspiration from a theoretical analysis of recent arithmetic components.
+We find that careful initialization, restricting parameter space, and regularizing for sparsity is important when optimizing the NAU and NMU.
+%How does it perform?
+Our results, compared with previous attempts, show that the NAU and NMU converges more consistently, have fewer parameters, learns faster, does not diverge with large hidden sizes, obtains sparse and meaningful weights, and can extrapolate to negative and small numbers.\ifdefined\nonanonymous\footnote{Implementation is available on GitHub: \url{https://github.com/AndreasMadsen/stable-nalu}.}\fi
+
+%What’s the domain?
+%Exact arithmetic operations of real numbers in Neural Networks present a unique learning challenge for machine learning models.
+%What’s the issue?
+%Neural networks can approximate complex functions by learning from labeled data. However, when extrapolating to out-of-distribution samples neural networks often fail. Learning the underlying logic, as opposed to an approximation, is crucial for applications such as comparing, counting, and inferring physical models.
+%What’s your contribution?
+%Our proposed Neural Addition Unit (NAU) and Neural Multiplication Unit (NMU) can learn, using backpropagation, the underlying rules of real number addition, subtraction, and multiplication thereby performing well when extrapolating.
+%Why is it novel?
+%The proposed units controls the arithmetic operation by using a sparse weight matrix, which allows the units to perform exact arithmetic operations.
+%What’s interesting about it?
+%Through theoretical analysis, supported by empirical evidence, we justify how the NAU and NMU improve over previous methods. Our experimental setting, motivated by previous work, includes an arithmetic extrapolation task and multiplication of up to 20 MNIST digits.
+%How does it perform?
+%We show that NAU and NMU have fewer parameters, converges more consistently, learns faster, handles large hidden sizes better, and have more meaningful discrete values than previous attempts.\ifdefined\nonanonymous\footnote{Implementation is available on GitHub: \url{https://github.com/AndreasMadsen/stable-nalu}.}\fi
 \end{abstract}
 
 \input{sections/introduction}

diff --git a/paper/sections/conclusion.tex b/paper/sections/conclusion.tex
@@ -1,7 +1,12 @@
 \section{Conclusion}
-By including theoretical considerations, such as initialization, gradients, and sparsity, we have developed a new multiplication unit that outperforms the state-of-the-art models on established extrapolation and sequential tasks. Our model converges more consistently, faster, and to more sparse solutions, than previously proposed models. 
+By including theoretical considerations, such as initialization, gradients, and sparsity, we have developed a new neural multiplication unit (NMU) that outperforms the state-of-the-art models on established extrapolation and sequential tasks.
+Our model converges more consistently, faster, to more sparse solutions than previously proposed models, and supports all input ranges unlike NALU.
 
-We find that performing division and multiplication concurrently is a hard problem because of division by zero that currently can not be solved. However, when it comes to multiplication, our model is capable of extrapolating in both the negative range and to very small numbers.
-%A theoretical disadvantage of our multiplication unit is that it is incapable of division. However, previous publications concur that this is a problematic and generally unsolved area, due to the singularity in division. Thus our proposed model is empirically identical when it comes to division. On the other hand, when it comes to multiplication, our model is capable of extrapolating in both the negative range and to very small numbers.
+A natural next step would be to extend the NMU to support division and add gating between the NMU and NAU, to be comparable in theoretical features with NALU.
+However we find, both experimentally and theoretically, that learning the division is impractical, because of the singularity when dividing by zero, and that a sigmoid-gate that chooses between two functions with vastly different convergences properties, such as a multiplication unit and an addition unit, cannot be consistently learned.
 
-Finally, when it comes to considering more than just two inputs to the multiplication layer, our model clearly outperforms all previously proposed models as well as variations of previous models that borrow from our model. The ability for a neural layer to consider more than just two inputs, is critical in neural networks where the desired function is unknown.
+%Alternative
+%An important aspect of neural networks is supporting a large hidden size with a distributed representation and redundancy.
+%We find that our proposed Neural Multiplication Unit significantly outperforms previous models when increasing the hidden size of the network.
+Finally, when it comes to considering more than just two inputs to the multiplication layer, our model performs significantly better than previously proposed methods and variations of these.
+The ability for a neural layer to consider more than just two inputs, is critical in neural networks where the desired function is unknown.
diff --git a/paper/sections/introduction.tex b/paper/sections/introduction.tex
@@ -1,44 +1,39 @@
 \section{Introduction}
 When studying intelligence, insects, reptiles, and humans have been found to possess neurons with the capacity to hold integers, real numbers, and perform arithmetic operations \cite{nieder-neuronal-number,rugani-arithmetic-chicks,gallistel-numbers-in-brain}.
 In our quest to mimic intelligence we have put much faith in neural networks, which in turn has provided unparalleled and often superhuman performance in tasks requiring high cognitive abilities \cite{natureGo,bert,openai-learning-dexterous}.
-However, when using neural networks to learn simple arithmetic problems, such as counting, multiplication, or comparison they systematically fail to extrapolate onto unseen ranges \cite{stillNotSystematic,suzgun2019evaluating,trask-nalu}. This can be a significant drawback when comparing in question answering \cite{naturalquestions} or counting objects in visual data \cite{johnson2017clevr,drewspaper}.
+However, when using neural networks to learn simple arithmetic problems, such as counting, multiplication, or comparison they systematically fail to extrapolate onto unseen ranges \cite{stillNotSystematic,suzgun2019evaluating,trask-nalu}.
+The absence of inductive bias makes it difficult for neural networks to extrapolate well on arithmetic tasks as they lack the underlying logic to represent the required operations.
 
-In this paper, we analyze and improve parts of the recently proposed Neural Arithmetic Logic Unit (NALU) by \citet{trask-nalu}.
-The NALU is a neural network layer with two sub-units; the $\text{NAC}_{+}$ for addition/subtraction and the $\text{NAC}_{\bullet}$ for multiplication/division.
-The sub-units are softly gated using a sigmoid function.
-The parameters, which are computed by a soft weight constraint using a tanh-sigmoid transformation, are learned by observing arithmetic input-output pairs and using backpropagation \cite{rumelhart1986learning}.
+We would like to achieve a neural network component that can take an arbitrary hidden input, learn to select the appropriate elements, and apply the desired arithmetic operation.
+A recent attempt to achieve this goal is the Neural Arithmetic Logic Unit (NALU), by \citet{trask-nalu}.
 
-Our contributions are alternatives to the $\text{NAC}_{+}$ and $\text{NAC}_{\bullet}$ units that are more theoretically founded.
-Our alternatives can support small and negative numbers, are more sparse, and supports a larger hidden size, while using less parameters.
-We test these properties through a rigid experimental setup with $19230$ arithmetic tests and recurrent multiplication of up to 20 MNIST digits \cite{mnist}.
-%More specifically;
-%an alternative formulation of the soft weight constraint with a clipped linear activation, parameter regularization that biases towards a sparse solution of $\{-1,0,1\}$, and a reformulation of the multiplication unit with a partial linearity. All of which significantly improves upon the $\text{NAC}_{+}$ and $\text{NAC}_{\bullet}$ units as shown through extensive testing on static arithmetic tasks (more than $10000$ experiments) and recurrent multiplication of a sequence of MNIST digits.
+The NALU consists of two sub-units: the $\text{NAC}_{+}$ for addition/subtraction and the $\text{NAC}_{\bullet}$ for multiplication/division.
+The sub-units are softly gated using a sigmoid function in order to exclusively select one of the sub-units.
+However, we find that the soft gating mechanism and the $\text{NAC}_{\bullet}$ are fragile and hard learn.
 
-\begin{figure}[h]
+In this paper, we analyze and improve upon the $\text{NAC}_{+}$ and $\text{NAC}_{\bullet}$ with respect to addition, subtraction, and multiplication.
+Our proposed improvements, namely the Neural Addition Unit (NAU) and Neural Multiplication Unit (NMU), are more theoretically founded and improves performance regarding stability, speed of convergence, and interpretability of results.
+Most importantly, the NMU can support a large hidden input-size.
+
+The improvements, based on a theoretical analysis of the NALU and its components, are achieved by a simplification of the parameter matrix for a better gradient signal, a sparsity regularizer, and a new multiplication unit that can be optimally initialized and supports both negative and small numbers.
+The NMU does not support division.
+However, we find that the $\text{NAC}_{\bullet}$ in practice also only supports multiplication and cannot learn division (for theoretical findings on why division is hard to learn, see section \ref{sssec:nac-mul}).
+
+To analyze the impact of each improvement in the NMU by introducing several variants of the $\text{NAC}_{\bullet}$.
+We find that, allowing division makes optimization for multiplication harder, linear and regularized weights improve convergence, and that the NMU style of multiplication is critical when increasing the hidden size.
+
+Furthermore, we improve upon existing benchmarks in \citet{trask-nalu} by expanding the ``simple function task'', using a multiplicative variant of ``MNIST Counting and Arithmetic Tasks'', and use an improved success-criterion \citet{maep-madsen-johansen-2019}.
+A success-criterion is important because the arithmetic layers are solving a logical problem.
+We propose the MNIST multiplication variant as we want to test the NMU's and $\text{NAC}_{\bullet}$'s ability to learn from real data.
+%Hence, the solution found is either correct or wrong.
+%To test this we propose using a success-criteria to evaluate model performance.
+%A success-criterion enables measuring sensitivity to the initialization seed as well as the number of iterations until convergence.
+
+\begin{figure}[t]
 \centering
-\includegraphics[scale=0.6]{graphics/nmu.pdf}
+\includegraphics[scale=0.7]{graphics/nmu.pdf}
 \caption{Visualization of NMU for a single output scalar $z_1$, this construction repeats for every element in the output vector $\mathbf{z}$.}
 \end{figure}
-We motivate our work by an investigation of the NALU components: the soft gating mechanism that binds the subunits; the subunits $\text{NAC}_{\bullet}$ and $\text{NAC}_{+}$, and the way parameters are constructed.
-Our findings are the following:
-(a) $\text{NAC}_{\bullet}$ does not work for negative input.
-(b) $\text{NAC}_{\bullet}$ cannot model inputs $<\epsilon$.
-(c) optimal weight initialization for the parameters has a gradient of zero in expectation.
-(d) the weight design does not enforce sparsity and results suggests that they rarely are, which limits interpretability.
-(e) $\text{NAC}_{\bullet}$ has no optimal initialization.
-(f) The expected mean of $\text{NAC}_{\bullet}$, at initialization, is exponential w.r.t. the hidden size and has exploding variance.
-(g) Optimizing $\text{NAC}_{\bullet}$ for division is close to impossible and never converges in practice.
-(h) The NALU does not converge; we find that learning the gate is cumbersome due to the heterogeneity of the subunits. $\text{NAC}_{\bullet}$ takes orders of magnitude longer to converge than $\text{NAC}_{+}$.%, and their estimated gradient signals has varying magnitude.
-
-Motivated by these challenges, we attempt to solve (a-f) and leave (g-h) for future work.
-
-Furthermore, we find that the experimental setup of \citet{trask-nalu} has the following concerns: (i) the dataset parameters for "simple function task" is not defined. (ii) the evaluation metric is based on relative performance to a random baseline. This means that if the random baseline has an MSE of 1e10, then 1e7 would be considered a score of 0.1. (iii) Multiplication is only thoroughly tested in simple function task and not on anything requiring a deep neural network.
-
-We attempt to solve (i-iii) by proposing a much extended arithmetic task, define a successful convergence criteria, and do multiplication of MNIST digits.
-% The investigation uncovers the following analytical and empirical concerns; the gradients of the weight matrix construction in $\text{NAC}_{+}$ and $\text{NAC}_{\bullet}$ have zero expectation, the $\text{NAC}_{\bullet}$ has a treacherous optimization space with unwanted global minimas near singularities, when applying the $\text{NAC}_{+}$ in isolation we observe that the wanted weight matrix values of $\{-1, 0, 1\}$ are rarely found, and our empirical results reveals that the NALU is significantly worse than hard-choosing either the $\text{NAC}_{+}$ or $\text{NAC}_{\bullet}$, indicating that the gating mechanism does not work as intended.
-
-% We avoid using gating as we see no obvious solution to simultaneously train two vastly different sub-units, the NAU/$\text{NAC}_{+}$ and NMU/$\text{NAC}_{\bullet}$, with a soft gating mechanism. We expand upon why this is such a big challenge in section \ref{sec:methods:gatting-issue} and show empirically in \ref{} that this gating .
-% We will thus assume that the desired operation is already known, or can empirically be found by varying the network architecture.
 
 \subsection{Learning a 10 parameter function}
 Consider the static function $t = (x_1 + x_2) \cdot (x_1 + x_2 + x_3 + x_4)$ for $x \in \mathbb{R}^4$. To illustrate the ability of $\mathrm{NAC}_{\bullet}$, NALU, and our proposed NMU, we conduct 100 experiemnts for each model, where we attempt to fit this function. Table \ref{tab:very-simple-function-results} show that NMU has a higher success rate and converges faster.