Merge overleaf-2020-01-14-1232 into master

AndreasMadsen · Jan 14, 2020 · 1961194 · 1961194
2 parents 3e31a1d + 3ca8546
commit 1961194
Show file tree

Hide file tree

Showing 7 changed files with 43 additions and 51 deletions.
diff --git a/paper/appendix/sequential-mnist.tex b/paper/appendix/sequential-mnist.tex
@@ -3,13 +3,14 @@ \section{Sequential MNIST}
 \subsection{Task and evaluation criteria}
 The simple function task is a purely synthetic task, that does not require a deep network. As such it does not test if an arithmetic layer inhibits the networks ability to be optimized using gradient decent.
 
-The sequential MNIST task takes the numerical value of a sequence of MNIST digits and applies a binary operation recursively. Such that $t_i = Op(t_{i-1}, z_t)$, where $z_t$ is the MNIST digit's numerical value.
+The sequential MNIST task takes the numerical value of a sequence of MNIST digits and applies a binary operation recursively. Such that $t_i = Op(t_{i-1}, z_t)$, where $z_t$ is the MNIST digit's numerical value. This is identical to the ``MNIST Counting and Arithmetic Tasks'' in \citet[section 4.2]{trask-nalu}. We present the addition variant to validate the NAU's ability to backpropagate, and we add an additional multiplication variant to validate the NMU's ability to backpropagate.
 
 The performance of this task depends on the quality of the image-to-scalar network and the arithmetic layer's ability to model the scalar. We use mean-square-error (MSE) to evaluate joint image-to-scalar and arithmetic layer model performance. To determine an MSE threshold from the correct prediction we use an empirical baseline. This is done by letting the arithmetic layer be solved, such that only the image-to-scalar is learned. By learning this over multiple seeds an upper bound for an MSE threshold can be set. In our experiment we use the 1\% one-sided upper confidence-interval, assuming a student-t distribution.
 
 Similar to the simple function task we use a success-criteria as reporting the MSE is not interpretable and models that do not converge will obscure the mean. Furthermore, because the operation is applied recursively, natural error from the dataset will accumulate over time, thus exponentially increasing the MSE. Using a baseline model and reporting the successfulness solves this interpretation challenge.
 
 \subsection{Addition of sequential MNIST}
+\label{sec:appendix:mnist:addition-experiment}
 
 Figure \label{fig:sequential-mnist-sum} shows results for sequential addition of MNIST digits. This experiment is identical to the MNIST Digit Addition Test from \citet[section 4.2]{trask-nalu}. The models are trained on a sequence of 10 digits and evaluated on sequences between 1 and 1000 MNIST digits.
 
@@ -19,7 +20,7 @@ \subsection{Addition of sequential MNIST}
 \end{equation}
 To provide a fair comparison, a variant of $\mathrm{NAC}_{+}$ that also uses this regularizer is included, this variant is called $\mathrm{NAC}_{+, R_z}$. Section \ref{sec:appendix:sequential-mnist-sum:ablation} provides an ablation study of the $R_z$ regularizer.
 
-\begin{figure}[h]
+\begin{figure}[H]
 \centering
 \includegraphics[width=\linewidth,trim={0 0.5cm 0 0},clip]{paper/results/sequential_mnist_sum_long.pdf}
 \caption{Shows the ability of each model to learn the arithmetic operation of addition and backpropagate through the arithmetic layer in order to learn an image-to-scalar value for MNIST digits. The model is tested by extrapolating to larger sequence lengths than what it has been trained on. The NAU and $\mathrm{NAC}_{+,R_z}$ models use the $\mathrm{R}_z$ regularizer from section \ref{section:results:cumprod_mnist}.}
@@ -31,7 +32,7 @@ \subsection{Sequential addtion without the \texorpdfstring{$\mathrm{R}_z$}{R\_z}
 
 As an ablation study of the $\mathrm{R}_z$ regularizer, figure \ref{fig:sequential-mnist-sum-ablation} shows the NAU model without the $\mathrm{R}_z$ regularizer. Removing the regularizer causes a reduction in the success-rate. The reduction is likely larger, as compared to sequential multiplication, because the sequence length used for training is longer. The loss function is most sensitive to the 10th output in the sequence, as this has the largest scale. This causes some of the model instances to just learn the mean, which becomes passable for very long sequences, which is why the success-rate increases for longer sequences. However, this is not a valid solution. A well-behavior model should be successful independent of the sequence length.
 
-\begin{figure}[h]
+\begin{figure}[H]
 \centering
 \includegraphics[width=\linewidth,trim={0 0.5cm 0 0},clip]{paper/results/sequential_mnist_sum_long_ablation.pdf}
 \caption{Same as figure \ref{fig:sequential-mnist-sum}, but where the NAU model do not use the $\mathrm{R}_z$ regularizer.} 
@@ -43,7 +44,7 @@ \subsection{Sequential multiplication without the \texorpdfstring{$\mathrm{R}_z$
 
 As an ablation study of the $\mathrm{R}_z$ regularizer figure \ref{fig:sequential-mnist-prod-ablation} shows the NMU and $\mathrm{NAC}_{\bullet,\mathrm{NMU}}$ models without the $\mathrm{R}_z$ regularizer. The success-rate is somewhat similar to figure \ref{fig:sequential-mnist-prod-results}. However, as seen in the ``sparsity error'' plot, the solution is quite different.
 
-\begin{figure}[h]
+\begin{figure}[H]
 \centering
 \includegraphics[width=\linewidth,trim={0 0.5cm 0 0},clip]{results/sequential_mnist_prod_long_ablation.pdf}
 \caption{Shows the ability of each model to learn the arithmetic operation of addition and backpropagate through the arithmetic layer in order to learn an image-to-scalar value for MNIST digits. The model is tested by extrapolating to larger sequence lengths than what it has been trained on. The NMU and $\mathrm{NAC}_{\bullet,\mathrm{NMU}}$ models do not use the $\mathrm{R}_z$ regularizer.} 

diff --git a/paper/appendix/simple-function-task.tex b/paper/appendix/simple-function-task.tex
@@ -1,9 +1,9 @@
 \section{Arithmetic task}
 
-Our ``arithmetic task'' is identical to the ``simple function task'' in the NALU paper \cite{trask-nalu}. However, as they do not describe their dataset generation, dataset parameters, and model evaluation in details we elaborate on that here.
-
 The aim of the ``Arithmetic task'' is to directly test arithmetic models ability to extrapolate beyond the training range. Additionally, our generalized version provides a high degree of flexibility in how the input is shaped, sampled, and the problem complexity.
 
+Our ``arithmetic task'' is identical to the ``simple function task'' in the NALU paper \cite{trask-nalu}. However, as they do not describe their setup in details, we use the setup from \citet{maep-madsen-johansen-2019}, which provide Algorithm \ref{tab:simple-function-task-defaults}, an evaluation-criterion to if and when the model has converged, the sparsity error, as well as methods for computing confidence intervals for success-rate and the sparsity error.
+
 \begin{figure}[h]
 \centering
 \includegraphics[scale=0.7]{graphics/function_task_static_problem.pdf}
@@ -136,24 +136,24 @@ \subsection{Gating convergence experiment}
 
 In the interest of adding some understand of what goes wrong in the NALU gate, and the shared weight choice that NALU employs to remedy this, we introduce the following experiment.
 
-We train two models to fit the arithmetic task. Both uses the $\mathrm{NAC}_{+}$ in the first layer and NALU in the second layer. The only difference is that one model shares the weight between $\mathrm{NAC}_{+}$ and $\mathrm{NAC}_{\bullet}$ in the NALU, and the other just treat them as two separate models with separate weights. In both cases NALU should gate between $\mathrm{NAC}_{+}$ and $\mathrm{NAC}_{\bullet}$ and choose the appropriate operation. Note that this NALU model is different from the one presented elsewhere in this paper, including the original NALU paper \cite{trask-nalu}. The typical NALU model is just two NALU layers with shared weights.
+We train two models to fit the arithmetic task. Both uses the $\mathrm{NAC}_{+}$ in the first layer and NALU in the second layer. The only difference is that one model shares the weight between $\mathrm{NAC}_{+}$ and $\mathrm{NAC}_{\bullet}$ in the NALU, and the other treat them as two separate units with separate weights. In both cases NALU should gate between $\mathrm{NAC}_{+}$ and $\mathrm{NAC}_{\bullet}$ and choose the appropriate operation. Note that this NALU model is different from the one presented elsewhere in this paper, including the original NALU paper \cite{trask-nalu}. The typical NALU model is just two NALU layers with shared weights.
 
-Furthermore, we also introduce a new gated unit that simply gates between our proposed NMU and NAU, using the same sigmoid gating-mechanism as in the NALU. This combination is done with seperate weights, as NMU and NAU uses different weight constrains and can therefore not be shared.
+Furthermore, we also introduce a new gated unit that simply gates between our proposed NMU and NAU, using the same sigmoid gating-mechanism as in the NALU. This combination is done with seperate weights, as NMU and NAU use different weight constrains and can therefore not be shared.
 
 The models are trained and evaluated over 100 different seeds on the multiplication and addition task. A histogram of the gate-value for all seeds is presented in figure \ref{fig:simple-function-static-nalu-gate-graph} and table \ref{tab:simple-function-static-nalu-gate-table} contains a summary. Some noteworthy observations:
 
 \vspace{-0.3cm}\begin{enumerate}
     \item When the NALU weights are separated far more trials converge to select $\mathrm{NAC}_{+}$ for both the addition and multiplication task. Sharing the weights between $\mathrm{NAC}_{+}$ and $\mathrm{NAC}_{\bullet}$ makes the gating less likely to converge for addition.
     \item The performance of the addition task is dependent on NALU selecting the right operation. In the multiplication task, when the right gate is selected, $\mathrm{NAC}_{\bullet}$ do not converge consistently, unlike our NMU that converges more consistently.
-    \item Which operation the gate converges to appears to be mostly random and independent of the task. This issues caused by the sigmoid gating-mechanism and thus exists independent of the used sub-units.
+    \item Which operation the gate converges to appears to be mostly random and independent of the task. These issues are caused by the sigmoid gating-mechanism and thus exists independent of the used sub-units.
 \end{enumerate}
 
-These observations validates that the NALU gating-mechanism does not converge as intended. This becomes a critical issues when more gates are present, as is normally the case. E.g. when stacking multiple NALU layers together.
+\vspace{-0.2cm}These observations validates that the NALU gating-mechanism does not converge as intended. This becomes a critical issues when more gates are present, as is normally the case. E.g. when stacking multiple NALU layers together.
 
 \begin{figure}[h]
 \centering
-\includegraphics[width=0.98\linewidth]{results/function_task_static_nalu.pdf}
-\caption{Shows the gating-value in the NALU layer and a variant that uses NAU/NMU instead of $\mathrm{NAC}_{+}$/$\mathrm{NAC}_{\bullet}$. Separate/shared refers to the weights in $\mathrm{NAC}_{+}$/$\mathrm{NAC}_{\bullet}$ used in NALU.}
+\includegraphics[width=0.93\linewidth]{results/function_task_static_nalu.pdf}
+\vspace{-0.2cm}\caption{Shows the gating-value in the NALU layer and a variant that uses NAU/NMU instead of $\mathrm{NAC}_{+}$/$\mathrm{NAC}_{\bullet}$. Separate/shared refers to the weights in $\mathrm{NAC}_{+}$/$\mathrm{NAC}_{\bullet}$ used in NALU.}
 \label{fig:simple-function-static-nalu-gate-graph}
 \end{figure}
 
@@ -196,9 +196,10 @@ \subsection{Comparing all models}
 Table \ref{tab:function-task-static-defaults-all} compares all models on all operations used in NALU \cite{trask-nalu}. All variations of models and operations are trained for 100 different seeds to build confidence intervals. Some noteworthy observations are:
 
 \begin{enumerate}
-    \item Division does not work for any model, including the $\mathrm{NAC}_{\bullet}$ and NALU models. This may seem surprising but is actually in line with the results from the NALU paper (\citet{trask-nalu}, table 1) where there is a large error given the interpolation range. The extrapolation range has a smaller error, but this is an artifact of their evaluation method where they normalize with a random baseline. Since a random baseline with have a higher error for the extrapolation range, a similar error will appear to be smaller. A correct solution to division should have both a small interpolation and extrapolation error.
+    \item Division does not work for any model, including the $\mathrm{NAC}_{\bullet}$ and NALU models. This may seem surprising but is actually in line with the results from the NALU paper (\citet{trask-nalu}, table 1) where there is a large error given the interpolation range. The extrapolation range has a smaller error, but this is an artifact of their evaluation method where they normalize with a random baseline. Since a random baseline will have a higher error for the extrapolation range, errors just appear to be smaller. A correct solution to division should have both a small interpolation and extrapolation error. 
     \item $\mathrm{NAC}_{\bullet}$ and NALU are barely able to learn $\sqrt{z}$, with just 2\% success-rate for NALU and 7\% success-rate for $\mathrm{NAC}_{\bullet}$.
     \item NMU is fully capable of learning $z^2$. It learn this by learning the same subset twice in the NAU layer, this is also how $\mathrm{NAC}_{\bullet}$ learn $z^2$.
+    \item The Gated NAU/NMU (discussed in section \ref{sec:appendix:nalu-gate-experiment}) works very poorly, because the NMU initialization assumes that $E[z_{h_{\ell-1}}] = 0$. This is usually true, as discussed in section \ref{sec:methods:moments-and-initialization}, but not in this case for the first layer. In the recommended NMU model, the NMU layer appears after NAU, which causes that assumption to be satisfied.
 \end{enumerate}
 
 \input{results/function_task_static_all.tex}
diff --git a/paper/main.tex b/paper/main.tex
@@ -3,7 +3,7 @@
 \usepackage{iclr2020_conference,times}
 
 % COMMENT for anonymous submission
-% \def\nonanonymous{}
+\def\nonanonymous{}
 
 \ifdefined\nonanonymous
 \iclrfinalcopy
@@ -80,29 +80,16 @@
 %What’s the domain?
 Neural networks can approximate complex functions, but they struggle to perform exact arithmetic operations over real numbers.
 %What’s the issue?
-The lack of inductive bias for arithmetic operations leaves neural networks without the underlying logic needed to extrapolate on tasks such as addition, subtraction, and multiplication.
+The lack of inductive bias for arithmetic operations leaves neural networks without the underlying logic necessary to extrapolate on tasks such as addition, subtraction, and multiplication.
 %What’s your contribution?
-We present two new neural network components: the Neural Addition Unit (NAU), which can learn to add and subtract; and Neural Multiplication Unit (NMU) that can multiply subsets of a vector.
+We present two new neural network components: the Neural Addition Unit (NAU), which can learn exact addition and subtraction; and the Neural Multiplication Unit (NMU) that can multiply subsets of a vector.
 %Why is it novel?
-The NMU is to our knowledge the first arithmetic neural network component that can learn multiplication of a vector with a large hidden size.
+The NMU is, to our knowledge, the first arithmetic neural network component that can learn to multiply elements from a vector, when the hidden size is large.
 %What’s interesting about it?
-The two new components draw inspiration from a theoretical analysis of recent arithmetic components.
+The two new components draw inspiration from a theoretical analysis of recently proposed arithmetic components.
 We find that careful initialization, restricting parameter space, and regularizing for sparsity is important when optimizing the NAU and NMU.
 %How does it perform?
-Our results, compared with previous attempts, show that the NAU and NMU converges more consistently, have fewer parameters, learn faster, do not diverge with large hidden sizes, obtain sparse and meaningful weights, and can extrapolate to negative and small numbers.\ifdefined\nonanonymous\footnote{Implementation is available on GitHub: \url{https://github.com/AndreasMadsen/stable-nalu}.}\fi
-
-%What’s the domain?
-%Exact arithmetic operations of real numbers in Neural Networks present a unique learning challenge for machine learning models.
-%What’s the issue?
-%Neural networks can approximate complex functions by learning from labeled data. However, when extrapolating to out-of-distribution samples neural networks often fail. Learning the underlying logic, as opposed to an approximation, is crucial for applications such as comparing, counting, and inferring physical models.
-%What’s your contribution?
-%Our proposed Neural Addition Unit (NAU) and Neural Multiplication Unit (NMU) can learn, using backpropagation, the underlying rules of real number addition, subtraction, and multiplication thereby performing well when extrapolating.
-%Why is it novel?
-%The proposed units controls the arithmetic operation by using a sparse weight matrix, which allows the units to perform exact arithmetic operations.
-%What’s interesting about it?
-%Through theoretical analysis, supported by empirical evidence, we justify how the NAU and NMU improve over previous methods. Our experimental setting, motivated by previous work, includes an arithmetic extrapolation task and multiplication of up to 20 MNIST digits.
-%How does it perform?
-%We show that NAU and NMU have fewer parameters, converges more consistently, learns faster, handles large hidden sizes better, and have more meaningful discrete values than previous attempts.\ifdefined\nonanonymous\footnote{Implementation is available on GitHub: \url{https://github.com/AndreasMadsen/stable-nalu}.}\fi
+Our proposed units NAU and NMU, compared with previous neural units, converge more consistently, have fewer parameters, learn faster, can converge for larger hidden sizes, obtain sparse and meaningful weights, and can extrapolate to negative and small values.\ifdefined\nonanonymous\footnote{Implementation is available on GitHub: \url{https://github.com/AndreasMadsen/stable-nalu}.}\fi
 \end{abstract}
 
 \input{sections/introduction}

diff --git a/paper/results/function_task_static_all.tex b/paper/results/function_task_static_all.tex
@@ -209,4 +209,4 @@
 
 \nopagebreak
 \multirow{-11}{*}{\centering\arraybackslash $z^2$} & ReLU6 & $0\% {~}^{+4\%}_{-0\%}$ & --- & --- & ---\\*
-\end{longtable}
+\end{longtable}