LSTM prediction is numerically inconsistent for the last few instances. #30995

Quiigi · 2019-07-24T16:12:31Z

The predictions you get may differ slightly depending on input length and position within it. E.g., if you have 11 instances of input, you get one answer for the first 8, and a different answer for the last 3.
I write "may" as it happens to me with probability around 0.4. "Slightly" means in the order of the least significant bits of the float32 mantissa.

System information

Yes, I have written custom code, supplied below as a reprex in R using keras.
Tried on two platforms, with identical results.
Platform A:
Linux Ubuntu Ubuntu 16.04.5 LTS
TensorFlow version:
VERSION "1.7.0"
GIT_VERSION "v1.7.0-3-g024aecf414"
COMPILER_VERSION "4.8.4"
Python version: 2.7.12

Platform B:

Linux Ubuntu 14.04.6 LTS
TensorFlow version: tried both
VERSION "1.12.0"
GIT_VERSION "v1.12.0-0-ga6d8ffae09"
COMPILER_VERSION "4.8.5"
Python version: 2.7.6

both:

TensorFlow installed from binary.
Not a mobile device.
CUDA/cuDNN version: Not used.
GPU model and memory: Not used

Describe the current behavior
If the first dimension of x is n, "row" i will get one value if 0 <= i < (n&-4), but a possibly different value for (n&-4) <= i < n. (These C++/Python style 0-based indices. For R, 1-based, it's 0 < i <= bitwAnd(n, -4) versus bitwAnd(n, -4) < i <= n.)

Describe the expected behavior
Reproducible prediction from same input instance, independent of row number or input length. I use generalized "row" for a slice of a tensor with a given fixed first index, e.g., x[i,,] or pred[i,].

Code to reproduce the issue
This is a reprex written in R. I'd be happy to port to other languages if that's preferable.

options(digits=8)
fake <- function(shape_) {                        # arbitrary but reproducible
   array(seq_len(prod(shape_)) %% 2.71 - 1.04, shape_)
}

library(keras)
shape <- c(30,5)
model <- keras_model_sequential() %>%
   layer_lstm(units=2, input_shape=shape) %>%
   set_weights(list(fake(c(5, 8)), fake(c(2, 8)), fake(8)))

n <- 11                                           # not a multiple of 4
x <- array(rep(fake(shape), each=n), c(n, shape)) # n copies of identical input
p <- model %>% predict(x)                         # all predictions should match
p                                                 # but last n%%4 rows differ
#>             [,1]        [,2]
#>  [1,] 0.46561426 -0.22865930
#>  [2,] 0.46561426 -0.22865930
#>  [3,] 0.46561426 -0.22865930
#>  [4,] 0.46561426 -0.22865930
#>  [5,] 0.46561426 -0.22865930
#>  [6,] 0.46561426 -0.22865930
#>  [7,] 0.46561426 -0.22865930
#>  [8,] 0.46561426 -0.22865930
#>  [9,] 0.46561423 -0.22865926
#> [10,] 0.46561423 -0.22865926
#> [11,] 0.46561423 -0.22865926
(t(p)-p[1,]) * 2**26                              # the difference is low bits
#>      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
#> [1,]    0    0    0    0    0    0    0    0   -2    -2    -2
#> [2,]    0    0    0    0    0    0    0    0    3     3     3
stopifnot(t(p)==p[1,])                            # all *should* be equal
#> Error in eval(expr, envir, enclos): t(p) == p[1, ] are not all TRUE

##in contrast...
x12 <- array(rep(fake(shape), each=12), c(12, shape))
p12 <- model %>% predict(x12)
stopifnot(t(p12)==p12[1,])                         # ...all is well for n == 12

^{Created on 2019-07-25 by the reprex package (v0.2.1.9000)}

Other info / logs

The text was updated successfully, but these errors were encountered:

jvishnuvardhan · 2019-07-25T23:59:25Z

This is not Build/Installation or Bug/Performance issue. Please post this kind of support questions at Stackoverflow. There is a big community to support and learn from your questions. GitHub is mainly for addressing bugs in installation and performance. Thanks!

Quiigi · 2019-07-26T11:44:17Z

This is a bug report. (It is not about build, installation, or performance. It is about correctness.)

I provided a small reproducible example illustrating a bug. (In our real code, we trained a bigger model on a training set of thousands of instances, but then found that the trained model behaved oddly.) The same input should give the same response. And it does, except if the length of the input isn't divisible by 4; then the remaining 1 to 3 instances differ. To make this obvious, I repeated the same input 11 times in my reprex.

This bug in keras or tensorflow also manifests like this when applying to time series: If predicting n days, the first n-1 predictions should match what you get if you had 1 less data of data and were generating just n-1 predictions. But the prediction on the historical data does change!

jvishnuvardhan · 2019-07-30T21:14:43Z

@Quiigi Can you provide a standalone code in python to reproduce the issue? Thanks!

Quiigi · 2019-08-01T21:12:29Z

After looking at python basics, I rewrote my R example above in python:

import numpy as np
from keras.models import Sequential
from keras.layers import LSTM

def fake(shape_):                                # arbitrary but reproducible
    f = np.reshape(range(np.prod(shape_)), shape_, order="F") + 1
    return f % 2.71 - 1.04

shape = (30,5)
model = Sequential()
model.add(LSTM(units=2, input_shape=shape))
model.set_weights([fake((5, 8)), fake((2, 8)), fake(8)])

for n in [8, 7]:
    print("\nn = " + str(n))
    x= np.broadcast_to(fake(shape), (n,)+shape) # n copies of identical input
    p = model.predict(x)                        # all predictions should match
    p == p[1]                                   # but last n%4 rows differ
    (p-p[1]) * 2**26                            # the difference is low bits
    assert (p == p[1]).all()                    # fails iff n%4 > 0

I also do two experiments. For n=8, a multiple of 4, my check passes; for 7 it fails. Here's my output:

n == 8
array([[ True,  True],
       [ True,  True],
       [ True,  True],
       [ True,  True],
       [ True,  True],
       [ True,  True],
       [ True,  True],
       [ True,  True]])
array([[0., 0.],
       [0., 0.],
       [0., 0.],
       [0., 0.],
       [0., 0.],
       [0., 0.],
       [0., 0.],
       [0., 0.]])

n == 7
array([[ True,  True],
       [ True,  True],
       [ True,  True],
       [ True,  True],
       [False, False],
       [False, False],
       [False, False]])
array([[ 0.,  0.],
       [ 0.,  0.],
       [ 0.,  0.],
       [ 0.,  0.],
       [-2.,  3.],
       [-2.,  3.],
       [-2.,  3.]])
Traceback (most recent call last):
  File "<stdin>", line 7, in <module>
AssertionError

qlzh727 · 2019-08-01T23:45:18Z

Thanks for reporting the issue. Let me take a look.

Quiigi · 2019-09-16T20:20:35Z

@qlzh727 What did you find?

qlzh727 · 2019-09-17T19:01:30Z

Sorry for the late reply.

I was able to reproduce the issue, and I think it is somehow happening when batch_size that is not a perfect 2^n number. I can see the value difference between batch 0-3 and 4-6. If I change the batch size to 9, then the difference is between 0-7 and 8, same for batch size 17.

The cause of this might be numerical instability for numerical libraries. Also, given the fact that the diff is so small, this is usually ignored in the unit test (the default atol and rtol for numpy assert allclose() is 1e-6). In fact if I change assert in the code to np.allclose(), the issue goes away.

Could you give more details about why this issue is concerning you in your application, and what's the specific problem it causes?

Quiigi · 2019-09-18T02:01:27Z

We are predicting financial time series. In this application, snooping future
data is an insidious problem: critical to avoid, but at times subtle and hard
to notice. It's even possible that it is avoided in research, only to creep in
in production use. For example, you need to use info on FX rates to get
information into a common currency. The live data source for FX rates
might "snap" the prices at a different time each day than your research
data source and inadvertently "cheat".

We have found a "snoop test" to be a useful tool: every day, we generate
predictions not just for "tomorrow", but also for the past few days. Then
a live check can compare predictions on the overlapped days (all but the
last day). Specifically, when predicting n days, the first n-1 predictions
should match what you get if you had 1 less day of data and were generating
just n-1 predictions. This property is useful in testing and checking to
verify it catch bugs where a backtest accidentally snoops into future. If
on a new day your prediction changes for an old day, regardless of how
small the change, the cause might be snooped data, that was not available
back at the original date. But keras/tensorflow has a bug causing the final
1, 2or 3 predictions to change as you stack on additional data, even though
no change is made to the historical data!

The numerical difference is very small, on the order of floating point
noise, and ordinarily completely insignificant. But it might be very
significant in terms of tending to falsely inflating your (small) edge in
accurately predicting financial time series. Also: if you snoop the future
data, the error you get is also very small. Finally, this issue makes it
hard to to use our "snoop test" idea because it introduces many false
alarms, that are generally pretty indistinguishable from the real thing.

Ideally, the fix should be in the library. The way "multiple of 4" gets
into it makes us think it is some sort of batch optimization that is flawed
in some way. If there was a clean fix at that level, it would be ideal.
(It doesn't have to be a perfect 2^n number; any multiple of 4 is immune,
e.g., 12 in my original test case. We use the default batch size, 32.)

We have a workaround: we tack on additional 0, 1, 2, or 3 irrelevant data
elements when predicting, to force the length to always be multiple of 4.
Then we discard as many from the prediction. Possibly this sort of fix
could be pushed into keras, wrapped around the predict code, but hopefully
there is a more elegant solution.

qlzh727 · 2019-09-18T19:17:52Z

Thanks for the detailed explanation.

After some debug, the lowest op I can track that cause the difference was the "recurrent_activation" function (sigmoid), where the inputs with same value will produce slight different result. The underlying implementation of sigmoid for CPU goes to Eigen, which I don't have any knowledge.

Adding @rmlarsen who is the expert for Eigen in TF team for this issue.

Quiigi · 2019-09-19T15:50:33Z

Thank you for the update!

Trying to replicate your debugging, I wasn't able to find a node called "recurrent_activation" or "sigmoid" in my tensorflow graph. The closest I see is "Tanh", and its output (Tanh:0) shows the tiny discrepancies at the end. I see the issue the nodes feeding it directly, "add_5", and idirectly, MatMul_6, BiasAdd_2, MatMul_2. Notably, the inputs to the latter (Enter and TensorArrayReadV3) are clean. So I would guess the difference starts around there.

qlzh727 · 2019-09-19T18:55:04Z

The recurrent_activation I am talking about is at

tensorflow/tensorflow/python/keras/layers/recurrent_v2.py

Line 109 in 3a9580e

recurrent_activation='sigmoid',

. I guess for any of the non-linear function, they might have tiny discrepancy, and really depends on the implementation.

Quiigi · 2019-09-20T17:28:07Z

I have keras.__version__ '2.2.4', and there's no "recurrent_v2.py".
I see recurrent activation at

tensorflow/tensorflow/python/keras/layers/recurrent.py

Line 2103 in 3a9580e

recurrent_activation='hard_sigmoid',

(And activation is 'tanh', resulting presumably in the node from where I was able to trace back the mod 4 issue to matrix multiplication).

Quiigi · 2019-10-28T20:31:09Z

In my case, the inconsistency originates in MatMul. It's possible that activation functions 'sigmoid' or 'hard_sigmoid' have a similar issue. This reduced example demonstrates the issue:

import numpy as np
import tensorflow as tf
a = np.broadcast_to(np.float32([.6, -.8, -.3, 0]), (5,4))
b = np.float32([[8, 1], [6, 4], [-9, 1], [0, 0]])
tf.matmul(a,b).eval(session=tf.Session())

My output:

array([[ 2.6999998, -2.9      ],
       [ 2.6999998, -2.9      ],
       [ 2.6999998, -2.9      ],
       [ 2.6999998, -2.9      ],
       [ 2.7      , -2.8999999]], dtype=float32)

This is an improvement because LSTM generates a graph with 268 nodes, including While loops. The above reproducible example has just 1 simple node. I'm not sure where exactly the discrepancy creeps in; maybe in the function evalGemm, i.e.,

  void Eigen::TensorContractionEvaluatorBase<
    Eigen::TensorEvaluator<
      Eigen::TensorContractionOp<
        Eigen::array<Eigen::IndexPair<long>, 1ul> const,
        Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const,
        Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const
        > const,
      Eigen::ThreadPoolDevice> >
  ::evalGemm<true, true, false, 0>(float*) const

but that's just a guess.

qlzh727 · 2020-02-20T22:38:02Z

Due the numerical instability, I don't think there is anything we can address here (the diff is smaller than the normal limit when we do the tests), I am going to close this bug.

tensorflow-bot · 2020-02-20T22:38:06Z

Are you satisfied with the resolution of your issue?
Yes
No

Quiigi · 2020-02-21T12:38:05Z

I had hoped to learn where exactly the instability arises.

I assume we see an artifact of optimization, trading accuracy for speed. And as the error is within your normal tolerance, the result we see must be deemed "correct". Therefore, closing this issue is appropriate.

ravikyram self-assigned this Jul 25, 2019

ravikyram added comp:keras Keras related issues type:support Support issues labels Jul 25, 2019

ravikyram assigned jvishnuvardhan and unassigned ravikyram Jul 25, 2019

Quiigi changed the title ~~prediction is numerically inconsistent for last n%%4 instances~~ LSTM prediction is numerically inconsistent for the last few instances. Jul 25, 2019

jvishnuvardhan assigned qlzh727 and jvishnuvardhan and unassigned jvishnuvardhan and qlzh727 Jul 25, 2019

jvishnuvardhan added type:bug Bug and removed type:support Support issues labels Jul 30, 2019

jvishnuvardhan assigned qlzh727 and unassigned jvishnuvardhan Jul 30, 2019

jvishnuvardhan added stat:awaiting tensorflower Status - Awaiting response from tensorflower TF 1.12 Issues related to TF 1.12 labels Jul 30, 2019

jvishnuvardhan added stat:awaiting response Status - Awaiting response from author and removed stat:awaiting tensorflower Status - Awaiting response from tensorflower labels Jul 30, 2019

tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Aug 2, 2019

qlzh727 closed this as completed Feb 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LSTM prediction is numerically inconsistent for the last few instances. #30995

LSTM prediction is numerically inconsistent for the last few instances. #30995

Quiigi commented Jul 24, 2019 •

edited

jvishnuvardhan commented Jul 25, 2019

Quiigi commented Jul 26, 2019 •

edited

jvishnuvardhan commented Jul 30, 2019

Quiigi commented Aug 1, 2019 •

edited

qlzh727 commented Aug 1, 2019

Quiigi commented Sep 16, 2019

qlzh727 commented Sep 17, 2019

Quiigi commented Sep 18, 2019

qlzh727 commented Sep 18, 2019

Quiigi commented Sep 19, 2019

qlzh727 commented Sep 19, 2019

Quiigi commented Sep 20, 2019 •

edited

Quiigi commented Oct 28, 2019 •

edited

qlzh727 commented Feb 20, 2020

tensorflow-bot bot commented Feb 20, 2020

Quiigi commented Feb 21, 2020 •

edited

LSTM prediction is numerically inconsistent for the last few instances. #30995

LSTM prediction is numerically inconsistent for the last few instances. #30995

Comments

Quiigi commented Jul 24, 2019 • edited

jvishnuvardhan commented Jul 25, 2019

Quiigi commented Jul 26, 2019 • edited

jvishnuvardhan commented Jul 30, 2019

Quiigi commented Aug 1, 2019 • edited

qlzh727 commented Aug 1, 2019

Quiigi commented Sep 16, 2019

qlzh727 commented Sep 17, 2019

Quiigi commented Sep 18, 2019

qlzh727 commented Sep 18, 2019

Quiigi commented Sep 19, 2019

qlzh727 commented Sep 19, 2019

Quiigi commented Sep 20, 2019 • edited

Quiigi commented Oct 28, 2019 • edited

qlzh727 commented Feb 20, 2020

tensorflow-bot bot commented Feb 20, 2020

Quiigi commented Feb 21, 2020 • edited

Quiigi commented Jul 24, 2019 •

edited

Quiigi commented Jul 26, 2019 •

edited

Quiigi commented Aug 1, 2019 •

edited

Quiigi commented Sep 20, 2019 •

edited

Quiigi commented Oct 28, 2019 •

edited

Quiigi commented Feb 21, 2020 •

edited