Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LSTM prediction is numerically inconsistent for the last few instances. #30995

Closed
Quiigi opened this issue Jul 24, 2019 · 16 comments
Closed

LSTM prediction is numerically inconsistent for the last few instances. #30995

Quiigi opened this issue Jul 24, 2019 · 16 comments
Assignees
Labels
comp:keras Keras related issues TF 1.12 Issues related to TF 1.12 type:bug Bug

Comments

@Quiigi
Copy link

Quiigi commented Jul 24, 2019

The predictions you get may differ slightly depending on input length and position within it. E.g., if you have 11 instances of input, you get one answer for the first 8, and a different answer for the last 3.
I write "may" as it happens to me with probability around 0.4. "Slightly" means in the order of the least significant bits of the float32 mantissa.

System information

  • Yes, I have written custom code, supplied below as a reprex in R using keras.
  • Tried on two platforms, with identical results.
    Platform A:
  • Linux Ubuntu Ubuntu 16.04.5 LTS
  • TensorFlow version:
    VERSION "1.7.0"
    GIT_VERSION "v1.7.0-3-g024aecf414"
    COMPILER_VERSION "4.8.4"
  • Python version: 2.7.12

Platform B:

  • Linux Ubuntu 14.04.6 LTS
  • TensorFlow version: tried both
    VERSION "1.12.0"
    GIT_VERSION "v1.12.0-0-ga6d8ffae09"
    COMPILER_VERSION "4.8.5"
  • Python version: 2.7.6

both:

  • TensorFlow installed from binary.
  • Not a mobile device.
  • CUDA/cuDNN version: Not used.
  • GPU model and memory: Not used

Describe the current behavior
If the first dimension of x is n, "row" i will get one value if 0 <= i < (n&-4), but a possibly different value for (n&-4) <= i < n. (These C++/Python style 0-based indices. For R, 1-based, it's 0 < i <= bitwAnd(n, -4) versus bitwAnd(n, -4) < i <= n.)

Describe the expected behavior
Reproducible prediction from same input instance, independent of row number or input length. I use generalized "row" for a slice of a tensor with a given fixed first index, e.g., x[i,,] or pred[i,].

Code to reproduce the issue
This is a reprex written in R. I'd be happy to port to other languages if that's preferable.

options(digits=8)
fake <- function(shape_) {                        # arbitrary but reproducible
   array(seq_len(prod(shape_)) %% 2.71 - 1.04, shape_)
}

library(keras)
shape <- c(30,5)
model <- keras_model_sequential() %>%
   layer_lstm(units=2, input_shape=shape) %>%
   set_weights(list(fake(c(5, 8)), fake(c(2, 8)), fake(8)))

n <- 11                                           # not a multiple of 4
x <- array(rep(fake(shape), each=n), c(n, shape)) # n copies of identical input
p <- model %>% predict(x)                         # all predictions should match
p                                                 # but last n%%4 rows differ
#>             [,1]        [,2]
#>  [1,] 0.46561426 -0.22865930
#>  [2,] 0.46561426 -0.22865930
#>  [3,] 0.46561426 -0.22865930
#>  [4,] 0.46561426 -0.22865930
#>  [5,] 0.46561426 -0.22865930
#>  [6,] 0.46561426 -0.22865930
#>  [7,] 0.46561426 -0.22865930
#>  [8,] 0.46561426 -0.22865930
#>  [9,] 0.46561423 -0.22865926
#> [10,] 0.46561423 -0.22865926
#> [11,] 0.46561423 -0.22865926
(t(p)-p[1,]) * 2**26                              # the difference is low bits
#>      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
#> [1,]    0    0    0    0    0    0    0    0   -2    -2    -2
#> [2,]    0    0    0    0    0    0    0    0    3     3     3
stopifnot(t(p)==p[1,])                            # all *should* be equal
#> Error in eval(expr, envir, enclos): t(p) == p[1, ] are not all TRUE

##in contrast...
x12 <- array(rep(fake(shape), each=12), c(12, shape))
p12 <- model %>% predict(x12)
stopifnot(t(p12)==p12[1,])                         # ...all is well for n == 12

Created on 2019-07-25 by the reprex package (v0.2.1.9000)

Other info / logs

@ravikyram ravikyram self-assigned this Jul 25, 2019
@ravikyram ravikyram added comp:keras Keras related issues type:support Support issues labels Jul 25, 2019
@Quiigi Quiigi changed the title prediction is numerically inconsistent for last n%%4 instances LSTM prediction is numerically inconsistent for the last few instances. Jul 25, 2019
@jvishnuvardhan
Copy link
Contributor

This is not Build/Installation or Bug/Performance issue. Please post this kind of support questions at Stackoverflow. There is a big community to support and learn from your questions. GitHub is mainly for addressing bugs in installation and performance. Thanks!

@Quiigi
Copy link
Author

Quiigi commented Jul 26, 2019

This is a bug report. (It is not about build, installation, or performance. It is about correctness.)

I provided a small reproducible example illustrating a bug. (In our real code, we trained a bigger model on a training set of thousands of instances, but then found that the trained model behaved oddly.) The same input should give the same response. And it does, except if the length of the input isn't divisible by 4; then the remaining 1 to 3 instances differ. To make this obvious, I repeated the same input 11 times in my reprex.

This bug in keras or tensorflow also manifests like this when applying to time series: If predicting n days, the first n-1 predictions should match what you get if you had 1 less data of data and were generating just n-1 predictions. But the prediction on the historical data does change!

@jvishnuvardhan jvishnuvardhan added type:bug Bug and removed type:support Support issues labels Jul 30, 2019
@jvishnuvardhan jvishnuvardhan added stat:awaiting tensorflower Status - Awaiting response from tensorflower TF 1.12 Issues related to TF 1.12 labels Jul 30, 2019
@jvishnuvardhan
Copy link
Contributor

@Quiigi Can you provide a standalone code in python to reproduce the issue? Thanks!

@jvishnuvardhan jvishnuvardhan added stat:awaiting response Status - Awaiting response from author and removed stat:awaiting tensorflower Status - Awaiting response from tensorflower labels Jul 30, 2019
@Quiigi
Copy link
Author

Quiigi commented Aug 1, 2019

After looking at python basics, I rewrote my R example above in python:

import numpy as np
from keras.models import Sequential
from keras.layers import LSTM

def fake(shape_):                                # arbitrary but reproducible
    f = np.reshape(range(np.prod(shape_)), shape_, order="F") + 1
    return f % 2.71 - 1.04

shape = (30,5)
model = Sequential()
model.add(LSTM(units=2, input_shape=shape))
model.set_weights([fake((5, 8)), fake((2, 8)), fake(8)])

for n in [8, 7]:
    print("\nn = " + str(n))
    x= np.broadcast_to(fake(shape), (n,)+shape) # n copies of identical input
    p = model.predict(x)                        # all predictions should match
    p == p[1]                                   # but last n%4 rows differ
    (p-p[1]) * 2**26                            # the difference is low bits
    assert (p == p[1]).all()                    # fails iff n%4 > 0

I also do two experiments. For n=8, a multiple of 4, my check passes; for 7 it fails. Here's my output:

n == 8
array([[ True,  True],
       [ True,  True],
       [ True,  True],
       [ True,  True],
       [ True,  True],
       [ True,  True],
       [ True,  True],
       [ True,  True]])
array([[0., 0.],
       [0., 0.],
       [0., 0.],
       [0., 0.],
       [0., 0.],
       [0., 0.],
       [0., 0.],
       [0., 0.]])

n == 7
array([[ True,  True],
       [ True,  True],
       [ True,  True],
       [ True,  True],
       [False, False],
       [False, False],
       [False, False]])
array([[ 0.,  0.],
       [ 0.,  0.],
       [ 0.,  0.],
       [ 0.,  0.],
       [-2.,  3.],
       [-2.,  3.],
       [-2.,  3.]])
Traceback (most recent call last):
  File "<stdin>", line 7, in <module>
AssertionError

@qlzh727
Copy link
Member

qlzh727 commented Aug 1, 2019

Thanks for reporting the issue. Let me take a look.

@tensorflowbutler tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Aug 2, 2019
@Quiigi
Copy link
Author

Quiigi commented Sep 16, 2019

@qlzh727 What did you find?

@qlzh727
Copy link
Member

qlzh727 commented Sep 17, 2019

Sorry for the late reply.

I was able to reproduce the issue, and I think it is somehow happening when batch_size that is not a perfect 2^n number. I can see the value difference between batch 0-3 and 4-6. If I change the batch size to 9, then the difference is between 0-7 and 8, same for batch size 17.

The cause of this might be numerical instability for numerical libraries. Also, given the fact that the diff is so small, this is usually ignored in the unit test (the default atol and rtol for numpy assert allclose() is 1e-6). In fact if I change assert in the code to np.allclose(), the issue goes away.

Could you give more details about why this issue is concerning you in your application, and what's the specific problem it causes?

@Quiigi
Copy link
Author

Quiigi commented Sep 18, 2019

We are predicting financial time series. In this application, snooping future
data is an insidious problem: critical to avoid, but at times subtle and hard
to notice. It's even possible that it is avoided in research, only to creep in
in production use. For example, you need to use info on FX rates to get
information into a common currency. The live data source for FX rates
might "snap" the prices at a different time each day than your research
data source and inadvertently "cheat".

We have found a "snoop test" to be a useful tool: every day, we generate
predictions not just for "tomorrow", but also for the past few days. Then
a live check can compare predictions on the overlapped days (all but the
last day). Specifically, when predicting n days, the first n-1 predictions
should match what you get if you had 1 less day of data and were generating
just n-1 predictions. This property is useful in testing and checking to
verify it catch bugs where a backtest accidentally snoops into future. If
on a new day your prediction changes for an old day, regardless of how
small the change, the cause might be snooped data, that was not available
back at the original date. But keras/tensorflow has a bug causing the final
1, 2or 3 predictions to change as you stack on additional data, even though
no change is made to the historical data!

The numerical difference is very small, on the order of floating point
noise, and ordinarily completely insignificant. But it might be very
significant in terms of tending to falsely inflating your (small) edge in
accurately predicting financial time series. Also: if you snoop the future
data, the error you get is also very small. Finally, this issue makes it
hard to to use our "snoop test" idea because it introduces many false
alarms, that are generally pretty indistinguishable from the real thing.

Ideally, the fix should be in the library. The way "multiple of 4" gets
into it makes us think it is some sort of batch optimization that is flawed
in some way. If there was a clean fix at that level, it would be ideal.
(It doesn't have to be a perfect 2^n number; any multiple of 4 is immune,
e.g., 12 in my original test case. We use the default batch size, 32.)

We have a workaround: we tack on additional 0, 1, 2, or 3 irrelevant data
elements when predicting, to force the length to always be multiple of 4.
Then we discard as many from the prediction. Possibly this sort of fix
could be pushed into keras, wrapped around the predict code, but hopefully
there is a more elegant solution.

@qlzh727
Copy link
Member

qlzh727 commented Sep 18, 2019

Thanks for the detailed explanation.

After some debug, the lowest op I can track that cause the difference was the "recurrent_activation" function (sigmoid), where the inputs with same value will produce slight different result. The underlying implementation of sigmoid for CPU goes to Eigen, which I don't have any knowledge.

Adding @rmlarsen who is the expert for Eigen in TF team for this issue.

@Quiigi
Copy link
Author

Quiigi commented Sep 19, 2019

Thank you for the update!

Trying to replicate your debugging, I wasn't able to find a node called "recurrent_activation" or "sigmoid" in my tensorflow graph. The closest I see is "Tanh", and its output (Tanh:0) shows the tiny discrepancies at the end. I see the issue the nodes feeding it directly, "add_5", and idirectly, MatMul_6, BiasAdd_2, MatMul_2. Notably, the inputs to the latter (Enter and TensorArrayReadV3) are clean. So I would guess the difference starts around there.

@qlzh727
Copy link
Member

qlzh727 commented Sep 19, 2019

The recurrent_activation I am talking about is at

recurrent_activation='sigmoid',
. I guess for any of the non-linear function, they might have tiny discrepancy, and really depends on the implementation.

@Quiigi
Copy link
Author

Quiigi commented Sep 20, 2019

I have keras.__version__ '2.2.4', and there's no "recurrent_v2.py".
I see recurrent activation at

recurrent_activation='hard_sigmoid',

(And activation is 'tanh', resulting presumably in the node from where I was able to trace back the mod 4 issue to matrix multiplication).

@Quiigi
Copy link
Author

Quiigi commented Oct 28, 2019

In my case, the inconsistency originates in MatMul. It's possible that activation functions 'sigmoid' or 'hard_sigmoid' have a similar issue. This reduced example demonstrates the issue:

import numpy as np
import tensorflow as tf
a = np.broadcast_to(np.float32([.6, -.8, -.3, 0]), (5,4))
b = np.float32([[8, 1], [6, 4], [-9, 1], [0, 0]])
tf.matmul(a,b).eval(session=tf.Session())

My output:

array([[ 2.6999998, -2.9      ],
       [ 2.6999998, -2.9      ],
       [ 2.6999998, -2.9      ],
       [ 2.6999998, -2.9      ],
       [ 2.7      , -2.8999999]], dtype=float32)

This is an improvement because LSTM generates a graph with 268 nodes, including While loops. The above reproducible example has just 1 simple node. I'm not sure where exactly the discrepancy creeps in; maybe in the function evalGemm, i.e.,

  void Eigen::TensorContractionEvaluatorBase<
    Eigen::TensorEvaluator<
      Eigen::TensorContractionOp<
        Eigen::array<Eigen::IndexPair<long>, 1ul> const,
        Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const,
        Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const
        > const,
      Eigen::ThreadPoolDevice> >
  ::evalGemm<true, true, false, 0>(float*) const

but that's just a guess.

@qlzh727
Copy link
Member

qlzh727 commented Feb 20, 2020

Due the numerical instability, I don't think there is anything we can address here (the diff is smaller than the normal limit when we do the tests), I am going to close this bug.

@qlzh727 qlzh727 closed this as completed Feb 20, 2020
@tensorflow-bot
Copy link

Are you satisfied with the resolution of your issue?
Yes
No

@Quiigi
Copy link
Author

Quiigi commented Feb 21, 2020

I had hoped to learn where exactly the instability arises.

I assume we see an artifact of optimization, trading accuracy for speed. And as the error is within your normal tolerance, the result we see must be deemed "correct". Therefore, closing this issue is appropriate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:keras Keras related issues TF 1.12 Issues related to TF 1.12 type:bug Bug
Projects
None yet
Development

No branches or pull requests

5 participants