Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG Report]: 'Invalid tape state' when training model #1221

Open
PhasonMatrix opened this issue Dec 29, 2023 · 3 comments
Open

[BUG Report]: 'Invalid tape state' when training model #1221

PhasonMatrix opened this issue Dec 29, 2023 · 3 comments
Assignees

Comments

@PhasonMatrix
Copy link

Description

I have built a U-Net convolutional network. When calling model.fit() I get an exception:

Message: 
  Tensorflow.RuntimeError : Invalid tape state.

Stack Trace: 
  Tape.ComputeGradient(Int64[] target_tensor_ids, Int64[] source_tensor_ids, UnorderedMap`2 sources_that_are_targets, List`1output_gradients, Boolean build_default_zeros_grads)
  EagerRunner.TFE_TapeGradient(ITape tape, Tensor[] target, Tensor[] sources, List`1 output_gradients, Tensor[] sources_raw, String unconnected_gradients)
  GradientTape.gradient(Tensor target, IEnumerable`1 sources, List`1 output_gradients, String unconnected_gradients)
  Model._minimize(GradientTape tape, IOptimizer optimizer, Tensor loss, List`1 trainable_variables)
  Model.train_step(DataHandler data_handler, Tensors x, Tensors y)
  Model.train_step_function(DataHandler data_handler, OwnedIterator iterator)
  Model.FitInternal(DataHandler data_handler, Int32 epochs, Int32 verbose, List`1 callbackList, ValidationDataPack validation_data, Func`3 train_step_func)
  Model.fit(NDArray x, NDArray y, Int32 batch_size, Int32 epochs, Int32 verbose, List`1 callbacks, Single validation_split,   ValidationDataPack validation_data, Int32 validation_step, Boolean shuffle, Dictionary`2 class_weight, NDArray sample_weight, Int32 initial_epoch, Int32 max_queue_size, Int32 workers, Boolean use_multiprocessing)
  UNet.Train(NDArray x, NDArray y) line 68

Reproduction Steps

Summary of the U-Net model:

Model: U-Net 
__________________________________________________________________________________________________ 
Layer (type)                     Output Shape          Param #     Connected to                    
================================================================================================== 
image(InputLayer)                (None, 256, 256, 1)   0                                           
__________________________________________________________________________________________________ 
conv2d(Conv2D)                   (None, 256, 256, 16)  160         image[0][0]                     
__________________________________________________________________________________________________ 
dropout(Dropout)                 (None, 256, 256, 16)  0           conv2d[0][0]                    
__________________________________________________________________________________________________ 
conv2d_1(Conv2D)                 (None, 256, 256, 16)  2320        dropout[0][0]                   
__________________________________________________________________________________________________ 
max_pooling2d(MaxPooling2D)      (None, 128, 128, 16)  0           conv2d_1[0][0]                  
__________________________________________________________________________________________________ 
conv2d_2(Conv2D)                 (None, 128, 128, 32)  4640        max_pooling2d[0][0]             
__________________________________________________________________________________________________ 
dropout_1(Dropout)               (None, 128, 128, 32)  0           conv2d_2[0][0]                  
__________________________________________________________________________________________________ 
conv2d_3(Conv2D)                 (None, 128, 128, 32)  9248        dropout_1[0][0]                 
__________________________________________________________________________________________________ 
max_pooling2d_1(MaxPooling2D)    (None, 64, 64, 32)    0           conv2d_3[0][0]                  
__________________________________________________________________________________________________ 
conv2d_4(Conv2D)                 (None, 64, 64, 64)    18496       max_pooling2d_1[0][0]           
__________________________________________________________________________________________________ 
dropout_2(Dropout)               (None, 64, 64, 64)    0           conv2d_4[0][0]                  
__________________________________________________________________________________________________ 
conv2d_5(Conv2D)                 (None, 64, 64, 64)    36928       dropout_2[0][0]                 
__________________________________________________________________________________________________ 
max_pooling2d_2(MaxPooling2D)    (None, 32, 32, 64)    0           conv2d_5[0][0]                  
__________________________________________________________________________________________________ 
conv2d_6(Conv2D)                 (None, 32, 32, 128)   73856       max_pooling2d_2[0][0]           
__________________________________________________________________________________________________ 
dropout_3(Dropout)               (None, 32, 32, 128)   0           conv2d_6[0][0]                  
__________________________________________________________________________________________________ 
conv2d_7(Conv2D)                 (None, 32, 32, 128)   147584      dropout_3[0][0]                 
__________________________________________________________________________________________________ 
max_pooling2d_3(MaxPooling2D)    (None, 16, 16, 128)   0           conv2d_7[0][0]                  
__________________________________________________________________________________________________ 
conv2d_8(Conv2D)                 (None, 16, 16, 256)   295168      max_pooling2d_3[0][0]           
__________________________________________________________________________________________________ 
dropout_4(Dropout)               (None, 16, 16, 256)   0           conv2d_8[0][0]                  
__________________________________________________________________________________________________ 
conv2d_9(Conv2D)                 (None, 16, 16, 256)   590080      dropout_4[0][0]                 
__________________________________________________________________________________________________ 
conv2d_transpose(Conv2DTranspose (None, 32, 32, 128)   131200      conv2d_9[0][0]                  
__________________________________________________________________________________________________ 
concatenate(Concatenate)         (None, 32, 32, 256)   0           conv2d_transpose[0][0]          
                                                                   conv2d_7[0][0]                  
__________________________________________________________________________________________________ 
conv2d_10(Conv2D)                (None, 32, 32, 128)   295040      concatenate[0][0]               
__________________________________________________________________________________________________ 
dropout_5(Dropout)               (None, 32, 32, 128)   0           conv2d_10[0][0]                 
__________________________________________________________________________________________________ 
conv2d_11(Conv2D)                (None, 32, 32, 128)   147584      dropout_5[0][0]                 
__________________________________________________________________________________________________ 
conv2d_transpose_1(Conv2DTranspo (None, 64, 64, 64)    32832       conv2d_11[0][0]                 
__________________________________________________________________________________________________ 
concatenate_1(Concatenate)       (None, 64, 64, 128)   0           conv2d_transpose_1[0][0]        
                                                                   conv2d_5[0][0]                  
__________________________________________________________________________________________________ 
conv2d_12(Conv2D)                (None, 64, 64, 64)    73792       concatenate_1[0][0]             
__________________________________________________________________________________________________ 
dropout_6(Dropout)               (None, 64, 64, 64)    0           conv2d_12[0][0]                 
__________________________________________________________________________________________________ 
conv2d_13(Conv2D)                (None, 64, 64, 64)    36928       dropout_6[0][0]                 
__________________________________________________________________________________________________ 
conv2d_transpose_2(Conv2DTranspo (None, 128, 128, 32)  8224        conv2d_13[0][0]                 
__________________________________________________________________________________________________ 
concatenate_2(Concatenate)       (None, 128, 128, 64)  0           conv2d_transpose_2[0][0]        
                                                                   conv2d_3[0][0]                  
__________________________________________________________________________________________________ 
conv2d_14(Conv2D)                (None, 128, 128, 32)  18464       concatenate_2[0][0]             
__________________________________________________________________________________________________ 
dropout_7(Dropout)               (None, 128, 128, 32)  0           conv2d_14[0][0]                 
__________________________________________________________________________________________________ 
conv2d_15(Conv2D)                (None, 128, 128, 32)  9248        dropout_7[0][0]                 
__________________________________________________________________________________________________ 
conv2d_transpose_3(Conv2DTranspo (None, 256, 256, 16)  2064        conv2d_15[0][0]                 
__________________________________________________________________________________________________ 
concatenate_3(Concatenate)       (None, 256, 256, 32)  0           conv2d_transpose_3[0][0]        
                                                                   conv2d_1[0][0]                  
__________________________________________________________________________________________________ 
conv2d_16(Conv2D)                (None, 256, 256, 16)  4624        concatenate_3[0][0]             
__________________________________________________________________________________________________ 
dropout_8(Dropout)               (None, 256, 256, 16)  0           conv2d_16[0][0]                 
__________________________________________________________________________________________________ 
conv2d_17(Conv2D)                (None, 256, 256, 16)  2320        dropout_8[0][0]                 
__________________________________________________________________________________________________ 
conv2d_18(Conv2D)                (None, 256, 256, 1)   17          conv2d_17[0][0]                 
================================================================================================== 
Total params: 1940817 
Trainable params: 1940817 
Non-trainable params: 0 

Code to build the model is mentioned in my previous bug report #1219

I call the fit method with:

_model.fit(x, y, batch_size:16, verbose:1, epochs:1, shuffle:false);

Input and labels passed in ( 'x' and 'y') are both NDArray with shape (1872, 256, 256, 1). That is, 1872 grey-scale images, 256x256px.

I have googled and can only find one StackOverflow answer that mentions that the labels (y) should be passed in, which I have already done.

I have the same model in Python and can train it with X and Y numpy arrays of similar shape (8100, 256, 256, 1), only difference is a different number of images.

Known Workarounds

No response

Configuration and Other Information

OS: Windows 11
.Net: 6.0

SciSharp.TensorFlow.Redist 2.16.0
TensorFlow.Keras 0.15.0
TensorFlow.Net 0.150.0

@SIARIAymane
Copy link

Hello,

I wish to report that I am also experiencing the issue described here, namely Tensorflow.RuntimeError: Invalid tape state.

Here are some details about my environment:

  • TensorFlow.NET Version: 0.150.0
  • Operating System: Windows 10
  • IDE: Visual Studio 2022
  • Usage Scenario: Training a U-Net model for image segmentation.

I am interested in any suggestions or solutions that may have been found since this issue was created. Moreover, if additional information from my side could help resolve this issue, I would be happy to provide it.

Thank you very much for your attention and for any effort aimed at resolving this issue. It is very important to me and my project.

Kind regards,
Aymane.

@AsakusaRinne
Copy link
Collaborator

Hi, I'm one of the maintainers of tensorflow.net. However I'm sorry that none of the main maintainers of this repo is available at this time. We won't reject PRs but we don't have enough time to fix BUG or add features now. I feel sorry for that.

I've once met the same problem during the development. Generally, this BUG is because of wrong traced graph structure info or invalid backward ops in your model.

Tape works as below: it records the information of nodes and edges of the graph, which is traced during the model running. When it's required to compute gradients, it pops the nodes at topological order, begging from the output node(s).

If you want debug it, please at first narrow the scope for debugging, finding a smallest model structure which could reproduce this problem. Then, run it with the source code and see the records in the Tape. You'll finally find which number of the operation is missed in the tape informations. After that, you could try to fix it. Good luck!

@SIARIAymane
Copy link

Thank you for your advice

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants