Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using call_get_leaves inside @tf.function call in ensemble model inherits from tensorflow.keras.Model #199

Open
advahadr opened this issue Nov 13, 2023 · 10 comments

Comments

@advahadr
Copy link

advahadr commented Nov 13, 2023

Hi All,

I would like to get you help on the following Ensemble architecture:
created this colab notebook for your convenience.

I'm using the output of pre-trained tfdf model and concat it to a dense layer output, when I call the tfdf model directly I can concatenate is to the output of the dense layer [please see class MyEnsembleWorking], however my problem is when trying to concat the index of the leaves instead, by using:call_get_leaves [please see class MyEnsembleLeaves].

When adding the line:
tfdf_output_leaves = tf.stop_gradient(self.tfdf_model.call_get_leaves(inputs))

It seems that the output has no shape, I get this print:

tfdf_output_leaves: Tensor("StopGradient_1:0", shape=(None, None), dtype=int32)
And can't work further with this output and concatenate it.

I wonder what it the correct way to ensemble the leaves prediction and not the probability in my architecture.

Would appreciate any help,
Regards
Adva

@advahadr
Copy link
Author

advahadr commented Nov 13, 2023

Updated the colab link with sharable notebook

@rstz
Copy link
Collaborator

rstz commented Nov 14, 2023

Hi, thank you for updating the link to the Colab, I will have a look!

@janpfeifer
Copy link
Contributor

It seems to be something about the shapes of the output to call_get_leaves. I wonder if there is a way for us fixing that in TFDF, but in the meantime, you could force it like this:

...
        #TODO: want to change this to leaves predictions
        tfdf_output = self.tfdf_model.call_get_leaves(inputs)
        tfdf_output = tf.cast(tfdf_output, tf.float32)
        tfdf_output = tf.reshape(tfdf_output, [tf.shape(inputs)[0], 5])
        tfdf_output = tf.stop_gradient(tfdf_output)
...

But you have to know in advance the number of trees (5) and hardcode that into the model.

@janpfeifer
Copy link
Contributor

Btw, I simply converted the leaf numbers to float values, but if I were to combine the models, I'd definitely either embed the leaf numbers (different embedding per tree) or just add an extra NN (Dense) layer on top (which is equivalent).

@advahadr
Copy link
Author

advahadr commented Nov 15, 2023

Hi, thank you for your solution!
Implementing it in the colab worked fine, however I tried to implement it in our environment and got an error.

Environment details: (followed the compatibility table here)
Working on sagemaker pipelines, the configuration I used is:
tensorflow_decision_forests==1.5.0
tensorflow==2.13.0
(Can't upgrade to a higher tensorflow version dew to sagemaker limitation)

Got this error on the call_get_leaves (regular prediction worked just fine):

File "/usr/local/lib/python3.10/site-packages/tensorflow_decision_forests/keras/core_inference.py", line 767, in call_get_leaves * assert len(self._models_get_leaves) == 1 TypeError: object of type 'NoneType' has no len()

This was reproduced also in the colab notebook using the mentioned versions.
Would appreciate your help facing these limitations, Thanks!

@janpfeifer
Copy link
Contributor

hi @advahadr , I'm sorry you are having these difficulties -- the call_get_leaves is not often used, hence not as well tested.

I have 2 hypothesis:

  1. AFAIK, saved models may not work with it -- maybe @rstz could confirm ? You are saying that you are having issues in the inference code, is that correct ?

  2. I took a peak at your colab, and it is missing the tf.reshape line of the fix. Here is my copy of your colab. If that fixes it, then problem solved.

One short-term alternative, that would also work for (1) above: generate the leave values first, as a separate step. And then concatenate the leaf values to the inputs for the Keras model. This is not convenient :(, but it will work, if your environment allows this intermediary step. You could even materialized (save to disk along the input) the leaf values after training the TFDF model.

We'll look into this (most likely tomorrow, there is conference going on today), and get back to you.

If you could provide more details on how you are using it in your environment (Sagemaker), it would be very helpful! Is your pipeline something that reads the model and then runs inference on it, in Python ? Or is it using the TensorFlow C++ API ? etc.

@advahadr
Copy link
Author

advahadr commented Nov 15, 2023

Hi @janpfeifer,
Thanks for your response!

Regarding 2:
I'll start with bullet number 2 because we can eliminate it: I worked on a colab copy also, didn't want to edit the version I shared here so I added the reshape and it worked fine.

Regarding 1:
We use it when trying to train the ensemble (similarly to the colab example) with the pre-trained tfdf model as layer, so in a sense it's technically inference of the tfdf, but in general it's a training step of the ensemble (that's the reason I can't use the predict_get_leaves API):
This is the flow of our training process:

  1. training tfdf model and save it under s3 path
  2. load tfdf model from s3, pass it as and argument to to the nn ensemble initialization
  3. train the ensemble
  4. save the ensemble model

Later on the funnel we also predicting on this ensemble model.
Important to note that this flow worked fine when eliminate the call_get_leaves also while serving real traffic.

Regarding saved model problem:
I assumed that it might related to tfdf model saving and loading so I already checked it with new instance creation and unfortunately faced the same error (checked it both in a colab notebook and in our environment).

Regarding the alternative:
Unfortunately persisting the outcome of the tfdf predictions is not an option for serving (we are limited in latency since we run in real time environment).
In that case I want to do it on fly and than I need to use it again inside @tf.function which limits me to the use of only call_get_leaves again (since higher level functions like predict_get_leave are note available in that context).

Regarding SM environment:
We run it as python code (import the tensorflow_decision_forests) and use the classes as showed in the notebook.

Hope it's a bit more clear, if not I can elaborated more.

Regards,
Adva

@achoum
Copy link
Collaborator

achoum commented Nov 16, 2023

Hi Adva,

Sorry to hear about your troubles. Let me also try to help :).

Regarding the shape, the output of call_get_leaves returns an array of shape [num_examples, num_trees]. As you noticed, the shape is not inferred during the creation of the graph. Instead, the shape is known in TensorFlow when the graph runs. Since we as users know the shape, we can simply set it with "set_shape". Note that "set_shape" is a purely bookkeeping operation. It does not involve any computation. This is different from tf.reshape.

def __init__(self, tfdf_model):
	  ...
  self.tfdf_model = tfdf_model
  self.num_trees = tfdf_model.make_inspector().num_trees()

@tf.function
def call(self, inputs):
  
	  ...
  tfdf_output_leaves = self.tfdf_model.call_get_leaves(inputs)
  tfdf_output_leaves_casted = tf.cast(tfdf_output_leaves, tf.float32)
  tfdf_output_leaves_casted.set_shape((None, self.num_trees))
  concatenated = self.concat_nn_tfdf([x, tfdf_output_leaves_casted])

stop_gradient is not necessary. The TF-DF inference operations do not propagate gradients by default.

About saving your model. Saving a model (e.g. model.save) does not save the predict_get_leaves function, however it saves the "call_get_leaves" function that you are using. For call_get_leaves to be saved, you need to make sure to call either call_get_leaves or predict_get_leaves one before saving the model.

I copied and updated your notebook with those changes. You can find it here: https://colab.research.google.com/drive/1TIPdzDN0UDLAXtcVICmsdh9YEDhW12LO?usp=sharing

Cheers,

@advahadr
Copy link
Author

advahadr commented Nov 20, 2023

Hi @rstz thank you for your great help! I'm getting there but still have some issues:

I tried to use the code you provided on our repo over Sagemaker:

        tfdf_output_leaves = self.tfdf_model.call_get_leaves(inputs_for_tfdf)
        print(f'\ntfdf_output_leaves: {tfdf_output_leaves}')

        tfdf_output_leaves_casted = tf.cast(tfdf_output_leaves, tf.float32)
        tfdf_output_leaves_casted.set_shape((None, 3))
        print(f'\ntfdf_output_leaves_casted: {tfdf_output_leaves_casted}')

        concatenated = self.concat_nn_tfdf([x, tfdf_output_leaves_casted])

The prints log show:

2023-11-20T18:23:15.784+02:00 | tfdf_output_leaves: Tensor("StatefulPartitionedCall:0", shape=(2048, None), dtype=int32)

  | 2023-11-20T18:23:15.784+02:00 | tfdf_output_leaves_casted: Tensor("Cast:0", shape=(2048, 3), dtype=float32)

And the error I got (Incompatible shapes: [80,1] vs. [2048,1]):

ErrorMessage "tensorflow.python.framework.errors_impl.InvalidArgumentError: Graph execution error

Detected at node 'gradient_tape/binary_crossentropy/mul_1/Mul' defined at (most recent call last)
File "/opt/ml/code/sagemaker_training_entrypoint.py", line 430, in
trained_model, preprocessing_layer = TrainFnSageMaker.train_fn(
File "/opt/ml/code/RankingTF/Training/train_nn_fn.py", line 244, in train_fn
model.fit(ds_train_input,
File "/usr/local/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 65, in error_handler
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/keras/src/engine/training.py", line 1742, in fit
tmp_logs = self.train_function(iterator)
File "/usr/local/lib/python3.10/site-packages/keras/src/engine/training.py", line 1338, in train_function
return step_function(self, iterator)
File "/usr/local/lib/python3.10/site-packages/keras/src/engine/training.py", line 1322, in step_function
outputs = model.distribute_strategy.run(run_step, args=(data,))
File "/usr/local/lib/python3.10/site-packages/keras/src/engine/training.py", line 1303, in run_step
outputs = model.train_step(data)
File "/usr/local/lib/python3.10/site-packages/keras/src/engine/training.py", line 1084, in train_step
self.optimizer.minimize(loss, self.trainable_variables, tape=tape)
File "/usr/local/lib/python3.10/site-packages/keras/src/optimizers/optimizer.py", line 543, in minimize
grads_and_vars = self.compute_gradients(loss, var_list, tape)
File "/usr/local/lib/python3.10/site-packages/keras/src/optimizers/optimizer.py", line 276, in compute_gradients
grads = tape.gradient(loss, var_list)
Node: 'gradient_tape/binary_crossentropy/mul_1/Mul'
Incompatible shapes: [80,1] vs. [2048,1]
#11 [[{{node gradient_tape/binary_crossentropy/mul_1/Mul}}]] [Op:__inference_train_function_48268]

One thing to note:
In my repo the inputs (inputs_for_tfdf) are represented as a dict of tensors (the key is the feature name and value is the tensor), so maybe this different representation is causing the issue.

This is non blocker issue for me but when calling:
self.num_trees = tfdf_model.make_inspector().num_trees()
Inspector was not available on the loaded model, I tried an alternative:
tfdf_model.get_config()['num_trees'], but the config object was empty dict
so eventually I set it manually.

Would appreciate your help! thank you, Adva

@advahadr
Copy link
Author

Hi,
updating that I found the shape mismatch problem, I trained the tfdf with dataset with batch size of 2048, and when trying to train the ensemble the batch size I used was 80, however I'm still not sure why does it matter what was the batch size of the tfdf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants