feat: add StringSequenceToEmbedding transformer and layer#47
feat: add StringSequenceToEmbedding transformer and layer#47mruiyangyou wants to merge 2 commits into
Conversation
|
Hey @mruiyangyou - this looks good at a glance, but we are gearing up to do a big new kamae v3 release in the next few weeks or so. Can this wait till after these? I can aid migrating this to keras 3. |
Parses a delimited string of pre-computed embedding vectors into a (seq_len, embedding_dim) float tensor, with optional reversal of the non-pad portion of each sequence. Includes the Spark transformer, the TensorFlow-only Keras 3 layer, unit tests, Spark/TF parity tests, and serialisation + JIT-compatibility registry entries. Authored on top of v3.0.0 (Keras 3 multi-backend migration): the layer lives in kamae.keras.tensorflow.layers, subclasses kamae.keras.core.base.BaseLayer, declares supported_backends = TENSORFLOW_ONLY and jit_compatible = False, and the transformer wraps it via get_keras_layer(). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
7bc4ed9 to
2bc89ff
Compare
georyetti
left a comment
There was a problem hiding this comment.
One main comment on padding value being assumed to be zero.
| # A row is considered padding iff all of its components are 0. | ||
| row_norms = tf.reduce_sum(tf.abs(result), axis=-1) | ||
| seq_lengths = tf.reduce_sum(tf.cast(row_norms > 0, tf.int32), axis=-1) |
There was a problem hiding this comment.
pad value is configurable, so this only works if pad value is 0 no? If pad value is -1, then row_norms > 0 and its considered non padded?
| # Count the number of non-pad vectors (a vector is pad iff all | ||
| # of its components are zero). Reverse only that prefix. | ||
| abs_sums = F.transform( | ||
| vectors, | ||
| lambda v: F.aggregate( | ||
| v, | ||
| F.lit(0.0), | ||
| lambda acc, value: acc + F.abs(value), | ||
| ), |
There was a problem hiding this comment.
Same as above, pad value = -1 breaks this I think
| layer = StringSequenceToEmbeddingLayer( | ||
| name="reverse", | ||
| seq_len=4, | ||
| embedding_dim=2, | ||
| reverse=True, |
There was a problem hiding this comment.
Can we add a test for a non zero pad value
|
Also since pad value is a string, what happens if the user specific pad_value = "hello"? We should validate that its a numeric string on setting in Spark and in constructor of keras layer |
…_value Addresses PR ExpediaGroup#47 review (georyetti): - reverse no longer infers padding from values (norm > 0), which broke for non-zero pad values (e.g. -1). It now counts the supplied vectors positionally (non-empty groups in the original input, capped at seq_len) and reverses only that prefix, leaving appended padding at the tail. Applied to both the Keras layer and the Spark transformer. - validate that pad_value is a numeric string in the Keras layer constructor and the Spark setPadValue/_transform; a non-numeric value like "hello" now raises ValueError instead of producing NaNs. - add tests for a non-zero pad value and for non-numeric pad_value rejection; rewrite the reverse tests to use genuinely padded inputs. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Summary
Adds a new
StringSequenceToEmbeddingtransformer/layer pair that parses a delimited string of pre-computed embedding vectors into a dense(seq_len, embedding_dim)float tensor.StringSequenceToEmbeddingTransformer) and Keras layer (StringSequenceToEmbeddingLayer) with parity behaviour.sequence_separator(default,), floats within a vector byseparator(default|). Example:"1|2|3,4|5|6"withseq_len=2, embedding_dim=3→[[1,2,3],[4,5,6]].pad_value(default"0"); truncates long ones.reversemode reverses only the non-pad portion of each sequence (useful for chronological → recency-first ordering).StringToStringListLayerconvention, so(None, 1, 1)inputs produce(None, 1, seq_len, embedding_dim)outputs without a downstream squeeze.pad_valuebeforetf.strings.to_numberto avoidStringToNumberOpfailures at graph execution.Test plan
tests/kamae/tensorflow/layers/test_string_sequence_to_embedding.py— 9 unit tests covering default/custom separators, padding, truncation, reverse, trailing-1 squeeze behaviour, empty/malformed inputs, config round-trip, and invalid args.tests/kamae/spark/transformers/test_string_sequence_to_embedding.py— 6 tests including Spark/TF parity across separators, padding, and reverse modes.tests/kamae/tensorflow/test_layer_serialisation.py— added the new layer to the serialisation matrix; passestest_all_layers_tested_for_serialisation.string_to_string_listtests.🤖 Generated with Claude Code