Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST] Summary Type Arg Meaning (BCE Task Head) #466

Closed
jacobdineen opened this issue Aug 1, 2022 · 4 comments
Closed

[QST] Summary Type Arg Meaning (BCE Task Head) #466

jacobdineen opened this issue Aug 1, 2022 · 4 comments

Comments

@jacobdineen
Copy link

jacobdineen commented Aug 1, 2022

❓ Questions & Help

Details

Hello!

My team and I are in the process of leveraging a trained t4rec model as a feature extractor. We are currently using Albert, but this question is for any encoder/decoder model offered. When instantiating a model, a user has the option of selecting the model summary type (first/last/mean/etc..). This is to slice the hidden representation of shape (batch, sequence_length, nn_dim) along the middle axis, which is the representation passed into the final nn layer in the BCE task head. I sourced the dependency back to huggingface's SequenceSummary here, which is a straightforward implementation.

Q: If we select first as our summary_type with an encoder-only model, does that token have additional context over the last or mean? Or should there be additional tokenization built into the preprocessing stage?

This question stems from literature, but in encoder-only models, there is a special CLS (and/or end) token that is said to provide a sentence-level representation over all input tokens. The intermediate output appears to provide a nn_dim sized embedding for each token (middle axis) for each element of the batch, but does the above theory still hold if we don't have a tokenization framework that includes special start/stop tokens?

@rnyak
Copy link
Contributor

rnyak commented Aug 2, 2022

Hello @jacobdineen. Thanks for your question.

Q1: If we select first as our summary_type with an encoder-only model, does that token have additional context over the last or mean?

Yes, it will have the context over the other tokens due to self-attention mechanism - which computes all the pairwise scores given the items in the current session.

Q2: This question stems from literature, but in encoder-only models, there is a special CLS (and/or end) token that is said to provide a sentence-level representation over all input tokens. The intermediate output appears to provide a nn_dim sized embedding for each token (middle axis) for each element of the batch, but does the above theory still hold if we don't have a tokenization framework that includes special start/stop tokens?

This theory holds for NLP as you already explained above, but for session-based or sequential recommendation case, we do not need to use special CLS separator, since for us every session (which is the input to the model, not multiple sessions) is treated like a single sentence. Note that, we might need this in case of session-aware recommendation task which we do not support yet.

@jacobdineen
Copy link
Author

Thanks @rnyak !

@rnyak
Copy link
Contributor

rnyak commented Aug 3, 2022

@jacobdineen for this task, process of leveraging a trained t4rec model as a feature extractor. you do not need to use custom model.fit() and BC head right? Basically you can train the model with HF trainer class with next item prediction task, and then extract the embeddings from the layer you want. In such case, you should be able to use torch.nn.parallel.DistributedDataParallel as described in here #456.

Please let us know how it will go.

@jacobdineen
Copy link
Author

Hey, @rnyak - Going to respond separately to the other thread as that answer is a bit more verbose re: what solution we have tried.

For this one, leveraging a t4rec model as a feature extractor, we still need to use the BC head, which requires us to use model.fit(). Essentially, we are looking for general purpose features (embeddings) for downstream tasks which are generated via training on an explicit target feature (conversion).

Next item prediction in our context reduces down to predicting the next item that a user will not convert on, due to the natural sparsity of our data and the infrequency of user action. semi-supervised learning may not be applicable to our problem for that reason. Intra-session recommendation would be a good use case for this, but the full customer journey is out of scope for our team.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants