Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

maximum input length issue #12

Closed
gd1m3y opened this issue Feb 6, 2023 · 13 comments
Closed

maximum input length issue #12

gd1m3y opened this issue Feb 6, 2023 · 13 comments

Comments

@gd1m3y
Copy link

gd1m3y commented Feb 6, 2023

i can see that the maximum input length is set to 512 how can i change that ? and is seq length more than 512 supported ? what is the max seq length supported

@Harry-hash
Copy link
Contributor

Hi, Thanks a lot for your question!

By default, the maximum sequence length is 512. You can change that using the attribute max_seq_length. For example:

from InstructorEmbedding import INSTRUCTOR
model = INSTRUCTOR('hkunlp/instructor-large')
model.max_seq_length = 256
print(model.max_seq_length)

Hope this helps! Feel free to leave any further comments or questions!

@gd1m3y
Copy link
Author

gd1m3y commented Feb 8, 2023

Het thankyou for your answer although i can see that changing the max_seq_length parameter doesnt seem to reflect on the output for ex if i set it to 0 it will still give me a embedding of (1,768). or even i set it to a no like 4096 it doesnt seem to reflect it on the output.

from InstructorEmbedding import INSTRUCTOR
model = INSTRUCTOR('hkunlp/instructor-large')
model.max_seq_length = 256
print(model.max_seq_length)
sentence = "3D ActionSLAM: wearable person tracking in multi-floor environments"
instruction = "Represent the Science title:"
embeddings = model.encode([[instruction,sentence]])
print(embeddings.shape)

@Harry-hash
Copy link
Contributor

Hi, Thanks a lot for your question!

Our INSTRUCTOR calculates the sentence embedding, which is the average of token embeddings in the input text. For the embedding shape (1,768) in your example, 1 refers to the number of sentences you encode, and 768 refers to the embedding dimension.

To inspect the actual length of sequence being encoded, you may print out the shape of token_embeddings here: https://github.com/HKUNLP/instructor-embedding/blob/main/InstructorEmbedding/instructor.py#L103.

Tips:
To print out the shape of token_embeddings, one way is to install the InstructorEmbedding package from the source via

pip install -e .

and add the following code after line 103, i.e., https://github.com/HKUNLP/instructor-embedding/blob/main/InstructorEmbedding/instructor.py#L103

print("The shape of token embeddings: ", token_embeddings.shape)

Then you will be able to see the sequence length.

Hope this helps! Feel free to add any following question or comment!

@Harry-hash
Copy link
Contributor

Feel free to re-open this issue and add any following comments!

@ZQ-Dev8
Copy link

ZQ-Dev8 commented Apr 12, 2023

Feel free to re-open this issue and add any following comments!

@Harry-hash does this mean that InstructOR can accept sequences of any length, because everything is mean pooled into the embedding dimension in the end? It feels like there should still be a limit governed by the model's architecture or computational resources.

@Harry-hash
Copy link
Contributor

Harry-hash commented Apr 14, 2023

Hi, thanks a lot for your comments!

Theoretically, the model can embed sequences of any length. However, since the model is not particularly pre-trained on long sequences, the performance may drop significantly upon extremely long inputs. In addition, due to the O(n^2) computational complexity inside transformer model, the efficiency may also drops as the input length increases. Therefore, upon extremely long sequences, e.g., over 10k tokens, it may be suggested to first chunk texts before calculating embeddings with the INSTRUCTOR model.

Hope this helps!

@Harry-hash Harry-hash reopened this Apr 14, 2023
@ZQ-Dev8
Copy link

ZQ-Dev8 commented Apr 14, 2023

@Harry-hash very helpful, thanks for the quick response!

@Harry-hash
Copy link
Contributor

Feel free to re-open the issue if your have any questions or comments!

@NPap0
Copy link

NPap0 commented May 8, 2023

Feel free to re-open this issue and add any following comments!

@Harry-hash does this mean that InstructOR can accept sequences of any length, because everything is mean pooled into the embedding dimension in the end? It feels like there should still be a limit governed by the model's architecture or computational resources.

Can someone explain to me why this ^ does not work then?
It may not be pre-trained on long sequences, but you can always chunck them and since you are going to average the embeddings afterwards it shouldn't make a difference? What am I missing here?

@AwokeKnowing
Copy link

@OneCodeToRuleThemAll if you take entire books and average them into the average of their tokens, it's going to trend toward the same thing (the frequecy of words in the language). So the weights are tuned/trained in a particular distribution of values, and averaging way more tokens will shift the distribution from the distribution of the typical few tokens in the training set toward the distribution of tokens in the entire language.
For a totally different way to think about it, you can mix colors randomly. and maybe the model is trained to find the different colors that were mixed. but if you mix more and more colors, it will trend toward the same thing, so that all your documents will have similar values for their token embedding.

@NPap0
Copy link

NPap0 commented May 10, 2023

@AwokeKnowing

@OneCodeToRuleThemAll if you take entire books and average them into the average of their tokens, it's going to trend toward the same thing (the frequecy of words in the language). So the weights are tuned/trained in a particular distribution of values, and averaging way more tokens will shift the distribution from the distribution of the typical few tokens in the training set toward the distribution of tokens in the entire language. For a totally different way to think about it, you can mix colors randomly. and maybe the model is trained to find the different colors that were mixed. but if you mix more and more colors, it will trend toward the same thing, so that all your documents will have similar values for their token embedding.

Thank you for the explanation. Follow up question:
Averaging way more tokens will shift the distribution of the typical few tokens in the training set toward the distribution of tokens in the entire language. That is I'm guessing if the 'way more tokens' number is really Large.
Let's say we have a 5-10 pages of text.
If we want to add more text/context you say that by adding 20-30 books we are going to shift from the typical distribution to the distribution of the entire language and everything will be then similar and we won't be able to distinguish.

But what if there is a golden spot (this is an idea) to add maybe 1-2 books or +10 or 20 pages to increase the text so as to not shift these values so much but still be able to shift them in order to capture the meaning of the extra pages.
Is what I'm saying making any sense? If you have any papers/anything I can read about this let me know, I'm interested.

@PlatosTwin
Copy link

@Harry-hash, sorry to reopen this topic with what I think is a basic question: is the sequence length under discussion here in characters, words, tokens, or some other unit? E.g., when splitting text, should we be splitting it into chunks less than or equal to 512 characters, words, tokens, etc.?

Thanks for the detailed answers to date!

@PlatosTwin
Copy link

I think I can answer the above from line 242 in instructor.py. max_seq_length is passed to AutoTokenizer, and in the context of AutoTokenizer the maximum length is in terms of tokens, so max_seq_length=512 limits the total number of tokens (not words, characters, etc.) to 512.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants