maximum input length issue #12

gd1m3y · 2023-02-06T17:23:50Z

i can see that the maximum input length is set to 512 how can i change that ? and is seq length more than 512 supported ? what is the max seq length supported

Harry-hash · 2023-02-07T16:50:45Z

Hi, Thanks a lot for your question!

By default, the maximum sequence length is 512. You can change that using the attribute max_seq_length. For example:

from InstructorEmbedding import INSTRUCTOR
model = INSTRUCTOR('hkunlp/instructor-large')
model.max_seq_length = 256
print(model.max_seq_length)

Hope this helps! Feel free to leave any further comments or questions!

gd1m3y · 2023-02-08T12:21:50Z

Het thankyou for your answer although i can see that changing the max_seq_length parameter doesnt seem to reflect on the output for ex if i set it to 0 it will still give me a embedding of (1,768). or even i set it to a no like 4096 it doesnt seem to reflect it on the output.

from InstructorEmbedding import INSTRUCTOR
model = INSTRUCTOR('hkunlp/instructor-large')
model.max_seq_length = 256
print(model.max_seq_length)
sentence = "3D ActionSLAM: wearable person tracking in multi-floor environments"
instruction = "Represent the Science title:"
embeddings = model.encode([[instruction,sentence]])
print(embeddings.shape)

Harry-hash · 2023-02-09T10:15:48Z

Hi, Thanks a lot for your question!

Our INSTRUCTOR calculates the sentence embedding, which is the average of token embeddings in the input text. For the embedding shape (1,768) in your example, 1 refers to the number of sentences you encode, and 768 refers to the embedding dimension.

To inspect the actual length of sequence being encoded, you may print out the shape of token_embeddings here: https://github.com/HKUNLP/instructor-embedding/blob/main/InstructorEmbedding/instructor.py#L103.

Tips:
To print out the shape of token_embeddings, one way is to install the InstructorEmbedding package from the source via

pip install -e .

and add the following code after line 103, i.e., https://github.com/HKUNLP/instructor-embedding/blob/main/InstructorEmbedding/instructor.py#L103

print("The shape of token embeddings: ", token_embeddings.shape)

Then you will be able to see the sequence length.

Hope this helps! Feel free to add any following question or comment!

Harry-hash · 2023-02-19T02:36:47Z

Feel free to re-open this issue and add any following comments!

ZQ-Dev8 · 2023-04-12T22:17:27Z

Feel free to re-open this issue and add any following comments!

@Harry-hash does this mean that InstructOR can accept sequences of any length, because everything is mean pooled into the embedding dimension in the end? It feels like there should still be a limit governed by the model's architecture or computational resources.

Harry-hash · 2023-04-14T03:04:34Z

Hi, thanks a lot for your comments!

Theoretically, the model can embed sequences of any length. However, since the model is not particularly pre-trained on long sequences, the performance may drop significantly upon extremely long inputs. In addition, due to the O(n^2) computational complexity inside transformer model, the efficiency may also drops as the input length increases. Therefore, upon extremely long sequences, e.g., over 10k tokens, it may be suggested to first chunk texts before calculating embeddings with the INSTRUCTOR model.

Hope this helps!

ZQ-Dev8 · 2023-04-14T17:46:00Z

@Harry-hash very helpful, thanks for the quick response!

Harry-hash · 2023-04-22T03:07:14Z

Feel free to re-open the issue if your have any questions or comments!

NPap0 · 2023-05-08T06:55:27Z

Feel free to re-open this issue and add any following comments!

@Harry-hash does this mean that InstructOR can accept sequences of any length, because everything is mean pooled into the embedding dimension in the end? It feels like there should still be a limit governed by the model's architecture or computational resources.

Can someone explain to me why this ^ does not work then?
It may not be pre-trained on long sequences, but you can always chunck them and since you are going to average the embeddings afterwards it shouldn't make a difference? What am I missing here?

AwokeKnowing · 2023-05-10T00:22:45Z

@OneCodeToRuleThemAll if you take entire books and average them into the average of their tokens, it's going to trend toward the same thing (the frequecy of words in the language). So the weights are tuned/trained in a particular distribution of values, and averaging way more tokens will shift the distribution from the distribution of the typical few tokens in the training set toward the distribution of tokens in the entire language.
For a totally different way to think about it, you can mix colors randomly. and maybe the model is trained to find the different colors that were mixed. but if you mix more and more colors, it will trend toward the same thing, so that all your documents will have similar values for their token embedding.

NPap0 · 2023-05-10T08:38:53Z

@AwokeKnowing

@OneCodeToRuleThemAll if you take entire books and average them into the average of their tokens, it's going to trend toward the same thing (the frequecy of words in the language). So the weights are tuned/trained in a particular distribution of values, and averaging way more tokens will shift the distribution from the distribution of the typical few tokens in the training set toward the distribution of tokens in the entire language. For a totally different way to think about it, you can mix colors randomly. and maybe the model is trained to find the different colors that were mixed. but if you mix more and more colors, it will trend toward the same thing, so that all your documents will have similar values for their token embedding.

Thank you for the explanation. Follow up question:
Averaging way more tokens will shift the distribution of the typical few tokens in the training set toward the distribution of tokens in the entire language. That is I'm guessing if the 'way more tokens' number is really Large.
Let's say we have a 5-10 pages of text.
If we want to add more text/context you say that by adding 20-30 books we are going to shift from the typical distribution to the distribution of the entire language and everything will be then similar and we won't be able to distinguish.

But what if there is a golden spot (this is an idea) to add maybe 1-2 books or +10 or 20 pages to increase the text so as to not shift these values so much but still be able to shift them in order to capture the meaning of the extra pages.
Is what I'm saying making any sense? If you have any papers/anything I can read about this let me know, I'm interested.

PlatosTwin · 2023-06-08T06:46:20Z

@Harry-hash, sorry to reopen this topic with what I think is a basic question: is the sequence length under discussion here in characters, words, tokens, or some other unit? E.g., when splitting text, should we be splitting it into chunks less than or equal to 512 characters, words, tokens, etc.?

Thanks for the detailed answers to date!

PlatosTwin · 2023-06-10T22:02:34Z

I think I can answer the above from line 242 in instructor.py. max_seq_length is passed to AutoTokenizer, and in the context of AutoTokenizer the maximum length is in terms of tokens, so max_seq_length=512 limits the total number of tokens (not words, characters, etc.) to 512.

Harry-hash closed this as completed Feb 19, 2023

Harry-hash reopened this Apr 14, 2023

Harry-hash closed this as completed Apr 22, 2023

bitnom mentioned this issue Apr 22, 2023

Input Length / Accuracy #29

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

maximum input length issue #12

maximum input length issue #12

gd1m3y commented Feb 6, 2023

Harry-hash commented Feb 7, 2023

gd1m3y commented Feb 8, 2023 •

edited

Loading

Harry-hash commented Feb 9, 2023

Harry-hash commented Feb 19, 2023

ZQ-Dev8 commented Apr 12, 2023

Harry-hash commented Apr 14, 2023 •

edited

Loading

ZQ-Dev8 commented Apr 14, 2023

Harry-hash commented Apr 22, 2023

NPap0 commented May 8, 2023

AwokeKnowing commented May 10, 2023

NPap0 commented May 10, 2023 •

edited

Loading

PlatosTwin commented Jun 8, 2023

PlatosTwin commented Jun 10, 2023

maximum input length issue #12

maximum input length issue #12

Comments

gd1m3y commented Feb 6, 2023

Harry-hash commented Feb 7, 2023

gd1m3y commented Feb 8, 2023 • edited Loading

Harry-hash commented Feb 9, 2023

Harry-hash commented Feb 19, 2023

ZQ-Dev8 commented Apr 12, 2023

Harry-hash commented Apr 14, 2023 • edited Loading

ZQ-Dev8 commented Apr 14, 2023

Harry-hash commented Apr 22, 2023

NPap0 commented May 8, 2023

AwokeKnowing commented May 10, 2023

NPap0 commented May 10, 2023 • edited Loading

PlatosTwin commented Jun 8, 2023

PlatosTwin commented Jun 10, 2023

gd1m3y commented Feb 8, 2023 •

edited

Loading

Harry-hash commented Apr 14, 2023 •

edited

Loading

NPap0 commented May 10, 2023 •

edited

Loading