Clarification on Embedding Vector Output from Evo 2 7B

Hi,
First of all, thank you for your outstanding work on Evo 2. It has been incredibly valuable for my research.

I'm using the Evo 2 7B model to obtain sequence-level representations and would appreciate clarification on the expected behavior of the embedding vectors.

I'm working on obtaining sequence-level representations from the Evo 2 7B model, but I'm not sure the output is in the right form.
I would really appreciate if you check out my procedure and code.

Here is the process I'm currently following to extract embeddings:

- Prepare a sequence and tokenize it into byte-level token IDs.
- Perform a forward pass and extract token embeddings from layer "blocks.28.mlp.l3".
- Take the mean across token embeddings (dimension-wise) to get a single vector representing the sequence.

This is the snippet of my code for embedding vector inference (adapted from the provided tutorial):
```
import numpy as np
import torch
from evo2 import Evo2

device = "cuda" if torch.cuda.is_available() else "cpu"
model = Evo2("evo2_7b")
list_of_embs = []

model.model.eval()
with torch.no_grad():
    for seq in sequences: # List of sequences is defined in the original code
        inputs = torch.tensor(model.tokenizer.tokenize(seq), dtype=torch.int).unsqueeze(0).to(device)
        _, token_embs = model(inputs, return_embeddings=True, layer_names=["blocks.28.mlp.l3"])
        seq_repr = torch.mean(token_embs[layer_name], dim=1).squeeze().detach().cpu().tolist()
        list_of_embs.append(seq_repr)

emb_array = np.array(list_of_embs)
```

The resulting embeddings look like this:
```
array([[  90.   , -228.   ,   26.375, ..., -241.   , -113.   , -100.5  ],
       [ 181.   , -288.   ,  212.   , ..., -292.   ,  -69.5  ,   63.75 ],
       [ 119.   , -208.   ,   90.   , ..., -201.   ,  -46.25 ,   14.25 ],
       ...,
       [ 126.5  , -175.   ,   95.   , ..., -221.   ,  -91.5  ,   57.75 ],
       [ 162.   , -308.   ,  195.   , ..., -262.   , -106.   ,  116.5  ],
       [ 180.   , -318.   ,  202.   , ..., -276.   ,  -58.75 ,   75.   ]])
```
and I've observed that

- The magnitude of values within a single vector varies quite a bit.
- The same dimension across different vectors also shows large variation.

This variance seems to result in unstable behavior in downstream tasks. So here are my questions:

- Is this expected behavior for the extracted embeddings?
- If so, is there a recommended normalization strategy before using them downstream (e.g., z-score normalization)?

Thanks again for your work and support!

 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Clarification on Embedding Vector Output from Evo 2 7B #161

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Clarification on Embedding Vector Output from Evo 2 7B #161

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions