Skip to content

Clarification on Embedding Vector Output from Evo 2 7B #161

@Grivv

Description

@Grivv

Hi,
First of all, thank you for your outstanding work on Evo 2. It has been incredibly valuable for my research.

I'm using the Evo 2 7B model to obtain sequence-level representations and would appreciate clarification on the expected behavior of the embedding vectors.

I'm working on obtaining sequence-level representations from the Evo 2 7B model, but I'm not sure the output is in the right form.
I would really appreciate if you check out my procedure and code.

Here is the process I'm currently following to extract embeddings:

  • Prepare a sequence and tokenize it into byte-level token IDs.
  • Perform a forward pass and extract token embeddings from layer "blocks.28.mlp.l3".
  • Take the mean across token embeddings (dimension-wise) to get a single vector representing the sequence.

This is the snippet of my code for embedding vector inference (adapted from the provided tutorial):

import numpy as np
import torch
from evo2 import Evo2

device = "cuda" if torch.cuda.is_available() else "cpu"
model = Evo2("evo2_7b")
list_of_embs = []

model.model.eval()
with torch.no_grad():
    for seq in sequences: # List of sequences is defined in the original code
        inputs = torch.tensor(model.tokenizer.tokenize(seq), dtype=torch.int).unsqueeze(0).to(device)
        _, token_embs = model(inputs, return_embeddings=True, layer_names=["blocks.28.mlp.l3"])
        seq_repr = torch.mean(token_embs[layer_name], dim=1).squeeze().detach().cpu().tolist()
        list_of_embs.append(seq_repr)

emb_array = np.array(list_of_embs)

The resulting embeddings look like this:

array([[  90.   , -228.   ,   26.375, ..., -241.   , -113.   , -100.5  ],
       [ 181.   , -288.   ,  212.   , ..., -292.   ,  -69.5  ,   63.75 ],
       [ 119.   , -208.   ,   90.   , ..., -201.   ,  -46.25 ,   14.25 ],
       ...,
       [ 126.5  , -175.   ,   95.   , ..., -221.   ,  -91.5  ,   57.75 ],
       [ 162.   , -308.   ,  195.   , ..., -262.   , -106.   ,  116.5  ],
       [ 180.   , -318.   ,  202.   , ..., -276.   ,  -58.75 ,   75.   ]])

and I've observed that

  • The magnitude of values within a single vector varies quite a bit.
  • The same dimension across different vectors also shows large variation.

This variance seems to result in unstable behavior in downstream tasks. So here are my questions:

  • Is this expected behavior for the extracted embeddings?
  • If so, is there a recommended normalization strategy before using them downstream (e.g., z-score normalization)?

Thanks again for your work and support!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions