# Biological Sequence Embedding Extraction Tutorial

## What is Embedding?

**Embedding (Embedding Vector)** is a technique for converting text, sequences, or other unstructured data into numerical vectors. In bioinformatics, embedding can convert biological sequences such as DNA sequences and protein sequences into high-dimensional numerical vectors. These vectors can:

1. **Capture semantic information of sequences**: Similar sequences produce similar vectors
2. **Support machine learning**: Numerical vectors can be directly used in various machine learning algorithms
3. **Dimensional reduction representation**: Compress complex sequence information into fixed-length vectors
4. **Calculate similarity**: Calculate similarity between sequences through vector distance

## Why Extract Embedding?

In bioinformatics research, embedding extraction has important value:

- **Sequence classification**: Identify functional types of DNA sequences (such as promoters, enhancers, etc.)
- **Sequence similarity analysis**: Quickly find similar biological sequences
- **Functional prediction**: Predict protein function based on sequence embedding
- **Evolutionary analysis**: Study evolutionary relationships of sequences

## This tutorial will demonstrate:

1. How to load pre-trained [Genos-1.2B](https://huggingface.co/BGI-HangzhouAI/Genos-1.2B) or [Genos-10B](https://huggingface.co/BGI-HangzhouAI/Genos-10B) sequence models
2. How to convert DNA sequences into embedding vectors
3. How to analyze embedding features from different layers
4. Understanding the meaning and applications of embedding vectors


## Step 1: Import Required Libraries

First, we need to import the Python libraries required for processing embeddings:

- **torch**: PyTorch deep learning framework for model inference
- **transformers**: Hugging Face's transformers library, providing pre-trained models and tokenizers
  - **AutoModel**: Automatically load pre-trained models
  - **AutoTokenizer**: Automatically load corresponding tokenizers


In [None]:
import torch
from transformers import AutoModel, AutoTokenizer


## Step 2: Load Pre-trained Model

Here we load a pre-trained model specifically designed for biological sequences:

- **model_path**: Specify the path to the pre-trained model (here using a biological sequence-specific model)
- **tokenizer**: Tokenizer responsible for converting DNA sequences into tokens that the model can understand
- **model**: Pre-trained model, set `output_hidden_states=True` to get hidden states from all layers


In [None]:
# Specify model path
model_path = "/path/to/your/local/Genos-1.2B" # Replace with local model path

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Load locally downloaded Genos weights
model = AutoModel.from_pretrained(model_path, 
                                  output_hidden_states=True,
                                  torch_dtype=torch.bfloat16, 
                                  trust_remote_code=True, 
                                  attn_implementation="flash_attention_2" # Use flash_attention
  )
model.cuda() # Load model to GPU
model.eval() # Switch model to inference mode


## Step 3: Prepare Input Sequence

We create a DNA sequence as an example:

- **text**: Randomly generated sequence of specified length
- This sequence contains common DNA bases (A, T, G, C)

**In practical applications**, you can replace it with any real DNA sequence.


In [None]:
import random

# Randomly select a specific number of bases
bases = ['A', 'T', 'G', 'C']
seqs = random.choices(bases, k=8192)
# Generate input base sequence
text = ''.join(seqs)
print(text)


## Step 4: Sequence Encoding

Use the tokenizer to convert DNA sequences into a format that the model can process:

- **tokenizer(text, return_tensors="pt")**: Convert text sequence to PyTorch tensor
- **return_tensors="pt"**: Return PyTorch format tensor
- **inputs**: Contains encoded sequence information, including input_ids, attention_mask, etc.

This step converts the original DNA string into a digitized token sequence, where each base or base combination corresponds to a unique token ID.


In [None]:
# Encode text
inputs = tokenizer(text, return_tensors="pt")

# View encoded token sequence
print(inputs['input_ids'])

# Load data to GPU
inputs = {k: v.cuda() for k, v in inputs.items()}


## Step 5: Model Inference

Perform forward propagation through the pre-trained model to obtain embeddings:

- **torch.no_grad()**: Disable gradient computation since we only need inference, not training
- **model(**inputs)**: Input encoded sequence into the model
- **outputs**: Model output, containing logits from the last layer and hidden states from all layers

This step is the core of embedding extraction, where the model converts sequences into high-dimensional vector representations based on pre-trained knowledge.


In [None]:
# Model inference
with torch.no_grad():
    outputs = model(**inputs)


## Step 6: Extract Layer-wise Embeddings

Obtain hidden states (embeddings) from each layer of the model:

- **outputs.hidden_states**: Contains hidden states from all layers, is a tuple
- **hidden_states[i]**: Embedding vector from layer i
- **shape**: Each embedding has shape [batch_size, sequence_length, hidden_size]

**Important Note**:
- Embeddings from different layers capture semantic information at different levels
- Shallow layers typically capture local features (such as base patterns)
- Deep layers capture more abstract semantic information (such as functional domains, structural features)


In [None]:
# Get hidden states from all layers
hidden_states = outputs.hidden_states  # Tuple containing hidden states from each layer

# Iterate through embeddings from each layer and display key information
for i, layer_embedding in enumerate(hidden_states):
    print(f"Layer {i} embedding ({layer_embedding.shape}): {layer_embedding[0, 0, :10]}")
    print("-" * 50)


---
# Using Genos Package to Directly Get Embeddings


Note: Due to current limited resources, the API currently provides models supporting 1.2B and 10B, with a maximum embedding length of **128k**, and only returns embeddings from the last layer

- Parameter description
    - sequence, sequence, maximum length not exceeding 128k
    - model_name, model name, "Genos-1.2B" or "Genos-10B"
    - pooling_method, pooling method, "mean": mean pooling, "max": max pooling, "min": min pooling, "last": take embedding of the last token


In [None]:
from genos import create_client

client = create_client(token="<your_api_key>")

result = client.get_embedding(sequence=text, model_name="Genos-1.2B", pooling_method="mean")
print(result['result']['embedding'])


---


### Applications of Embeddings

Extracted embeddings can be used for:

1. **Sequence similarity calculation**:
   ```python
   # Calculate cosine similarity between two sequences
   similarity = cosine_similarity(embedding1, embedding2)
   ```

2. **Sequence classification**:
   ```python
   # Train classifier using embeddings
   classifier = train_classifier(embeddings, labels)
   ```

3. **Clustering analysis**:
   ```python
   # Perform clustering on sequences
   clusters = kmeans_clustering(embeddings)
   ```

4. **Dimensional reduction visualization**:
   ```python
   # Use t-SNE or PCA for dimensional reduction visualization
   reduced_embeddings = tsne.fit_transform(embeddings)
   ```


## Summary

Through this tutorial, we learned:

### Core Concepts
- **Embedding**: Technique for converting biological sequences into numerical vectors
- **Pre-trained models**: Large language models specifically trained for biological sequences
- **Hierarchical features**: Different layers capture sequence information at different levels

### Technical Process
1. Load pre-trained biological sequence models
2. Encode DNA sequences into tokens
3. Obtain embeddings through model inference
4. Analyze feature representations from different layers

### Practical Value
- **Accelerate research**: Quickly analyze large amounts of biological sequences
- **Improve accuracy**: Use pre-trained knowledge to enhance prediction performance
- **Support downstream tasks**: Provide foundation for classification, clustering, similarity analysis, etc.

### Next Examples
1. Use extracted embeddings for downstream population prediction tasks
2. Use extracted embeddings for downstream variant prediction tasks
3. RNA seq prediction


**Congratulations!** You have mastered the basic methods of biological sequence embedding extraction!
