# Introduction to Hyena-GLT Framework

Welcome to the Hyena-GLT (Genome Language Transformer) framework! This notebook provides an introduction to the framework and demonstrates its key capabilities for genomic sequence modeling.

## What is Hyena-GLT?

Hyena-GLT is a hybrid architecture that combines:
- **BLT's Byte Latent Tokenization**: Efficient compression and tokenization
- **Savanna's Striped Hyena blocks**: Linear complexity convolutions
- **Genomic-specific adaptations**: Specialized for biological sequences

## Key Features

- 🧬 **Multi-modal genomic support**: DNA, RNA, and protein sequences
- ⚡ **Efficient processing**: Linear O(n) complexity vs quadratic attention
- 🔄 **Dynamic tokenization**: Adaptive sequence compression
- 📊 **Multi-task capable**: Classification, generation, and analysis
- 🎯 **Fine-tuning ready**: Easy adaptation to downstream tasks

In [None]:
# Import required libraries

import matplotlib.pyplot as plt
import torch

# Hyena-GLT imports
from hyena_glt import HyenaGLT, HyenaGLTConfig
from hyena_glt.data import GenomicTokenizer
from hyena_glt.utils import analyze_tokenization, compute_sequence_statistics

print("🧬 Hyena-GLT Framework Loaded Successfully!")
print(f"PyTorch version: {torch.__version__}")
print(f"Device available: {torch.device('cuda' if torch.cuda.is_available() else 'cpu')}")

## Quick Start Example

Let's start with a simple example to demonstrate the basic workflow:

In [None]:
# 1. Create configuration
config = HyenaGLTConfig(
    vocab_size=4096,
    hidden_size=256,  # Smaller for demo
    num_layers=4,     # Fewer layers for demo
    num_heads=8,
    sequence_length=512,
    dropout=0.1
)

print("Configuration created:")
print(f"  - Vocabulary size: {config.vocab_size:,}")
print(f"  - Hidden size: {config.hidden_size}")
print(f"  - Number of layers: {config.num_layers}")
print(f"  - Sequence length: {config.sequence_length}")

In [None]:
# 2. Initialize tokenizer
tokenizer = GenomicTokenizer(
    sequence_type="dna",
    vocab_size=config.vocab_size,
    max_length=config.sequence_length
)

print(f"Tokenizer initialized for {tokenizer.sequence_type.upper()} sequences")
print(f"Vocabulary size: {len(tokenizer.vocab) if hasattr(tokenizer, 'vocab') else 'Dynamic'}")

In [None]:
# 3. Create model
model = HyenaGLT(config)
total_params = sum(p.numel() for p in model.parameters())

print(f"Model created with {total_params:,} parameters")
print(f"Model size: {total_params * 4 / 1024**2:.1f} MB (FP32)")

## Working with Genomic Sequences

Let's explore how Hyena-GLT processes different types of genomic sequences:

In [None]:
# Example genomic sequences
sequences = {
    "Gene fragment": "ATGCGATCGATCGATCGAATTCGCTAGCTAGCTAGCTAGCTAGCTAGCTAG",
    "Promoter region": "TATAATGGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGC",
    "Coding sequence": "ATGAAACGTTTCGACGACGACGACGACGACGACGACGACGACGACGACTAG",
    "Repetitive DNA": "ATATATATATATATATATATATATATATATATATATATATATATATATAT"
}

print("Example genomic sequences:")
for name, seq in sequences.items():
    print(f"  {name}: {seq[:30]}... ({len(seq)} bp)")

In [None]:
# Tokenize sequences
tokenized_sequences = {}
for name, seq in sequences.items():
    tokens = tokenizer.encode(seq)
    tokenized_sequences[name] = tokens
    print(f"{name}:")
    print(f"  Original length: {len(seq)} bp")
    print(f"  Tokenized length: {len(tokens)} tokens")
    print(f"  Compression ratio: {len(seq) / len(tokens):.2f}x")
    print(f"  Sample tokens: {tokens[:10]}...")
    print()

## Model Inference

Now let's run inference on our genomic sequences:

In [None]:
# Run inference
model.eval()
inference_results = {}

with torch.no_grad():
    for name, tokens in tokenized_sequences.items():
        # Prepare input
        input_ids = torch.tensor([tokens])  # Add batch dimension

        # Run model
        outputs = model(input_ids)

        # Store results
        inference_results[name] = {
            'logits_shape': outputs.logits.shape,
            'logits_mean': outputs.logits.mean().item(),
            'logits_std': outputs.logits.std().item(),
            'hidden_states_shape': outputs.hidden_states[-1].shape if outputs.hidden_states else 'N/A'
        }

        print(f"{name}:")
        print(f"  Output logits shape: {outputs.logits.shape}")
        print(f"  Logits statistics: mean={outputs.logits.mean():.4f}, std={outputs.logits.std():.4f}")
        if outputs.hidden_states:
            print(f"  Hidden states shape: {outputs.hidden_states[-1].shape}")
        print()

## Sequence Analysis

Let's analyze the properties of our genomic sequences:

In [None]:
# Analyze sequences
sequence_list = list(sequences.values())
stats = compute_sequence_statistics(sequence_list)

print("Sequence Statistics:")
print(f"  Average length: {stats['avg_length']:.1f} bp")
print(f"  Length std: {stats['length_std']:.1f} bp")
print(f"  Average GC content: {stats['avg_gc_content']:.3f}")
print(f"  GC content std: {stats['gc_content_std']:.3f}")

if 'base_composition' in stats:
    print("\nBase composition:")
    for base, fraction in stats['base_composition'].items():
        print(f"  {base}: {fraction:.3f}")

In [None]:
# Tokenization analysis
token_stats = analyze_tokenization(tokenizer, sequence_list)

print("Tokenization Analysis:")
print(f"  Average tokens per sequence: {token_stats['avg_tokens']:.1f}")
print(f"  Token count std: {token_stats['token_std']:.1f}")
print(f"  Average compression ratio: {token_stats['compression_ratio']:.2f}x")
print(f"  Compression std: {token_stats['compression_std']:.2f}")

## Visualization

Let's create some visualizations to better understand our data:

In [None]:
# Create visualizations
fig, axes = plt.subplots(2, 2, figsize=(12, 8))

# Plot 1: Sequence lengths
names = list(sequences.keys())
lengths = [len(seq) for seq in sequences.values()]

axes[0, 0].bar(range(len(names)), lengths, color='lightblue')
axes[0, 0].set_title('Sequence Lengths')
axes[0, 0].set_xlabel('Sequence')
axes[0, 0].set_ylabel('Length (bp)')
axes[0, 0].set_xticks(range(len(names)))
axes[0, 0].set_xticklabels(names, rotation=45, ha='right')

# Plot 2: Token counts
token_counts = [len(tokens) for tokens in tokenized_sequences.values()]

axes[0, 1].bar(range(len(names)), token_counts, color='lightgreen')
axes[0, 1].set_title('Token Counts')
axes[0, 1].set_xlabel('Sequence')
axes[0, 1].set_ylabel('Tokens')
axes[0, 1].set_xticks(range(len(names)))
axes[0, 1].set_xticklabels(names, rotation=45, ha='right')

# Plot 3: Compression ratios
compression_ratios = [lengths[i] / token_counts[i] for i in range(len(lengths))]

axes[1, 0].bar(range(len(names)), compression_ratios, color='orange')
axes[1, 0].set_title('Compression Ratios')
axes[1, 0].set_xlabel('Sequence')
axes[1, 0].set_ylabel('Compression Ratio')
axes[1, 0].set_xticks(range(len(names)))
axes[1, 0].set_xticklabels(names, rotation=45, ha='right')

# Plot 4: GC content
gc_contents = []
for seq in sequences.values():
    gc_count = seq.count('G') + seq.count('C')
    gc_content = gc_count / len(seq) if len(seq) > 0 else 0
    gc_contents.append(gc_content)

axes[1, 1].bar(range(len(names)), gc_contents, color='red', alpha=0.7)
axes[1, 1].set_title('GC Content')
axes[1, 1].set_xlabel('Sequence')
axes[1, 1].set_ylabel('GC Content')
axes[1, 1].set_xticks(range(len(names)))
axes[1, 1].set_xticklabels(names, rotation=45, ha='right')
axes[1, 1].set_ylim(0, 1)

plt.tight_layout()
plt.show()

## Next Steps

This introduction showed you the basics of Hyena-GLT. Here's what to explore next:

1. **📊 Tokenization Deep Dive**: `02_tokenization.ipynb` - Learn about BLT tokenization
2. **🏗️ Model Architecture**: `03_model_architecture.ipynb` - Understand Hyena blocks
3. **🎯 Training**: `04_training.ipynb` - Train your own models
4. **📈 Evaluation**: `05_evaluation.ipynb` - Assess model performance
5. **🔧 Fine-tuning**: `06_fine_tuning.ipynb` - Adapt to specific tasks
6. **🧬 Generation**: `07_generation.ipynb` - Generate new sequences

## Resources

- **Documentation**: See `docs/` folder for comprehensive guides
- **Examples**: Check `examples/` for complete scripts
- **API Reference**: Full API documentation in `docs/API.md`
- **User Guide**: Step-by-step guide in `docs/USER_GUIDE.md`

In [None]:
print("🎉 Introduction to Hyena-GLT completed!")
print("Happy genomic modeling! 🧬")