# Genomic Tokenization and Data Preprocessing

This notebook provides a comprehensive guide to genomic data preprocessing and tokenization with Hyena-GLT. Learn how to handle different sequence types, implement efficient tokenization strategies, and prepare data for downstream modeling tasks.

## 🎯 Learning Objectives

By the end of this notebook, you will:
- Master different tokenization strategies for genomic sequences
- Understand sequence encoding and decoding processes
- Learn advanced preprocessing techniques
- Implement custom tokenizers for specific genomic tasks
- Handle various sequence formats and data sources
- Optimize tokenization for performance

## 📋 Prerequisites

- Complete the Introduction notebook (`01_introduction_to_hyena_glt.ipynb`)
- Basic understanding of genomic sequences
- Familiarity with Python and pandas

## 1. Setup and Imports

In [None]:
# Standard library imports
import sys
import os
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Add project root to path
project_root = os.path.abspath('../..')
if project_root not in sys.path:
    sys.path.insert(0, project_root)

# Scientific computing
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Deep learning
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset

# Hyena-GLT components
from hyena_glt.tokenizers import (
    DNATokenizer, RNATokenizer, ProteinTokenizer,
    CodonTokenizer, KmerTokenizer
)
from hyena_glt.data import GenomicDataset, SequenceCollator
from hyena_glt.utils import get_sequence_stats, analyze_composition

# Example utilities
from examples.utils.data_utils import (
    generate_synthetic_genomic_data,
    load_fasta_sequences,
    create_classification_dataset
)
from examples.utils.visualization_utils import (
    plot_tokenization_comparison,
    plot_sequence_length_distribution,
    plot_composition_analysis
)

# Configure plotting
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 6)

print("🧬 Genomic tokenization setup complete!")
print(f"PyTorch version: {torch.__version__}")
print(f"Device: {'CUDA' if torch.cuda.is_available() else 'CPU'}")

## 2. Understanding Tokenization Strategies

Genomic sequences can be tokenized in multiple ways. Let's explore different strategies and their trade-offs.

In [None]:
# Sample DNA sequence for demonstration
sample_dna = "ATCGATCGTAGCTAGCTAGCGATCGATCGTAGCTAGCATGAAATTTGGGCCCAAATTTCCCGGGATCGATCG"

print(f"Sample DNA sequence ({len(sample_dna)} bp):")
print(sample_dna)
print()

# 1. Character-level tokenization (k=1)
char_tokenizer = DNATokenizer(k=1)
char_tokens = char_tokenizer.encode(sample_dna)
print(f"Character tokens (k=1): {char_tokens[:20]}... (total: {len(char_tokens)})")
print(f"Vocabulary size: {char_tokenizer.vocab_size}")

# 2. K-mer tokenization (k=3)
kmer_tokenizer = DNATokenizer(k=3)
kmer_tokens = kmer_tokenizer.encode(sample_dna)
print(f"\nK-mer tokens (k=3): {kmer_tokens[:20]}... (total: {len(kmer_tokens)})")
print(f"Vocabulary size: {kmer_tokenizer.vocab_size}")

# 3. Overlapping vs non-overlapping
overlap_tokenizer = DNATokenizer(k=3, overlap=True)
overlap_tokens = overlap_tokenizer.encode(sample_dna)
print(f"\nOverlapping k-mers: {overlap_tokens[:20]}... (total: {len(overlap_tokens)})")

non_overlap_tokenizer = DNATokenizer(k=3, overlap=False)
non_overlap_tokens = non_overlap_tokenizer.encode(sample_dna)
print(f"Non-overlapping k-mers: {non_overlap_tokens[:20]}... (total: {len(non_overlap_tokens)})")

### Visualizing Tokenization Strategies

In [None]:
# Compare different tokenization strategies
strategies = {
    'Character (k=1)': char_tokens,
    'K-mer (k=3)': kmer_tokens,
    'Overlapping k=3': overlap_tokens,
    'Non-overlapping k=3': non_overlap_tokens
}

# Plot comparison
plot_tokenization_comparison(sample_dna, strategies)
plt.tight_layout()
plt.show()

## 3. Specialized Tokenizers for Different Sequence Types

Different genomic sequences require different tokenization approaches.

In [None]:
# Sample sequences for each type
dna_seq = "ATGAAATTTGGGCCCAAATTTCCCGGGATCGATCGTAGCTAGC"
rna_seq = "AUGAAAUUUGGGCCCAAAUUUCCCGGGAUCGAUCGUAGCUAGC"
protein_seq = "MKFGPKFPGIDRSRR"

print("Sample sequences:")
print(f"DNA:     {dna_seq}")
print(f"RNA:     {rna_seq}")
print(f"Protein: {protein_seq}")
print()

# Initialize different tokenizers
tokenizers = {
    'DNA': DNATokenizer(k=3),
    'RNA': RNATokenizer(k=3),
    'Protein': ProteinTokenizer(),
    'Codon': CodonTokenizer(),
    'K-mer': KmerTokenizer(k=6)
}

# Test tokenization
sequences = {
    'DNA': dna_seq,
    'RNA': rna_seq,
    'Protein': protein_seq,
    'Codon': dna_seq,  # Can be used for DNA coding sequences
    'K-mer': dna_seq
}

for name, tokenizer in tokenizers.items():
    if name in sequences:
        seq = sequences[name]
        try:
            tokens = tokenizer.encode(seq)
            print(f"{name} tokenizer:")
            print(f"  Input:  {seq}")
            print(f"  Tokens: {tokens[:10]}... (total: {len(tokens)})")
            print(f"  Vocab size: {tokenizer.vocab_size}")
            
            # Test decoding
            decoded = tokenizer.decode(tokens)
            print(f"  Decoded: {decoded}")
            print(f"  Match: {decoded == seq}")
            print()
        except Exception as e:
            print(f"{name} tokenizer error: {e}")
            print()

## 4. Advanced Tokenization Features

Learn about special tokens, padding, truncation, and other advanced features.

In [None]:
# Initialize tokenizer with special tokens
tokenizer = DNATokenizer(
    k=3,
    add_special_tokens=True,
    max_length=100,
    padding=True,
    truncation=True
)

print("Tokenizer with special tokens:")
print(f"Vocabulary size: {tokenizer.vocab_size}")
print(f"Special tokens: {tokenizer.special_tokens}")
print(f"PAD token ID: {tokenizer.pad_token_id}")
print(f"UNK token ID: {tokenizer.unk_token_id}")
print(f"CLS token ID: {tokenizer.cls_token_id}")
print(f"SEP token ID: {tokenizer.sep_token_id}")
print()

# Test with sequences of different lengths
test_sequences = [
    "ATCG",  # Very short
    "ATCGATCGTAGCTAGCTAGCGATCGATCG",  # Medium
    "ATCGATCG" * 20,  # Long (will be truncated)
    "ATCGXYZ"  # Contains unknown characters
]

for i, seq in enumerate(test_sequences):
    print(f"Test sequence {i+1} (length: {len(seq)}):")
    print(f"  Input: {seq[:50]}{'...' if len(seq) > 50 else ''}")
    
    # Encode with different options
    tokens = tokenizer.encode(seq)
    print(f"  Tokens: {tokens[:15]}{'...' if len(tokens) > 15 else ''}")
    print(f"  Token count: {len(tokens)}")
    
    # Get attention mask
    attention_mask = tokenizer.get_attention_mask(tokens)
    print(f"  Attention mask: {attention_mask[:15]}{'...' if len(attention_mask) > 15 else ''}")
    print(f"  Valid tokens: {sum(attention_mask)}")
    print()

## 5. Batch Processing and Data Loading

Learn how to efficiently process multiple sequences and create data loaders.

In [None]:
# Generate a dataset of sequences
np.random.seed(42)
sequences, labels = generate_synthetic_genomic_data(
    n_samples=1000,
    sequence_type='dna',
    min_length=100,
    max_length=500,
    num_classes=5
)

print(f"Generated {len(sequences)} sequences")
print(f"Sequence length range: {min(len(s) for s in sequences)} - {max(len(s) for s in sequences)}")
print(f"Label distribution: {np.bincount(labels)}")
print()

# Analyze sequence properties
lengths = [len(seq) for seq in sequences]
gc_contents = [(seq.count('G') + seq.count('C')) / len(seq) for seq in sequences]

# Plot sequence statistics
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Length distribution
axes[0, 0].hist(lengths, bins=30, alpha=0.7, color='skyblue')
axes[0, 0].set_xlabel('Sequence Length')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].set_title('Sequence Length Distribution')

# GC content distribution
axes[0, 1].hist(gc_contents, bins=30, alpha=0.7, color='lightcoral')
axes[0, 1].set_xlabel('GC Content')
axes[0, 1].set_ylabel('Frequency')
axes[0, 1].set_title('GC Content Distribution')

# Label distribution
label_counts = np.bincount(labels)
axes[1, 0].bar(range(len(label_counts)), label_counts, color='lightgreen')
axes[1, 0].set_xlabel('Label')
axes[1, 0].set_ylabel('Count')
axes[1, 0].set_title('Label Distribution')

# Length vs GC content
scatter = axes[1, 1].scatter(lengths, gc_contents, c=labels, alpha=0.6, cmap='viridis')
axes[1, 1].set_xlabel('Sequence Length')
axes[1, 1].set_ylabel('GC Content')
axes[1, 1].set_title('Length vs GC Content (colored by label)')
plt.colorbar(scatter, ax=axes[1, 1])

plt.tight_layout()
plt.show()

### Creating Genomic Datasets

In [None]:
# Initialize tokenizer for dataset creation
tokenizer = DNATokenizer(k=3, max_length=512, padding=True, truncation=True)

# Create genomic dataset
dataset = GenomicDataset(
    sequences=sequences,
    labels=labels,
    tokenizer=tokenizer,
    sequence_type='dna',
    task_type='classification'
)

print(f"Dataset created with {len(dataset)} samples")
print(f"Sample data structure:")
sample = dataset[0]
for key, value in sample.items():
    if isinstance(value, torch.Tensor):
        print(f"  {key}: {value.shape} ({value.dtype})")
    else:
        print(f"  {key}: {value}")
print()

# Split dataset
from torch.utils.data import random_split

train_size = int(0.8 * len(dataset))
val_size = int(0.1 * len(dataset))
test_size = len(dataset) - train_size - val_size

train_dataset, val_dataset, test_dataset = random_split(
    dataset, [train_size, val_size, test_size]
)

print(f"Dataset splits:")
print(f"  Training: {len(train_dataset)} samples")
print(f"  Validation: {len(val_dataset)} samples")
print(f"  Test: {len(test_dataset)} samples")

### Efficient Data Loading with Collators

In [None]:
# Create sequence collator for efficient batching
collator = SequenceCollator(
    tokenizer=tokenizer,
    max_length=512,
    padding=True,
    return_attention_mask=True,
    return_token_type_ids=False
)

# Create data loaders
train_loader = DataLoader(
    train_dataset,
    batch_size=32,
    shuffle=True,
    collate_fn=collator,
    num_workers=2
)

val_loader = DataLoader(
    val_dataset,
    batch_size=32,
    shuffle=False,
    collate_fn=collator,
    num_workers=2
)

print(f"Data loaders created:")
print(f"  Training batches: {len(train_loader)}")
print(f"  Validation batches: {len(val_loader)}")
print()

# Examine a batch
batch = next(iter(train_loader))
print(f"Batch structure:")
for key, value in batch.items():
    if isinstance(value, torch.Tensor):
        print(f"  {key}: {value.shape} ({value.dtype})")
        if key == 'input_ids':
            print(f"    Range: {value.min().item()} - {value.max().item()}")
        elif key == 'labels':
            print(f"    Unique labels: {torch.unique(value).tolist()}")
    else:
        print(f"  {key}: {type(value)}")
print()

# Check attention masks
attention_mask = batch['attention_mask']
valid_tokens_per_sample = attention_mask.sum(dim=1)
print(f"Valid tokens per sample in batch:")
print(f"  Min: {valid_tokens_per_sample.min().item()}")
print(f"  Max: {valid_tokens_per_sample.max().item()}")
print(f"  Mean: {valid_tokens_per_sample.float().mean().item():.1f}")

## 6. Handling Different Data Formats

Learn how to work with common genomic file formats like FASTA, FASTQ, and VCF.

In [None]:
# Create sample FASTA-like data
fasta_data = [
    (">seq1|gene_A|chromosome_1", "ATGAAATTTGGGCCCAAATTTCCCGGGATCGATCGTAGCTAGC"),
    (">seq2|gene_B|chromosome_2", "CGATCGATCGTAGCTAGCTAGCGATCGATCGTAGCTAGCATG"),
    (">seq3|gene_C|chromosome_3", "TAGCTAGCGATCGATCGTAGCTAGCATGAAATTTGGGCCCAAA"),
    (">seq4|gene_D|chromosome_4", "AAATTTCCCGGGATCGATCGTAGCTAGCGATCGATCGTAGC")
]

print("Sample FASTA-like data:")
for header, sequence in fasta_data:
    print(f"{header}")
    print(f"{sequence}")
    print()

# Parse and process FASTA data
def parse_fasta_header(header):
    """Parse FASTA header to extract metadata."""
    parts = header.strip('> ').split('|')
    return {
        'seq_id': parts[0] if len(parts) > 0 else 'unknown',
        'gene': parts[1] if len(parts) > 1 else 'unknown',
        'chromosome': parts[2] if len(parts) > 2 else 'unknown'
    }

# Process FASTA data
processed_data = []
for header, sequence in fasta_data:
    metadata = parse_fasta_header(header)
    
    # Calculate sequence statistics
    stats = get_sequence_stats(sequence)
    
    processed_data.append({
        'sequence': sequence,
        'metadata': metadata,
        'stats': stats
    })

# Display processed data
print("Processed FASTA data:")
for i, data in enumerate(processed_data):
    print(f"Entry {i+1}:")
    print(f"  Sequence: {data['sequence'][:30]}...")
    print(f"  Metadata: {data['metadata']}")
    print(f"  Stats: {data['stats']}")
    print()

### Quality Score Handling (FASTQ-style)

In [None]:
# Simulate FASTQ-style data with quality scores
fastq_data = [
    {
        'sequence': 'ATGAAATTTGGGCCCAAATTTCCCGGGATCGATCG',
        'quality': '##############################!!!!!!',  # Phred+33 encoding
        'header': '@read1'
    },
    {
        'sequence': 'CGATCGATCGTAGCTAGCTAGCGATCGATCGTAG',
        'quality': '!!!!!!##########!!!!!############',
        'header': '@read2'
    }
]

def parse_quality_scores(quality_string, encoding='phred33'):
    """Parse quality scores from FASTQ format."""
    if encoding == 'phred33':
        return [ord(char) - 33 for char in quality_string]
    elif encoding == 'phred64':
        return [ord(char) - 64 for char in quality_string]
    else:
        raise ValueError(f"Unknown encoding: {encoding}")

def filter_by_quality(sequence, quality_scores, min_quality=20, min_length=20):
    """Filter sequence based on quality scores."""
    # Filter positions with quality >= min_quality
    high_quality_positions = [i for i, q in enumerate(quality_scores) if q >= min_quality]
    
    if len(high_quality_positions) < min_length:
        return None, None  # Sequence too short after filtering
    
    # Extract high-quality subsequence
    start_pos = min(high_quality_positions)
    end_pos = max(high_quality_positions) + 1
    
    filtered_sequence = sequence[start_pos:end_pos]
    filtered_quality = quality_scores[start_pos:end_pos]
    
    return filtered_sequence, filtered_quality

# Process FASTQ data
print("Processing FASTQ-style data:")
for data in fastq_data:
    sequence = data['sequence']
    quality = data['quality']
    header = data['header']
    
    # Parse quality scores
    quality_scores = parse_quality_scores(quality)
    avg_quality = np.mean(quality_scores)
    
    # Filter by quality
    filtered_seq, filtered_qual = filter_by_quality(sequence, quality_scores)
    
    print(f"\n{header}:")
    print(f"  Original: {sequence}")
    print(f"  Quality:  {quality}")
    print(f"  Quality scores: {quality_scores}")
    print(f"  Average quality: {avg_quality:.1f}")
    
    if filtered_seq:
        print(f"  Filtered: {filtered_seq}")
        print(f"  Filtered quality: {filtered_qual}")
    else:
        print(f"  Filtered: REMOVED (low quality)")

## 7. Custom Tokenizer Development

Learn how to create custom tokenizers for specialized genomic tasks.

In [None]:
class MotifAwareTokenizer:
    """Custom tokenizer that preserves important biological motifs."""
    
    def __init__(self, motifs=None, k=3):
        self.k = k
        self.motifs = motifs or [
            'TATAAA',  # TATA box
            'CAAT',    # CAAT box
            'GGCC',    # GC box
            'ATG',     # Start codon
            'TAA', 'TAG', 'TGA',  # Stop codons
        ]
        
        # Build vocabulary
        self._build_vocabulary()
    
    def _build_vocabulary(self):
        """Build vocabulary including motifs and k-mers."""
        bases = ['A', 'T', 'C', 'G']
        
        # Start with special tokens
        self.vocab = {
            '[PAD]': 0,
            '[UNK]': 1,
            '[CLS]': 2,
            '[SEP]': 3
        }
        
        # Add motifs (higher priority)
        for motif in self.motifs:
            if motif not in self.vocab:
                self.vocab[motif] = len(self.vocab)
        
        # Add regular k-mers
        for i in range(4 ** self.k):
            kmer = ''
            val = i
            for _ in range(self.k):
                kmer = bases[val % 4] + kmer
                val //= 4
            
            if kmer not in self.vocab:
                self.vocab[kmer] = len(self.vocab)
        
        # Create reverse mapping
        self.id_to_token = {v: k for k, v in self.vocab.items()}
        self.vocab_size = len(self.vocab)
    
    def _find_motifs(self, sequence):
        """Find motif positions in sequence."""
        motif_positions = []
        for motif in self.motifs:
            start = 0
            while True:
                pos = sequence.find(motif, start)
                if pos == -1:
                    break
                motif_positions.append((pos, pos + len(motif), motif))
                start = pos + 1
        
        # Sort by position
        motif_positions.sort()
        return motif_positions
    
    def encode(self, sequence):
        """Encode sequence preserving motifs."""
        tokens = []
        sequence = sequence.upper()
        
        # Find motifs
        motif_positions = self._find_motifs(sequence)
        
        i = 0
        while i < len(sequence):
            # Check if current position starts a motif
            motif_found = False
            for start, end, motif in motif_positions:
                if i == start:
                    tokens.append(self.vocab[motif])
                    i = end
                    motif_found = True
                    break
            
            if not motif_found:
                # Regular k-mer tokenization
                if i + self.k <= len(sequence):
                    kmer = sequence[i:i+self.k]
                    tokens.append(self.vocab.get(kmer, self.vocab['[UNK]']))
                    i += self.k
                else:
                    # Handle remaining characters
                    remaining = sequence[i:]
                    tokens.append(self.vocab.get(remaining, self.vocab['[UNK]']))
                    break
        
        return tokens
    
    def decode(self, tokens):
        """Decode tokens back to sequence."""
        sequence = ''
        for token_id in tokens:
            if token_id in self.id_to_token:
                token = self.id_to_token[token_id]
                if token not in ['[PAD]', '[UNK]', '[CLS]', '[SEP]']:
                    sequence += token
        return sequence

# Test custom tokenizer
test_sequence = "CGATCGTATAAATCGATCGATGATGAAATTTGGGCCCAAATAACGATCGTAGCTAGC"
print(f"Test sequence: {test_sequence}")
print(f"Length: {len(test_sequence)}")
print()

# Initialize custom tokenizer
motif_tokenizer = MotifAwareTokenizer(k=3)
print(f"Custom tokenizer vocabulary size: {motif_tokenizer.vocab_size}")
print(f"Important motifs: {motif_tokenizer.motifs}")
print()

# Encode with custom tokenizer
custom_tokens = motif_tokenizer.encode(test_sequence)
print(f"Custom tokens: {custom_tokens}")
print(f"Token count: {len(custom_tokens)}")

# Decode and verify
decoded_sequence = motif_tokenizer.decode(custom_tokens)
print(f"Decoded: {decoded_sequence}")
print(f"Match: {decoded_sequence == test_sequence}")
print()

# Compare with regular tokenizer
regular_tokenizer = DNATokenizer(k=3)
regular_tokens = regular_tokenizer.encode(test_sequence)
print(f"Regular tokens: {regular_tokens}")
print(f"Regular token count: {len(regular_tokens)}")

print(f"\nCompression comparison:")
print(f"  Original length: {len(test_sequence)}")
print(f"  Custom tokenizer: {len(custom_tokens)} tokens")
print(f"  Regular tokenizer: {len(regular_tokens)} tokens")
print(f"  Custom compression ratio: {len(test_sequence) / len(custom_tokens):.2f}")
print(f"  Regular compression ratio: {len(test_sequence) / len(regular_tokens):.2f}")

## 8. Performance Optimization

Learn techniques to optimize tokenization performance for large datasets.

In [None]:
import time
from multiprocessing import Pool
import concurrent.futures

# Generate test data for performance testing
test_sequences = []
for i in range(1000):
    length = np.random.randint(100, 1000)
    seq = ''.join(np.random.choice(['A', 'T', 'C', 'G'], length))
    test_sequences.append(seq)

print(f"Generated {len(test_sequences)} test sequences for performance testing")
print(f"Average length: {np.mean([len(s) for s in test_sequences]):.1f}")
print()

# Test different tokenization approaches
tokenizer = DNATokenizer(k=3)

# 1. Sequential processing
start_time = time.time()
sequential_results = []
for seq in test_sequences:
    tokens = tokenizer.encode(seq)
    sequential_results.append(tokens)
sequential_time = time.time() - start_time

print(f"Sequential processing: {sequential_time:.3f} seconds")

# 2. Batch processing function
def process_batch(sequences):
    """Process a batch of sequences."""
    return [tokenizer.encode(seq) for seq in sequences]

# 3. Parallel processing with multiprocessing
def parallel_tokenize(sequences, num_workers=4, batch_size=100):
    """Tokenize sequences in parallel."""
    # Split into batches
    batches = [sequences[i:i+batch_size] for i in range(0, len(sequences), batch_size)]
    
    # Process batches in parallel
    with concurrent.futures.ProcessPoolExecutor(max_workers=num_workers) as executor:
        batch_results = list(executor.map(process_batch, batches))
    
    # Flatten results
    results = []
    for batch_result in batch_results:
        results.extend(batch_result)
    
    return results

# Test parallel processing
start_time = time.time()
parallel_results = parallel_tokenize(test_sequences, num_workers=2, batch_size=100)
parallel_time = time.time() - start_time

print(f"Parallel processing: {parallel_time:.3f} seconds")
print(f"Speedup: {sequential_time / parallel_time:.2f}x")
print()

# Verify results are identical
results_match = all(
    seq_tokens == par_tokens 
    for seq_tokens, par_tokens in zip(sequential_results, parallel_results)
)
print(f"Results match: {results_match}")

### Memory-Efficient Processing

In [None]:
class MemoryEfficientTokenizer:
    """Memory-efficient tokenizer for large datasets."""
    
    def __init__(self, tokenizer, max_memory_mb=100):
        self.tokenizer = tokenizer
        self.max_memory_mb = max_memory_mb
    
    def process_file_streaming(self, file_path, output_path):
        """Process file in streaming fashion to save memory."""
        import psutil
        import gc
        
        process = psutil.Process()
        initial_memory = process.memory_info().rss / 1024 / 1024  # MB
        
        with open(output_path, 'w') as outfile:
            batch = []
            batch_size = 0
            
            # Simulated file reading (replace with actual file reading)
            for seq in test_sequences[:100]:  # Process subset for demo
                batch.append(seq)
                batch_size += len(seq)
                
                # Check memory usage
                current_memory = process.memory_info().rss / 1024 / 1024
                memory_used = current_memory - initial_memory
                
                # Process batch if memory limit reached or batch is large enough
                if memory_used > self.max_memory_mb or len(batch) >= 50:
                    # Process batch
                    for seq in batch:
                        tokens = self.tokenizer.encode(seq)
                        outfile.write(f"{','.join(map(str, tokens))}\n")
                    
                    # Clear batch and force garbage collection
                    batch.clear()
                    batch_size = 0
                    gc.collect()
                    
                    print(f"Processed batch, memory usage: {memory_used:.1f} MB")
            
            # Process remaining sequences
            if batch:
                for seq in batch:
                    tokens = self.tokenizer.encode(seq)
                    outfile.write(f"{','.join(map(str, tokens))}\n")
        
        final_memory = process.memory_info().rss / 1024 / 1024
        print(f"Memory usage: {initial_memory:.1f} -> {final_memory:.1f} MB")

# Test memory-efficient processing
efficient_tokenizer = MemoryEfficientTokenizer(tokenizer, max_memory_mb=50)
output_file = "/tmp/tokenized_sequences.txt"

print("Testing memory-efficient processing:")
efficient_tokenizer.process_file_streaming(None, output_file)

# Verify output
with open(output_file, 'r') as f:
    lines = f.readlines()
    print(f"\nOutput file contains {len(lines)} tokenized sequences")
    print(f"First few lines:")
    for i, line in enumerate(lines[:3]):
        tokens = list(map(int, line.strip().split(',')))
        print(f"  Line {i+1}: {tokens[:10]}... ({len(tokens)} tokens)")

## 9. Tokenization Best Practices

Summary of key best practices for genomic tokenization.

In [None]:
def tokenization_recommendations(sequence_type, task_type, sequence_length):
    """Provide tokenization recommendations based on task characteristics."""
    recommendations = {
        'tokenizer_type': '',
        'k_value': 1,
        'overlap': True,
        'special_tokens': True,
        'max_length': 512,
        'reasoning': []
    }
    
    # Sequence type recommendations
    if sequence_type == 'dna':
        recommendations['tokenizer_type'] = 'DNATokenizer'
        if task_type == 'classification':
            recommendations['k_value'] = 3
            recommendations['reasoning'].append("k=3 captures codon-like patterns for DNA classification")
        elif task_type == 'generation':
            recommendations['k_value'] = 1
            recommendations['reasoning'].append("k=1 provides flexibility for sequence generation")
    
    elif sequence_type == 'rna':
        recommendations['tokenizer_type'] = 'RNATokenizer'
        recommendations['k_value'] = 3
        recommendations['reasoning'].append("k=3 captures codon structure in RNA")
    
    elif sequence_type == 'protein':
        recommendations['tokenizer_type'] = 'ProteinTokenizer'
        recommendations['k_value'] = 1
        recommendations['overlap'] = False
        recommendations['reasoning'].append("Amino acids are natural tokens for proteins")
    
    # Task type adjustments
    if task_type in ['classification', 'regression']:
        recommendations['special_tokens'] = True
        recommendations['reasoning'].append("Special tokens help with sequence classification")
    
    # Sequence length adjustments
    if sequence_length < 100:
        recommendations['max_length'] = 128
        recommendations['reasoning'].append("Short sequences need smaller max_length")
    elif sequence_length > 1000:
        recommendations['max_length'] = 1024
        recommendations['k_value'] = min(recommendations['k_value'] + 1, 6)
        recommendations['reasoning'].append("Long sequences benefit from larger k-mers and max_length")
    
    return recommendations

# Test recommendations for different scenarios
scenarios = [
    ('dna', 'classification', 500),
    ('rna', 'structure_prediction', 200),
    ('protein', 'function_prediction', 300),
    ('dna', 'generation', 1500),
    ('dna', 'variant_prediction', 50)
]

print("Tokenization Recommendations:")
print("=" * 50)

for seq_type, task_type, seq_len in scenarios:
    rec = tokenization_recommendations(seq_type, task_type, seq_len)
    
    print(f"\nScenario: {seq_type.upper()} {task_type} (length ~{seq_len})")
    print(f"  Tokenizer: {rec['tokenizer_type']}")
    print(f"  K-value: {rec['k_value']}")
    print(f"  Overlap: {rec['overlap']}")
    print(f"  Special tokens: {rec['special_tokens']}")
    print(f"  Max length: {rec['max_length']}")
    print(f"  Reasoning:")
    for reason in rec['reasoning']:
        print(f"    - {reason}")

## 10. Summary and Next Steps

Recap of key concepts and next steps for advanced genomic tokenization.

In [None]:
print("🎓 Genomic Tokenization Summary")
print("=" * 40)
print()

print("✅ Key Concepts Learned:")
concepts = [
    "Different tokenization strategies (character, k-mer, overlapping)",
    "Specialized tokenizers for DNA, RNA, and proteins",
    "Advanced features: special tokens, padding, truncation",
    "Batch processing and efficient data loading",
    "Handling different genomic file formats",
    "Custom tokenizer development for specific tasks",
    "Performance optimization techniques",
    "Memory-efficient processing for large datasets",
    "Best practices and recommendations"
]

for i, concept in enumerate(concepts, 1):
    print(f"{i:2d}. {concept}")

print()
print("🚀 Next Steps:")
next_steps = [
    "Explore model training with your tokenized data (notebook 04)",
    "Learn about model architectures (notebook 03)",
    "Apply tokenization to real genomic datasets",
    "Experiment with custom tokenizers for your specific tasks",
    "Optimize tokenization for your computational resources",
    "Integrate with existing genomic analysis pipelines"
]

for i, step in enumerate(next_steps, 1):
    print(f"{i}. {step}")

print()
print("📚 Additional Resources:")
resources = [
    "Hyena-GLT documentation: docs/",
    "Example scripts: examples/",
    "API reference: docs/API.md",
    "Performance optimization guide: docs/OPTIMIZATION.md"
]

for resource in resources:
    print(f"  • {resource}")

print()
print("Happy tokenizing! 🧬🔤")