**The Hyena Architecture: A Transformer Without Attention**

An innovation in deep learning has been the Transformer architecture. However, the self-attention mechanism's computational cost, which scales quadratically (O(n^2)) with sequence length, is a major bottleneck. The Hyena Hierarchy architecture, introduced in the 2023 paper "Hyena Hierarchy: Towards Larger Convolutional Language Models", presents a compelling alternative.

Hyena replaces the attention mechanism with a more efficient operator based on long convolutions, demonstrating comparable performance while offering significant speed and memory advantages for long sequences.

The core ideas are:

1) Long Convolutions: A filter that extends across the entire sequence to model long-range dependencies.

2) Implicitly Parameterized Filters: Instead of learning a giant filter directly, a smaller neural network generates its weights on-the-fly.

3) Data-Controlled Gating: An input-dependent mechanism modulates the convolution's output, mimicking attention's dynamic nature.

---------------------------------------------------------------------------

**Part 1: Hyena from Scratch with NumPy**

Building a model from scratch without a framework like PyTorch means we handle every operation ourselves. This approach is excellent for me to understand the raw mechanics of the forward pass. A full training implementation would require manually coding the backward pass (backpropagation) for each function.



*A) Basic Building Blocks (NumPy)*

*First, we need our own versions of standard neural network layers using NumPy.*

In [1]:
import numpy as np

class Linear:
    """A fully connected layer."""
    def __init__(self, in_features, out_features):
        # Initialize weights with a common scheme (He initialization)
        self.weights = np.random.randn(in_features, out_features) * np.sqrt(2. / in_features)
        self.biases = np.zeros(out_features)

    def forward(self, x):
        # Standard matrix multiplication: X @ W + b
        return x @ self.weights + self.biases

class GELU:
    """Gaussian Error Linear Unit activation function."""
    def forward(self, x):
        # An approximation of the GELU activation
        return 0.5 * x * (1 + np.tanh(np.sqrt(2 / np.pi) * (x + 0.044715 * x**3)))

class LayerNorm:
    """Layer normalization."""
    def __init__(self, d_model, eps=1e-5):
        self.d_model = d_model
        self.eps = eps
        self.gamma = np.ones(d_model) # Learnable scale
        self.beta = np.zeros(d_model)  # Learnable shift

    def forward(self, x):
        mean = np.mean(x, axis=-1, keepdims=True)
        var = np.var(x, axis=-1, keepdims=True)
        x_norm = (x - mean) / np.sqrt(var + self.eps)
        return self.gamma * x_norm + self.beta

class Conv1d:
    """A simple 1D depth-wise convolution."""
    def __init__(self, d_model, kernel_size, padding=1):
        # For a depth-wise conv, each input channel has its own filter
        self.kernel = np.random.randn(d_model, 1, kernel_size)
        self.padding = padding

    def forward(self, x):
        # x shape: (B, L, D) -> transpose to (B, D, L) for convolution
        x = x.transpose(0, 2, 1)
        B, D, L = x.shape

        # Apply padding
        x_padded = np.pad(x, ((0, 0), (0, 0), (self.padding, self.padding)), 'constant')

        output = np.zeros_like(x)
        # Manually perform the convolution for each channel
        for d in range(D):
            for l in range(L):
                output[:, d, l] = np.sum(x_padded[:, d, l:l+self.kernel.shape[2]] * self.kernel[d, 0, :], axis=1)

        return output.transpose(0, 2, 1) # Transpose back to (B, L, D)

*What is this?*

This block defines the fundamental layers of a neural network (Linear, GELU, LayerNorm, Conv1d) from scratch using only NumPy. Each class initializes its own learnable parameters (like weights and biases) and implements a forward method to perform its specific mathematical operation on input data.

*Why do we do this?*


To build a neural network without a framework like PyTorch, we must first create our own tools. This code demystifies what a "layer" actually is: a self-contained object that holds state (parameters) and performs a specific transformation (e.g., matrix multiplication). This is the foundation upon which the more complex Hyena architecture is built, providing a clear view of the underlying mathematics.

--------------------------------------------------------------

*B) The Core Hyena Operator (NumPy)*

*Now for the main event. We'll implement the HyenaOperator using our NumPy building blocks and NumPy's FFT library.*

In [2]:
class PositionalEmbedding:
    """Generates learnable positional embeddings for the filter."""
    def __init__(self, emb_dim, seq_len):
        self.embedding = np.random.randn(emb_dim, seq_len)

    def forward(self, L):
        return self.embedding[:, :L]

class HyenaOperator:
    """The Hyena operator from scratch in NumPy."""
    def __init__(self, d_model, max_seq_len, filter_order=64):
        self.d_model = d_model
        self.max_seq_len = max_seq_len

        # Positional embeddings for the filter
        self.filter_pos_emb = PositionalEmbedding(filter_order, max_seq_len)

        # Small FFN to generate the filter
        self.filter_net = [
            Linear(filter_order, d_model),
            GELU(),
            Linear(d_model, d_model)
        ]

        # Input/output projections and short convolution
        self.in_proj = Linear(d_model, 2 * d_model)
        self.out_proj = Linear(d_model, d_model)
        self.short_conv = Conv1d(d_model, kernel_size=3, padding=1)

    def forward(self, x):
        B, L, D = x.shape

        # --- 1. Generate the long convolutional filter ---
        pos_emb = self.filter_pos_emb.forward(L).T # Shape (L, filter_order)

        # Pass through the filter network
        k = pos_emb
        for layer in self.filter_net:
            k = layer.forward(k) # k shape: (L, D)

        # --- 2. Project the input ---
        projected_x = self.in_proj.forward(x)
        v, gate_res = np.split(projected_x, 2, axis=-1) # Each is (B, L, D)

        # --- 3. FFT Convolution ---
        fft_len = 2 * L

        # Apply FFT. NumPy's rfft expects the last axis to be the one transformed.
        v_fft = np.fft.rfft(v, n=fft_len, axis=1)   # Transform along L
        k_fft = np.fft.rfft(k, n=fft_len, axis=0)   # Transform along L

        # Multiply in frequency domain. Add a batch dimension to k_fft.
        y_fft = v_fft * k_fft[np.newaxis, :, :]

        # Apply inverse FFT and crop
        y = np.fft.irfft(y_fft, n=fft_len, axis=1)[:, :L, :] # Crop L dim

        # --- 4. Apply Gating Mechanism ---
        short_conv_out = self.short_conv.forward(x)

        y = y * gate_res
        y = y + short_conv_out * (1 - gate_res)

        output = self.out_proj.forward(y)
        return output

*What is this?*

This is the heart of the Hyena model. The HyenaOperator defines the entire process of replacing self-attention. It uses a PositionalEmbedding and a small feed-forward network (filter_net) to generate a long convolutional filter (k) on-the-fly. It then uses the Fast Fourier Transform (FFT) to efficiently convolve this filter with the input sequence (v). Finally, it combines this result with a short-range convolution using a gating mechanism.

*Why do we do this?*

This entire operator is designed to solve the quadratic complexity problem of self-attention.

- Implicit Filter Generation: Instead of learning a massive filter of size (seq_len, d_model), we learn a tiny network that generates it. This is memory efficient.

- FFT Convolution: Convolving in the time domain is slow (O(L
2
 )). By transforming the signals to the frequency domain, convolution becomes a simple element-wise multiplication, which is incredibly fast (O(LlogL)). This is the key to Hyena's efficiency.

- Gating: The gating mechanism allows the operator to dynamically decide whether to focus on long-range patterns (from the FFT convolution) or local patterns (from the short convolution) for each token, making it adaptive like attention.

-------------------------------------------------------------------------

*C) Assembling the Full Model (NumPy)*

*Finally, we stack the HyenaLayers to create the full language model.*

In [3]:
class HyenaLayer:
    """A single Hyena layer with normalization and FFN."""
    def __init__(self, d_model, max_seq_len):
        self.hyena = HyenaOperator(d_model, max_seq_len)
        self.ffn = [
            Linear(d_model, 4 * d_model),
            GELU(),
            Linear(4 * d_model, d_model)
        ]
        self.norm1 = LayerNorm(d_model)
        self.norm2 = LayerNorm(d_model)

    def forward(self, x):
        x = x + self.hyena.forward(self.norm1.forward(x))

        ffn_out = self.norm2.forward(x)
        for layer in self.ffn:
            ffn_out = layer.forward(ffn_out)

        x = x + ffn_out
        return x

class HyenaLanguageModel:
    """The full language model built with NumPy."""
    def __init__(self, vocab_size, d_model, n_layers, max_seq_len):
        self.token_embedding_table = np.random.randn(vocab_size, d_model)
        self.pos_embedding_table = np.random.randn(max_seq_len, d_model)
        self.layers = [HyenaLayer(d_model, max_seq_len) for _ in range(n_layers)]
        self.output_head = Linear(d_model, vocab_size)

    def forward(self, idx):
        B, L = idx.shape
        tok_emb = self.token_embedding_table[idx] # (B, L, D)
        pos_emb = self.pos_embedding_table[:L, :]   # (L, D)
        x = tok_emb + pos_emb

        for layer in self.layers:
            x = layer.forward(x)

        logits = self.output_head.forward(x)
        return logits

# --- Example Usage (NumPy) ---
VOCAB_SIZE_NP = 1000
D_MODEL_NP = 64
N_LAYERS_NP = 2
MAX_SEQ_LEN_NP = 256

model_np = HyenaLanguageModel(VOCAB_SIZE_NP, D_MODEL_NP, N_LAYERS_NP, MAX_SEQ_LEN_NP)
input_tokens_np = np.random.randint(0, VOCAB_SIZE_NP, size=(4, 128))
output_logits_np = model_np.forward(input_tokens_np)

print("--- NumPy Implementation ---")
print(f"Input shape: {input_tokens_np.shape}")
print(f"Output logits shape: {output_logits_np.shape}")

--- NumPy Implementation ---
Input shape: (4, 128)
Output logits shape: (4, 128, 1000)


*What is this?*

This block assembles the HyenaOperator into a complete, deep language model. The HyenaLayer follows the standard Transformer block structure: it combines the main operator (HyenaOperator) with a simple feed-forward network (ffn), using residual connections and layer normalization around each. The HyenaLanguageModel class then takes token IDs as input, converts them to embeddings, and stacks multiple HyenaLayers to produce the final output logits.

*Why do we do this?*

A single layer can only learn simple patterns. Depth is crucial for a model's performance. By stacking layers, the model creates a hierarchy of representations. Early layers might capture simple local features (like word pairings), while deeper layers can combine these to understand more abstract concepts like sentence structure, context, and semantics. The residual connections are vital for training deep networks by preventing the vanishing gradient problem, ensuring that information flows smoothly through all the layers.

----------------------------------------------------------------

**Part 2: Hyena with PyTorch**

Using a framework like PyTorch abstracts away the manual gradient calculations and provides optimized, pre-built layers. This makes building, training, and deploying models vastly more practical. This code is fully functional and trainable.

*The Hyena Operator and Model (PyTorch)*

*Here we define all the necessary components using PyTorch's nn.Module.*

In [4]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.fft import rfft, irfft

class PositionalEmbedding(nn.Module):
    def __init__(self, emb_dim: int, seq_len: int):
        super().__init__()
        self.embedding = nn.Parameter(torch.randn(emb_dim, seq_len))

    def forward(self, L: int):
        return self.embedding[:, :L]

class HyenaOperator(nn.Module):
    def __init__(self, d_model: int, max_seq_len: int, filter_order: int = 64):
        super().__init__()
        self.d_model = d_model
        self.max_seq_len = max_seq_len
        self.filter_pos_emb = PositionalEmbedding(filter_order, max_seq_len)

        self.filter_net = nn.Sequential(
            nn.Linear(filter_order, d_model),
            nn.GELU(),
            nn.Linear(d_model, d_model)
        )

        self.in_proj = nn.Linear(d_model, 2 * d_model)
        self.out_proj = nn.Linear(d_model, d_model)

        self.short_conv = nn.Conv1d(
            in_channels=d_model,
            out_channels=d_model,
            kernel_size=3,
            padding=1,
            groups=d_model
        )

    def forward(self, x):
        B, L, D = x.shape

        # --- 1. Generate the long convolutional filter ---
        pos_emb = self.filter_pos_emb(L)
        k = self.filter_net(pos_emb.transpose(0, 1)) # Shape: (L, D)

        # --- 2. Project the input sequence ---
        v, gate_res = self.in_proj(x).split(self.d_model, dim=-1)

        # --- 3. Perform FFT-based convolution ---
        fft_len = 2 * L
        v_fft = rfft(v.transpose(1, 2), n=fft_len)
        k_fft = rfft(k.transpose(0, 1).unsqueeze(0), n=fft_len)
        y_fft = v_fft * k_fft
        y = irfft(y_fft, n=fft_len)[:, :, :L].transpose(1, 2)

        # --- 4. Apply the gating mechanism ---
        short_conv_out = self.short_conv(x.transpose(1, 2)).transpose(1, 2)
        y = y * gate_res
        y = y + short_conv_out * (1 - gate_res)

        output = self.out_proj(y)
        return output

class HyenaLayer(nn.Module):
    def __init__(self, d_model: int, max_seq_len: int):
        super().__init__()
        self.hyena = HyenaOperator(d_model, max_seq_len)
        self.ffn = nn.Sequential(
            nn.Linear(d_model, 4 * d_model),
            nn.GELU(),
            nn.Linear(4 * d_model, d_model)
        )
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)

    def forward(self, x):
        x = x + self.hyena(self.norm1(x))
        x = x + self.ffn(self.norm2(x))
        return x

class HyenaLanguageModel(nn.Module):
    def __init__(self, vocab_size: int, d_model: int, n_layers: int, max_seq_len: int):
        super().__init__()
        self.token_embedding = nn.Embedding(vocab_size, d_model)
        self.pos_embedding = nn.Parameter(torch.randn(1, max_seq_len, d_model))
        self.layers = nn.ModuleList([HyenaLayer(d_model, max_seq_len) for _ in range(n_layers)])
        self.output_head = nn.Linear(d_model, vocab_size)

    def forward(self, x):
        B, L = x.shape
        x = self.token_embedding(x)
        x = x + self.pos_embedding[:, :L, :]

        for layer in self.layers:
            x = layer(x)

        logits = self.output_head(x)
        return logits

# --- Example Usage (PyTorch) ---
VOCAB_SIZE_PT = 1000
D_MODEL_PT = 64
N_LAYERS_PT = 2
MAX_SEQ_LEN_PT = 256

model_pt = HyenaLanguageModel(VOCAB_SIZE_PT, D_MODEL_PT, N_LAYERS_PT, MAX_SEQ_LEN_PT)
input_tokens_pt = torch.randint(0, VOCAB_SIZE_PT, (4, 128))
output_logits_pt = model_pt(input_tokens_pt)

print("\n" + "--- PyTorch Implementation ---")
print(f"Model created with {sum(p.numel() for p in model_pt.parameters()):,} parameters.")
print(f"Input shape: {input_tokens_pt.shape}")
print(f"Output logits shape: {output_logits_pt.shape}")


--- PyTorch Implementation ---
Model created with 286,952 parameters.
Input shape: torch.Size([4, 128])
Output logits shape: torch.Size([4, 128, 1000])


*What is this?*

This is the practical, trainable implementation of the complete Hyena architecture using PyTorch. It mirrors the structure of the NumPy version but uses PyTorch's nn.Module as the base for all classes. This provides several key features for free:

- Automatic Parameter Tracking: Any nn.Parameter or submodule is automatically registered.

- Optimized Layers: nn.Linear, nn.Conv1d, etc., are highly optimized C++ or CUDA kernels.

- Automatic Differentiation: PyTorch's autograd engine automatically tracks all operations in the forward pass to compute gradients for the backward pass, which is essential for training.

*Why do we do this?*

While the NumPy version is invaluable for understanding the mechanics, it's not practical for real-world use. The PyTorch version is for actually building and training models. It abstracts away the complex, error-prone process of manual backpropagation and provides a robust, high-performance toolkit. This allows researchers and engineers to focus on designing the model architecture (the "what") without having to reinvent the underlying machinery of training and optimization (the "how"). It also makes it trivial to run the model on a GPU for massive speedups.

--------------------------------------------------------------------------

**Part 3: Reference**

The concepts and architecture are based on the original research paper. For a deeper dive, it's a highly recommended read.

Paper Title: "Hyena Hierarchy: Towards Larger Convolutional Language Models"

arXiv Link: https://arxiv.org/abs/2302.10866**