# <strong> MAMBA </strong>

The paper "Mamba: Linear-Time Sequence Modeling with Selective State Spaces" introduces Mamba, a novel sequence modeling architecture designed to address the computational inefficiencies of Transformers, especially when handling long sequences.

* Traditional Transformers, which underpin many deep learning applications, face challenges with long sequences due to their quadratic computational complexity and limited context window. While various subquadratic-time architectures have been proposed, they often underperform in key areas like language processing. The authors identify that a significant weakness of these models is their inability to perform content-based reasoning, it seems that mamaba is 5 times faster on inference throughput vs Best transformers.

* MAMBA Computation grows linearly and Transformers grow Expontially.

* Unlike LSTMs that each block should wait for output of previous Block, MAMBAs gets a global hidden state value and blocks dont need to wait for eachother.

> note: <strong> mamba is a mixup of LSTMs and Attentions that is inherited from a old Sequence Modeling that were not being used these days. </strong>

* Mamba: Linear-Time Sequence Modeling with Selective State Spaces
* https://arxiv.org/pdf/2312.00752
* VMamba: Visual State Space Model
* https://arxiv.org/abs/2401.10166

<img src="src/MAMBA.png">

# Mamba: Linear-Time Sequence Modeling with Selective State Spaces

## **Core Ideas**

### **Linear State Space Models (SSMs):**
- Mamba builds on the foundation of linear state space models, which efficiently model sequences and capture long-term dependencies.
- **SSM Formula:**
  ```math
  h_t = A h_{t-1} + B x_t, \quad y_t = C h_t + D x_t
  ```
  Where:
  - \(A, B, C, D\) are learned parameters.
  - \(x_t\): Input at time \(t\).
  - \(h_t\): Hidden state.

### **Selective State Spaces (SSMs with Inputs as Parameters):**
- In Mamba, **parameters like \(A, B, C\)** are **functions of the input \(x_t\)**:
  ```math
  A_t = f_A(x_t), \quad B_t = f_B(x_t), \quad C_t = f_C(x_t)
  ```
- This dynamic parameterization conditions the state propagation on the input, enabling efficient handling of discrete modalities like language.

### **Linear Time Complexity:**
- Mamba achieves linear complexity \(O(T)\) with sequence length \(T\) by leveraging a **parallel recurrent algorithm**, avoiding attention mechanisms.

---

## **How Mamba Works**

### **Dynamic Parameterization:**
- For each time step \(t\):
  ```math
  A_t = f_A(x_t), \quad B_t = f_B(x_t), \quad C_t = f_C(x_t)
  ```
- Lightweight feedforward networks generate these parameters dynamically.

### **Selective Information Propagation:**
- Mamba selectively **propagates** or **resets** the state \(h_t\), inspired by gating structures like in LSTMs.

### **Efficient Recurrence:**
- **Parallel Algorithm**: Processes sequences in block-parallel fashion for high throughput, avoiding sequential bottlenecks of RNNs.

### **Output Computation:**
- Output combines historical context and immediate input:
  ```math
  y_t = C_t h_t + D x_t
  ```

---

## **Key Optimizations in Mamba**

1. **No Attention or MLP Blocks:**
   - Simplified architecture eliminates these costly components.

2. **Parallel Algorithm:**
   - Balances recurrence with hardware-friendly parallelization.

3. **Input-Specific Computations:**
   - Dynamically modulates state propagation for better adaptability.

---

## **Performance Advantages**

1. **Efficiency:**
   - Linear scaling with \(T\), handling sequences up to 1M tokens.
   - 5x inference throughput compared to Transformers.

2. **Accuracy:**
   - Matches or outperforms Transformers in language modeling.
   - Excels in multimodal tasks (e.g., audio, genomics).

3. **Parameter Efficiency:**
   - Similar performance with fewer parameters than Transformers.

---

## **Why Mamba Works Well**

1. **Selective State Spaces:**
   - Focuses on relevant sequence parts, ignoring irrelevant ones.

2. **Long-Term Dependencies:**
   - Recurrence inherently captures these without explicit positional encodings.

3. **Hardware Awareness:**
   - Optimized for modern accelerators (GPUs/TPUs).

---

## **Applications of Mamba**

- **Language Modeling:**
  - Outperforms Transformers of equivalent size and matches models twice its size.
- **Audio and Time-Series Processing:**
  - Handles long-range dependencies in time-series data efficiently.
- **Genomics:**
  - Processes long genetic sequences beyond Transformer context limits.

---

## **Mathematical Differences with Transformers**

| Feature                | Transformers                | Mamba                     |
|------------------------|-----------------------------|---------------------------|
| Complexity             | \(O(T^2)\)                 | \(O(T)\)                  |
| Long Dependencies      | Limited (context size)      | Efficient (state-based)   |
| Parameter Sharing      | Fixed across tokens         | Input-conditioned         |
| Mechanism              | Attention                  | Selective State Spaces    |
| Modality Adaptability  | High (for language)         | High (multimodal)         |

---

## **Summary**
Mamba combines:
- **State Space Models** for efficient sequence modeling.
- **Selective Mechanisms** for content-based reasoning.
- **Parallel Algorithms** for practical scalability.

It offers a **streamlined, efficient, and scalable alternative** to Transformers, excelling across multiple modalities with state-of-the-art performance.

---


<image src="src/MAMBAModeularImplementation.png">

# Using Mamba in Neural Networks

Mamba is designed to be highly modular and can be integrated into neural network architectures in a flexible way. Its linear-time state space mechanism makes it ideal for tasks requiring efficient sequence modeling across various modalities. Here's how Mamba can be used in neural network architectures, along with guidelines for modularity and block design.

---

## **Using Mamba in Neural Networks**
Mamba functions as a **sequence modeling block** that can replace traditional Transformer layers or recurrent units (like LSTMs or GRUs). It operates as a standalone component that processes sequential inputs and can be stacked or combined with other modules.

### **Key Roles in Architectures**
1. **As a Backbone:**
   - Mamba can serve as the main sequence processor in architectures for tasks like language modeling, audio processing, or time-series prediction.
   - Example: Replace Transformer layers in GPT-like models with Mamba layers.

2. **As a Hybrid Component:**
   - Mamba can be combined with Transformer blocks, CNNs, or other architectures to handle specific aspects of a task, such as long-range dependencies or efficient feature extraction.

3. **For Multimodal Tasks:**
   - Mamba's selective state spaces adapt well to multiple data types (e.g., text, audio, genomics), making it suitable for architectures handling diverse inputs.

---

## **Modularity in Mamba**
To use Mamba effectively, it should be modular, enabling it to integrate with existing neural network layers.

### **Mamba Block Design**
Each **Mamba block** consists of the following components:

1. **Input Encoding Layer:**
   - Converts raw inputs into a suitable representation.
   - Can be an embedding layer (for text) or a convolutional layer (for audio).

   ```math
   z_t = \text{Embedding}(x_t) \quad \text{or} \quad z_t = \text{Conv}(x_t)
   ```

2. **Dynamic Parameter Generation:**
   - Lightweight feedforward layers compute the dynamic parameters \(A_t, B_t, C_t, D_t\) for each time step \(t\) based on the input.
   - These functions can be shared across tokens or customized for each one.

   ```math
   A_t = f_A(z_t), \quad B_t = f_B(z_t), \quad C_t = f_C(z_t)
   ```

3. **State Update Module (Core SSM):**
   - The core selective state space computation is performed here. It updates the hidden state \(h_t\) and computes the output \(y_t\) for each time step.
   - Parallelized recurrence ensures efficiency.

   ```math
   h_t = A_t h_{t-1} + B_t z_t, \quad y_t = C_t h_t + D_t z_t
   ```

4. **Residual Connections (Optional):**
   - Residual connections can be added to stabilize training, similar to Transformer blocks.

   ```math
   y_t = y_t + \text{Residual Input}
   ```

5. **Normalization and Output Layer:**
   - Apply layer normalization and pass the output to the next Mamba block or the task-specific head.

   ```math
   y_t = \text{LayerNorm}(y_t)
   ```

---

## **Stacking Mamba Blocks**
- Multiple Mamba blocks can be stacked to handle complex dependencies and hierarchical representations.
- Each block operates in parallel, processing the input sequence as a whole, but respects recurrence internally for efficient state updates.

### **Single Mamba Block**
```
[Input] → [Dynamic Parameter Generation] → [State Update] → [Residual + Normalization] → [Output]
```

### **Stacked Mamba Architecture**
```
[Input] → [Embedding/Encoding] → [Mamba Block 1] → [Mamba Block 2] → ... → [Task Head]
```

---

## **Hybrid Architectures with Mamba**
Mamba can complement other architectures, such as Transformers or convolutional networks:

### **Hybrid Transformer-Mamba Architecture**
- Use Mamba for handling long-range dependencies and Transformers for local interactions.
- Alternate Mamba and Transformer blocks in a stacked architecture.

```
[Input] → [Embedding] → [Mamba Block] → [Transformer Block] → [Mamba Block] → [Task Head]
```

### **Convolution + Mamba for Audio/Time-Series**
- Combine convolutional layers for local feature extraction with Mamba for temporal modeling.

```
[Input] → [Conv Layers] → [Mamba Blocks] → [Task Head]
```

### **Multimodal Architecture**
- Use Mamba for processing sequential data (e.g., text or audio) and combine it with other branches for handling non-sequential data (e.g., images).

```
[Image Input] → [CNN]
[Text Input] → [Mamba Blocks]
[Audio Input] → [Mamba Blocks]
→ [Fusion Layer] → [Task Head]
```

---

## **Example Mamba Block Implementation (Pseudocode)**
```python
class MambaBlock(nn.Module):
    def __init__(self, input_dim, hidden_dim):
        super(MambaBlock, self).__init__()
        self.input_dim = input_dim
        self.hidden_dim = hidden_dim

        # Dynamic parameter generators
        self.f_A = nn.Linear(input_dim, hidden_dim)
        self.f_B = nn.Linear(input_dim, hidden_dim)
        self.f_C = nn.Linear(input_dim, hidden_dim)
        self.f_D = nn.Linear(input_dim, hidden_dim)

        # Layer normalization
        self.layer_norm = nn.LayerNorm(hidden_dim)

    def forward(self, x):
        seq_len, batch_size, input_dim = x.shape
        h_t = torch.zeros(batch_size, self.hidden_dim).to(x.device)  # Initial hidden state

        outputs = []
        for t in range(seq_len):
            z_t = x[t]

            # Compute dynamic parameters
            A_t = self.f_A(z_t)
            B_t = self.f_B(z_t)
            C_t = self.f_C(z_t)
            D_t = self.f_D(z_t)

            # Update hidden state and compute output
            h_t = A_t * h_t + B_t * z_t
            y_t = C_t * h_t + D_t * z_t

            outputs.append(y_t)

        outputs = torch.stack(outputs, dim=0)
        outputs = self.layer_norm(outputs)
        return outputs
```

---

## **Summary**
- Mamba is modular and flexible, suitable for standalone use or as part of hybrid architectures.
- A single block consists of dynamic parameter generation, state updates, and optional residual/normalization layers.
- You can stack Mamba blocks or combine them with other layers to handle a wide variety of tasks, from language modeling to audio processing.
- Hybrid architectures with Mamba and Transformers or CNNs can exploit the strengths of both approaches.

Let me know if you'd like help refining an architecture using Mamba or implementing it for a specific task!
