# Audio Classifier Training Notebooks

This repository contains complete training notebooks for three state-of-the-art audio classification models:

1. **Mamba Audio Classifier** (`train_mamba.ipynb`)
2. **Liquid S4 Audio Classifier** (`train_liquid_s4.ipynb`) 
3. **V-JEPA2 Audio Classifier** (`train_vjepa2.ipynb`)

All models are trained on the ESC-50 environmental sound classification dataset.

## 🚀 Quick Start

### Prerequisites
- Python 3.8+
- PyTorch 2.0+
- CUDA-capable GPU (recommended)
- ESC-50 dataset

### Installation
```bash
# Install dependencies
pip install torch torchaudio torchvision
pip install tqdm matplotlib seaborn scikit-learn
pip install pandas numpy

# For Mamba
pip install causal-conv1d>=1.0.0
pip install mamba-ssm

# For Liquid S4 (dependencies included in external_repos)
# For V-JEPA2 (dependencies included in external_repos)
```

### Dataset Setup
Download ESC-50 dataset and place it in the expected location:
- **Expected location**: `./data/ESC-50/` (relative to the project root)
- **Structure**: 
  ```
  data/ESC-50/
  ├── audio/          # .wav files
  ├── meta/
  │   └── esc50.csv   # metadata
  └── README.md
  ```


## 📊 Model Comparison

| Model | Architecture | Input Format | Batch Size | Parameters | Memory Usage |
|-------|-------------|--------------|------------|------------|--------------|
| **Mamba** | State Space Model | Sequence [B, T, F] | 16 | ~50M | Medium |
| **Liquid S4** | Liquid State Space | Sequence [B, T, F] | 32 | ~5M | Low |
| **V-JEPA2** | Vision Transformer | Tubelets [B, C, T, H, W] | 8 | ~20M | High |

### Key Differences:

**Mamba Audio Classifier:**
- Uses selective state space models for efficient sequence modeling
- Processes mel-spectrograms as sequences
- Good balance of performance and efficiency
- Recommended for most use cases

**Liquid S4 Audio Classifier:**
- Uses liquid state space models with learnable kernels
- Most parameter-efficient
- Fastest training and inference
- Good for resource-constrained environments

**V-JEPA2 Audio Classifier:**
- Treats audio as visual data using Vision Transformers
- Uses tubelet tokenization for temporal modeling
- Most memory-intensive but potentially highest accuracy
- Best for high-performance requirements


## 🎯 Training Configuration

### Common Features:
- **Dataset**: ESC-50 (50 environmental sound classes)
- **Input**: Mel-spectrograms (128 mel bins)
- **Augmentation**: Time/frequency masking, noise, volume scaling
- **Optimizer**: AdamW with weight decay
- **Scheduler**: Cosine annealing learning rate
- **Early Stopping**: Based on validation accuracy
- **Evaluation**: Per-class metrics and confusion matrices

### Model-Specific Configurations:

#### Mamba Configuration:
```python
config = {
    'batch_size': 16,
    'learning_rate': 3e-4,
    'epochs': 100,
    'd_model': 512,
    'n_layer': 12,
    'pool_method': 'mean'
}
```

#### Liquid S4 Configuration:
```python
config = {
    'batch_size': 32,
    'learning_rate': 1e-3,
    'epochs': 150,
    'd_model': 64,
    'n_layers': 8,
    'd_state': 64,
    'dropout': 0.1
}
```

#### V-JEPA2 Configuration:
```python
config = {
    'batch_size': 8,
    'learning_rate': 1e-4,
    'epochs': 100,
    'embed_dim': 384,
    'depth': 8,
    'num_heads': 8,
    'patch_size': 16,
    'tubelet_size': 2
}
```


## 🚀 Usage Instructions

### 1. Choose Your Model
Select the notebook that best fits your requirements:
- **For general use**: Start with `train_mamba.ipynb`
- **For efficiency**: Use `train_liquid_s4.ipynb`
- **For maximum performance**: Try `train_vjepa2.ipynb`

### 2. Update Configuration
Modify the configuration in Cell 2 of each notebook:
```python
# Adjust batch size based on your GPU memory
config['batch_size'] = 16  # Reduce if OOM errors occur

# Adjust model parameters if needed
config['d_model'] = 256  # for smaller models
```

### 3. Run Training
Execute cells sequentially:
1. **Cell 1-2**: Setup and configuration
2. **Cell 3**: Data loading and verification
3. **Cell 4**: Model creation and testing
4. **Cell 5**: Training setup
5. **Cell 6**: Training loop (this will take time!)
6. **Cell 7**: Test evaluation
7. **Cell 8**: Results and visualization

### 4. Monitor Training
Each notebook includes:
- Progress bars with real-time loss updates
- Validation accuracy tracking
- Early stopping to prevent overfitting
- Model checkpointing (best model saved automatically)

### 5. Analyze Results
After training, you'll get:
- Training/validation curves
- Confusion matrix
- Per-class performance metrics
- Final test accuracy


## 🔧 Troubleshooting

### Common Issues:

#### Out of Memory (OOM) Errors:
```python
# Reduce batch size
config['batch_size'] = 8  # or even 4

# Reduce model size
config['d_model'] = 256  # for Mamba
config['embed_dim'] = 192  # for V-JEPA2
```

#### Import Errors:
```bash
# Make sure external repos are properly set up
# Check that external_repos/mamba, external_repos/liquid-s4, 
# and external_repos/vjepa2 directories exist
```

#### Dataset Path Issues:
```python
# Verify ESC-50 structure in data/ESC-50/:
# data/ESC-50/
#   ├── audio/          # .wav files
#   ├── meta/
#   │   └── esc50.csv   # metadata
#   └── README.md
```

#### Slow Training:
- Use GPU acceleration (CUDA)
- Reduce `num_workers` in DataLoader
- Use mixed precision training (add to training loop)

### Performance Tips:

1. **Start with Liquid S4** for fastest iteration
2. **Use smaller models** for initial experiments
3. **Monitor GPU memory** usage during training
4. **Save checkpoints** regularly for long training runs
5. **Use validation set** to tune hyperparameters

### Expected Results:
- **Mamba**: ~70-80% accuracy on ESC-50
- **Liquid S4**: ~65-75% accuracy on ESC-50  
- **V-JEPA2**: ~75-85% accuracy on ESC-50

*Note: Results may vary based on hardware, hyperparameters, and random seeds.*


## 📁 File Structure

```
audio-classifier-v2/
├── train_mamba.ipynb              # Mamba training notebook
├── train_liquid_s4.ipynb          # Liquid S4 training notebook  
├── train_vjepa2.ipynb             # V-JEPA2 training notebook
├── README_Training_Notebooks.ipynb # This documentation
├── audio_utils.py                 # Data loading utilities (uses fixed path)
├── mamba_audio.py                 # Mamba model implementation
├── liquidS4_audio.py              # Liquid S4 model implementation
├── vjepa2_audio.py                # V-JEPA2 model implementation
├── data/                          # Dataset directory
│   └── ESC-50/                    # ESC-50 dataset (download here)
│       ├── audio/                 # .wav files
│       ├── meta/
│       │   └── esc50.csv         # metadata
│       └── README.md
└── external_repos/                # External model repositories
    ├── mamba/                     # Mamba SSM implementation
    ├── liquid-s4/                 # Liquid S4 implementation
    └── vjepa2/                    # V-JEPA2 implementation
```

## 🎯 Next Steps

After training your models:

1. **Compare Results**: Run all three notebooks and compare performance
2. **Hyperparameter Tuning**: Experiment with different configurations
3. **Model Ensemble**: Combine predictions from multiple models
4. **Transfer Learning**: Fine-tune on your specific audio dataset
5. **Deployment**: Export models for production use

## 📚 References

- [Mamba: Linear-Time Sequence Modeling](https://arxiv.org/abs/2312.00752)
- [Liquid S4: Liquid State Space Models](https://arxiv.org/abs/2306.03955)
- [V-JEPA2: Video Joint Embedding Predictive Architecture](https://arxiv.org/abs/2402.01379)
- [ESC-50 Dataset](https://github.com/karolpiczak/ESC-50)

## 🤝 Contributing

Feel free to:
- Report issues or bugs
- Suggest improvements
- Add new model architectures
- Optimize training configurations

Happy training! 🚀
