# Identifying Out-of-Tune Instruments in Multi-Instrument Mixes using VGGish Transfer Learning

**Team Members**: [Insert names]  
**Course**: CS8321 - Advanced Machine Learning and Neural Networks  
**University**: Southern Methodist University, Dallas  
**Semester**: Spring 2025

This project investigates whether a machine learning model can identify which instrument in a polyphonic audio mixture is out of tune. Using synthetic data derived from the NSynth dataset and pre-trained VGGish audio embeddings, we aim to build a robust classifier capable of detecting tuning irregularities in complex musical environments.

## Table of Contents

1. [Motivation & Research Questions](#motivation)
2. [Related Work](#related-work)
3. [Problem Statement & Hypothesis](#problem)
4. [Dataset Description & Preprocessing](#dataset)
5. [Transfer Learning: VGGish Embeddings](#transfer)
6. [Modeling](#modeling)
7. [Methodology](#methodology)
8. [Preliminary Analysis & Results](#results)
9. [Evaluation Metrics](#evaluation)
10. [Ethical Considerations](#ethics)
11. [Future Work](#future)
12. [References](#references)


## Motivation & Research Questions <a name="motivation"></a>


## Related Work <a name="related-work"></a>

## Problem Statement & Hypothesis <a name="problem"></a>

**Problem Statement**:  
Detect and localize which instrument in a multi-instrument audio mix is out of tune.

**Hypothesis**:  
We hypothesize that VGGish embeddings retain enough frequency-shift sensitivity to enable binary (in-tune vs. out-of-tune) and multi-class (instrument identification) classification, even when audio sources are blended.



## Dataset Description & Preprocessing <a name="dataset"></a>

We used the **NSynth-train** dataset from Activeloop’s DeepLake hub and applied the following steps:

- Randomly selected 3 instruments per mix (total 1000 samples).
- Applied pitch shift of ±1–2 semitones to one instrument per sample.
- Mixed audio clips to form polyphonic audio.
- Normalized and exported as `.wav` files.
- Saved instrument labels and pitch shift metadata in `labels.csv`.

## Transfer Learning: VGGish Embeddings <a name="transfer"></a>

We leverage **VGGish**, a pretrained audio feature extractor developed by Google, based on the VGG architecture. It converts audio into 128-dimensional embeddings suitable for downstream tasks.

### Why VGGish?
- Trained on large-scale YouTube data
- Robust across audio types (speech, music, environmental sounds)
- Eliminates need for custom feature engineering

### Preprocessing for VGGish:
- Convert `.wav` to mono, 16kHz
- Slice or pad into 0.96s frames
- Extract VGGish embeddings per file for classifier input


## Modeling <a name="modeling"></a>

### Baseline Model:
- Multi-Layer Perceptron (MLP) classifier on mean VGGish embeddings
- Output: Multi-class classification (which instrument is out-of-tune)

### Advanced Options (Optional):
- Random Forest or Gradient Boosted Trees
- Add attention layer on top of VGGish sequence embeddings
- Use CNN over time-distributed embeddings
- Explore transformer-based classifiers for temporal audio patterns

### Input–Output Format:
- **Input**: 128-D VGGish feature vector(s) per audio mix
- **Output**: One-hot encoded label of out-of-tune instrument


## Methodology <a name="methodology"></a>

### Step-by-Step:
1. **Data Augmentation**:
   - Generate pitch-shifted multi-instrument samples
   - Create metadata file (`labels.csv`)
2. **Feature Extraction**:
   - Extract 128-D embeddings using VGGish
3. **Label Encoding**:
   - One-hot encode instruments and tuning status
4. **Train/Test Split**:
   - Standard 80/20 split or stratified by instrument
5. **Classifier Training**:
   - Fit baseline classifier
6. **Evaluation**:
   - Generate metrics, confusion matrix, and visualizations


## Preliminary Analysis & Results <a name="results"></a>

We trained the initial model on a subset of 100 samples. Below are key findings:

### Results (Sample):
- Accuracy: XX%
- F1-score (macro): XX%
- Instruments like [X] show higher confusion with [Y]

### Visualization:
> *(Insert matplotlib/seaborn Confusion Matrix or PCA projection here)*

### Interpretation:
- Certain instrument combinations may mask pitch shifts.
- VGGish may be sensitive to harmonics rather than pitch center in some cases.

## Evaluation Metrics <a name="evaluation"></a>

To evaluate the classifier's ability to identify the out-of-tune instrument:

- **Accuracy**
- **Confusion Matrix**
- **Precision / Recall / F1-score** (macro and per-class)
- **Top-1 and Top-2 accuracy**
- **AUC-ROC** (optional for binary tuning detection)

## Ethical Considerations <a name="ethics"></a>

- Model trained on synthetic data — may not generalize to real-world performances
- Risk of overfitting to artifacts introduced during synthetic pitch shifting
- Use of tuning detection in artistic expression must respect creative freedom

## Future Work <a name="future"></a>

- Integrate source separation (e.g., Demucs, Spleeter) for per-stream analysis
- Test on real-world recordings from MusicNet or user-generated audio
- Build web or real-time tool for live instrument tuning analysis
- Train contrastive learning model using positive (in-tune) vs. negative (out-of-tune) pairs


## References <a name="references"></a>
