# Synch

Synch is a music recommendation engine built on measured signal rather than popularity, trends, or social noise.

Tracks are represented as vectors in a high-dimensional acoustic and semantic space, and recommendations are generated by proximity in that space rather than by human behavior.

Music is treated as a physical system: **sound → numbers → similarity**.

## Dataset construction

More than 5,000 tracks were collected.

The raw files were heterogeneous: some had complete metadata, others were partially tagged, and some were incorrect.

Because downstream modeling is only as good as the input, a manual data integrity phase was introduced before any automation.

### Metadata verification

All tracks were inspected and corrected using Mp3tag with MusicBrainz lookups.

This ensured that artist names, titles, albums, release years, and genres were standardized and trustworthy.

After verification, metadata was exported into a structured CSV to serve as the base layer.

The dataset schema was defined as:

- title
- track_number
- disc_number
- duration_seconds
- album
- album_artist
- contributing_artists
- genre_tagged
- year
- file_path

One critical field was still missing: a stable track identifier.

### Deterministic track identity

Since no global ID existed across the source files, a deterministic track_id was constructed from multiple attributes of each track.

The goal was not cryptographic uniqueness but collision-resistant identity for a dataset on the order of 10,000 tracks.

The ID is generated from:

- title
- track number
- disc number
- duration
- album
- album artist
- contributing artists
- genre
- year

These fields are normalized, reduced to stable characters, and combined into a reproducible identifier.

This makes:

- identical files always produce the same ID
- small metadata errors detectable
- joins between metadata and audio features reliable

Unicode normalization and tag cleanup are applied before ID generation to avoid hidden collisions.

### Metadata extraction pipeline

A custom extraction script walks the audio library, reads tags using Mutagen, cleans and normalizes them, and exports a CSV.

Key processing steps include:

- Unicode normalization and encoding repair (ftfy, unicodedata)
- Artist and genre tokenization and deduplication
- Year parsing
- Track and disc number normalization
- Deterministic track_id generation

The output is a clean, stable metadata spine for the entire project.

### Sample (metadata.csv)

| track_id | title | album | album_artist | contributing_artists | genre_tagged | year |
|----------|-------|-------|--------------|---------------------|--------------|------|
| b11119pgze10 | Broken | Plastic Beach | Gorillaz | Gorillaz | Electronic;Hip-Hop;Pop | 2010 |
| s6117pgle10 | Superfast Jellyfish | Plastic Beach | Gorillaz | Gorillaz; Gruff Rhys; De La Soul | Electronic;Hip-Hop;Pop | 2010 |

## Audio feature extraction

Once the metadata layer was stable, the next phase converted sound into numbers.

All audio files are high-quality lossless or high-bitrate formats to ensure numerical accuracy.

Feature extraction is performed using Essentia, a professional-grade audio analysis library used in music information retrieval research.

Because Essentia is Linux-native, the environment runs inside WSL for compatibility and reproducibility.

### Low-level acoustic features

The lowlevel.py pipeline computes frame-based and track-level statistics from each audio file.

For each track:

1. The waveform is loaded and converted to mono.
2. The signal is split into overlapping frames.
3. Each frame is transformed into the frequency domain.
4. Core spectral features are computed.
5. Statistics are aggregated over the entire track.

Extracted features include:

| Feature | Meaning |
|---------|----------|
| MFCCs | Timbre and texture |
| Spectral centroid | Brightness |
| Spectral flatness | Tonality vs noise |
| Spectral flux | Temporal change |
| RMS | Loudness |

These are reduced to means and standard deviations to give one vector per track.

### Sample (lowlevel.csv)

| track_id | mfcc_1_mean | spectral_centroid_mean | spectral_flatness_mean | rms_mean |
|----------|-------------|------------------------|------------------------|----------|
| b11119pgze10 | -533.7 | 0.132 | 0.164 | 0.321 |
| s6117pgle10 | -556.6 | 0.180 | 0.291 | 0.321 |

The next stage introduces high-level features using Essentia's ML models