# Prompt2Song – Audio Encoder & Gated Fusion

Train an audio feature projection network and a gated fusion module that blends lyric and acoustic embeddings into a unified song representation.

## Goals
- Embed song lyrics with the fine-tuned text encoder
- Learn an audio feature encoder that maps acoustic descriptors into the same emotion space
- Train a gated fusion module that balances lyric and audio signals per song
- Persist reusable models, scalers, and fused song embeddings

Imports filesystem helpers, numerical libraries, PyTorch modules, and scikit-learn utilities required for the audio pipeline.

In [11]:
import os
from pathlib import Path
from typing import List, Tuple

import numpy as np
import pandas as pd
import torch
from torch import nn
from torch.utils.data import Dataset, DataLoader

from tqdm.auto import tqdm
try:
    from sklearn.preprocessing import StandardScaler
except ImportError as exc:
    raise ImportError("Install scikit-learn before running this notebook.") from exc


Identifies dataset/artifact locations, prepares output directories, and warns if the text encoder artifacts are missing.

In [12]:
NOTEBOOK_DIR = Path.cwd().resolve()
if (NOTEBOOK_DIR / "datasets").exists():
    PROJECT_ROOT = NOTEBOOK_DIR
else:
    PROJECT_ROOT = NOTEBOOK_DIR.parent

DATASET_PATH = PROJECT_ROOT / "datasets" / "song_features" / "songs_with_attributes_and_lyrics.csv"
TEXT_MODEL_DIR = PROJECT_ROOT / "artifacts" / "text_encoder" / "hf_model"
FUSION_ARTIFACTS = PROJECT_ROOT / "artifacts" / "fusion"
FUSION_ARTIFACTS.mkdir(parents=True, exist_ok=True)

if not TEXT_MODEL_DIR.exists():
    print("⚠️ Fine-tuned text encoder not found. Run 01_text_emotion_encoder.ipynb first.")

print(f"Using dataset: {DATASET_PATH}")


Using dataset: /Users/himanshu/Documents/Github/prompt2song/datasets/song_features/songs_with_attributes_and_lyrics.csv


Defines the TextEmotionEncoder wrapper so we can reuse the fine-tuned text model for lyric embeddings.

In [13]:
class TextEmotionEncoder(torch.nn.Module):
    def __init__(self, model_dir: Path, device: str | None = None):
        super().__init__()
        from transformers import AutoModel, AutoTokenizer

        self.device = "mps" if torch.backends.mps.is_available() else "cuda"
        self.base_model = AutoModel.from_pretrained(model_dir).to(self.device)
        self.tokenizer = AutoTokenizer.from_pretrained(model_dir)

    @torch.no_grad()
    def encode(self, texts: List[str], batch_size: int = 32, max_length: int = 512) -> np.ndarray:
        embeddings = []
        batch_indices = range(0, len(texts), batch_size)
        for start in tqdm(batch_indices, desc='Encoding lyrics', leave=False):
            batch = texts[start:start + batch_size]
            tokens = self.tokenizer(
                batch,
                padding=True,
                truncation=True,
                max_length=max_length,
                return_tensors="pt",
            ).to(self.device)
            outputs = self.base_model(**tokens)
            token_embeddings = outputs.last_hidden_state
            attention_mask = tokens.attention_mask.unsqueeze(-1)
            summed = (token_embeddings * attention_mask).sum(dim=1)
            counts = attention_mask.sum(dim=1)
            mean_pooled = summed / counts
            embeddings.append(mean_pooled.cpu().numpy())
        return np.vstack(embeddings)


Loads the song metadata CSV, fills missing lyrics, and prints coverage statistics for songs and lyrics.

In [14]:
songs_df = pd.read_csv(DATASET_PATH)
print(songs_df.columns.tolist())
print("Total songs:", len(songs_df))

songs_df["lyrics"] = songs_df["lyrics"].fillna("").astype(str)
non_empty = songs_df[songs_df["lyrics"].str.len() > 0].copy()
print("Songs with lyrics:", len(non_empty))


['id', 'name', 'album_name', 'artists', 'danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'duration_ms', 'lyrics']
Total songs: 955320
Songs with lyrics: 955307


### Embed lyrics with the fine-tuned text encoder
This step can be time-consuming; cache the result for reuse across training runs.

Runs the text encoder to embed every available lyric and stores the embeddings on disk for reuse.

In [15]:
# Uncomment when the text encoder artifacts are available
encoder = TextEmotionEncoder(TEXT_MODEL_DIR)
lyric_embeddings = encoder.encode(non_empty["lyrics"].tolist(), batch_size=16, max_length=512)
np.save(FUSION_ARTIFACTS / "lyric_embeddings.npy", lyric_embeddings)


Encoding lyrics:   0%|          | 0/59707 [00:00<?, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
                                                                      

KeyboardInterrupt: 

Loads cached lyric embeddings from disk and asserts they align with the filtered songs.

In [None]:
# If embeddings were cached previously, load them here
lyric_embeddings = np.load(FUSION_ARTIFACTS / "lyric_embeddings.npy")
assert lyric_embeddings.shape[0] == len(non_empty)


### Prepare acoustic features
Select the numerical columns that describe each song's audio profile.

Specifies which acoustic feature columns to use, scales them with StandardScaler, persists scaler parameters, and logs the resulting shape.

In [None]:
ACOUSTIC_FEATURES = [
    "danceability",
    "energy",
    "loudness",
    "speechiness",
    "acousticness",
    "instrumentalness",
    "liveness",
    "valence",
    "tempo",
    "mode",
    "key",
    "duration_ms",
]

feature_df = non_empty[ACOUSTIC_FEATURES].copy()
feature_df = feature_df.replace([np.inf, -np.inf], np.nan).fillna(feature_df.median())

scaler = StandardScaler()
scaled_features = scaler.fit_transform(feature_df.values)

np.save(FUSION_ARTIFACTS / "audio_feature_scaler_mean.npy", scaler.mean_)
np.save(FUSION_ARTIFACTS / "audio_feature_scaler_scale.npy", scaler.scale_)
feature_order_path = FUSION_ARTIFACTS / "audio_feature_order.json"
feature_order_path.write_text("
".join(ACOUSTIC_FEATURES), encoding="utf-8")

print("Scaled features shape:", scaled_features.shape)


### Audio encoder dataset

Implements a Dataset that pairs acoustic feature vectors with lyric embeddings for supervised training.

In [None]:
class AudioToLyricDataset(Dataset):
    def __init__(self, features: np.ndarray, targets: np.ndarray):
        assert features.shape[0] == targets.shape[0]
        self.features = torch.from_numpy(features).float()
        self.targets = torch.from_numpy(targets).float()

    def __len__(self):
        return self.features.shape[0]

    def __getitem__(self, idx):
        return self.features[idx], self.targets[idx]

# dataset = AudioToLyricDataset(scaled_features.astype(np.float32), lyric_embeddings.astype(np.float32))
# train_loader = DataLoader(dataset, batch_size=64, shuffle=True)


### Neural modules

Defines the audio encoder MLP and gated fusion module that will combine lyric and audio representations.

In [None]:
class AudioEmotionEncoder(nn.Module):
    def __init__(self, input_dim: int, embedding_dim: int, hidden_dims: Tuple[int, ...] = (256, 512)):
        super().__init__()
        layers = []
        prev_dim = input_dim
        for hidden in hidden_dims:
            layers.append(nn.Linear(prev_dim, hidden))
            layers.append(nn.ReLU())
            layers.append(nn.Dropout(p=0.1))
            prev_dim = hidden
        layers.append(nn.Linear(prev_dim, embedding_dim))
        self.net = nn.Sequential(*layers)

    def forward(self, x):
        return self.net(x)

class GatedFusion(nn.Module):
    def __init__(self, embedding_dim: int, hidden_dim: int = 256):
        super().__init__()
        self.gate = nn.Sequential(
            nn.Linear(embedding_dim * 2, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, embedding_dim),
        )

    def forward(self, lyric_emb: torch.Tensor, audio_emb: torch.Tensor):
        concat = torch.cat([lyric_emb, audio_emb], dim=-1)
        gate_logits = self.gate(concat)
        gate = torch.sigmoid(gate_logits)
        fused = gate * lyric_emb + (1.0 - gate) * audio_emb
        return fused, gate


### Training utilities

Provides a training loop for the audio encoder that minimizes MSE between predicted and target lyric embeddings.

In [None]:
def train_audio_encoder(
    model: nn.Module,
    dataloader: DataLoader,
    epochs: int = 20,
    lr: float = 1e-3,
    device: str | None = None,
    checkpoint_path: Path | None = None,
):
    device = device or ("cuda" if torch.cuda.is_available() else "cpu")
    model = model.to(device)
    optimizer = torch.optim.AdamW(model.parameters(), lr=lr)
    criterion = nn.MSELoss()

    for epoch in range(epochs):
        model.train()
        epoch_loss = 0.0
        for features, targets in dataloader:
            features = features.to(device)
            targets = targets.to(device)

            optimizer.zero_grad()
            preds = model(features)
            loss = criterion(preds, targets)
            loss.backward()
            optimizer.step()
            epoch_loss += loss.item() * features.size(0)
        avg_loss = epoch_loss / len(dataloader.dataset)
        print(f"[Audio] epoch={epoch+1} loss={avg_loss:.4f}")

    if checkpoint_path is not None:
        torch.save(model.state_dict(), checkpoint_path)
        print(f"Saved audio encoder to {checkpoint_path}")

    return model


Provides a training loop for the gated fusion module that learns blend weights between lyric and audio embeddings.

In [None]:
def train_gated_fusion(
    fusion: GatedFusion,
    lyric_embeddings: torch.Tensor,
    audio_embeddings: torch.Tensor,
    epochs: int = 10,
    lr: float = 1e-3,
    batch_size: int = 128,
    device: str | None = None,
    checkpoint_path: Path | None = None,
):
    device = device or ("cuda" if torch.cuda.is_available() else "cpu")
    fusion = fusion.to(device)

    dataset = torch.utils.data.TensorDataset(lyric_embeddings, audio_embeddings)
    dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

    optimizer = torch.optim.AdamW(fusion.parameters(), lr=lr)
    mse_loss = nn.MSELoss()

    for epoch in range(epochs):
        fusion.train()
        epoch_loss = 0.0
        for lyric_batch, audio_batch in dataloader:
            lyric_batch = lyric_batch.to(device)
            audio_batch = audio_batch.to(device)

            optimizer.zero_grad()
            fused, gate = fusion(lyric_batch, audio_batch)
            loss = mse_loss(fused, lyric_batch) + 0.1 * mse_loss(gate, torch.ones_like(gate))
            loss.backward()
            optimizer.step()
            epoch_loss += loss.item() * lyric_batch.size(0)
        avg_loss = epoch_loss / len(dataloader.dataset)
        print(f"[Fusion] epoch={epoch+1} loss={avg_loss:.4f}")

    if checkpoint_path is not None:
        torch.save(fusion.state_dict(), checkpoint_path)
        print(f"Saved fusion module to {checkpoint_path}")

    return fusion


### Training workflow (execute after caching lyric embeddings)

Outlines an end-to-end training script showing how to fit the audio encoder, train fusion, and save resulting artifacts.

In [None]:
# Example script:
embedding_dim = lyric_embeddings.shape[1]
dataset = AudioToLyricDataset(scaled_features.astype(np.float32), lyric_embeddings.astype(np.float32))
train_loader = DataLoader(dataset, batch_size=128, shuffle=True)

audio_encoder = AudioEmotionEncoder(input_dim=len(ACOUSTIC_FEATURES), embedding_dim=embedding_dim)
audio_encoder = train_audio_encoder(
    audio_encoder,
    train_loader,
    epochs=30,
    lr=5e-4,
    checkpoint_path=FUSION_ARTIFACTS / "audio_encoder.pt",
)

with torch.no_grad():
    audio_embeddings = audio_encoder(torch.from_numpy(scaled_features).float()).cpu()

fusion_module = GatedFusion(embedding_dim=embedding_dim)
fusion_module = train_gated_fusion(
    fusion_module,
    torch.from_numpy(lyric_embeddings).float(),
    audio_embeddings,
    epochs=15,
    lr=1e-3,
    checkpoint_path=FUSION_ARTIFACTS / "gated_fusion.pt",
)

with torch.no_grad():
    fused_embeddings, gates = fusion_module(
        torch.from_numpy(lyric_embeddings).float(),
        audio_embeddings,
    )
fused_embeddings = fused_embeddings.cpu().numpy()
np.save(FUSION_ARTIFACTS / "fused_song_embeddings.npy", fused_embeddings)
np.save(FUSION_ARTIFACTS / "fusion_gates.npy", gates.cpu().numpy())


### Metadata export

Serializes song ids, titles, artists, and lyrics to JSON so retrieval components can access consistent metadata.

In [None]:
metadata = {
    "song_ids": non_empty["id"].tolist(),
    "titles": non_empty["name"].tolist(),
    "artists": non_empty["artists"].tolist(),
    "lyrics": non_empty["lyrics"].astype(str).tolist(),
}

import json
with open(FUSION_ARTIFACTS / "song_metadata.json", "w", encoding="utf-8") as fp:
    json.dump(metadata, fp, indent=2)
print("Saved song metadata for retrieval.")

### Next steps
- Finish training the audio encoder and gated fusion module.
- Confirm `fused_song_embeddings.npy` exists for the retrieval notebook.
- Optionally analyse `fusion_gates.npy` to see modality balance per song.