# Generate Tags Dataset for Caption Creation

This notebook generates a synthetic dataset of music tags for caption creation. It selects random tags from predefined categories (instruments, mood, tempo, genre) with probabilities based on co-occurrence patterns in the MusicCaps dataset.

In [1]:
import json
import ast
import pandas as pd
import numpy as np
from pathlib import Path
from collections import Counter, defaultdict
import itertools
import random
from datasets import load_dataset

## Load Existing MusicCaps Data

Load the existing tags dataset to analyze tag co-occurrence patterns.

In [2]:
ds = load_dataset("google/MusicCaps")
df = ds['train'].to_pandas()
df['aspect_list_transformed'] = df['aspect_list'].apply(ast.literal_eval)

## Define Tag Categories

Define all possible tags for each category based on the dataset.

In [3]:
def extract_tags(song_tags, concept_tags):
    res = []
    for c_tag in concept_tags:
        for s_tag in song_tags:
            if c_tag in s_tag:
                res.append(s_tag)
    return list(set(res))

In [4]:
concepts = json.load(open("../data/concepts_to_tags.json", "r"))

for concept, tags in concepts.items():
    df[concept + '_tags'] = df['aspect_list_transformed'].apply(
        lambda x: extract_tags(x, tags)
    )

## Analyze Tag Statistics

Calculate individual tag frequencies and co-occurrence patterns.

In [5]:
# Calculate tag frequencies
instrument_counts = Counter()
mood_counts = Counter()
genre_counts = Counter()
tempo_counts = Counter()

for _, row in df.iterrows():
        for tag in row['instrument_tags']:
            instrument_counts[tag.strip()] += 1
        for tag in row['mood_tags']:
            mood_counts[tag.strip()] += 1
        for tag in row['genre_tags']:
            genre_counts[tag.strip()] += 1
        for tag in row['tempo_tags']:
            tempo_counts[tag.strip()] += 1

print("Top instruments:", instrument_counts.most_common(5))
print("Top moods:", mood_counts.most_common(5))
print("Top genres:", genre_counts.most_common(5))
print("Top tempos:", tempo_counts.most_common(5))

Top instruments: [('acoustic drums', 393), ('bass guitar', 297), ('electric guitar', 297), ('acoustic guitar', 290), ('piano', 239)]
Top moods: [('emotional', 620), ('energetic', 593), ('passionate', 592), ('happy', 225), ('romantic', 177)]
Top genres: [('rock', 207), ('pop', 197), ('electronic drums', 188), ('electronic music', 128), ('dance music', 117)]
Top tempos: [('medium tempo', 595), ('slow tempo', 473), ('fast tempo', 381), ('groovy', 321), ('uptempo', 247)]


In [6]:
# Calculate number of instruments per sample
instrument_num_counts = Counter()
for _, row in df.iterrows():
    num_instruments = len([tag.strip() for tag in str(row['instrument_tags']).split(',')])
    instrument_num_counts[num_instruments] += 1

print("Distribution of number of instruments per sample:")
for num, count in sorted(instrument_num_counts.items()):
    print(f"  {num} instruments: {count} samples ({count/len(df)*100:.1f}%)")

Distribution of number of instruments per sample:
  1 instruments: 1842 samples (33.4%)
  2 instruments: 1127 samples (20.4%)
  3 instruments: 1046 samples (18.9%)
  4 instruments: 841 samples (15.2%)
  5 instruments: 451 samples (8.2%)
  6 instruments: 158 samples (2.9%)
  7 instruments: 38 samples (0.7%)
  8 instruments: 13 samples (0.2%)
  9 instruments: 4 samples (0.1%)
  11 instruments: 1 samples (0.0%)


In [7]:
# Calculate number of moods per sample
mood_num_counts = Counter()
for _, row in df.iterrows():
    num_moods = len([tag.strip() for tag in str(row['mood_tags']).split(',')])
    mood_num_counts[num_moods] += 1

print("Distribution of number of moods per sample:")
for num, count in sorted(mood_num_counts.items()):
    print(f"  {num} moods: {count} samples ({count/len(df)*100:.1f}%)")

Distribution of number of moods per sample:
  1 moods: 3555 samples (64.4%)
  2 moods: 955 samples (17.3%)
  3 moods: 528 samples (9.6%)
  4 moods: 304 samples (5.5%)
  5 moods: 111 samples (2.0%)
  6 moods: 43 samples (0.8%)
  7 moods: 19 samples (0.3%)
  8 moods: 3 samples (0.1%)
  9 moods: 2 samples (0.0%)
  10 moods: 1 samples (0.0%)


In [8]:
# Calculate number of genres per sample
genre_num_counts = Counter()
for _, row in df.iterrows():
    num_genres = len([tag.strip() for tag in str(row['genre_tags']).split(',')])
    genre_num_counts[num_genres] += 1

print("Distribution of number of genres per sample:")
for num, count in sorted(genre_num_counts.items()):
    print(f"  {num} genres: {count} samples ({count/len(df)*100:.1f}%)")

Distribution of number of genres per sample:
  1 genres: 4142 samples (75.0%)
  2 genres: 674 samples (12.2%)
  3 genres: 353 samples (6.4%)
  4 genres: 175 samples (3.2%)
  5 genres: 87 samples (1.6%)
  6 genres: 34 samples (0.6%)
  7 genres: 26 samples (0.5%)
  8 genres: 15 samples (0.3%)
  9 genres: 9 samples (0.2%)
  10 genres: 5 samples (0.1%)
  13 genres: 1 samples (0.0%)


In [9]:
# Calculate number of tempos per sample
tempo_num_counts = Counter()
for _, row in df.iterrows():
    num_tempos = len([tag.strip() for tag in str(row['tempo_tags']).split(',')])
    tempo_num_counts[num_tempos] += 1

print("Distribution of number of tempos per sample:")
for num, count in sorted(tempo_num_counts.items()):
    print(f"  {num} tempos: {count} samples ({count/len(df)*100:.1f}%)")

Distribution of number of tempos per sample:
  1 tempos: 4619 samples (83.7%)
  2 tempos: 547 samples (9.9%)
  3 tempos: 221 samples (4.0%)
  4 tempos: 94 samples (1.7%)
  5 tempos: 30 samples (0.5%)
  6 tempos: 8 samples (0.1%)
  7 tempos: 2 samples (0.0%)


## Build Probability Distributions

Create weighted probability distributions based on tag frequencies.

In [10]:
# Convert counts to probability distributions
def counts_to_probs(counts, tags):
    """Convert counts to normalized probability distribution."""
    total = sum(counts.values())
    return {tag: counts.get(tag, 1) / total for tag in tags}  # Add 1 for unseen tags

instrument_probs = counts_to_probs(instrument_counts, concepts['instrument'])
mood_probs = counts_to_probs(mood_counts, concepts['mood'])
genre_probs = counts_to_probs(genre_counts, concepts['genre'])
tempo_probs = counts_to_probs(tempo_counts, concepts['tempo'])

# Distribution for number of instruments
total_samples = len(df)
num_instruments_probs = {num: count/total_samples for num, count in instrument_num_counts.items()}
num_moods_probs = {num: count/total_samples for num, count in mood_num_counts.items()}
num_genres_probs = {num: count/total_samples for num, count in genre_num_counts.items()}
num_tempos_probs = {num: count/total_samples for num, count in tempo_num_counts.items()}

print("Probability distributions created")

Probability distributions created


## Generate New Tags Dataset

Generate new samples with realistic tag combinations.

In [11]:
def sample_tags(tags, probs, num_samples=1, temperature=1.0):
    """Sample tags with temperature-controlled randomness.
    
    Args:
        tags: List of available tags
        probs: Dictionary of tag probabilities
        num_samples: Number of tags to sample
        temperature: Controls randomness (lower = more deterministic, higher = more random)
    """
    # Apply temperature to probabilities
    prob_values = np.array([probs[tag] for tag in tags])
    prob_values = prob_values ** (1 / temperature)
    prob_values = prob_values / prob_values.sum()
    
    selected = np.random.choice(tags, size=num_samples, replace=False, p=prob_values)
    return selected.tolist()

def sample_num_from_distribution(num_probs, temperature=1.0):
    """Sample a number from a discrete distribution with temperature control.
    
    Args:
        num_probs: Dictionary mapping numbers to their probabilities
        temperature: Controls randomness (lower = more deterministic, higher = more random)
    """
    nums = list(num_probs.keys())
    prob_values = np.array(list(num_probs.values()))
    prob_values = prob_values ** (1 / temperature)
    prob_values = prob_values / prob_values.sum()
    
    selected_num = np.random.choice(nums, p=prob_values)
    return selected_num

def generate_sample(variety_factor=0.5):
    """Generate a single sample with tags.
    
    Args:
        variety_factor: Controls randomness (0 = deterministic, 1 = very random)
    """
    # Temperature increases with variety_factor
    temp = 1.0 + variety_factor * 2.0
    
    num_instruments = sample_num_from_distribution(num_instruments_probs, temp)
    num_moods = sample_num_from_distribution(num_moods_probs, temp)
    num_genres = sample_num_from_distribution(num_genres_probs, temp)
    num_tempos = sample_num_from_distribution(num_tempos_probs, temp)
    
    # Sample tags from each category
    selected_instruments = sample_tags(concepts['instrument'], instrument_probs, num_instruments, temp)
    selected_moods = sample_tags(concepts['mood'], mood_probs, num_moods, temp)
    selected_genres = sample_tags(concepts['genre'], genre_probs, num_genres, temp)
    selected_tempos = sample_tags(concepts['tempo'], tempo_probs, num_tempos, temp)
    
    # Combine all tags
    all_tags = selected_instruments + selected_moods + selected_genres + selected_tempos
    
    return {
        'instrument_tags': ', '.join(selected_instruments),
        'mood_tags': ', '.join(selected_moods),
        'genre_tags': ', '.join(selected_genres),
        'tempo_tags': ', '.join(selected_tempos),
        'aspect_list': ', '.join(all_tags)
    }

# Test generation
print("Sample with low variety:")
print(generate_sample(variety_factor=0.2))
print("\nSample with high variety:")
print(generate_sample(variety_factor=0.8))

Sample with low variety:
{'instrument_tags': 'male voice, drums', 'mood_tags': 'passionate', 'genre_tags': 'dance, pop song', 'tempo_tags': 'moderate tempo', 'aspect_list': 'male voice, drums, passionate, dance, pop song, moderate tempo'}

Sample with high variety:
{'instrument_tags': 'groovy bass, steady drumming rhythm, male voice, female singer', 'mood_tags': 'inspiring', 'genre_tags': 'rock, classical music, pop song, r&b, movie music', 'tempo_tags': 'dance groove', 'aspect_list': 'groovy bass, steady drumming rhythm, male voice, female singer, inspiring, rock, classical music, pop song, r&b, movie music, dance groove'}


In [17]:
# Generate dataset with varying variety
def generate_dataset(num_samples=1000, seed=42):
    """Generate a complete tags dataset."""
    np.random.seed(seed)
    random.seed(seed)
    
    samples = []
    for i in range(num_samples):
        # Vary the variety factor across samples
        variety = random.uniform(0.2, 0.8)
        sample = generate_sample(variety_factor=variety)
        sample['id'] = f"sample_{i:04d}"
        samples.append(sample)
    
    return pd.DataFrame(samples)

# Generate datasets
train_size = 1000
val_size = 100
test_size = 100

print(f"Generating {train_size} training samples...")
train_df = generate_dataset(train_size, seed=42)

print(f"Generating {val_size} validation samples...")
val_df = generate_dataset(val_size, seed=43)

print(f"Generating {test_size} test samples...")
test_df = generate_dataset(test_size, seed=44)

print("\nDataset generation complete!")
print(f"Train: {len(train_df)} samples")
print(f"Validation: {len(val_df)} samples")
print(f"Test: {len(test_df)} samples")

Generating 1000 training samples...
Generating 100 validation samples...
Generating 100 test samples...

Dataset generation complete!
Train: 1000 samples
Validation: 100 samples
Test: 100 samples


## Preview Generated Data

In [18]:
print("First 10 training samples:")
train_df.head(10)

First 10 training samples:


Unnamed: 0,instrument_tags,mood_tags,genre_tags,tempo_tags,aspect_list,id
0,punchy kick,"happy, passionate, scary, eerie, uplifting, em...","ambient sounds, dance, soulful","fast tempo, uptempo, dance groove","punchy kick, happy, passionate, scary, eerie, ...",sample_0000
1,e-bass,fun,hip hop,rhythmic patter,"e-bass, fun, hip hop, rhythmic patter",sample_0001
2,"no percussion, punchy kick, electric guitar, v...",chaotic,classical music,slow tempo,"no percussion, punchy kick, electric guitar, v...",sample_0002
3,"acoustic drums, shimmering shakers, male voice...",mellow,"pop song, chill",slow tempo,"acoustic drums, shimmering shakers, male voice...",sample_0003
4,"keyboard accompaniment, shimmering cymbals, ba...","mysterious, calming, fun, sad","classical, western classical music, electronic...","slow, beatboxing, rising pattern, strong drumm...","keyboard accompaniment, shimmering cymbals, ba...",sample_0004
5,"male vocalist, no voices, bass guitar",cheerful,"folk, electronic","groovy piano chords, groovy bassline, groovy r...","male vocalist, no voices, bass guitar, cheerfu...",sample_0005
6,"no singer, groovy bass line, keyboard harmony,...","energetic, emotional, inspiring, intense","chill, latin dance music, ambient, reggae","syncopated, uptempo","no singer, groovy bass line, keyboard harmony,...",sample_0006
7,"piano accompaniment, no percussion, steady dru...",heavy metal,"classical, meditation","medium to uptempo, groovy double bass, afrobea...","piano accompaniment, no percussion, steady dru...",sample_0007
8,"clapping, acoustic drums, keyboard",soothing,dance,"slow tempo, uptempo","clapping, acoustic drums, keyboard, soothing, ...",sample_0008
9,"percussion, cello",dark,dance,"moderate tempo, fast tempo, medium tempo","percussion, cello, dark, dance, moderate tempo...",sample_0009


In [19]:
# Analyze generated dataset statistics
print("Generated dataset statistics:")
print("\nInstrument distribution:")
instrument_gen_counts = Counter()
for _, row in train_df.iterrows():
    for tag in row['instrument_tags'].split(', '):
        instrument_gen_counts[tag.strip()] += 1
print(instrument_gen_counts.most_common(10))

print("\nMood distribution:")
print(train_df['mood_tags'].value_counts().head(10))

print("\nGenre distribution:")
print(train_df['genre_tags'].value_counts().head(10))

print("\nTempo distribution:")
print(train_df['tempo_tags'].value_counts())

print("\nNumber of instruments per sample:")
instrument_counts_gen = train_df['instrument_tags'].apply(lambda x: len(x.split(', '))).value_counts().sort_index()
for num, count in instrument_counts_gen.items():
    print(f"  {num} instruments: {count} samples ({count/len(train_df)*100:.1f}%)")

Generated dataset statistics:

Instrument distribution:
[('acoustic drums', 110), ('bass guitar', 96), ('male vocal', 94), ('punchy kick', 93), ('piano', 93), ('acoustic guitar', 91), ('electric guitar', 90), ('male voice', 90), ('punchy snare', 86), ('bass', 77)]

Mood distribution:
mood_tags
passionate     20
energetic      19
sentimental    17
calming        17
emotional      15
soft           14
soothing       13
exciting       12
happy          12
melancholic    11
Name: count, dtype: int64

Genre distribution:
genre_tags
soulful             24
pop                 19
electronic music    19
pop rock            15
rock                14
classical music     14
dance music         14
blues               13
jazz                12
classical           12
Name: count, dtype: int64

Tempo distribution:
tempo_tags
medium tempo                                                         68
slow tempo                                                           46
fast tempo                         

## Save Generated Dataset

In [20]:
# Create output directory
output_dir = Path("../data/generated_tags")
output_dir.mkdir(parents=True, exist_ok=True)

# Save datasets
train_df.to_csv(output_dir / "train.csv", index=False)
val_df.to_csv(output_dir / "validation.csv", index=False)
test_df.to_csv(output_dir / "test.csv", index=False)

all_df = pd.concat([train_df, val_df, test_df])
all_df.to_csv(output_dir / "all.csv", index=False)

print(f"Datasets saved to {output_dir}")
print(f"  - train.csv: {len(train_df)} samples")
print(f"  - validation.csv: {len(val_df)} samples")
print(f"  - test.csv: {len(test_df)} samples")

Datasets saved to ..\data\generated_tags
  - train.csv: 1000 samples
  - validation.csv: 100 samples
  - test.csv: 100 samples


## Upload to huggingface

In [21]:
data_files = {
    "train": str(output_dir / "train.csv"),
    "validation": str(output_dir / "validation.csv"),
    "test": str(output_dir / "test.csv")
}
dataset = load_dataset("csv", data_files=data_files)
dataset.push_to_hub("bsienkiewicz/random-tags-dataset", private=True)

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ? shards/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Processing Files (0 / 0): |          |  0.00B /  0.00B            

New Data Upload: |          |  0.00B /  0.00B            

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ? shards/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Processing Files (0 / 0): |          |  0.00B /  0.00B            

New Data Upload: |          |  0.00B /  0.00B            

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ? shards/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Processing Files (0 / 0): |          |  0.00B /  0.00B            

New Data Upload: |          |  0.00B /  0.00B            

README.md:   0%|          | 0.00/678 [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/datasets/bsienkiewicz/random-tags-dataset/commit/eac2f962541c6e28afcdba6648b4b7c54b35670f', commit_message='Upload dataset', commit_description='', oid='eac2f962541c6e28afcdba6648b4b7c54b35670f', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/bsienkiewicz/random-tags-dataset', endpoint='https://huggingface.co', repo_type='dataset', repo_id='bsienkiewicz/random-tags-dataset'), pr_revision=None, pr_num=None)