<a href="https://colab.research.google.com/github/IsitaRex/Vibe-Sorcery/blob/main/main.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
#@title Load the MTG Listening Models

#@markdown Have a look at how we get hold of and construct the pre-trained models
#@markdown from the MTG repository.

# Essentia for tagging the music

!pip install essentia-tensorflow

from essentia.standard import MonoLoader, TensorflowPredictEffnetDiscogs, TensorflowPredict2D
!wget https://essentia.upf.edu/models/music-style-classification/discogs-effnet/discogs-effnet-bs64-1.pb
!wget https://essentia.upf.edu/models/classification-heads/mtg_jamendo_moodtheme/mtg_jamendo_moodtheme-discogs-effnet-1.pb

embeddings_model = TensorflowPredictEffnetDiscogs(
    graphFilename="discogs-effnet-bs64-1.pb",
    output="PartitionedCall:1",
)

mood_classification_model = TensorflowPredict2D(
    graphFilename="mtg_jamendo_moodtheme-discogs-effnet-1.pb",
    output='model/Sigmoid',
)


Collecting essentia-tensorflow
  Downloading essentia_tensorflow-2.1b6.dev1110-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.9 kB)
Downloading essentia_tensorflow-2.1b6.dev1110-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (291.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m291.4/291.4 MB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: essentia-tensorflow
Successfully installed essentia-tensorflow-2.1b6.dev1110
--2025-03-12 11:46:23--  https://essentia.upf.edu/models/music-style-classification/discogs-effnet/discogs-effnet-bs64-1.pb
Resolving essentia.upf.edu (essentia.upf.edu)... 84.89.139.43
Connecting to essentia.upf.edu (essentia.upf.edu)|84.89.139.43|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 18366619 (18M) [application/octet-stream]
Saving to: ‘discogs-effnet-bs64-1.pb’


2025-03-12 11:46:40 (1.06 MB/s) - ‘discogs-effnet-bs64-1.pb’ saved [18366619/18366619]

-

https://huggingface.co/riffusion/riffusion-model-v1

In [2]:
#@title Functions for Using the Listening Models

#@markdown Also see how we use the models to embed an audio file into a latent
#@markdown space and then pass it through the annotation model to get a sequence
#@markdown of activation vectors that we then average over.

mood_tags = [
  "action", "adventure", "advertising", "background", "ballad", "calm",
  "children", "christmas", "commercial", "cool", "corporate",
  "dark", "deep", "documentary", "drama", "dramatic",
  "dream", "emotional", "energetic", "epic", "fast",
  "film", "fun", "funny", "game", "groovy",
  "happy", "heavy", "holiday", "hopeful", "inspiring",
  "love", "meditative", "melancholic", "melodic", "motivational",
  "movie", "nature", "party", "positive", "powerful",
  "relaxing", "retro", "romantic", "sad", "sexy",
  "slow", "soft", "soundscape", "space", "sport",
  "summer", "trailer", "travel", "upbeat", "uplifting"
]

def get_mood_activations_dict(wav_filepath):
  audio = MonoLoader(filename=wav_filepath, sampleRate=32000)()
  embeddings = embeddings_model(audio)
  activations = mood_classification_model(embeddings)
  activation_avs = []
  for i in range(0, len(activations[0])):
    vals = [activations[j][i] for j in range(0, len(activations))]
    # Note - this does the averaging bit
    activation_avs.append(sum(vals)/len(vals))
  activations_dict = {}
  for ind, tag in enumerate(mood_tags):
    activations_dict[tag] = activation_avs[ind]
  return activations_dict

In [8]:
def get_top_k_moods(mood_dict, k=5):
    """
    Returns the top k moods from a mood dictionary.

    Args:
        mood_dict (dict): A dictionary mapping mood tags to activation values.
        k (int, optional): The number of top moods to return. Defaults to 5.

    Returns:
        list: A list of the top k moods.
    """

    # Sort the mood dictionary by activation values in descending order
    sorted_moods = sorted(mood_dict.items(), key=lambda item: item[1], reverse=True)

    # Return the top k moods
    return [mood[0] for mood in sorted_moods[:k]]

In [3]:
import os

In [4]:
os.getcwd()

'/content'

In [5]:
import librosa
import librosa.display
import IPython.display as ipd

# Assuming '024.wav' is in the current directory
audio_path = '024.wav'

# Load the audio file
audio_data, sample_rate = librosa.load(audio_path)

# Display audio player widget
ipd.Audio(audio_data, rate=sample_rate)


In [7]:
mood_dict = get_mood_activations_dict(audio_path)

In [9]:
get_top_k_moods(mood_dict)

['dark', 'epic', 'action', 'space', 'soundscape']

In [21]:
# !huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    A token is already saved on your machine. Run `huggingface-cli whoami` to get more information or `huggingface-cli logout` if you want to log out.
    Setting a new token will erase the existing one.
    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) n
Token is valid (permission: fineG

In [23]:
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="openai-community/gpt2")

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu


In [36]:
def generate_song_caption(moods, max_new_tokens=50):
    """
    Generates a song caption based on a list of moods using GPT-2.

    Args:
        moods (list): A list of mood words (e.g., ["happy", "energetic", "nostalgic"]).
        max_new_tokens (int): Maximum number of new tokens to generate.

    Returns:
        str: The generated song caption.
    """
    # Create a natural-sounding prompt
    mood_str = ", ".join(moods)
    prompt = f"What song caption that fits the moods: {mood_str} in 3 sentences wold you use?"
    # prompt = f"Create a song caption that fits the moods: {mood_str} in 3 sentences. For example, if the song is [\"happy\", \"dreamy\"] you can have a caption like \"A sun-kissed melody drifts through the sky, where golden light dances on cotton candy clouds, and every note feels like a gentle breeze of endless joy.\".\nCaption:"

    # Generate text using the pipeline
    output = pipe(prompt, max_new_tokens=max_new_tokens, num_return_sequences=1, temperature=0.8, top_k=50, top_p=0.95, do_sample=True)

    # Extract generated text
    caption = output[0]["generated_text"].split("Caption:")[-1].strip()
    return caption

# Example Usage
moods = ['dark', 'epic', 'action']
caption = generate_song_caption(moods)
print("Generated Song Caption:", caption)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated Song Caption: What song caption that fits the moods: dark, epic, action in 3 sentences wold you use?

This was one of the most interesting songs I've ever recorded. I thought it was a pretty interesting song, and that's probably what people are interested in. I'm a songwriter, and that's what I do.

We
