# SDS24 Generative AI for Well-being

## Workshop 2: Generation of relaxing and meditation sounds

## 1.) Introduction

Meditation is like a secret weapon against stress and feeling down.
When you meditate, you take a break from all the craziness around you and find peace in the moment.
By focusing on your breathing and thoughts, you relax your body and mind, letting go of stress.
Doing this regularly helps you control your emotions better and to improve your well-being [1].
Introducing soothing meditation background music can further amplify these benefits by improving relaxation and concentration [2].

How cool would it be if we were able to create customized meditation background sounds based on our individual preferences?

In this workshop we will explore exactly that through a proof of concept.
To do so, we utilise a generative AI model that is able to generate sounds conditioned on text descriptions or audio samples.
Concretely we use the MusicGen model.
We at BFH have fine-tuned this model to steer it towards an improved ability to produce relaxing sounds.

Let's get started!

PS: In the following there will be the possibility to play around with different parameters. Please refer to the in-code comments to do so.


[1] Rubia, K. (2009). The neurobiology of meditation and its clinical effectiveness in psychiatric disorders. Biological psychology, 82(1), 1-11.

[2] Dvorak, A. L., & Hernandez-Ruiz, E. (2021). Comparison of music stimuli to support mindfulness meditation. Psychology of Music, 49(3), 498-512.


#### Important note before you continue: The notebook runs way faster when executed on a GPU. Thus, make sure to change its run time accordingly.

First we import and define the required libraries and globals, respectively.

In [None]:
!pip install wget

Collecting wget
  Downloading wget-3.2.zip (10 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: wget
  Building wheel for wget (setup.py) ... [?25l[?25hdone
  Created wheel for wget: filename=wget-3.2-py3-none-any.whl size=9656 sha256=402b867261fb52b5295dbd10145316e6fdd7c83018fd744845799c5e439af249
  Stored in directory: /root/.cache/pip/wheels/8b/f1/7f/5c94f0a7a505ca1c81cd1d9208ae2064675d97582078e6c769
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2


In [None]:
from transformers import AutoProcessor, MusicgenForConditionalGeneration
import scipy
from IPython.display import Audio
import numpy as np
import torch
import random
import matplotlib.pyplot as plt
import wget


In [None]:
# The used music models expect audio files sampled with 32kHz. You must NOT change this.
SAMPLING_RATE = 32_000

In [None]:
# Setting a seed before every execution makes results reproducible.
def set_global_seeds(seed=42):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed)

## 2.) Trying out the default MusicGen by Meta

Before using our customized AI model, you can try out the original model from Facebook for general purpose music generation.
Visit https://huggingface.co/docs/transformers/model_doc/musicgen for further information about it.

In [None]:
# This may take a few minutes to download.
processor = AutoProcessor.from_pretrained("facebook/musicgen-small")
default_model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-small")

### 2.1.) Generate custom music

The following cell we explore how to let AI create music in general, e.g. a 90ties style guitar rock riff.

There are three things to play around with:

* sound_description
  * Describe the kind of music you would like to have generated.
* guidance_scale
  * This controls how strongly the model is guided by the text description (the higher, the stronger).
* max_new_tokens
  * This basically controls the length of the generated sound. Setting it to 256 corresponds to 5 seconds.  

The execution may take a few seconds to run. After its completion use the play button to hear your results.

In [None]:
# You can alter the following text as you like.
sound_description = '90ties rock with guitar riff'

conditioned_text_input = processor(text=sound_description, padding=True, return_tensors="pt")

# You can comment this line to get different results each time you execute the cell
#  even when using the same description.
set_global_seeds()

audio = default_model.generate(**conditioned_text_input,
                               do_sample=True,
                               guidance_scale=3,  # Value >1, best results achieved with 3.
                               max_new_tokens=256 # 256 ^= 5 seconds of audio.
                              )
Audio(audio[0, 0].numpy(), rate=SAMPLING_RATE)

### 2.2.) Create meditation background music using the default MusicGen

We now use the original MusicGen to try to generate meditation sounds.
The results are expected to be already quite nice, but there's definitely room for improvement.

In [None]:
sound_description = 'Peaceful meditation background sound'

conditioned_text_input = processor(text=sound_description, padding=True, return_tensors="pt")

set_global_seeds()
audio = default_model.generate(**conditioned_text_input,
                               do_sample=True,
                               guidance_scale=3,  # Value >1, best results achieved with 3.
                               max_new_tokens=256 # 256 ^= 5 seconds of audio.
                              )
Audio(audio[0, 0].numpy(), rate=SAMPLING_RATE)

## 3.) Using our fine-tuned model

In this section, we use a customized meditation music generation model by BFH.
The following cell downloads the model from huggingface automatically.

In [None]:
MODEL_VERSION = "bfh-genai/meditation-musicgen"

# This may take a few minutes.
processor = AutoProcessor.from_pretrained("facebook/musicgen-small")
model = MusicgenForConditionalGeneration.from_pretrained(MODEL_VERSION)

### 3.1.) Generate custom relaxing sounds

Now let's hear the results coming from a dedicated meditation generation model.

Although this is a proof-of-concept, the result should be an improvement on the previous sound. But we all know that tastes differ. What is your opinion? Do you hear any significant differences?


In [None]:
# Feel free to describe sounds on your own or un-comment one of the following examples.
relaxing_description = 'Peaceful meditation background sound'
# relaxing_description = 'Calm slow piano with low pitch'
#relaxing_description = 'Relaxing, calm sound'

conditioned_text_input = processor(text=relaxing_description, padding=True, return_tensors="pt")

set_global_seeds()

audio_value = model.generate(**conditioned_text_input,
                             do_sample=True,
                             guidance_scale=3,  # Value >1, best results achieved with 3.
                             max_new_tokens=256 # 256 ^= 5 seconds of audio.
                             )

Audio(audio_value[0, 0].numpy(), rate=SAMPLING_RATE)

In [None]:
scipy.io.wavfile.write(f"my_audio_file.wav", rate=SAMPLING_RATE, data=audio_value[0, 0].numpy())

### 3.2.) Create sounds conditioned on music

So far we have only conditioned the generation on text description.
In this section we further condition the generation based on 5s long audio samples resulting in 10s in total.

Select one of the following prepared audio files. Use the play button to get a preview of the selected audio. When this is additionally input to the model, it tries to align its generation to the selected audio.

In [None]:
# Download the required audio files
urls = ["https://github.com/BFH-AMI/sds24/raw/ff4a0ff7958e2eefdda43391b9536fb29e8ec877/Workshop2/audio_conditioning_samples/ambient-lo-fi-pad-seasons_75bpm_C_major.wav",
        "https://github.com/BFH-AMI/sds24/raw/ff4a0ff7958e2eefdda43391b9536fb29e8ec877/Workshop2/audio_conditioning_samples/calm-lo-fi-piano-acoustic-melody_120bpm.wav",
        "https://github.com/BFH-AMI/sds24/raw/ff4a0ff7958e2eefdda43391b9536fb29e8ec877/Workshop2/audio_conditioning_samples/nostalgic-ambient-violin-classical-melody_70bpm_G_minor.wav",
        "https://github.com/BFH-AMI/sds24/raw/ff4a0ff7958e2eefdda43391b9536fb29e8ec877/Workshop2/audio_conditioning_samples/soft-ambient-piano-reflective-loop_154bpm_C_minor.wav"]
for url in urls:
    wget.download(url, out=f"/content/{url.split('/')[-1]}")

In [None]:
# audio_condition_file = 'calm-lo-fi-piano-acoustic-melody_120bpm.wav'
# audio_condition_file = 'soft-ambient-piano-reflective-loop_154bpm_C_minor.wav'
# audio_condition_file = 'nostalgic-ambient-violin-classical-melody_70bpm_G_minor.wav'
audio_condition_file = 'ambient-lo-fi-pad-seasons_75bpm_C_major.wav'

_, audio_con_data = scipy.io.wavfile.read("/content/" + audio_condition_file)
Audio(audio_con_data, rate=SAMPLING_RATE)

In [None]:
conditioned_text_input = processor(
    audio=audio_con_data,
    sampling_rate=SAMPLING_RATE,
    text='Peaceful meditation background sound',
    padding=True,
    return_tensors="pt",
)
set_global_seeds()
audio_value = model.generate(**conditioned_text_input,
                             do_sample=True,
                             guidance_scale=3,
                             max_new_tokens=256)
Audio(audio_value[0, 0], rate=SAMPLING_RATE)

### 3.3.) Creating longer audio sequences

So far we've only created short sequences of relaxing sounds but meditating is usually practiced five minutes to an hour.
Such long sequences cannot be generated in one go.
Thus, we provide in this section the capability to generate longer songs by chaining small individual audio snippets together.

You can control the length using the **nbr_total_seconds** variable.
However, as this is just a PoC, sequences tend to become worse the longer they become.

In [None]:
def smooth_transition(array_to_add, last_val, window=10):
    """
    This function somewhat smooths the transition from one snippet to another
     preventing clicking noises between two snippets.
    """
    window_end_val = array_to_add[window]
    slope = (window_end_val - last_val) / window
    for i in range(window):
        array_to_add[i] = last_val + slope * i
    return array_to_add


This may take some time... :)

In [None]:
audio_condition_file = 'ambient-lo-fi-pad-seasons_75bpm_C_major.wav'

_, audio_con_data = scipy.io.wavfile.read("/content/" + audio_condition_file)

# Adjust this if you want.
nbr_total_seconds = 30
chained_audio = np.zeros((nbr_total_seconds*SAMPLING_RATE))
chained_audio[:len(audio_con_data)] = audio_con_data

for i in range(5, nbr_total_seconds, 5):
    print(f"Creating seconds {i} to {i+5} / {nbr_total_seconds} ...")
    start_id, end_id = i*SAMPLING_RATE, (i+5)*SAMPLING_RATE

    conditioned_text_input = processor(
        audio=chained_audio[(i-5)*SAMPLING_RATE:i*SAMPLING_RATE],
        sampling_rate=SAMPLING_RATE,
        text='Peaceful meditation background sound',
        padding=True,
        return_tensors="pt",
    )
    set_global_seeds()
    audio_value = model.generate(**conditioned_text_input, do_sample=True,
                                 guidance_scale=3,
                                 max_new_tokens=256)

    whole_generated = audio_value[0, 0].numpy()
    if len(whole_generated) == 2*len(audio_con_data):
        chained_audio[start_id:end_id] =  smooth_transition(whole_generated[len(audio_con_data):].copy(),
                                                            chained_audio[start_id-1])
    elif len(whole_generated) > 2*len(audio_con_data):
        diff = len(whole_generated) - 2*len(audio_con_data)
        chained_audio[start_id:end_id] = smooth_transition(whole_generated[len(audio_con_data):-diff].copy(),
                                                           chained_audio[start_id-1])
    else:
        chained_audio[start_id:end_id] = smooth_transition(whole_generated[::-1][:len(audio_con_data)][::-1].copy(),
                                                           chained_audio[start_id-1])

Audio(chained_audio, rate=SAMPLING_RATE)

In [None]:
from matplotlib.ticker import FuncFormatter
from matplotlib import rc


# Have a look how your file looks like

rc('text', usetex=False)
formatter = FuncFormatter(lambda x_val, tick_pos: "{:.0f}".format(x_val/SAMPLING_RATE))

fig, ax = plt.subplots(1, figsize=(12, 4))
ax.xaxis.set_major_formatter(formatter)
ax.plot(chained_audio)
plt.xlabel("Time")
plt.ylabel("Amplitude")
plt.grid(True)
plt.show()

In [None]:
# Using this cell you can save your audio file
scipy.io.wavfile.write(f"my_audio_file.wav", rate=SAMPLING_RATE, data=chained_audio)