## MusicGen in 🤗 Transformers

**by [Sanchit Gandhi](https://huggingface.co/sanchit-gandhi)**

MusicGen is a Transformer-based model capable fo generating high-quality music samples conditioned on text descriptions or audio prompts. It was proposed in the paper [Simple and Controllable Music Generation](https://arxiv.org/abs/2306.05284) by Jade Copet et al. from Meta AI.

The MusicGen model can be de-composed into three distinct stages:
1. The text descriptions are passed through a frozen text encoder model to obtain a sequence of hidden-state representations
2. The MusicGen decoder is then trained to predict discrete audio tokens, or *audio codes*, conditioned on these hidden-states
3. These audio tokens are then decoded using an audio compression model, such as EnCodec, to recover the audio waveform

The pre-trained MusicGen checkpoints use Google's [t5-base](https://huggingface.co/t5-base) as the text encoder model, and [EnCodec 32kHz](https://huggingface.co/facebook/encodec_32khz) as the audio compression model. The MusicGen decoder is a pure language model architecture,
trained from scratch on the task of music generation.

The novelty in the MusicGen model is how the audio codes are predicted. Traditionally, each codebook has to be predicted by a separate model (i.e. hierarchically) or by continuously refining the output of the Transformer model (i.e. upsampling). MusicGen uses an efficient *token interleaving pattern*, thus eliminating the need to cascade multiple models to predict a set of codebooks. Instead, it is able to generate the full set of codebooks in a single forward pass of the decoder, resulting in much faster inference.

<p align="center">
  <img src="https://github.com/sanchit-gandhi/codesnippets/blob/main/delay_pattern.png?raw=true" width="600"/>
</p>


**Figure 1:** Codebook delay pattern used by MusicGen. Figure taken from the [MusicGen paper](https://arxiv.org/abs/2306.05284).


## Prepare the Environment

Let’s make sure we’re connected to a GPU to run this notebook. To get a GPU, click `Runtime` -> `Change runtime type`, then change `Hardware accelerator` from `None` to `GPU`. We can verify that we’ve been assigned a GPU and view its specifications through the `nvidia-smi` command:

In [1]:
!nvidia-smi

Fri Jan 19 08:27:20 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla P100-PCIE-16GB           Off | 00000000:00:04.0 Off |                    0 |
| N/A   29C    P0              25W / 250W |      0MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

We see here that we've got on Tesla T4 16GB GPU, although this may vary for you depending on GPU availablity and Colab GPU assignment.

Next, we install the 🤗 Transformers package from the main branch, as well as 🤗 Datasets package to load audio files for audio-prompted generation:

In [2]:
!pip install --upgrade --quiet pip
!pip install --quiet git+https://github.com/huggingface/transformers.git datasets[audio]

## Load the Model

The pre-trained MusicGen small, medium and large checkpoints can be loaded from the [pre-trained weights](https://huggingface.co/models?search=facebook/musicgen-) on the Hugging Face Hub. Change the repo id with the checkpoint size you wish to load. We'll default to the small checkpoint, which is the fastest of the three but has the lowest audio quality:

In [3]:
from transformers import MusicgenForConditionalGeneration

model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-stereo-large")

config.json:   0%|          | 0.00/7.75k [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/98.6k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/1.93G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/224 [00:00<?, ?B/s]

We can then place the model on our accelerator device (if available), or leave it on the CPU otherwise:

In [4]:
import torch

device = "cuda:0" if torch.cuda.is_available() else "cpu"
model.to(device);

## Generation

MusicGen is compatible with two generation modes: greedy and sampling. In practice, sampling leads to significantly
better results than greedy, thus we encourage sampling mode to be used where possible. Sampling is enabled by default,
and can be explicitly specified by setting `do_sample=True` in the call to `MusicgenForConditionalGeneration.generate` (see below).

### Unconditional Generation

The inputs for unconditional (or 'null') generation can be obtained through the method `MusicgenForConditionalGeneration.get_unconditional_inputs`. We can then run auto-regressive generation using the `.generate` method, specifying `do_sample=True` to enable sampling mode:

The audio outputs are a three-dimensional Torch tensor of shape `(batch_size, num_channels, sequence_length)`. To listen
to the generated audio samples, you can either play them in an ipynb notebook:

In [5]:
from IPython.display import Audio

# sampling_rate = model.config.audio_encoder.sampling_rate
# Audio(audio_values[0].cpu().numpy(), rate=sampling_rate)

Or save them as a `.wav` file using a third-party library, e.g. `scipy` (note here that we also need to remove the channel dimension from our audio tensor):

In [6]:
import scipy

# scipy.io.wavfile.write("musicgen_out.wav", rate=sampling_rate, data=audio_values[0, 0].cpu().numpy())



The argument `max_new_tokens` specifies the number of new tokens to generate. As a rule of thumb, you can work out the length of the generated audio sample in seconds by using the frame rate of the EnCodec model:

In [102]:
number_of_seconds = 5

In [103]:
tokens = 50 * number_of_seconds
tokens += 6

tokens = int(tokens)
## 0.12 after required 

audio_length_in_s = tokens / model.config.audio_encoder.frame_rate

audio_length_in_s

5.12

### Text-Conditional Generation

The model can generate an audio sample conditioned on a text prompt through use of the `MusicgenProcessor` to pre-process
the inputs. The pre-processed inputs can then be passed to the `.generate` method to generate text-conditional audio samples.
Again, we enable sampling mode by setting `do_sample=True`:

In [104]:
sampling_rate = model.config.audio_encoder.sampling_rate

In [105]:
prompts = {
    0: "Capture the beauty of Europe with a random and enchanting twist. Picture iconic European landmarks bathed in a kaleidoscope of colors, creating a mesmerizing fusion of beauty and randomness.",
    90: "Infuse a serene mountain landscape with cyberpunk elements. Picture towering peaks adorned with futuristic city lights, creating a harmonious blend of nature and high-tech urban living.",
    180: "Transform Europe's classic architecture into a canvas of beauty and randomness. Envision historic buildings infused with vibrant and unexpected colors, creating a captivating blend of the traditional and the avant-garde.",
    270: "Sail through Europe's waterways and capture the beauty of random moments. Picture charming canals and riverside scenes, with unexpected elements adding a touch of magic to the serene and picturesque settings.",
    360: "Evoke awe with a breathtaking panorama of Europe's natural wonders, enhanced by random and beautiful elements. Imagine majestic mountains and serene lakes, where the unexpected harmonizes with the breathtaking scenery.",
    450: "Merge Europe's rich cultural heritage with a touch of randomness and beauty. Visualize historic landmarks transformed by unexpected and vibrant elements, creating a tapestry of tradition and whimsy.",
    540: "Craft a scene where the elegance of Europe's sunset meets random and beautiful elements. Envision cityscapes bathed in warm hues, with unexpected details adding a touch of magic to the twilight beauty.",
    630: "Infuse Europe's charming countryside with the beauty of randomness. Picture rolling hills and meadows adorned with unexpected surprises, creating a visual feast of natural beauty and unpredictability.",
#     720: "Embark on a journey through Europe's diverse landscapes, capturing the beauty of random and captivating moments. Visualize ancient forests, coastal cliffs, and expansive fields, where the unexpected becomes an integral part of the scenic tapestry."
}

In [106]:
# Extract prompts into a Python list
prompt_list = list(prompts.values())

# Print the list of prompts
print(prompt_list[0])

Capture the beauty of Europe with a random and enchanting twist. Picture iconic European landmarks bathed in a kaleidoscope of colors, creating a mesmerizing fusion of beauty and randomness.


In [107]:
from pydub.playback import play
from pydub import AudioSegment

In [108]:
# while True:
#     pass

In [109]:
# from transformers import AutoProcessor

# processor = AutoProcessor.from_pretrained("facebook/musicgen-small")

# inputs = processor(
#     text="Create a dark and foreboding musical journey with ominous strings and percussion for the Nebula of Adversaries, a dimensional realm where Loki, Thanos, and the Red Skull conspire. The Avengers, led by Captain America and Black Panther, must dispel the looming threat with their heroic might. Genre: Dark Fantasy.",
#     padding=True,
#     return_tensors="pt",
# )

# audio_values = model.generate(**inputs.to(device), do_sample=True, guidance_scale=3, max_new_tokens=256)

# Audio(audio_values[0].cpu().numpy(), rate=sampling_rate)

In [110]:
import shutil
import os

folder_path = '/kaggle/working/audio'

# Check if the folder exists
if os.path.exists(folder_path):
    # Delete the folder
    shutil.rmtree(folder_path)
    
# Create the folder
os.makedirs(folder_path)


In [111]:
from transformers import AutoProcessor
from scipy.io import wavfile
from tqdm.auto import tqdm

processor = AutoProcessor.from_pretrained("facebook/musicgen-stereo-large")

for j, i in enumerate(tqdm(prompt_list)):
            
        print(f"{j}.Prompt - {str(i)}\n")
        inputs = processor(
            text=i,
            padding=True,
            return_tensors="pt",
        )

        audio_values = model.generate(**inputs.to(device), do_sample=True, guidance_scale=3, max_new_tokens=tokens)
        
#         Audio(audio_values[0].cpu().numpy(), rate=sampling_rate)

        sampling_rate = model.config.audio_encoder.sampling_rate
        scipy.io.wavfile.write(f"/kaggle/working/audio/{j}.wav", rate=sampling_rate, data=audio_values[0, 0].cpu().numpy())

# Audio(audio_values[0].cpu().numpy(), rate=sampling_rate)

  0%|          | 0/8 [00:00<?, ?it/s]

0.Prompt - Capture the beauty of Europe with a random and enchanting twist. Picture iconic European landmarks bathed in a kaleidoscope of colors, creating a mesmerizing fusion of beauty and randomness.

1.Prompt - Infuse a serene mountain landscape with cyberpunk elements. Picture towering peaks adorned with futuristic city lights, creating a harmonious blend of nature and high-tech urban living.

2.Prompt - Transform Europe's classic architecture into a canvas of beauty and randomness. Envision historic buildings infused with vibrant and unexpected colors, creating a captivating blend of the traditional and the avant-garde.

3.Prompt - Sail through Europe's waterways and capture the beauty of random moments. Picture charming canals and riverside scenes, with unexpected elements adding a touch of magic to the serene and picturesque settings.

4.Prompt - Evoke awe with a breathtaking panorama of Europe's natural wonders, enhanced by random and beautiful elements. Imagine majestic mounta

In [112]:
import glob
files = glob.glob("/kaggle/working/audio/*")

In [113]:
files

['/kaggle/working/audio/1.wav',
 '/kaggle/working/audio/0.wav',
 '/kaggle/working/audio/2.wav',
 '/kaggle/working/audio/4.wav',
 '/kaggle/working/audio/6.wav',
 '/kaggle/working/audio/5.wav',
 '/kaggle/working/audio/3.wav',
 '/kaggle/working/audio/7.wav']

In [114]:
from pydub import AudioSegment
from pydub.playback import play

def merge_and_smooth(files, output_file, target_duration=48000):
    # Initialize an empty AudioSegment to store the final result
    final_audio = AudioSegment.silent(duration=0)

    for i, file in enumerate(files):
        # Load each audio file
        audio = AudioSegment.from_file(file)

        # Add a fade-in effect (500 ms fade-in)
        audio = audio.fade_in(100)

        if i > 0:
            # Crossfade with the previous audio segment (500 ms crossfade)
            final_audio = final_audio.fade_out(100)
            audio = audio.fade_in(100)
            final_audio = final_audio.append(audio, crossfade=100)
        else:
            # For the first segment, just append without crossfade
            final_audio += audio

    # Trim the final audio to the target duration (50 seconds)
    final_audio = final_audio[:target_duration]

    # Export the final result to a new WAV file
    final_audio.export(output_file, format="wav")

# List of WAV files to merge
input_files = files

# Output file name
output_file = "output_final.wav"

# Merge and add fade-in effect
merge_and_smooth(input_files, output_file)


In [115]:
Audio('/kaggle/working/output_final.wav')

The `guidance_scale` is used in classifier free guidance (CFG), setting the weighting between the conditional logits
(which are predicted from the text prompts) and the unconditional logits (which are predicted from an unconditional or
'null' prompt). A higher guidance scale encourages the model to generate samples that are more closely linked to the input
prompt, usually at the expense of poorer audio quality. CFG is enabled by setting `guidance_scale > 1`. For best results,
use a `guidance_scale=3` (default) for text and audio-conditional generation.