<a href="https://colab.research.google.com/github/Mahender2023/CodSoft/blob/main/MusicGen.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Prepare the Environment

Let’s make sure we’re connected to a GPU to run this notebook. To get a free Tier T4 GPU, click `Connect T4` in the top right-hand corner of the screen. If you have access to Colab Pro, you can select a more performant GPU by clicking `Runtime` -> `Change runtime type`, then change `Hardware accelerator` from `None` to your choice of GPU. We can verify that we’ve been assigned a GPU and view its specifications through the `nvidia-smi` command:

In [None]:
!nvidia-smi

We see here that we've got on Tesla T4 16GB GPU, although this may vary for you depending on GPU availablity and Colab GPU assignment.

Next, we install the 🤗 Transformers package from the main branch, as well as 🤗 Datasets package to load audio files for audio-prompted generation:

In [None]:
!pip install --upgrade --quiet pip
!pip install --upgrade --quiet transformers datasets[audio]

## Load the Model

The pre-trained MusicGen small, medium and large checkpoints can be loaded from the [pre-trained weights](https://huggingface.co/models?search=facebook/musicgen-) on the Hugging Face Hub. Change the repo id with the checkpoint size you wish to load. We'll default to the small checkpoint, which is the fastest of the three but has the lowest audio quality:

In [None]:
from transformers import MusicgenForConditionalGeneration

model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-small")

We can then place the model on our accelerator device (if available), or leave it on the CPU otherwise:

In [None]:
import torch

device = "cuda:0" if torch.cuda.is_available() else "cpu"
model.to(device);

## Generation

MusicGen is compatible with two generation modes: greedy and sampling. In practice, sampling leads to significantly
better results than greedy, thus we encourage sampling mode to be used where possible. Sampling is enabled by default,
and can be explicitly specified by setting `do_sample=True` in the call to `MusicgenForConditionalGeneration.generate` (see below).

### Unconditional Generation

The inputs for unconditional (or 'null') generation can be obtained through the method `MusicgenForConditionalGeneration.get_unconditional_inputs`. We can then run auto-regressive generation using the `.generate` method, specifying `do_sample=True` to enable sampling mode:

In [None]:
unconditional_inputs = model.get_unconditional_inputs(num_samples=1)

audio_values = model.generate(**unconditional_inputs, do_sample=True, max_new_tokens=256)

The audio outputs are a three-dimensional Torch tensor of shape `(batch_size, num_channels, sequence_length)`. To listen
to the generated audio samples, you can either play them in an ipynb notebook:

In [None]:
from IPython.display import Audio

sampling_rate = model.config.audio_encoder.sampling_rate
Audio(audio_values[0].cpu().numpy(), rate=sampling_rate)

Or save them as a `.wav` file using a third-party library, e.g. `scipy` (note here that we also need to remove the channel dimension from our audio tensor):

In [None]:
import scipy

scipy.io.wavfile.write("musicgen_out.wav", rate=sampling_rate, data=audio_values[0, 0].cpu().numpy())

The argument `max_new_tokens` specifies the number of new tokens to generate. As a rule of thumb, you can work out the length of the generated audio sample in seconds by using the frame rate of the EnCodec model:

In [None]:
audio_length_in_s = 256 / model.config.audio_encoder.frame_rate

audio_length_in_s

In [None]:
from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained("facebook/musicgen-small")

inputs = processor(
    text=["80s pop track with bassy drums and synth", "90s rock song with loud guitars and heavy drums"],
    padding=True,
    return_tensors="pt",
)

audio_values = model.generate(**inputs.to(device), do_sample=True, guidance_scale=3, max_new_tokens=256)

Audio(audio_values[0].cpu().numpy(), rate=sampling_rate)

In [1]:
from datasets import load_dataset

dataset = load_dataset("sanchit-gandhi/gtzan", split="train", streaming=True)
sample = next(iter(dataset))["audio"]

# take the first half of the audio sample
sample["array"] = sample["array"][: len(sample["array"]) // 2]

inputs = processor(
    audio=sample["array"],
    sampling_rate=sample["sampling_rate"],
    text=["80s blues track with groovy saxophone"],
    padding=True,
    return_tensors="pt",
)

audio_values = model.generate(**inputs.to(device), do_sample=True, guidance_scale=3, max_new_tokens=256)

Audio(audio_values[0].cpu().numpy(), rate=sampling_rate)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/703 [00:00<?, ?B/s]

NotImplementedError: Loading a streaming dataset cached in a LocalFileSystem is not supported yet.

In [None]:
sample = next(iter(dataset))["audio"]

# take the first quater of the audio sample
sample_1 = sample["array"][: len(sample["array"]) // 4]

# take the first half of the audio sample
sample_2 = sample["array"][: len(sample["array"]) // 2]

inputs = processor(
    audio=[sample_1, sample_2],
    sampling_rate=sample["sampling_rate"],
    text=["80s blues track with groovy saxophone", "90s rock song with loud guitars and heavy drums"],
    padding=True,
    return_tensors="pt",
)

audio_values = model.generate(**inputs.to(device), do_sample=True, guidance_scale=3, max_new_tokens=256)

# post-process to remove padding from the batched audio
audio_values = processor.batch_decode(audio_values, padding_mask=inputs.padding_mask)

Audio(audio_values[0], rate=sampling_rate)

## Generation Config

The default parameters that control the generation process, such as sampling, guidance scale and number of generated tokens, can be found in the model's generation config, and updated as desired. Let's first inspect the default generation config:

In [None]:
model.generation_config

Alright! We see that the model defaults to using sampling mode (`do_sample=True`), a guidance scale of 3, and a maximum generation length of 1500 (which is equivalent to 30s of audio). You can update any of these attributes to change the default generation parameters:

In [None]:
# increase the guidance scale to 4.0
model.generation_config.guidance_scale = 4.0

# set the max new tokens to 256
model.generation_config.max_new_tokens = 256

# set the softmax sampling temperature to 1.5
model.generation_config.temperature = 1.5

Re-running generation now will use the newly defined values in the generation config:

In [None]:
audio_values = model.generate(**inputs.to(device))

Note that any arguments passed to the generate method will **supersede** those in the generation config, so setting `do_sample=False` in the call to generate will supersede the setting of `model.generation_config.do_sample` in the generation config.