<a href="https://colab.research.google.com/github/TheEvilSocks/colab-jukebox/blob/master/interacting-with-jukebox.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

IMPORTANT NOTE ON SYSTEM REQUIREMENTS:

If you are connecting to a hosted runtime, make sure it has a P100 GPU (optionally run !nvidia-smi to confirm). Go to Edit>Notebook Settings to set this.

CoLab may first assign you a lower memory machine if you are using a hosted runtime.  If so, the first time you try to load the 5B model, it will run out of memory, and then you'll be prompted to restart with more memory (then return to the top of this CoLab).  If you continue to have memory issues after this (or run into issues on your own home setup), switch to the 1B model.

If you are using a local GPU, we recommend V100 or P100 with 16GB GPU memory for best performance. For GPUâ€™s with less memory, we recommend using the 1B model and a smaller batch size throughout.  



In [None]:
!nvidia-smi -L

# Configuration

Mount Google Drive to save sample levels as they are generated.

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

# Prepare the environment
print("Preparing environment... (This might take a while)")
!pip install git+https://github.com/openai/jukebox.git > /dev/null
print("Prepped and ready.")

Setup the configuration with which the AI is gonnna do work.

In [None]:
model = "5b_lyrics" # "5b", "1b_lyrics" or "5b_lyrics"

# Specify where to save the samples.
sample_folder = '/content/gdrive/My Drive/Jukebox/Samples'

# Specify an audio file here.
audio_file = '/content/gdrive/My Drive/Jukebox/Primer.wav'

# Specify how many seconds of audio to prime on.
prompt_length_in_seconds=12

# Specify wether to resume from a previous checkpoint
resume_checkpoint = False

sample_length_in_seconds = 76          # Full length of musical sample to generate - we find songs in the 1 to 4 minute
                                       # range work well, with generation time proportional to sample length.  
                                       # This total length affects how quickly the model 
                                       # progresses through lyrics (model also generates differently
                                       # depending on if it thinks it's in the beginning, middle, or end of sample)

sampling_temperature = .98


song_artist = "Rick Astley"
song_genre = "Pop"
song_lyrics = """We're no strangers to love
You know the rules and so do I
A full commitment's what I'm thinking of
You wouldn't get this from any other guy

I just wanna tell you how I'm feeling
Gotta make you understand

Never gonna give you up
Never gonna let you down
Never gonna run around and desert you
Never gonna make you cry
Never gonna say goodbye
Never gonna tell a lie and hurt you

We've known each other for so long
Your heart's been aching, but
You're too shy to say it
Inside, we both know what's been going on
We know the game and we're gonna play it

And if you ask me how I'm feeling
Don't tell me you're too blind to see

Never gonna give you up
Never gonna let you down
Never gonna run around and desert you
Never gonna make you cry
Never gonna say goodbye
Never gonna tell a lie and hurt you

Never gonna give you up
Never gonna let you down
Never gonna run around and desert you
Never gonna make you cry
Never gonna say goodbye
Never gonna tell a lie and hurt you

(Ooh, give you up)
(Ooh, give you up)
Never gonna give, never gonna give
(Give you up)
Never gonna give, never gonna give
(Give you up)

We've known each other for so long
Your heart's been aching, but
You're too shy to say it
Inside, we both know what's been going on
We know the game and we're gonna play it

I just wanna tell you how I'm feeling
Gotta make you understand

Never gonna give you up
Never gonna let you down
Never gonna run around and desert you
Never gonna make you cry
Never gonna say goodbye
Never gonna tell a lie and hurt you

Never gonna give you up
Never gonna let you down
Never gonna run around and desert you
Never gonna make you cry
Never gonna say goodbye
Never gonna tell a lie and hurt you

Never gonna give you up
Never gonna let you down
Never gonna run around and desert you
Never gonna make you cry
Never gonna say goodbye
Never gonna tell a lie and hurt you
"""

# Setup the AI


Just run this.

In [None]:
import jukebox
import torch as t
import librosa
import os
from IPython.display import Audio
from jukebox.make_models import make_vqvae, make_prior, MODELS, make_model
from jukebox.hparams import Hyperparams, setup_hparams
from jukebox.sample import sample_single_window, _sample, \
                           sample_partial_window, upsample, \
                           load_prompts
from jukebox.utils.dist_utils import setup_dist_from_mpi
from jukebox.utils.torch_utils import empty_cache
rank, local_rank, device = setup_dist_from_mpi()



hps = Hyperparams()
hps.sr = 44100
hps.n_samples = 8 if model=='1b_lyrics' else 3
# Specifies the directory to save the sample in.
# We set this to the Google Drive mount point.
hps.name = sample_folder
chunk_size = 32 if model=="1b_lyrics" else 16
max_batch_size = 16 if model=="1b_lyrics" else 3
hps.levels = 3
hps.hop_fraction = [.5,.5,.125]

vqvae, *priors = MODELS[model]
vqvae = make_vqvae(setup_hparams(vqvae, dict(sample_length = 1048576)), device)
top_prior = make_prior(setup_hparams(priors[-1], dict()), vqvae, device)

# Prime song creation using an arbitrary audio sample.
mode = 'primed'
codes_file=None

if resume_checkpoint:
  if os.path.exists(hps.name):
    # Identify the lowest level generated and continue from there.
    for level in [1, 2]:
      data = f"{hps.name}/level_{level}/data.pth.tar"
      if os.path.isfile(data):
        mode = 'upsample'
        codes_file = data
        print('Upsampling from level '+str(level))
        break
print('mode is now '+mode)

sample_hps = Hyperparams(dict(mode=mode, codes_file=codes_file, audio_file=audio_file, prompt_length_in_seconds=prompt_length_in_seconds))


hps.sample_length = (int(sample_length_in_seconds*hps.sr)//top_prior.raw_to_tokens)*top_prior.raw_to_tokens
assert hps.sample_length >= top_prior.n_ctx*top_prior.raw_to_tokens, f'Please choose a larger sampling rate'

# Note: Metas can contain different prompts per sample.
# By default, all samples use the same prompt.
metas = [dict(artist = song_artist,
            genre = song_genre,
            total_length = hps.sample_length,
            offset = 0,
            lyrics = song_lyrics,
            ),
          ] * hps.n_samples
labels = [None, None, top_prior.labeller.get_batch_labels(metas, 'cuda')]

lower_batch_size = 16
max_batch_size = 16 if model == "1b_lyrics" else 3
lower_level_chunk_size = 32
chunk_size = 32 if model == "1b_lyrics" else 16
sampling_kwargs = [dict(temp=.99, fp16=True, max_batch_size=lower_batch_size,
                        chunk_size=lower_level_chunk_size),
                    dict(temp=0.99, fp16=True, max_batch_size=lower_batch_size,
                         chunk_size=lower_level_chunk_size),
                    dict(temp=sampling_temperature, fp16=True, 
                         max_batch_size=max_batch_size, chunk_size=chunk_size)]

# Start the generation process

Now we're ready to sample from the model. We'll generate the top level (2) first, followed by the first upsampling (level 1), and the second upsampling (0).  In this CoLab we load the top prior separately from the upsamplers, because of memory concerns on the hosted runtimes. If you are using a local machine, you can also load all models directly with make_models, and then use sample.py's ancestral_sampling to put this all in one step.

After each level, we decode to raw audio and save the audio files.   

This next cell will take a while (approximately 10 minutes per 20 seconds of music sample)

In [None]:
if sample_hps.mode == 'upsample':
  assert sample_hps.codes_file is not None
  # Load codes.
  data = t.load(sample_hps.codes_file, map_location='cpu')
  zs = [z.cuda() for z in data['zs']]
  assert zs[-1].shape[0] == hps.n_samples, f"Expected bs = {hps.n_samples}, got {zs[-1].shape[0]}"
  del data
  print('Falling through to the upsample step later in the notebook.')
elif sample_hps.mode == 'primed':
  assert sample_hps.audio_file is not None
  audio_files = sample_hps.audio_file.split(',')
  duration = (int(sample_hps.prompt_length_in_seconds*hps.sr)//top_prior.raw_to_tokens)*top_prior.raw_to_tokens
  x = load_prompts(audio_files, duration, hps)
  zs = top_prior.encode(x, start_level=0, end_level=len(priors), bs_chunks=x.shape[0])
  zs = _sample(zs, labels, sampling_kwargs, [None, None, top_prior], [2], hps)
else:
  raise ValueError(f'Unknown sample mode {sample_hps.mode}.')

# Upsampling

We are now done with the large top_prior model, and instead load the upsamplers.

Please note: this next upsampling step will take several hours.  At the free tier, Google CoLab lets you run for 12 hours.  As the upsampling is completed, samples will appear in the Files tab (you can access this at the left of the CoLab), under "samples" (or whatever hps.name is currently).  Level 1 is the partially upsampled version, and then Level 0 is fully completed.

In [None]:
# Set this False if you are on a local machine that has enough memory (this allows you to do the
# lyrics alignment visualization during the upsampling stage). For a hosted runtime, 
# we'll need to go ahead and delete the top_prior if you are using the 5b_lyrics model.
if True:
  del top_prior
  empty_cache()
  top_prior=None
upsamplers = [make_prior(setup_hparams(prior, dict()), vqvae, 'cpu') for prior in priors[:-1]]
labels[:2] = [prior.labeller.get_batch_labels(metas, 'cuda') for prior in upsamplers]

# The actual upsampling bit
zs = upsample(zs, labels, sampling_kwargs, [*upsamplers, top_prior], hps)

Listen to your final sample!

In [None]:
Audio(f'{hps.name}/level_0/item_0.wav')

In [None]:
del upsamplers
empty_cache()