# Deep Sonic
### ___Chris Pagolu, Joshua JJ Wonder___

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/AllStars101-sudo/DeepSonic/blob/main/DeepSonic.ipynb)

[DeepSonic](https://github.com/AllStars101-sudo/DeepSonic) is a fully open source Deep Learning music experiment that is capable of synthesizing, generating, remixing, modifying music, all using the power of AI. Thanks to powerful open-source libraries such as Magenta by Tensorflow/Google and Jukebox by OpenAI, we were able to create a multi-functional AI Audio Engineer. 

Note: This notebook runs all code natively. No cloud service is required unless you do not have a dedicated Nvidia GPU.

# Basic Hardware Requirements and Recommendations

The DeepSonic Experiment requires considerably powerful hardware. 

An NVIDIA Geforce RTX 2000 (Turing) Series GPU with 8GB of VRAM is required, at the least. A cloud-based NVIDIA Tesla or a server NVIDIA Quadro GPU with atleast 16GB VRAM is recommended, while a supercomputer will perform best, depending on the task. There are no explicit CPU requirements for DeepSonic, however, an AMD Ryzen 3 3200 or Intel Core i3 8100 (or higher) is recommended. The more powerful, the better.

# Table of Contents:
- 1. [How to get started?](#how-to-get-started)
  - 1.1. [Quick Install Guide](#quick-install-guide)
- 2. [DeepSynth](#deepsynth)
- 3. [GANSynth](#gansynth)
- 4. [DDSP Timbre Transfer](#ddsp-timbre-transfer)
- 5. [Music Transformer](#music-transformer)
  - 6.1. [Melody-conditioned Piano Transformer](#melody-conditioned-piano-transformer)
- 6. [DeepLyrics](#deeplyrics)

# How to get started?

<a id="#how-to-get-started">To</a> get started with DeepSonic, all you have to do is install Magenta and Jukebox from their official GitHub repositories.

# Quick Install Guide:

<a id="#quick-install-guide">Run</a> the follow code in your shell (taken from the official Magenta and Jukebox wiki) to install the required tools. We recommend using a Debian-based operating system. The Windows Subsystem for Linux (WSL2) and Windows didn't appear to work at the time of our testing, possibly due to early support for Cuda on WSL. Also note: root privileges are required for installing audio libraries.

```bash
# Required commands:

sudo apt-get update && sudo apt-get install build-essential libasound2-dev libjack-dev portaudio19-dev
curl https://raw.githubusercontent.com/tensorflow/magenta/main/magenta/tools/magenta-install.sh > /tmp/magenta-install.sh
bash /tmp/magenta-install.sh
conda create --name jukebox python=3.7.5
conda activate jukebox
conda install mpi4py=3.0.3 # if this fails, try: pip install mpi4py==3.0.3
conda install pytorch=1.4 torchvision=0.5 cudatoolkit=10.0 -c pytorch
git clone https://github.com/openai/jukebox.git
cd jukebox
pip install -r requirements.txt
pip install -e .
conda install av=7.0.01 -c conda-forge 
pip install ./tensorboardX
curl -o /path/to/dir/cs1-1pre.mid http://www.jsbach.net/midi/cs1-1pre.mid
curl -o /path/to/dir/arp.mid http://storage.googleapis.com/magentadata/papers/gansynth/midi/arp.mid
pip install -qU ddsp==1.6.5
echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] https://packages.cloud.google.com/apt cloud-sdk main" | sudo tee -a /etc/apt/sources.list.d/google-cloud-sdk.list
sudo apt-get install apt-transport-https ca-certificates gnupg
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key --keyring /usr/share/keyrings/cloud.google.gpg add -
sudo apt-get update && sudo apt-get install google-cloud-sdk
gsutil -q -m cp -r gs://magentadata/models/music_transformer/primers/* /home/chris/Downloads/DeepSonic/
gsutil -q -m cp gs://magentadata/soundfonts/Yamaha-C5-Salamander-JNv5.1.sf2 /home/chris/Downloads/DeepSonic/
pip install -q 'tensorflow-datasets < 4.0.0'
gsutil -q -m cp -r gs://magentadata/models/music_transformer/checkpoints/* /home/chris/Downloads/musictransformermodels/

 
# Following two commands are optional: Apex for faster training with fused_adam 

conda install pytorch=1.1 torchvision=0.3 cudatoolkit=10.0 -c pytorch
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./apex
```

In [1]:
print('Importing Modules...\n')
#basic libraries
import os
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import Audio
%matplotlib inline

#magenta libraries
from magenta.models.nsynth import utils
from magenta.models.nsynth.wavenet import fastgen
from note_seq.notebook_utils import colab_play as play
MIDI_SONG_DEFAULT = 'cs1-1pre.mid'
MIDI_RIFF_DEFAULT = 'arp.mid'

import time
import warnings
import IPython
import os
import librosa
from magenta.models.nsynth.utils import load_audio
from magenta.models.gansynth.lib import flags as lib_flags
from magenta.models.gansynth.lib import generate_util as gu
from magenta.models.gansynth.lib import model as lib_model
from magenta.models.gansynth.lib import util
import matplotlib.pyplot as plt
import note_seq
from note_seq.notebook_utils import colab_play as play
import numpy as np
import tensorflow.compat.v1 as tf
import tensorflow_datasets as tfds
import crepe
import ddsp
import ddsp.training
from ddsp.training.postprocessing import (
    detect_notes, fit_quantile_transform
)
import gin
import pickle
from scipy.io import wavfile
tf.disable_v2_behavior()
warnings.filterwarnings("ignore")

#music transformer libraries
print('Copying Salamander piano SoundFont (via https://sites.google.com/site/soundfonts4u) from GCS...')
!gsutil -q -m cp -r gs://magentadata/models/music_transformer/primers/* /home/chris/Downloads/DeepSonic/
!gsutil -q -m cp gs://magentadata/soundfonts/Yamaha-C5-Salamander-JNv5.1.sf2 /home/chris/Downloads/DeepSonic/


#jukebox libraries

import jukebox
import torch as t
import librosa
from jukebox.make_models import make_vqvae, make_prior, MODELS, make_model
from jukebox.hparams import Hyperparams, setup_hparams
from jukebox.sample import sample_single_window, _sample, \
                           sample_partial_window, upsample
from jukebox.utils.dist_utils import setup_dist_from_mpi
from jukebox.utils.torch_utils import empty_cache
rank, local_rank, device = setup_dist_from_mpi()

get_name = lambda f: os.path.splitext(os.path.basename(f))[0]

print('Alrighty, we are done here.')

Importing Modules...

Instructions for updating:
Use tf.initializers.variance_scaling instead with distribution=uniform to get equivalent behavior.
Using cuda True
Sucess!! Environment is now setup.


In [2]:
!nvidia-smi #checks if cuda and nvidia drivers are working properly

Sun Aug 22 15:07:26 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA GeForce ...  Off  | 00000000:09:00.0  On |                  N/A |
| 25%   49C    P5    16W / 215W |   2234MiB /  7981MiB |      6%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

# DeepSynth
### __Adapted from the [EZSynth Experiment](https://colab.research.google.com/notebooks/magenta/nsynth/nsynth.ipynb) by Jesse Engel, Cinjon Resnick, Adam Roberts, Sander Dieleman, Douglas Eck, Karen Simonyan, Mohammad Norouzi__
<a id="#deepsynth">Neural</a> Audio Synthesis of Musical Notes with WaveNet Autoencoders

### Additional Resources (as provided in the [EZSynth Notebook](https://colab.research.google.com/notebooks/magenta/nsynth/nsynth.ipynb)):
* [Nat and Friends "Behind the scenes"](https://www.youtube.com/watch?v=BOoSy-Pg8is)
* [Original Blog Post](https://magenta.tensorflow.org/nsynth)
* [NSynth Instrument](https://magenta.tensorflow.org/nsynth-instrument)
* [Jupyter Notebook Tutorial](https://magenta.tensorflow.org/nsynth-fastgen)
* [ArXiv Paper](https://arxiv.org/abs/1704.01279)
* [Github Code](https://github.com/tensorflow/magenta/tree/main/magenta/models/nsynth)

There are two pretrained models to choose from (thanks, Google :)): one trained on the individual instrument notes of the [NSynth Dataset](https://magenta.tensorflow.org/datasets/nsynth) ("Instruments"), and another trained on a variety of voices in the wild for an art project ("Voices", mixture of singing and speaking). The Instruments model was trained on a larger quantity of data, so tends to generalize a bit better. Neither reconstructs audio perfectly, but both add their own unique character to sounds. 

In [None]:
#Choose a Model { vertical-output: true, run: "auto" }
Model = "Instruments" #@param ["Instruments", "Voices"] {type:"string"}
ckpts = {'Instruments': 'wavenet-ckpt/model.ckpt-200000',
         'Voices': 'wavenet-voice-ckpt/model.ckpt-200000'}

ckpt_path = ckpts[Model]
print('Using model pretrained on %s.' % Model)

# Use local audio files

In the next section, you may choose to specify which audio file you want to use for the audio synthesis. Note: the larger your audio file, the longer it'll take to encode and the longer it'll take to synthesize the audio, depending on how powerful your GPU is.

In [None]:
#Set Sound Length (in Seconds) { vertical-output: true, run: "auto" }
Length = 60.0 #set the length of your synthesized audio
SR = 16000
SAMPLE_LENGTH = int(SR * Length) #audio length

In [None]:
#Upload sound files (.wav, .mp3)

try:
  file_list, audio_list = [], [] #creates numpy arrays for file and audio lists
  fname="test.mp3" # name of the audio file
  audio = utils.load_audio(fname, sample_length=SAMPLE_LENGTH, sr=SR)  #loads audio file for magenta to process
  file_list.append(fname)
  audio_list.append(audio)
  names = [get_name(f) for f in file_list]
  # Pad and peak normalize
  for i in range(len(audio_list)):
    audio_list[i] = audio_list[i] / np.abs(audio_list[i]).max()

    if len(audio_list[i]) < SAMPLE_LENGTH:
      padding = SAMPLE_LENGTH - len(audio_list[i])
      audio_list[i] = np.pad(audio_list[i], (0, padding), 'constant')

  audio_list = np.array(audio_list)
except Exception as e:
  print("Error encountered. Sure the file is .wav or .mp3? Does your GPU have enough memory left?")
  print(e)

The below code may take some time, depending on your GPU.

In [None]:
#Generate Encodings
audio = np.array(audio_list)
z = fastgen.encode(audio, ckpt_path, SAMPLE_LENGTH)
print('Encoded %d files' % z.shape[0])


# Start with reconstructions
z_list = [z_ for z_ in z]
name_list = ['recon_' + name_ for name_ in names]

# Add all the mean interpolations
n = len(names)
for i in range(n - 1):
  for j in range(i + 1, n):
    new_z = (z[i] + z[j]) / 2.0
    new_name = 'interp_' + names[i] + '_X_'+ names[j]
    z_list.append(new_z)
    name_list.append(new_name)

print("%d total: %d reconstructions and %d interpolations" % (len(name_list), n, len(name_list) - n))

# Final Step: Synthesize

With a GPU, this should take about 4 minutes per 1 second of audio per a batch. Approximate time required for a 60 second song (~1,000,000 interpolations): 8-12 hours on a GeForce GPU and 2-8 hours on a Quadro or Tesla GPU. After that, your synthesized audio will appear in the same directory as this notebook (can be found as `recon_<name of audio>.mp3` or `recon_<name of audio>.wav`). )

In [None]:
#Synthesize Interpolations
print('Total Iterations to Complete: %d\n' % SAMPLE_LENGTH)

encodings = np.array(z_list)
save_paths = [name + '.wav' for name in name_list]
fastgen.synthesize(encodings,
                   save_paths=save_paths,
                   checkpoint_path=ckpt_path,
                   samples_per_save=int(SAMPLE_LENGTH / 10))

# GANSynth
### __Adapted from the [GANSynth Demo Notebook](https://colab.research.google.com/notebooks/magenta/nsynth/nsynth.ipynb) by the Magenta team.__

<a id="#gansynth">GANSynth</a> generates audio using Generative Adversarial Networks. GANSynth learns to produce individual instrument notes using the NSynth Dataset from Google. With pitch provided as a conditional attribute, the generator learns to use its latent space to represent different instrument timbres. This allows us to synthesize performances from MIDI files, either keeping the timbre constant, or interpolating between instruments over time. Rather than generate audio sequentially, GANSynth generates an entire sequence in parallel, synthesizing audio significantly faster than real-time on a modern GPU and ~50,000 times faster than a standard WaveNet. Unlike the WaveNet autoencoders from the original paper that used a time-distributed latent code, GANSynth generates the entire audio clip from a single latent vector, allowing for easier disentanglement of global features such as pitch and timbre. Using the NSynth dataset of musical instrument notes, we can independently control pitch and timbre.

<img src="https://storage.googleapis.com/magentadata/papers/gansynth/figures/models.jpeg" alt="GANSynth figure" width="600">


### Additional Resources (as provided in the [GANSynth Notebook](https://colab.research.google.com/notebooks/magenta/gansynth/gansynth_demo.ipynb)):
* [GANSynth ICLR paper](https://arxiv.org/abs/1809.11096)
* [Audio Examples](http://goo.gl/magenta/gansynth-examples) 

In [None]:
# GLOBALS
CKPT_DIR = '/home/chris/Downloads/all_instruments' #Load Checkpoint of model
output_dir = '/home/chris/Downloads/DeepSonic/samples/gansynth' #where you want your final audio file to be saved
BATCH_SIZE = 16
SR = 16000

# Load the model
tf.reset_default_graph()
flags = lib_flags.Flags({
    'batch_size_schedule': [BATCH_SIZE],
    'tfds_data_dir': 'gs://tfds-data/datasets',
})
model = lib_model.Model.load_from_path(CKPT_DIR, flags)

# Helper functions
def load_midi(midi_path, min_pitch=36, max_pitch=84):
  """Load midi as a notesequence."""
  midi_path = util.expand_path(midi_path)
  ns = note_seq.midi_file_to_sequence_proto(midi_path)
  pitches = np.array([n.pitch for n in ns.notes])
  velocities = np.array([n.velocity for n in ns.notes])
  start_times = np.array([n.start_time for n in ns.notes])
  end_times = np.array([n.end_time for n in ns.notes])
  valid = np.logical_and(pitches >= min_pitch, pitches <= max_pitch)
  notes = {'pitches': pitches[valid],
           'velocities': velocities[valid],
           'start_times': start_times[valid],
           'end_times': end_times[valid]}
  return ns, notes

def get_envelope(t_note_length, t_attack=0.010, t_release=0.3, sr=16000):
  """Create an attack sustain release amplitude envelope."""
  t_note_length = min(t_note_length, 3.0)
  i_attack = int(sr * t_attack)
  i_sustain = int(sr * t_note_length)
  i_release = int(sr * t_release)
  i_tot = i_sustain + i_release  # attack envelope doesn't add to sound length
  envelope = np.ones(i_tot)
  # Linear attack
  envelope[:i_attack] = np.linspace(0.0, 1.0, i_attack)
  # Linear release
  envelope[i_sustain:i_tot] = np.linspace(1.0, 0.0, i_release)
  return envelope

def combine_notes(audio_notes, start_times, end_times, velocities, sr=16000):
  """Combine audio from multiple notes into a single audio clip.

  Args:
    audio_notes: Array of audio [n_notes, audio_samples].
    start_times: Array of note starts in seconds [n_notes].
    end_times: Array of note ends in seconds [n_notes].
    sr: Integer, sample rate.

  Returns:
    audio_clip: Array of combined audio clip [audio_samples]
  """
  n_notes = len(audio_notes)
  clip_length = end_times.max() + 3.0
  audio_clip = np.zeros(int(clip_length) * sr)

  for t_start, t_end, vel, i in zip(start_times, end_times, velocities, range(n_notes)):
    # Generate an amplitude envelope
    t_note_length = t_end - t_start
    envelope = get_envelope(t_note_length)
    length = len(envelope)
    audio_note = audio_notes[i, :length] * envelope
    # Normalize
    audio_note /= audio_note.max()
    audio_note *= (vel / 127.0)
    # Add to clip buffer
    clip_start = int(t_start * sr)
    clip_end = clip_start + length
    audio_clip[clip_start:clip_end] += audio_note

  # Normalize
  audio_clip /= audio_clip.max()
  audio_clip /= 2.0
  return audio_clip

# Plotting tools
def specplot(audio_clip):
  p_min = np.min(36)
  p_max = np.max(84)
  f_min = librosa.midi_to_hz(p_min)
  f_max = 2 * librosa.midi_to_hz(p_max)
  octaves = int(np.ceil(np.log2(f_max) - np.log2(f_min)))
  bins_per_octave = 36
  n_bins = int(bins_per_octave * octaves)
  C = librosa.cqt(audio_clip, sr=SR, hop_length=2048, fmin=f_min, n_bins=n_bins, bins_per_octave=bins_per_octave)
  power = 10 * np.log10(np.abs(C)**2 + 1e-6)
  plt.matshow(power[::-1, 2:-2], aspect='auto', cmap=plt.cm.magma)
  plt.yticks([])
  plt.xticks([])

print('And...... Done!')

## Random Interpolation

These cells take the MIDI for a full song and interpolate between several random latent vectors (equally spaced in time) over the whole song. The result sounds like instruments that slowly and smoothly morph between each other.  

In [None]:
midi_file = "Bach Prelude (Default)" #name of a default midi file, provided by Google

midi_path = MIDI_SONG_DEFAULT

ns, notes = load_midi(midi_path)
print('Loaded {}'.format(midi_path))
note_seq.plot_sequence(ns)

## Generate Random Interpolation
Assign the number of seconds to take in interpolating between each random instrument. Larger numbers will have slower and smoother interpolations.

In [None]:
seconds_per_instrument = 5 

# Distribute latent vectors linearly in time
z_instruments, t_instruments = gu.get_random_instruments(
    model, notes['end_times'][-1], secs_per_instrument=seconds_per_instrument)

# Get latent vectors for each note
z_notes = gu.get_z_notes(notes['start_times'], z_instruments, t_instruments)

# Generate audio for each note
print('Generating {} samples...'.format(len(z_notes)))
audio_notes = model.generate_samples_from_z(z_notes, notes['pitches'])

# Make a single audio clip
audio_clip = combine_notes(audio_notes,
                           notes['start_times'],
                           notes['end_times'],
                           notes['velocities'])

## Play Synthesized Audio
A [Constant-Q Spectogram](https://en.wikipedia.org/wiki/Constant-Q_transform) will also be displayed.

In [None]:
# Play the audio
print('\nAudio:')
IPython.display.Audio(audio_clip, rate=SR)
print('CQT Spectrogram:')
specplot(audio_clip)

## Save the audio

In [None]:
fname = os.path.join(output_dir, 'generated_clip.wav') #enter desired file name
gu.save_wav(audio_clip, fname)

## Generate your own interpolation (custom interpolation)

These cells allow you to choose two latent vectors and interpolate between them over a MIDI clip.

In [None]:
midi_file = "Arpeggio (Default)"

midi_path = MIDI_RIFF_DEFAULT


print('Loaded {}'.format(midi_path))
note_seq.plot_sequence(ns)

In [None]:
#Sample some random instruments

number_of_random_instruments = 10 #enter desired number of random instruments (max: 16)
pitch_preview = 60
n_preview = number_of_random_instruments

pitches_preview = [pitch_preview] * n_preview
z_preview = model.generate_z(n_preview)

audio_notes = model.generate_samples_from_z(z_preview, pitches_preview)
for i, audio_note in enumerate(audio_notes):
  print("Instrument: {}".format(i))
  play(audio_note, sample_rate=16000)

## Generate custom interpolation

In [None]:
instruments = [0, 2, 4, 0]
times = [0, 0.3, 0.6, 1.0]
times[0] = -0.001
times[-1] = 1.0

z_instruments = np.array([z_preview[i] for i in instruments])
t_instruments = np.array([notes_2['end_times'][-1] * t for t in times])

# Get latent vectors for each note
z_notes = gu.get_z_notes(notes_2['start_times'], z_instruments, t_instruments)

# Generate audio for each note
print('Generating {} samples...'.format(len(z_notes)))
audio_notes = model.generate_samples_from_z(z_notes, notes_2['pitches'])

# Make a single audio clip
audio_clip = combine_notes(audio_notes,
                           notes_2['start_times'],
                           notes_2['end_times'],
                           notes_2['velocities'])

# Play the audio
print('\nAudio:')
IPython.display.Audio(audio_clip, rate=SR)
print('CQT Spectrogram:')
specplot(audio_clip)


## Save the file

In [None]:
# Write the file
fname = os.path.join(output_dir, 'generated_clip.wav') # name of the file
gu.save_wav(audio_clip, fname)

# DDSP Timbre Transfer
### __Adapted from the [DDSP Timbre Transfer Demo](hhttps://colab.research.google.com/github/magenta/ddsp/blob/main/ddsp/colab/demos/timbre_transfer.ipynb#scrollTo=Go36QW9AS_CD) by the Magenta team.__

<a id="#ddsp-timbre-transfer">Convert</a> audio between sound sources with pretrained models. For example, you can try turning your voice into a violin, or scratching your laptop and seeing how it sounds as a flute!
This section is a demo of timbre transfer using DDSP (Differentiable Digital Signal Processing). The model here is trained to generate audio conditioned on a time series of fundamental frequency and loudness.

<img src="https://storage.googleapis.com/ddsp/additive_diagram/ddsp_autoencoder.png" alt="DDSP process depiction">

In [None]:
sample_rate = 16000

## Load audio

In [None]:
samplerate, audio = wavfile.read('generated_clip.wav')
if len(audio.shape) == 1:
  audio = audio[np.newaxis, :]
print('\nExtracting audio features...')


# Setup the session.
ddsp.spectral_ops.reset_crepe()

# Compute features.
start_time = time.time()
audio_features = ddsp.training.metrics.compute_audio_features(audio)
audio_features['loudness_db'] = audio_features['loudness_db'].astype(np.float32)
audio_features_mod = None
print('Audio features took %.1f seconds' % (time.time() - start_time))


TRIM = -15
# Plot Features.
fig, ax = plt.subplots(nrows=3, 
                       ncols=1, 
                       sharex=True,
                       figsize=(6, 8))
ax[0].plot(audio_features['loudness_db'][:TRIM])
ax[0].set_ylabel('loudness_db')

ax[1].plot(librosa.hz_to_midi(audio_features['f0_hz'][:TRIM]))
ax[1].set_ylabel('f0 [midi]')

ax[2].plot(audio_features['f0_confidence'][:TRIM])
ax[2].set_ylabel('f0 confidence')
_ = ax[2].set_xlabel('Time step [frame]')

## Loading the model

In [None]:
model = 'Violin' #@param ['Violin', 'Flute', 'Flute2', 'Trumpet', 'Tenor_Saxophone', 'Upload your own (checkpoint folder as .zip)']
MODEL = model

if model in {'Violin', 'Flute', 'Flute2', 'Trumpet', 'Tenor_Saxophone'}:
  # Pretrained models.
  PRETRAINED_DIR = '/home/chris/Downloads/timbre_models'
  model_dir = PRETRAINED_DIR
  gin_file = os.path.join(model_dir, 'operative_config-0.gin')


# Load the dataset statistics.
DATASET_STATS = None
dataset_stats_file = os.path.join(model_dir, 'dataset_statistics.pkl')
print(f'Loading dataset statistics from {dataset_stats_file}')
try:
  if tf.io.gfile.exists(dataset_stats_file):
    with tf.io.gfile.GFile(dataset_stats_file, 'rb') as f:
      DATASET_STATS = pickle.load(f)
except Exception as err:
  print('Loading dataset statistics from pickle failed: {}.'.format(err))


# Parse gin config,
with gin.unlock_config():
  gin.parse_config_file(gin_file, skip_unknown=True)

# Assumes only one checkpoint in the folder, 'ckpt-[iter]`.
ckpt_files = [f for f in tf.io.gfile.listdir(model_dir) if 'ckpt' in f]
ckpt_name = ckpt_files[0].split('.')[0]
ckpt = os.path.join(model_dir, ckpt_name)

# Ensure dimensions and sampling rates are equal
time_steps_train = gin.query_parameter('F0LoudnessPreprocessor.time_steps')
n_samples_train = gin.query_parameter('Harmonic.n_samples')
hop_size = int(n_samples_train / time_steps_train)

time_steps = int(audio.shape[1] / hop_size)
n_samples = time_steps * hop_size

# print("===Trained model===")
# print("Time Steps", time_steps_train)
# print("Samples", n_samples_train)
# print("Hop Size", hop_size)
# print("\n===Resynthesis===")
# print("Time Steps", time_steps)
# print("Samples", n_samples)
# print('')

gin_params = [
    'Harmonic.n_samples = {}'.format(n_samples),
    'FilteredNoise.n_samples = {}'.format(n_samples),
    'F0LoudnessPreprocessor.time_steps = {}'.format(time_steps),
    'oscillator_bank.use_angular_cumsum = True',  # Avoids cumsum accumulation errors.
]

with gin.unlock_config():
  gin.parse_config(gin_params)


# Trim all input vectors to correct lengths 
for key in ['f0_hz', 'f0_confidence', 'loudness_db']:
  audio_features[key] = audio_features[key][:time_steps]
audio_features['audio'] = audio_features['audio'][:, :n_samples]


# Set up the model just to predict audio given new conditioning
model = ddsp.training.models.Autoencoder()
model.restore(ckpt)

# Build model by running a batch through it.
start_time = time.time()
_ = model(audio_features, training=False)
print('Restoring model took %.1f seconds' % (time.time() - start_time))

## Modify conditioning
These models were not explicitly trained to perform timbre transfer, so they may sound unnatural if the incoming loudness and frequencies are very different then the training data (which will always be somewhat true).

## Note Detection
You can leave this at 1.0 for most cases.

In [None]:
threshold = 1

In [None]:
ADJUST = True #change this to false if you want to manually adjust the pitch and loudness

Quiet parts without notes detected

In [None]:
quiet = 20 # max= 60

Force pitch to nearest note (amount)

In [None]:
autotune = 0

## Manual adjustment

In [None]:
pitch_shift =  0 #shifts the pitch, max = 2, min = -2

## Apply conditions

In [None]:
audio_features_mod = {k: v.copy() for k, v in audio_features.items()}
## Helper functions.
def shift_ld(audio_features, ld_shift=0.0):
  """Shift loudness by a number of ocatves."""
  audio_features['loudness_db'] += ld_shift
  return audio_features


def shift_f0(audio_features, pitch_shift=0.0):
  """Shift f0 by a number of ocatves."""
  audio_features['f0_hz'] *= 2.0 ** (pitch_shift)
  audio_features['f0_hz'] = np.clip(audio_features['f0_hz'], 
                                    0.0, 
                                    librosa.midi_to_hz(110.0))
  return audio_features


mask_on = None

if ADJUST and DATASET_STATS is not None:
  # Detect sections that are "on".
  mask_on, note_on_value = detect_notes(audio_features['loudness_db'],
                                        audio_features['f0_confidence'],
                                        threshold)

  if np.any(mask_on):
    # Shift the pitch register.
    target_mean_pitch = DATASET_STATS['mean_pitch']
    pitch = ddsp.core.hz_to_midi(audio_features['f0_hz'])
    mean_pitch = np.mean(pitch[mask_on])
    p_diff = target_mean_pitch - mean_pitch
    p_diff_octave = p_diff / 12.0
    round_fn = np.floor if p_diff_octave > 1.5 else np.ceil
    p_diff_octave = round_fn(p_diff_octave)
    audio_features_mod = shift_f0(audio_features_mod, p_diff_octave)


    # Quantile shift the note_on parts.
    _, loudness_norm = fit_quantile_transform(
        audio_features['loudness_db'],
        mask_on,
        inv_quantile=DATASET_STATS['quantile_transform'])

    # Turn down the note_off parts.
    mask_off = np.logical_not(mask_on)
    loudness_norm[mask_off] -=  quiet * (1.0 - note_on_value[mask_off][:, np.newaxis])
    loudness_norm = np.reshape(loudness_norm, audio_features['loudness_db'].shape)
    
    audio_features_mod['loudness_db'] = loudness_norm 

    # Auto-tune.
    if autotune:
      f0_midi = np.array(ddsp.core.hz_to_midi(audio_features_mod['f0_hz']))
      tuning_factor = get_tuning_factor(f0_midi, audio_features_mod['f0_confidence'], mask_on)
      f0_midi_at = auto_tune(f0_midi, tuning_factor, mask_on, amount=autotune)
      audio_features_mod['f0_hz'] = ddsp.core.midi_to_hz(f0_midi_at)

  else:
    print('\nSkipping auto-adjust (no notes detected or ADJUST box empty).')

else:
  print('\nSkipping auto-adujst (box not checked or no dataset statistics found).')

# Manual Shifts.
audio_features_mod = shift_ld(audio_features_mod, loudness_shift)
audio_features_mod = shift_f0(audio_features_mod, pitch_shift)



# Plot Features.
has_mask = int(mask_on is not None)
n_plots = 3 if has_mask else 2 
fig, axes = plt.subplots(nrows=n_plots, 
                      ncols=1, 
                      sharex=True,
                      figsize=(2*n_plots, 8))

if has_mask:
  ax = axes[0]
  ax.plot(np.ones_like(mask_on[:TRIM]) * threshold, 'k:')
  ax.plot(note_on_value[:TRIM])
  ax.plot(mask_on[:TRIM])
  ax.set_ylabel('Note-on Mask')
  ax.set_xlabel('Time step [frame]')
  ax.legend(['Threshold', 'Likelihood','Mask'])

ax = axes[0 + has_mask]
ax.plot(audio_features['loudness_db'][:TRIM])
ax.plot(audio_features_mod['loudness_db'][:TRIM])
ax.set_ylabel('loudness_db')
ax.legend(['Original','Adjusted'])

ax = axes[1 + has_mask]
ax.plot(librosa.hz_to_midi(audio_features['f0_hz'][:TRIM]))
ax.plot(librosa.hz_to_midi(audio_features_mod['f0_hz'][:TRIM]))
ax.set_ylabel('f0 [midi]')
_ = ax.legend(['Original','Adjusted'])


## Finally, resynthesize audio.
Runs a batch of predictions and then plots.

In [None]:
af = audio_features if audio_features_mod is None else audio_features_mod

# Run a batch of predictions.
start_time = time.time()
outputs = model(af, training=False)
audio_gen = model.get_audio_from_outputs(outputs)
print('Prediction took %.1f seconds' % (time.time() - start_time))

# Plot
print('Original')
sr=44000
IPython.display.Audio(audio, rate=sr)

print('Resynthesis')
IPython.display.Audio(audio_gen, rate=sr)

# Music Transformer
### __Adapted from the [Generating Piano Music with Transformer](https://colab.research.google.com/notebooks/magenta/piano_transformer/piano_transformer.ipynb) by Ian Simon, Anna Huang, Jesse Engel and Curtis "Fjord" Hawthorne__
<a id="#music-transformer">An</a> attention-based neural network that can generate music with improved long-term coherence. The models used here were trained on over 10,000 hours of piano recordings from YouTube, transcribed using [Onsets and Frames](https://magenta.tensorflow.org/onsets-frames) and represented using the event vocabulary from [Performance RNN](https://magenta.tensorflow.org/performance-rnn) and were compiled by the Tensorflow/Magenta team for free.


## Definitions and Helper Functions
Define a few constants and helper functions.

In [None]:
SF2_PATH = 'Yamaha-C5-Salamander-JNv5.1.sf2'
SAMPLE_RATE = 16000

# Decode a list of IDs.
def decode(ids, encoder):
  ids = list(ids)
  if text_encoder.EOS_ID in ids:
    ids = ids[:ids.index(text_encoder.EOS_ID)]
  return encoder.decode(ids)

## Setup and Load Checkpoint
Set up generation from an unconditional Transformer model.

In [None]:
model_name = 'transformer'
hparams_set = 'transformer_tpu'
ckpt_path = 'gs://magentadata/models/music_transformer/checkpoints/unconditional_model_16.ckpt'

class PianoPerformanceLanguageModelProblem(score2perf.Score2PerfProblem):
  @property
  def add_eos_symbol(self):
    return True

problem = PianoPerformanceLanguageModelProblem()
unconditional_encoders = problem.get_feature_encoders()

# Set up HParams.
hparams = trainer_lib.create_hparams(hparams_set=hparams_set)
trainer_lib.add_problem_hparams(hparams, problem)
hparams.num_hidden_layers = 16
hparams.sampling_method = 'random'

# Set up decoding HParams.
decode_hparams = decoding.decode_hparams()
decode_hparams.alpha = 0.0
decode_hparams.beam_size = 1

# Create Estimator.
run_config = trainer_lib.create_run_config(hparams)
estimator = trainer_lib.create_estimator(
    model_name, hparams, run_config,
    decode_hparams=decode_hparams)

# Create input generator (so we can adjust priming and
# decode length on the fly).
def input_generator():
  global targets
  global decode_length
  while True:
    yield {
        'targets': np.array([targets], dtype=np.int32),
        'decode_length': np.array(decode_length, dtype=np.int32)
    }

# These values will be changed by subsequent cells.
targets = []
decode_length = 0

# Start the Estimator, loading from the specified checkpoint.
input_fn = decoding.make_input_fn_from_generator(input_generator())
unconditional_samples = estimator.predict(
    input_fn, checkpoint_path=ckpt_path)

# "Burn" one.
_ = next(unconditional_samples)

## Generate a piano performance from scratch
This can take a minute or so depending on the length of the performance the model ends up generating. Because we use a [representation](https://magenta.tensorflow.org/performance-rnn) where each event corresponds to a variable amount of time, the actual number of seconds generated may vary.

In [None]:
targets = []
decode_length = 1024

# Generate sample events.
sample_ids = next(unconditional_samples)['outputs']

# Decode to NoteSequence.
midi_filename = decode(
    sample_ids,
    encoder=unconditional_encoders['targets'])
unconditional_ns = note_seq.midi_file_to_note_sequence(midi_filename)

# Play and plot.
note_seq.play_sequence(
    unconditional_ns,
    synth=note_seq.fluidsynth, sample_rate=SAMPLE_RATE, sf2_path=SF2_PATH)
note_seq.plot_sequence(unconditional_ns)

note_seq.sequence_proto_to_midi_file(
    unconditional_ns, 'unconditional.mid') #name of generated audio

## Choose Priming Sequence
Here you can choose a priming sequence to be continued by the model.

Set max_primer_seconds below to trim the primer to a fixed number of seconds (this will have no effect if the primer is already shorter than max_primer_seconds).

In [None]:
filenames = {
    'C major arpeggio': 'c_major_arpeggio.mid',
    'C major scale': 'c_major_scale.mid',
    'Clair de Lune': 'clair_de_lune.mid',
}
primer = 'C major scale'  # choose from C-major arpeggio, C-major scale, or Clair de Lune.

primer_ns = note_seq.midi_file_to_note_sequence(filenames[primer])

# Handle sustain pedal in the primer.
primer_ns = note_seq.apply_sustain_control_changes(primer_ns)

# Trim to desired number of seconds.
max_primer_seconds = 20  #@param {type:"slider", min:1, max:120}
if primer_ns.total_time > max_primer_seconds:
  print('Primer is longer than %d seconds, truncating.' % max_primer_seconds)
  primer_ns = note_seq.extract_subsequence(
      primer_ns, 0, max_primer_seconds)

# Remove drums from primer if present.
if any(note.is_drum for note in primer_ns.notes):
  print('Primer contains drums; they will be removed.')
  notes = [note for note in primer_ns.notes if not note.is_drum]
  del primer_ns.notes[:]
  primer_ns.notes.extend(notes)

# Set primer instrument and program.
for note in primer_ns.notes:
  note.instrument = 1
  note.program = 0

# Play and plot the primer.
note_seq.play_sequence(
    primer_ns,
    synth=note_seq.fluidsynth, sample_rate=SAMPLE_RATE, sf2_path=SF2_PATH)
note_seq.plot_sequence(primer_ns)

## Generate Continuation
Continue a piano performance, starting with the chosen priming sequence.

In [None]:
targets = unconditional_encoders['targets'].encode_note_sequence(
    primer_ns)

# Remove the end token from the encoded primer.
targets = targets[:-1]

decode_length = max(0, 4096 - len(targets))
if len(targets) >= 4096:
  print('Primer has more events than maximum sequence length; nothing will be generated.')

# Generate sample events.
sample_ids = next(unconditional_samples)['outputs']

# Decode to NoteSequence.
midi_filename = decode(
    sample_ids,
    encoder=unconditional_encoders['targets'])
ns = note_seq.midi_file_to_note_sequence(midi_filename)

# Append continuation to primer.
continuation_ns = note_seq.concatenate_sequences([primer_ns, ns])

# Play and plot.
note_seq.play_sequence(
    continuation_ns,
    synth=note_seq.fluidsynth, sample_rate=SAMPLE_RATE, sf2_path=SF2_PATH)
note_seq.plot_sequence(continuation_ns)

## Save generated audio

In [None]:
note_seq.sequence_proto_to_midi_file(
    continuation_ns, 'continuation.mid')

# Melody-Conditioned Piano Performance Model
<a id="#melody-conditioned-piano-transformer">Set</a> up generation from a melody-conditioned Transformer model.

In [None]:
model_name = 'transformer'
hparams_set = 'transformer_tpu'
ckpt_path = 'gs://magentadata/models/music_transformer/checkpoints/melody_conditioned_model_16.ckpt'

class MelodyToPianoPerformanceProblem(score2perf.AbsoluteMelody2PerfProblem):
  @property
  def add_eos_symbol(self):
    return True

problem = MelodyToPianoPerformanceProblem()
melody_conditioned_encoders = problem.get_feature_encoders()

# Set up HParams.
hparams = trainer_lib.create_hparams(hparams_set=hparams_set)
trainer_lib.add_problem_hparams(hparams, problem)
hparams.num_hidden_layers = 16
hparams.sampling_method = 'random'

# Set up decoding HParams.
decode_hparams = decoding.decode_hparams()
decode_hparams.alpha = 0.0
decode_hparams.beam_size = 1

# Create Estimator.
run_config = trainer_lib.create_run_config(hparams)
estimator = trainer_lib.create_estimator(
    model_name, hparams, run_config,
    decode_hparams=decode_hparams)

# These values will be changed by the following cell.
inputs = []
decode_length = 0

# Create input generator.
def input_generator():
  global inputs
  while True:
    yield {
        'inputs': np.array([[inputs]], dtype=np.int32),
        'targets': np.zeros([1, 0], dtype=np.int32),
        'decode_length': np.array(decode_length, dtype=np.int32)
    }

# Start the Estimator, loading from the specified checkpoint.
input_fn = decoding.make_input_fn_from_generator(input_generator())
melody_conditioned_samples = estimator.predict(
    input_fn, checkpoint_path=ckpt_path)

# "Burn" one.
_ = next(melody_conditioned_samples)

## Choose Melody
Here you can choose a melody to be accompanied by the model. There are a few preassigned options to choose from: "Twinkle Twinkle Little Star", "Mary Had a Little Lamb" and "Row Row Row your Boat" You can use your own melody too. If your MIDI file is polyphonic, the notes with the highest pitch will be used as the melody.

In [None]:
# Tokens to insert between melody events.
event_padding = 2 * [note_seq.MELODY_NO_EVENT]

melodies = {
    'Mary Had a Little Lamb': [
        64, 62, 60, 62, 64, 64, 64, note_seq.MELODY_NO_EVENT,
        62, 62, 62, note_seq.MELODY_NO_EVENT,
        64, 67, 67, note_seq.MELODY_NO_EVENT,
        64, 62, 60, 62, 64, 64, 64, 64,
        62, 62, 64, 62, 60, note_seq.MELODY_NO_EVENT,
        note_seq.MELODY_NO_EVENT, note_seq.MELODY_NO_EVENT
    ],
    'Row Row Row Your Boat': [
        60, note_seq.MELODY_NO_EVENT, note_seq.MELODY_NO_EVENT,
        60, note_seq.MELODY_NO_EVENT, note_seq.MELODY_NO_EVENT,
        60, note_seq.MELODY_NO_EVENT, 62,
        64, note_seq.MELODY_NO_EVENT, note_seq.MELODY_NO_EVENT,
        64, note_seq.MELODY_NO_EVENT, 62,
        64, note_seq.MELODY_NO_EVENT, 65,
        67, note_seq.MELODY_NO_EVENT, note_seq.MELODY_NO_EVENT,
        note_seq.MELODY_NO_EVENT, note_seq.MELODY_NO_EVENT, note_seq.MELODY_NO_EVENT,
        72, 72, 72, 67, 67, 67, 64, 64, 64, 60, 60, 60,
        67, note_seq.MELODY_NO_EVENT, 65,
        64, note_seq.MELODY_NO_EVENT, 62,
        60, note_seq.MELODY_NO_EVENT, note_seq.MELODY_NO_EVENT,
        note_seq.MELODY_NO_EVENT, note_seq.MELODY_NO_EVENT, note_seq.MELODY_NO_EVENT
    ],
    'Twinkle Twinkle Little Star': [
        60, 60, 67, 67, 69, 69, 67, note_seq.MELODY_NO_EVENT,
        65, 65, 64, 64, 62, 62, 60, note_seq.MELODY_NO_EVENT,
        67, 67, 65, 65, 64, 64, 62, note_seq.MELODY_NO_EVENT,
        67, 67, 65, 65, 64, 64, 62, note_seq.MELODY_NO_EVENT,
        60, 60, 67, 67, 69, 69, 67, note_seq.MELODY_NO_EVENT,
        65, 65, 64, 64, 62, 62, 60, note_seq.MELODY_NO_EVENT        
    ]
}

melody = 'Twinkle Twinkle Little Star'  #@param ['Mary Had a Little Lamb', 'Row Row Row Your Boat', 'Twinkle Twinkle Little Star', 'Upload your own!']

if melody == 'Upload your own!':
  # Extract melody from user-uploaded MIDI file.
  melody_ns = upload_midi()
  melody_instrument = note_seq.infer_melody_for_sequence(melody_ns)
  notes = [note for note in melody_ns.notes
           if note.instrument == melody_instrument]
  del melody_ns.notes[:]
  melody_ns.notes.extend(
      sorted(notes, key=lambda note: note.start_time))
  for i in range(len(melody_ns.notes) - 1):
    melody_ns.notes[i].end_time = melody_ns.notes[i + 1].start_time
  inputs = melody_conditioned_encoders['inputs'].encode_note_sequence(
      melody_ns)
else:
  # Use one of the provided melodies.
  events = [event + 12 if event != note_seq.MELODY_NO_EVENT else event
            for e in melodies[melody]
            for event in [e] + event_padding]
  inputs = melody_conditioned_encoders['inputs'].encode(
      ' '.join(str(e) for e in events))
  melody_ns = note_seq.Melody(events).to_sequence(qpm=150)

# Play and plot the melody.
note_seq.play_sequence(
    melody_ns,
    synth=note_seq.fluidsynth, sample_rate=SAMPLE_RATE, sf2_path=SF2_PATH)
note_seq.plot_sequence(melody_ns)

## Go crazy. Generate Accompaniment for Melody. 
Generate a piano performance consisting of the chosen melody plus accompaniment.

In [None]:
# Generate sample events.
decode_length = 4096
sample_ids = next(melody_conditioned_samples)['outputs']

# Decode to NoteSequence.
midi_filename = decode(
    sample_ids,
    encoder=melody_conditioned_encoders['targets'])
accompaniment_ns = note_seq.midi_file_to_note_sequence(midi_filename)

# Play and plot.
note_seq.play_sequence(
    accompaniment_ns,
    synth=note_seq.fluidsynth, sample_rate=SAMPLE_RATE, sf2_path=SF2_PATH)
note_seq.plot_sequence(accompaniment_ns)

## Save Audio


In [None]:
note_seq.sequence_proto_to_midi_file(
    accompaniment_ns, 'accompaniment.mid')

# DeepLyrics
## Using the power of Jukebox, an open-source Deep Learning music generation library, developed by OpenAI to generate music from nothing but lyrics.
### Adapted from the Jukebox Colab Notebook

Note: <a href="deeplyrics">We</a> highly recommend that you follow the recommended hardware specifications specified earlier for this particular segment. We also highly recommend that you run this on a GPU with atleast 16GB of memory.

# Sample from 1B or 5B model
The 5B model is more robust and will provide better results compared to the 1B model. However, the 1B model is signifantly faster and less resource intensive. Use it if you have less than 16GB of GPU memory on your system.

In [None]:
model = "1b_lyrics" # Change this to "5b_lyrics" if you choose to use the 5B model
hps = Hyperparams() #load hyperparams
hps.sr = 44100 #sample rate
hps.n_samples = 3 if model=='5b_lyrics' else 3 #number of samples to generate
hps.name = 'samples'
chunk_size = 16 if model=="5b_lyrics" else 32
max_batch_size = 3 if model=="5b_lyrics" else 16
hps.levels = 3
hps.hop_fraction = [.5,.5,.125]

vqvae, *priors = MODELS[model]
vqvae = make_vqvae(setup_hparams(vqvae, dict(sample_length = 1048576)), device)
top_prior = make_prior(setup_hparams(priors[-1], dict()), vqvae, device)

Specify your choice of artist, genre, lyrics, and length of musical sample. 

In [None]:
sample_length_in_seconds = 60          # Full length of musical sample to generate - we find songs in the 1 to 4 minute
                                       # range work well, with generation time proportional to sample length.  
                                       # This total length affects how quickly the model 
                                       # progresses through lyrics (model also generates differently
                                       # depending on if it thinks it's in the beginning, middle, or end of sample)

hps.sample_length = (int(sample_length_in_seconds*hps.sr)//top_prior.raw_to_tokens)*top_prior.raw_to_tokens
assert hps.sample_length >= top_prior.n_ctx*top_prior.raw_to_tokens, f'Please choose a larger sampling rate'

In [None]:
#We chose to work with Eminem's voice with lyrics from "Paid my Dues" by NF

metas = [dict(artist = "Eminem",
            genre = "Hip Hop",
            total_length = hps.sample_length,
            offset = 0,
            lyrics = """II spit it with ease, so leave it to me
            You doubt it but you better believe
            I'm on a rampage hit 'em with the record release
            Dependin' the week, I'm prolly gonna have to achieve another goal
            Let me go when I'm over the beat
            I go into beast mode like I'm ready to feast
            I'm fed up with these thieves tryna get me to bleed
            They wanna see me take an L? (yup, see what I mean)
            How many records I gotta give you to get with the program?
            Taken for granted I'm 'bout to give you the whole plan
            Open your mind up and take a look at the blueprint
            Debate if you gotta, but gotta hold it with both hands
            To pick up the bars you gotta be smart
            You really gotta dig in your heart if you wanna get to the root of an issue
            Pursuin' the mental can be dark and be difficult
            But the payoff at the end of it, can help you to get through it, hey
            """,
            ),
          ] * hps.n_samples
labels = [None, None, top_prior.labeller.get_batch_labels(metas, 'cuda')]

Optionally adjust the sampling temperature (set it around 1 for the best results).  

In [None]:
sampling_temperature = .98

lower_batch_size = 16
max_batch_size = 3 if model == "5b_lyrics" else 16
lower_level_chunk_size = 32
chunk_size = 16 if model == "5b_lyrics" else 32
sampling_kwargs = [dict(temp=.99, fp16=True, max_batch_size=lower_batch_size,
                        chunk_size=lower_level_chunk_size),
                    dict(temp=0.99, fp16=True, max_batch_size=lower_batch_size,
                         chunk_size=lower_level_chunk_size),
                    dict(temp=sampling_temperature, fp16=True, 
                         max_batch_size=max_batch_size, chunk_size=chunk_size)]

Now we're ready to sample from the model. We'll generate the top level (2) first, followed by the first upsampling (level 1), and the second upsampling (0).  If you are using a local machine, you can also load all models directly with make_models, and then use sample.py's ancestral_sampling to put this all in one step.

After each level, we decode to raw audio and save the audio files.   

This next cell will take a while (approximately 10 minutes per 20 seconds of music sample), similar to synthesizing audio as we demonstrated earlier.

Approximate time required for a 60 second song (~1,000,000 interpolations): 30 mins-1 hour on a GeForce RTX GPU and 5-10 minutes on a Quadro or Tesla GPU. These audio files are compressed and will be of lower audio quality than the original. You may find these at {hps.name}/level_2/.

In [None]:
zs = [t.zeros(hps.n_samples,0,dtype=t.long, device='cuda') for _ in range(len(priors))]
zs = _sample(zs, labels, sampling_kwargs, [None, None, top_prior], [2], hps)

In [None]:
Audio(f'{hps.name}/level_2/item_0.wav') #if you want to hear it directly from Jupyter or Colab

## Upsampling

The following code block will allow you to upsample your previously generated audio using Neural Networks. This process is GPU dependant and will take a long time to complete if you do not meet the recommended hardware requirements. With a GPU, this should take about 4 minutes per 1 second of audio per a batch. Approximate time required for a 60 second song (~800,000 interpolations): 8-12 hours on a GeForce GPU and 2-8 hours on a Quadro or Tesla GPU.

In [None]:
# Set this False if you are on a local machine that has enough memory (this allows you to do the
# lyrics alignment visualization during the upsampling stage). For a hosted runtime, 
# we'll need to go ahead and delete the top_prior if you are using the 5b_lyrics model.
if True:
  del top_prior
  empty_cache()
  top_prior=None
upsamplers = [make_prior(setup_hparams(prior, dict()), vqvae, 'cpu') for prior in priors[:-1]]
labels[:2] = [prior.labeller.get_batch_labels(metas, 'cuda') for prior in upsamplers]

In [None]:
zs = upsample(zs, labels, sampling_kwargs, [*upsamplers, top_prior], hps) #Note: This is the code that upsamples the previously 
                                                                          #generated low-quality audio file

In [None]:
Audio(f'{hps.name}/level_0/item_0.wav') #if you want to hear it directly from Jupyter or Colab

In [None]:
del upsamplers
empty_cache() #clears stored cache from all the processing

## Alright Folks, that's it! More to come soon.