<a href="https://colab.research.google.com/github/DemonFlexCouncil/DDSP-48kHz-Stereo/blob/master/ddsp/colab/timbre_transfer_48stereo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


##### Copyright 2020 Google LLC.

Licensed under the Apache License, Version 2.0 (the "License");





In [None]:
# Copyright 2020 Google LLC. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

# Train + Timbre Transfer--DDSP Autoencoder on GPU

Made by Google Magenta--altered by Demon Flex Council

This notebook demonstrates how to install the DDSP library and train it for synthesis based on your own data using our command-line scripts. If run inside of Colaboratory, it will automatically use a free Google Cloud GPU.

<img src="https://storage.googleapis.com/ddsp/additive_diagram/ddsp_autoencoder.png" alt="DDSP Autoencoder figure" width="700">


**Note that we prefix bash commands with a `!` inside of Colab, but you would leave them out if running directly in a terminal.**

**A Little Background**

A producer friend of mine turned me on to Magenta’s DDSP, and I’m glad he did. In my mind it represents the way forward for AI music. Finally we have a glimpse inside the black box, with access to musical parameters as well as neural net hyperparameters. And DDSP leverages decades of studio knowledge by utilizing traditional processors like synthesizers and effects. One can envision a time when DDSP-like elements will sit at the heart of production DAWs.

According to Magenta’s paper, this algorithm was intended as proof of concept, but I wanted to bend it more towards a tool for producers. I bumped the sample rate up to 48kHz and made it stereo. I also introduced a variable render length so you can feed it a loop or phrase. However, there are limits to this parameter. The total number of samples in your render length (number of seconds * 48000) must be evenly divisible by 800. In practice, this means using round-numbered or highly-divisible tempos (105, 96, 90, 72, 50…) or using material that does not depend on tempo.

Also note that longer render times may require a smaller batch size, which is currently set at 8 for a 4-second render. This may diminish audio quality, so use shorter render times if at all possible.

The dataset and audio primer files must be WAVE format, stereo, and 48kHz. Most DAWs and audio editors have a 48kHz export option, including the free Audacity. There appears to be a lower limit on the total size of the dataset, somewhere around 20MB. Anything lower than that and the TFRecord maker will create blank records (0 bytes). Also, Colaboratory may throw memory errors if it encounters large single audio files—cut the file into smaller pieces if this happens.

## **Step 1**--Install Dependencies
First we install the required dependencies with `pip` (takes about 5 minutes). **Warning:** do not use a Tensorflow version newer than 2.2.

In [1]:
!pip install tensorflow==2.2
!pip install mir_eval
!pip install apache_beam
!pip install crepe
!pip install pydub
!pip3 install ffmpeg-normalize
import os
import glob
import tensorflow as tf

Collecting tensorflow==2.2
[?25l  Downloading https://files.pythonhosted.org/packages/3d/be/679ce5254a8c8d07470efb4a4c00345fae91f766e64f1c2aece8796d7218/tensorflow-2.2.0-cp36-cp36m-manylinux2010_x86_64.whl (516.2MB)
[K     |████████████████████████████████| 516.2MB 28kB/s 
Collecting tensorboard<2.3.0,>=2.2.0
[?25l  Downloading https://files.pythonhosted.org/packages/1d/74/0a6fcb206dcc72a6da9a62dd81784bfdbff5fedb099982861dc2219014fb/tensorboard-2.2.2-py3-none-any.whl (3.0MB)
[K     |████████████████████████████████| 3.0MB 2.8MB/s 
Collecting tensorflow-estimator<2.3.0,>=2.2.0
[?25l  Downloading https://files.pythonhosted.org/packages/a4/f5/926ae53d6a226ec0fda5208e0e581cffed895ccc89e36ba76a8e60895b78/tensorflow_estimator-2.2.0-py2.py3-none-any.whl (454kB)
[K     |████████████████████████████████| 460kB 19.7MB/s 
Installing collected packages: tensorboard, tensorflow-estimator, tensorflow
  Found existing installation: tensorboard 2.3.0
    Uninstalling tensorboard-2.3.0:
      Suc

## **Step 2**--Login and mount your Google Drive

This will require an authentication code. You should then be able to see your drive in the file browser on the left panel.

In [2]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


## **Step 3**--Set your audio directory on Drive and get DDSP repository from Github

Find a folder on Drive where you want to upload audio files and store checkpoints. Then right-click and select "Copy path" in the file browser on the left panel--make sure you've clicked the folder icon on the far left side of the broswer. Enter the path below.

In [3]:
DRIVE_DIR =  "/content/drive/My Drive/test" #@param {type:"string"}

!git clone https://github.com/DemonFlexCouncil/DDSP-48kHz-Stereo.git

AUDIO_DIR = '/content/data/audio'
!mkdir -p $AUDIO_DIR
AUDIO_FILEPATTERN = AUDIO_DIR + '/*'
AUDIO_INPUT_DIR = DRIVE_DIR + '/audio_input'
AUDIO_OUTPUT_DIR = DRIVE_DIR + '/audio_output'
CKPT_OUTPUT_DIR = DRIVE_DIR + '/ckpt'
SAVE_DIR = os.path.join(DRIVE_DIR, 'model')

%cd $DRIVE_DIR
!mkdir -p audio_input audio_output ckpt data model primers

Cloning into 'DDSP-48kHz-Stereo'...
remote: Enumerating objects: 62, done.[K
remote: Counting objects: 100% (62/62), done.[K
remote: Compressing objects: 100% (60/60), done.[K
remote: Total 209 (delta 33), reused 0 (delta 0), pack-reused 147[K
Receiving objects: 100% (209/209), 152.26 KiB | 436.00 KiB/s, done.
Resolving deltas: 100% (98/98), done.
/content/drive/My Drive/test


## **Step 4**--Set render length

In [4]:
#@markdown Determines the length of audio slices for training and resynthesis. Decimals are OK.
RENDER_SECONDS =  4.8#@param {type:"number", min:1, max:10}
RENDER_SAMPLES = int(RENDER_SECONDS * 48000)

if ((RENDER_SAMPLES % 800) != 0):
  raise ValueError("Number of samples at 48kHz must be divisble by 800.")

## **Step 5**--Upload your audio files to Drive and create a TFRecord dataset
* Put all of your training audio files in the "audio_input" directory inside whatever directory you set as DRIVE_DIR.
 * Typically works well with audio from a single acoustic environment.

Preprocessing involves inferring the fundamental frequency (or "pitch") with [CREPE](http://github.com/marl/crepe), and computing the loudness. These features will then be stored in a sharded [TFRecord](https://www.tensorflow.org/tutorials/load_data/tfrecord) file for easier loading. Depending on the amount of input audio, this process usually takes a few minutes.

In [5]:
audio_files = glob.glob(os.path.join(AUDIO_INPUT_DIR, '*.wav'))

for fname in audio_files:
  target_name = os.path.join(AUDIO_DIR, 
                             os.path.basename(fname).replace(' ', '_'))
  print('Copying {} to {}'.format(fname, target_name))
  !cp "$fname" $target_name

TRAIN_TFRECORD = '/content/data/train.tfrecord'
TRAIN_TFRECORD_FILEPATTERN = TRAIN_TFRECORD + '*'

drive_data_dir = os.path.join(DRIVE_DIR, 'data') 
drive_dataset_files = glob.glob(drive_data_dir + '/*')

# Make a new dataset.
if not glob.glob(AUDIO_FILEPATTERN):
  raise ValueError('No audio files found. Please use the previous cell to '
                    'upload.')
  
!python /content/DDSP-48kHz-Stereo/ddsp/training/data_preparation/prepare_tfrecord.py \
  --input_audio_filepatterns=$AUDIO_FILEPATTERN \
  --output_tfrecord_path=$TRAIN_TFRECORD \
  --num_shards=10 \
  --example_secs=$RENDER_SECONDS \
  --alsologtostderr

TRAIN_TFRECORD_DIR = DRIVE_DIR + '/data'
TRAIN_TFRECORD_DIR = TRAIN_TFRECORD_DIR.replace("My Drive", "My\ Drive")
!cp $TRAIN_TFRECORD_FILEPATTERN $TRAIN_TFRECORD_DIR

Copying /content/drive/My Drive/test/audio_input/Violence-full48.wav to /content/data/audio/Violence-full48.wav
I0806 23:37:58.670020 139916554712960 statecache.py:154] Creating state cache with size 100
I0806 23:37:58.671144 139916554712960 worker_handlers.py:841] Created Worker handler <apache_beam.runners.portability.fn_api_runner.worker_handlers.EmbeddedWorkerHandler object at 0x7f404c112208> for environment ref_Environment_default_environment_1 (beam:env:embedded_python:v1, b'')
I0806 23:37:58.671466 139916554712960 fn_runner.py:485] Running (((((ref_AppliedPTransform_WriteToTFRecord/Write/WriteImpl/DoOnce/Impulse_23)+(ref_AppliedPTransform_WriteToTFRecord/Write/WriteImpl/DoOnce/FlatMap(<lambda at core.py:2632>)_24))+(ref_AppliedPTransform_WriteToTFRecord/Write/WriteImpl/DoOnce/Map(decode)_26))+(ref_AppliedPTransform_WriteToTFRecord/Write/WriteImpl/InitializeWrite_27))+(ref_PCollection_PCollection_16/Write))+(ref_PCollection_PCollection_17/Write)
I0806 23:37:58.691665 139916554712

## **Step 6**--Save dataset statistics for timbre transfer

Quantile normalization helps match loudness of timbre transfer inputs to the 
loudness of the dataset, so let's calculate it here and save in a pickle file.

In [6]:
%cd /content/DDSP-48kHz-Stereo/ddsp/

from colab import colab_utils
from training import data

TRAIN_TFRECORD = '/content/data/train.tfrecord'
TRAIN_TFRECORD_FILEPATTERN = TRAIN_TFRECORD + '*'

data_provider = data.TFRecordProvider(TRAIN_TFRECORD_FILEPATTERN, example_secs=RENDER_SECONDS)
dataset = data_provider.get_dataset(shuffle=False)

PICKLE_FILE_PATH = os.path.join(SAVE_DIR, 'dataset_statistics.pkl')

colab_utils.save_dataset_statistics(data_provider, PICKLE_FILE_PATH)

/content/DDSP-48kHz-Stereo/ddsp
Calculating dataset statistics for <training.data.TFRecordProvider object at 0x7f1bf3e083c8>
---loudness1---
---loudness2---
---loudness2---
---Average pitch f0_trimmedL---
[[220.34908  220.69112  220.05768  ... 196.78365  196.78918  196.6635  ]
 [185.97324  185.61166  186.36786  ...  36.740314  36.726173  36.72935 ]
 [123.70869  112.78811  109.3933   ... 296.5569   297.1117   296.46014 ]
 ...
 [183.70335  183.2174   183.47133  ... 220.67043  220.88762  221.20738 ]
 [191.1269   187.63937  186.09825  ... 146.61873  146.05525  146.30437 ]
 [ 73.533615  73.44043   73.53976  ... 196.13974  195.9261   195.9158  ]]
(256, 1180)
---Average pitch f0_trimmedL[mask_onL]---
[165.53345 165.62416 165.71095 ... 196.11421 196.07552 196.1745 ]
(84142,)
---frequencies1---
[195.99193 195.67447 195.4532  ... 194.70001 195.03435 194.94092]
(93750,)
---frequencies2---
tf.Tensor([195.99193 195.67447 195.4532  ... 194.70001 195.03435 194.94092], shape=(93750,), dtype=float32)
(

## **Step 7**--Train model

DDSP was designed to model a single instrument, but I've had more interesting results training it on sparse multi-timbral material. In this case, the neural network will attempt to model all timbres, but will likely associate certain timbres with different pitch and loudness conditions.

Note that  [gin configuration](https://github.com/google/gin-config) files are specified for the both the model architecture ([solo_instrument.gin](TODO)) and the dataset ([tfrecord.gin](TODO)), which are both predefined in the library. You could also create your own. Parameters can be overriden in the run script below (!python ddsp/ddsp_run.py).

### Training Notes:
* Models typically perform well when the loss drops to the range of ~8-9.
* Depending on the dataset this can take anywhere from 10k-60k training steps usually.
* The default is set to 60k, but you can stop training at any time.
* On the Colaboratory GPU, this can take from around 3-20 hours.
* By default, checkpoints will be saved every 300 steps with a maximum of 10 checkpoints.
* Feel free to adjust these numbers depending on the frequency of saves you would like and the space on your drive.
* If you Colaboratory runtime has stopped, re-run all previous cells to resume training from your most recent checkpoint.

In [None]:
%cd /content/DDSP-48kHz-Stereo

TRAIN_TFRECORD = '/content/data/train.tfrecord'
TRAIN_TFRECORD_FILEPATTERN = TRAIN_TFRECORD + '*'

!python ddsp/ddsp_run.py \
  --mode=train \
  --alsologtostderr \
  --save_dir="$SAVE_DIR" \
  --gin_file=models/solo_instrument.gin \
  --gin_file=datasets/tfrecord.gin \
  --gin_param="TFRecordProvider.file_pattern='$TRAIN_TFRECORD_FILEPATTERN'" \
  --gin_param="TFRecordProvider.example_secs=$RENDER_SECONDS" \
  --gin_param="Autoencoder.n_samples=$RENDER_SAMPLES" \
  --gin_param="batch_size=6" \
  --gin_param="train_util.train.num_steps=30000" \
  --gin_param="train_util.train.steps_per_save=300" \
  --gin_param="trainers.Trainer.checkpoints_to_keep=10"

/content/DDSP-48kHz-Stereo
I0806 23:52:38.382843 139930134554496 ddsp_run.py:166] Restore Dir: /content/drive/My Drive/test/model
I0806 23:52:38.383050 139930134554496 ddsp_run.py:167] Save Dir: /content/drive/My Drive/test/model
I0806 23:52:38.387451 139930134554496 ddsp_run.py:139] Using operative config: /content/drive/My Drive/test/model/operative_config-0.gin
I0806 23:52:39.009639 139930134554496 train_util.py:57] Defaulting to MirroredStrategy
2020-08-06 23:52:39.011399: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-08-06 23:52:39.014469: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-06 23:52:39.015435: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties: 
pciBusID: 0000:00:04.0 name: Tesla P100-PCIE-16GB computeCapability

## **Step 8**--Timbre transfer imports

Now it's time to render the final audio file with the aid of an audio primer file for timbre transfer. We'll start with some basic imports.

In [10]:
%cd /content/DDSP-48kHz-Stereo/ddsp

# Ignore a bunch of deprecation warnings
import warnings
warnings.filterwarnings("ignore")

import copy
import time
import pydub
import gin
import crepe
import librosa
import matplotlib.pyplot as plt
import numpy as np
import pickle
import tensorflow as tf
import tensorflow_datasets as tfds

import core
import spectral_ops
from training import metrics
from training import models
from colab import colab_utils
from colab.colab_utils import (auto_tune, detect_notes, fit_quantile_transform, get_tuning_factor, download, play, record, specplot, upload, DEFAULT_SAMPLE_RATE)
from google.colab import files

# Helper Functions
sample_rate = 48000

print('Done!')

/content/DDSP-48kHz-Stereo/ddsp
Done!


## **Step 9**--Process audio primer

The key to transcending the sonic bounds of the dataset is the audio primer file. This file will graft its frequency and loudness information onto the rendered audio file, sort of like a vocoder. Then you can use the sliders in the "Modify Conditioning" section to further alter the rendered file.

Put your audio primer files in the "primers" directory inside whatever directory you set as DRIVE_DIR. Input the file name of the primer you want to use on the line below.

In [11]:
PRIMER_DIR = DRIVE_DIR + '/primers/'
PRIMER_FILE =  "On--48--1loop--100bpm.wav" #@param {type:"string"}
PATH_TO_PRIMER = PRIMER_DIR + PRIMER_FILE

from scipy.io.wavfile import read as read_audio
from scipy.io.wavfile import write as write_audio

primer_sample_rate, audio = read_audio(PATH_TO_PRIMER)

# Setup the session.
spectral_ops.reset_crepe()

# Compute features.
start_time = time.time()
audio_features = metrics.compute_audio_features(audio)
audio_features['loudness_dbM'] = audio_features['loudness_dbM'].astype(np.float32)
audio_features['loudness_dbL'] = audio_features['loudness_dbL'].astype(np.float32)
audio_features['loudness_dbR'] = audio_features['loudness_dbR'].astype(np.float32)
audio_features_mod = None
print('Audio features took %.1f seconds' % (time.time() - start_time))

--audio shapes after channel splittingz--
(1, 921600)
(1, 921600)
(1, 921600)
[[   71 -1826 -2118 ...  7831  8228  7494]]
--audio shapes after float32--
(1, 921600)
(1, 921600)
(1, 921600)
[[   71 -1826 -2118 ...  7831  8228  7494]]
Audio features took 18.4 seconds


## **Step 10**--Load most recent checkpoint

In [12]:
# Copy most recent checkpoint to "ckpt" folder
%cd $DRIVE_DIR/ckpt/
!rm *
CHECKPOINT_ZIP = 'ckpt.zip'
latest_checkpoint_fname = os.path.basename(tf.train.latest_checkpoint(SAVE_DIR))  + '*'
!cd "$SAVE_DIR"
!cd "$SAVE_DIR" && zip $CHECKPOINT_ZIP $latest_checkpoint_fname* operative_config-0.gin dataset_statistics.pkl
!cp "$SAVE_DIR/$CHECKPOINT_ZIP" "$DRIVE_DIR/ckpt/"
!unzip -o "$CHECKPOINT_ZIP"
!rm "$CHECKPOINT_ZIP"
%cd $SAVE_DIR
!rm "$CHECKPOINT_ZIP"
model_dir = DRIVE_DIR + '/ckpt/'
gin_file = os.path.join(model_dir, 'operative_config-0.gin')

# Load the dataset statistics.
DATASET_STATS = None
dataset_stats_file = os.path.join(model_dir, 'dataset_statistics.pkl')
print(f'Loading dataset statistics from {dataset_stats_file}')
try:
  if tf.io.gfile.exists(dataset_stats_file):
    with tf.io.gfile.GFile(dataset_stats_file, 'rb') as f:
      DATASET_STATS = pickle.load(f)
except Exception as err:
  print('Loading dataset statistics from pickle failed: {}.'.format(err))

# Parse gin config,
with gin.unlock_config():
  gin.parse_config_file(gin_file, skip_unknown=True)

# Assumes only one checkpoint in the folder, 'ckpt-[iter]`.
ckpt_files = [f for f in tf.io.gfile.listdir(model_dir) if 'ckpt' in f]
ckpt_name = ckpt_files[0].split('.')[0]
ckpt = os.path.join(model_dir, ckpt_name)

# Ensure dimensions and sampling rates are equal
time_steps_train = gin.query_parameter('DefaultPreprocessor.time_steps')
n_samples_train = RENDER_SAMPLES
hop_size = int(n_samples_train / time_steps_train)
time_steps = int(audio_features['audioL'].shape[1] / hop_size)
n_samples = time_steps * hop_size

# Trim all input vectors to correct lengths 
for key in ['f0_hzM','f0_hzL','f0_hzR', 'f0_confidenceM', 'f0_confidenceL', 'f0_confidenceR', 'loudness_dbM', 'loudness_dbL', 'loudness_dbR']:
  audio_features[key] = audio_features[key][:time_steps]
audio_features['audioM'] = audio_features['audioM'][:, :n_samples]
audio_features['audioL'] = audio_features['audioL'][:, :n_samples]
audio_features['audioR'] = audio_features['audioR'][:, :n_samples]

# Set up the model just to predict audio given new conditioning
model = models.Autoencoder()
model.restore(ckpt)

# Build model by running a batch through it.
start_time = time.time()
_ = model(audio_features, training=False)
print('Restoring model took %.1f seconds' % (time.time() - start_time))

/content/drive/My Drive/test/ckpt
rm: cannot remove '*': No such file or directory
  adding: ckpt-2100.data-00000-of-00002 (deflated 92%)
  adding: ckpt-2100.data-00001-of-00002 (deflated 6%)
  adding: ckpt-2100.index (deflated 83%)
  adding: operative_config-0.gin (deflated 76%)
  adding: dataset_statistics.pkl (deflated 56%)
Archive:  ckpt.zip
  inflating: ckpt-2100.data-00000-of-00002  
  inflating: ckpt-2100.data-00001-of-00002  
  inflating: ckpt-2100.index         
  inflating: operative_config-0.gin  
  inflating: dataset_statistics.pkl  
/content/drive/My Drive/test/model
Loading dataset statistics from /content/drive/My Drive/test/ckpt/dataset_statistics.pkl
---dense_out---
<tensorflow.python.keras.layers.core.Dense object at 0x7fb2434bc780>
<tensorflow.python.keras.layers.core.Dense object at 0x7fb2434bca20>
<tensorflow.python.keras.layers.core.Dense object at 0x7fb243527cf8>
dense
dense
dense


To change all layers to have dtype float64 by default, call `tf.keras.backend.set



To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.



[1;30;43mStreaming output truncated to the last 5000 lines.[0m
        [1.4161307],
        [1.385833 ],
        [1.3321607],
        [1.2737844],
        [1.266037 ],
        [1.2705439],
        [1.2437984],
        [1.1539261],
        [1.0797682],
        [1.2274706],
        [1.3027697],
        [1.3222219],
        [1.2975447],
        [1.2234234],
        [1.1924478],
        [1.3502159],
        [1.4169871],
        [1.4303539],
        [1.403102 ],
        [1.3373408],
        [1.2534618],
        [1.1739335],
        [1.1277848],
        [1.1174515],
        [1.1067357],
        [1.0858482],
        [1.2052397],
        [1.2955754],
        [1.3248726],
        [1.3083922],
        [1.2429428],
        [1.1940536],
        [1.2987919],
        [1.3491232],
        [1.3657101],
        [1.3412519],
        [1.2776741],
        [1.222926 ],
        [1.3065748],
        [1.371465 ],
        [1.3873903],
        [1.3701413],
        [1.3339204],
        [1.2966552],
        [1.

## **Step 11**--Modify Conditioning

In [None]:
#@markdown These models were not explicitly trained to perform timbre transfer, so they may sound unnatural if the incoming loudness and frequencies are very different then the training data (which will always be somewhat true). 


#@markdown ## Note Detection

#@markdown You can leave this at 1.0 for most cases
threshold = 0.08 #@param {type:"slider", min: 0.0, max:2.0, step:0.01}


#@markdown ## Automatic

ADJUST = True #@param{type:"boolean"}

#@markdown Quiet parts without notes detected (dB)
quiet = 4 #@param {type:"slider", min: 0, max:60, step:1}

#@markdown Force pitch to nearest note (amount)
autotune = 1 #@param {type:"slider", min: 0.0, max:1.0, step:0.1}

#@markdown ## Manual


#@markdown Shift the pitch (octaves)
pitch_shift =  -2 #@param {type:"slider", min:-2, max:2, step:1}

#@markdown Adjsut the overall loudness (dB)
loudness_shift = 15 #@param {type:"slider", min:-20, max:20, step:1}


audio_features_mod = {k: v.copy() for k, v in audio_features.items()}


## Helper functions.
def shift_ld(audio_features, ld_shiftL=0.0, ld_shiftR=0.0):
  """Shift loudness by a number of ocatves."""
  audio_features['loudness_dbL'] += ld_shiftL
  audio_features['loudness_dbR'] += ld_shiftR
  return audio_features


def shift_f0(audio_features, pitch_shiftL=0.0, pitch_shiftR=0.0):
  """Shift f0 by a number of ocatves."""
  audio_features['f0_hzL'] *= 2.0 ** (pitch_shiftL)
  audio_features['f0_hzL'] = np.clip(audio_features['f0_hzL'], 
                                    0.0, 
                                    librosa.midi_to_hz(110.0))
  audio_features['f0_hzR'] *= 2.0 ** (pitch_shiftR)
  audio_features['f0_hzR'] = np.clip(audio_features['f0_hzR'], 
                                    0.0, 
                                    librosa.midi_to_hz(110.0))
  return audio_features


mask_on = None

if ADJUST and DATASET_STATS is not None:
  # Detect sections that are "on".
  mask_onL, note_on_valueL = detect_notes(audio_features['loudness_dbL'],
                                        audio_features['f0_confidenceL'],
                                        threshold)
  
  mask_onR, note_on_valueR = detect_notes(audio_features['loudness_dbR'],
                                        audio_features['f0_confidenceR'],
                                        threshold)

  if np.any(mask_onL):
    # Shift the pitch register.
    target_mean_pitchL = DATASET_STATS['mean_pitchL']
    target_mean_pitchR = DATASET_STATS['mean_pitchR']
    pitchL = core.hz_to_midi(audio_features['f0_hzL'])
    pitchR = core.hz_to_midi(audio_features['f0_hzR'])
    pitchL = np.expand_dims(pitchL, axis=0)
    pitchR = np.expand_dims(pitchR, axis=0)
    mean_pitchL = np.mean(pitchL[mask_onL])
    mean_pitchR = np.mean(pitchR[mask_onR])
    p_diffL = target_mean_pitchL - mean_pitchL
    p_diffR = target_mean_pitchR - mean_pitchR
    p_diff_octaveL = p_diffL / 12.0
    p_diff_octaveR = p_diffR / 12.0
    round_fnL = np.floor if p_diff_octaveL > 1.5 else np.ceil
    round_fnR = np.floor if p_diff_octaveR > 1.5 else np.ceil
    p_diff_octaveL = round_fnL(p_diff_octaveL)
    p_diff_octaveR = round_fnR(p_diff_octaveR)
    audio_features_mod = shift_f0(audio_features_mod, p_diff_octaveL, p_diff_octaveR)

    # Quantile shift the note_on parts.
    _, loudness_normL = colab_utils.fit_quantile_transform(
        audio_features['loudness_dbL'],
        mask_onL,
        inv_quantile=DATASET_STATS['quantile_transformL'])
    
    # Quantile shift the note_on parts.
    _, loudness_normR = colab_utils.fit_quantile_transform(
        audio_features['loudness_dbR'],
        mask_onR,
        inv_quantile=DATASET_STATS['quantile_transformR'])

    # Turn down the note_off parts.
    mask_offL = np.logical_not(mask_onL)
    mask_offR = np.logical_not(mask_onR)
    loudness_normL = np.squeeze(loudness_normL)
    loudness_normR = np.squeeze(loudness_normR)
    loudness_normL[np.squeeze(mask_offL)] -=  quiet * (1.0 - note_on_valueL[mask_offL])
    loudness_normR[np.squeeze(mask_offR)] -=  quiet * (1.0 - note_on_valueR[mask_offR])
    loudness_normL = np.reshape(loudness_normL, audio_features['loudness_dbL'].shape)
    loudness_normR = np.reshape(loudness_normR, audio_features['loudness_dbR'].shape)
    
    audio_features_mod['loudness_dbL'] = loudness_normL
    audio_features_mod['loudness_dbR'] = loudness_normR

    # Auto-tune.
    if autotune:
      f0_midiL = np.array(core.hz_to_midi(audio_features_mod['f0_hzL']))
      f0_midiR = np.array(core.hz_to_midi(audio_features_mod['f0_hzR']))
      tuning_factorL = get_tuning_factor(f0_midiL, audio_features_mod['f0_confidenceL'], np.squeeze(mask_onL))
      tuning_factorR = get_tuning_factor(f0_midiR, audio_features_mod['f0_confidenceR'], np.squeeze(mask_onR))
      f0_midi_atL = auto_tune(f0_midiL, tuning_factorL, np.squeeze(mask_onL), amount=autotune)
      f0_midi_atR = auto_tune(f0_midiR, tuning_factorR, np.squeeze(mask_onR), amount=autotune)
      audio_features_mod['f0_hzL'] = core.midi_to_hz(f0_midi_atL)
      audio_features_mod['f0_hzR'] = core.midi_to_hz(f0_midi_atR)

  else:
    print('\nSkipping auto-adjust (no notes detected or ADJUST box empty).')

else:
  print('\nSkipping auto-adujst (box not checked or no dataset statistics found).')

# Manual Shifts.
audio_features_mod = shift_ld(audio_features_mod, loudness_shift, loudness_shift)
audio_features_mod = shift_f0(audio_features_mod, pitch_shift, pitch_shift)

TRIM = -15

# Plot Features.
has_maskL = int(mask_onL is not None)
n_plots = 3 if has_maskL else 2 
figL, axesL = plt.subplots(nrows=n_plots, 
                      ncols=1, 
                      sharex=True,
                      figsize=(2*n_plots, 8))

if has_maskL:
  ax = axesL[0]
  ax.plot(np.ones_like(np.squeeze(mask_onL)[:TRIM]) * threshold, 'k:')
  ax.plot(np.squeeze(note_on_valueL)[:TRIM])
  ax.plot(np.squeeze(mask_onL)[:TRIM])
  ax.set_ylabel('Note-on Mask--Left')
  ax.set_xlabel('Time step [frame]--Left')
  ax.legend(['Threshold', 'Likelihood','Mask'])

ax = axesL[0 + has_maskL]
ax.plot(np.squeeze(audio_features['loudness_dbL'])[:TRIM])
ax.plot(np.squeeze(audio_features_mod['loudness_dbL'])[:TRIM])
ax.set_ylabel('loudness_db--Left')
ax.legend(['Original','Adjusted'])

ax = axesL[1 + has_maskL]
ax.plot(librosa.hz_to_midi(np.squeeze(audio_features['f0_hzL'])[:TRIM]))
ax.plot(librosa.hz_to_midi(np.squeeze(audio_features_mod['f0_hzL'])[:TRIM]))
ax.set_ylabel('f0 [midi]--Left')
_ = ax.legend(['Original','Adjusted'])

has_maskR = int(mask_onR is not None)
n_plots = 3 if has_maskR else 2 
figR, axesR = plt.subplots(nrows=n_plots, 
                      ncols=1, 
                      sharex=True,
                      figsize=(2*n_plots, 8))

if has_maskR:
  ax = axesR[0]
  ax.plot(np.ones_like(np.squeeze(mask_onR)[:TRIM]) * threshold, 'k:')
  ax.plot(np.squeeze(note_on_valueR)[:TRIM])
  ax.plot(np.squeeze(mask_onR)[:TRIM])
  ax.set_ylabel('Note-on Mask--Right')
  ax.set_xlabel('Time step [frame]--Right')
  ax.legend(['Threshold', 'Likelihood','Mask'])

ax = axesR[0 + has_maskR]
ax.plot(np.squeeze(audio_features['loudness_dbR'])[:TRIM])
ax.plot(np.squeeze(audio_features_mod['loudness_dbR'])[:TRIM])
ax.set_ylabel('loudness_db--Right')
ax.legend(['Original','Adjusted'])

ax = axesR[1 + has_maskR]
ax.plot(librosa.hz_to_midi(np.squeeze(audio_features['f0_hzR'])[:TRIM]))
ax.plot(librosa.hz_to_midi(np.squeeze(audio_features_mod['f0_hzR'])[:TRIM]))
ax.set_ylabel('f0 [midi]--Right')
_ = ax.legend(['Original','Adjusted'])

## **Step 12**--Render audio

After running this cell, your final rendered file should be downloaded automatically. If not, look for it in the "audio_output" directory inside whatever directory you set as DRIVE_DIR.

In [13]:
%cd $AUDIO_OUTPUT_DIR

af = audio_features if audio_features_mod is None else audio_features_mod

# Run a batch of predictions.
start_time = time.time()
audio_genM, audio_genL, audio_genR = model(af, training=False)
print('Prediction took %.1f seconds' % (time.time() - start_time))

audio_genL = np.expand_dims(np.squeeze(audio_genL.numpy()), axis=1)
audio_genR = np.expand_dims(np.squeeze(audio_genR.numpy()), axis=1)
audio_genS = np.concatenate((audio_genL, audio_genR), axis=1)
audio_genM = np.expand_dims(np.squeeze(audio_genM.numpy()), axis=1)

# Ear test (normalization), also make sure render.wav is stereo
write_audio("renderS.wav", 48000, audio_genS)
write_audio("renderM.wav", 48000, audio_genM)

!ffmpeg-normalize renderS.wav -o render.wav -t -15

!rm renderS.wav renderM.wav

colab_utils.download("render.wav")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
        [1.385833 ],
        [1.3321607],
        [1.2737844],
        [1.266037 ],
        [1.2705439],
        [1.2437984],
        [1.1539261],
        [1.0797682],
        [1.2274706],
        [1.3027697],
        [1.3222219],
        [1.2975447],
        [1.2234234],
        [1.1924478],
        [1.3502159],
        [1.4169871],
        [1.4303539],
        [1.403102 ],
        [1.3373408],
        [1.2534618],
        [1.1739335],
        [1.1277848],
        [1.1174515],
        [1.1067357],
        [1.0858482],
        [1.2052397],
        [1.2955754],
        [1.3248726],
        [1.3083922],
        [1.2429428],
        [1.1940536],
        [1.2987919],
        [1.3491232],
        [1.3657101],
        [1.3412519],
        [1.2776741],
        [1.222926 ],
        [1.3065748],
        [1.371465 ],
        [1.3873903],
        [1.3701413],
        [1.3339204],
        [1.2966552],
        [1.2541783],
        [1.

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## **Step 13** (optional)--Download your model for later use

In [14]:
%cd $CKPT_OUTPUT_DIR
!zip -r checkpoint.zip *
colab_utils.download('checkpoint.zip')
!rm checkpoint.zip

/content/drive/My Drive/test/ckpt
  adding: ckpt-2100.data-00000-of-00002 (deflated 92%)
  adding: ckpt-2100.data-00001-of-00002 (deflated 6%)
  adding: ckpt-2100.index (deflated 83%)
  adding: dataset_statistics.pkl (deflated 56%)
  adding: operative_config-0.gin (deflated 76%)


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>