<a href="https://colab.research.google.com/github/JustinYuu/CBIR_pytorch/blob/master/CMT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Demo of Controllable Music Transformer**

We provide a colab notebook for running inference with CMT. You can upload a video and generate a background music using this notebook.

# 1. Preparation

Clone the repo

In [None]:
import os
from google.colab import files
import json

os.chdir('/content')
!git clone https://github.com/wzk1015/video-bgm-generation
os.chdir('/content/video-bgm-generation')

Download checkpoint and soundfont



In [None]:
!gsutil -m cp gs://cmt/loss_8_params.pt /content/video-bgm-generation/exp/
!gsutil -m cp gs://magentadata/soundfonts/SGM-v2.01-Sal-Guit-Bass-V1.3.sf2 /content/video-bgm-generation/

Install dependencies

In [None]:
!apt-get update && apt-get install libfluidsynth1 build-essential libasound2-dev libjack-dev fluidsynth

In [None]:
!pip install --upgrade pip
# this may take ~15 minutes
!pip install pytorch-fast-transformers==0.3.0
# Note: Version of pytorch-fast-transformers is tricky - depends on your randomly assigned colab GPU, it could be 0.3.0 or 0.4.0 or others.
# Incorrect fast-transformers version could lead to Errors or generating awful results for unknown reasons,
# so you should try different versions, or refer to https://github.com/idiap/fast-transformers
!cd /content/video-bgm-generation
!pip install -r py3_requirements.txt
os.chdir("/content/video-bgm-generation/src/video2npz/visbeat3/")
!python setup.py install

# 2. Process input video

In [None]:
!cd /content/video-bgm-generation/
!pip install -r /content/video-bgm-generation/py3_requirements.txt

Upload your video

It is recommended to use videos **less than 2 minutes**, otherwise it gets really slow

In [None]:
os.chdir("/content/video-bgm-generation/")
uploaded = files.upload()
assert len(uploaded) == 1, "upload one video file only"
filename = list(uploaded.keys())[0]
os.system(f'mv {filename} videos/test_raw.mp4')

Convert to 360p to speed up extracting optical flow and visbeats

In [None]:
os.chdir("/content/video-bgm-generation/videos/")
!rm test.mp4
!ffmpeg -i test_raw.mp4 -strict -2 -vf scale=-1:360 test.mp4

Extracting optical flow and visbeats, convert video into npz file

In [None]:
os.chdir("/content/video-bgm-generation/src/video2npz/")
!rm -r VisBeatAssets/ fig/ flow/ image/ optical_flow/
!bash video2npz.sh ../../videos/test.mp4
# extracting optical flow and visbeats may be slow

# 3. Run the model to generate background music

Run inference to generate MIDI (.mid) output

In [None]:
os.chdir("/content/video-bgm-generation/src/")
!python gen_midi_conditional.py -f "../inference/test.npz" -c "../exp/loss_8_params.pt" -n 1

Convert midi into audio: use **GarageBand (recommended)** or midi2audio

Remember to **set tempo to the value of tempo in video2npz/metadata.json**

In [None]:
os.chdir("/content/video-bgm-generation/src/")
files.download('../inference/test.npz_0.mid')

with open("video2npz/metadata.json") as f:
    tempo = json.load(f)['tempo']
    print("tempo:", tempo)

Generate audio with midi2audio

Instead of running this cell, we recommend using GarageBand or other softwares, since their soundfonts are better. But this also works fine

In [None]:
import note_seq
from pretty_midi import PrettyMIDI
import midi2audio
import numpy as np
import io
import scipy

SAMPLE_RATE = 16000
SF2_PATH = '/content/video-bgm-generation/SGM-v2.01-Sal-Guit-Bass-V1.3.sf2'
os.chdir("/content/video-bgm-generation/inference/")

input_mid = 'test.npz_0.mid'
midi_obj = PrettyMIDI(input_mid)
# convert tempo
midi_length = midi_obj.get_end_time()
midi_obj.adjust_times([0, midi_length], [0, midi_length*120/tempo])
processed_mid = input_mid[:-4] + "_processed.mid"
midi_obj.write(processed_mid)
print("converting into mp3")
fs = midi2audio.FluidSynth(SF2_PATH, sample_rate=SAMPLE_RATE)
fs.midi_to_audio(processed_mid, "music.mp3")

print("playing music")
ns = note_seq.midi_io.midi_to_note_sequence(midi_obj)
note_seq.play_sequence(ns, synth=note_seq.fluidsynth, sample_rate=SAMPLE_RATE, sf2_path=SF2_PATH)
note_seq.plot_sequence(ns)


Combine original video and audio into video with BGM

Generate/upload the audio file under `inference`, name it as `music.mp3`, and run this to combine video and music

In [None]:
os.chdir("/content/video-bgm-generation/inference/")
!rm output.mp4
!ffmpeg -i ../videos/test_raw.mp4 -i music.mp3 -c:v copy -c:a aac -strict experimental -map 0:v:0 -map 1:a:0 output.mp4
files.download('output.mp4')