# Generating Audio with LSTM 

This notebookw will let you download a song from Youtube and model it with an **LSTM**

This models the way each spectral frame follows another and can be used to generate new raw audio

## Install prerequisites and get code

In [None]:
%tensorflow_version 1.x
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'  # less warnings ...
import tensorflow as tf

In [None]:
!git clone https://github.com/ual-cci/music_gen_interaction_RTML.git

In [None]:
# python libraries
!pip install Pillow numpy opencv-python PyWavelets tqdm slugify
!pip install -U Flask
!pip install lws==1.2.6
!pip install tflearn
!pip install librosa==0.7.2
!pip install numba==0.48
!pip install mock

In [None]:
!pip install numba==0.48

In [None]:
%cd /content/music_gen_interaction_RTML

## Download a sample audio:

Note: replace the url with whatever music video you want - or upload a file directly ... You can use the ffmpeg to convert it to wav later.

In [None]:
# get a youtube downloader
!sudo curl -L https://yt-dl.org/downloads/latest/youtube-dl -o /usr/local/bin/youtube-dl
!sudo chmod a+rx /usr/local/bin/youtube-dl

# Set up training

In [None]:

%cd /content/music_gen_interaction_RTML/

from unittest.mock import Mock, MagicMock
args = MagicMock(name='method')
sample_rate = 22050

# keep the same settings as the model you used when training:
args.lstm_layers = 3
args.lstm_units = 128
args.griffin_iterations = 60
args.sample_rate = sample_rate
args.sequence_length = 40
args.async_loading = True
args.amount_epochs = -1

# Pick a new song from Youtube 

## Insert your own url below and train 



In [None]:
youtube_url = "https://www.youtube.com/watch?v=4cIWu5m8UmA"

!youtube-dl -ci -f "bestaudio[ext=m4a]" $youtube_url -o 'youtube_audio.m4a'
!ffmpeg -i 'youtube_audio.m4a' -ac 2 -f wav full.wav

### You will probably need to train for 300 epochs

In [None]:
number_of_epochs = 150 # will take cca 4min
number_of_epochs = 300 # will take cca 8min

# Train Model

In [None]:
# ----[keep the same bellow]-------------------------------------------------------------------
!ffmpeg -ss 60 -i full.wav -t 60 -c copy sample.wav
#"""
!mkdir __music_samples
!mkdir __music_samples/sample/
!mv sample.wav __music_samples/sample/
!mkdir __saved_models/

# takes time!
!python training_handler.py -target_file __music_samples/sample/ -amount_epochs $number_of_epochs -batch_size 512
from IPython.display import clear_output 
clear_output()

import glob
import numpy as np
l = glob.glob("__saved_models/*.wav")
a = l[np.random.randint(len(l))]
i = glob.glob("__saved_models/*.png")
from IPython.display import Audio, Image
display(Audio(a))
display(Image(i[0]))
#"""

# Generate

As Vit mentioned in the lecture, we start from a place in the original audio track, then use the model to keep predicting new audio frames. 

Unfortunately, sometimes it can get stuck in a loop, or just stop generating interesting things so we can give it a kick by jumping to a new point in the song. 

Once we've switched, we keep using the model to generate new audio frames!

Below, you can specify a sequence of lengths and changes to the generation process to create a new audio output. 

The `sequence` array stores these points. Each item in is an array that stores the **start position** and **segment length**. The default has three, but you can add in as many as you want and/or change the existing values. 

```
sequence = [
  [start1, length1],
  [start2, length2],
  [start3, length3],
  etc....
]
```


The default code:

* Starts generating 10% through the song and generates 200 frames 

* Moves to 60% through the song and generates 300 frames

* Moves to 90% through the song and generates 150 frames

**Try with your own!**

In [None]:
from server_handler import ServerHandler
import settings

my_settings = settings.Settings(args)
my_settings.print_settings()

generation_handler = ServerHandler(my_settings)

# slightly experimental interpolation through the latents while generating ...

generation_handler.change_impulse(0.2) # set to 20% sharp

sequence = [
  #Starts generating 10% through the song and generates 200 frames 
  [0.1, 200],
  #Moves to 60% through the song and generates 300 frames
  [0.6, 300],
  #Moves to 90% through the song and generates 150 frames
  [0.9, 150]            
]

output_audio = []

for i in sequence:
  position_in_the_song = i[0]
  requested_length = i[1]
  generation_handler.change_impulse_smoothly_start(position_in_the_song) # allow interpolation
  audio, predict, reconstruct = generation_handler.generate_audio_sample(requested_length, interactive_i=position_in_the_song)
  output_audio.append(audio)

clear_output()

# Play Audio

If you want to download the audio, you can use the Colab File Explorer on the left <----. 

Find the file `generated_output_exp_concat.wav` and select **download**

In [None]:
import librosa
from IPython.display import Audio, Image
output_audio = np.concatenate(output_audio)

out_name = 'generated_output_exp_concat.wav'
librosa.output.write_wav(out_name, output_audio, sr=sample_rate)
Audio(out_name)

# Clean up if you are going to train a new model from a different youtube video

In [None]:
# fast cleanup
!mkdir unused
!mv __music_samples unused/
!mv __saved_models unused/
!mv *.wav unused/
!mv *.m4a unused/