# Introduction

These blog posts will be used to detail my research into AI audio generation for games. The plan is to experiment with different machine learning algorithms for generating audio, find the best algorithm for generating short sound effects, then extend the algorithm to fascilitate easy search, interpolation and possibly other modifications to a generated audio sample. The final algorithm should make generating and finding appropriate sound effects and easy task, even for non-experts in the area of audio engineering.

# Audio Samples for Training

It is important to have good training examples when training deep neural networks. Therefore I opted to purchase some high quility, game specific, sound effect libraries for this project. So far I have one for sci-fi weaponry, and another for sword fighting.

# Machine Learning Algorithm

Having gotten some data to train with I selected an initial machine learning algorithm for testing. The algorithm I selected is called WaveGAN [insert reference here] the implementation I used is available here (https://github.com/chrisdonahue/wavegan). It is a Generative Adverserial Network for generating fixed sized audio clips using methods that have been successful in the field of photorealistic image generation.

# Audio Format

After getting all the dependencies setup I attempted to run the make_tfrecord.py script to convert my wav files into TFRecords for the algorithm. However, I encounter a problem. The script reported that the files were not in the format it was expecting (standard PCM_16 wav format) and were instead in an extended format known as Broadcast Wave Format. This format allows for storing of extra metadata that the standard wav format does not support. I also found that the audio format itself was PCM_24 (24 bit audio), which is not supported by many applications and scripts. Fortunately there is a good python library available for reading many kinds of audio files (PySoundFile), which supports PCM_24 wav audio. It does not support reading the extra metadata provided by the BWF format out of the box. However, there is an extension available, provided by danrossi over on his github page (https://github.com/danrossi/bwfsoundlib), which extends PySoundFile to support BWF metadata.

# Test Reading Metadata from BWF Wave file

Here are my initial tests for reading metadata using the BWFSoudlib extension

In [2]:
from bwfsoundfile import BwfSoundFile
import librosa
from IPython.display import display, Audio

In [3]:
file = BwfSoundFile('../ml/wavegan/data/Lethal_Energies_orig/Designed Weapons/DS_Heavy/DS_Heavy_01_Shot-1.wav')

In [4]:
file.read_metadata()

In [6]:
print(file.bext_info)

{'origination_date': '2015', 'originator_reference': '', 'originator': '', 'origination_time': '', 'version': 0, 'timereference_translated': '00:00:00.000000', 'timereference': 0, 'description': 'sci-fi, futuristic, weapon, gun, cannon, launcher, blast, heavy, huge, massive', 'coding_history': '', 'umid': ''}


In [7]:
print (file.get_core_info())

{'bit_depth_info': 'Signed 24 bit PCM', 'bit_depth': 24, 'channels': 2, 'duration': '00:00:02', 'samplerate': 48000}


# Test Downsample to 16KHz

WaveGAN was originally designed to work on audio data sampled at 16KHz. Using this samplerate, it outputs just over 1 second of audio. In order to keep things as simple as possible for initial testing, I have opted to downsample my audio clips in order to match WaveGANs original implementation. To do this I use a audio processing and analysis library called librosa. Below is the result of downsample a clip from the Lethal Energies sound effect library.

In [10]:
data = file.read(-1, dtype='float32')
data = data.T # PySoundFile and librosa audio matrices (2 channels per audio sample) are the transpose of each other.

In [12]:
samplerate = file.samplerate
display(Audio(data, rate=samplerate))

In [13]:
resamplerate = 16000
data_resampled = librosa.resample(data, samplerate, resamplerate)
display(Audio(data_resampled, rate=resamplerate))

You can hear that this crushes the audio signal quite a bit. In my final implementation I would like to be able to train and generate at a higher sample rate to preserve quality.

# Bulk Resample

What follows is the code I used to bulk resample my audio data. I have excluded extracting metadata for now as I am not yet using it.

In [16]:
from bwfsoundfile import BwfSoundFile
import soundfile as sf
import librosa
from IPython.display import display, Audio
import glob
import os

In [17]:
resamplerate = 16000
out_dir = '../ml/wavegan/data/Swordfighter_Preprocessed/'
if not os.path.isdir(out_dir):
    os.mkdir(out_dir)

In [18]:
file_list = []
for filename in glob.iglob('../ml/wavegan/data/Swordfighter_orig/Swordfighter - Video Games/wav/**/*.wav', recursive=True):
    file_list.append(filename)

In [19]:
for filename in file_list:
    with BwfSoundFile(filename) as bwf_file:
        # Read
        samplerate = bwf_file.samplerate
        data = bwf_file.read(-1, dtype='float32')
        data = data.T # Librosa uses shape (nb_channels, nb_samples), PySoundFile uses shape (nb_samples, nb_channels)
        
        # Resample
        data_resampled = librosa.resample(data, samplerate, resamplerate)
        
        # Write
        out_filename = bwf_file.name.split('\\')[-1]
        sf.write(out_dir + out_filename, data_resampled.T, resamplerate, subtype='PCM_16')

# Results

I already have some initial results from running the WaveGAN algorithm which I will talk about in my next blog post. For now, here is are some samples of audio I have managed to generate so far.

In [20]:
file_list = []
for filename in glob.iglob('../ml/wavegan/generate/**/*.wav', recursive=True):
    file_list.append(filename)

In [23]:
for filename in file_list:
    data, samplerate = sf.read(filename)
    display(Audio(data.T, rate=samplerate))