# Music Style Transfer using Convolutional Neural Network

**Pratik Mahajan**<br/>
Information Systems Department<br/>
Northeastern University<br/>
mahajan.pr@husky.neu.edu<br/>

## Abstract

With the recent success on Convolutional Neural Networks in style transfer  for images, we try to apply similar algorithms to experiment with style transfer in musical domain. In this paper, we use Convolutional Neural Network to transfer music style from one song to another. The music thus generated has features of both content as well as style song. The neural style algorithm[1] is used to celebrate the style and content from each song and recombine it to for a new music. In this paper we also state that how using Convolutional Neural Networks for audio processing results in imperfect output. 

## Introduction

In the last decade, Deep Neural Networks(DNNs) have gained a reputation to solve complex problems and emerged as a state of art solution for several (Artificial Intelligence) tasks like image classification, driverless cars, automation, etc 

After the success of Convolution Neural Networks in Image style transfer, we try to apply same technique to perform style transfer on music and look at its advantages and challanges involved. In this paper, we apply convolutional neural networks on spectogram of sound waves and determine weather it is a good decision whether to apply CNN on sound waves. 

It is difficult to define music style because of its multi-level and multi-modal character. The musical content can usually be denoted by the melody and style by harmonization. We also discuss current limitation of music style modelling and what can be done to make the transfer more accurate and feasible.

## Music Style Transfer: Process

By separating and recombining music contents and music style, it is possible to generate new music that is both creative and human-like.In other words, we can still use our favorite data-driven algorithms but twist the constraints or optimizations in general by applying them seprately to different aspects (i.e., content and style) of music[2]. 

1. We convert the raw audio signal into its spectogram using the Short Time Fourier Transform. A spectogram is a 2D representation of a 1D signal, thus we can treat it as an image. It is equivalent to 1xT image with F number of channels

2. We cannot use VGG-19 as 3x3 convolutions for out 1D problem, thus we have to use 1D convolutions. Thus we are training network with random weights. We are using only one layer of 4096 filters. 

3. We reconstruct music file from the resultant spectogram. We do this using Griffin-Lim algorithm[11].

## Results

![alt text](result.png)

The result as seen above is a new spectogram generated from the content and the style file. We convert this result spectogram to audio file using the Griffin-Lim algorithm. The result has features of both the content as well as the style audio file. As we increase the number of iterations, we can see that the features from both content and style become more prominant. 

## Conclusion

The result obtained out of music style transfer has features of both content as well as style audio. But, the different instruments are fused in the output, as a result, it is difficult to identify different instruments from the two different songs. As far as music style transfer is concerned, we were successful in transferring style from one song to another. The output sometimes generates melodious music and sometimes not. The second case happens typically when there is ensemble of multiple instruments in the song. This makes it seem like random instruments are playing from both the songs. But, when we use a single instrument song to transfer styles, we observe that the output is melodious to listen to and also there is good combination of style as well as content in the result.

## Problems in Music Style Transfer

There is a severe problem that music style is a fuzzy term and can refer to any aspect of music, ranging from high level features like tone and chord sequence to low level features like sounf texture and timbre[2].

### Image Vs Sound

In Images, the the concepts of content and style are intuitive. In images, we describe the objects present like trees, faces, animals, etc. This style is understood by colors, lighting, texture, edges, etc. However, the music is sementically abstract and is multi-dimentional in nature. Thus, musical content can mean different thing in different context. The musical content can usually be denoted by the melody and style by harmonization. But, we can also associate the lyrics with content and different melodies used to compose the song as the music style. 

### Axes of Spectrograms do not carry same meaning as images.

In images, similar neighboring pixels can often be assumed to belong to the same visual object but in sound, frequencies are most often non-locally distributed on the spectrogram [3]. Periodic sounds are typically comprised of a fundamental frequency and a number of harmonics which are spaced apart by relationships dictated by the source of the sound. It is the mixture of these harmonics that determines the timbre of the sound [4].

### Sound is transparent, images are not.

One challenge posed in the comparison between visual images and spectrograms is the fact that visual objects and sound events do not accumulate in the same manner. To use a visual analogy, one could say that sounds are always transparent [3] whereas most visual objects are opaque.

When encountering a pixel of a certain color in an image, it can most often be assumed to belong to a single object. Discrete sound events do not separate into layers on a spectrogram: Instead, they all sum together into a distinct whole. That means that a particular observed frequency in a spectrogram cannot be assumed to belong to a single sound as the magnitude of that frequency could have been produced by any number of accumulated sounds or even by the complex interactions between sound waves such as phase cancellation. This makes it difficult to separate simultaneous sounds in spectrogram representations.[4]

### The spectral properties of sounds are non-local

In images, similar neighboring pixels can often be assumed to belong to the same visual object but in sound, frequencies are most often non-locally distributed on the spectrogram [3]. Periodic sounds are typically comprised of a fundamental frequency and a number of harmonics which are spaced apart by relationships dictated by the source of the sound. It is the mixture of these harmonics that determines the timbre of the sound [4].

## References

[1] Leon A. Gatys, Alexander S. Ecker, Matthias Bethge,"A Neural Algorithm of Artistic Style" <br/>
[2] Shuqi Dai, Zheng Zhang, Gus G. Xia, "Music Style Transfer: A Position Paper"<br/>
[3] L. Wyse, "Audio Spectrogram Representations for Processing with Convolutional Neural Networks," vol. 1, no. 1, pp. 37–41, 2017.<br/>
[4] "What’s wrong with CNNs and spectrograms for audio processing?" https://towardsdatascience.com/whats-wrong-with-spectrograms-and-cnns-for-audio-processing-311377d7ccd<br/>
[5] https://github.com/vadim-v-lebedev/audio_style_tranfer <br/>
[6] https://github.com/Lasagne/Recipes/blob/master/examples/styletransfer/Art%20Style%20Transfer.ipynb <br/>
[7] Extreme Style Machines: Using Random Neural Networks to Generate Textures https://nucl.ai/blog/extreme-style-machines/ <br/>
[8] Ivan Ustyuzhaninov, Wieland Brendel, Leon A. Gatys, Matthias Bethge, "Texture Synthesis Using Shallow Convolutional Networks with Random Filters"<br/>
[9] Kun He, Yan Wang, John Hopcroft, "A Powerful Generative Model Using Random Weights for the Deep Image Representation"<br/>
[10] Shaun Barry, Youngmoo Kim, "“Style” Transfer for Musical Audio Using Multiple Time-Frequency Representations "
[11] Dmitry Ulyanov and Vadim Lebedev, "Audio texture synthesis and style transfer"
[12] D. Griffin, jae Lim, "Signal estimation from modified short-time Fourier transform"

last updated: 08/14/2018