# WELCOME TO THE ML @ SAPIENZA HACKATHON 2023/24!

This is a short welcome kit to familiarize with the topics of the Hackathon: *digital music* ðŸŽ¹

Please read this notebook from start to end, because you'll need some of these functions and tools!

# Your local machine

Please note that this version of the notebook is supposed to be ran on your local computer.

If you want to run the tutorial on Google Colab, please use the other notebook.

Make sure to have followed all instructions in the readme in order to have a properly-set virtual environment to run your code on.

## Fundamentals

Let's first import the required libraries:

In [None]:
from pydub import AudioSegment
import librosa

**NOTE:** If you *don't* use Colab and want to use your laptop's GPU, make sure to install `torch` with CUDA enabled:

```pip3 install torch torchaudio --index-url https://download.pytorch.org/whl/cu118 ```

Visit [https://pytorch.org/](the official website) to get the exact string for your OS.

And here are some utility functions that we will need in this demo. You don't need to understand how they work to complete this notebook!

In [None]:
from utils import *

Let's start playing some sounds! ðŸŽ¶

We use the `AudioSegment` class to load sounds and play them. Execute the following cell to get a small embedded player:

In [None]:
audio = AudioSegment.from_wav("./test.wav")
player(audio, zoom=1, title="test waveform")

# push the play button!

Did you enjoy those smooth notes?

You might be wondering what is that function plot we see below the player. If you already know what it is, a brief refresher doesn't hurt. If it's the first time you delve into this, welcome to the world of **Digital Sound Processing**!! ðŸ•º

A simple way to represent sound is as a **list of samples**. Let's try to zoom into the plot above:

In [None]:
player(audio, zoom=60., title="test waveform")

The zoom only shows a small portion of the entire soundwave. On the *x*-axis we see time; on the *y*-axis, we have the **sample** amplitude. Therefore, in the plot above we are looking at a window of roughly $3,500$ samples.

How many samples in time are we using to represent the entire audio file? Let's see:

In [None]:
print(len(audio.get_array_of_samples()))

Wow, that's a lot of samples for just $8$ seconds of music! If each of these samples was a letter of the alphabet, we would have a book of $150$ pages!

That's because we are representing sound using $24,000$ samples per second; in signal processing we say that the **sample rate** (or **frame rate**) of this example is $24$ kHz.

In [None]:
print(audio.frame_rate)

Sometimes you need to process audio more quickly, or you need to save some memory, hence it can be useful to decrease the sample rate. Let's do this:

In [None]:
audio_downsampled = audio.set_frame_rate(4000)
player(audio_downsampled, zoom=1., title="downsampled waveform")

The sound quality has decreased, but it sounds acceptable given the 6x times reduction in memory!

Finally, each of the $33,920$ samples of this downsampled soundwave is stored with a certain precision; is it just $1$ byte per sample? $4$ bytes? Are these integer values, or floating point numbers? Are these signed or unsigned numbers?

Let's check:

In [None]:
print(audio.sample_width)

The **sample width** is the number of bytes used to represent each sample. As you can imagine, it determines the sound quality. The wider, the merrier!

Let's try to reduce the sample width to just $1$ byte per sample:

In [None]:
audio_lowres = audio.set_sample_width(1)
player(audio_lowres, zoom=1., title="lowres waveform")

You can still hear what's going on, but there's strong background noise!

To conclude, there is a trade-off between sound quality and efficiency when we deal with digital audio; however, for a ML project, it might be good to reduce quality whenever needed, if it makes things easier to test!

## Spectrograms

Representing sound as a scalar function over time is not the only possibility.

In fact, it is very common in sound processing to use **spectrograms**. Let's compute one right away! First, we choose some parameters for the spectrogram computation:

In [None]:
device = "cuda"  # change to "cpu" if you don't have a GPU

# NOTE: these parameters are optimized for a sample rate of 44.1 kHz
params = SpectrogramParams(
            sample_rate=44100,
            stereo=False,
            step_size_ms=10,
            min_frequency=20,
            max_frequency=20000,
            num_frequencies=512,
        )

converter = SpectrogramConverter(params=params, device=device)

And here's the computed spectrogram:

In [None]:
audio = audio.set_frame_rate(params.sample_rate)
spectrogram = converter.spectrogram_from_audio(audio)

plt.figure(figsize=(6,5))
plt.imshow(librosa.power_to_db(spectrogram.squeeze()), origin='lower', aspect='auto', interpolation="nearest")
plt.ylabel('Freq. bin')
plt.xlabel('Time (ms)')
plt.title('Original', fontsize=10)

From the image above, we observe a few things:

- The *x*-axis represents **time**.
- The *y*-axis seems to represent musical notes! Look at the yellow arrow-like traces: they go up and down just like the notes in the short music piece.
- There is an empty region on the right, corresponding to *silence*.

The reason why the $y$-axis seems to represent musical notes, it's because it represents **frequency**, i.e. the oscillation speed of the soundwave. And in fact, according to music theory, **different notes correspond to different frequencies**.

Can we go back from spectrogram to soundwave? Sure:

In [None]:
audio_recovered = converter.audio_from_spectrogram(spectrogram, do_apply_filters=False)
player(audio_recovered, zoom=1., title="recovered waveform")

The spectrogram transform is *invertible*. But it may be a better representation for ML pipelines, since essentially it is a 2D image!

Also, it's easier to **manipulate**. For example, you can apply image processing techniques like smoothing, interpolation and deformation, then convert back to the time domain of the soundwave, and listen to what new sounds you created. It's actually pretty fun! ðŸ¤ª

## Features

In addition to the spectrogram, there are many other **audio features** you can extract! Things like the *spectral centroid*, or the *bandwidth*, or the *zero crossing rate* are informative and can help in ML tasks.

The `librosa` library comes with ready-to-use functions to compute audio features. [Check the docs!](https://librosa.org/doc/0.10.2/feature.html)

## Composing music

Don't worry, you don't have to be a great composer -- you just have to be a good listener!ðŸŽ§

Let's load a song:

In [None]:
from modsong import MODSong, Sample

song = MODSong()
song.load_from_file("./demo.mod")

player(song, zoom=1., title="demo song")

The song is already composed for you, but you can change a few things!

For example, here's a list of its instruments (here they are called *samples*, but don't confuse these samples with the soundwave samples we explained above!):

In [None]:
for i in range(len(song.samples)):
    print(f"Sample {i}: {song.samples[i].name}")

As you can see, the song has $31$ samples in total, but not all of them are used! The used ones have a name. For example, sample n. $6$ is called "polysynth", which is some kind of synthetic sound.

Let's have a look at it!

In [None]:
s6 = AudioSegment(data=song.samples[6].waveform, sample_width=2, frame_rate=44100, channels=1)
player(s6, zoom=1., title=f"sample 6: {song.samples[6].name}")

And here's the bassdrum, corresponding to sample $14$:

In [None]:
s6 = AudioSegment(data=song.samples[14].waveform, sample_width=2, frame_rate=44100, channels=1)
player(s6, zoom=1., title=f"sample 14: {song.samples[14].name}")

How about we modify the song, so that it only plays the bassdrum?

In [None]:
song.keep_sample(14)
player(song, zoom=1., title="demo song with bassdrum only")

Note that the song object is *stateful*! We can't recover the samples we removed. So we need to reload the song:

In [None]:
song = MODSong()
song.load_from_file("./demo.mod")

Let's now remove all the drums, and play the song again:

In [None]:
song.remove_sample(12)
song.remove_sample(13)
song.remove_sample(14)

player(song, zoom=1., title="demo song without drums")

Finally, it would be cool to load our own instruments into the song!

Let's load a "clap" sound, and put it where the bassdrum was before:

In [None]:
clap = AudioSegment.from_wav("./clap.wav")
player(clap, zoom=1, title="clap waveform")

It sounds a bit harsh, right? That's because we need to play it at a higher note! Let's put it inside the song to see how it sounds, and then we'll change its pitch:

In [None]:
song.samples[14] = Sample()  # replace the bassdrum with a new sample
song.samples[14].name = "clap"
song.samples[14].waveform = clap.get_array_of_samples()

player(song, zoom=1., title="demo song with claps")

Let's clean it up ðŸ§¹ We are going to turn down its volume a bit, and raise its pitch by $12$ semitones:

In [None]:
song.samples[14].volume = 40  # the maximum volume is 64!
song.tune_sample(sample_idx=14, semitone=12)  # raise the pitch by 12 semitones

player(song, zoom=1., title="demo song with claps")

Of course, we can also play with pitches to modify the main melody.

For example, below we *lower* the pitch of the lead melody by $3$ semitones:

In [None]:
song.tune_sample(sample_idx=4, semitone=-3)  # lower the pitch by 3 semitones

player(song, zoom=1., title="demo song out of tune")

The effect is that the main melody is now out of tune with the rest of the song ðŸŽ¶ðŸ‘Ž

It seems useless here, but being able to finetune the pitch of individual instruments will be important for the main challenge of this hackathon.