# **Sutton Trust Music & Science Workshop**

Instructor: Huw Cheston, PhD researcher @ Centre for Music & Science, University of Cambridge

![ST](https://summerschools.suttontrust.com/wp-content/themes/sutton-trust-summer-programme/assets/img/summer_school_logo.png)

© Huw Cheston 2023, hwc31@cam.ac.uk

# Automatic music transcription using neural networks

![NN](https://magenta.tensorflow.org/assets/transcription-with-transformers/mt3_diagram.png)

If you've ever tried to listen to a recording of a piece of music and transcribe it into notation, you'll know that this can be quite a time consuming and difficult process. Neural networks can be applied to enable automatic transcription of many forms of recorded music, and can even work on polyphonic instruments like the piano. By digesting a wealth of musical examples, these networks become adept at recognizing notes, chords, and rhythms, essentially learning how to translate sound waves into written symbols – much like translating a language. This technology opens up exciting possibilities for musicians, composers, and musicologists, offering them the tools to analyze, recreate, and build upon musical compositions in ways that were once intricate and time-consuming.


In this workbook, we'll going to use an automatic music transcription library called [BasicPitch](https://github.com/spotify/basic-pitch), which was developed by [Spotify](https://www.spotify.com). We'll be working with tracks ripped directly from YouTube, so you won't need to download anything beforehand. You also don't need to have any experience of programming to use this workbook, and all the various options will be explained as you go.


## Setup

**Before you do anything else**, hit the *Play* button below, next to the **Show code** line. You may need to move your mouse for this to appear. Please let me know if you get any errors when running this! This may take a minute or so to run, and you'll see some code in the window as this happens: you'll need to wait until the wheel stops spinning and is replaced by a green tick before moving on.

In [None]:
# @title
!apt install ffmpeg
!sudo apt install -y fluidsynth
!pip install basic-pitch yt-dlp pretty_midi pyfluidsynth

import tensorflow as tf
import basic_pitch.inference as bp
from basic_pitch import ICASSP_2022_MODEL_PATH
import pretty_midi
from pretty_midi.pretty_midi import PrettyMIDI
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import Audio
import numpy as np
import collections
import pandas as pd

SAMPLE_RATE = 44100
HOP_LENGTH = 512
BASIC_PITCH_MODEL = tf.saved_model.load(str(ICASSP_2022_MODEL_PATH))

## Run the model

First things first, go to [YouTube](https://youtube.com) and choose a recording you want to work with. We're looking for recordings that contain **solo piano only** – no other instruments allowed!

Aside from this, there are no restrictions on genre here, so choose whatever you think might lead to some interesting results! You could try a classical piano piece, an unaccompanied jazz solo, or an arrangement of a pop song. Once you've found a track, copy the link into the field *yt_link* below. It should look something like https://www.youtube.com/watch?v=NlZ0e5FqZEU

If the track takes a while to start (maybe it has a long intro), you can use the starting_position slider to skip ahead in the track. So, if the music starts at 10 seconds into the video, you'd set the slider to 10.

Next, you'll need to experiment with setting the other parameters. These stand for:

* note_threshold: Minimum energy required for a note to be considered present. If you're finding that you get lots of repeated notes, increase this value!
* minimum_note_length: The minimum allowed note length in milliseconds.
* minimum_freq: Minimum allowed audio frequency, in Hz. You'll probably want to set this fairly low: remember that the low A on a piano is ~27Hz!
* maximum_freq: Maximum allowed audio frequency, in Hz.

Once you've set all the parameters, hit the big "Play" icon as before and wait a minute for the recording to process.

In [None]:
yt_link = 'https://www.youtube.com/watch?v=X5Sg0WGy9YA' # @param {type:"string"}
starting_position = 1 # @param {type:"slider", min:1, max:100, step:1}
note_threshold = 0.8 # @param {type:"slider", min:0, max:1, step:0.01}
minimum_note_length = 100 # @param {type:"slider", min:0, max:1000, step:10}
minimum_freq = 100 # @param {type:"slider", min:100, max:44100, step:100}
maximum_freq = 8300 # @param {type:"slider", min:100, max:44100, step:100}
if minimum_freq > maximum_freq:
  raise ValueError('Minimum frequency cannot be above maximum frequency!')

end_position = starting_position + 60
!yt-dlp -f "bestaudio" --extract-audio --force-overwrites --audio-format mp3 -o youtube.mp3 --postprocessor-args "-ar 44100" $yt_link
!ffmpeg -y -hide_banner -loglevel error -i youtube.mp3 -ss $starting_position -to $end_position -c copy cut.mp3

print('Starting BasicPitch ...')
model_output, midi_data, note_events = bp.predict(
    'cut.mp3',
    BASIC_PITCH_MODEL,
    minimum_frequency=minimum_freq,
    maximum_frequency=maximum_freq,
    onset_threshold=note_threshold,
    frame_threshold=0.3,
    minimum_note_length=minimum_note_length,
    melodia_trick=True,
    multiple_pitch_bends=False
)
print('... done!')

print(f'↓↓↓ Listen to synthesized MIDI below (may take a second to appear) ↓↓↓')
midi_data.instruments[0].program = 0
midi_data.remove_invalid_notes()
Audio(midi_data.fluidsynth(fs=SAMPLE_RATE), rate=SAMPLE_RATE)

## Create a graph

Once you're happy with how the beat tracking is working, you can press the play button on the next cell to create a graph showing the distribution of the pitch and duration of the first 100 notes in the recording.

In [None]:
# @title
def midi_to_notes(pm) -> pd.DataFrame:
  instrument = pm.instruments[0]
  notes = collections.defaultdict(list)
  # Sort the notes by start time
  sorted_notes = sorted(instrument.notes, key=lambda note: note.start)
  prev_start = sorted_notes[0].start
  for note in sorted_notes:
    start = note.start
    end = note.end
    notes['pitch'].append(note.pitch)
    notes['start'].append(start)
    notes['end'].append(end)
    notes['step'].append(start - prev_start)
    notes['duration'].append(end - start)
    prev_start = start
  return pd.DataFrame({name: np.array(value) for name, value in notes.items()})

def plot_piano_roll(notes: pd.DataFrame, count = None):
  if count:
    title = f'First {count} notes'
  else:
    title = f'Whole track'
    count = len(notes['pitch'])
  plt.figure(figsize=(10, 4))
  plot_pitch = np.stack([notes['pitch'], notes['pitch']], axis=0)
  plot_start_stop = np.stack([notes['start'], notes['end']], axis=0)
  plt.plot(
      plot_start_stop[:, :count], plot_pitch[:, :count], color="b", marker='.')
  plt.xlabel('Time [s]')
  plt.ylabel('Pitch')
  plt.title(title)

raw_notes = midi_to_notes(midi_data)
get_note_names = np.vectorize(pretty_midi.note_number_to_name)
sample_note_names = get_note_names(raw_notes['pitch'])
raw_notes['note_name'] = get_note_names(raw_notes['pitch'])
raw_notes['note_name'] = raw_notes['note_name'].str.replace('\d+', '')
plot_piano_roll(raw_notes, count=100)

def plot_distributions(notes: pd.DataFrame, drop_percentile=2.5):
  fig, ax = plt.subplots(nrows=1, ncols=2, sharex=False, sharey=False, figsize=(10, 4))
  sns.histplot(notes.sort_values(by='note_name'), x="note_name", bins=12, ax=ax[0])
  ax[0].set(xlabel='Note name')
  max_duration = np.percentile(notes['duration'], 100 - drop_percentile)
  sns.histplot(notes, x="duration", bins=np.linspace(0, max_duration, 12), ax=ax[1])
  ax[1].set(xlabel='Note duration [s]', ylabel='')

plot_distributions(raw_notes.head(100))
plt.show()

## Evaluate the output

Congratulations, you just used a neural network for the first time! How do the results sound? You can try different combinations of parameters (or different videos) by changing the parameters above and pressing the "Play" button once again.

If you can't think of which tracks to use, you can try the following:

*   Classical: https://www.youtube.com/watch?v=o5dL-65mKe0
*   Impressionist: https://www.youtube.com/watch?v=cVMGwPDP-Yk
*   Pop: https://www.youtube.com/watch?v=b3E6E6hYSSI
*   Jazz: https://www.youtube.com/watch?v=X5Sg0WGy9YA

## Discussion questions

1.   Which combination of parameters lead to the best results? Which combination leads to the worst results?
2.   Do parameters that work well for one recording transfer to another? Why (or why not)?
3.   Are the results consistent across different genres? What about different songs?
4.   What might the distribution graphs above tell us about a particular performance, and performance in general?




