<a href="https://colab.research.google.com/github/Juanvr/Dathoven/blob/main/notebooks/2%20-%20Building%20the%20Dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Building the Dataset

## Downloading the data

In [1]:
import requests

url = 'https://github.com/Juanvr/Dathoven/raw/main/compressed_data/data.zip'
r = requests.get(url, allow_redirects=True)

open('data.zip', 'wb').write(r.content);

In [2]:
import zipfile
with zipfile.ZipFile('data.zip', 'r') as zip_ref:
    zip_ref.extractall('data')

## Importing dependencies

My first aproach is to use absolute tone of notes in order to generate the model.

In [3]:
import glob, os
import numpy as np
from matplotlib import pyplot as plt

In [4]:
! pip install --upgrade music21

Requirement already up-to-date: music21 in /usr/local/lib/python3.7/dist-packages (6.7.1)


In [5]:
from music21 import converter, corpus, instrument, midi, note, chord, pitch, stream,interval

In [6]:
!apt install fluidsynth
!cp /usr/share/sounds/sf2/FluidR3_GM.sf2 ./font.sf2

Reading package lists... Done
Building dependency tree       
Reading state information... Done
fluidsynth is already the newest version (1.1.9-1).
0 upgraded, 0 newly installed, 0 to remove and 39 not upgraded.


In [7]:
from IPython.display import Audio

## Absolute approach

### Getting the notes

#### From midi to notes and chords

We create a function that takes a midi file and returns all its notes and chords: 


In [8]:
def get_stream_from_midi_without_drums(midi_path):
    mf = midi.MidiFile()
    mf.open(midi_path)
    mf.read()
    mf.close()
    
    for i in range(len(mf.tracks)):
        mf.tracks[i].events = [ev for ev in mf.tracks[i].events if ev.channel != 10]          

    return midi.translate.midiFileToStream(mf)
        

In [9]:
def stream_to_array_of_notes_strings (stream):
    result = []
    for element in stream.flat.notes:
        stringRepresentationOfElement = ''
        if isinstance(element, note.Note):
            stringRepresentationOfElement = element.nameWithOctave
        else: # it's a chord
            listOfNotesWithOctaves = [note.nameWithOctave for note in element.notes]
            stringRepresentationOfElement = ' '.join(listOfNotesWithOctaves)
        result.append(stringRepresentationOfElement)
    return result

In [10]:
def from_midi_to_array_of_notes (midi_path):
    return stream_to_array_of_notes_strings(get_stream_from_midi_without_drums(midi_path))

#### From notes and chords to midi

We create the function that takes an array of notes and chords and creates a midi file:

In [11]:
def from_array_of_elements_to_midi ( elements, midi_path ):
    streamResult = stream.Stream()
    for element in elements:
        if ' ' not in element:
            streamResult.append(note.Note(element))
        else:
            streamResult.append(chord.Chord(element))
    
    streamResult.write('midi', fp= midi_path)

In [12]:
from_array_of_elements_to_midi(["C4", "D4", "E4", "F4", "G4"], "test1.mid")

Let's hear it:

In [13]:
!fluidsynth -ni font.sf2 test1.mid -F output.wav -r 4100

FluidSynth version 1.1.9
Copyright (C) 2000-2018 Peter Hanappe and others.
Distributed under the LGPL license.
SoundFont(R) is a registered trademark of E-mu Systems, Inc.

Rendering audio to file 'output.wav'..


In [14]:
Audio('output.wav')

Let's check that we can convert to our notation and back to midi:

In [15]:
url = 'https://github.com/Juanvr/Dathoven/raw/main/examples/silent_night_easy.mid'
r = requests.get(url, allow_redirects=True)

open('silent_night_easy.mid', 'wb').write(r.content);

In [16]:
stream_silent_night = get_stream_from_midi_without_drums("silent_night_easy.mid")

In [17]:
from_array_of_elements_to_midi(stream_to_array_of_notes_strings(stream_silent_night), "silent_night.mid")

In [18]:
!fluidsynth -ni font.sf2 silent_night.mid -F output.wav -r 4100

FluidSynth version 1.1.9
Copyright (C) 2000-2018 Peter Hanappe and others.
Distributed under the LGPL license.
SoundFont(R) is a registered trademark of E-mu Systems, Inc.

Rendering audio to file 'output.wav'..


In [19]:
Audio('output.wav')

As we can see, it's a version of silent night with all the notes of same duration and with the same offset between notes. 

#### Iterating through song data folder

We create a function that goes through a folder of midi files reading its notes and chords: 

In [20]:
def get_folder_songs(folder_path):
    songs = []
    for file in glob.glob(folder_path):
        songs.append(from_midi_to_array_of_notes(file))
    return songs

In [21]:
# Test
songs = get_folder_songs("data/*.mid")  # This step takes a while

In [22]:
songs[:1]

[['B4',
  'B2',
  'B2',
  'F#5',
  'B2',
  'G#4',
  'E2',
  'E2',
  'E2',
  'B4',
  'B2',
  'B2',
  'B5',
  'F#5',
  'B2',
  'G#4',
  'E2',
  'E2',
  'E2',
  'B4',
  'B2',
  'B2',
  'F#5',
  'B2',
  'G#4',
  'E2',
  'E2',
  'E2',
  'E-5',
  'B2',
  'B2',
  'F#5',
  'B4',
  'B2',
  'G#4',
  'E2',
  'E2',
  'E2',
  'B4',
  'B2',
  'B2',
  'F#5',
  'B2',
  'G#4',
  'E2',
  'E2',
  'E2',
  'B4',
  'B2',
  'B2',
  'B5',
  'F#5',
  'B2',
  'G#4',
  'E2',
  'E2',
  'E2',
  'B4',
  'B2',
  'B2',
  'F#5',
  'B2',
  'G#4',
  'E2',
  'E2',
  'E2',
  'E-5',
  'B2',
  'B2',
  'F#5',
  'B4',
  'B2',
  'G#4',
  'E2',
  'E2',
  'E2']]

Up to this point we have a list of all the songs. For each song we have a list of all the notes and chords that are part of the song.

We save the resulting list in a pickle for future use:

In [23]:
# Save into a pickle file.
import pickle

pickle.dump( songs, open( "songs_notes.p", "wb" ) )

### Getting the notes with offset and duration

In [24]:
def stream_to_array_of_pitches_with_time (stream):
    result = []
    offsets = []
    for item in stream.flat.notes:
        element = []
        if isinstance(item, note.Note):
            element = item
        else: # it's a chord
            #pitch = [note.pitch.ps for note in element.notes][0]
            element = item.notes[0]
        
        resultElement = {
            'absolute_offset': element._getOffset(), 
            'pitch': element.pitch.ps, 
            'name': element.nameWithOctave,
            'duration': element.duration.quarterLength,
            #'element': element
        }
        #print(element.duration.quarterLength)
        
        if not resultElement['absolute_offset'] in offsets:  # only keep one note per time position
            if not resultElement['duration'] == 0:
                result.append(resultElement)
                offsets.append(resultElement['absolute_offset'])
    result.sort(key=lambda x: x['absolute_offset'])
    return result

In [25]:
def from_midi_to_array_of_pitches_with_time (midi_path):
    return stream_to_array_of_pitches_with_time(get_stream_from_midi_without_drums(midi_path))

In [26]:
def get_folder_songs_with_time(folder_path):
    songs = []
    for file in glob.glob(folder_path):
        songs.append(from_midi_to_array_of_pitches_with_time(file))
    return songs

In [27]:
songs_with_time = get_folder_songs_with_time("data/*.mid")  # This step takes a while

In [28]:
songs_with_time[:1]

[[{'absolute_offset': 0.0, 'duration': 3.0, 'name': 'B4', 'pitch': 71.0},
  {'absolute_offset': 1.5, 'duration': 0.5, 'name': 'B2', 'pitch': 47.0},
  {'absolute_offset': 3.0, 'duration': 1.0, 'name': 'F#5', 'pitch': 78.0},
  {'absolute_offset': 4.0, 'duration': 2.0, 'name': 'G#4', 'pitch': 68.0},
  {'absolute_offset': 5.5, 'duration': 0.5, 'name': 'E2', 'pitch': 40.0},
  {'absolute_offset': 7.0, 'duration': 0.5, 'name': 'E2', 'pitch': 40.0},
  {'absolute_offset': 8.0, 'duration': 2.0, 'name': 'B4', 'pitch': 71.0},
  {'absolute_offset': 9.5, 'duration': 0.5, 'name': 'B2', 'pitch': 47.0},
  {'absolute_offset': 10.0, 'duration': 0.5, 'name': 'B5', 'pitch': 83.0},
  {'absolute_offset': 11.0, 'duration': 0.5, 'name': 'F#5', 'pitch': 78.0},
  {'absolute_offset': 12.0, 'duration': 2.0, 'name': 'G#4', 'pitch': 68.0},
  {'absolute_offset': 13.5, 'duration': 0.5, 'name': 'E2', 'pitch': 40.0},
  {'absolute_offset': 15.0, 'duration': 0.5, 'name': 'E2', 'pitch': 40.0},
  {'absolute_offset': 16.0, '

For each song we have a list of all the notes that are part of the song, their time position and their duration. We save it in a pickle file.

In [29]:
pickle.dump( songs_with_time, open( "songs_notes_with_time.p", "wb" ) )

### Getting the notes by track

In [30]:
def get_streams_from_midi_without_drums(midi_path):
    mf = midi.MidiFile()
    mf.open(midi_path)
    mf.read()
    mf.close()
    
    streams = []
    for track in mf.tracks:
        if (len(track.events) >0):
            track.events = [ev for ev in track.events if ev.channel != 10]
            if (len(track.events) > 0):
                streams.append(midi.translate.midiTrackToStream(track))          
    return streams

In [31]:
def from_midi_to_array_of_tracks(midi_path):
    streams = get_streams_from_midi_without_drums(midi_path)
    result = [stream_to_array_of_pitches_with_time(stream) for stream in streams if len(stream.flat.notes) > 0];
    return result

In [32]:
def get_folder_songs_with_tracks(folder_path):
    songs = []
    for file in glob.glob(folder_path):
        songs.append(from_midi_to_array_of_tracks(file))
    return songs

In [33]:
songs_with_tracks = get_folder_songs_with_tracks("data/*.mid")  # This step takes a while


In [34]:
songs_with_tracks[100:101]

[[[{'absolute_offset': 0.0, 'duration': 1.0, 'name': 'E3', 'pitch': 52.0}],
  [{'absolute_offset': 0.0, 'duration': 1.0, 'name': 'B3', 'pitch': 59.0},
   {'absolute_offset': Fraction(5, 3),
    'duration': 0.25,
    'name': 'B3',
    'pitch': 59.0},
   {'absolute_offset': 2.25, 'duration': 0.25, 'name': 'A3', 'pitch': 57.0},
   {'absolute_offset': 3.0, 'duration': 0.25, 'name': 'B3', 'pitch': 59.0},
   {'absolute_offset': Fraction(11, 3),
    'duration': 0.25,
    'name': 'A3',
    'pitch': 57.0},
   {'absolute_offset': 4.5, 'duration': 0.25, 'name': 'G4', 'pitch': 67.0},
   {'absolute_offset': 4.75, 'duration': 0.25, 'name': 'F#4', 'pitch': 66.0},
   {'absolute_offset': 5.0, 'duration': 0.25, 'name': 'F#4', 'pitch': 66.0},
   {'absolute_offset': 5.25, 'duration': 0.25, 'name': 'D4', 'pitch': 62.0},
   {'absolute_offset': Fraction(16, 3),
    'duration': 0.25,
    'name': 'F#4',
    'pitch': 66.0},
   {'absolute_offset': 5.5, 'duration': 0.25, 'name': 'G4', 'pitch': 67.0},
   {'absolut

We have a list of songs, for each song we have a list of tracks, for each track we have a list of notes with its offset and duration. 

In [35]:
pickle.dump( songs_with_tracks, open( "songs_notes_with_tracks.p", "wb" ) )

## Incremental approach

### Getting the intervals

In [36]:
songs_with_time[:1]

[[{'absolute_offset': 0.0, 'duration': 3.0, 'name': 'B4', 'pitch': 71.0},
  {'absolute_offset': 1.5, 'duration': 0.5, 'name': 'B2', 'pitch': 47.0},
  {'absolute_offset': 3.0, 'duration': 1.0, 'name': 'F#5', 'pitch': 78.0},
  {'absolute_offset': 4.0, 'duration': 2.0, 'name': 'G#4', 'pitch': 68.0},
  {'absolute_offset': 5.5, 'duration': 0.5, 'name': 'E2', 'pitch': 40.0},
  {'absolute_offset': 7.0, 'duration': 0.5, 'name': 'E2', 'pitch': 40.0},
  {'absolute_offset': 8.0, 'duration': 2.0, 'name': 'B4', 'pitch': 71.0},
  {'absolute_offset': 9.5, 'duration': 0.5, 'name': 'B2', 'pitch': 47.0},
  {'absolute_offset': 10.0, 'duration': 0.5, 'name': 'B5', 'pitch': 83.0},
  {'absolute_offset': 11.0, 'duration': 0.5, 'name': 'F#5', 'pitch': 78.0},
  {'absolute_offset': 12.0, 'duration': 2.0, 'name': 'G#4', 'pitch': 68.0},
  {'absolute_offset': 13.5, 'duration': 0.5, 'name': 'E2', 'pitch': 40.0},
  {'absolute_offset': 15.0, 'duration': 0.5, 'name': 'E2', 'pitch': 40.0},
  {'absolute_offset': 16.0, '

In [37]:
def from_pitches_to_intervals_with_time (array_of_pitches_with_time):
    intervals_with_time = []
    for i in range(1,len(array_of_pitches_with_time) - 1):

        first_element = array_of_pitches_with_time[i-1]
        second_element = array_of_pitches_with_time[i]
        resultElement = {
            'relative_offset': second_element['absolute_offset'] - first_element['absolute_offset'],
            'interval': second_element['pitch'] - first_element['pitch'],
            'duration': second_element['duration'], 
            'name': second_element['name'], 
            #'element': second_element['element']
        }
        intervals_with_time.append(resultElement)
    return intervals_with_time
  

In [38]:
songs_intervals = [from_pitches_to_intervals_with_time(song) for song in songs_with_time]

In [39]:
songs_intervals[:1]

[[{'duration': 0.5, 'interval': -24.0, 'name': 'B2', 'relative_offset': 1.5},
  {'duration': 1.0, 'interval': 31.0, 'name': 'F#5', 'relative_offset': 1.5},
  {'duration': 2.0, 'interval': -10.0, 'name': 'G#4', 'relative_offset': 1.0},
  {'duration': 0.5, 'interval': -28.0, 'name': 'E2', 'relative_offset': 1.5},
  {'duration': 0.5, 'interval': 0.0, 'name': 'E2', 'relative_offset': 1.5},
  {'duration': 2.0, 'interval': 31.0, 'name': 'B4', 'relative_offset': 1.0},
  {'duration': 0.5, 'interval': -24.0, 'name': 'B2', 'relative_offset': 1.5},
  {'duration': 0.5, 'interval': 36.0, 'name': 'B5', 'relative_offset': 0.5},
  {'duration': 0.5, 'interval': -5.0, 'name': 'F#5', 'relative_offset': 1.0},
  {'duration': 2.0, 'interval': -10.0, 'name': 'G#4', 'relative_offset': 1.0},
  {'duration': 0.5, 'interval': -28.0, 'name': 'E2', 'relative_offset': 1.5},
  {'duration': 0.5, 'interval': 0.0, 'name': 'E2', 'relative_offset': 1.5},
  {'duration': 2.0, 'interval': 31.0, 'name': 'B4', 'relative_offset

We have a list of songs. For each song we have a list with the increments between its notes. 

In [40]:
pickle.dump( songs_intervals, open( "songs_intervals.p", "wb" ) )

### Getting the intervals by track

In [41]:
get_intervals_from_song = lambda song: [from_pitches_to_intervals_with_time(track) for track in song if len(track) > 2 ]

In [42]:
songs_intervals_by_track = [get_intervals_from_song(song) for song in songs_with_tracks]

In [43]:
songs_intervals_by_track[:1]

[[[{'duration': 0.25, 'interval': 7.0, 'name': 'F#5', 'relative_offset': 0.25},
   {'duration': 0.25,
    'interval': -10.0,
    'name': 'G#4',
    'relative_offset': 0.08333333333333331},
   {'duration': 0.25,
    'interval': 3.0,
    'name': 'B4',
    'relative_offset': 0.4166666666666667},
   {'duration': 0.25, 'interval': 12.0, 'name': 'B5', 'relative_offset': 0.25},
   {'duration': 0.25, 'interval': -12.0, 'name': 'B4', 'relative_offset': 0.5},
   {'duration': 0.25, 'interval': 7.0, 'name': 'F#5', 'relative_offset': 0.25},
   {'duration': 0.25, 'interval': -3.0, 'name': 'E-5', 'relative_offset': 0.5},
   {'duration': 0.25, 'interval': 3.0, 'name': 'F#5', 'relative_offset': 0.25},
   {'duration': 0.25,
    'interval': -10.0,
    'name': 'G#4',
    'relative_offset': 0.16666666666666652},
   {'duration': 0.25,
    'interval': 3.0,
    'name': 'B4',
    'relative_offset': 0.3333333333333335},
   {'duration': 0.25, 'interval': 7.0, 'name': 'F#5', 'relative_offset': 0.25},
   {'duratio

We have a list of songs. For each song we have a list of tracks. For each track we have a list of intervals that tell us the variations between the notes on each track. We save it to a pickle: 

In [44]:
pickle.dump( songs_intervals_by_track, open( "songs_intervals_by_track.p", "wb" ) )