# Assignment 2 - Convolutions with MIDI

In this assignment, you're going to play around with the MIDI notebook we've been building in class.

The code should run on mltgpu at the time of submission, but you do not need to use the GPU for this assignment.  (You can if you want to.)

When testing on your own machine, in addition to the full PyTorch stack, you'll need the mido module.  Installing the scamp module is necessary if you want to listen to anything. If using Linux, you will have to install fluidsynth.

You will use the [lakh](https://colinraffel.com/projects/lmd/) MIDI corpus.  A copy will be placed in the scratch directory of mltgpu; information will be provided via Canvas announcement.

This assignment is due on November 1, 2022, at 23:59.  There are **25 points** and **29 bonus points** (!!!) available on this assignment.

In [1]:
import sys
import os
import mido
import glob
import random

In [2]:
from mido import MidiFile, KeySignatureError, Message
import os
import sys
from torch.utils.data import Dataset, DataLoader

## Part 1 -- improve data handling and representation (4 points)

Here you will take the `MessageSequence` we created in class and make the following improvements:

1. Change the representation so that it can accommodate start and end symbols, as appropriate for your modeling in part 2.

2. Allow for the loading of multiple channels (2 or more, possibly randomly selected), with a reasonable cutoff.  To make things simple, you can make the very wrong assumption that every note is of the same duration and therefore aligned one-by-one, and you can thus ignore duration and offset information.

In [3]:
from collections import defaultdict
from dataclasses import dataclass
import torch
import torch.nn.functional as functional
from pprint import pprint


class MIDITrackError(Exception):
    pass


@dataclass
class Note:
    note: int
    timestamp: int
    offset: int
    duration: int


class MessageSequence:
    START_NOTE = Note(128, 0, 0, 0)
    END_NOTE = Note(129, 0, 0, 0)
    PADDING_NOTE = Note(130, 130, 130, 130)
    NOTE_DB = functional.one_hot(torch.arange(0, 128 + 3))

    def __init__(self, mid: MidiFile, number_of_channels: int = 1, random_channels: bool = True):
        messages = self._collect_notes(mid)

        if len(messages) < number_of_channels:
            raise MIDITrackError(f'File has only {len(messages)} channels')

        if random_channels:
            messages = {key: messages[key] for key in random.sample(list(messages.keys()), k=number_of_channels)}
        else:
            messages = {key: messages[key] for key in list(messages.keys())[:number_of_channels]}

        self.sequences = self._calculate_note_durations(messages)

    def _collect_notes(self, mid):
        messages: defaultdict[list[Message]] = defaultdict(list)
        try:
            merged_track = mido.merge_tracks(mid.tracks)

            for i in merged_track:
                if i.type in ['note_on', 'note_off']:
                    messages[i.channel].append(i)
        except IndexError as e:
            raise MIDITrackError from e
        except TypeError as e:
            raise MIDITrackError from e

        # delete channels with unequal number of note_on and note_off
        for channel in list(messages.keys()):
            note_ons = 0
            note_offs = 0
            for note in messages[channel]:
                if note.type == 'note_on':
                    note_ons += 1
                elif note.type == 'note_off':
                    note_offs += 1

            if note_offs != note_ons:
                del messages[channel]
        return messages

    def _calculate_note_durations(self, messages):
        lengths = list(sorted([(channel, len(messages[channel])) for channel in messages], key=lambda x: x[1]))
        max_length_channel, max_length = lengths[-1]

        sequences = {}
        timecounter = 0
        notedict = {}
        real_sequence = []
        for message in messages[max_length_channel]:
            timecounter += message.time
            if message.type == "note_on":
                notedict[message.note] = timecounter

            if message.type == "note_off":
                if message.note not in notedict:
                    raise MIDITrackError(f'Note {message.note} turned off before turned on')

                duration = timecounter - notedict[message.note]
                real_sequence.append(Note(
                    note=message.note,
                    timestamp=notedict[message.note],
                    offset=message.time,
                    duration=duration
                ))

        real_sequence.insert(0, self.START_NOTE)
        real_sequence.append(self.END_NOTE)
        sequences[max_length_channel] = real_sequence

        for channel in messages:
            if channel == max_length_channel:
                continue

            timecounter = 0
            notedict = {}
            real_sequence = []
            for index, message in enumerate(list(filter(lambda x: x.type == 'note_on', messages[channel]))):
                reference_message = sequences[max_length_channel][index + 1]

                real_sequence.append(Note(
                    note=message.note,
                    timestamp=reference_message.timestamp,
                    offset=reference_message.offset,
                    duration=reference_message.duration,
                ))

            real_sequence.extend([self.PADDING_NOTE] * (int(max_length / 2) - len(real_sequence)))
            real_sequence.insert(0, self.START_NOTE)
            real_sequence.append(self.END_NOTE)
            sequences[channel] = real_sequence

        return sequences

    def midi_reencode(self):
        reencoded = []
        for channel in self.sequences:
            active_notes = {}
            timecounter = 0
            for note in self.sequences[channel]:
                note_order = sorted(active_notes.keys(), key=lambda x: active_notes[x][0])  # timestamp is tuple item 0
                for active_note in note_order:
                    if active_notes[active_note][0] < note.timestamp:
                        reencoded.append(mido.Message("note_off",
                                                      channel=channel,
                                                      note=active_note,
                                                      velocity=95,
                                                      time=active_notes[active_note][1]))
                        timecounter += active_notes[active_note][1]
                        del active_notes[active_note]
                reencoded.append(mido.Message("note_on",
                                              channel=channel,
                                              note=note.note,
                                              velocity=95,
                                              time=note.timestamp-timecounter))
                active_notes[note] = (note.timestamp+note.duration, note.offset, note.duration)
                timecounter = note.timestamp

            note_order = sorted(active_notes.keys(), key=lambda x: active_notes[x][0])  # timestamp is tuple item 0
            for active_note in note_order:
                reencoded.append(mido.Message("note_off",
                                              channel=channel,
                                              note=active_note,
                                              velocity=95,
                                              time=active_notes[active_note][1]))
                del active_notes[active_note]

        return reencoded

    def vector_encode(self):
        channels = list(self.sequences.values())
        encoded = [[] for _ in range(len(channels[0]))]
        for channel in channels:
            for note_index, note in enumerate(channel):
                encoded[note_index].append(self.encode_note(note))
        return encoded

    def encode_note(self, note):
        offset = note.offset
        duration = note.duration
        note_value = note.note

        # note_vec = self.NOTE_DB[note.note].clone().detach()

        if offset > 100:
            offset = 100
        if duration > 4000:
            duration = 4000

        offset = offset/100
        duration = duration/4000
        note_value = note_value/131

        return {
            'note': torch.Tensor([note.note]),
            'note_vector': torch.Tensor([note_value]),
            'offset': torch.Tensor([offset]),
            'duration': torch.Tensor([duration])
        }


In [4]:
file = MidiFile('data/lmd_full/8/834d6b48ccaaa7077f18c566202dbf8f.mid')
seq = MessageSequence(file, 4, False)

### Describe your changes and any special motivations for them here (in notebook Markdown):

I found that tracks in the MIDI files contained sometimes multiple channels, while some channels were spread over multiple tracks. To simplify, I merged all tracks into one and extracted the channels. During the extraction of the channels, I also excluded files that were not fitting the simplifications or the model (e.g. files with no 'note_off' or different numbers of 'note_on' and 'note_off').

In the next step, I identified the channel with the most notes, and then aligned the other channels along this reference channel. The durations and offsets are therefore based on the reference channel and the same for all channels. This, of course is a simplification and breaks the music. The missing notes in the end of a shorter channel are filled with PADDING_NOTES. After that, all channels are prefixed with a START_NOTE and suffixed with and END_NOTE.

For the vector encoding, I decided to not use a one-hot encoding for the notes, but rely on a single float as for the offset and the duration. The reason for that is that I wanted to use these three values as channels in the CNN layer. Therefore, they needed to have the same size.

## Part 2 - Convolutional Model (8 points)

Replace the model below with a model with the following characteristics:

1. It should include an ensemble of parallel 1D-convolutional layers (2 or more)
2. The layers should combine into a single output representation.
3. The layers should be different (have different kernels, windows, or strides).
4. The layers should be able to handle multiple channels. 
5. The input will be the song representation up to time step n, and the output will be a representation of notes for a single time step across the channels at n+1. (This means that an instance will be prediction of the next note, and a song will have to be run n times to predict n characters.)

Training the model will take longer than the n-gram model, especially if you're not using the GPU.

You have a free hand in all other aspects of the model, as long as you explain any significant design decisions (i.e., not every minor choice, but ones with real design impact).

(A bit of advice: the biggest problem here will be keeping the matrix/tensor dimensions straight...)

In [5]:
import torch.nn as nn

In [6]:
import torch

In [7]:
import torch.optim as optim
from torch.nn.functional import pad

In [8]:
class MIDIModel(nn.Module):
    def __init__(self, dropout_prob=0.2):
        super().__init__()

        self.song_length = 4000

        self.dropout = nn.Dropout(dropout_prob)

        self.conv1 = nn.Conv1d(3, 3, kernel_size=2, stride=1)
        self.conv2 = nn.Conv1d(3, 3, kernel_size=4, stride=1)
        self.conv3 = nn.Conv1d(3, 3, kernel_size=10, stride=1)
        self.conv4 = nn.Conv1d(3, 3, kernel_size=10, stride=10)

        combined_length = 12387

        self.linear_sequential = nn.Sequential(
            nn.Linear(combined_length, int(combined_length / 2)),
            nn.Dropout(dropout_prob),
            nn.Sigmoid(),
            nn.Linear(int(combined_length / 2), int(combined_length / 4)),
            nn.Dropout(dropout_prob),
            nn.Sigmoid(),
            nn.Linear(int(combined_length / 4), int(combined_length / 8)),
            nn.Sigmoid(),
        )

        self.linear_notes = nn.Linear(int(combined_length / 8), 131)
        self.linear_offset = nn.Linear(int(combined_length / 8), 1)
        self.linear_duration = nn.Linear(int(combined_length / 8), 1)

        self.logsoftmax = nn.LogSoftmax(dim=0)
        self.sigmoid_offset = nn.Sigmoid()
        self.sigmoid_duration = nn.Sigmoid()

    def forward(self, data):
        # Tensor (batch, CNN channel, MIDI channel, song length) -> (MIDI channel, batch, CNN channel, song length)
        by_midi_channel = data.permute(2, 0, 1, 3)
        padded = pad(by_midi_channel, (0, self.song_length - by_midi_channel.size(3)), value=130)

        note_outputs = []
        offset_outputs = []
        duration_outputs = []
        for channel in padded:
            conv1 = self.dropout(self.conv1(channel))
            conv2 = self.dropout(self.conv2(channel))
            conv3 = self.dropout(self.conv3(channel))
            conv4 = self.dropout(self.conv4(channel))

            combined = torch.cat((conv1, conv2, conv3, conv4), dim=2)

            linear = self.linear_sequential(combined)

            note_output = self.linear_notes(linear[:, 0])
            offset_output = self.linear_offset(linear[:, 1])
            duration_output = self.linear_duration(linear[:, 2])

            note_outputs.append(self.logsoftmax(note_output))
            offset_outputs.append(self.sigmoid_offset(offset_output))
            duration_outputs.append(self.sigmoid_duration(duration_output))

        return (
            torch.stack(note_outputs).permute(1, 0, 2),
            torch.stack(offset_outputs).permute(1, 0, 2),
            torch.stack(duration_outputs).permute(1, 0, 2)
        )


### Explain your design choices below.

As already mentioned above, the goal was, to use note_value, offset and duration as channels for the CNN layer. For this reason, all samples fed to the Model needed to have the same length. This length, I defined as maximum 4000 notes in a song. All samples are padded to this length.

After that I predict notes for each MIDI channel separately. Each channel is run through 4 convolutional layers. Two of these should find smaller patterns with a small kernel size and the other two bigger patterns. The outputs are then concateneated and run through multiple linear layers to reduce the dimensions. Finally, the note value channel is reduced to 131 (notes + special tokens) dimensions with a final logarithmic softmax function. With this, it is ensured that only 'legal' not values can be predicted. The offset and duration channel are reduced to a single number, since they can be calculated back to a real offset/duration.

I applied dropout to all layers, to reduce overfitting to my dataset. For lack of memory, I could only use a very small dataset, which is why the model does not see very varied data.

## Part 3 - Dataset sampling (4 points)

Consider how the model is designed above and design a dataset generator capable of producing sample prefixes and next-characters for each time step for each song.  You can replace all the code from the original MIDI notebook with whatever you want.  Consider that there are more and less efficient ways of doing this, and that it may also be worth seeing if it's easier to do this in iterator mode where you can select random prefixes from random songs at each iteration.  You can even choose not to use the torch Dataset class at all, though it means you will have to rewrite the training loop not to use it.

In [9]:
@dataclass
class Sample:
    # (L, (note_vector: (128 + 3) + offset: 1 + duration: 1) = 133)
    song_beginning: torch.Tensor
    note: torch.Tensor
    offset: torch.Tensor
    duration: torch.Tensor


def generate_samples_per_song(song: MessageSequence) -> list[Sample]:
    vectors = song.vector_encode()

    number_midi_channels = len(vectors[0])
    number_cnn_channels = 3
    length_song = len(vectors)

    padding_note = song.encode_note(song.PADDING_NOTE)
    samples: list[Sample] = []
    for i in range(1, length_song):
        song_beginning: list[list[torch.Tensor]] = [[[] for _ in range(number_midi_channels)] for _ in range(number_cnn_channels)]
        for note in vectors[0:i]:
            for channel_index, channel_note in enumerate(note):
                song_beginning[0][channel_index].append(channel_note['note_vector'])
                song_beginning[1][channel_index].append(channel_note['offset'])
                song_beginning[2][channel_index].append(channel_note['duration'])
        
        # pad song_beginning
        for midi_channel in range(number_midi_channels):
            song_beginning[0][midi_channel].extend([padding_note['note_vector']] * (length_song - i))
            song_beginning[1][midi_channel].extend([padding_note['offset']] * (length_song - i))
            song_beginning[2][midi_channel].extend([padding_note['duration']] * (length_song - i))


        notes: list[torch.Tensor] = [[] for _ in range(number_midi_channels)]
        offsets: list[torch.Tensor] = [[] for _ in range(number_midi_channels)]
        durations: list[torch.Tensor] = [[] for _ in range(number_midi_channels)]
        for channel, channel_note in enumerate(vectors[i]):
            notes[channel] = channel_note['note']
            offsets[channel] = channel_note['offset']
            durations[channel] = channel_note['duration']

        samples.append(Sample(
            song_beginning=torch.stack([torch.stack([torch.cat(midi_channel) for midi_channel in cnn_channels]) for cnn_channels in song_beginning]),
            note=torch.stack(notes),
            offset=torch.stack(offsets),
            duration=torch.stack(durations)
        ))

    return samples


def generate_samples(songlist: list[MessageSequence]) -> list[Sample]:
    samples: list[Sample] = []
    for song in songlist:
        samples += generate_samples_per_song(song)

    return samples


In [10]:
class MIDINotesDataset(Dataset):
    def __init__(self, mididir: str = None, index_file: str = None, maximum: int = 500) -> None:
        if mididir is None and index_file is None:
            raise ValueError("Must provide at least one of --mididir or --index_file")

        write_index = False

        if mididir is not None:
            search_pattern = mididir + '/**//*.mid'
            file_list = glob.glob(search_pattern, recursive=True)
            if index_file is not None:
                write_index = True
        else:
            with open(index_file, 'r') as f:
                file_list = f.read().splitlines()

        mss: list[MessageSequence] = []
        count = 0
        successfull_files: set[str] = set()
        for filename in file_list:
            try:
                midifile = MidiFile(filename)
                ms = MessageSequence(midifile, 2)
                successfull_files.add(filename)
            except MIDITrackError:
                continue
            except EOFError:
                continue
            except KeySignatureError:
                continue
            except OSError:
                continue
            except IndexError:
                continue
            except ValueError:
                continue

            mss.append(ms)

            count += 1
            if count == maximum:
                break

        if write_index:
            with open(index_file, 'w') as f:
                f.write('\n'.join(successfull_files))
        self.notes = generate_samples(mss)

    def __getitem__(self, i) -> Sample:
        return self.notes[i]

    def __len__(self) -> int:
        return len(self.notes)


In [11]:
maximum_files = 100
full_dataset = MIDINotesDataset(index_file='index_lmd_matched.txt', maximum=maximum_files)

In [12]:
split = int(0.8 * len(full_dataset))

train_dataset, test_dataset = torch.utils.data.random_split(full_dataset, [split, len(full_dataset) - split])
print(f'Full: {len(full_dataset)}')
print(f'Train: {len(train_dataset)}')
print(f'Test: {len(test_dataset)}')

del full_dataset

Full: 65872
Train: 52697
Test: 13175


In [13]:
import pickle

with open(f'saved/train_dataset_{maximum_files}', 'wb') as f:
    pickle.dump(train_dataset, f)

with open(f'saved/test_dataset_{maximum_files}', 'wb') as f:
    pickle.dump(test_dataset, f)

In [None]:
import pickle

with open(f'saved/train_dataset_100', 'rb') as f:
    train_dataset = pickle.load(f)

with open(f'saved/test_dataset_100', 'rb') as f:
    pickle.load(f)
    test_dataset = pickle.load(f)

In [13]:
print(train_dataset[1].song_beginning.size())

torch.Size([3, 2, 4813])


### Describe any significant choices you made in designing the mode of access to the dataset.

The biggest change I made was the representation of the sample. Each sample contains of a beginning of the song until the specific note and the representation of the note itself (note_value, offset, duration). Also the representation of the song beginning is a tensor of multiple note representations. Since I wanted to combine everything in one tensor, all song beginnings needed to have the same size. Therfore, I padded them with padding notes on the right side. The song beginning has the shape (CNN channel, MIDI channel, song length) where the CNN channels are note_value, offset and duration. With this representation, I get very big tensors that occupy a lot of memory. That's why, I was able to only use a very small dataset of 100 songs which produced around 76000 samples in total.

Furthermore, I implemented a solution, to index processable midi files from the dataset, to reload them faster.

## Part 4 - Training loop (2 points)

Adapt the training loop to the way you organized access to the dataset and to the model you wrote.  Make any other improvements, such as trying out a different optimizer.  Make sure it is possible to vary the batch size as well as the epochs.

In [14]:
def custom_collate(data):
    song_beginnings = []
    notes = []
    offsets = []
    durations = []

    max_length = 0
    for sample in data:
        song_beginnings.append(sample.song_beginning)
        max_length = max(max_length, sample.song_beginning.size(2))

        notes.append(sample.note)
        offsets.append(sample.offset)
        durations.append(sample.duration)
    
    padded_song_beginnings = [pad(song_beginning, (0, max_length - song_beginning.size(2)), value=130) for song_beginning in song_beginnings]

    return {
        'song_beginnings': torch.stack(padded_song_beginnings),
        'notes': torch.stack(notes),
        'offsets': torch.stack(offsets),
        'durations': torch.stack(durations)
    }


In [15]:
import math


def train(dataloader, epochs=10, device=torch.device('cpu'), dropout_prob=0.2, learning_rate=0.001):
    mm = MIDIModel(dropout_prob).to(device)
    optimizer = optim.Adam(mm.parameters(), lr=learning_rate)
    note_criterion = nn.NLLLoss()
    offset_citerion = nn.L1Loss()
    duration_criterion = nn.L1Loss()
    print(f'{epochs} EPOCHS - {math.floor(len(dataloader.dataset) / dataloader.batch_size)} BATCHES PER EPOCH')
    for epoch in range(epochs):
        total_loss = 0
        for i, batch in enumerate(dataloader):
            song_beginnings = batch['song_beginnings'].to(device)
            notes = batch['notes'].type(torch.LongTensor).to(device)
            offsets = batch['offsets'].to(device)
            durations = batch['durations'].to(device)

            optimizer.zero_grad()
            (note_output, offset_output, duration_output) = mm(song_beginnings)
            note_loss = torch.exp(-note_criterion(note_output.permute(0, 2, 1), notes.squeeze()))
            offset_loss = offset_citerion(offset_output, offsets)
            duration_loss = duration_criterion(duration_output, durations)

            loss = note_loss + offset_loss + duration_loss
            total_loss += loss.item()
            loss.backward()
            optimizer.step()

            sys.stdout.write(f'\repoch {epoch}, batch {i}: {round(total_loss / (i + 1), 4)}')

        print()
    return mm.to('cpu')


In [16]:
# dataloader = DataLoader(train_dataset, batch_size=64, shuffle=True, drop_last=True, collate_fn=custom_collate)
# model = train(dataloader, 10, torch.device('cuda:0'))

### If there are any remarks you have on the training loop, put them here:

For the training, I switched to the Adam optimizer, which produced much better results. Aditionally, I used the L1-Loss for offset and duration, to deal easily with multiple MIDI channels. Furthermore, I changed some code, to get better logs and see, how the loss is changing during training. This didn't have an effect on the training itself.

## Part 5 - Evaluation (7 points)

Actually predicting accuracy of note prediction in a set of songs is probably unlikely to work.  So instead we will calculate the perplexity of your model under different training assumptions (for example, epochs, dropout probability -- if you used dropout -- and/or hidden layer size).  Divide your dataset into training and validation sets and use the validation for the perplexity calculation.  (Note that you are predicting notes across multiple channels, so will have to combine perplexities across the channels.)

In [17]:
parameters = [
    # (epochs, dropout, learning_rate)
    (5, 0.2, 0.0001),
    (5, 0.3, 0.0001),
    (5, 0.2, 0.00001),
    (3, 0,   0.0001),
]

dataloader = DataLoader(train_dataset, batch_size=128, shuffle=True, drop_last=True, collate_fn=custom_collate)

models = {}
for parameter in parameters:
    print(parameter)
    models[parameter] = train(dataloader, parameter[0], torch.device('cuda:0'), parameter[1], parameter[2])  
    models[parameter].eval()  

(5, 0.2, 0.0001)
5 EPOCHS - 411 BATCHES PER EPOCH
epoch 0, batch 410: 0.4271
epoch 1, batch 410: 0.4001
epoch 2, batch 410: 0.3993
epoch 3, batch 410: 0.3998
epoch 4, batch 410: 0.3994
(5, 0.3, 0.0001)
5 EPOCHS - 411 BATCHES PER EPOCH
epoch 0, batch 410: 0.4354
epoch 1, batch 410: 0.4011
epoch 2, batch 410: 0.3995
epoch 3, batch 410: 0.3987
epoch 4, batch 410: 0.3982
(5, 0.2, 1e-05)
5 EPOCHS - 411 BATCHES PER EPOCH
epoch 0, batch 410: 0.4572
epoch 1, batch 410: 0.4221
epoch 2, batch 410: 0.3972
epoch 3, batch 410: 0.3833
epoch 4, batch 410: 0.3782
(3, 0, 0.0001)
3 EPOCHS - 411 BATCHES PER EPOCH
epoch 0, batch 410: 0.4149
epoch 1, batch 410: 0.3844
epoch 2, batch 410: 0.3849


In [28]:
device = torch.device('cuda:0')
loss = torch.nn.CrossEntropyLoss()
with torch.no_grad():
    for parameter, model in models.items():
        model.to(device)
        losses = []
        for batch in DataLoader(test_dataset, batch_size=128, drop_last=True, collate_fn=custom_collate):
            song_beginnings = batch['song_beginnings'].to(device)
            notes = batch['notes'].type(torch.LongTensor).to(device)

            (note_output, _, _) = model(song_beginnings)
            note_loss = loss(torch.exp(note_output).permute(0, 2, 1), notes.squeeze())
            
            losses.append(note_loss)
        average = torch.mean(torch.stack(losses))
        perplexity = torch.exp(average).item()
        print(parameter, perplexity)    

        model.to(torch.device('cpu'))

(5, 0.2, 0.0001) 131.1883544921875
(5, 0.3, 0.0001) 131.12937927246094
(5, 0.2, 1e-05) 131.22671508789062
(3, 0, 0.0001) 131.2026824951172


### Your remarks on your evaluation here:

The dataset occupied a lot of memory, so I could train the model with only a very small set of songs (100 songs). Even though they produced around 76.000 samples, it is not much data to learn from. The data for example learns only 100 times per epoch, how to predict the start of a song. The same applies also to the GPU memory, and GPU time which is why I run the model with a maximum of 5 epochs. For that reason and the fact that the data is very simplified, I didn't expect very good results. Anyway, the loss decreased all the time, which indicated that the model is learning despite these limitations.

During my training, the model mostly only improved in the first epoch. All further epochs give only a slight better loss. Changing the learning rate from 0.001 to 0.0001 also improved the learning process. As seen in model 3, an even smaller learning rate worsened the model in the first epochs a little, but with more epochs it, lowered the loss the most. 

When we look at the perplexities, it can be seen that they are very high while there is not much difference between the models. The best model here is the model with a higher dropout (0,3) and a lower learning rate (0.001). The difference is so small that this also is a result of different randomized datasets or initilizations. 

There may be a few reasons for these performance. As mentioned, I think a less simplified and bigger dataset might improve the model a lot. With the current limited data, it is probably diffcult for the model to find recurrent patterns in all songs. Futhermore, the big amount of padding is maybe also a limiting factor. There is padding to pad all midi channels to the same size, to make the batches the same size and to feed the same length to the CNNs. The actual data then only is a very small part. For this, it may be an idea, to decrease it, by for example using a different method to align channels.

## Bonus Part 1 -- "Music" (3 points)

You will have to properly install [scamp](http://scamp.marcevanstein.com/) to do this bonus. You can rewrite the mode of song generation here to take into account your convolutional process.  Then use scamp to play the (multi-channel/simultaneous note music back).  Try to see if you get any quality improvement at all by using better parameters. (It will probably sound awful no matter what.)  If you want to train on mltgpu and play music on your own computer, you'll have to also write a way to save and load the model.

In [None]:
from numpy.random import choice

# This is just to get the first two notes out of the development song.
vecs = x.vector_encode()

def generate_music(model, note1, note2, length=30, diversity=5):
    note_db = functional.one_hot(torch.arange(0, 128))
    newsong = [note1, note2]
    model.eval()
    with torch.no_grad():
        for i in range(length):
            notepair = torch.cat((note1, note2))
            fake_batch = torch.stack([notepair] + [torch.randn(260) for _ in range(24)])
            (note_output, offset_output, duration_output) = model(fake_batch)
            note_output = note_output[0]
            offset_output = offset_output[0]
            duration_output = duration_output[0]
            print("note_output: {}".format(note_output))
            notesort = torch.argsort(note_output, descending=True)
            print("notesort: {}".format(notesort))
            noteset = notesort[:diversity]
            print("noteset: {}".format(noteset))
            notenum = int(choice(noteset.numpy()))
            print("notenum: {}".format(notenum))
            note1 = note2
            print("testgen {} {} {}".format(note_db[notenum].clone().detach(), offset_output, duration_output))
            note2 = torch.cat((note_db[notenum].clone().detach(), torch.Tensor([offset_output]), 
                                                                               torch.Tensor([duration_output])))
            newsong.append(note2)
    return newsong

In [None]:
def reconvert_song(notetensors):
    return [(int(torch.argmax(x[0:128])), int(torch.floor(x[128] * 100)), int(torch.floor(x[129] * 4000))) for x in notetensors]

In [None]:
def get_sequence_back(model_ouptut, starting_time):
    sequence = []
    for (note, offset, duration) in model_ouptut:
        sequence.append((note, starting_time, offset, duration))
        starting_time += duration - offset
        
    return sequence

In [None]:
from scamp import *
import time

In [None]:
sess = Session().run_as_server()

In [None]:
clarinet = sess.new_part("clarinet")

Using preset Clarinet for clarinet


In [None]:
for n in converted_song:
    clarinet.play_note(n[0], 0.8, n[2]/1000)
    time.sleep(n[2]/1000 + 0.01)

### Your remarks on the quality of the music.

## Bonus Part 2 - 2D-convolutions (6 points)

Define a model as in part 2 that restructures your representation as an ensemble of 2D convolutional models (using the additional dimension to handle multiple MIDI channels).  This will probably require that you rebuild other parts of the pipeline to accommodate it.

Do an evaluation of the output in terms of perplexity (and, optionally, musical quality).

### Your code here (in as many cells as you need):

### Your remarks:

## Bonus Part 3 - Durations (20 points)

Starting from the song representation, find a way to properly handle durations across multiple channels so that your code is not reliant on an incorrect alignment of the sequence of notes.  Evaluate as in Bonus Part 2.

### Your code here:

### Your remarks:

## Submission

Submit a filled-out version of this notebook via Canvas.