# Assignment 2 - Convolutions with MIDI

In this assignment, you're going to play around with the MIDI notebook we've been building in class.

The code should run on mltgpu at the time of submission, but you do not need to use the GPU for this assignment.  (You can if you want to.)

When testing on your own machine, in addition to the full PyTorch stack, you'll need the mido module.  Installing the scamp module is necessary if you want to listen to anything. If using Linux, you will have to install fluidsynth.

You will use the [lakh](https://colinraffel.com/projects/lmd/) MIDI corpus.  A copy will be placed in the scratch directory of mltgpu; information will be provided via Canvas announcement.

This assignment is due on November 1, 2022, at 23:59.  There are **25 points** and **29 bonus points** (!!!) available on this assignment.

In [1]:
import sys
import os
import mido

In [2]:
from mido import MidiFile
import os
import sys
from torch.utils.data import Dataset, DataLoader
import numpy as np
from torch.nn.functional import pad

## Part 1 -- improve data handling and representation (4 points)

Here you will take the `MessageSequence` we created in class and make the following improvements:

1. Change the representation so that it can accommodate start and end symbols, as appropriate for your modeling in part 2.

2. Allow for the loading of multiple channels (2 or more, possibly randomly selected), with a reasonable cutoff.  To make things simple, you can make the very wrong assumption that every note is of the same duration and therefore aligned one-by-one, and you can thus ignore duration and offset information.

In [3]:
import torch
import torch.nn.functional as functional

class MIDITrackError(Exception):
    pass

class MessageSequence:
    def __init__(self, mid, number_of_channels = None):
        self.messages = []
        self.max_time = 0
        count = 0
        try:
            if number_of_channels > len(mid.tracks):
                number_of_channels = -1
            for u in mid.tracks[1:number_of_channels]: #another layer, Channels must be fixed across songs, Piano, Guitar, Drums
                #fix in the layering that the metamessage must have that instrument.
#                 if count >= number_of_channels:
#                     break
                channel = []
                for k in u:
                    if k.type in ['note_on', 'note_off']:
                        channel.append(k)
                self.messages.append(channel)
                    
                
        
        except IndexError:
            raise MIDITrackError
        #calculate note durations
        timecounter = 0
        notedict = {}
        real_sequence = []
        for channel in self.messages:
            channel_2 = []
#             channel_2.append('SOS')
            for message in channel:
                timecounter += message.time
                if message.type == "note_on":
                    notedict[message.note] = timecounter

                if message.type == "note_off":
                    duration = timecounter - notedict[message.note]
                    channel_2.append((message.note, notedict[message.note], message.time, duration))
            real_sequence.append(channel_2)
        self.sequence = real_sequence
        
    def midi_reencode(self):
        reencoded = []
        active_notes = {}
        timecounter = 0
        for channel in self.sequence:
            channel_3 = []
#             channel_3.append("SOS")
            for (note, timestamp, offset, duration) in channel: # requires aanother level of iteration
                note_order = sorted(active_notes.keys(), key=lambda x: active_notes[x][0]) #timestamp is tuple item 0
                for active_note in note_order:
                    if active_notes[active_note][0] < timestamp:
                        channel_3.append(mido.Message("note_off", 
                                                      channel=2, #add
                                                      note=active_note, 
                                                      velocity=95, 
                                                      time=active_notes[active_note][1]))
                        timecounter += active_notes[active_note][1]
                        del active_notes[active_note]
                channel_3.append(mido.Message("note_on", 
                                              channel=2,  #channels
                                              note=note,
                                              velocity=95, 
                                              time=timestamp-timecounter))
                active_notes[note] = (timestamp+duration, offset, duration)
                timecounter = timestamp


            note_order = sorted(active_notes.keys(), key=lambda x: active_notes[x][0]) #timestamp is tuple item 0
            for active_note in note_order:
                channel_3.append(mido.Message("note_off", 
                                              channel=2, #add
                                              note=active_note, 
                                              velocity=95, 
                                              time=active_notes[active_note][1]))
                del active_notes[active_note]
            channel_3.append("SOS")
            channel_3.append("EOS")
            reencoded.append(channel_3)
        return reencoded
    
    def vector_encode(self):
        note_db = functional.one_hot(torch.arange(0, 130)).float()
        encoded = []
        f_off = []
        f_dur = []
        
        self.start_vector_token = note_db[128]
        self.end_vector_token = note_db[129]
        for channel in self.sequence:
            channel_4 = []
            acc_offset = []
            acc_dur = []
            channel_4.append(torch.cat((self.start_vector_token, torch.zeros(1,), torch.zeros(1,))))
            for (note, _, offset, duration) in channel:

                note_vec = note_db[note].clone().detach()
                if offset > 100:
                    offset = 100
                if duration > 4000:
                    duration = 4000
                offset = offset/100
                duration = duration/4000
                channel_4.append(torch.cat((note_vec ,torch.Tensor([offset]),  torch.Tensor([duration]))))

            channel_4 = torch.stack(channel_4)
            encoded.append(channel_4)

        return encoded 

In [4]:
def padding(song):
    padded_vectors = []
    #Pads each channel to a fixed point:
    
    data = torch.nn.utils.rnn.pad_sequence(song, batch_first=True)
    data[:, -1, -3] = 1
    return data


In [5]:
store = []
# for filename in os.listdir("""../lt2326-h22-resources/clean_midi/"Weird Al" Yankovic"""):
#     midi_file = MessageSequence(mido.MidiFile("""../lt2326-h22-resources/clean_midi/"Weird Al" Yankovic/"""+ filename), 100)
#     store.append(midi_file)

for filename in os.listdir("""../Wagner"""):
    midi_file = MessageSequence(mido.MidiFile("""../Wagner/"""+ filename), 100)
    store.append(midi_file)
    
vault = []
for song in store:
    vector_raw = song.vector_encode()
    
    t_tensor = padding(vector_raw)
    vault.append(t_tensor)
    


### Describe your changes and any special motivations for them here (in notebook Markdown):

Changed MessageSequence to allow a specific number of channels rather than all-accept

Changed MessageSequence to cat notes, off, and dur and stack them.

Initially, a "start" and "end" token are inserted into the one hot. The Start token is added and have two empty values added to represent off and dur. The end token is inserted once padding in 3rd from last.

Each group of channels are all padded to the longest

Stored all in a container per Weird Al Yanko

WHAT I WOULD HAVE LIKED TO DO:
- Specify instrument, but also in terms of future training.
- padded according to real-time and not wrongfully assumed beginnings and ends.
- Retrieved more info from the notes; theres guaranteed more good stuff in there . . . 


## Part 2 - Convolutional Model (8 points)

Replace the model below with a model with the following characteristics:

1. It should include an ensemble of parallel 1D-convolutional layers (2 or more)
2. The layers should combine into a single output representation.
3. The layers should be different (have different kernels, windows, or strides).
4. The layers should be able to handle multiple channels. 
5. The input will be the song representation up to time step n, and the output will be a representation of notes for a single time step across the channels at n+1. (This means that an instance will be prediction of the next note, and a song will have to be run n times to predict n characters.)

Training the model will take longer than the n-gram model, especially if you're not using the GPU.

You have a free hand in all other aspects of the model, as long as you explain any significant design decisions (i.e., not every minor choice, but ones with real design impact).

(A bit of advice: the biggest problem here will be keeping the matrix/tensor dimensions straight...)

In [6]:
import torch.nn as nn

In [7]:
import torch

In [8]:
import torch.optim as optim

In [9]:
class MIDIModel(nn.Module):
    def __init__(self, drop_out):
        super().__init__()
        
        
        self.conv1 = nn.Conv1d(132 , 132 , kernel_size=1, stride=1)
        self.conv2 = nn.Conv1d(132 , 132   , kernel_size=5, stride=5)
        self.conv3 = nn.Conv1d(132 , 132, kernel_size=7, stride=7)
        self.conv4 = nn.Conv1d(132 , 132, kernel_size=10, stride=10)
        
#         torch.Size([25, 132, 30000]) conv1
#         torch.Size([25, 132, 6000])  conv2
#         torch.Size([25, 132, 4285])  conv3
#         torch.Size([25, 132, 3000])  conv4

#       if you change the padding in max-song-length, remember to take the outs of the convlayers 
#        and replace the value of data-length

        
        self.data_length = 43285

        self.drop_out = nn.Dropout(drop_out)
    
        self.combined_length = 3300
        
        self.max_song_length = 30000
        
        self.fc1 = nn.Linear(self.data_length, self.data_length//2)
        self.fc2 = nn.Linear(self.data_length//2 , self.data_length//4)
        self.fc3 = nn.Linear(self.data_length//4 , self.data_length//8)
        
        self.Sigmoid_1 = nn.Sigmoid()
        self.Sigmoid_2 = nn.Sigmoid()
        self.Sigmoid_3 = nn.Sigmoid()
        
        
        self.logsoftmax = nn.LogSoftmax(dim=0)
        self.Sigmoid_offset = nn.Sigmoid()
        self.Sigmoid_duration = nn.Sigmoid()
        
        
        self.note_fc = nn.Linear(self.data_length//8, 130)
        self.offset_fc = nn.Linear(self.data_length//8, 1)
        self.duration_fc = nn.Linear(self.data_length//8, 1)
        
        self.final_fc = nn.Linear(130, 1)
        
    def forward(self, data):
        note_final = []
        dur_final = []
        off_final = []
        
        data = torch.permute(data, (0,2,1))
        data = pad(data, (0, self.max_song_length-data.shape[2]))
        self.data_length = data.shape[2]
        
        conv1 = self.drop_out(self.conv1(data))
        conv2 = self.drop_out(self.conv2(data))
        conv3 = self.drop_out(self.conv3(data))
        conv4 = self.drop_out(self.conv4(data))
        parallel_ensemble = torch.cat((conv1, conv2, conv3, conv4), dim=2)
     

        parallel_ensemble = self.fc1(parallel_ensemble)
        parallel_ensemble = self.Sigmoid_1(parallel_ensemble)
        parallel_ensemble = self.fc2(parallel_ensemble)
        parallel_ensemble = self.Sigmoid_2(parallel_ensemble)
        parallel_ensemble = self.fc3(parallel_ensemble)
        parallel_ensemble = self.Sigmoid_3(parallel_ensemble)
        
        #opdeling af note, dur, offset
        note_out = parallel_ensemble[:, :-2]
        dur_out = parallel_ensemble[: , -1]
        off_out = parallel_ensemble[: , -2]
        

        #Som alle tages igennem deres eget respektive linear FC layer
        note_out = self.note_fc(note_out)
        dur_out = self.duration_fc(dur_out)
        off_out = self.offset_fc(off_out)
        
        final_note = self.final_fc(note_out.permute(0,2,1))
        final_note = final_note.squeeze(2)

        #Disse lag tages bliver softmaxet OG SIGMOIDES
        dur_out = self.Sigmoid_duration(dur_out)
        off_out = self.Sigmoid_offset(off_out)

        return final_note, off_out, dur_out

### Explain your design choices below.

Each song is padded to the length of the its longest channel. Each channels note also have concatenated their individual offset and duration.

The CNN-model has been initiated with four CNN-layers, three and four respective fully connected layers, and activation function-layers to offset and duration respectively to the input.

The CNN-model handles channels by flattening the input. Otherwise, it could have been done by collecting each channel and then stacking them following a for-loop. The code, then, runs four parallel conv-layers as per the assignment and concatenates the results of these features maps. It alters the formats of these from time to time. The result is then brought down to a size- (130/1) portion and then given an activation function. The note-size had an additional FC-layer added as I realised in the for-loop that I had to output 1 note and note 130. The dimension/understanding the dimensions here were indeed slippery.



## Part 3 - Dataset sampling (4 points)

Consider how the model is designed above and design a dataset generator capable of producing sample prefixes and next-characters for each time step for each song.  You can replace all the code from the original MIDI notebook with whatever you want.  Consider that there are more and less efficient ways of doing this, and that it may also be worth seeing if it's easier to do this in iterator mode where you can select random prefixes from random songs at each iteration.  You can even choose not to use the torch Dataset class at all, though it means you will have to rewrite the training loop not to use it.

In [10]:
def generate_samples_per_song(song):
    contain = []
    super_song_vector = torch.flatten(song,0,1)

    song_history = []
    for i,note in enumerate(super_song_vector[:-1]):
#         song_history.append([note, super_song_vector[i+1]])
        song_history.append(note)
        contain.append((torch.stack(song_history.copy()), super_song_vector[i+1]))
#     print(contain[:3])
#         if i > 3:
#             break
        
    return contain
    
    #Write comments, flatting song, concatenating drums and vocals.

def generate_samples(songlist, cap):
    samples = []
    for song in songlist:
        samples += generate_samples_per_song(song)
    return samples[:cap] #cap for speed!

In [11]:
class MIDINotesDataset(Dataset):
    def __init__(self, mididir, maximum=500):
        items = os.walk(mididir)
        
        store = []
        for filename in os.listdir(mididir):
            midi_file = MessageSequence(mido.MidiFile(mididir+ filename), 100)
            store.append(midi_file)

        vault = []
        for song in store:
            vector_raw = song.vector_encode()

            t_tensor = padding(vector_raw)
            vault.append(t_tensor)

        self.gen_song_list = generate_samples(vault, maximum)
        
    
        
    def __getitem__(self, i):
        return self.gen_song_list[i]
    
    def __len__(self):
        return len(self.gen_song_list) #change

In [12]:
dataset = MIDINotesDataset("../Wagner/", 20)
# print(len(dataset.gen_song_list))

### Describe any significant choices you made in designing the mode of access to the dataset.

I removed one of the data-set creaters. I believed it was a bit silly to have a dataloader to create a dataloader considering the rather simplicity of the format of the samples. 

The generator is also rather simple. It iterates songs and extracts sample out and stores it in a list.

Now, the way it generates samples is that it accumulates samples according to how far in the song the n-note is. The data-sample will be the song until that very given point in the song concatenated with n+1 as a sep. value. (what is the i and o in the loop below)

The iteration therefore runs from 1 through [:-1] so that the final note of the song doesnt break the loop.

There is a hinge to the iteration, which I wanted to change: I concatenated all notes of a song and essentially flatten the otherwise stacked container of tensors from a 3-dim to a 2-dim. This means that all notes of the song's total channels are found in this container. This essentially also means that the coming model-training wont be able to distinguish between vocals and drums. 

The dataset current runs a rather small pool of songs for sheer performance. There might be some error-handling to explore if larger Midi-files are entered - these structures are generally not very well produced / alligned. 


## Part 4 - Training loop (2 points)

Adapt the training loop to the way you organized access to the dataset and to the model you wrote.  Make any other improvements, such as trying out a different optimizer.  Make sure it is possible to vary the batch size as well as the epochs.

In [13]:
def custom_collate(data):
    iput = []
    oput = []
    max_len = 0
    for sample in data:
        max_len = max(max_len, sample[0].shape[0])
    for i in data:
        padded_input = pad(i[0], (0,0, 0, max_len - i[0].shape[0])) # each batch has its own padding-length! otherwise, the stack would increase by accumulating notes)
        iput.append(padded_input)
        oput.append(i[1])
    
    return torch.stack(iput), torch.stack(oput)

The custom_collate function above serves as the shaper of data in the data_loader function.

The purpose of the Collate song is to find the max-length of the given song-input and pad each batch to its respective max-length. This is due to the fact that each note-sample is a n+1 iteration, so the size is varying, which indeed will not work in the training loop. The Custom Collate also serves as output generating for the DataLoader insothat the forloop is a simply i,o in/out.


In [14]:
def train(data, epochs=10):
    mm = MIDIModel(0.2)
    optimizer = optim.SGD(mm.parameters(), lr=0.001, momentum=0.9)
    note_criterion = nn.CrossEntropyLoss()
#     note_criterion = nn.NLLLoss()
    for epoch in range(epochs):
        losses = []
        
        loader = DataLoader(data, batch_size=15, shuffle=True, drop_last=True, collate_fn=custom_collate) #collate ensures that the data on fetch is actually split
        #collate here is inserted to deal with the say the dataloader was made; it made data as a tuple which would otherwise break
        #Collate manipulates data once creating the batch . . . 
        for i, o in loader:
            optimizer.zero_grad()
            (note_output, offset_output, duration_output) = mm(i)
#             print(note_output.shape, offset_output.shape, duration_output.shape)
            #print("no: {}, oo: {}, do: {}".format(note_output, offset_output, duration_output))
    
#             o = o.type(torch.LongTensor)
#             note_output= note_output.type(torch.LongTensor)
            note_loss = torch.exp(-note_criterion(note_output , o[:, :130]))
            offset_loss = torch.abs(o[:, 130:131] - offset_output)
            duration_loss = torch.abs(o[:, 131:132] - duration_output)
            #print("nl: {}, ol: {}, dl: {}".format(note_loss, offset_loss, duration_loss))
            loss = note_loss + offset_loss + duration_loss
            losses.append(sum(loss))
            sum(loss).backward()
            optimizer.step()
        print("mean loss in epoch {} is {}".format(epoch, float(torch.mean(torch.stack(losses)))))
    return mm

### If there are any remarks you have on the training loop, put them here:

There are not really any hard changes to the training loop other than the NLLLoss was changed for a cross-entropy. The reason was sheer misunderstanding of formats (adding an activation in the model, tensor-types, etc.) which eventually just ended up being corrected in the forloop with the CEL. 

The loss is also changed abit to fit the format of the Midi-vectors/one-hot vectors.



In [15]:
train_dataset = train(dataset)


mean loss in epoch 0 is 12.449592590332031
mean loss in epoch 1 is 5.265387535095215
mean loss in epoch 2 is 4.929225444793701
mean loss in epoch 3 is 5.4242377281188965
mean loss in epoch 4 is 5.882735729217529
mean loss in epoch 5 is 4.465993881225586
mean loss in epoch 6 is 5.637473106384277
mean loss in epoch 7 is 5.470832824707031
mean loss in epoch 8 is 6.7103118896484375
mean loss in epoch 9 is 6.770566463470459


## Part 5 - Evaluation (7 points)

Actually predicting accuracy of note prediction in a set of songs is probably unlikely to work.  So instead we will calculate the perplexity of your model under different training assumptions (for example, epochs, dropout probability -- if you used dropout -- and/or hidden layer size).  Divide your dataset into training and validation sets and use the validation for the perplexity calculation.  (Note that you are predicting notes across multiple channels, so will have to combine perplexities across the channels.)

In [16]:
def new_train(data, epochs=input, l_r=input, drop_out=input, b_size=input):
    mm = MIDIModel(drop_out)
    optimizer = optim.SGD(mm.parameters(), lr=l_r, momentum=0.9)
    note_criterion = nn.CrossEntropyLoss()
#     note_criterion = nn.NLLLoss()
    for epoch in range(epochs):
        losses = []
        
        loader = DataLoader(data, batch_size=b_size, shuffle=True, drop_last=True, collate_fn=custom_collate) #collate ensures that the data on fetch is actually split
        #collate here is inserted to deal with the say the dataloader was made; it made data as a tuple which would otherwise break
        #Collate manipulates data once creating the batch . . . 
        for i, o in loader:
            optimizer.zero_grad()
            (note_output, offset_output, duration_output) = mm(i)
#             print(note_output.shape, offset_output.shape, duration_output.shape)
            #print("no: {}, oo: {}, do: {}".format(note_output, offset_output, duration_output))
    
#             o = o.type(torch.LongTensor)
#             note_output= note_output.type(torch.LongTensor)
            note_loss = torch.exp(-note_criterion(note_output , o[:, :130]))
            offset_loss = torch.abs(o[:, 130:131] - offset_output)
            duration_loss = torch.abs(o[:, 131:132] - duration_output)
            #print("nl: {}, ol: {}, dl: {}".format(note_loss, offset_loss, duration_loss))
            loss = note_loss + offset_loss + duration_loss
            losses.append(sum(loss))
            sum(loss).backward()
            optimizer.step()
        print("mean loss in epoch {} is {}".format(epoch, float(torch.mean(torch.stack(losses)))))
    return mm

The trainer-function here has been altered so that all parameters are controlled in the initial call and then forwarded to the respective functions/model. The parameters which are controlled are epoch, learning rate, dropout and batchsize 

The batchsize is hard-capped in the custom-collate, which is why it is also hard-capped here to => 20

Below I made a  parameter change-list in a forloop for simplicity.
The loss of both the new and the original seem to correspond correctly to how a solid training ought to loop (despite the models arent any good). 

In [17]:
dataset = MIDINotesDataset("../Wagner/", 20) #find new artist 
parameter_list = [[7, 0.1, 0.1],[10, 0.0001, 0.05],[2, 0.000001, 0.3]  ]

#                  (Epochs, learning_rate, drop_out )


packing_models = []
for parameter in parameter_list:
    print('Parameter of given model: Epochs, Learning Rate, Drop-out: ',parameter)
    model = new_train(dataset, parameter[0], parameter[1], parameter[2], b_size=15)
    packing_models.append([model, parameter])
    
  
    


Parameter of given model: Epochs, Learning Rate, Drop-out:  [7, 0.1, 0.1]
mean loss in epoch 0 is 10.06298542022705
mean loss in epoch 1 is 8.558642387390137
mean loss in epoch 2 is 6.319942474365234
mean loss in epoch 3 is 8.491042137145996
mean loss in epoch 4 is 8.383999824523926
mean loss in epoch 5 is 7.826999664306641
mean loss in epoch 6 is 6.581999778747559
Parameter of given model: Epochs, Learning Rate, Drop-out:  [10, 0.0001, 0.05]
mean loss in epoch 0 is 8.860145568847656
mean loss in epoch 1 is 7.8011794090271
mean loss in epoch 2 is 5.4920125007629395
mean loss in epoch 3 is 5.070052623748779
mean loss in epoch 4 is 4.92282772064209
mean loss in epoch 5 is 5.019260406494141
mean loss in epoch 6 is 4.850572109222412
mean loss in epoch 7 is 5.626389026641846
mean loss in epoch 8 is 5.713675498962402
mean loss in epoch 9 is 5.680151462554932
Parameter of given model: Epochs, Learning Rate, Drop-out:  [2, 1e-06, 0.3]
mean loss in epoch 0 is 10.558037757873535
mean loss in epo

In [18]:
validation_set = MIDINotesDataset("../Wagner/", 20) #find new artist 

In [19]:

loss = torch.nn.CrossEntropyLoss()
with torch.no_grad():
    for parameter_model in packing_models:
        model = parameter_model[0]
        parameter = parameter_model[1]
        losses = []
        for i, o  in DataLoader(validation_set, batch_size=15, drop_last=True, collate_fn=custom_collate):
            (note_output, offset_output, duration_output) = model(i)
            note_loss = loss(note_output , o[:, :130])
            losses.append(note_loss)
        average = torch.mean(torch.stack(losses))
        model_perplexity = torch.exp(average).item()
        print(parameter, model_perplexity)    
            

[7, 0.1, 0.1] 1.383132295854212e+27
[10, 0.0001, 0.05] 133.8363037109375
[2, 1e-06, 0.3] 125.85550689697266


### Your remarks on your evaluation here:

It is apparent from the perplexity calculation that the model is indeed live and solid but the quality indeed is not. There are numerous reasons for this: 

1. The data is limited to one artist with one song
2. The padding of the vectors initially in the vector_code disregards the parallelism of channels are assumes a common point of initialisation
3. The chosen artist's one song is capped to the first 20 accumulating notes+ n+1 for performance (the vectors got extremely long)
4. each song in the model is padded to a concatenated CNN- output of 43000, which could be reduced significantly. 

The model does however work and are capable of predicting future notes given input.

The parameters / perplexity training shows that the given model indeed enjoys a low drop-out, a very slow learning rate and few epoch-iterations. 

## Bonus Part 1 -- "Music" (3 points)

You will have to properly install [scamp](http://scamp.marcevanstein.com/) to do this bonus. You can rewrite the mode of song generation here to take into account your convolutional process.  Then use scamp to play the (multi-channel/simultaneous note music back).  Try to see if you get any quality improvement at all by using better parameters. (It will probably sound awful no matter what.)  If you want to train on mltgpu and play music on your own computer, you'll have to also write a way to save and load the model.

In [None]:
from numpy.random import choice

# This is just to get the first two notes out of the development song.
vecs = x.vector_encode()

def generate_music(model, note1, note2, length=30, diversity=5):
    note_db = functional.one_hot(torch.arange(0, 128))
    newsong = [note1, note2]
    model.eval()
    with torch.no_grad():
        for i in range(length):
            notepair = torch.cat((note1, note2))
            fake_batch = torch.stack([notepair] + [torch.randn(260) for _ in range(24)])
            (note_output, offset_output, duration_output) = model(fake_batch)
            note_output = note_output[0]
            offset_output = offset_output[0]
            duration_output = duration_output[0]
            print("note_output: {}".format(note_output))
            notesort = torch.argsort(note_output, descending=True)
            print("notesort: {}".format(notesort))
            noteset = notesort[:diversity]
            print("noteset: {}".format(noteset))
            notenum = int(choice(noteset.numpy()))
            print("notenum: {}".format(notenum))
            note1 = note2
            print("testgen {} {} {}".format(note_db[notenum].clone().detach(), offset_output, duration_output))
            note2 = torch.cat((note_db[notenum].clone().detach(), torch.Tensor([offset_output]), 
                                                                               torch.Tensor([duration_output])))
            newsong.append(note2)
    return newsong

In [None]:
def reconvert_song(notetensors):
    return [(int(torch.argmax(x[0:128])), int(torch.floor(x[128] * 100)), int(torch.floor(x[129] * 4000))) for x in notetensors]

In [None]:
def get_sequence_back(model_ouptut, starting_time):
    sequence = []
    for (note, offset, duration) in model_ouptut:
        sequence.append((note, starting_time, offset, duration))
        starting_time += duration - offset
        
    return sequence

In [None]:
from scamp import *
import time

In [None]:
sess = Session().run_as_server()

In [None]:
clarinet = sess.new_part("clarinet")

In [None]:
for n in converted_song:
    clarinet.play_note(n[0], 0.8, n[2]/1000)
    time.sleep(n[2]/1000 + 0.01)

### Your remarks on the quality of the music.

## Bonus Part 2 - 2D-convolutions (6 points)

Define a model as in part 2 that restructures your representation as an ensemble of 2D convolutional models (using the additional dimension to handle multiple MIDI channels).  This will probably require that you rebuild other parts of the pipeline to accommodate it.

Do an evaluation of the output in terms of perplexity (and, optionally, musical quality).

### Your code here (in as many cells as you need):

### Your remarks:

## Bonus Part 3 - Durations (20 points)

Starting from the song representation, find a way to properly handle durations across multiple channels so that your code is not reliant on an incorrect alignment of the sequence of notes.  Evaluate as in Bonus Part 2.

### Your code here:

### Your remarks:

## Submission

Submit a filled-out version of this notebook via Canvas.