In [1]:
%load_ext autoreload
%autoreload 2

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import pickle


import torch
from torch.utils.data import DataLoader, Dataset


from Preprocessing import *
from CNN_ExtractGenre import *
from PolyphonicPreprocessing import *

from DatasetLoader import collate_fn as cf
import DatasetLoader as DL
import Model as M

# Cleaning data

Apply the functions in Preprocessing.py to clean the midi dataset. There are multiple files that are currupted or duplicated. 
For this analysis we are also going to use only Midi file with a timestamp of 4/4, like in the reference paper. This filtering is done in CleaningData()

In [3]:
DeleteDuplicates()
CleaningData()

Deleting Duplicates:: 100%|██████████| 2201/2201 [00:00<00:00, 2227.41it/s]
100%|██████████| 2201/2201 [06:27<00:00,  5.68it/s]


# Preprocessing data:

Firstly we reconstruct the database, transforming all the polyphonic audios into monophonic, keeping the information about the tracks (instruments) in each of the midi file. It is done by keeping only the highest pitch from each polyphonic note.

In [None]:
#RecreateDatabase()

Recreating Database: 100%|██████████| 2059/2059 [08:41<00:00,  3.95it/s]


The input of the model has to be a 128x16 matrix as in the paper. The following function clasify the midi tracks into instrumental classes:
- Guitar  
- Percussion
- Organ  
- Sound Effects 
- Bass  
- Piano 
- Synth Lead 
- Chromatic Percussion 
- Synth Pad  
- Percussive 
- Synth Effects
- Ethnic  

In [4]:
Dataset = PreProcessing(nDir = 2000)
torch.save(Dataset, 'Dataset_Percussion1.pt')

Preprocessing:  60%|█████▉    | 1192/2000 [04:44<03:12,  4.20it/s]


OSError: MThd not found. Probably not a MIDI file

In [4]:
#Dataset = torch.load('Dataset.pt', map_location='cpu')

for key in Dataset.keys():
   print(key, '', len(Dataset[key]))

#Since we load it after, we free some space
# del Dataset
# gc.collect()

Bass  115561
Piano  58756
Guitar  185958
Organ  44142
Ensemble  62419
Synth Lead  21766
Reed  30971
Sound Effects  8824
Brass  33210
Chromatic Percussion  29047
Pipe  18956
Percussion  6663
Synth Pad  17083
Strings  33716
Percussive  6366
Synth Effects  7385
Ethnic  3040


In [3]:
Dataset['Bass']

NameError: name 'Dataset' is not defined

# Monophonic Model and Architecture

The class DatasetTransorm allow us to choose which intrument's bars to load. 

In [2]:
#We are selecting the data from the dataset with the guitar instrument
Data = DL.MonophonicDataset(Instrument='Guitar')
trainData = DataLoader(Data, batch_size=10, shuffle=True, num_workers=0, collate_fn=cf)

# Genre detection using CNN

The idea is to train a Convolutional Neural Network (CNN) to understand the structure of the songs and to implement a classifier capable of identifying the genre of each song in our MIDI dataset.

However, we cannot train the CNN directly on our MIDI dataset, since this would compromise both learning and classification. Moreover, CNNs are supervised learning models, and our dataset does not include genre labels. For this reason, we found another dataset containing 100 songs in .wav format for each of the following musical genres:

- metal
- disco
- classical
- hiphop
- jazz
- country
- pop
- blues 
- raggae 
- rock

The idea is to train the CNN using this labeled dataset. Before doing that, we need to perform some preprocessing, since some songs in the dataset are corrupted. Additionally, the audio clips are only a few seconds long, so we preprocess each song to have a fixed length and a consistent format. The preprocessing functions are implemented in the file **CNN_ExtractGenre.py**.

After preprocessing, we define the CNN model and the data loader in **Model.py** and **DataLoader.py**, respectively. The model is trained on Google Colab (not on the local machine), and we later load the trained model using its state_dict.

The CNN achieves a strong validation accuracy of 84%, as shown in the accompanying paper.

Once the model is trained on the labeled dataset, we use it to classify our own songs. This is a complex process because our songs are in .mid format, while the model expects .wav spectrograms as input. Therefore, each MIDI file must be converted into audio, transformed into a spectrogram, and then classified by the CNN.

After classifying each song, we save a file containing the song’s name and its predicted genre. From there, we proceed as before: we separate our dataset by genre, and within each genre, we further separate the songs by instrument.


N.B. all the function in the file **CNN_ExtractGenre** has already been runned since the computation is quite long. In the following cell we are showing the final result

In [4]:
#Load the preprocessed and classified dataset:
with open('GenreDataset.pkl', 'rb') as f:
   GenreDataset = pickle.load(f)

#Mapping each genre into a number for classification
GenreMapping = {'metal': 0, 'disco': 1, 'classical': 2, 'hiphop': 3, 'jazz': 4,
          'country': 5, 'pop': 6, 'blues': 7, 'reggae': 8, 'rock': 9}

i = 0
for key in GenreDataset.keys():
   i += 1
   print(key, GenreDataset[key])

   if i > 10:
      break
#Author/Name of the song, (Genre, confidence)

Gordon Lightfoot/Sundown (9, 0.5)
Gordon Lightfoot/Sundown.1 (9, 1.0)
Gordon Lightfoot/Rainy Day People (9, 0.5)
Gordon Lightfoot/Carefree Highway (9, 0.92)
Gordon Lightfoot/Beautiful (4, 0.67)
Gounod Charles/Ave Maria.1 (2, 0.92)
Gounod Charles/Marche funebre d'une marionnette (2, 0.42)
Gounod Charles/Waltz from Faust (2, 1.0)
Grace Jones/Slave to the Rhythm (6, 0.58)
Grand Funk Railroad/Some Kind of Wonderful (1, 1.0)
Grand Funk Railroad/I'm Your Captain (Closer to Home) (9, 0.67)


# Polyphonic Music Generator

as now we are only considering monophonic tracks, we are thus losing all the information between notes of the same instrument and the correlation between instruments! Therefore we would like to try to implement a polyphonic music generator. 

The strategy is the same as before. Instead of having (128x16) matrix we have (4x128x16) where 4 is the maximum number of instruments that can be played at the same time. Now each matrix 128x16 does not encode for a single note as before, but it allows for multiple note of the same instrument.

In the file **PolyphonicPreprocessing.py** there are all the function used to preprocess the clean_midi dataset and build the dataset of mapped songs, cathegorized by genre (using the genre recognition dataset built before). Here we are processing and storing the Polyphonic dataset

In [2]:
PolyphonicDataset = PolyphonicPreProcessing(nDir = 2000)
torch.save(PolyphonicDataset, 'PolyphonicDataset_Percussion.pt')

Preprocessing:   3%|▎         | 62/2000 [00:12<06:27,  5.00it/s]


KeyboardInterrupt: 

And these are all the genre and the number of bars for each genre

In [6]:
PolyphonicDataset = torch.load('PolyphonicDataset.pt', weights_only=False)

for key in PolyphonicDataset:
   print(key, '', len(PolyphonicDataset[key]))

#We load with the DataLoader class, freeing some space
# del PolyphonicDataset
# gc.collect()

disco  14828
country  9742
rock  62342
jazz  9811
classical  9943
pop  8046
blues  1513
raggae  3262
hiphop  228
metal  88


In [13]:
PolyphonicDataset['disco'][0]['Bars'][0]

tensor(indices=tensor([[ 2,  2,  2],
                       [56, 61, 63],
                       [ 6,  4,  0]]),
       values=tensor([1, 1, 1]),
       size=(4, 128, 16), nnz=3, layout=torch.sparse_coo)

In [11]:
np.shape(PolyphonicDataset['disco'][0]['Bars'][0].to_dense())

torch.Size([4, 128, 16])

And we can do the same thing done before saving the velocity information of the bars:

In [7]:
PolyphonicDatasetVelocity = PolyphonicPreProcessing(nDir = 2000, Velocity=True)
torch.save(PolyphonicDatasetVelocity, 'PolyphonicDatasetVelocity.pt')

Preprocessing: 100%|██████████| 2000/2000 [07:19<00:00,  4.55it/s] 
100%|██████████| 10/10 [00:00<00:00, 12.09it/s]


And we can load the dataset using the 

In [None]:
PolData = DL.PolyphonicDataset(Genre = 'jazz')
PolTrainData = DataLoader(PolData, batch_size=30, shuffle=True, num_workers=0, collate_fn=cf)