<a href="https://colab.research.google.com/github/SirawitC/NLP-based-music-processing-for-composer-classification/blob/main/Musical_AI_Composer_classification_with_NLP_based_approaches.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Copyright © 2022 Somrudee Deepaisarn, Sirawit Chokphantavee, Sorawit Chokphantavee, Phuriphan Prathipasen, Suphachok Buaruk and Virach Sornlertlamvanich are authors of this computer program, contributed to the project 'Natural processing of music for composer classification' All right reserved.

# Musical AI: Composer classification with NLP-based approaches

## Dependencies

### Libraries installation

We install three addtional package here using **PIP**, a python package installer:


*   **pretty_midi** - the library that contain various utility function for MIDI file handling \

 *   reference: [Intuitive Analysis, Creation and Manipulation of MIDI Data with pretty_midi](https://colinraffel.com/publications/ismir2014intuitive.pdf)
*   **sentencepiece** - the unsupervised text tokenizer that allow end-to-end text tokenization without language dependency
 *   reference: [SentencePiece: A simple and language independent subword tokenizer
and detokenizer for Neural Text Processing](https://aclanthology.org/D18-2012.pdf)
*   **tensorflow-addons** - A library of useful extra functionality for TensorFlow
 *   reference: [TensorFlow Addons](https://www.tensorflow.org/addons)






In [None]:
!pip install pretty_midi
!pip install sentencepiece
!pip install tensorflow-addons

### Import libraries

In [None]:
from google.colab import drive
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import pretty_midi
import collections
from typing import Dict, List, Optional, Sequence, Tuple
import math
import string
import gensim
import ast
import sentencepiece as spm
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

import tensorflow as tf
import tensorflow_addons as tfa
from keras.wrappers.scikit_learn import KerasClassifier
import keras.backend as K

## Load data

Here, we will mount our colab notebook to our Google Drive to access all the data.

In [None]:
drive.mount('/content/drive/')

Next, we specify our `music_path_prefix` which will be used later in this notebook.

In [None]:
music_path_prefix = "/content/drive/MyDrive/Musical AI/Dataset/maestro-v3.0.0/"

Then we will access the **maestroV3.csv**, which contain all information in MAESTRO Dataset ver.3, using `read_csv` function from pandas library. 

*   **maestro_data** - contain all information from MAESTRO Dataset in `pandas.DataFrame` format
*  **composer_100** - contain all information of 5 composers of interest including Franz Liszt, Franz Schubert, Frédéric Chopin, Johann Sebastian Bach, andLudwig van Beethoven.



In [None]:
maestro_data = pd.read_csv("/content/drive/MyDrive/Musical AI/Dataset/maestroV3.csv")
composer_100 = maestro_data[maestro_data["canonical_composer"].isin(["Franz Liszt","Franz Schubert","Frédéric Chopin","Johann Sebastian Bach","Ludwig van Beethoven"])]

In [None]:
# all_gram_list = np.load("/content/drive/MyDrive/Musical AI/Dataset/all_gram_list.npy")
# all_gram_list_vel = np.load("/content/drive/MyDrive/Musical AI/Dataset/all_gram_list_vel.npy")

## create utility functions

**midi_to_notes** - this function allow us to extract the useful music       characteristic such as "pitch", "note start", "note end", and "velocity" from music in MIDI format.
*   Input - file path of the interested MIDI file
*   Output - `pandas.DataFrame` contain the musical features (pitch, start, end, step, duration, velocity)






In [None]:
def midi_to_notes(midi_file: str) -> pd.DataFrame:
  pm = pretty_midi.PrettyMIDI(midi_file)
  instrument = pm.instruments[0]
  notes = collections.defaultdict(list)

  sorted_notes = sorted(instrument.notes, key=lambda note: note.start)
  prev_start = sorted_notes[0].start

  for note in sorted_notes:
    start = note.start
    end = note.end
    notes['pitch'].append(note.pitch)
    notes['start'].append(start)
    notes['end'].append(end)
    notes['step'].append(start - prev_start)
    notes['duration'].append(end - start)
    notes['velocity'].append(note.velocity)
    prev_start = start
  return pd.DataFrame({name: np.array(value) for name, value in notes.items()})

**extract_gram** - this function will extract the MIDI information of each note into tuple `(Pi, Di)` \
> where : \
 Pi is the pitch number of each specific note\
 Di is the duration that each note played

*   Input - `midi_frame` which is the DataFrame contain each music piece information
*   Output - data in form of `list` containing all extracted tuple of every music



In [None]:
def extract_gram(midi_frame):
  gram_list = []
  temp = []
  s_time = 0
  for i in range(midi_frame.shape[0]):
    pitch = midi_frame["pitch"][i]
    ti = round(midi_frame["duration"][i],2)
    gram = (pitch,ti)
    
    if((not temp) or (midi_frame["start"][i] - s_time <= 0.003)):
      temp.append(gram)
      if(len(temp) == 1):
        s_time = midi_frame["start"][i]
      if(i == midi_frame.shape[0] - 1):
        gram_list += temp
    else:
      sorted_list = sorted(temp, key=lambda tup: tup[0], reverse=True)
      sorted_list.append(gram)
      gram_list += sorted_list
      temp.clear()
      s_time = 0

  return gram_list

**encodeChinese** - this function will encode integer (`int`) into chinese uni-code character.
*   Input - any integer number
*   Output - chinese uni-code character



In [None]:
def encodeChinese(index_number):
  val = index_number + 0x4e00
  return chr(val)

**get_sentence_vec_avg** - this function will transform the tokenized word/subword list into average vector representation
*   Input - list of tokenized word/subword and `Word2Vec` model
*   Output - vector representation of each music



In [None]:
def get_sentence_vec_avg(sentences,model):
  l = []
  for sentence in sentences:
    for word in sentence:
      try:
        temp = np.zeros(len(model[word]))
        temp += model[word]
      except:
        print("Not in vocab")
    l.append(temp/len(sentence))
  return l

**get_sentence_vec_avg_with_cov2** - this function will transform the tokenized word/subword list into average vector concatenated with SD vector representation
*   Input - list of tokenized word/subword and `Word2Vec` model
*   Output - vector representation of each music



In [None]:
def get_sentence_vec_avg_with_cov2(sentences,model):
  l = []
  cov = []
  for sentence in sentences:
    for word in sentence:
      try:
        temp = np.zeros(len(model[word]))
        temp += model[word]
        cov.append(model[word])
      except:
        print("Not in vocab")
    data = np.array(cov)
    sd = np.std(data,axis=0)
    z = temp/len(sentence)
    z = z.tolist()
    z += sd.tolist()
    z = np.array(z)
    l.append(z)
  return l

**get_sentence_vec_SD_only** - this function will transform the tokenized word/subword list into SD vector representation
*   Input - list of tokenized word/subword and `Word2Vec` model
*   Output - vector representation of each music



In [None]:
def get_sentence_vec_SD_only(sentences,model):
  l = []
  cov = []
  for sentence in sentences:
    for word in sentence:
      try:
        cov.append(model[word])
      except:
        print("Not in vocab")
    data = np.array(cov)
    sd = np.std(data,axis=0)
    z = sd.tolist()
    z = np.array(z)
    l.append(z)
  return l

**createLabel** - this function will create list of associated composer label for each  music piece represented in vector
*   Input - list of every vector representation of music 
*   Output - data of music vector and its associated classification label




In [None]:
def createLabel(sentenceslst):
  composer_music = maestro_data.groupby(["canonical_composer"])["canonical_title"].count().to_frame()
  composer_music.columns = ["canonical_title"]
  composer_music.reset_index(inplace=True)
  composer_list= []
  for i in range(composer_music.shape[0]):
    composer_list.append(composer_music.iloc[i][0])
  composer_map = { j:i for i,j in enumerate(composer_list)}
  data = []
  label = []
  for i in composer_100.index.tolist():
    try:
      data.append(sentenceslst[i])
      label.append(composer_map[maestro_data.iloc[i]["canonical_composer"]])
    except:
      print("Error",i)
  return data, label

**get_f1** - this function will calculate the F1-Score that will be an evaluation metric for the MLP model.


*   Inputs - ground truth labels, prediction results 
*   Output - F1-Score (ranging from 0.00 - 1.00)



In [None]:
def get_f1(y_true, y_pred): #taken from old keras source code
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
    predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
    precision = true_positives / (predicted_positives + K.epsilon())
    recall = true_positives / (possible_positives + K.epsilon())
    f1_val = 2*(precision*recall)/(precision+recall+K.epsilon())
    return f1_val

## Data preparation

### variable initialization

Extract all tuples `(Pi, Di)` from all songs in the dataset and record them into a list called **"all_gram_list"**.

In [None]:
all_gram_list = []
for i in range(maestro_data.shape[0]):
  suffix_path = maestro_data["midi_filename"][i]
  path = music_path_prefix + suffix_path
  frame = midi_to_notes(path)
  gram_list = extract_gram(frame)
  all_gram_list = all_gram_list + gram_list

Then, save the **"all_gram_list"** as a `numpy.array` for convenience usage later.

In [None]:
np.save("/content/drive/MyDrive/Musical AI/Dataset/all_gram_list_note_as_char2.npy",all_gram_list)

Load the saved `numpy.array` into the variable called **"all_gram_list"**.

In [None]:
all_gram_list = np.load("/content/drive/MyDrive/Musical AI/Dataset/all_gram_list_note_as_char2.npy")

Sanity check: try printing out the partial result of **"all_gram_list"** that we loaded.

In [None]:
all_gram_list[0:20]

Here, we sort the **"all_gram_list"**, then use that sorted list to create a dictionary for mapping the note tuple into a Chinese character and vice versa. 

In [None]:
sorted_gram_list = sorted(set(tuple(i) for i in all_gram_list.tolist()))
note2Ch = { j:encodeChinese(i) for i,j in enumerate(sorted_gram_list)}
Ch2note =  { encodeChinese(i):j for i,j in enumerate(sorted_gram_list)}

### Build corpus

**extractInfoToTxt** - this function will extract the note tuples, map them into Chinese characters, and record them into a text file.


*   Input - name of the file
*   Output - a file with the specified name in `.txt` format



In [None]:
def extractInfoToTxt(filename):
  text = ''
  for i in range(maestro_data.shape[0]):
    suffix_path = maestro_data["midi_filename"][i]
    path = music_path_prefix + suffix_path
    frame = midi_to_notes(path)
    gram_list = extract_gram(frame)
    for j in gram_list:
      text += note2Ch[j]
    text += '\n'
  f = open(filename, "w")
  f.write(text)
  f.close() 

Utilize the function to build the text corpus.

In [None]:
extractInfoToTxt("CorpusNoteAsChar2.txt")

## Data preprocessing

### SentencePiece

**sentencePiece** - this function will be used to build a subword tokenizer model according to the NLP-based approach, SentencePiece [Taku Kudo & John Richardson](https://arxiv.org/abs/1808.06226). Then use the model for tokenizing the corpus.


*   Inputs - file path of the corpus, preferred name of the model, size of the vocabulary, and maximum sentence length 
*   Output - List of all tokenized sentences (songs)



In [None]:
def sentencePiece(corpus, modelName, vocabSize, maxSenLength):
  spm.SentencePieceTrainer.train(input=corpus, model_prefix=modelName, vocab_size=vocabSize, max_sentence_length=maxSenLength)
  sp = spm.SentencePieceProcessor()
  temp = modelName+".model"
  sp.load(temp)
  f1 = open(corpus,'r')
  temp = {}
  for i in range(1276):
    line = f1.readline()
    tokenized = sp.encode_as_pieces(line)
    temp[i] = tokenized
  Ch_note_series = pd.Series(temp)
  f1.close()
  sentences = []
  for i in range(Ch_note_series.shape[0]):
    sentences.append(Ch_note_series[i])
  return sentences

### Word2Vec

**Word2Vec** - this function will transform each tokenized subword in each sentence into its representation vectors utilizing the Word2Vec procedure, as proposed by [Mikolov et al.](https://arxiv.org/pdf/1301.3781.pdf). Note that here we use the Word2Vec approach with the skip-gram algorithm. 


*   Inputs - Skip-gram window size, list of all tokenized sentences, Average (boolean), Standard deviation (boolean) 


All valid combinations of Avg and SD:
> `Avg = True, SD = False` : each song is the average vector of all its subword vectors. \
`Avg = False, SD = True`: each song is the SD vector of all its subword vectors. \
`Avg = True, SD = True` : each song is the average vector concatenate with SD vector 

*   Output - List of all song represented vector.



In [None]:
def Word2Vec(Window, sentences, Avg = False, SD = False):
  model = gensim.models.Word2Vec(
    sentences=sentences,
    window=Window,
    min_count=1,
    workers=4,
    sg = 1
  )
  if(Avg and SD):
      sentenceLstAvgwithCov = get_sentence_vec_avg_with_cov2(sentences,model)
      return sentenceLstAvgwithCov
  elif(Avg and (not SD)):
      sentencesLstAvg = get_sentence_vec_avg(sentences,model)
      return sentencesLstAvg
  elif((not Avg) and SD):
      sentencesLstSD = get_sentence_vec_SD_only(sentences,model)
      return sentencesLstSD

## Classification

### Main function

**Main** - this function integrate subfunctions and utility functions so that it allow us to conduct experiment easily with varying parameters. The classification algorithm used in this function are K-Nearest Neighbor (KNN), Random Forest Classifier (RFC), Logistic Regression (LR), Support Vector Machine (SVM), and Multilayer Perceptron (MLP)


*   Inputs - corpus's file path, model name (default is "m"), vocab size for Word2Vec algorithm, max sentence length for Word2Vec algorithm, skip-gram's window size, Avg, and cov combination as describe in Word2Vec section 
*   Outputs - test F1-score result from KNN, RFC, LR, SVM, MLP and Avg parameter



In [None]:
def Main(corpus, modelName, vocabSize, maxSenLength, window, Avg=False, cov=False):
  sentences = sentencePiece(corpus, modelName, vocabSize, maxSenLength)
  sentencesLst = Word2Vec(window, sentences, Avg, cov)
  data, label = createLabel(sentencesLst)
  sentence_w_label_100 = pd.DataFrame({"sentence": data, "label":label})
  map_label = {11:0, 14:1, 18:2, 31:3, 40:4}
  # Create training data
  X = []
  y = []
  for i in composer_100.index.tolist():
      X.append(sentencesLst[i])
  for i in sentence_w_label_100["label"].tolist():
    y.append(map_label[i])
  PredictorScaler=StandardScaler()
  PredictorScalerFit=PredictorScaler.fit(X)
  X=PredictorScalerFit.transform(X)
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
  y_train_mlp = tf.one_hot(y_train, 5)
  y_test_mlp = tf.one_hot(y_test, 5)
  y_train = np.array(y_train)
  y_test = np.array(y_test)
  #KNN
  clf = KNeighborsClassifier(n_neighbors=5)
  KNN=clf.fit(X_train,y_train)
  prediction=KNN.predict(X_test)
  F1_Score_knn=metrics.f1_score(y_test, prediction, average='weighted')
  #Random forest
  clf = RandomForestClassifier(max_depth=500, random_state=0)
  RFC = clf.fit(X_train, y_train)
  prediction=RFC.predict(X_test)
  F1_Score_RFC=metrics.f1_score(y_test, prediction, average='weighted')
  #Logistic regression
  log = LogisticRegression(random_state=0, max_iter=300).fit(X_train, y_train)
  prediction=log.predict(X_test)
  F1_Score_Logistic=metrics.f1_score(y_test, prediction, average='weighted')
  #SVM
  svm = SVC(random_state=0).fit(X_train, y_train)
  predictionSVM = svm.predict(X_test)
  F1_Score_SVM=metrics.f1_score(y_test, predictionSVM, average='weighted')
  #MLP
  array_shape = len(X[0])
  model = tf.keras.models.Sequential([
  tf.keras.layers.Input(shape=(array_shape,)),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.BatchNormalization(),
  tf.keras.layers.Dense(256,activation='tanh',
    kernel_regularizer = tf.keras.regularizers.l2(1e-4)),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.BatchNormalization(),
  tf.keras.layers.Dense(128,activation='tanh',
    kernel_regularizer = tf.keras.regularizers.l2(1e-4)),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.BatchNormalization(),
  tf.keras.layers.Dense(5,activation='softmax')
  ])
  model.compile(optimizer='adam', 
             loss=tf.keras.losses.CategoricalCrossentropy(),
             metrics=[get_f1])
  model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
             filepath = '/content/checkpoint',
             save_weights_only = True,
             monitor = 'val_get_f1',
             mode = 'max',
             save_best_only = True
  )
  history = model.fit(X_train, y_train_mlp, epochs=200, batch_size=256,
                    shuffle=True, validation_split=0.2, callbacks=[model_checkpoint_callback])
  model.load_weights('/content/checkpoint')
  test_loss, F1_Score_MLP = model.evaluate(X_test,y_test_mlp,verbose=0)
  return F1_Score_knn, F1_Score_RFC, F1_Score_Logistic, Avg, F1_Score_SVM, F1_Score_MLP

## Implementation example

AVG

In [None]:
KNN, RFC, Logis, Avg, SVM, MLP = Main("CorpusNoteAsChar2.txt","m",13000,5000,5,True,False)
print("Avg:", Avg)
print("F1-score of KNN:",KNN)
print("F1-score of RFC:",RFC)
print("F1-score of Logistic:",Logis)
print("F1-score of SVM:",SVM)
print("F1-score of MLP:",MLP)

SD

In [None]:
KNN, RFC, Logis, Avg, SVM, MLP = Main("CorpusNoteAsChar2.txt","m",13000,5000,5,False,True)
print("Avg:", Avg)
print("F1-score of KNN:",KNN)
print("F1-score of RFC:",RFC)
print("F1-score of Logistic:",Logis)
print("F1-score of SVM:",SVM)
print("F1-score of MLP:",MLP)

AVG + SD

In [None]:
KNN, RFC, Logis, Avg, SVM, MLP = Main("CorpusNoteAsChar2.txt","m",13000,5000,5,True,True)
print("Avg:", Avg)
print("F1-score of KNN:",KNN)
print("F1-score of RFC:",RFC)
print("F1-score of Logistic:",Logis)
print("F1-score of SVM:",SVM)
print("F1-score of MLP:",MLP)

Copyright © 2022 Somrudee Deepaisarn, Sirawit Chokphantavee, Sorawit Chokphantavee, Phuriphan Prathipasen, Suphachok Buaruk and Virach Sornlertlamvanich are authors of this computer program, contributed to the project 'Natural processing of music for composer classification' All right reserved.