# Training and Evaluation

In this notebook the training and evaluation of a naive bayesian model is done. It will be evaluated in a 10 repetition 5-Fold cross validation set up. The result will be a median accuracy across all of the repetitions.

## The Model


The model to be trained is a naive bayesian classifier. The features were extracted using the procedure explained in:
https://colab.research.google.com/drive/1FFOlA5Q2O5TdxpVnBq7qKkHAbK6a0fvj?usp=sharing

Each of the frequency energies are real numbers assumed to be normal distributed and the paired labels of family_pitch assumed from a categorical distribution.

This is for the feature vectors $X$: $$\mathcal{N}(X_i;\mu;\sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{\frac{-(X_i - \mu)^2}{2\sigma^2}}$$

And for the paired labels $Y$: $$Y \leftarrow argmaxA_{y_k} P(Y = y_k)\prod_{i}^{d}{P(X_i|Y = y_k)}$$

## Implementation


In [0]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import random
%matplotlib inline
random.seed(2020)

In [21]:
# We get the feature dataset
!wget -O family_note_features.csv https://raw.githubusercontent.com/Sirivasv/MCC-AA/master/ProyectoFinal/family_note_features.csv

--2020-06-04 09:00:31--  https://raw.githubusercontent.com/Sirivasv/MCC-AA/master/ProyectoFinal/family_note_features.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.48.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.48.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 43994342 (42M) [text/plain]
Saving to: ‘family_note_features.csv’


2020-06-04 09:00:40 (5.16 MB/s) - ‘family_note_features.csv’ saved [43994342/43994342]



In [2]:
# We show the head of the dataset
family_notes_df = pd.read_csv("family_note_features.csv")
family_notes_df.head()

Unnamed: 0,segment_name,note_24,note_25,note_26,note_27,note_28,note_29,note_30,note_31,note_32,...,note_99,note_100,note_101,note_102,note_103,note_104,note_105,note_106,note_107,NOTE_CLASS
0,string_acoustic_011-026-127_seg_0,0.060915,0.078695,0.04824,0.102672,0.00266,0.067364,0.016569,0.055188,0.04317,...,1e-06,1.400782e-07,1.129644e-07,7.845241e-07,3.064818e-07,8.880471e-07,3.291851e-07,7.803614e-07,5.115749e-07,family_string_note_26
1,string_acoustic_011-026-127_seg_1,0.055939,0.078754,0.068669,0.108313,0.03569,0.061008,0.031879,0.059526,0.042898,...,0.00091,0.001075201,0.0004154219,0.0001315951,6.02934e-05,4.100908e-05,0.0001625531,0.000225503,0.0001986155,family_string_note_26
2,string_acoustic_011-026-127_seg_2,0.046108,0.077479,0.109768,0.124337,0.054667,0.049494,0.049995,0.061188,0.055727,...,0.001135,0.001278791,0.0002765147,0.000701275,0.0008349395,0.0006810263,0.0006534754,0.0004423557,0.0009610463,family_string_note_26
3,string_acoustic_011-026-127_seg_3,0.04105,0.077621,0.157035,0.150372,0.053375,0.045698,0.056476,0.05591,0.063097,...,0.00344,0.005911199,0.005680938,0.004629928,0.003378918,0.001862395,0.001057562,0.0009777358,0.002081144,family_string_note_26
4,string_acoustic_011-026-127_seg_4,0.045203,0.084967,0.202651,0.18241,0.046832,0.054968,0.051309,0.043172,0.059405,...,0.003629,0.001441651,0.001284077,0.003948394,0.002435748,0.0006445605,0.0005838733,0.001244309,0.002766589,family_string_note_26


Each category has the same number of samples so in a categorical distribution for the naive bayes classifier the probability $P(Y=y)$ is the same for each of the $y\in Y$ which is $\sum{\frac{Y=y}{N}}$.
$$ \frac{224}{24864} = 0.009009009009009009$$ 

In [0]:
# Define hyperparameters and probality for all classes
p_y_num = 224.0
p_y_den = 24864.0

k_cv = 7 # k-fold corss-validation
cv_repetitions = 10 # Repetitions for the cross-validation
family_names = ["string", "guitar", "brass"] # The instrument families to identify
n_families = 3
n_pitches = 37 # The number of pitches in the identifyible range for the classifier
start_pitch = 24 # The starting pitch in MIDI notation (e.g. 24 for C1)
feature_pitches = 84 # The Range of pitches as features
n_frames = 16.0 # Frammes taken from each pitch for each instrument

# We initialize the dictionaries to store the means and variances of each
# category
means_per_category = {}
variances_per_category = {}

In [0]:
# Get the instruments present for each family-pitch paired label
samples_in_family_note = 0
instrument_samples_per_category = {}
for i in range(n_pitches):
  note_i = i + start_pitch
  for family_name in family_names:
    category_name = "family_" + family_name + "_note_"+ str(note_i)
    instrument_samples_per_category[category_name] = {}
    samples_in_family_note = family_notes_df[family_notes_df["NOTE_CLASS"] == category_name].copy()
    for index, row in samples_in_family_note.iterrows():
      segmented_instrument_name = row["segment_name"].split("_")
      instrument_source_name = segmented_instrument_name[0] + "_" + segmented_instrument_name[1] + "_" + segmented_instrument_name[2]
      if (not(instrument_source_name in instrument_samples_per_category[category_name])):
        instrument_samples_per_category[category_name][instrument_source_name] = pd.DataFrame()
      instrument_samples_per_category[category_name][instrument_source_name] = instrument_samples_per_category[category_name][instrument_source_name].append(row)

In [0]:
# We traverse the features names
feature_names = []
for feature_i in range(feature_pitches):
  feature_name = "note_" + str(start_pitch + feature_i)
  feature_names.append(feature_name)

In [0]:
# Get the family_notes key list
family_note_category_keys = np.array(list(instrument_samples_per_category.keys()))
instrument_sources_per_category_keys = {}
for family_note_category_key in family_note_category_keys:
  instrument_sources_per_category_keys[family_note_category_key] = np.array(list(instrument_samples_per_category[family_note_category_key].keys()))

In [0]:
# Define Required functions 
def Gaussian_PDF(x, mean_y, variance_y):
    # Input the arguments into a probability density function
    p = 1/(np.sqrt(2*np.pi*variance_y)) * np.exp((-(x-mean_y)**2)/(2*variance_y))
    
    # return p
    return p

In [0]:
#for index, segment in instrument_samples_per_category['family_brass_note_60']['brass_acoustic_006-060-127'][feature_names].iterrows():
  #print(segment)
#print(means_per_category['family_string_note_24'])
#print(instrument_samples_per_category['family_brass_note_60']['brass_acoustic_006-060-127'][feature_names].mean())

In [80]:
# We initialize the lists were we save the accuracy percentages in both 
# test and training
ac_percentage_per_repetition_train = []
ac_percentage_per_repetition_test = []

# Run the training/evaluation loop
for i_repetition in range(cv_repetitions):

  # We traverse each category
  splited_instrument_sources_keys_per_category = {}
  for family_note_category_key in family_note_category_keys:
    # Random shuffle its intrument sources keys 
    np.random.shuffle(
        instrument_sources_per_category_keys[family_note_category_key]
        )
    
    # Split the instrument sources in k_cv partitions
    splited_instrument_sources_keys_per_category[family_note_category_key] = \
      np.array(np.array_split(
          instrument_sources_per_category_keys[family_note_category_key],
          k_cv
          ))
    
  # We initialize the lists were we save the accuracy percentages in both 
  # test and training per fold
  ac_percentage_per_fold_train = []
  ac_percentage_per_fold_test = []

  # We traverse the partitions
  for test_partition in range(k_cv):

    # We initialize the dictionaries to store the means and variances of each
    # category
    means_per_category = {}
    variances_per_category = {}

    # Identifiers for test and train per category
    test_keys_per_category = {}
    train_keys_per_category = {}

    # We traverse the categories
    for family_note_category_key in family_note_category_keys:
      
      # We define the test and train keys
      test_keys_per_category[family_note_category_key] = \
        splited_instrument_sources_keys_per_category[family_note_category_key][test_partition]

      train_keys_per_category[family_note_category_key] = []
      for temp_partition_i in range(len(splited_instrument_sources_keys_per_category[family_note_category_key])):
        if (temp_partition_i == test_partition):
          continue        
        for element in splited_instrument_sources_keys_per_category[family_note_category_key][temp_partition_i]:
          train_keys_per_category[family_note_category_key].append(element)

      # We traverse the training samples to get the dataframe of it
      current_category_train_df = pd.DataFrame()
      for category_train_key in train_keys_per_category[family_note_category_key]:
        current_category_train_df = current_category_train_df.append(instrument_samples_per_category[family_note_category_key][category_train_key][feature_names])
      
      # We traverse the test samples to get the dataframe of it
      current_category_test_df = pd.DataFrame()
      for category_test_key in test_keys_per_category[family_note_category_key]:
        current_category_test_df = current_category_test_df.append(instrument_samples_per_category[family_note_category_key][category_test_key][feature_names])

      variances_per_category[family_note_category_key] = current_category_train_df.var()
      means_per_category[family_note_category_key] = current_category_train_df.mean()
     
    
    # Once we obtained the means and variances we can now traverse the train and test splits to get the predictions
    correct_predictions_in_partition_test = 0
    correct_predictions_in_partition_train = 0
    
    # We traverse the train segments
    for index, segment in current_category_train_df.iterrows():
      # We initialize the max values we have seen
      max_likely_cat = ""
      max_prob_seen = -1.0

      # We again traverse all categories to see which is the most likely
      for prediction_famnote_label in family_note_category_keys:
        current_prob = 1.0
        
        for feature_name in feature_names:
          current_prob *= Gaussian_PDF(
                segment[feature_name],
                means_per_category[prediction_famnote_label][feature_name],
                variances_per_category[prediction_famnote_label][feature_name]
              )
        
        current_prob *= p_y_num
        current_prob /= p_y_den

        if (current_prob > max_prob_seen):
          max_prob_seen = current_prob
          max_likely_cat = prediction_famnote_label
          
      if (max_likely_cat == family_notes_df.loc[index]['NOTE_CLASS']):
        correct_predictions_in_partition_train += 1
        
    # We traverse the test segments
    for index, segment in current_category_test_df.iterrows():
      # We initialize the max values we have seen
      max_likely_cat = ""
      max_prob_seen = -1.0

      # We again traverse all categories to see which is the most likely
      for prediction_famnote_label in family_note_category_keys:
        current_prob = 1.0
        
        for feature_name in feature_names:
          current_prob *= Gaussian_PDF(
                segment[feature_name],
                means_per_category[prediction_famnote_label][feature_name],
                variances_per_category[prediction_famnote_label][feature_name]
              )
        
        current_prob *= p_y_num
        current_prob /= p_y_den

        if (current_prob > max_prob_seen):
          max_prob_seen = current_prob
          max_likely_cat = prediction_famnote_label
          
      if (max_likely_cat == family_notes_df.loc[index]['NOTE_CLASS']):
        correct_predictions_in_partition_test += 1

    # We save per fold results
    ac_percentage_per_fold_train.append(correct_predictions_in_partition_train * 100 / current_category_train_df.shape[0])
    ac_percentage_per_fold_test.append(correct_predictions_in_partition_test * 100 / current_category_test_df.shape[0])
    
    # We print current result
    print ('Repetition {0}/{1} - Partition {2}/{3} - Accuracy TRAIN {4}% - Accuracy TEST {5}%'.format( 
        (i_repetition + 1), 
        cv_repetitions, 
        (test_partition + 1), 
        k_cv,
        ac_percentage_per_fold_train[-1],
        ac_percentage_per_fold_test[-1]
        )
    )

  # We save per repetition results
  ac_percentage_per_repetition_train.append(np.array(ac_percentage_per_fold_train).mean())
  ac_percentage_per_repetition_test.append(np.array(ac_percentage_per_fold_test).mean())

Repetition 1/10 - Partition 1/7 - Accuracy TRAIN 94.79166666666667% - Accuracy TEST 90.625%
Repetition 1/10 - Partition 2/7 - Accuracy TRAIN 91.66666666666667% - Accuracy TEST 81.25%
Repetition 1/10 - Partition 3/7 - Accuracy TRAIN 93.22916666666667% - Accuracy TEST 84.375%
Repetition 1/10 - Partition 4/7 - Accuracy TRAIN 94.27083333333333% - Accuracy TEST 87.5%
Repetition 1/10 - Partition 5/7 - Accuracy TRAIN 93.22916666666667% - Accuracy TEST 96.875%
Repetition 1/10 - Partition 6/7 - Accuracy TRAIN 93.75% - Accuracy TEST 96.875%
Repetition 1/10 - Partition 7/7 - Accuracy TRAIN 94.27083333333333% - Accuracy TEST 87.5%
Repetition 2/10 - Partition 1/7 - Accuracy TRAIN 95.3125% - Accuracy TEST 87.5%
Repetition 2/10 - Partition 2/7 - Accuracy TRAIN 93.75% - Accuracy TEST 84.375%
Repetition 2/10 - Partition 3/7 - Accuracy TRAIN 92.70833333333333% - Accuracy TEST 90.625%
Repetition 2/10 - Partition 4/7 - Accuracy TRAIN 92.70833333333333% - Accuracy TEST 100.0%
Repetition 2/10 - Partition 5/

In [81]:
# We Report the average percentage per repetition in train
ac_percentage_per_repetition_train

[93.60119047619048,
 93.75,
 93.0059523809524,
 93.67559523809523,
 93.37797619047619,
 93.52678571428574,
 94.04761904761905,
 93.67559523809523,
 93.75,
 93.97321428571429]

In [82]:
# We report the average percentage per repetition in test
ac_percentage_per_repetition_test

[89.28571428571429,
 90.17857142857143,
 87.5,
 87.94642857142857,
 87.94642857142857,
 88.39285714285714,
 87.94642857142857,
 89.73214285714286,
 88.39285714285714,
 88.83928571428571]

In [84]:
# We report the average of all repetitions in train
np.array(ac_percentage_per_repetition_train).mean()

93.63839285714286

In [85]:
# We report the average of all repetitions in test
np.array(ac_percentage_per_repetition_test).mean()

88.61607142857142

In [0]:
# We Export the last models means and variances to further inference tests
means_last_model_df = pd.DataFrame(means_per_category).T
means_last_model_df.to_csv('means_last_model.csv', index=True)

variances_last_model_df = pd.DataFrame(variances_per_category).T
variances_last_model_df.to_csv('variances_last_model.csv', index=True)