## Visualizing Model Performance

In order to explain the performance and classification results of each model, we designed an interactive visual analytics system that explains the feature space of the data set, as well as the outcomes of each model.

We begin by doing some simple preprocessing of the data features, using UMAP to perform dimensionality reduction such that the data can be easily visualized on a 2D scatterplot.

### Data Preprocessing

In [None]:
import pandas as pd
import json
import numpy as np
from sklearn.preprocessing import StandardScaler
import umap

We load the model predictions for each data sample in our testing split.

In [None]:
# Load original data
df_original = pd.read_csv('../public/all-models/lin_reg-test-original.csv')

original_cols = df_original.columns.values.tolist()

Next, we also use our models to classify some of our own music. These samples are unlabeled music pieces that were released more recently.

In [None]:
# Run our models on some unlabeled data
df_test = pd.read_csv('../public/test-data-prediction-results.csv')

df_test['genre'] = pd.Series([-1 for x in range(len(df_test.index))])

df_test = df_test.rename(columns={'linear_regression': 'lin_reg_predicted',
                                  'logistic_regression': 'log_reg_predicted',
                                  'random_forest': 'rf_predicted',
                                  'svm': 'svm_clf_predicted',
                                  'multi-layer_perceptron': 'mlp_predicted',
                                  'gaussian_process': 'gpc_predicted',
                                  'gaussian_naive_bayes': 'gnb_predicted',
                                  'knn': 'knn_predicted',
                                  'gaussian_mixture_model': 'gmm_predicted'})

df_test_cols = df_test.columns.values.tolist()

# Get features only of test data

df_test_features = df_test.drop(columns=['lin_reg_predicted', 'log_reg_predicted', 'rf_predicted',
                                         'svm_clf_predicted', 'mlp_predicted', 'gpc_predicted',
                                         'gnb_predicted', 'knn_predicted','gmm_predicted',
                                         'index', 'genre', 'filepath', 'filename'])

In [None]:
### CREATE UMAP
df_features = df_original.drop(columns=["index", "genre", "filepath", "filename"])

# Concat original and test data
df_joined = pd.concat([df_features, df_test_features])

reducer = umap.UMAP()

scaled_data = StandardScaler().fit_transform(df_joined)
embedding = reducer.fit_transform(scaled_data)
coords = pd.DataFrame(embedding)

# Original coords
original_coords = coords.iloc[:300,:]

# Test coords
test_coords = coords.iloc[300:,:].reset_index(drop=True)

# Get new column names
new_cols = original_cols + ['x', 'y']
new_test_cols = df_test_cols + ['x', 'y']

# Add coords to test data
df_test_final = pd.concat([df_test, test_coords], axis=1)
df_test_final.columns =  new_test_cols

# Add coords and predictions to original data
df_original_final = pd.concat([df_original, original_coords], axis=1)

# Get predicted for each model
models = ['gmm', 'gnb', 'gpc', 'knn', 'lin_reg', 'log_reg', 'mlp', 'rf', 'svm_clf']

for m in models:
    df_predict = pd.read_csv('../public/all-models/' + m + '-test-predict.csv')
    predicted = df_predict["genre"]
    
    new_cols = new_cols + [m + '_predicted']
    
    df_original_final = pd.concat([df_original_final, predicted], axis=1)

df_original_final.columns = new_cols

# Concat original and test data
df_final = pd.concat([df_test_final, df_original_final]).reset_index(drop=True)

In [None]:
# df_final.to_json("./all-models-with-coords.json", orient="records")

### Visual Explorer

Finally, we create a visual analytics tool that explores the feature space of the data set and classification results of the different models. We will now demonstrate the visualization by walking you through some examples.

In [6]:
# Import widget
from ReactWidget import Test

In [7]:
Test

ReactWidget.test.Test

In [8]:
import pandas as pd

df2 = pd.read_json("../public/all-models-with-coords.json")

result2 = df2.to_dict(orient="records")

In [10]:
# This is the basic visualization interface
# Different models can be selected using the drop down
# Each song is plotted on the 2D scatterplot
# Its labeled genre is the fill color
# While the predicted genre is the stroke color on the outside
# We can zoom in to see the details

Test(result2)

Test(component='Test', props={'data': [{'index': 0, 'filepath': 'data/test-dataset/Song-8.wav', 'filename': 'S…