# Volume Enhancer Mod
## An analysis on the viability of a deep learning model for the prediction of preferred user volume when listening to music of various origins.

Some context for this project ...

- When listening to music on streaming services, a closer look at the users' volume preferences will lead you to notice a pattern. Most of us change our preferences for listening volume based on the origin of the song e.g.; classical piano is more pleasant to the ear at a lower volume while heavy metal will likely have you turning the knob the other way.

- These patterns certainly don't follow any regular regression plot of any sort. While the baseline varies from person to person based on their comfort of hearing, the general trend seemingly remains the same.

- Due to this unstructured distribution of data that remains elusive to accurate prediction by regression algorithms, a neural structure is desirable for evaluation. This experiment seeks to prove/disprove that very viability.

In [1]:
# library imports
import librosa
import torch as tc
import numpy as np
import pandas as pd
from torch import nn
import seaborn as sns
from torch import optim
import torch.nn.functional as F
import matplotlib.pyplot as plt
from sklearn import preprocessing

# Dataset Analysis :
In order to evaluate the viability of the concept behind the model as well as the compatibility of the independent variables with the same, we have to test our audio features against the dependent variable i.e.; user volume.

To carry out this evaluation, we will start with data import & preliminary datatype evaluation. We're ideally looking for float values here.

In [2]:
# data upload
dataPath = '../input/random-testing-data1/features_30_sec_randEdit.csv'
audioData = pd.read_csv(dataPath)
audioData[:10]

In [3]:
# display datatypes
audioData.dtypes

# Correlation testing :
At this stage, we need to find the correlation coefficients of all possible pairings of independent & dependent variables, i.e; find the Pearson's correlation coefficient of all variable pairings with user volume.

As we need only find the pairs most suitable to be used as inputs for the model with respect to training efficiency and aren't focused on the exact numbers behind said pairs, we can filter out all reflexive & symmetric groups and highlight just the cells with an R value of greater than 0.8 (tentative). 

Do note that correlation analysis doesn't take into account the effect of variables outside the system being evaluated. It is also relatively inaccurate while testing for non-linear relationships. Hence, some leeway in the threshold R value may be warranted.

That being said, we're only interested in highlighted cells in the last row here.

In [4]:
# correlogram filtering out only pairs with a correlation coefficient of greater than 0.8 or lesser than -0.8
corrDataFrame = audioData.corr(method ='pearson')
corrDataFrame[np.abs(corrDataFrame) < 0.8] = 0

mask = np.zeros_like(corrDataFrame, dtype = np.bool)
mask[np.triu_indices_from(mask)] = True

plt.figure(figsize = (20,20))
sns.heatmap(corrDataFrame, cmap="coolwarm", mask=mask)

# consider applying transforms to the dataset to try and squeeze out any correlations

# Model design : 
From the correlogram above, we shall choose **only the highlighted columns** from the 'volume' row to be our independent variables. As observed from the datatype test from earlier, the only categorical variable in the dataset is genre. 

Intuitively, genre **does play a significant role** in our preferences for ideal volume. Therefore, we shall include it as a label encoded input.

As most electronic devices represent volume as an **integer between 0 & 100**, we shall construct a **classifier** to predict the volume and will have to one-hot encode the output as such.

In [5]:
# train/test data loading
df = pd.read_csv(dataPath)
train = df.sample(frac=0.5,random_state=100) # random state is a seed value
test = df.drop(train.index)

# dependent variable, categorical (0-100) (one hot encoded)
Ytest = test['volume']
Ytest = pd.get_dummies(Ytest)
Ytest = Ytest.to_numpy()

Ytrain = train['volume']
Ytrain = pd.get_dummies(Ytrain)
Ytrain = Ytrain.to_numpy()

# independent variable 1, categorical (label encoded)
# note : [blues:0, classical:1, country:2, disco:3, hiphop:4, jazz:5, metal:6, pop:7, reggae:8, rock:9]
le = preprocessing.LabelEncoder()
X1test = le.fit_transform(test['label'])
X1train = le.fit_transform(train['label'])

# independent variable 2, numerical
X2test = test['chroma_stft_mean'].to_numpy()
X2train = train['chroma_stft_mean'].to_numpy()

# add on the remaining relevant variables here

In [6]:
# declaring constants
numInputs = 2 # number of input variables based on correlation test
Ntest = 500 # number of testing entries
Ntrain = 500 # number of training entries

In [7]:
# data reshaping
Xtest = np.dstack([X1test,X2test]) # add on all the independent testing variables in this list
Xtest = Xtest.reshape(Ntest,numInputs) 

Xtrain = np.dstack([X1train,X2train]) # add on all the independent training variables in this list
Xtrain = Xtrain.reshape(Ntrain,numInputs) 

if (Ytest.shape[1] == Ytrain.shape[1]) :
    
    print(f'Number of classes : {Ytest.shape[1]}\n')
    # declaring number of outputs based on the results of one hot encoding
    numOutputs = 99
    
else :
    
    print('Re-execute test/train split\n')
    
Ytest = Ytest.reshape(Ntest, 99) # one dependent test variable
Ytrain = Ytrain.reshape(Ntrain, 99) # one dependent train variable

# data type conversion
xtest = tc.from_numpy(Xtest.astype(np.float32))
xtrain = tc.from_numpy(Xtrain.astype(np.float32))

ytest = tc.from_numpy(Ytest.astype(np.float32))
ytrain = tc.from_numpy(Ytrain.astype(np.float32))


# Model architecture :

As the data is organizable as a plain stack of layers with singular input & output tensors, a sequential model is suitable for our purposes.

- ReLU activation will suffice for the hidden layers.
- As the model is intended to be a classifier, a sigmoid activation function seems to be the intuitive choice of activation for the output layer, but testing has shown that a ReLU produces higher accuracies.
- Dropouts are preferable to prevent overfitting.

As such, we shall build a **sequential linear ReLU stack**.

In [8]:
# model definition
class NeuralNetwork(nn.Module):
    def __init__(self, numInputs, numOutputs):
        super(NeuralNetwork, self).__init__()
        self.flatten = nn.Flatten()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(numInputs, 512),
            nn.ReLU(),
            nn.Dropout(0.25),
            nn.Linear(512, 256),
            nn.ReLU(),
            nn.Dropout(0.25),
            nn.Linear(256, 256),
            nn.ReLU(),
            nn.Dropout(0.25),
            nn.Linear(256, 128),
            nn.ReLU(),
            nn.Dropout(0.25),
            nn.Linear(128, numOutputs),
        )

    def forward(self, x):
        x = self.flatten(x)
        logits = self.linear_relu_stack(x)
        return logits

# Initializing the training parameters :
There's two aspect choices to consider at this stage -

- **Loss function :** As this is a multi-class classification problem which requires small deviations from the true value to be exaggerated, a Kullback-Leibler divergence loss function is most appropriate. Preliminary testing confirms it's positive impact on model accuracy.
- **Optimizer :** Due to one hot encoding of 100 classes resulting in a sparse data distribution, Adam is seemingly preferable to stochastic gradient descent.

In [9]:
# error function & optimizer
model = NeuralNetwork(numInputs, numOutputs)
e_func = tc.nn.KLDivLoss()
optim = tc.optim.Adam(model.parameters(), lr=0.0001)

**Note :** This script is GPU compatible.

In [10]:
# move model to GPU
device = tc.device("cuda" if tc.cuda.is_available() else "cpu")
model.to(device)

In [11]:
# model training/testing loop
ep = 1000
train_losses, test_losses, accuracies = [], [], []

for e in range(ep+1) :
    
    # training model
    running_loss = 0
    xtrain, ytrain = xtrain.to(device), ytrain.to(device)
    
    optim.zero_grad() 
    
    output = model(xtrain) 
    loss = e_func(output, ytrain)
    
    loss.backward() 
    optim.step() 
    
    running_loss += loss.item()
    train_losses.append(running_loss/len(ytrain))
    
    # testing model
    test_loss, accuracy = 0, 0
    
    with tc.no_grad():
            
        model.eval()
        xtest, ytest = xtest.to(device), ytest.to(device)
        output = model(xtest) 
        test_loss += e_func(output, ytest)
        
        _, predictions = tc.max(output, 1)
        _, targets = tc.max(ytest, 1)
        accuracy += (predictions == targets).sum().item()
        
        test_losses.append(test_loss/len(ytest))
        accuracies.append(accuracy/len(ytest))
        model.train()

    print(f'Epoch: {e}/{ep}\n',
          f'Training loss: {running_loss/len(ytrain)}\n',
          f'Test loss: {test_loss/len(ytest)}\n',
          f'Accuracy: {accuracy/len(ytest)}\n')

In [12]:
# plot train & test loss per iteration
plt.plot(train_losses, label='Training loss')
plt.plot(test_losses, label='Validation loss')
plt.legend(frameon=False)

In [13]:
# plot testing accuracy
plt.plot(accuracies, label='Model accuracy')
plt.legend(frameon=False)

In [14]:
# save model
PATH = '../output/model.pth' # input directory in which the model is to be saved
tc.save(model, PATH)

# Testing model deployment :
This section is implemented as follows -

- The model class is declared & the model is loaded from storage.
- A sample MP3 file is loaded using librosa & visualized.
- The data is processed to extract the relevant features identified in the correlogram earlier.
- The processed data is converted to the appropriate tensors for input into the model.
- A prediction is obtained from this data.

In [None]:
# model class
class NeuralNetwork(nn.Module):
    def __init__(self, numInputs, numOutputs):
        super(NeuralNetwork, self).__init__()
        self.flatten = nn.Flatten()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(numInputs, 512),
            nn.ReLU(),
            nn.Dropout(0.25),
            nn.Linear(512, 256),
            nn.ReLU(),
            nn.Dropout(0.25),
            nn.Linear(256, 256),
            nn.ReLU(),
            nn.Dropout(0.25),
            nn.Linear(256, 128),
            nn.ReLU(),
            nn.Dropout(0.25),
            nn.Linear(128, numOutputs),
        )

    def forward(self, x):
        x = self.flatten(x)
        logits = self.linear_relu_stack(x)
        return logits

In [None]:
# load model
PATH = '../output/model.pth' # directory in which the model is saved
model = tc.load(PATH)
model.eval()

In [None]:
# loading sample file for testing
fname = '../input/test-audio/Kalimba.mp3'
genre = '' # specify genre
SR = 22050
data, _ = librosa.load(fname, sr=SR, mono=True)

# visualizing sample mp3
plt.figure(figsize = (16, 6))
librosa.display.waveplot(y = data, sr = 22050, color = "#A300F9")
plt.title("Sound Waves in sample", fontsize = 10)

# The prediction function :
Implementation guidelines -

- Input : librosa loaded audio file, string containing genre.
- Relevant feature extraction using librosa's in-built functions.
- Genre is encoded as follows : [blues:0, classical:1, country:2, disco:3, hiphop:4, jazz:5, metal:6, pop:7, reggae:8, rock:9]
- The data is processed in an identical manner to the data loading & data reshaping sections of the model design chapter of this notebook to produce the necessary tensors.
- The tensor(s) are fed to the model and a prediction is returned.
- Output : Integer value from 0 to 100.

In [None]:
# function to take an mp3 file as a parameter and return predicted volume by calling saved model
def predictVolume(data, genre) :
    
    # getting all necessary model input values in the correct data format
    
    return prediction