# A10-2 Sound and music description, revisited

## Introduction

(This assignment needs a working installation of Essentia library and requires good amount of independent programming.)
In this assignment, you will extend the sound and music description task you did in A9 to a larger set of instruments and explore possible improvements in a task of Instrument identification from single note/stroke sounds. By doing this assignment, you will get hands on experience with Essentia and better insights into complexities arising in a real world Music Information Retrieval problem, with a larger set of descriptors and more instrument classes. You will present the results and findings as a short report.

## Guidelines

In A9, you explored the tasks of clustering and classification with sound excerpts of three instru- ments. Typically, as we add more instruments for clustering, the average performance degrades. In such situations, clustering performance can be improved in several ways, two of which you will explore in this assignment:
- Improving descriptor selection: We can use a better set of descriptors, in addition to the ones you used in A9. To improve performance, we also need to typically increase the number of descriptors used for clustering as the number of instrument classes increase.
- Improving descriptor computation: In A9, each descriptor you used is a time averaged mean of the feature computed over short frames of the audio file. However, there are segments in an audio file (typically at the beginning and the end) where there is silence or low energy background noise. Such segments should be discarded while computing the global statistics of the descriptors, e.g., low energy regions (silence) have a higher spectral centroid and affect the average spectral centroid adversely if included in computing the average.

You will use Essentia to implement both these improvements. You can use the scripts provided with A9 as a base (get A9 scripts) and build your code using them. You first need to install Essentia to compute some of the descriptors that you will be exploring for the task. You can find the download and install instructions for Essentia here: http://essentia.upf.edu/. Essentia has extensive documentation that will be useful in this assignment http://essentia.upf.edu/ documentation/index.html.

The questions in the assignment have been presented separately for the ease of evaluation. But you will write the answers to all the questions together in a document and upload your report (PDF, 2 pages max., excluding plots/illustrations/parameter listings) in Question 1. You must also upload the code that you write. You will evaluate a minimum of three other peers in this assignment.

### Question 1: Downloading sounds

Choose at least 10 different instrumental sounds classes from the following possible classes: violin, guitar, bassoon, trumpet, clarinet, cello, naobo, snare drum, flute, mridangam, daluo, xiaoluo. For each instrument class, use soundDownload.py script from A9 to download the audio and
descriptors of 20 examples of representative single notes/strokes of each instrument. Since you will use the sounds also to extract descriptors using Essentia, make sure you download the high quality mp3 and not the low quality mp3 preview. To achieve this in the soundDownload.py script, you can replace fs.FSRequest.retrieve(sound.previews.preview lq mp3, fsClnt, mp3Path) to fs.FSRequest.retrieve(sound.previews.preview hq mp3, fsClnt, mp3Path)
In the report, explain your choices, query text and the tags you used. Include the Freesound links to the downloaded sounds.

In [None]:
import soundDownload as SD

API_key = 'OSq4Ic5Wg4zdVdTuhNZaiT4kKlKzsnnFvIIzU6p6'

instruments_lst = ['violin', 'guitar', 'cello', 'clarinet', 'flue', 'bassoon', 'trumpet',  'naobo', 'snare drum',  'daluo']

In [None]:
# download the violin
SD.downloadSoundsFreesound(queryText='violin',
                           API_Key=API_key, outputDir='test_download/', topNResults=20,
                           duration=(0.01, 10), tag="(single-note AND strings)")

In [None]:
# download the cello
SD.downloadSoundsFreesound(queryText='cello',
                           API_Key=API_key, outputDir='test_download/', topNResults=20,
                           duration=(0.01, 10), tag="(single-note AND strings)")

In [None]:
# download the guitar
SD.downloadSoundsFreesound(queryText='guitar',
                           API_Key=API_key, outputDir='test_download/', topNResults=20,
                           duration=(0.01, 10), tag="(string AND simplesamples)")

In [None]:
# download the clarinet
SD.downloadSoundsFreesound(queryText='clarinet',
                           API_Key=API_key, outputDir='test_download/', topNResults=20,
                           duration=(0.03, 10), tag="(single-note AND clarinet AND multisample)")

In [None]:
# download the flute
SD.downloadSoundsFreesound(queryText='flute',
                           API_Key=API_key, outputDir='test_download/', topNResults=20,
                           duration=(6, 10), tag="(single-note AND flute AND good-sounds "
                                                    "AND multisample AND neumann-U87)")

In [None]:
# download the bassoon
SD.downloadSoundsFreesound(queryText='bassoon',
                           API_Key=API_key, outputDir='test_download/', topNResults=20,
                           duration=(0, 10), tag="(single-note)")

In [None]:
# download the trumpet
SD.downloadSoundsFreesound(queryText='trumpet',
                           API_Key=API_key, outputDir='test_download/', topNResults=20,
                           duration=(4, 8), tag="(single-note AND multisample AND trumpet AND neumann-u87)")

In [None]:
# download the naobo
SD.downloadSoundsFreesound(queryText='naobo',
                           API_Key=API_key, outputDir='test_download/', topNResults=20,
                           duration=(0.01, 6.5), tag="(beijing-opera AND chinese AND icassp2014-dataset)")

In [None]:
# download the snare drum
SD.downloadSoundsFreesound(queryText='snaredrum',
                           API_Key=API_key, outputDir='test_download/', topNResults=20,
                           duration=(0, 2), tag="(snare AND drum AND 1-shot AND velocity AND multisample)")

In [None]:
# download the daluo
SD.downloadSoundsFreesound(queryText='daluo',
                           API_Key=API_key, outputDir='test_download/', topNResults=20,
                           duration=(0, 9), tag="(beijing-opera AND chinese AND icassp2014-dataset AND qmul AND compmusic)")

### Question 2: Obtaining a baseline clustering performance

Visualize different pairs of descriptors and choose a subset of the descriptors you downloaded along with the audio (same as A9) for a good separation between classes. Run a k-means clustering task with the 10 instrument dataset using the chosen subset of descriptors. You can use the soundAnalysis.py script from A9 for this task. Use the same number of clusters as the number of different instruments.

Report the subset of descriptors used and the clustering accuracy you obtained. Since k-means algorithm is randomly initiated and gives a different result every time it is run, report the average performance over 10 runs of the algorithm. This performance result acts as your baseline, over which you will improve in Question 3.

Obtaining a baseline performance is necessary to suggest and evaluate improvements. For the 10 instrument class problem, the random baseline is 10% (randomly choosing one out of the ten classes). But as you will see, the baseline you obtain will be higher that 10%, but lower than that you obtained for three instruments in A9 (with a careful selection of descriptors).
You will upload a single PDF file containing a report answering all questions of this assignment.

In [None]:
import soundAnalysis as SA
SA.descriptorPairScatterPlot('test_download/', descInput=(2,12,14))
SA.showDescriptorMapping()

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.lines import Line2D
import os, sys
import json
from scipy.cluster.vq import vq, kmeans, whiten

# from scipy.stats import mode

# Mapping of descriptors
descriptorMapping = {0: 'lowlevel.spectral_centroid.mean',
                     1: 'lowlevel.dissonance.mean',
                     2: 'lowlevel.hfc.mean',
                     3: 'sfx.logattacktime.mean',
                     4: 'sfx.inharmonicity.mean',
                     5: 'lowlevel.spectral_contrast.mean.0',
                     6: 'lowlevel.spectral_contrast.mean.1',
                     7: 'lowlevel.spectral_contrast.mean.2',
                     8: 'lowlevel.spectral_contrast.mean.3',
                     9: 'lowlevel.spectral_contrast.mean.4',
                     10: 'lowlevel.spectral_contrast.mean.5',
                     11: 'lowlevel.mfcc.mean.0',
                     12: 'lowlevel.mfcc.mean.1',
                     13: 'lowlevel.mfcc.mean.2',
                     14: 'lowlevel.mfcc.mean.3',
                     15: 'lowlevel.mfcc.mean.4',
                     16: 'lowlevel.mfcc.mean.5'
                     }


def showDescriptorMapping():
    """
  This function prints the mapping of integers to sound descriptors.
  """
    for key in descriptorMapping.keys():
        print ("Number %d is for %s" % (key, descriptorMapping[key]))


def descriptorPairScatterPlot(inputDir, descInput=(0, 0), anotOn=0):
    """
  This function does a scatter plot of the chosen feature pairs for all the sounds in the
  directory inputDir. The chosen features are specified in descInput as a tuple.
  Additionally, you can annotate the sound id on the scatter plot by setting anotOn = 1

  Input:
    inputDir (string): path to the directory where the sound samples and descriptors are present
    descInput (tuple): pair of descriptor indices (see descriptorMapping for mapping between
                       indices and descriptor names)
    anotOn (int): Set this flag to 1 to annotate the scatter points with the sound id. (Default = 0)

  Output:
    scatter plot of the chosen pair of descriptors for all the sounds in the directory inputDir
  """
    if max(descInput) >= len(descriptorMapping.keys()):
        print("Please select a descriptor index that is within the range. Maximum descriptor index can be " +
              str(len(descriptorMapping) - 1) + ". Check the descriptor index mapping again using function "
                                                "showDescriptorMapping().")

    dataDetails = fetchDataDetails(inputDir)
    colors = ['r', 'g', 'c', 'b', 'k', 'm', 'y']
    plt.figure()
    plt.hold(True)
    legArray = []
    catArray = []
    for ii, category in enumerate(dataDetails.keys()):
        catArray.append(category)
        for soundId in dataDetails[category].keys():
            filepath = os.path.join(inputDir, category, soundId, dataDetails[category][soundId]['file'])
            descSound = convFtrDict2List(json.load(open(filepath, 'r')))
            x_cord = descSound[descInput[0]]
            y_cord = descSound[descInput[1]]

            plt.scatter(x_cord, y_cord, c=colors[ii], s=200, hold=True, alpha=0.75)
            if anotOn == 1:
                plt.annotate(soundId, xy=(x_cord, y_cord), xytext=(x_cord, y_cord))

        circ = Line2D([0], [0], linestyle="none", marker="o", alpha=0.75, markersize=10, markerfacecolor=colors[ii])
        legArray.append(circ)

    plt.ylabel(descriptorMapping[descInput[1]], fontsize=16)
    plt.xlabel(descriptorMapping[descInput[0]], fontsize=16)
    plt.legend(legArray, catArray, numpoints=1, bbox_to_anchor=(0., 1.02, 1., .102), loc=3, ncol=len(catArray),
               mode="expand", borderaxespad=0.)

    plt.show()


def convFtrDict2List(ftrDict):
    """
  This function converts descriptor dictionary to an np.array. The order in the numpy array (indices)
  are same as those mentioned in descriptorMapping dictionary.

  Input:
    ftrDict (dict): dictionary containing descriptors downloaded from the freesound
  Output:
    ftr (np.ndarray): Numpy array containing the descriptors for processing later on
  """
    ftr = []
    for key in range(len(descriptorMapping.keys())):
        try:
            ftrName, ind = '.'.join(descriptorMapping[key].split('.')[:-1]), int(descriptorMapping[key].split('.')[-1])
            ftr.append(ftrDict[ftrName][0][ind])
        except:
            ftr.append(ftrDict[descriptorMapping[key]][0])
    return np.array(ftr)


def computeSimilarSounds(queryFile, targetDir, descInput=[]):
    """
  This function returns similar sounds for a specific queryFile. Given a queryFile this function
  computes the distance of the query to all the sounds found in the targetDir and sorts them in
  the increasing order of the distance. This way we can obtain similar sounds to a query sound.

  Input:
    queryFile (string): Descriptor file (.json, unless changed)
    targetDir (string): Target directory to search for similar sounds (using their descriptor files)
    descInput (list) : list of indices of the descriptors to be used for similarity/distance computation
                       (see descriptorMapping)
  Output:
    List containing an ordered list of similar sounds.
  """

    dataDetails = fetchDataDetails(targetDir)

    # reading query feature dictionary
    qFtr = json.load(open(queryFile, 'r'))

    dist = []
    # Iterating over classes
    for cname in dataDetails.keys():
        # Iterating over sounds
        for sname in dataDetails[cname].keys():
            eucDist = eucDistFeatures(qFtr, dataDetails[cname][sname]['feature'], descInput)
            dist.append([eucDist, sname, cname])

    # Sorting the array based on the distance
    indSort = np.argsort(np.array(dist)[:, 0])
    return (np.array(dist)[indSort, :]).tolist()


def classifySoundkNN(queryFile, targetDir, K, descInput=[]):
    """
  This function performs the KNN classification of a sound. The nearest neighbors are chosen from
  the sounds in the targetDir.

  Input:
    queryFile (string): Descriptor file (.json, unless changed)
    targetDir (string): Target directory to search for similar sounds (using their descriptor files)
    K (int) : Number of nearest neighbors to consider for KNN classification.
    descInput (list) : List of indices of the descriptors to be used for similarity/distance computation
                      (see descriptorMapping)
  Output:
    predClass (string): Predicted class of the query sound
  """
    distances = computeSimilarSounds(queryFile, targetDir, descInput)

    if len(np.where((np.array(distances)[:, 0].astype(np.float64)) == 0)[0]) > 0:
        print("Warning: We found an exact copy of the query file in the target directory. "
              "Beware of duplicates while doing KNN classification.")

    classes = (np.array(distances)[:K, 2]).tolist()
    freqCnt = []
    for ii in range(K):
        freqCnt.append(classes.count(classes[ii]))
    indMax = np.argmax(freqCnt)
    predClass = classes[indMax]
    print ("This sample belongs to class: " + str(predClass))
    return predClass


def clusterSounds(targetDir, nCluster=-1, descInput=[]):
    """
  This function clusters all the sounds in targetDir using kmeans clustering.

  Input:
    targetDir (string): Directory where sound descriptors are stored (all the sounds in this
                        directory will be used for clustering)
    nCluster (int): Number of clusters to be used for kmeans clustering.
    descInput (list) : List of indices of the descriptors to be used for similarity/distance
                       computation (see descriptorMapping)
  Output:
    Prints the class of each cluster (computed by a majority vote), number of sounds in each
    cluster and information (sound-id, sound-class and classification decision) of the sounds
    in each cluster. Optionally, you can uncomment the return statement to return the same data.
  """

    dataDetails = fetchDataDetails(targetDir)

    ftrArr = []
    infoArr = []

    if nCluster == -1:
        nCluster = len(dataDetails.keys())
    for cname in dataDetails.keys():
        # iterating over sounds
        for sname in dataDetails[cname].keys():
            ftrArr.append(convFtrDict2List(dataDetails[cname][sname]['feature'])[descInput])
            infoArr.append([sname, cname])

    ftrArr = np.array(ftrArr)
    infoArr = np.array(infoArr)

    ftrArrWhite = whiten(ftrArr)
    centroids, distortion = kmeans(ftrArrWhite, nCluster)
    clusResults = -1 * np.ones(ftrArrWhite.shape[0])

    for ii in range(ftrArrWhite.shape[0]):
        diff = centroids - ftrArrWhite[ii, :]
        diff = np.sum(np.power(diff, 2), axis=1)
        indMin = np.argmin(diff)
        clusResults[ii] = indMin

    ClusterOut = []
    classCluster = []
    globalDecisions = []
    for ii in range(nCluster):
        ind = np.where(clusResults == ii)[0]
        freqCnt = []
        for elem in infoArr[ind, 1]:
            freqCnt.append(infoArr[ind, 1].tolist().count(elem))
        indMax = np.argmax(freqCnt)
        classCluster.append(infoArr[ind, 1][indMax])

        print("\n(Cluster: " + str(ii) + ") Using majority voting as a criterion this cluster belongs to " +
              "class: " + classCluster[-1])
        print ("Number of sounds in this cluster are: " + str(len(ind)))
        decisions = []
        for jj in ind:
            if infoArr[jj, 1] == classCluster[-1]:
                decisions.append(1)
            else:
                decisions.append(0)
        globalDecisions.extend(decisions)
        print ("sound-id, sound-class, classification decision")
        ClusterOut.append(np.hstack((infoArr[ind], np.array([decisions]).T)))
        print (ClusterOut[-1])
    globalDecisions = np.array(globalDecisions)
    totalSounds = len(globalDecisions)
    nIncorrectClassified = len(np.where(globalDecisions == 0)[0])
    print("Out of %d sounds, %d sounds are incorrectly classified considering that one cluster should "
          "ideally contain sounds from only a single class" % (totalSounds, nIncorrectClassified))
    print("You obtain a classification (based on obtained clusters and majority voting) accuracy "
          "of %.2f percentage" % round(float(100.0 * float(totalSounds - nIncorrectClassified) / totalSounds), 2))

    acc = float(100.0 * float(totalSounds - nIncorrectClassified) / totalSounds)
    return acc


def fetchDataDetails(inputDir, descExt='.json'):
    """
  This function is used by other functions to obtain the information regarding the directory structure
  and the location of descriptor files for each sound
  """
    dataDetails = {}
    for path, dname, fnames in os.walk(inputDir):
        for fname in fnames:
            if descExt in fname.lower():
                remain, rname, cname, sname = path.split('/')[:-3], path.split('/')[-3], path.split('/')[-2], \
                path.split('/')[-1]
                if cname not in dataDetails:
                    dataDetails[cname] = {}
                fDict = json.load(open(os.path.join('/'.join(remain), rname, cname, sname, fname), 'r'))
                dataDetails[cname][sname] = {'file': fname, 'feature': fDict}
    return dataDetails


def eucDistFeatures(ftrDict1, ftrDict2, ftrInds):
    """
  This function computes Euclidean distance between two descriptor vectors (input as dictionaries).
  Additionally, also provide a list of the indices of the descriptor vectors that need to be used
  in the distance computation.

  Input:
    ftrDict1 (dict): Feature vector dictionary 1
    ftrDict2 (dict): Feature vector dictionary 2
    ftrInds (list): List of indices of descriptor vectors to be used in
                    distance computation (see descriptorMapping)
  """
    f1 = convFtrDict2List(ftrDict1)
    f2 = convFtrDict2List(ftrDict2)
    return eucDist(f1[ftrInds], f2[ftrInds])


def eucDist(vec1, vec2):
    """
  Computes the euclidean distance between two vectors
  """
    return np.sqrt(np.sum(np.power(np.array(vec1) - np.array(vec2), 2)))


In [None]:
import itertools

def cluster_sounds_helper(desc_lst):
    return clusterSounds('test_download/', nCluster=10, descInput=desc_lst)

elements = range(17)
k = 15
#given_comb = [3] # 48.02
#given_comb = [3, 11] # 61.20
#given_comb = [3, 11, 13] # 71.70
#given_comb = [1, 3, 11, 13] # 75.63
#given_comb = [1, 3, 9, 11, 13] # 75.60
#given_comb = [1, 3, 9, 11, 13, 16] # 73.90
#given_comb = [1, 3, 9, 10, 11, 13, 16] # 72.90
#given_comb = [1, 3, 7, 9, 10, 11, 13, 15, 16] # 71.00
#given_comb = [1, 3, 6, 7, 9, 10, 11, 13, 15, 16] #69.17
#given_comb = [1, 3, 6, 7, 8, 9, 10, 11, 13, 15, 16] #68.27
#given_comb = [0, 1, 3, 6, 7, 8, 9, 10, 11, 13, 15, 16]  # 68.23
#given_comb = [0, 1, 3, 6, 7, 8, 9, 10, 11, 12, 13, 15, 16] # 68.10
#given_comb = [0, 1, 3, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16] # 67.83
given_comb = [0, 1, 2, 3, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16] # 66.10

combinations = list(itertools.combinations(elements, k))
filtered_combinations = [comb for comb in combinations if set(given_comb).issubset(set(comb))]
result = filtered_combinations
#result = combinations

iter = 15
max_mean = 0.0
best_desc = []
des_max_acc = 0.0
sum = 0.0

# Find the best_desc
for desc_list in result:
    desc_list = list(desc_list)
    acc = cluster_sounds_helper(desc_list)
    if acc >= des_max_acc:
        best_desc = desc_list
        des_max_acc = acc

for desc_list in result:
    desc_list = list(desc_list)
    sum = 0.0
    for i in range(iter):
        acc = cluster_sounds_helper(desc_list)
        sum += acc
    mean_acc = float(sum/iter)
    if mean_acc >= max_mean:
        best_desc = desc_list
        max_mean = mean_acc

print('**************************** Best Acc. Mean = %.2f ****************************' % max_mean)
print('**************************** Best Desc = ', best_desc)

### Question 3: Suggest improvements

As you can observe, the clustering performance is poorer with 10 instruments. Using Essentia, you will implement the two different improvements described in the introduction of this assignment:
- Better and more features: Shortlist a set of descriptors based on the sound characteristics of the instruments such that they can differentiate between the instruments. The choice of the descriptors computed is up to you. We suggest you compute many different descriptors similar to the ones returned by Freesound API, and additional ones described in the class lectures. The descriptors you used in A9 (but now computed using Essentia) are a good starting point. You can use the Essentia extractors that compute many frame-wise low level descrip- tors together (http://essentia.upf.edu/documentation/algorithms\_overview.html# extractors)You can then use a subset of them for clustering for an improved clustering performance.
- Computing the descriptors stripping the silences and noise at the beginning/end: For each sound, compute the energy of each frame of audio. You can then detect the low energy frames (silence) using a threshold on the energy of the frame. Since most of the single notes you will use are well recorded, the energy of silence regions is very low and a single threshold might work well for all the sounds. Plot the frame energy over time for a few sounds to determine a meaningful energy threshold. Subsequently, compute the mean descriptor value discarding these silent frames.

Report the set of descriptors you computed and the performance it achieves, along with a brief explanation of your observations. You can also report the results for several combinations of features and finally report the best performance you achieved. Upload the code for computing the non-silent regions and for computing the descriptors that you used. Apart from the two enhancements suggested above, you are free to try further enhancements that improve clustering performance. In your report, describe these enhancements and the improvement they resulted in.

Please upload now the code. You will be evaluated on the code you upload, and the observations and discussion in your report.