# Classification in PySpark's MLlib Project Solution

### Genre classification
For this project we will be classifying songs based on a number of characteristics into a set of 23 electronic genres. This technology could be used by an application like Pandora to recommend songs to users or just create meaningful channels.

### Dataset
*beatsdataset.csv*
Each row is an electronic music song. The dataset contains 100 song for each genre among 23 electronic music genres, they were the top (100) songs of their genres on November 2016. The 71 columns are audio features extracted of a two random minutes sample of the file audio. These features have been extracted using pyAudioAnalysis (https://github.com/tyiannak/pyAudioAnalysis).

### Your task
Create an algorithm that classifies songs into the 23 genres provided. Test out several different models and select the highest performing one. Also play around with feature selection methods and finally try to make a recommendation to a user.  

### Source
https://www.kaggle.com/caparrini/beatsdataset

In [1]:
import findspark
findspark.init()

import pyspark # only run after findspark.init()
from pyspark.sql import SparkSession
# May take awhile locally
spark = SparkSession.builder.appName("ClassificationPS").getOrCreate()

cores = spark._jsc.sc().getExecutorMemoryStatus().keySet().size()
print("You are working with", cores, "core(s)")
spark

You are working with 1 core(s)


In [2]:
# Read in dependencies
from pyspark.ml.feature import VectorAssembler, StringIndexer, MinMaxScaler
from pyspark.sql.types import * 

from pyspark.ml.classification import *
from pyspark.ml.evaluation import *
from pyspark.sql.functions import *
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

Read in our dataset

In [3]:
path ="C:\\Users\\Suraj\\PySpark\\SparkMLLib\\beatsdataset.csv\\"
df = spark.read.csv(path+'beatsdataset.csv',inferSchema=True,header=True)

### Check out the dataset

Let's produce a print out of the dataframe so we know what we are working with.

In [4]:
df.limit(6).toPandas()

Unnamed: 0,_c0,1-ZCRm,2-Energym,3-EnergyEntropym,4-SpectralCentroidm,5-SpectralSpreadm,6-SpectralEntropym,7-SpectralFluxm,8-SpectralRolloffm,9-MFCCs1m,...,63-ChromaVector8std,64-ChromaVector9std,65-ChromaVector10std,66-ChromaVector11std,67-ChromaVector12std,68-ChromaDeviationstd,69-BPM,70-BPMconf,71-BPMessentia,class
0,0,0.13644,0.088861,3.201201,0.262825,0.249212,1.114423,0.007003,0.256682,-22.723259,...,0.003431,0.004981,0.010818,0.024001,0.005201,0.015056,133.333333,0.132792,128.0,BigRoom
1,1,0.117039,0.108389,3.194001,0.247657,0.250288,1.065668,0.005387,0.199821,-21.775871,...,0.004461,0.006441,0.007469,0.015499,0.005589,0.019339,120.0,0.112767,126.0,BigRoom
2,2,0.085308,0.128525,3.123837,0.217205,0.228652,0.789647,0.008247,0.156822,-22.472722,...,0.001529,0.004556,0.007723,0.017482,0.002901,0.022201,133.333333,0.123373,129.0,BigRoom
3,3,0.10305,0.167042,3.15083,0.233593,0.245032,0.967082,0.006571,0.168083,-21.470751,...,0.001591,0.003514,0.009477,0.023162,0.004165,0.015379,133.333333,0.158876,129.0,BigRoom
4,4,0.15173,0.148405,3.194498,0.29373,0.267231,1.353005,0.003872,0.292055,-21.371157,...,0.003945,0.004131,0.01133,0.028188,0.002639,0.019079,133.333333,0.190708,129.0,BigRoom
5,5,0.127047,0.153488,3.221987,0.261693,0.257361,1.090034,0.004943,0.230099,-21.234846,...,0.002986,0.006533,0.010347,0.025008,0.003035,0.019479,133.333333,0.168933,129.0,BigRoom


In [5]:
df.printSchema()

root
 |-- _c0: integer (nullable = true)
 |-- 1-ZCRm: double (nullable = true)
 |-- 2-Energym: double (nullable = true)
 |-- 3-EnergyEntropym: double (nullable = true)
 |-- 4-SpectralCentroidm: double (nullable = true)
 |-- 5-SpectralSpreadm: double (nullable = true)
 |-- 6-SpectralEntropym: double (nullable = true)
 |-- 7-SpectralFluxm: double (nullable = true)
 |-- 8-SpectralRolloffm: double (nullable = true)
 |-- 9-MFCCs1m: double (nullable = true)
 |-- 10-MFCCs2m: double (nullable = true)
 |-- 11-MFCCs3m: double (nullable = true)
 |-- 12-MFCCs4m: double (nullable = true)
 |-- 13-MFCCs5m: double (nullable = true)
 |-- 14-MFCCs6m: double (nullable = true)
 |-- 15-MFCCs7m: double (nullable = true)
 |-- 16-MFCCs8m: double (nullable = true)
 |-- 17-MFCCs9m: double (nullable = true)
 |-- 18-MFCCs10m: double (nullable = true)
 |-- 19-MFCCs11m: double (nullable = true)
 |-- 20-MFCCs12m: double (nullable = true)
 |-- 21-MFCCs13m: double (nullable = true)
 |-- 22-ChromaVector1m: double (null

### How many classes do we have and are they balanced?

Just making sure :) 

We have a perfectly balanced dataset, however we only have 100 examples from each class which may make training a decent model challenging. But let's see what we can do!

*Note: This almost never happens in real life :)*

In [6]:
df.groupBy("class").count().show(100)

+--------------------+-----+
|               class|count|
+--------------------+-----+
|           PsyTrance|  100|
|           HardDance|  100|
|              Breaks|  100|
|  HardcoreHardTechno|  100|
|   IndieDanceNuDisco|  100|
|              Trance|  100|
|           DeepHouse|  100|
|ElectronicaDowntempo|  100|
|           ReggaeDub|  100|
|             Minimal|  100|
|         DrumAndBass|  100|
|             Dubstep|  100|
|             BigRoom|  100|
|              Techno|  100|
|               House|  100|
|         FutureHouse|  100|
|        ElectroHouse|  100|
|           GlitchHop|  100|
|           TechHouse|  100|
|              HipHop|  100|
|           FunkRAndB|  100|
|               Dance|  100|
|    ProgressiveHouse|  100|
+--------------------+-----+



## Set up our Data Formatting Function 

Remember that MLlib requires all input columns of your dataframe to be vectorized and our dependent variable needs to be zero indexed. We can do that using our handy dandy function we developed in the lecture. Feel free to make this your own!

For example, with this go, I added a print statement to show the new label values for each class. 

In [7]:
# Data Prep function
def MLClassifierDFPrep(df,input_columns,dependent_var,treat_outliers=True,treat_neg_values=True):
    
    # change label (class variable) to string type to prep for reindexing
    # Pyspark is expecting a zero indexed integer for the label column. 
    # Just incase our data is not in that format... we will treat it by using the StringIndexer built in method
    renamed = df.withColumn("label_str", df[dependent_var].cast(StringType())) #Rename and change to string type
    indexer = StringIndexer(inputCol="label_str", outputCol="label") #Pyspark is expecting the this naming convention 
    indexed = indexer.fit(renamed).transform(renamed)
    print(indexed.groupBy("class","label").count().show(100))

    # Convert all string type data in the input column list to numeric
    # Otherwise the Algorithm will not be able to process it
    numeric_inputs = []
    string_inputs = []
    for column in input_columns:
        if str(indexed.schema[column].dataType) == 'StringType':
            indexer = StringIndexer(inputCol=column, outputCol=column+"_num") 
            indexed = indexer.fit(indexed).transform(indexed)
            new_col_name = column+"_num"
            string_inputs.append(new_col_name)
        else:
            numeric_inputs.append(column)
            
    if treat_outliers == True:
        print("We are correcting for non normality now!")
        # empty dictionary d
        d = {}
        # Create a dictionary of quantiles
        for col in numeric_inputs: 
            d[col] = indexed.approxQuantile(col,[0.01,0.99],0.25) #if you want to make it go faster increase the last number
        #Now fill in the values
        for col in numeric_inputs:
            skew = indexed.agg(skewness(indexed[col])).collect() #check for skewness
            skew = skew[0][0]
            # This function will floor, cap and then log+1 (just in case there are 0 values)
            if skew > 1:
                indexed = indexed.withColumn(col, \
                log(when(df[col] < d[col][0],d[col][0])\
                .when(indexed[col] > d[col][1], d[col][1])\
                .otherwise(indexed[col] ) +1).alias(col))
                print(col+" has been treated for positive (right) skewness. (skew =)",skew,")")
            elif skew < -1:
                indexed = indexed.withColumn(col, \
                exp(when(df[col] < d[col][0],d[col][0])\
                .when(indexed[col] > d[col][1], d[col][1])\
                .otherwise(indexed[col] )).alias(col))
                print(col+" has been treated for negative (left) skewness. (skew =",skew,")")

            
    # Produce a warning if there are negative values in the dataframe that Naive Bayes cannot be used. 
    # Note: we only need to check the numeric input values since anything that is indexed won't have negative values
    minimums = df.select([min(c).alias(c) for c in df.columns if c in numeric_inputs]) # Calculate the mins for all columns in the df
    min_array = minimums.select(array(numeric_inputs).alias("mins")) # Create an array for all mins and select only the input cols
    df_minimum = min_array.select(array_min(min_array.mins)).collect() # Collect golobal min as Python object
    df_minimum = df_minimum[0][0] # Slice to get the number itself

    features_list = numeric_inputs + string_inputs
    assembler = VectorAssembler(inputCols=features_list,outputCol='features')
    output = assembler.transform(indexed).select('features','label')

#     final_data = output.select('features','label') #drop everything else
    
    # Now check for negative values and ask user if they want to correct that? 
    if df_minimum < 0:
        print(" ")
        print("WARNING: The Naive Bayes Classifier will not be able to process your dataframe as it contains negative values")
        print(" ")
    
    if treat_neg_values == True:
        print("You have opted to correct that by rescaling all your features to a range of 0 to 1")
        print(" ")
        print("We are rescaling you dataframe....")
        scaler = MinMaxScaler(inputCol="features", outputCol="scaledFeatures")

        # Compute summary statistics and generate MinMaxScalerModel
        scalerModel = scaler.fit(output)

        # rescale each feature to range [min, max].
        scaled_data = scalerModel.transform(output)
        final_data = scaled_data.select('label','scaledFeatures') # added class to the selection
        final_data = final_data.withColumnRenamed('scaledFeatures','features')
        print("Done!")

    else:
        print("You have opted not to correct that therefore you will not be able to use to Naive Bayes classifier")
        print("We will return the dataframe unscaled.")
        final_data = output
    
    return final_data

## Set up our Training and Evaluation Function

In [10]:
def ClassTrainEval(classifier,features,classes,folds,train,test):
    
    def FindMtype(classifier):
        # Intstantiate Model
        M = classifier
        # Learn what it is
        Mtype = type(M).__name__
        
        return Mtype
    
    Mtype = FindMtype(classifier)
    

    def IntanceFitModel(Mtype,classifier,classes,features,folds,train):
        
        if Mtype == "OneVsRest":
            # instantiate the base classifier.
            lr = LogisticRegression()
            # instantiate the One Vs Rest Classifier.
            OVRclassifier = OneVsRest(classifier=lr)
#             fitModel = OVRclassifier.fit(train)
            # Add parameters of your choice here:
            paramGrid = ParamGridBuilder() \
                .addGrid(lr.regParam, [0.1, 0.01]) \
                .build()
            #Cross Validator requires the following parameters:
            crossval = CrossValidator(estimator=OVRclassifier,
                                      estimatorParamMaps=paramGrid,
                                      evaluator=MulticlassClassificationEvaluator(),
                                      numFolds=folds) # 3 is best practice
            # Run cross-validation, and choose the best set of parameters.
            fitModel = crossval.fit(train)
            return fitModel
        if Mtype == "MultilayerPerceptronClassifier":
            # specify layers for the neural network:
            # input layer of size features, two intermediate of features+1 and same size as features
            # and output of size number of classes
            # Note: crossvalidator cannot be used here
            features_count = len(features[0][0])
            layers = [features_count, features_count+1, features_count, classes]
            MPC_classifier = MultilayerPerceptronClassifier(maxIter=100, layers=layers, blockSize=128, seed=1234)
            fitModel = MPC_classifier.fit(train)
            return fitModel
        if Mtype in("LinearSVC","GBTClassifier") and classes != 2: # These classifiers currently only accept binary classification
            print(Mtype," could not be used because PySpark currently only accepts binary classification data for this algorithm")
            return
        if Mtype in("LogisticRegression","NaiveBayes","RandomForestClassifier","GBTClassifier","LinearSVC","DecisionTreeClassifier"):
  
            # Add parameters of your choice here:
            if Mtype in("LogisticRegression"):
                paramGrid = (ParamGridBuilder() \
#                              .addGrid(classifier.regParam, [0.1, 0.01]) \
                             .addGrid(classifier.maxIter, [10, 15,20])
                             .build())
                
            # Add parameters of your choice here:
            if Mtype in("NaiveBayes"):
                paramGrid = (ParamGridBuilder() \
                             .addGrid(classifier.smoothing, [0.0, 0.2, 0.4, 0.6]) \
                             .build())
                
            # Add parameters of your choice here:
            if Mtype in("RandomForestClassifier"):
                paramGrid = (ParamGridBuilder() \
                               .addGrid(classifier.maxDepth, [2, 5, 10])
#                                .addGrid(classifier.maxBins, [5, 10, 20])
#                                .addGrid(classifier.numTrees, [5, 20, 50])
                             .build())
                
            # Add parameters of your choice here:
            if Mtype in("GBTClassifier"):
                paramGrid = (ParamGridBuilder() \
#                              .addGrid(classifier.maxDepth, [2, 5, 10, 20, 30]) \
#                              .addGrid(classifier.maxBins, [10, 20, 40, 80, 100]) \
                             .addGrid(classifier.maxIter, [10, 15,50,100])
                             .build())
                
            # Add parameters of your choice here:
            if Mtype in("LinearSVC"):
                paramGrid = (ParamGridBuilder() \
                             .addGrid(classifier.maxIter, [10, 15]) \
                             .addGrid(classifier.regParam, [0.1, 0.01]) \
                             .build())
            
            # Add parameters of your choice here:
            if Mtype in("DecisionTreeClassifier"):
                paramGrid = (ParamGridBuilder() \
#                              .addGrid(classifier.maxDepth, [2, 5, 10, 20, 30]) \
                             .addGrid(classifier.maxBins, [10, 20, 40, 80, 100]) \
                             .build())
            
            #Cross Validator requires all of the following parameters:
            crossval = CrossValidator(estimator=classifier,
                                      estimatorParamMaps=paramGrid,
                                      evaluator=MulticlassClassificationEvaluator(),
                                      numFolds=folds) # 3 + is best practice
            # Fit Model: Run cross-validation, and choose the best set of parameters.
            fitModel = crossval.fit(train)
            return fitModel
    
    fitModel = IntanceFitModel(Mtype,classifier,classes,features,folds,train)
    
    # Print feature selection metrics
    if fitModel is not None:
        
        if Mtype in("OneVsRest"):
            # Get Best Model
            BestModel = fitModel.bestModel
            global OVR_BestModel
            OVR_BestModel = BestModel
            print(" ")
            print('\033[1m' + Mtype + '\033[0m')
            # Extract list of binary models
            models = BestModel.models
            for model in models:
                print('\033[1m' + 'Intercept: '+ '\033[0m',model.intercept)
                print('\033[1m' + 'Top 20 Coefficients:'+ '\033[0m')
                coeff_array = model.coefficients.toArray()
                coeff_scores = []
                for x in coeff_array:
                    coeff_scores.append(float(x))
                # Then zip with input_columns list and create a df
                result = spark.createDataFrame(zip(input_columns,coeff_scores), schema=['feature','coeff'])
                print(result.orderBy(result["coeff"].desc()).show(truncate=False))


        if Mtype == "MultilayerPerceptronClassifier":
            print("")
            print('\033[1m' + Mtype + '\033[0m')
            print('\033[1m' + "Model Weights: "+ '\033[0m',fitModel.weights.size)
            print("")
            global MLPC_Model
            MLPC_BestModel = fitModel

        if Mtype in("DecisionTreeClassifier", "GBTClassifier","RandomForestClassifier"):
            # FEATURE IMPORTANCES
            # Estimate of the importance of each feature.
            # Each feature’s importance is the average of its importance across all trees 
            # in the ensemble The importance vector is normalized to sum to 1. 
            # Get Best Model
            BestModel = fitModel.bestModel
            print(" ")
            print('\033[1m' + Mtype," Top 20 Feature Importances"+ '\033[0m')
            print("(Scores add up to 1)")
            print("Lowest score is the least important")
            print(" ")
            featureImportances = BestModel.featureImportances.toArray()
            # Convert from numpy array to list
            imp_scores = []
            for x in featureImportances:
                imp_scores.append(float(x))
            # Then zip with input_columns list and create a df
            result = spark.createDataFrame(zip(input_columns,imp_scores), schema=['feature','score'])
            print(result.orderBy(result["score"].desc()).show(truncate=False))
            
            # Save the feature importance values and the models
            if Mtype in("DecisionTreeClassifier"):
                global DT_featureimportances
                DT_featureimportances = BestModel.featureImportances.toArray()
                global DT_BestModel
                DT_BestModel = BestModel
            if Mtype in("GBTClassifier"):
                global GBT_featureimportances
                GBT_featureimportances = BestModel.featureImportances.toArray()
                global GBT_BestModel
                GBT_BestModel = BestModel
            if Mtype in("RandomForestClassifier"):
                global RF_featureimportances
                RF_featureimportances = BestModel.featureImportances.toArray()
                global RF_BestModel
                RF_BestModel = BestModel

        # Print the coefficients
        if Mtype in("LogisticRegression"):
            # Get Best Model
            BestModel = fitModel.bestModel
            print(" ")
            print('\033[1m' + Mtype + '\033[0m')
            print("Intercept: " + str(BestModel.interceptVector))
            print('\033[1m' + " Top 20 Coefficients"+ '\033[0m')
            print("You should compares these relative to eachother")
            # Convert from numpy array to list
            coeff_array = BestModel.coefficientMatrix.toArray()
            coeff_scores = []
            for x in coeff_array[0]:
                coeff_scores.append(float(x))
            # Then zip with input_columns list and create a df
            result = spark.createDataFrame(zip(input_columns,coeff_scores), schema=['feature','coeff'])
            print(result.orderBy(result["coeff"].desc()).show(truncate=False))
            # Save the coefficient values and the models
            global LR_coefficients
            LR_coefficients = BestModel.coefficientMatrix.toArray()
            global LR_BestModel
            LR_BestModel = BestModel

        # Print the Coefficients
        if Mtype in("LinearSVC"):
            # Get Best Model
            BestModel = fitModel.bestModel
            print(" ")
            print('\033[1m' + Mtype + '\033[0m')
            print("Intercept: " + str(BestModel.intercept))
            print('\033[1m' + "Top 20 Coefficients"+ '\033[0m')
            print("You should compares these relative to eachother")
#             print("Coefficients: \n" + str(BestModel.coefficients))
            coeff_array = BestModel.coefficients.toArray()
            coeff_scores = []
            for x in coeff_array:
                coeff_scores.append(float(x))
            # Then zip with input_columns list and create a df
            result = spark.createDataFrame(zip(input_columns,coeff_scores), schema=['feature','coeff'])
            print(result.orderBy(result["coeff"].desc()).show(truncate=False))
            # Save the coefficient values and the models
            global LSVC_coefficients
            LSVC_coefficients = BestModel.coefficients.toArray()
            global LSVC_BestModel
            LSVC_BestModel = BestModel
        
   
    # Set the column names to match the external results dataframe that we will join with later:
    columns = ['Classifier', 'Result']
    
    if Mtype in("LinearSVC","GBTClassifier") and classes != 2:
        Mtype = [Mtype] # make this a list
        score = ["N/A"]
        result = spark.createDataFrame(zip(Mtype,score), schema=columns)
    else:
        predictions = fitModel.transform(test)
        MC_evaluator = MulticlassClassificationEvaluator(metricName="accuracy") # redictionCol="prediction",
        accuracy = (MC_evaluator.evaluate(predictions))*100
        Mtype = [Mtype] # make this a string
        score = [str(accuracy)] #make this a string and convert to a list
        result = spark.createDataFrame(zip(Mtype,score), schema=columns)
        result = result.withColumn('Result',result.Result.substr(0, 5))
        
    return result
    #Also returns the fit model important scores or p values

## Testing Time! 


In [11]:
# Set up independ and dependent vars
input_columns = df.columns
input_columns = input_columns[1:-1] # keep only relevant columns: everything but the first and last cols
dependent_var = 'class'

# Learn how many classes there are in order to specify evaluation type based on binary or multi and turn the df into an object
class_count = df.select(countDistinct("class")).collect()
classes = class_count[0][0]

### Test 1: Without outlier treatment, skew or negative value treatment

This first go will act as our baseline so we can understand how our treatments affected our analysis. So I will opt out of the option for outlier treatment, skew treatment or negative value treatment which means we cannot use the naive bayes classifier. 

In [12]:
# Call on data prep, train and evaluate functions
test1_data = MLClassifierDFPrep(df,input_columns,dependent_var,treat_outliers=False,treat_neg_values=False)
test1_data.limit(5).toPandas()

# Comment out Naive Bayes if your data still contains negative values
classifiers = [
                LogisticRegression()
                ,OneVsRest()
               ,LinearSVC()
#                ,NaiveBayes()
               ,RandomForestClassifier()
               ,GBTClassifier()
               ,DecisionTreeClassifier()
               ,MultilayerPerceptronClassifier()
              ] 

train,test = test1_data.randomSplit([0.7,0.3])
features = test1_data.select(['features']).collect()
folds = 2 # because we have limited data

#set up your results table
columns = ['Classifier', 'Result']
vals = [("Place Holder","N/A")]
results = spark.createDataFrame(vals, columns)

for classifier in classifiers:
    new_result = ClassTrainEval(classifier,features,classes,folds,train,test)
    results = results.union(new_result)
results = results.where("Classifier!='Place Holder'")
print("!!!!!Final Results!!!!!!!!")
results.show(100,False)

+--------------------+-----+-----+
|               class|label|count|
+--------------------+-----+-----+
|           ReggaeDub| 19.0|  100|
|           FunkRAndB|  8.0|  100|
|              Trance| 22.0|  100|
|              Techno| 21.0|  100|
|             Dubstep|  5.0|  100|
|   IndieDanceNuDisco| 15.0|  100|
|ElectronicaDowntempo|  7.0|  100|
|               House| 14.0|  100|
|  HardcoreHardTechno| 12.0|  100|
|              HipHop| 13.0|  100|
|           GlitchHop| 10.0|  100|
|         DrumAndBass|  4.0|  100|
|           PsyTrance| 18.0|  100|
|    ProgressiveHouse| 17.0|  100|
|        ElectroHouse|  6.0|  100|
|           HardDance| 11.0|  100|
|             Minimal| 16.0|  100|
|           TechHouse| 20.0|  100|
|           DeepHouse|  3.0|  100|
|             BigRoom|  0.0|  100|
|         FutureHouse|  9.0|  100|
|               Dance|  2.0|  100|
|              Breaks|  1.0|  100|
+--------------------+-----+-----+

None
 
 
You have opted not to correct that therefore 

+---------------------+------------------+
|feature              |coeff             |
+---------------------+------------------+
|27-ChromaVector6m    |34.87618719719077 |
|68-ChromaDeviationstd|31.451543810532648|
|34-ChromaDeviationm  |29.969875060075655|
|65-ChromaVector10std |28.544488497205403|
|26-ChromaVector5m    |12.833859453747076|
|36-Energystd         |11.549608016111884|
|61-ChromaVector6std  |9.049250764204633 |
|31-ChromaVector10m   |8.928990133390979 |
|24-ChromaVector3m    |7.401049242973041 |
|25-ChromaVector4m    |5.57139070033116  |
|53-MFCCs11std        |5.469320230158945 |
|35-ZCRstd            |5.089628063234102 |
|47-MFCCs5std         |4.909626328398783 |
|52-MFCCs10std        |4.48174155432709  |
|8-SpectralRolloffm   |4.027000000173122 |
|30-ChromaVector9m    |3.770775916809555 |
|4-SpectralCentroidm  |3.117634251830983 |
|49-MFCCs7std         |3.0600173029368283|
|51-MFCCs9std         |3.0016250085154104|
|48-MFCCs6std         |2.4800091238956394|
+----------

+--------------------+------------------+
|feature             |coeff             |
+--------------------+------------------+
|63-ChromaVector8std |81.52846760721211 |
|27-ChromaVector6m   |39.23217700708037 |
|60-ChromaVector5std |19.589965469479864|
|36-Energystd        |15.884978320775689|
|34-ChromaDeviationm |15.094809577747059|
|55-MFCCs13std       |13.293501407182811|
|33-ChromaVector12m  |8.554954493576886 |
|54-MFCCs12std       |8.54080584005462  |
|66-ChromaVector11std|8.241812316420988 |
|67-ChromaVector12std|7.630556160916408 |
|48-MFCCs6std        |6.091560910218815 |
|44-MFCCs2std        |3.649970287937713 |
|53-MFCCs11std       |3.625850576862127 |
|2-Energym           |3.117209401993878 |
|52-MFCCs10std       |3.0016138064300213|
|24-ChromaVector3m   |2.1638597340124934|
|18-MFCCs10m         |1.1492784192590582|
|58-ChromaVector3std |1.0156758961478314|
|3-EnergyEntropym    |0.885677265634934 |
|15-MFCCs7m          |0.7474046225123673|
+--------------------+------------

+----------------------+------------------+
|feature               |coeff             |
+----------------------+------------------+
|33-ChromaVector12m    |53.529794219624634|
|65-ChromaVector10std  |43.78333249256184 |
|22-ChromaVector1m     |29.971968504164593|
|56-ChromaVector1std   |28.832406123086756|
|68-ChromaDeviationstd |13.88183724927122 |
|30-ChromaVector9m     |12.654810683627893|
|31-ChromaVector10m    |11.054122947940648|
|36-Energystd          |10.171553228388785|
|38-SpectralCentroidstd|9.206206015343517 |
|59-ChromaVector4std   |8.556228416780115 |
|23-ChromaVector2m     |7.663544527046133 |
|41-SpectralFluxstd    |7.307079324656791 |
|39-SpectralSpreadstd  |6.407327807378699 |
|66-ChromaVector11std  |5.023241625740009 |
|21-MFCCs13m           |4.356361468172659 |
|28-ChromaVector7m     |4.2607281696173525|
|61-ChromaVector6std   |4.2444568531078275|
|25-ChromaVector4m     |3.3980441242469164|
|42-SpectralRolloffstd |3.206583519886723 |
|20-MFCCs12m           |3.031574

### Test 2: Test treatments

Train and evaluate models on treated data (outliers, skewness and negative values) and compare to baseline (#1).

In [13]:
# Call on data prep, train and evaluate functions
test2_data = MLClassifierDFPrep(df,input_columns,dependent_var,treat_outliers=True,treat_neg_values=True)
test2_data.limit(5).toPandas()

# Comment out Naive Bayes if your data still contains negative values
classifiers = [
                LogisticRegression()
                ,OneVsRest()
               ,LinearSVC()
               ,NaiveBayes()
               ,RandomForestClassifier()
               ,GBTClassifier()
               ,DecisionTreeClassifier()
               ,MultilayerPerceptronClassifier()
              ] 

train,test = test2_data.randomSplit([0.7,0.3])
features = test2_data.select(['features']).collect()
folds = 2

#set up your results table
columns = ['Classifier', 'Result']
vals = [("Place Holder","N/A")]
results = spark.createDataFrame(vals, columns)

for classifier in classifiers:
    new_result = ClassTrainEval(classifier,features,classes,folds,train,test)
    results = results.union(new_result)
results = results.where("Classifier!='Place Holder'")
print("!!!!!Final Results!!!!!!!!")
results.show(100,False)

+--------------------+-----+-----+
|               class|label|count|
+--------------------+-----+-----+
|           ReggaeDub| 19.0|  100|
|           FunkRAndB|  8.0|  100|
|              Trance| 22.0|  100|
|              Techno| 21.0|  100|
|             Dubstep|  5.0|  100|
|   IndieDanceNuDisco| 15.0|  100|
|ElectronicaDowntempo|  7.0|  100|
|               House| 14.0|  100|
|  HardcoreHardTechno| 12.0|  100|
|              HipHop| 13.0|  100|
|           GlitchHop| 10.0|  100|
|         DrumAndBass|  4.0|  100|
|           PsyTrance| 18.0|  100|
|    ProgressiveHouse| 17.0|  100|
|        ElectroHouse|  6.0|  100|
|           HardDance| 11.0|  100|
|             Minimal| 16.0|  100|
|           TechHouse| 20.0|  100|
|           DeepHouse|  3.0|  100|
|             BigRoom|  0.0|  100|
|         FutureHouse|  9.0|  100|
|               Dance|  2.0|  100|
|              Breaks|  1.0|  100|
+--------------------+-----+-----+

None
We are correcting for non normality now!
7-Spectr

+----------------------+------------------+
|feature               |coeff             |
+----------------------+------------------+
|29-ChromaVector8m     |4.236737521849286 |
|70-BPMconf            |2.511238796608961 |
|10-MFCCs2m            |2.458518398375648 |
|55-MFCCs13std         |2.03292627132015  |
|23-ChromaVector2m     |1.7283455684029672|
|36-Energystd          |1.6611577509964706|
|67-ChromaVector12std  |1.6607208274476601|
|26-ChromaVector5m     |1.5944692176090174|
|38-SpectralCentroidstd|1.560373830330383 |
|53-MFCCs11std         |1.350284864845501 |
|16-MFCCs8m            |1.0842456361305908|
|22-ChromaVector1m     |1.0171976058874173|
|58-ChromaVector3std   |0.9742735237898289|
|64-ChromaVector9std   |0.9479167410584675|
|71-BPMessentia        |0.9423620665847317|
|34-ChromaDeviationm   |0.9274129912544166|
|11-MFCCs3m            |0.8290773246075739|
|47-MFCCs5std          |0.6914546912620478|
|50-MFCCs8std          |0.5981184158517683|
|28-ChromaVector7m     |0.580205

+---------------------+------------------+
|feature              |coeff             |
+---------------------+------------------+
|70-BPMconf           |6.002418885403301 |
|63-ChromaVector8std  |2.8722325935765163|
|2-Energym            |2.241260346729358 |
|69-BPM               |2.0970194111748106|
|67-ChromaVector12std |1.9460078780294623|
|3-EnergyEntropym     |1.8088172875238948|
|18-MFCCs10m          |1.6140882491348372|
|48-MFCCs6std         |1.5735594655701914|
|9-MFCCs1m            |1.5645572975227586|
|13-MFCCs5m           |1.4775353937247908|
|28-ChromaVector7m    |1.3420876709815148|
|59-ChromaVector4std  |1.3073336610621515|
|71-BPMessentia       |1.2882026426190192|
|46-MFCCs4std         |1.2578297726321548|
|68-ChromaDeviationstd|1.0992078544186905|
|5-SpectralSpreadm    |1.0167888797135514|
|16-MFCCs8m           |0.876571088868233 |
|25-ChromaVector4m    |0.8577356860777159|
|15-MFCCs7m           |0.7554656079378208|
|52-MFCCs10std        |0.7161216278251732|
+----------

+---------------------+------------------+
|feature              |coeff             |
+---------------------+------------------+
|59-ChromaVector4std  |4.429713460808254 |
|47-MFCCs5std         |2.8364482610936106|
|39-SpectralSpreadstd |2.6495515211130725|
|66-ChromaVector11std |2.3994241807069   |
|54-MFCCs12std        |2.2454061205534988|
|15-MFCCs7m           |2.197352642674322 |
|55-MFCCs13std        |2.1152315222377083|
|25-ChromaVector4m    |1.924112057241509 |
|45-MFCCs3std         |1.9090951360823352|
|63-ChromaVector8std  |1.6760972783168258|
|51-MFCCs9std         |1.653926173486973 |
|67-ChromaVector12std |1.619966508708444 |
|12-MFCCs4m           |1.6023507518145685|
|37-EnergyEntropystd  |1.5810108151953322|
|53-MFCCs11std        |1.557761405617503 |
|68-ChromaDeviationstd|1.422923791489448 |
|46-MFCCs4std         |1.3244596403914526|
|34-ChromaDeviationm  |1.2108285385604607|
|50-MFCCs8std         |1.1490267220932047|
|27-ChromaVector6m    |1.0323491453836227|
+----------

## Test 3: Feature Selection

Looks like all our of our models saw a little bit of improvement from the transformations we applied. And overall the best performing model was the One vs Rest. Let's try to do do some feature selection to make it even better!

Since we are using the One vs Rest algorithm backed by Logistic Regression, we can use Pyspark's Chi Squared Selector feature to select our features. I intentionally did not go over this in the previous lectures to get you used to exploring the PySpark documentation.

Here is the link to the documentation on the Chi Squared Selector for more details if you need it: https://spark.apache.org/docs/latest/ml-features#chisqselector

In [14]:
from pyspark.ml.feature import VectorSlicer
from pyspark.ml.feature import ChiSqSelector
from pyspark.ml.linalg import Vectors

classifiers = [OneVsRest()] 

#Select the top n features and view results
maximum = len(input_columns)
for n in range(10,maximum,10):
    print("Testing top n = ",n," features")
    
    # For Tree classifiers
#     best_n_features = RF_featureimportances.argsort()[-n:][::-1]
#     best_n_features= best_n_features.tolist() # convert to a list
#     vs = VectorSlicer(inputCol="features", outputCol="best_features", indices=best_n_features)
#     bestFeaturesDf = vs.transform(test2_data)

    # For Logistic regression or One vs Rest
    selector = ChiSqSelector(numTopFeatures=n, featuresCol="features",
                         outputCol="selectedFeatures", labelCol="label")
    bestFeaturesDf = selector.fit(test2_data).transform(test2_data)
    bestFeaturesDf = bestFeaturesDf.select("label","selectedFeatures")
    bestFeaturesDf = bestFeaturesDf.withColumnRenamed("selectedFeatures","features")

    # Collect features
    features = bestFeaturesDf.select(['features']).collect()

    # Split
    train,test = bestFeaturesDf.randomSplit([0.7,0.3])
    
    # Specify folds
    folds = 2

    #set up your results table
    columns = ['Classifier', 'Result']
    vals = [("Place Holder","N/A")]
    results = spark.createDataFrame(vals, columns)

    for classifier in classifiers:
        new_result = ClassTrainEval(classifier,features,classes,folds,train,test)
        results = results.union(new_result)
    results = results.where("Classifier!='Place Holder'")
    results.show(100,False)

Testing top n =  10  features
 
[1mOneVsRest[0m
[1mIntercept: [0m -5.315298561576743
[1mTop 20 Coefficients:[0m
+-------------------+--------------------+
|feature            |coeff               |
+-------------------+--------------------+
|1-ZCRm             |4.191066389900496   |
|10-MFCCs2m         |2.3781878863996253  |
|2-Energym          |1.0640501876795847  |
|8-SpectralRolloffm |0.8563738706853892  |
|4-SpectralCentroidm|0.04042110421820682 |
|7-SpectralFluxm    |-0.07914725345111266|
|3-EnergyEntropym   |-2.1061578098895803 |
|9-MFCCs1m          |-2.3908357052441027 |
|5-SpectralSpreadm  |-2.629787484361112  |
|6-SpectralEntropym |-6.601570188029839  |
+-------------------+--------------------+

None
[1mIntercept: [0m -8.488730125571474
[1mTop 20 Coefficients:[0m
+-------------------+--------------------+
|feature            |coeff               |
+-------------------+--------------------+
|9-MFCCs1m          |3.8104977131958857  |
|10-MFCCs2m         |3.20409622492

+-------------------+-------------------+
|feature            |coeff              |
+-------------------+-------------------+
|9-MFCCs1m          |4.55502866782088   |
|1-ZCRm             |3.1495591169568087 |
|8-SpectralRolloffm |3.0928004842330354 |
|7-SpectralFluxm    |2.8741863765044053 |
|4-SpectralCentroidm|0.7029388865003644 |
|5-SpectralSpreadm  |-0.2524211799312943|
|3-EnergyEntropym   |-0.7924434959155175|
|2-Energym          |-1.0240697150125897|
|10-MFCCs2m         |-2.2783359597817943|
|6-SpectralEntropym |-2.5070070508775935|
+-------------------+-------------------+

None
[1mIntercept: [0m -2.909248722176174
[1mTop 20 Coefficients:[0m
+-------------------+--------------------+
|feature            |coeff               |
+-------------------+--------------------+
|7-SpectralFluxm    |2.011890746145125   |
|10-MFCCs2m         |1.8912569273581632  |
|5-SpectralSpreadm  |1.7798848409879109  |
|1-ZCRm             |0.6580006084675952  |
|2-Energym          |0.05880035730871

+-------------------+---------------------+
|feature            |coeff                |
+-------------------+---------------------+
|1-ZCRm             |3.3287168771712934   |
|13-MFCCs5m         |3.2276920137792637   |
|12-MFCCs4m         |2.264183460145035    |
|17-MFCCs9m         |2.180115496717452    |
|16-MFCCs8m         |1.9602275614333529   |
|5-SpectralSpreadm  |1.4464016922852965   |
|14-MFCCs6m         |1.3899771937075251   |
|9-MFCCs1m          |0.7866797903630256   |
|8-SpectralRolloffm |0.6964501690756549   |
|18-MFCCs10m        |0.5717857576932047   |
|11-MFCCs3m         |1.2963931436499752E-4|
|15-MFCCs7m         |-0.16648929562820078 |
|4-SpectralCentroidm|-0.3959345551939117  |
|2-Energym          |-0.5712428374141691  |
|20-MFCCs12m        |-0.7161648639124066  |
|10-MFCCs2m         |-0.7463112820012415  |
|6-SpectralEntropym |-0.9515427647277604  |
|3-EnergyEntropym   |-1.4261788588208852  |
|7-SpectralFluxm    |-1.9638681926629973  |
|19-MFCCs11m        |-4.52466134

+-------------------+--------------------+
|feature            |coeff               |
+-------------------+--------------------+
|14-MFCCs6m         |3.378068581648291   |
|12-MFCCs4m         |2.687108271378355   |
|19-MFCCs11m        |2.6238515956639947  |
|11-MFCCs3m         |1.9903774633286462  |
|8-SpectralRolloffm |1.5062249944975212  |
|3-EnergyEntropym   |1.4192353006918013  |
|13-MFCCs5m         |0.8438347469995892  |
|18-MFCCs10m        |0.5185142241182289  |
|15-MFCCs7m         |0.3960935584015652  |
|9-MFCCs1m          |0.2516628209666073  |
|4-SpectralCentroidm|0.22358004563518338 |
|7-SpectralFluxm    |0.06304825213180275 |
|5-SpectralSpreadm  |-0.35079414367273987|
|6-SpectralEntropym |-0.48681193997613487|
|10-MFCCs2m         |-0.6922219342451972 |
|1-ZCRm             |-1.1281179130182148 |
|16-MFCCs8m         |-1.19749665995756   |
|17-MFCCs9m         |-1.8222609088574935 |
|2-Energym          |-2.1041953958429    |
|20-MFCCs12m        |-3.7828096852211854 |
+----------

+-------------------+--------------------+
|feature            |coeff               |
+-------------------+--------------------+
|20-MFCCs12m        |8.470208903279604   |
|2-Energym          |3.1256935989569175  |
|19-MFCCs11m        |2.409806969944708   |
|8-SpectralRolloffm |2.25418147657116    |
|18-MFCCs10m        |2.002500369321189   |
|6-SpectralEntropym |1.4999946801451782  |
|17-MFCCs9m         |0.31561900074323024 |
|7-SpectralFluxm    |-0.07009104221955224|
|1-ZCRm             |-0.29281079272861105|
|11-MFCCs3m         |-0.7102519049194459 |
|4-SpectralCentroidm|-1.1047631070049628 |
|14-MFCCs6m         |-1.1467277287913278 |
|9-MFCCs1m          |-2.1001093959941555 |
|15-MFCCs7m         |-2.2958923299394898 |
|10-MFCCs2m         |-2.3068955024175564 |
|16-MFCCs8m         |-2.3386278899301023 |
|5-SpectralSpreadm  |-2.3827366858365084 |
|3-EnergyEntropym   |-2.4888318694652103 |
|12-MFCCs4m         |-3.0148052924010673 |
|13-MFCCs5m         |-3.747615425632214  |
+----------

Py4JJavaError: An error occurred while calling o142906.fit.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 38495.0 failed 1 times, most recent failure: Lost task 0.0 in stage 38495.0 (TID 41898, LAPTOP-H2N58RL5, executor driver): TaskResultLost (result lost from block manager)
Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2059)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2008)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2007)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2007)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:973)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:973)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:973)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2239)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2188)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2177)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:775)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2120)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2139)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2164)
	at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1004)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:388)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:1003)
	at org.apache.spark.rdd.PairRDDFunctions.$anonfun$countByKey$1(PairRDDFunctions.scala:366)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:388)
	at org.apache.spark.rdd.PairRDDFunctions.countByKey(PairRDDFunctions.scala:366)
	at org.apache.spark.rdd.RDD.$anonfun$countByValue$1(RDD.scala:1273)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:388)
	at org.apache.spark.rdd.RDD.countByValue(RDD.scala:1273)
	at org.apache.spark.mllib.stat.test.ChiSqTest$.chiSquaredFeatures(ChiSqTest.scala:124)
	at org.apache.spark.mllib.stat.Statistics$.chiSqTest(Statistics.scala:192)
	at org.apache.spark.mllib.feature.ChiSqSelector.fit(ChiSqSelector.scala:253)
	at org.apache.spark.ml.feature.ChiSqSelector.fit(ChiSqSelector.scala:211)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
	at java.lang.reflect.Method.invoke(Unknown Source)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Unknown Source)


# Results

Let's see if we can improve upon the results above of 45.94%.

- 10 features = 32.27% (going down)
- 20 features = 39.23% (going down)
- 30 features = 39.24% (going down)
- 40 features = 42.61% (going down)
- 50 features = 43.52% (going down)
- 60 features = 42.42% (going down)
- 70 features = 40.22% (going down)

**Take away:** looks like we actually never beat our baseline so let's stick with the 71 features. Setting a random seed might have helped with variability we saw here.

## Train final model

Now we can finally train our final model with the optimal set of features!

In [15]:
from pyspark.ml.feature import VectorSlicer
from pyspark.ml.feature import ChiSqSelector
from pyspark.ml.linalg import Vectors

classifiers = [OneVsRest()] 

#Select the top n features and view results
n = 71

# For Logistic regression or One vs Rest
selector = ChiSqSelector(numTopFeatures=n, featuresCol="features",
                     outputCol="selectedFeatures", labelCol="label")
bestFeaturesDf = selector.fit(test2_data).transform(test2_data)
bestFeaturesDf = bestFeaturesDf.select("label","selectedFeatures")
bestFeaturesDf = bestFeaturesDf.withColumnRenamed("selectedFeatures","features")

# Collect features
features = bestFeaturesDf.select(['features']).collect()

# Split
train,test = bestFeaturesDf.randomSplit([0.7,0.3])

# Specify folds
folds = 2

#set up your results table
columns = ['Classifier', 'Result']
vals = [("Place Holder","N/A")]
results = spark.createDataFrame(vals, columns)

for classifier in classifiers:
    new_result = ClassTrainEval(classifier,features,classes,folds,train,test)
    results = results.union(new_result)
results = results.where("Classifier!='Place Holder'")
results.show(100,False)

 
[1mOneVsRest[0m
[1mIntercept: [0m -4.003427801470852
[1mTop 20 Coefficients:[0m
+--------------------+------------------+
|feature             |coeff             |
+--------------------+------------------+
|9-MFCCs1m           |4.042674420734869 |
|71-BPMessentia      |2.474342508793368 |
|5-SpectralSpreadm   |1.9682886062154266|
|52-MFCCs10std       |1.6927657857132412|
|11-MFCCs3m          |1.5773684906467658|
|24-ChromaVector3m   |1.5332623045004359|
|43-MFCCs1std        |1.515348653157018 |
|63-ChromaVector8std |1.5061966576175088|
|67-ChromaVector12std|1.3740187344539272|
|54-MFCCs12std       |1.3117512382821122|
|2-Energym           |1.2660160603412804|
|12-MFCCs4m          |1.1270795184283395|
|50-MFCCs8std        |1.045071722718406 |
|37-EnergyEntropystd |0.9552452325419726|
|22-ChromaVector1m   |0.9365209765179571|
|18-MFCCs10m         |0.8911960215143165|
|53-MFCCs11std       |0.8771668891911917|
|13-MFCCs5m          |0.7325555672727921|
|30-ChromaVector9m   |0.722290

+---------------------+------------------+
|feature              |coeff             |
+---------------------+------------------+
|11-MFCCs3m           |3.3695528553595895|
|64-ChromaVector9std  |2.302162517354526 |
|62-ChromaVector7std  |2.226010125874736 |
|17-MFCCs9m           |2.152566269900108 |
|40-SpectralEntropystd|2.0935917987793795|
|57-ChromaVector2std  |2.0158844657759034|
|60-ChromaVector5std  |1.9473362737731892|
|48-MFCCs6std         |1.7365808369581481|
|12-MFCCs4m           |1.521498483341436 |
|55-MFCCs13std        |1.4902449083886826|
|49-MFCCs7std         |1.4631104751617237|
|13-MFCCs5m           |1.4027185123627053|
|19-MFCCs11m          |1.3890475098333301|
|65-ChromaVector10std |1.3301783013797224|
|68-ChromaDeviationstd|1.3167018683348857|
|63-ChromaVector8std  |1.2435459670058886|
|31-ChromaVector10m   |1.09745953465758  |
|22-ChromaVector1m    |1.0808354160355786|
|39-SpectralSpreadstd |0.9719031814036848|
|32-ChromaVector11m   |0.9596414731807011|
+----------

+----------------------+------------------+
|feature               |coeff             |
+----------------------+------------------+
|41-SpectralFluxstd    |3.910783573463649 |
|29-ChromaVector8m     |3.834528683199506 |
|43-MFCCs1std          |3.173521199635303 |
|7-SpectralFluxm       |2.8423506885685232|
|21-MFCCs13m           |2.446408012529311 |
|71-BPMessentia        |2.130072780249942 |
|38-SpectralCentroidstd|1.9441945659065236|
|45-MFCCs3std          |1.692256398848676 |
|39-SpectralSpreadstd  |1.6167937020576884|
|25-ChromaVector4m     |1.508471116351702 |
|37-EnergyEntropystd   |1.3489791516089593|
|34-ChromaDeviationm   |1.3440833083199828|
|28-ChromaVector7m     |1.2497784918268462|
|23-ChromaVector2m     |1.0995649092610245|
|48-MFCCs6std          |0.9030350021382647|
|59-ChromaVector4std   |0.8626396902231167|
|30-ChromaVector9m     |0.8025944671777787|
|27-ChromaVector6m     |0.738671174754509 |
|32-ChromaVector11m    |0.6825255494605555|
|10-MFCCs2m            |0.676548