# Classification in PySpark's MLlib Project Solution

### Genre classification
Now it's time to leverage what we learned in the lectures to a REAL classification project! Have you ever wondered what makes us, humans, able to tell apart two songs of different genres? How we do we inherenly know the difference between a pop song and heavy metal? This type of classifcation may seem easy for us, but it's a very difficult challenge for a computer to do. So the question is, could an automatic genre classifcation model be possible? 

For this project we will be classifying songs based on a number of characteristics into a set of 23 electronic genres. This technology could be used by an application like Pandora to recommend songs to users or just create meaningful channels. Super fun!

### Dataset
*beatsdataset.csv*
Each row is an electronic music song. The dataset contains 100 song for each genre among 23 electronic music genres, they were the top (100) songs of their genres on November 2016. The 71 columns are audio features extracted of a two random minutes sample of the file audio. These features have been extracted using pyAudioAnalysis (https://github.com/tyiannak/pyAudioAnalysis).

### Your task
Create an algorithm that classifies songs into the 23 genres provided. Test out several different models and select the highest performing one. Also play around with feature selection methods and finally try to make a recommendation to a user.  

For the feature selection aspect of this project, you may need to get a bit creative if you want to select features from a non-tree algorithm. I did not go over this aspect of PySpark intentionally in the previous lectures to give you chance to get used to researching the PySpark documentation page. Here is the link to the Feature Selectors section of the documentation that just might come in handy: https://spark.apache.org/docs/latest/ml-features.html#feature-selectors

Good luck! Have fun :)


### My approach

I decided to approach this analysis in 4 main steps. 

1. **Create Baseline:** Train and evaluate models on raw data without pre-treating it for outliers, skewness or negative values. This way we can clearly see what effect our transformations have on our analysis. 

2. **Test treatments:** Train and evaluate models on treated data (outliers, skewness and negative values) and compare to baseline (#1).

3. **Feature Selection:** Select the best performing models from the previous two approaches and perform feature selection on it to fine tune it. 

4. **Make a recommendation to a user:** Create a scrip to make a recommendation to a user. I intentionally left this part of the project a bit ambiguous 

### Source
https://www.kaggle.com/caparrini/beatsdataset

First things first... let's create our PySpark instance.

In [3]:
# import findspark
# findspark.init()

import pyspark # only run after findspark.init()
from pyspark.sql import SparkSession
# May take awhile locally
spark = SparkSession.builder.appName("ClassificationPS").getOrCreate()

cores = spark._jsc.sc().getExecutorMemoryStatus().keySet().size()
print("You are working with", cores, "core(s)")
spark

You are working with 1 core(s)


In [4]:
# Read in dependencies
from pyspark.ml.feature import VectorAssembler, StringIndexer, MinMaxScaler
from pyspark.sql.types import * 

from pyspark.ml.classification import *
from pyspark.ml.evaluation import *
from pyspark.sql.functions import *
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

Read in our dataset

In [5]:
path ="Datasets/"
df = spark.read.csv(path+'beatsdataset.csv',inferSchema=True,header=True)

### Check out the dataset

Let's produce a print out of the dataframe so we know what we are working with.

In [4]:
df.limit(6).toPandas()

Unnamed: 0,_c0,1-ZCRm,2-Energym,3-EnergyEntropym,4-SpectralCentroidm,5-SpectralSpreadm,6-SpectralEntropym,7-SpectralFluxm,8-SpectralRolloffm,9-MFCCs1m,...,63-ChromaVector8std,64-ChromaVector9std,65-ChromaVector10std,66-ChromaVector11std,67-ChromaVector12std,68-ChromaDeviationstd,69-BPM,70-BPMconf,71-BPMessentia,class
0,0,0.13644,0.088861,3.201201,0.262825,0.249212,1.114423,0.007003,0.256682,-22.723259,...,0.003431,0.004981,0.010818,0.024001,0.005201,0.015056,133.333333,0.132792,128.0,BigRoom
1,1,0.117039,0.108389,3.194001,0.247657,0.250288,1.065668,0.005387,0.199821,-21.775871,...,0.004461,0.006441,0.007469,0.015499,0.005589,0.019339,120.0,0.112767,126.0,BigRoom
2,2,0.085308,0.128525,3.123837,0.217205,0.228652,0.789647,0.008247,0.156822,-22.472722,...,0.001529,0.004556,0.007723,0.017482,0.002901,0.022201,133.333333,0.123373,129.0,BigRoom
3,3,0.10305,0.167042,3.15083,0.233593,0.245032,0.967082,0.006571,0.168083,-21.470751,...,0.001591,0.003514,0.009477,0.023162,0.004165,0.015379,133.333333,0.158876,129.0,BigRoom
4,4,0.15173,0.148405,3.194498,0.29373,0.267231,1.353005,0.003872,0.292055,-21.371157,...,0.003945,0.004131,0.01133,0.028188,0.002639,0.019079,133.333333,0.190708,129.0,BigRoom
5,5,0.127047,0.153488,3.221987,0.261693,0.257361,1.090034,0.004943,0.230099,-21.234846,...,0.002986,0.006533,0.010347,0.025008,0.003035,0.019479,133.333333,0.168933,129.0,BigRoom


In [5]:
df.printSchema()

root
 |-- _c0: integer (nullable = true)
 |-- 1-ZCRm: double (nullable = true)
 |-- 2-Energym: double (nullable = true)
 |-- 3-EnergyEntropym: double (nullable = true)
 |-- 4-SpectralCentroidm: double (nullable = true)
 |-- 5-SpectralSpreadm: double (nullable = true)
 |-- 6-SpectralEntropym: double (nullable = true)
 |-- 7-SpectralFluxm: double (nullable = true)
 |-- 8-SpectralRolloffm: double (nullable = true)
 |-- 9-MFCCs1m: double (nullable = true)
 |-- 10-MFCCs2m: double (nullable = true)
 |-- 11-MFCCs3m: double (nullable = true)
 |-- 12-MFCCs4m: double (nullable = true)
 |-- 13-MFCCs5m: double (nullable = true)
 |-- 14-MFCCs6m: double (nullable = true)
 |-- 15-MFCCs7m: double (nullable = true)
 |-- 16-MFCCs8m: double (nullable = true)
 |-- 17-MFCCs9m: double (nullable = true)
 |-- 18-MFCCs10m: double (nullable = true)
 |-- 19-MFCCs11m: double (nullable = true)
 |-- 20-MFCCs12m: double (nullable = true)
 |-- 21-MFCCs13m: double (nullable = true)
 |-- 22-ChromaVector1m: double (null

### How many classes do we have and are they balanced?

Just making sure :) 

We have a perfectly balanced dataset, however we only have 100 examples from each class which may make training a decent model challenging. But let's see what we can do!

*Note: This almost never happens in real life :)*

In [6]:
df.groupBy("class").count().show(100)

+--------------------+-----+
|               class|count|
+--------------------+-----+
|           PsyTrance|  100|
|           HardDance|  100|
|              Breaks|  100|
|  HardcoreHardTechno|  100|
|   IndieDanceNuDisco|  100|
|              Trance|  100|
|           DeepHouse|  100|
|ElectronicaDowntempo|  100|
|           ReggaeDub|  100|
|             Minimal|  100|
|         DrumAndBass|  100|
|             Dubstep|  100|
|             BigRoom|  100|
|              Techno|  100|
|               House|  100|
|         FutureHouse|  100|
|        ElectroHouse|  100|
|           GlitchHop|  100|
|           TechHouse|  100|
|              HipHop|  100|
|           FunkRAndB|  100|
|               Dance|  100|
|    ProgressiveHouse|  100|
+--------------------+-----+



## Set up our Data Formatting Function 

Remember that MLlib requires all input columns of your dataframe to be vectorized and our dependent variable needs to be zero indexed. We can do that using our handy dandy function we developed in the lecture. Feel free to make this your own!

For example, with this go, I added a print statement to show the new label values for each class. 

In [6]:
# Data Prep function
def MLClassifierDFPrep(df,input_columns,dependent_var,treat_outliers=True,treat_neg_values=True):
    
    # change label (class variable) to string type to prep for reindexing
    # Pyspark is expecting a zero indexed integer for the label column. 
    # Just incase our data is not in that format... we will treat it by using the StringIndexer built in method
    renamed = df.withColumn("label_str", df[dependent_var].cast(StringType())) #Rename and change to string type
    indexer = StringIndexer(inputCol="label_str", outputCol="label") #Pyspark is expecting the this naming convention 
    indexed = indexer.fit(renamed).transform(renamed)
    print(indexed.groupBy("class","label").count().show(100))

    # Convert all string type data in the input column list to numeric
    # Otherwise the Algorithm will not be able to process it
    numeric_inputs = []
    string_inputs = []
    for column in input_columns:
        if str(indexed.schema[column].dataType) == 'StringType':
            indexer = StringIndexer(inputCol=column, outputCol=column+"_num") 
            indexed = indexer.fit(indexed).transform(indexed)
            new_col_name = column+"_num"
            string_inputs.append(new_col_name)
        else:
            numeric_inputs.append(column)
            
    if treat_outliers == True:
        print("We are correcting for non normality now!")
        # empty dictionary d
        d = {}
        # Create a dictionary of quantiles
        for col in numeric_inputs: 
            d[col] = indexed.approxQuantile(col,[0.01,0.99],0.25) #if you want to make it go faster increase the last number
        #Now fill in the values
        for col in numeric_inputs:
            skew = indexed.agg(skewness(indexed[col])).collect() #check for skewness
            skew = skew[0][0]
            # This function will floor, cap and then log+1 (just in case there are 0 values)
            if skew > 1:
                indexed = indexed.withColumn(col, \
                log(when(df[col] < d[col][0],d[col][0])\
                .when(indexed[col] > d[col][1], d[col][1])\
                .otherwise(indexed[col] ) +1).alias(col))
                print(col+" has been treated for positive (right) skewness. (skew =)",skew,")")
            elif skew < -1:
                indexed = indexed.withColumn(col, \
                exp(when(df[col] < d[col][0],d[col][0])\
                .when(indexed[col] > d[col][1], d[col][1])\
                .otherwise(indexed[col] )).alias(col))
                print(col+" has been treated for negative (left) skewness. (skew =",skew,")")

            
    # Produce a warning if there are negative values in the dataframe that Naive Bayes cannot be used. 
    # Note: we only need to check the numeric input values since anything that is indexed won't have negative values
    minimums = df.select([min(c).alias(c) for c in df.columns if c in numeric_inputs]) # Calculate the mins for all columns in the df
    min_array = minimums.select(array(numeric_inputs).alias("mins")) # Create an array for all mins and select only the input cols
    df_minimum = min_array.select(array_min(min_array.mins)).collect() # Collect golobal min as Python object
    df_minimum = df_minimum[0][0] # Slice to get the number itself

    features_list = numeric_inputs + string_inputs
    assembler = VectorAssembler(inputCols=features_list,outputCol='features')
    output = assembler.transform(indexed).select('features','label')

#     final_data = output.select('features','label') #drop everything else
    
    # Now check for negative values and ask user if they want to correct that? 
    if df_minimum < 0:
        print(" ")
        print("WARNING: The Naive Bayes Classifier will not be able to process your dataframe as it contains negative values")
        print(" ")
    
    if treat_neg_values == True:
        print("You have opted to correct that by rescaling all your features to a range of 0 to 1")
        print(" ")
        print("We are rescaling you dataframe....")
        scaler = MinMaxScaler(inputCol="features", outputCol="scaledFeatures")

        # Compute summary statistics and generate MinMaxScalerModel
        scalerModel = scaler.fit(output)

        # rescale each feature to range [min, max].
        scaled_data = scalerModel.transform(output)
        final_data = scaled_data.select('label','scaledFeatures') # added class to the selection
        final_data = final_data.withColumnRenamed('scaledFeatures','features')
        print("Done!")

    else:
        print("You have opted not to correct that therefore you will not be able to use to Naive Bayes classifier")
        print("We will return the dataframe unscaled.")
        final_data = output
    
    return final_data

## Set up our Training and Evaluation Function

Let's use our handy dandy function to train and test all our classifiers we have available to us! I made two modifications here to show you just how flexible this framework is. 

1. You'll see in the area where we print our coefficients and feature importances values, that I created a new output format that joins the values with their corresponding feature names so we can clearly see which value belongs to which feature. I did this by zipping the lists together and creating a Spark dataframe to allow the output to be viewed in column row format as opposed to a messy list. 
2. The second change I added was a parameter for the folds so that I could specify the folds more easily with each iteration. 

Let's check it out!

For more info on available hyper parameters visit: https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.classification

In [7]:
def ClassTrainEval(classifier,features,classes,folds,train,test):
    
    def FindMtype(classifier):
        # Intstantiate Model
        M = classifier
        # Learn what it is
        Mtype = type(M).__name__
        
        return Mtype
    
    Mtype = FindMtype(classifier)
    

    def IntanceFitModel(Mtype,classifier,classes,features,folds,train):
        
        if Mtype == "OneVsRest":
            # instantiate the base classifier.
            lr = LogisticRegression()
            # instantiate the One Vs Rest Classifier.
            OVRclassifier = OneVsRest(classifier=lr)
#             fitModel = OVRclassifier.fit(train)
            # Add parameters of your choice here:
            paramGrid = ParamGridBuilder() \
                .addGrid(lr.regParam, [0.1, 0.01]) \
                .build()
            #Cross Validator requires the following parameters:
            crossval = CrossValidator(estimator=OVRclassifier,
                                      estimatorParamMaps=paramGrid,
                                      evaluator=MulticlassClassificationEvaluator(),
                                      numFolds=folds) # 3 is best practice
            # Run cross-validation, and choose the best set of parameters.
            fitModel = crossval.fit(train)
            return fitModel
        if Mtype == "MultilayerPerceptronClassifier":
            # specify layers for the neural network:
            # input layer of size features, two intermediate of features+1 and same size as features
            # and output of size number of classes
            # Note: crossvalidator cannot be used here
            features_count = len(features[0][0])
            layers = [features_count, features_count+1, features_count, classes]
            MPC_classifier = MultilayerPerceptronClassifier(maxIter=100, layers=layers, blockSize=128, seed=1234)
            fitModel = MPC_classifier.fit(train)
            return fitModel
        if Mtype in("LinearSVC","GBTClassifier") and classes != 2: # These classifiers currently only accept binary classification
            print(Mtype," could not be used because PySpark currently only accepts binary classification data for this algorithm")
            return
        if Mtype in("LogisticRegression","NaiveBayes","RandomForestClassifier","GBTClassifier","LinearSVC","DecisionTreeClassifier"):
  
            # Add parameters of your choice here:
            if Mtype in("LogisticRegression"):
                paramGrid = (ParamGridBuilder() \
#                              .addGrid(classifier.regParam, [0.1, 0.01]) \
                             .addGrid(classifier.maxIter, [10, 15,20])
                             .build())
                
            # Add parameters of your choice here:
            if Mtype in("NaiveBayes"):
                paramGrid = (ParamGridBuilder() \
                             .addGrid(classifier.smoothing, [0.0, 0.2, 0.4, 0.6]) \
                             .build())
                
            # Add parameters of your choice here:
            if Mtype in("RandomForestClassifier"):
                paramGrid = (ParamGridBuilder() \
                               .addGrid(classifier.maxDepth, [2, 5, 10])
#                                .addGrid(classifier.maxBins, [5, 10, 20])
#                                .addGrid(classifier.numTrees, [5, 20, 50])
                             .build())
                
            # Add parameters of your choice here:
            if Mtype in("GBTClassifier"):
                paramGrid = (ParamGridBuilder() \
#                              .addGrid(classifier.maxDepth, [2, 5, 10, 20, 30]) \
#                              .addGrid(classifier.maxBins, [10, 20, 40, 80, 100]) \
                             .addGrid(classifier.maxIter, [10, 15,50,100])
                             .build())
                
            # Add parameters of your choice here:
            if Mtype in("LinearSVC"):
                paramGrid = (ParamGridBuilder() \
                             .addGrid(classifier.maxIter, [10, 15]) \
                             .addGrid(classifier.regParam, [0.1, 0.01]) \
                             .build())
            
            # Add parameters of your choice here:
            if Mtype in("DecisionTreeClassifier"):
                paramGrid = (ParamGridBuilder() \
#                              .addGrid(classifier.maxDepth, [2, 5, 10, 20, 30]) \
                             .addGrid(classifier.maxBins, [10, 20, 40, 80, 100]) \
                             .build())
            
            #Cross Validator requires all of the following parameters:
            crossval = CrossValidator(estimator=classifier,
                                      estimatorParamMaps=paramGrid,
                                      evaluator=MulticlassClassificationEvaluator(),
                                      numFolds=folds) # 3 + is best practice
            # Fit Model: Run cross-validation, and choose the best set of parameters.
            fitModel = crossval.fit(train)
            return fitModel
    
    fitModel = IntanceFitModel(Mtype,classifier,classes,features,folds,train)
    
    # Print feature selection metrics
    if fitModel is not None:
        
        if Mtype in("OneVsRest"):
            # Get Best Model
            BestModel = fitModel.bestModel
            global OVR_BestModel
            OVR_BestModel = BestModel
            print(" ")
            print('\033[1m' + Mtype + '\033[0m')
            # Extract list of binary models
            models = BestModel.models
            for model in models:
                print('\033[1m' + 'Intercept: '+ '\033[0m',model.intercept)
                print('\033[1m' + 'Top 20 Coefficients:'+ '\033[0m')
                coeff_array = model.coefficients.toArray()
                coeff_scores = []
                for x in coeff_array:
                    coeff_scores.append(float(x))
                # Then zip with input_columns list and create a df
                result = spark.createDataFrame(zip(input_columns,coeff_scores), schema=['feature','coeff'])
                print(result.orderBy(result["coeff"].desc()).show(truncate=False))


        if Mtype == "MultilayerPerceptronClassifier":
            print("")
            print('\033[1m' + Mtype + '\033[0m')
            print('\033[1m' + "Model Weights: "+ '\033[0m',fitModel.weights.size)
            print("")
            global MLPC_Model
            MLPC_BestModel = fitModel

        if Mtype in("DecisionTreeClassifier", "GBTClassifier","RandomForestClassifier"):
            # FEATURE IMPORTANCES
            # Estimate of the importance of each feature.
            # Each feature’s importance is the average of its importance across all trees 
            # in the ensemble The importance vector is normalized to sum to 1. 
            # Get Best Model
            BestModel = fitModel.bestModel
            print(" ")
            print('\033[1m' + Mtype," Top 20 Feature Importances"+ '\033[0m')
            print("(Scores add up to 1)")
            print("Lowest score is the least important")
            print(" ")
            featureImportances = BestModel.featureImportances.toArray()
            # Convert from numpy array to list
            imp_scores = []
            for x in featureImportances:
                imp_scores.append(float(x))
            # Then zip with input_columns list and create a df
            result = spark.createDataFrame(zip(input_columns,imp_scores), schema=['feature','score'])
            print(result.orderBy(result["score"].desc()).show(truncate=False))
            
            # Save the feature importance values and the models
            if Mtype in("DecisionTreeClassifier"):
                global DT_featureimportances
                DT_featureimportances = BestModel.featureImportances.toArray()
                global DT_BestModel
                DT_BestModel = BestModel
            if Mtype in("GBTClassifier"):
                global GBT_featureimportances
                GBT_featureimportances = BestModel.featureImportances.toArray()
                global GBT_BestModel
                GBT_BestModel = BestModel
            if Mtype in("RandomForestClassifier"):
                global RF_featureimportances
                RF_featureimportances = BestModel.featureImportances.toArray()
                global RF_BestModel
                RF_BestModel = BestModel

        # Print the coefficients
        if Mtype in("LogisticRegression"):
            # Get Best Model
            BestModel = fitModel.bestModel
            print(" ")
            print('\033[1m' + Mtype + '\033[0m')
            print("Intercept: " + str(BestModel.interceptVector))
            print('\033[1m' + " Top 20 Coefficients"+ '\033[0m')
            print("You should compares these relative to eachother")
            # Convert from numpy array to list
            coeff_array = BestModel.coefficientMatrix.toArray()
            coeff_scores = []
            for x in coeff_array[0]:
                coeff_scores.append(float(x))
            # Then zip with input_columns list and create a df
            result = spark.createDataFrame(zip(input_columns,coeff_scores), schema=['feature','coeff'])
            print(result.orderBy(result["coeff"].desc()).show(truncate=False))
            # Save the coefficient values and the models
            global LR_coefficients
            LR_coefficients = BestModel.coefficientMatrix.toArray()
            global LR_BestModel
            LR_BestModel = BestModel

        # Print the Coefficients
        if Mtype in("LinearSVC"):
            # Get Best Model
            BestModel = fitModel.bestModel
            print(" ")
            print('\033[1m' + Mtype + '\033[0m')
            print("Intercept: " + str(BestModel.intercept))
            print('\033[1m' + "Top 20 Coefficients"+ '\033[0m')
            print("You should compares these relative to eachother")
#             print("Coefficients: \n" + str(BestModel.coefficients))
            coeff_array = BestModel.coefficients.toArray()
            coeff_scores = []
            for x in coeff_array:
                coeff_scores.append(float(x))
            # Then zip with input_columns list and create a df
            result = spark.createDataFrame(zip(input_columns,coeff_scores), schema=['feature','coeff'])
            print(result.orderBy(result["coeff"].desc()).show(truncate=False))
            # Save the coefficient values and the models
            global LSVC_coefficients
            LSVC_coefficients = BestModel.coefficients.toArray()
            global LSVC_BestModel
            LSVC_BestModel = BestModel
        
   
    # Set the column names to match the external results dataframe that we will join with later:
    columns = ['Classifier', 'Result']
    
    if Mtype in("LinearSVC","GBTClassifier") and classes != 2:
        Mtype = [Mtype] # make this a list
        score = ["N/A"]
        result = spark.createDataFrame(zip(Mtype,score), schema=columns)
    else:
        predictions = fitModel.transform(test)
        MC_evaluator = MulticlassClassificationEvaluator(metricName="accuracy") # redictionCol="prediction",
        accuracy = (MC_evaluator.evaluate(predictions))*100
        Mtype = [Mtype] # make this a string
        score = [str(accuracy)] #make this a string and convert to a list
        result = spark.createDataFrame(zip(Mtype,score), schema=columns)
        result = result.withColumn('Result',result.Result.substr(0, 5))
        
    return result
    #Also returns the fit model important scores or p values

## Testing Time! 

Now we can use these functions above to train and evaluate our models in different ways. I'll show you my approach here but everyone has their own style and this process will likley be much more involved in a real use case, but I just want to show you the art of the possible here quickly. 

In [8]:
# Set up independ and dependent vars
input_columns = df.columns
input_columns = input_columns[1:-1] # keep only relevant columns: everything but the first and last cols
dependent_var = 'class'

# Learn how many classes there are in order to specify evaluation type based on binary or multi and turn the df into an object
class_count = df.select(countDistinct("class")).collect()
classes = class_count[0][0]

### Test 1: Without outlier treatment, skew or negative value treatment

This first go will act as our baseline so we can understand how our treatments affected our analysis. So I will opt out of the option for outlier treatment, skew treatment or negative value treatment which means we cannot use the naive bayes classifier. 

In [12]:
# Call on data prep, train and evaluate functions
test1_data = MLClassifierDFPrep(df,input_columns,dependent_var,treat_outliers=False,treat_neg_values=False)
test1_data.limit(5).toPandas()

# Comment out Naive Bayes if your data still contains negative values
classifiers = [
                LogisticRegression()
                ,OneVsRest()
               ,LinearSVC()
#                ,NaiveBayes()
               ,RandomForestClassifier()
               ,GBTClassifier()
               ,DecisionTreeClassifier()
               ,MultilayerPerceptronClassifier()
              ] 

train,test = test1_data.randomSplit([0.7,0.3])
features = test1_data.select(['features']).collect()
folds = 2 # because we have limited data

#set up your results table
columns = ['Classifier', 'Result']
vals = [("Place Holder","N/A")]
results = spark.createDataFrame(vals, columns)

for classifier in classifiers:
    new_result = ClassTrainEval(classifier,features,classes,folds,train,test)
    results = results.union(new_result)
results = results.where("Classifier!='Place Holder'")
print("!!!!!Final Results!!!!!!!!")
results.show(100,False)

+--------------------+-----+-----+
|               class|label|count|
+--------------------+-----+-----+
|   IndieDanceNuDisco| 11.0|  100|
|         FutureHouse| 14.0|  100|
|           PsyTrance|  4.0|  100|
|           ReggaeDub| 19.0|  100|
|              HipHop| 20.0|  100|
|             Minimal|  5.0|  100|
|              Breaks| 15.0|  100|
|           TechHouse| 13.0|  100|
|           GlitchHop|  7.0|  100|
|  HardcoreHardTechno| 12.0|  100|
|    ProgressiveHouse|  6.0|  100|
|             BigRoom| 21.0|  100|
|               House| 17.0|  100|
|              Techno|  9.0|  100|
|             Dubstep|  8.0|  100|
|         DrumAndBass| 10.0|  100|
|        ElectroHouse| 18.0|  100|
|           FunkRAndB| 16.0|  100|
|              Trance|  2.0|  100|
|               Dance|  3.0|  100|
|ElectronicaDowntempo|  0.0|  100|
|           HardDance| 22.0|  100|
|           DeepHouse|  1.0|  100|
+--------------------+-----+-----+

None
 
 
You have opted not to correct that therefore 

+---------------------+------------------+
|feature              |coeff             |
+---------------------+------------------+
|56-ChromaVector1std  |112.55717723864902|
|67-ChromaVector12std |21.11776647737247 |
|26-ChromaVector5m    |20.708275579456714|
|65-ChromaVector10std |13.705652241544923|
|62-ChromaVector7std  |13.098905388166813|
|41-SpectralFluxstd   |13.020892326083935|
|27-ChromaVector6m    |12.926000800767094|
|66-ChromaVector11std |12.836735382024687|
|39-SpectralSpreadstd |9.43627791003836  |
|30-ChromaVector9m    |8.369285230203037 |
|59-ChromaVector4std  |8.369020641865156 |
|68-ChromaDeviationstd|6.889158628084487 |
|32-ChromaVector11m   |6.872671241921265 |
|58-ChromaVector3std  |6.754922026317207 |
|50-MFCCs8std         |6.407469365950642 |
|35-ZCRstd            |4.061108263162837 |
|57-ChromaVector2std  |4.056607400100024 |
|49-MFCCs7std         |3.6245281904821374|
|48-MFCCs6std         |3.3700791174101936|
|53-MFCCs11std        |2.764403948832574 |
+----------

+---------------------+------------------+
|feature              |coeff             |
+---------------------+------------------+
|34-ChromaDeviationm  |14.578916875868794|
|5-SpectralSpreadm    |8.18292514948516  |
|61-ChromaVector6std  |7.566068104110049 |
|28-ChromaVector7m    |6.589319291416576 |
|65-ChromaVector10std |6.408800958991922 |
|66-ChromaVector11std |6.324476639566626 |
|29-ChromaVector8m    |5.990933344309503 |
|49-MFCCs7std         |3.4856544014957667|
|42-SpectralRolloffstd|3.3825129947601287|
|24-ChromaVector3m    |3.05972164154249  |
|52-MFCCs10std        |3.0395155435503085|
|46-MFCCs4std         |2.9344592172513786|
|40-SpectralEntropystd|2.8986740848920887|
|44-MFCCs2std         |2.7038373681969903|
|32-ChromaVector11m   |2.221268590717105 |
|64-ChromaVector9std  |2.0403084217376777|
|60-ChromaVector5std  |1.8709893446043064|
|45-MFCCs3std         |1.824290321988405 |
|54-MFCCs12std        |0.846726793405148 |
|48-MFCCs6std         |0.5789613446984917|
+----------

LinearSVC  could not be used because PySpark currently only accepts binary classification data for this algorithm
 
[1mRandomForestClassifier  Top 20 Feature Importances[0m
(Scores add up to 1)
Lowest score is the least important
 
+----------------------+--------------------+
|feature               |score               |
+----------------------+--------------------+
|71-BPMessentia        |0.07982059851901938 |
|69-BPM                |0.04545732736332575 |
|70-BPMconf            |0.03194386448055748 |
|2-Energym             |0.023723916760203374|
|1-ZCRm                |0.02310645820099108 |
|39-SpectralSpreadstd  |0.022343659015138507|
|9-MFCCs1m             |0.02057059555177828 |
|44-MFCCs2std          |0.020515646444789714|
|52-MFCCs10std         |0.019875911200021322|
|3-EnergyEntropym      |0.01900969677980962 |
|35-ZCRstd             |0.01872732585934679 |
|43-MFCCs1std          |0.01752233538614241 |
|54-MFCCs12std         |0.01731251376180032 |
|7-SpectralFluxm       |0.0171

### Test 2: Test treatments

Train and evaluate models on treated data (outliers, skewness and negative values) and compare to baseline (#1).

In [9]:
# Call on data prep, train and evaluate functions
test2_data = MLClassifierDFPrep(df,input_columns,dependent_var,treat_outliers=True,treat_neg_values=True)
test2_data.limit(5).toPandas()

# Comment out Naive Bayes if your data still contains negative values
classifiers = [
                LogisticRegression()
                ,OneVsRest()
               ,LinearSVC()
               ,NaiveBayes()
               ,RandomForestClassifier()
               ,GBTClassifier()
               ,DecisionTreeClassifier()
               ,MultilayerPerceptronClassifier()
              ] 

train,test = test2_data.randomSplit([0.7,0.3])
features = test2_data.select(['features']).collect()
folds = 2

#set up your results table
columns = ['Classifier', 'Result']
vals = [("Place Holder","N/A")]
results = spark.createDataFrame(vals, columns)

for classifier in classifiers:
    new_result = ClassTrainEval(classifier,features,classes,folds,train,test)
    results = results.union(new_result)
results = results.where("Classifier!='Place Holder'")
print("!!!!!Final Results!!!!!!!!")
results.show(100,False)

+--------------------+-----+-----+
|               class|label|count|
+--------------------+-----+-----+
|   IndieDanceNuDisco| 11.0|  100|
|         FutureHouse| 14.0|  100|
|           PsyTrance|  4.0|  100|
|           ReggaeDub| 19.0|  100|
|              HipHop| 20.0|  100|
|             Minimal|  5.0|  100|
|              Breaks| 15.0|  100|
|           TechHouse| 13.0|  100|
|           GlitchHop|  7.0|  100|
|  HardcoreHardTechno| 12.0|  100|
|    ProgressiveHouse|  6.0|  100|
|             BigRoom| 21.0|  100|
|               House| 17.0|  100|
|              Techno|  9.0|  100|
|             Dubstep|  8.0|  100|
|         DrumAndBass| 10.0|  100|
|        ElectroHouse| 18.0|  100|
|           FunkRAndB| 16.0|  100|
|              Trance|  2.0|  100|
|               Dance|  3.0|  100|
|ElectronicaDowntempo|  0.0|  100|
|           HardDance| 22.0|  100|
|           DeepHouse|  1.0|  100|
+--------------------+-----+-----+

None
We are correcting for non normality now!
7-Spectr

+---------------------+------------------+
|feature              |coeff             |
+---------------------+------------------+
|71-BPMessentia       |6.136888001132321 |
|9-MFCCs1m            |2.918368303219135 |
|69-BPM               |2.914313647927189 |
|7-SpectralFluxm      |2.688337180594859 |
|41-SpectralFluxstd   |2.6541976143483166|
|62-ChromaVector7std  |2.5781689641959513|
|29-ChromaVector8m    |2.214212358296934 |
|57-ChromaVector2std  |1.8162633702219924|
|68-ChromaDeviationstd|1.8138955548371414|
|66-ChromaVector11std |1.7067512951064963|
|31-ChromaVector10m   |1.5779863082184844|
|63-ChromaVector8std  |1.103801681302252 |
|17-MFCCs9m           |1.048562908815367 |
|49-MFCCs7std         |1.0101579471248427|
|61-ChromaVector6std  |0.9274776411353771|
|36-Energystd         |0.8808795426530354|
|27-ChromaVector6m    |0.8635727007769014|
|67-ChromaVector12std |0.8441178587889355|
|20-MFCCs12m          |0.777516166981001 |
|15-MFCCs7m           |0.7215291660844417|
+----------

+----------------------+------------------+
|feature               |coeff             |
+----------------------+------------------+
|67-ChromaVector12std  |3.1251164817700645|
|71-BPMessentia        |2.347523982510963 |
|38-SpectralCentroidstd|2.211846669605227 |
|29-ChromaVector8m     |2.1778564328891568|
|43-MFCCs1std          |2.089641725133536 |
|11-MFCCs3m            |2.0882279628959535|
|41-SpectralFluxstd    |1.8515225151157935|
|13-MFCCs5m            |1.755866871595872 |
|15-MFCCs7m            |1.7186721819126656|
|44-MFCCs2std          |1.6015240662878747|
|35-ZCRstd             |1.4218626813420532|
|65-ChromaVector10std  |1.4173918808030324|
|42-SpectralRolloffstd |1.3419326519690107|
|12-MFCCs4m            |1.3383259882117287|
|25-ChromaVector4m     |1.2823027949726153|
|30-ChromaVector9m     |1.2812508961948237|
|47-MFCCs5std          |0.957014821843279 |
|60-ChromaVector5std   |0.8582638822532481|
|4-SpectralCentroidm   |0.824434760748972 |
|37-EnergyEntropystd   |0.770712

+---------------------+------------------+
|feature              |coeff             |
+---------------------+------------------+
|70-BPMconf           |6.38592247611383  |
|63-ChromaVector8std  |2.8293385508218543|
|2-Energym            |2.356783502816937 |
|67-ChromaVector12std |2.2515917792339133|
|69-BPM               |2.0488497778985386|
|9-MFCCs1m            |1.612499512882857 |
|68-ChromaDeviationstd|1.606228270496342 |
|46-MFCCs4std         |1.5642472954358542|
|71-BPMessentia       |1.4966720202632682|
|3-EnergyEntropym     |1.4573823482447466|
|16-MFCCs8m           |1.0978457696731267|
|25-ChromaVector4m    |1.0733561343722542|
|14-MFCCs6m           |1.0640563462421029|
|49-MFCCs7std         |0.9785240824415337|
|48-MFCCs6std         |0.9259991660126768|
|28-ChromaVector7m    |0.924417293162491 |
|5-SpectralSpreadm    |0.796256291514212 |
|58-ChromaVector3std  |0.7606174576666069|
|26-ChromaVector5m    |0.7136516794059131|
|1-ZCRm               |0.6554458330809071|
+----------

## Test 3: Feature Selection

Looks like all our of our models saw a little bit of improvement from the transformations we applied. And overall the best performing model was the One vs Rest. Let's try to do do some feature selection to make it even better!

Since we are using the One vs Rest algorithm backed by Logistic Regression, we can use Pyspark's Chi Squared Selector feature to select our features. I intentionally did not go over this in the previous lectures to get you used to exploring the PySpark documentation.

Here is the link to the documentation on the Chi Squared Selector for more details if you need it: https://spark.apache.org/docs/latest/ml-features#chisqselector

In [None]:
from pyspark.ml.feature import VectorSlicer
from pyspark.ml.feature import ChiSqSelector
from pyspark.ml.linalg import Vectors

classifiers = [OneVsRest()] 

#Select the top n features and view results
maximum = len(input_columns)
for n in range(10,maximum,10):
    print("Testing top n = ",n," features")
    
    # For Tree classifiers
#     best_n_features = RF_featureimportances.argsort()[-n:][::-1]
#     best_n_features= best_n_features.tolist() # convert to a list
#     vs = VectorSlicer(inputCol="features", outputCol="best_features", indices=best_n_features)
#     bestFeaturesDf = vs.transform(test2_data)

    # For Logistic regression or One vs Rest
    selector = ChiSqSelector(numTopFeatures=n, featuresCol="features",
                         outputCol="selectedFeatures", labelCol="label")
    bestFeaturesDf = selector.fit(test2_data).transform(test2_data)
    bestFeaturesDf = bestFeaturesDf.select("label","selectedFeatures")
    bestFeaturesDf = bestFeaturesDf.withColumnRenamed("selectedFeatures","features")

    # Collect features
    features = bestFeaturesDf.select(['features']).collect()

    # Split
    train,test = bestFeaturesDf.randomSplit([0.7,0.3])
    
    # Specify folds
    folds = 2

    #set up your results table
    columns = ['Classifier', 'Result']
    vals = [("Place Holder","N/A")]
    results = spark.createDataFrame(vals, columns)

    for classifier in classifiers:
        new_result = ClassTrainEval(classifier,features,classes,folds,train,test)
        results = results.union(new_result)
    results = results.where("Classifier!='Place Holder'")
    results.show(100,False)

# Results

Let's see if we can improve upon the results above of 45.94%.

- 10 features = 32.27% (going down)
- 20 features = 39.23% (going down)
- 30 features = 39.24% (going down)
- 40 features = 42.61% (going down)
- 50 features = 43.52% (going down)
- 60 features = 42.42% (going down)
- 70 features = 40.22% (going down)

**Take away:** looks like we actually never beat our baseline so let's stick with the 71 features. Setting a random seed might have helped with variability we saw here.

## Train final model

Now we can finally train our final model with the optimal set of features!

In [None]:
from pyspark.ml.feature import VectorSlicer
from pyspark.ml.feature import ChiSqSelector
from pyspark.ml.linalg import Vectors

classifiers = [OneVsRest()] 

#Select the top n features and view results
n = 71

# For Logistic regression or One vs Rest
selector = ChiSqSelector(numTopFeatures=n, featuresCol="features",
                     outputCol="selectedFeatures", labelCol="label")
bestFeaturesDf = selector.fit(test2_data).transform(test2_data)
bestFeaturesDf = bestFeaturesDf.select("label","selectedFeatures")
bestFeaturesDf = bestFeaturesDf.withColumnRenamed("selectedFeatures","features")

# Collect features
features = bestFeaturesDf.select(['features']).collect()

# Split
train,test = bestFeaturesDf.randomSplit([0.7,0.3])

# Specify folds
folds = 2

#set up your results table
columns = ['Classifier', 'Result']
vals = [("Place Holder","N/A")]
results = spark.createDataFrame(vals, columns)

for classifier in classifiers:
    new_result = ClassTrainEval(classifier,features,classes,folds,train,test)
    results = results.union(new_result)
results = results.where("Classifier!='Place Holder'")
results.show(100,False)

## Make a recommendation to a user!

Imagine that your new model is being deployed in the next version release of the online radio station that has now purchased this model from you. So the existing users have gotten used to the songs in certian stations but now your model may be adding new songs to those stations. 

Can you show an output that lists songs that will be new to listers of the BigRoom station (ie. songs that used to belong to another station, but are now classified by your model as being in the BigRoom station)? Also show how many songs this would include. You can use the test dataframe to do this. Don't over think it. 

In [None]:
predictions = OVR_BestModel.transform(test)

In [None]:
# From the output earlier we saw that the new label for BigRoom is now 21.0
# Let's get a song from there
count = predictions.filter("label!=21.0 AND prediction == 21.0").count()
print(count)
predictions.filter("label!=21.0 AND prediction == 21.0").show()
# predictions.show()

## Conclusions

#### Other types of analyses to consider

1. **Training and evaluation splits::** we did a 70/30 split but it's a good idea to play around with other split ratios like 80/20 or 75/25
2. **Different transformations:** you could also play around with other types of tranformations on the data like normalizing or standard scaling:
        - normalizing: https://spark.apache.org/docs/2.1.0/ml-features.html#normalizer
        - standard scaling: https://spark.apache.org/docs/2.1.0/ml-features.html#standardscaler