# Classification in PySpark's MLlib Project Solution

### Genre classification
Now it's time to leverage what we learned in the lectures to a REAL classification project! Have you ever wondered what makes us, humans, able to tell apart two songs of different genres? How we do we inherenly know the difference between a pop song and heavy metal? This type of classifcation may seem easy for us, but it's a very difficult challenge for a computer to do. So the question is, could an automatic genre classifcation model be possible? 

For this project we will be classifying songs based on a number of characteristics into a set of 23 electronic genres. This technology could be used by an application like Pandora to recommend songs to users or just create meaningful channels. Super fun!

### Dataset
*beatsdataset.csv*
Each row is an electronic music song. The dataset contains 100 song for each genre among 23 electronic music genres, they were the top (100) songs of their genres on November 2016. The 71 columns are audio features extracted of a two random minutes sample of the file audio. These features have been extracted using pyAudioAnalysis (https://github.com/tyiannak/pyAudioAnalysis).

### Your task
Create an algorithm that classifies songs into the 23 genres provided. Test out several different models and select the highest performing one. Also play around with feature selection methods and finally try to make a recommendation to a user.  

For the feature selection aspect of this project, you may need to get a bit creative if you want to select features from a non-tree algorithm. I did not go over this aspect of PySpark intentionally in the previous lectures to give you chance to get used to researching the PySpark documentation page. Here is the link to the Feature Selectors section of the documentation that just might come in handy: https://spark.apache.org/docs/latest/ml-features.html#feature-selectors

Good luck! Have fun :)

### Source
https://www.kaggle.com/caparrini/beatsdataset

In [None]:
# First let's create our PySpark instance
# import findspark
# findspark.init()
!pip install pyspark
import pyspark # only run after findspark.init()
from pyspark.sql import SparkSession
# May take awhile locally
spark = SparkSession.builder.appName("Review2").getOrCreate()


spark
# Click the hyperlinked "Spark UI" link to view details about your Spark session

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.3.2.tar.gz (281.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m281.4/281.4 MB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting py4j==0.10.9.5
  Downloading py4j-0.10.9.5-py2.py3-none-any.whl (199 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.7/199.7 KB[0m [31m11.9 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.3.2-py2.py3-none-any.whl size=281824025 sha256=a319ff87f3bf257239cf5e1c7016f5ec64d2dbde8f6aa5e41df510de0fd92acc
  Stored in directory: /root/.cache/pip/wheels/b1/59/a0/a1a0624b5e865fd389919c1a10f53aec9b12195d6747710baf
Successfully built pyspark
Installing collected packages: py4j, pyspa

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# import the data and examine it

In [None]:
path ="drive/MyDrive/5. Spark/spark-scripts/section3/Datasets/"
df = spark.read.csv(path+'beatsdataset.csv',inferSchema=True,header=True)

In [None]:
import pandas as pd

pd.set_option('display.max_columns',None)
pd.set_option('display.max_colwidth',None)
pd.set_option('display.max_rows',None)

In [None]:
df.limit(6).toPandas()

Unnamed: 0,_c0,1-ZCRm,2-Energym,3-EnergyEntropym,4-SpectralCentroidm,5-SpectralSpreadm,6-SpectralEntropym,7-SpectralFluxm,8-SpectralRolloffm,9-MFCCs1m,10-MFCCs2m,11-MFCCs3m,12-MFCCs4m,13-MFCCs5m,14-MFCCs6m,15-MFCCs7m,16-MFCCs8m,17-MFCCs9m,18-MFCCs10m,19-MFCCs11m,20-MFCCs12m,21-MFCCs13m,22-ChromaVector1m,23-ChromaVector2m,24-ChromaVector3m,25-ChromaVector4m,26-ChromaVector5m,27-ChromaVector6m,28-ChromaVector7m,29-ChromaVector8m,30-ChromaVector9m,31-ChromaVector10m,32-ChromaVector11m,33-ChromaVector12m,34-ChromaDeviationm,35-ZCRstd,36-Energystd,37-EnergyEntropystd,38-SpectralCentroidstd,39-SpectralSpreadstd,40-SpectralEntropystd,41-SpectralFluxstd,42-SpectralRolloffstd,43-MFCCs1std,44-MFCCs2std,45-MFCCs3std,46-MFCCs4std,47-MFCCs5std,48-MFCCs6std,49-MFCCs7std,50-MFCCs8std,51-MFCCs9std,52-MFCCs10std,53-MFCCs11std,54-MFCCs12std,55-MFCCs13std,56-ChromaVector1std,57-ChromaVector2std,58-ChromaVector3std,59-ChromaVector4std,60-ChromaVector5std,61-ChromaVector6std,62-ChromaVector7std,63-ChromaVector8std,64-ChromaVector9std,65-ChromaVector10std,66-ChromaVector11std,67-ChromaVector12std,68-ChromaDeviationstd,69-BPM,70-BPMconf,71-BPMessentia,class
0,0,0.13644,0.088861,3.201201,0.262825,0.249212,1.114423,0.007003,0.256682,-22.723259,1.594074,0.011276,0.204468,0.042072,0.048552,0.158505,0.118984,-0.147956,-0.186152,-0.026418,-0.007264,-0.0179,0.011581,0.008747,0.041081,0.014497,0.025711,0.012587,0.06017,0.002864,0.004631,0.009576,0.026079,0.004161,0.032185,0.050143,0.047313,0.102995,0.041285,0.017725,0.414831,0.005867,0.133778,0.838302,0.505911,0.356206,0.336074,0.288888,0.278649,0.283437,0.300305,0.287688,0.296692,0.258531,0.238352,0.194701,0.013138,0.011665,0.032049,0.015464,0.020453,0.012943,0.046397,0.003431,0.004981,0.010818,0.024001,0.005201,0.015056,133.333333,0.132792,128.0,BigRoom
1,1,0.117039,0.108389,3.194001,0.247657,0.250288,1.065668,0.005387,0.199821,-21.775871,1.261364,-0.113015,0.001718,-0.052682,0.20413,0.153013,0.067214,-0.013227,-0.05944,-0.008604,0.114257,0.171009,0.006535,0.002646,0.086485,0.008391,0.016442,0.009006,0.087948,0.002472,0.006549,0.007412,0.015386,0.005978,0.041116,0.043713,0.043721,0.099449,0.039386,0.018946,0.407164,0.003613,0.110334,0.624185,0.476993,0.353151,0.33555,0.283832,0.269621,0.24415,0.24666,0.25719,0.272036,0.269477,0.222393,0.187471,0.006761,0.003152,0.058923,0.009012,0.016106,0.009386,0.071726,0.004461,0.006441,0.007469,0.015499,0.005589,0.019339,120.0,0.112767,126.0,BigRoom
2,2,0.085308,0.128525,3.123837,0.217205,0.228652,0.789647,0.008247,0.156822,-22.472722,1.425185,0.186749,0.417114,0.076406,0.190803,-0.016302,0.075038,0.10787,0.216874,0.095604,0.020977,-0.037011,0.007143,0.00296,0.220526,0.005639,0.010151,0.007453,0.043907,0.00124,0.004347,0.007989,0.017622,0.002636,0.066049,0.03292,0.037618,0.117704,0.041509,0.022645,0.34013,0.007697,0.085784,1.02874,0.449133,0.297935,0.266731,0.258299,0.275012,0.218368,0.19839,0.210177,0.212533,0.204458,0.197634,0.16491,0.007836,0.003079,0.093865,0.005692,0.008212,0.005451,0.0429,0.001529,0.004556,0.007723,0.017482,0.002901,0.022201,133.333333,0.123373,129.0,BigRoom
3,3,0.10305,0.167042,3.15083,0.233593,0.245032,0.967082,0.006571,0.168083,-21.470751,1.463686,0.226548,0.404531,0.117699,0.081861,0.053974,0.164865,0.014919,0.11709,0.027778,-0.063173,-0.052606,0.010724,0.00334,0.125459,0.005728,0.014695,0.006322,0.072154,0.001628,0.003493,0.011463,0.032204,0.004738,0.046159,0.036349,0.06196,0.134908,0.032564,0.020036,0.365068,0.005215,0.086336,0.769981,0.425496,0.245312,0.260132,0.22422,0.207597,0.199472,0.207818,0.189912,0.185509,0.187273,0.177629,0.16474,0.00833,0.003528,0.061426,0.005443,0.012382,0.004985,0.057999,0.001591,0.003514,0.009477,0.023162,0.004165,0.015379,133.333333,0.158876,129.0,BigRoom
4,4,0.15173,0.148405,3.194498,0.29373,0.267231,1.353005,0.003872,0.292055,-21.371157,1.187854,0.184415,0.363724,0.232119,0.112277,0.107335,0.159296,0.067213,-0.018713,0.091529,0.117344,0.091616,0.009624,0.004031,0.076133,0.008175,0.016267,0.009927,0.088364,0.002645,0.004054,0.011083,0.023926,0.002248,0.036761,0.055214,0.041139,0.122271,0.036637,0.017732,0.45874,0.0029,0.148464,1.023154,0.431075,0.352099,0.327842,0.23718,0.230675,0.236538,0.25721,0.27038,0.242086,0.229678,0.211439,0.179589,0.01086,0.00539,0.046999,0.008598,0.015579,0.00918,0.069485,0.003945,0.004131,0.01133,0.028188,0.002639,0.019079,133.333333,0.190708,129.0,BigRoom
5,5,0.127047,0.153488,3.221987,0.261693,0.257361,1.090034,0.004943,0.230099,-21.234846,1.541917,0.049064,0.194576,0.063895,0.058361,0.050883,0.071518,0.006431,0.011621,0.068274,0.130295,0.104191,0.009467,0.005906,0.079738,0.01158,0.014767,0.012332,0.078991,0.002105,0.005256,0.012025,0.031829,0.002447,0.041918,0.05022,0.048121,0.073129,0.043929,0.019323,0.461871,0.003707,0.138257,0.872498,0.418116,0.323801,0.332065,0.240972,0.212895,0.224638,0.225902,0.236724,0.227541,0.2327,0.216201,0.193793,0.009599,0.008033,0.058464,0.011039,0.011853,0.01143,0.059939,0.002986,0.006533,0.010347,0.025008,0.003035,0.019479,133.333333,0.168933,129.0,BigRoom


In [None]:
df.printSchema()

root
 |-- _c0: integer (nullable = true)
 |-- 1-ZCRm: double (nullable = true)
 |-- 2-Energym: double (nullable = true)
 |-- 3-EnergyEntropym: double (nullable = true)
 |-- 4-SpectralCentroidm: double (nullable = true)
 |-- 5-SpectralSpreadm: double (nullable = true)
 |-- 6-SpectralEntropym: double (nullable = true)
 |-- 7-SpectralFluxm: double (nullable = true)
 |-- 8-SpectralRolloffm: double (nullable = true)
 |-- 9-MFCCs1m: double (nullable = true)
 |-- 10-MFCCs2m: double (nullable = true)
 |-- 11-MFCCs3m: double (nullable = true)
 |-- 12-MFCCs4m: double (nullable = true)
 |-- 13-MFCCs5m: double (nullable = true)
 |-- 14-MFCCs6m: double (nullable = true)
 |-- 15-MFCCs7m: double (nullable = true)
 |-- 16-MFCCs8m: double (nullable = true)
 |-- 17-MFCCs9m: double (nullable = true)
 |-- 18-MFCCs10m: double (nullable = true)
 |-- 19-MFCCs11m: double (nullable = true)
 |-- 20-MFCCs12m: double (nullable = true)
 |-- 21-MFCCs13m: double (nullable = true)
 |-- 22-ChromaVector1m: double (null

In [None]:
input_features = df.columns[0:-1]
label = 'class'

In [None]:
df.groupBy('class').count().show(100)

+--------------------+-----+
|               class|count|
+--------------------+-----+
|           PsyTrance|  100|
|           HardDance|  100|
|              Breaks|  100|
|  HardcoreHardTechno|  100|
|   IndieDanceNuDisco|  100|
|              Trance|  100|
|           DeepHouse|  100|
|ElectronicaDowntempo|  100|
|           ReggaeDub|  100|
|             Minimal|  100|
|         DrumAndBass|  100|
|             Dubstep|  100|
|             BigRoom|  100|
|              Techno|  100|
|               House|  100|
|         FutureHouse|  100|
|        ElectroHouse|  100|
|           GlitchHop|  100|
|           TechHouse|  100|
|              HipHop|  100|
|           FunkRAndB|  100|
|               Dance|  100|
|    ProgressiveHouse|  100|
+--------------------+-----+



In [None]:
classesCount = df.groupBy('class').count().count()
print('We have ',classesCount,' balanced classes')

We have  23  balanced classes


# preprocessing

In [None]:
from pyspark.ml.feature import VectorAssembler
from pyspark.sql.types import * 
from pyspark.sql.functions import *
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import MinMaxScaler

In [None]:
string_inputs = []
numeric_inputs = []
for col in input_features:
  if str(df.schema[col].dataType) == 'StringType()':
    string_inputs.append(col)
  else:
    numeric_inputs.append(col)
  
print('Total number of columns: ',len(input_features))
print('Number of string colums:',len(string_inputs))
print('Number of non-string columns (double):',len(numeric_inputs))


Total number of columns:  72
Number of string colums: 0
Number of non-string columns (double): 72


### use string indexer on the label column

In [None]:
indexer = StringIndexer(inputCol="class", outputCol="label") #Pyspark is expecting the this naming convention 
indexed = indexer.fit(df).transform(df)
indexed.limit(5).show()

+---+---------------+---------------+----------------+-------------------+-----------------+------------------+----------------+------------------+--------------+-------------+---------------+----------------+---------------+---------------+----------------+---------------+----------------+----------------+-----------------+-----------------+----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+------------------+------------------+------------------+-------------------+---------------+---------------+-------------------+----------------------+--------------------+---------------------+------------------+---------------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+-------------------+-------------------+------

### Treating skewness using log and expon transform
### treating outliers using flooring and capping

In [None]:
d = {}
# Create a dictionary of quantiles from your numeric cols
# I'm doing the top and bottom 1% but you can adjust if needed
for col in numeric_inputs: 
    d[col] = indexed.approxQuantile(col,[0.01,0.99],0.25) #if you want to make it go faster increase the last number

#Now check for skewness for all numeric cols
for col in numeric_inputs:
    skew = indexed.agg(skewness(col)).collect() #check for skewness
    skew = skew[0][0]
    # If skewness is found,
    # This function will make the appropriate corrections
    if skew > 1: # If right skew, floor, cap and log(x+1)
        indexed = indexed.withColumn(col, \
        log(when(indexed[col] < d[col][0],d[col][0])\
        .when(indexed[col] > d[col][1], d[col][1])\
        .otherwise(indexed[col] ) +1).alias(col))
        print(col+" has been treated for positive (right) skewness. (skew =",skew,")")
    elif skew < -1: # If left skew floor, cap and exp(x)
        indexed = indexed.withColumn(col, \
        exp(when(indexed[col] < d[col][0],d[col][0])\
        .when(indexed[col] > d[col][1], d[col][1])\
        .otherwise(indexed[col] )).alias(col))
        print(col+" has been treated for negative (left) skewness. (skew =",skew,")")

7-SpectralFluxm has been treated for positive (right) skewness. (skew = 1.6396138160129063 )
22-ChromaVector1m has been treated for positive (right) skewness. (skew = 2.4162415204309258 )
23-ChromaVector2m has been treated for positive (right) skewness. (skew = 4.154796693680583 )
24-ChromaVector3m has been treated for positive (right) skewness. (skew = 1.1974019617504328 )
25-ChromaVector4m has been treated for positive (right) skewness. (skew = 2.446635863594906 )
26-ChromaVector5m has been treated for positive (right) skewness. (skew = 2.154482876187508 )
27-ChromaVector6m has been treated for positive (right) skewness. (skew = 2.01234064472543 )
28-ChromaVector7m has been treated for positive (right) skewness. (skew = 1.1829228989215521 )
29-ChromaVector8m has been treated for positive (right) skewness. (skew = 3.7372643733999955 )
30-ChromaVector9m has been treated for positive (right) skewness. (skew = 2.4117416421548645 )
31-ChromaVector10m has been treated for positive (right) 

In [None]:
minimums = df.select([min(c).alias(c) for c in df.columns if c in numeric_inputs]) 
# Create an array for all mins and select only the input cols
min_array = minimums.select(array(numeric_inputs).alias("mins")) 
# Collect golobal min as Python object
df_minimum = min_array.select(array_min(min_array.mins)).collect() 
# Slice to get the number itself
df_minimum = df_minimum[0][0] 

# If there are ANY negative vals found in the df, print a warning message
if df_minimum < 0:
    print("WARNING: The Naive Bayes Classifier will not be able to process your dataframe as it contains negative values")
else:
    print("No negative values were found in your dataframe.")



### As shown above there exists some negative values, thus we will normalize the data using **MinMaxScaler** after using the **VectorAssembler**

In [None]:
assembler = VectorAssembler(inputCols=input_features,outputCol='features')

output = assembler.transform(indexed).select('features','label')

In [None]:
output.show(5,truncate = False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [None]:
scaler = MinMaxScaler(inputCol="features", outputCol="scaledFeatures",min=0,max=500)
print("Features scaled to range: [%f, %f]" % (scaler.getMin(), scaler.getMax()))

# Compute summary statistics and generate MinMaxScalerModel
scalerModel = scaler.fit(output)

# rescale each feature to range [min, max].
scaled_data = scalerModel.transform(output)
final_data = scaled_data.select('label','scaledFeatures')
# Rename to default value
final_data = final_data.withColumnRenamed("scaledFeatures","features")
final_data.show(5, truncate=False)

Features scaled to range: [0.000000, 500.000000]
+-----+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

### Split the data

In [None]:
final_data.count()

2300

In [None]:
train,test = final_data.randomSplit([0.8,0.2])

In [None]:
train.count()

1827

In [None]:
test.count()

473

# Modeling

In [None]:
from pyspark.ml.classification import *
from pyspark.ml.evaluation import *
from pyspark.sql.functions import *
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

In [None]:
def ClassTrainEval(classifier,features,classes,train,test):

    def FindMtype(classifier):
        # Intstantiate Model
        M = classifier
        # Learn what it is
        Mtype = type(M).__name__
        
        return Mtype
    
    Mtype = FindMtype(classifier)
    

    def IntanceFitModel(Mtype,classifier,classes,features,train):
        
        if Mtype == "OneVsRest":
            # instantiate the base classifier.
            lr = LogisticRegression()
            # instantiate the One Vs Rest Classifier.
            OVRclassifier = OneVsRest(classifier=lr)
#             fitModel = OVRclassifier.fit(train)
            # Add parameters of your choice here:
            paramGrid = ParamGridBuilder() \
                .addGrid(lr.regParam, [0.1, 0.01]) \
                .build()
            #Cross Validator requires the following parameters:
            crossval = CrossValidator(estimator=OVRclassifier,
                                      estimatorParamMaps=paramGrid,
                                      evaluator=MulticlassClassificationEvaluator(),
                                      numFolds=2) # 3 is best practice
            # Run cross-validation, and choose the best set of parameters.
            fitModel = crossval.fit(train)
            return fitModel
        if Mtype == "MultilayerPerceptronClassifier":
            # specify layers for the neural network:
            # input layer of size features, two intermediate of features+1 and same size as features
            # and output of size number of classes
            # Note: crossvalidator cannot be used here
            features_count = len(features[0][0])
            layers = [features_count, features_count+1, features_count, classes]
            MPC_classifier = MultilayerPerceptronClassifier(maxIter=100, layers=layers, blockSize=128, seed=1234)
            fitModel = MPC_classifier.fit(train)
            return fitModel

        if Mtype in("LogisticRegression","NaiveBayes","RandomForestClassifier","GBTClassifier","LinearSVC","DecisionTreeClassifier"):
  
            # Add parameters of your choice here:
            if Mtype in("LogisticRegression"):
                paramGrid = (ParamGridBuilder() \
#                              .addGrid(classifier.regParam, [0.1, 0.01]) \
                             .addGrid(classifier.maxIter, [10, 15,20])
                             .build())
                
            # Add parameters of your choice here:
            if Mtype in("NaiveBayes"):
                paramGrid = (ParamGridBuilder() \
                             .addGrid(classifier.smoothing, [0.0, 0.2, 0.4, 0.6]) \
                             .build())
                
            # Add parameters of your choice here:
            if Mtype in("RandomForestClassifier"):
                paramGrid = (ParamGridBuilder() \
                               .addGrid(classifier.maxDepth, [2, 5, 10])
#                                .addGrid(classifier.maxBins, [5, 10, 20])
#                                .addGrid(classifier.numTrees, [5, 20, 50])
                             .build())            
            
            # Add parameters of your choice here:
            if Mtype in("DecisionTreeClassifier"):
                paramGrid = (ParamGridBuilder() \
#                              .addGrid(classifier.maxDepth, [2, 5, 10, 20, 30]) \
                             .addGrid(classifier.maxBins, [10, 20, 40, 80, 100]) \
                             .build())
            
            #Cross Validator requires all of the following parameters:
            crossval = CrossValidator(estimator=classifier,
                                      estimatorParamMaps=paramGrid,
                                      evaluator=MulticlassClassificationEvaluator(),
                                      numFolds=2) # 3 + is best practice
            # Fit Model: Run cross-validation, and choose the best set of parameters.
            fitModel = crossval.fit(train)
            return fitModel
    
    fitModel = IntanceFitModel(Mtype,classifier,classes,features,train)
    
    # Print feature selection metrics
    if fitModel is not None:
        
        if Mtype in("OneVsRest"):
            # Get Best Model
            BestModel = fitModel.bestModel
            print(" ")
            print('\033[1m' + Mtype + '\033[0m')
            # Extract list of binary models
            models = BestModel.models
            for model in models:
                print('\033[1m' + 'Intercept: '+ '\033[0m',model.intercept,'\033[1m' + '\nCoefficients:'+ '\033[0m',model.coefficients)

        if Mtype == "MultilayerPerceptronClassifier":
            print("")
            print('\033[1m' + Mtype," Weights"+ '\033[0m')
            print('\033[1m' + "Model Weights: "+ '\033[0m',fitModel.weights.size)
            print("")

        if Mtype in("DecisionTreeClassifier","RandomForestClassifier"):
            # FEATURE IMPORTANCES
            # Estimate of the importance of each feature.
            # Each feature’s importance is the average of its importance across all trees 
            # in the ensemble The importance vector is normalized to sum to 1. 
            # Get Best Model
            BestModel = fitModel.bestModel
            print(" ")
            print('\033[1m' + Mtype," Feature Importances"+ '\033[0m')
            print("(Scores add up to 1)")
            print("Lowest score is the least important")
            print(" ")
            featureImportances = BestModel.featureImportances.toArray()
            print(featureImportances)
            
            if Mtype in("DecisionTreeClassifier"):
                global DT_featureImportances
                DT_featureImportances = BestModel.featureImportances.toArray()
                global DT_BestModel
                DT_BestModel = BestModel
            if Mtype in("RandomForestClassifier"):
                global RF_featureImportances
                RF_featureImportances = BestModel.featureImportances.toArray()
                global RF_BestModel
                RF_BestModel = BestModel

        if Mtype in("LogisticRegression"):
            # Get Best Model
            BestModel = fitModel.bestModel
            print(" ")
            print('\033[1m' + Mtype," Coefficient Matrix"+ '\033[0m')
            print("You should compares these relative to eachother")
            print("Coefficients: \n" + str(BestModel.coefficientMatrix))
            print("Intercept: " + str(BestModel.interceptVector))
            global LR_coefficients
            LR_coefficients = BestModel.coefficientMatrix.toArray()
            global LR_BestModel
            LR_BestModel = BestModel


        
   
    # Set the column names to match the external results dataframe that we will join with later:
    columns = ['Classifier', 'Result']
    
    if Mtype in("LinearSVC","GBTClassifier") and classes != 2:
        Mtype = [Mtype] # make this a list
        score = ["N/A"]
        result = spark.createDataFrame(zip(Mtype,score), schema=columns)
    else:
        predictions = fitModel.transform(test)
        MC_evaluator = MulticlassClassificationEvaluator(metricName="accuracy") # predictionCol="prediction",
        accuracy = (MC_evaluator.evaluate(predictions))*100
        Mtype = [Mtype] # make this a string
        score = [str(accuracy)] #make this a string and convert to a list
        result = spark.createDataFrame(zip(Mtype,score), schema=columns)
        result = result.withColumn('Result',result.Result.substr(0, 5))
        
    return result
    #Also returns the fit model important scores or p values

In [None]:
# Run!
from pyspark.ml.classification import *
from pyspark.ml.evaluation import *
from pyspark.sql import functions
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

# Comment out Naive Bayes if your data still contains negative values
classifiers = [
                LogisticRegression()
                ,OneVsRest()
               ,NaiveBayes()
               ,RandomForestClassifier()
               ,DecisionTreeClassifier()
               ,MultilayerPerceptronClassifier()
              ] 

features = final_data.select(['features']).collect()
# Learn how many classes there are in order to specify evaluation type based on binary or multi and turn the df into an object
class_count = final_data.select(countDistinct("label")).collect()
classes = class_count[0][0]

#set up your results table
columns = ['Classifier', 'Result']
vals = [("Place Holder","N/A")]
results = spark.createDataFrame(vals, columns)

for classifier in classifiers:
    new_result = ClassTrainEval(classifier,features,classes,train,test)
    results = results.union(new_result)
results = results.where("Classifier!='Place Holder'")
results.show(100,False)

 
[1mLogisticRegression  Coefficient Matrix[0m
You should compares these relative to eachother
Coefficients: 
DenseMatrix([[-0.04550155,  0.00037109,  0.0095118 , ..., -0.01063309,
               0.00580656,  0.01260533],
             [-0.04114579, -0.00553871,  0.00664433, ...,  0.00796034,
              -0.02156525,  0.00368964],
             [-0.04283345,  0.00020943, -0.00316317, ...,  0.00052396,
               0.00477326, -0.01695208],
             ...,
             [ 0.04412127,  0.00129012, -0.00266503, ..., -0.00847819,
               0.00645421,  0.0051082 ],
             [ 0.04821708,  0.00175001, -0.00134773, ...,  0.00246822,
               0.00552117, -0.00162427],
             [ 0.03494286,  0.00182118, -0.00313703, ..., -0.00909452,
               0.00619095,  0.00711705]])
Intercept: [9.654171063050988,12.800938803319058,2.475509948209058,14.101058621420181,5.840422767769387,-9.231707743553699,2.4582418330319387,28.086656316495855,8.375092785994353,-10.14636996216457

In [None]:
results.orderBy(results['result'].desc()).show(50,truncate = False)

+------------------------------+------+
|Classifier                    |Result|
+------------------------------+------+
|RandomForestClassifier        |75.26 |
|DecisionTreeClassifier        |64.27 |
|LogisticRegression            |61.09 |
|OneVsRest                     |57.08 |
|NaiveBayes                    |47.78 |
|MultilayerPerceptronClassifier|27.90 |
+------------------------------+------+



### By unkown reason when adding the _c0 column to the feature vector, the accuracy rose from 46% to 75% 
### and it appears that Random forest classifier performs the best