# Classification in PySpark's MLlib Project Solution

### Genre classification
Now it's time to leverage what we learned in the lectures to a REAL classification project! Have you ever wondered what makes us, humans, able to tell apart two songs of different genres? How we do we inherenly know the difference between a pop song and heavy metal? This type of classifcation may seem easy for us, but it's a very difficult challenge for a computer to do. So the question is, could an automatic genre classifcation model be possible? 

For this project we will be classifying songs based on a number of characteristics into a set of 23 electronic genres. This technology could be used by an application like Pandora to recommend songs to users or just create meaningful channels. Super fun!

### Dataset
*beatsdataset.csv*
Each row is an electronic music song. The dataset contains 100 song for each genre among 23 electronic music genres, they were the top (100) songs of their genres on November 2016. The 71 columns are audio features extracted of a two random minutes sample of the file audio. These features have been extracted using pyAudioAnalysis (https://github.com/tyiannak/pyAudioAnalysis).

### Your task
Create an algorithm that classifies songs into the 23 genres provided. Test out several different models and select the highest performing one. Also play around with feature selection methods and finally try to make a recommendation to a user.  

For the feature selection aspect of this project, you may need to get a bit creative if you want to select features from a non-tree algorithm. I did not go over this aspect of PySpark intentionally in the previous lectures to give you chance to get used to researching the PySpark documentation page. Here is the link to the Feature Selectors section of the documentation that just might come in handy: https://spark.apache.org/docs/latest/ml-features.html#feature-selectors

Good luck! Have fun :)

### Source
https://www.kaggle.com/caparrini/beatsdataset

In [1]:
from google.colab import drive
drive.mount('/content/drive/', force_remount=True)

Mounted at /content/drive/


In [2]:
# Install pyspark
!pip install pyspark
# import findspark
# findspark.init()
import os
import pyspark # only run after findspark.init()
from pyspark.sql import SparkSession
# May take awhile locally
spark = SparkSession.builder.appName("Classification").getOrCreate()

cores = spark._jsc.sc().getExecutorMemoryStatus().keySet().size()
print("You are working with", cores, "core(s)")
spark


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.3.2.tar.gz (281.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m281.4/281.4 MB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting py4j==0.10.9.5
  Downloading py4j-0.10.9.5-py2.py3-none-any.whl (199 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.7/199.7 KB[0m [31m22.8 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.3.2-py2.py3-none-any.whl size=281824025 sha256=ee2e329159368baca84b5f508d655531f9dbb55dbc97a521ef28d8f556fdfe5c
  Stored in directory: /root/.cache/pip/wheels/b1/59/a0/a1a0624b5e865fd389919c1a10f53aec9b12195d6747710baf
Successfully built pyspark
Installing collected packages: py4j, pyspa

## Reading The Data

In [3]:
path = '/content/drive/MyDrive/Data Science Intake43/5. Spark/spark-scripts/section3/Datasets/'
os.listdir(path)

['beatsdataset.csv',
 'kickstarter.csv',
 'housing.csv',
 'fake_job_postings.csv',
 'Toddler Autism dataset July 2018.csv',
 'Concrete_Data.csv']

In [4]:
df = spark.read.csv(path+'beatsdataset.csv',inferSchema=True,header=True)
df.limit(6).toPandas()

Unnamed: 0,_c0,1-ZCRm,2-Energym,3-EnergyEntropym,4-SpectralCentroidm,5-SpectralSpreadm,6-SpectralEntropym,7-SpectralFluxm,8-SpectralRolloffm,9-MFCCs1m,...,63-ChromaVector8std,64-ChromaVector9std,65-ChromaVector10std,66-ChromaVector11std,67-ChromaVector12std,68-ChromaDeviationstd,69-BPM,70-BPMconf,71-BPMessentia,class
0,0,0.13644,0.088861,3.201201,0.262825,0.249212,1.114423,0.007003,0.256682,-22.723259,...,0.003431,0.004981,0.010818,0.024001,0.005201,0.015056,133.333333,0.132792,128.0,BigRoom
1,1,0.117039,0.108389,3.194001,0.247657,0.250288,1.065668,0.005387,0.199821,-21.775871,...,0.004461,0.006441,0.007469,0.015499,0.005589,0.019339,120.0,0.112767,126.0,BigRoom
2,2,0.085308,0.128525,3.123837,0.217205,0.228652,0.789647,0.008247,0.156822,-22.472722,...,0.001529,0.004556,0.007723,0.017482,0.002901,0.022201,133.333333,0.123373,129.0,BigRoom
3,3,0.10305,0.167042,3.15083,0.233593,0.245032,0.967082,0.006571,0.168083,-21.470751,...,0.001591,0.003514,0.009477,0.023162,0.004165,0.015379,133.333333,0.158876,129.0,BigRoom
4,4,0.15173,0.148405,3.194498,0.29373,0.267231,1.353005,0.003872,0.292055,-21.371157,...,0.003945,0.004131,0.01133,0.028188,0.002639,0.019079,133.333333,0.190708,129.0,BigRoom
5,5,0.127047,0.153488,3.221987,0.261693,0.257361,1.090034,0.004943,0.230099,-21.234846,...,0.002986,0.006533,0.010347,0.025008,0.003035,0.019479,133.333333,0.168933,129.0,BigRoom


In [5]:
df.printSchema()

root
 |-- _c0: integer (nullable = true)
 |-- 1-ZCRm: double (nullable = true)
 |-- 2-Energym: double (nullable = true)
 |-- 3-EnergyEntropym: double (nullable = true)
 |-- 4-SpectralCentroidm: double (nullable = true)
 |-- 5-SpectralSpreadm: double (nullable = true)
 |-- 6-SpectralEntropym: double (nullable = true)
 |-- 7-SpectralFluxm: double (nullable = true)
 |-- 8-SpectralRolloffm: double (nullable = true)
 |-- 9-MFCCs1m: double (nullable = true)
 |-- 10-MFCCs2m: double (nullable = true)
 |-- 11-MFCCs3m: double (nullable = true)
 |-- 12-MFCCs4m: double (nullable = true)
 |-- 13-MFCCs5m: double (nullable = true)
 |-- 14-MFCCs6m: double (nullable = true)
 |-- 15-MFCCs7m: double (nullable = true)
 |-- 16-MFCCs8m: double (nullable = true)
 |-- 17-MFCCs9m: double (nullable = true)
 |-- 18-MFCCs10m: double (nullable = true)
 |-- 19-MFCCs11m: double (nullable = true)
 |-- 20-MFCCs12m: double (nullable = true)
 |-- 21-MFCCs13m: double (nullable = true)
 |-- 22-ChromaVector1m: double (null

In [6]:
# Read in functions we will need
from pyspark.ml.feature import VectorAssembler
from pyspark.sql.types import * 
from pyspark.sql.functions import *
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import MinMaxScaler

# Read in dependencies
from pyspark.ml.classification import *
from pyspark.ml.evaluation import *
from pyspark.sql.functions import *
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

In [7]:
df.select('class').distinct().collect()

[Row(class='PsyTrance'),
 Row(class='HardDance'),
 Row(class='Breaks'),
 Row(class='HardcoreHardTechno'),
 Row(class='IndieDanceNuDisco'),
 Row(class='Trance'),
 Row(class='DeepHouse'),
 Row(class='ElectronicaDowntempo'),
 Row(class='ReggaeDub'),
 Row(class='Minimal'),
 Row(class='DrumAndBass'),
 Row(class='Dubstep'),
 Row(class='BigRoom'),
 Row(class='Techno'),
 Row(class='House'),
 Row(class='FutureHouse'),
 Row(class='ElectroHouse'),
 Row(class='GlitchHop'),
 Row(class='TechHouse'),
 Row(class='HipHop'),
 Row(class='FunkRAndB'),
 Row(class='Dance'),
 Row(class='ProgressiveHouse')]

## Checking The Class Balance

In [8]:
tot = df.count()
df.groupBy("class").count().withColumnRenamed('count', 'cnt_per_group').withColumn('perc_of_count_total', (col('cnt_per_group') / tot) * 100 ).show(100)

+--------------------+-------------+-------------------+
|               class|cnt_per_group|perc_of_count_total|
+--------------------+-------------+-------------------+
|           PsyTrance|          100| 4.3478260869565215|
|           HardDance|          100| 4.3478260869565215|
|              Breaks|          100| 4.3478260869565215|
|  HardcoreHardTechno|          100| 4.3478260869565215|
|   IndieDanceNuDisco|          100| 4.3478260869565215|
|              Trance|          100| 4.3478260869565215|
|           DeepHouse|          100| 4.3478260869565215|
|ElectronicaDowntempo|          100| 4.3478260869565215|
|           ReggaeDub|          100| 4.3478260869565215|
|             Minimal|          100| 4.3478260869565215|
|         DrumAndBass|          100| 4.3478260869565215|
|             Dubstep|          100| 4.3478260869565215|
|             BigRoom|          100| 4.3478260869565215|
|              Techno|          100| 4.3478260869565215|
|               House|         

## Building The Functions And Models

In [10]:
# Data Prep function
def MLClassifierDFPrep(df,input_columns,dependent_var,treat_outliers=True,treat_neg_values=True):
    
    # change label (class variable) to string type to prep for reindexing
    # Pyspark is expecting a zero indexed integer for the label column. 
    # Just incase our data is not in that format... we will treat it by using the StringIndexer built in method
    renamed = df.withColumn("label_str", df[dependent_var].cast(StringType())) #Rename and change to string type
    indexer = StringIndexer(inputCol="label_str", outputCol="label") #Pyspark is expecting the this naming convention 
    indexed = indexer.fit(renamed).transform(renamed)

    # Convert all string type data in the input column list to numeric
    # Otherwise the Algorithm will not be able to process it
    numeric_inputs = []
    string_inputs = []
    for column in input_columns:
        if str(indexed.schema[column].dataType) == 'StringType':
            indexer = StringIndexer(inputCol=column, outputCol=column+"_num") 
            indexed = indexer.fit(indexed).transform(indexed)
            new_col_name = column+"_num"
            string_inputs.append(new_col_name)
        else:
            numeric_inputs.append(column)
            
    if treat_outliers == True:
        print("We are correcting for non normality now!")
        # empty dictionary d
        d = {}
        # Create a dictionary of quantiles
        for col in numeric_inputs: 
            d[col] = indexed.approxQuantile(col,[0.01,0.99],0.25) #if you want to make it go faster increase the last number
        #Now fill in the values
        for col in numeric_inputs:
            skew = indexed.agg(skewness(indexed[col])).collect() #check for skewness
            skew = skew[0][0]
            # This function will floor, cap and then log+1 (just in case there are 0 values)
            if skew > 1:
                indexed = indexed.withColumn(col, \
                log(when(df[col] < d[col][0],d[col][0])\
                .when(indexed[col] > d[col][1], d[col][1])\
                .otherwise(indexed[col] ) +1).alias(col))
                print(col+" has been treated for positive (right) skewness. (skew =)",skew,")")
            elif skew < -1:
                indexed = indexed.withColumn(col, \
                exp(when(df[col] < d[col][0],d[col][0])\
                .when(indexed[col] > d[col][1], d[col][1])\
                .otherwise(indexed[col] )).alias(col))
                print(col+" has been treated for negative (left) skewness. (skew =",skew,")")

            
    # Produce a warning if there are negative values in the dataframe that Naive Bayes cannot be used. 
    # Note: we only need to check the numeric input values since anything that is indexed won't have negative values
    minimums = df.select([min(c).alias(c) for c in df.columns if c in numeric_inputs]) # Calculate the mins for all columns in the df
    min_array = minimums.select(array(numeric_inputs).alias("mins")) # Create an array for all mins and select only the input cols
    df_minimum = min_array.select(array_min(min_array.mins)).collect() # Collect golobal min as Python object
    df_minimum = df_minimum[0][0] # Slice to get the number itself

    features_list = numeric_inputs + string_inputs
    assembler = VectorAssembler(inputCols=features_list,outputCol='features')
    output = assembler.transform(indexed).select('features','label')

#     final_data = output.select('features','label') #drop everything else
    
    # Now check for negative values and ask user if they want to correct that? 
    if df_minimum < 0:
        print(" ")
        print("WARNING: The Naive Bayes Classifier will not be able to process your dataframe as it contains negative values")
        print(" ")
    
    if treat_neg_values == True:
        print("You have opted to correct that by rescaling all your features to a range of 0 to 1")
        print(" ")
        print("We are rescaling you dataframe....")
        scaler = MinMaxScaler(inputCol="features", outputCol="scaledFeatures")

        # Compute summary statistics and generate MinMaxScalerModel
        scalerModel = scaler.fit(output)

        # rescale each feature to range [min, max].
        scaled_data = scalerModel.transform(output)
        final_data = scaled_data.select('label','scaledFeatures')
        final_data = final_data.withColumnRenamed('scaledFeatures','features')
        print("Done!")

    else:
        print("You have opted not to correct that therefore you will not be able to use to Naive Bayes classifier")
        print("We will return the dataframe unscaled.")
        final_data = output
    
    return final_data

In [11]:
def ClassTrainEval(classifier,features,classes,train,test):

    def FindMtype(classifier):
        # Intstantiate Model
        M = classifier
        # Learn what it is
        Mtype = type(M).__name__
        
        return Mtype
    
    Mtype = FindMtype(classifier)
    

    def IntanceFitModel(Mtype,classifier,classes,features,train):
        
        if Mtype == "OneVsRest":
            # instantiate the base classifier.
            lr = LogisticRegression()
            # instantiate the One Vs Rest Classifier.
            OVRclassifier = OneVsRest(classifier=lr)
#             fitModel = OVRclassifier.fit(train)
            # Add parameters of your choice here:
            paramGrid = ParamGridBuilder() \
                .addGrid(lr.regParam, [0.1, 0.01]) \
                .build()
            #Cross Validator requires the following parameters:
            crossval = CrossValidator(estimator=OVRclassifier,
                                      estimatorParamMaps=paramGrid,
                                      evaluator=MulticlassClassificationEvaluator(),
                                      numFolds=2) # 3 is best practice
            # Run cross-validation, and choose the best set of parameters.
            fitModel = crossval.fit(train)
            return fitModel
        if Mtype == "MultilayerPerceptronClassifier":
            # specify layers for the neural network:
            # input layer of size features, two intermediate of features+1 and same size as features
            # and output of size number of classes
            # Note: crossvalidator cannot be used here
            features_count = len(features[0][0])
            layers = [features_count, features_count+1, features_count, classes]
            MPC_classifier = MultilayerPerceptronClassifier(maxIter=100, layers=layers, blockSize=128, seed=1234)
            fitModel = MPC_classifier.fit(train)
            return fitModel
        if Mtype in("LinearSVC","GBTClassifier") and classes != 2: # These classifiers currently only accept binary classification
            print(Mtype," could not be used because PySpark currently only accepts binary classification data for this algorithm")
            return
        if Mtype in("LogisticRegression","NaiveBayes","RandomForestClassifier","GBTClassifier","LinearSVC","DecisionTreeClassifier"):
  
            # Add parameters of your choice here:
            if Mtype in("LogisticRegression"):
                paramGrid = (ParamGridBuilder() \
#                              .addGrid(classifier.regParam, [0.1, 0.01]) \
                             .addGrid(classifier.maxIter, [10, 15,20])
                             .build())
                
            # Add parameters of your choice here:
            if Mtype in("NaiveBayes"):
                paramGrid = (ParamGridBuilder() \
                             .addGrid(classifier.smoothing, [0.0, 0.2, 0.4, 0.6]) \
                             .build())
                
            # Add parameters of your choice here:
            if Mtype in("RandomForestClassifier"):
                paramGrid = (ParamGridBuilder() \
                               .addGrid(classifier.maxDepth, [2, 5, 10])
#                                .addGrid(classifier.maxBins, [5, 10, 20])
#                                .addGrid(classifier.numTrees, [5, 20, 50])
                             .build())
                
            # Add parameters of your choice here:
            if Mtype in("GBTClassifier"):
                paramGrid = (ParamGridBuilder() \
#                              .addGrid(classifier.maxDepth, [2, 5, 10, 20, 30]) \
#                              .addGrid(classifier.maxBins, [10, 20, 40, 80, 100]) \
                             .addGrid(classifier.maxIter, [10, 15,50,100])
                             .build())
                
            # Add parameters of your choice here:
            if Mtype in("LinearSVC"):
                paramGrid = (ParamGridBuilder() \
                             .addGrid(classifier.maxIter, [10, 15]) \
                             .addGrid(classifier.regParam, [0.1, 0.01]) \
                             .build())
            
            # Add parameters of your choice here:
            if Mtype in("DecisionTreeClassifier"):
                paramGrid = (ParamGridBuilder() \
#                              .addGrid(classifier.maxDepth, [2, 5, 10, 20, 30]) \
                             .addGrid(classifier.maxBins, [10, 20, 40, 80, 100]) \
                             .build())
            
            #Cross Validator requires all of the following parameters:
            crossval = CrossValidator(estimator=classifier,
                                      estimatorParamMaps=paramGrid,
                                      evaluator=MulticlassClassificationEvaluator(),
                                      numFolds=2) # 3 + is best practice
            # Fit Model: Run cross-validation, and choose the best set of parameters.
            fitModel = crossval.fit(train)
            return fitModel
    
    fitModel = IntanceFitModel(Mtype,classifier,classes,features,train)
    
    # Print feature selection metrics
    if fitModel is not None:
        
        if Mtype in("OneVsRest"):
            # Get Best Model
            BestModel = fitModel.bestModel
            print(" ")
            print('\033[1m' + Mtype + '\033[0m')
            # Extract list of binary models
            models = BestModel.models
            for model in models:
                print('\033[1m' + 'Intercept: '+ '\033[0m',model.intercept,'\033[1m' + '\nCoefficients:'+ '\033[0m',model.coefficients)

        if Mtype == "MultilayerPerceptronClassifier":
            print("")
            print('\033[1m' + Mtype," Weights"+ '\033[0m')
            print('\033[1m' + "Model Weights: "+ '\033[0m',fitModel.weights.size)
            print("")

        if Mtype in("DecisionTreeClassifier", "GBTClassifier","RandomForestClassifier"):
            # FEATURE IMPORTANCES
            # Estimate of the importance of each feature.
            # Each feature’s importance is the average of its importance across all trees 
            # in the ensemble The importance vector is normalized to sum to 1. 
            # Get Best Model
            BestModel = fitModel.bestModel
            print(" ")
            print('\033[1m' + Mtype," Feature Importances"+ '\033[0m')
            print("(Scores add up to 1)")
            print("Lowest score is the least important")
            print(" ")
            featureImportances = BestModel.featureImportances.toArray()
            print(featureImportances)
            
            if Mtype in("DecisionTreeClassifier"):
                global DT_featureImportances
                DT_featureImportances = BestModel.featureImportances.toArray()
                global DT_BestModel
                DT_BestModel = BestModel
            if Mtype in("GBTClassifier"):
                global GBT_featureImportances
                GBT_featureImportances = BestModel.featureImportances.toArray()
                global GBT_BestModel
                GBT_BestModel = BestModel
            if Mtype in("RandomForestClassifier"):
                global RF_featureImportances
                RF_featureImportances = BestModel.featureImportances.toArray()
                global RF_BestModel
                RF_BestModel = BestModel

        if Mtype in("LogisticRegression"):
            # Get Best Model
            BestModel = fitModel.bestModel
            print(" ")
            print('\033[1m' + Mtype," Coefficient Matrix"+ '\033[0m')
            print("You should compares these relative to eachother")
            print("Coefficients: \n" + str(BestModel.coefficientMatrix))
            print("Intercept: " + str(BestModel.interceptVector))
            global LR_coefficients
            LR_coefficients = BestModel.coefficientMatrix.toArray()
            global LR_BestModel
            LR_BestModel = BestModel

        if Mtype in("LinearSVC"):
            # Get Best Model
            BestModel = fitModel.bestModel
            print(" ")
            print('\033[1m' + Mtype," Coefficients"+ '\033[0m')
            print("You should compares these relative to eachother")
            print("Coefficients: \n" + str(BestModel.coefficients))
            global LSVC_coefficients
            LSVC_coefficients = BestModel.coefficients.toArray()
            global LSVC_BestModel
            LSVC_BestModel = BestModel
        
   
    # Set the column names to match the external results dataframe that we will join with later:
    columns = ['Classifier', 'Result']
    
    if Mtype in("LinearSVC","GBTClassifier") and classes != 2:
        Mtype = [Mtype] # make this a list
        score = ["N/A"]
        result = spark.createDataFrame(zip(Mtype,score), schema=columns)
    else:
        predictions = fitModel.transform(test)
        MC_evaluator = MulticlassClassificationEvaluator(metricName="accuracy") # redictionCol="prediction",
        accuracy = (MC_evaluator.evaluate(predictions))*100
        Mtype = [Mtype] # make this a string
        score = [str(accuracy)] #make this a string and convert to a list
        result = spark.createDataFrame(zip(Mtype,score), schema=columns)
        result = result.withColumn('Result',result.Result.substr(0, 5))
        
    return result
    #Also returns the fit model important scores or p values

## Randomize The Data To Not Enter The Model In Order

In [12]:
df = df.orderBy(rand())
df.show(20)

+----+---------------+---------------+----------------+-------------------+-----------------+------------------+----------------+------------------+--------------+--------------+----------------+----------------+----------------+----------------+-----------------+----------------+-----------------+-----------------+-----------------+-----------------+----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+------------------+------------------+------------------+-------------------+---------------+---------------+-------------------+----------------------+--------------------+---------------------+------------------+---------------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+-------------------+-----------------

# For This Part I Applied Two Test Cases, 
1. With treating outliers and skewness (Set To True)
2. Without trearment (Set to False)

## Prepare The Data To Be Vectorized And Processed
### Note the treatments are set to **True**

In [13]:
# input_columns = df.columns # Collect the column names as a list
# input_columns = input_columns[8:] # keep only relevant columns: from column 8 until the end
input_columns = list(df.columns)[1:-1]
dependent_var = 'class'
final_data = MLClassifierDFPrep(df, input_columns, dependent_var, treat_outliers=True, treat_neg_values=True)
final_data.limit(5).toPandas()

We are correcting for non normality now!
7-SpectralFluxm has been treated for positive (right) skewness. (skew =) 1.6396138160129037 )
22-ChromaVector1m has been treated for positive (right) skewness. (skew =) 2.416241520430935 )
23-ChromaVector2m has been treated for positive (right) skewness. (skew =) 4.154796693680598 )
24-ChromaVector3m has been treated for positive (right) skewness. (skew =) 1.197401961750432 )
25-ChromaVector4m has been treated for positive (right) skewness. (skew =) 2.44663586359492 )
26-ChromaVector5m has been treated for positive (right) skewness. (skew =) 2.1544828761874895 )
27-ChromaVector6m has been treated for positive (right) skewness. (skew =) 2.0123406447254215 )
28-ChromaVector7m has been treated for positive (right) skewness. (skew =) 1.1829228989215568 )
29-ChromaVector8m has been treated for positive (right) skewness. (skew =) 3.737264373400017 )
30-ChromaVector9m has been treated for positive (right) skewness. (skew =) 2.411741642154868 )
31-Chrom

Unnamed: 0,label,features
0,2.0,"[0.4380109731804216, 0.2928301757103344, 0.793..."
1,10.0,"[0.4086518817284318, 0.10793781567608038, 0.46..."
2,14.0,"[0.5612636368237589, 0.2396613490944655, 0.623..."
3,9.0,"[0.39202816597363077, 0.37864654950047977, 0.7..."
4,16.0,"[0.2648663551081497, 0.20736217792303355, 0.58..."


In [14]:
train,test = final_data.randomSplit([0.7,0.3])

In [15]:
train.groupBy("label").count().show(100)

+-----+-----+
|label|count|
+-----+-----+
|  8.0|   64|
|  0.0|   71|
|  7.0|   71|
| 18.0|   64|
|  1.0|   64|
|  4.0|   68|
| 11.0|   76|
| 21.0|   70|
| 14.0|   73|
| 22.0|   72|
|  3.0|   69|
| 19.0|   78|
|  2.0|   67|
| 17.0|   68|
| 10.0|   68|
| 13.0|   71|
|  6.0|   70|
| 20.0|   60|
|  5.0|   75|
| 15.0|   77|
|  9.0|   71|
| 16.0|   67|
| 12.0|   73|
+-----+-----+



### Check class balance within the splitted data (to ensure the model will learn all classes equally )

In [16]:
tot = train.count()
train.groupBy("label").count().withColumnRenamed('count', 'cnt_per_group').withColumn('perc_of_count_total', (col('cnt_per_group') / tot) * 100 ).show(100)

+-----+-------------+-------------------+
|label|cnt_per_group|perc_of_count_total|
+-----+-------------+-------------------+
|  8.0|           64| 3.9825762289981332|
|  0.0|           71|  4.418170504044804|
|  7.0|           71|  4.418170504044804|
| 18.0|           64| 3.9825762289981332|
|  1.0|           64| 3.9825762289981332|
|  4.0|           68|  4.231487243310516|
| 11.0|           76|  4.729309271935283|
| 21.0|           70|  4.355942750466708|
| 14.0|           73|  4.542626011200996|
| 22.0|           72|    4.4803982576229|
|  3.0|           69|  4.293714996888612|
| 19.0|           78|  4.853764779091475|
|  2.0|           67|  4.169259489732421|
| 17.0|           68|  4.231487243310516|
| 10.0|           68|  4.231487243310516|
| 13.0|           71|  4.418170504044804|
|  6.0|           70|  4.355942750466708|
| 20.0|           60| 3.7336652146857494|
|  5.0|           75|  4.667081518357187|
| 15.0|           77|  4.791537025513379|
|  9.0|           71|  4.418170504

In [17]:
tot = test.count()
test.groupBy("label").count().withColumnRenamed('count', 'count_per_group').withColumn('perc_of_count_total', (col('count_per_group') / tot) * 100 ).show(100)

+-----+---------------+-------------------+
|label|count_per_group|perc_of_count_total|
+-----+---------------+-------------------+
|  8.0|             36|  5.194805194805195|
|  0.0|             29|  4.184704184704184|
|  7.0|             29|  4.184704184704184|
| 18.0|             36|  5.194805194805195|
|  1.0|             36|  5.194805194805195|
|  4.0|             32|  4.617604617604617|
| 11.0|             24|  3.463203463203463|
| 21.0|             30|  4.329004329004329|
| 14.0|             27|  3.896103896103896|
| 22.0|             28|  4.040404040404041|
|  3.0|             31|  4.473304473304474|
| 19.0|             22| 3.1746031746031744|
|  2.0|             33|  4.761904761904762|
| 17.0|             32|  4.617604617604617|
| 10.0|             32|  4.617604617604617|
| 13.0|             29|  4.184704184704184|
|  6.0|             30|  4.329004329004329|
| 20.0|             40|  5.772005772005772|
|  5.0|             25| 3.6075036075036073|
| 15.0|             23|  3.31890

## Test the model first on a baseline model (LR)

In [18]:
# Set up our evaluation objects
Bin_evaluator = BinaryClassificationEvaluator(rawPredictionCol='prediction') #labelCol='label'
# Bin_evaluator = BinaryClassificationEvaluator() #labelCol='label'
MC_evaluator = MulticlassClassificationEvaluator(metricName="accuracy") # redictionCol="prediction",

In [19]:
# This is the most simplistic approach which does not use cross validation
# Let's go ahead and train a Logistic Regression Algorithm
classifier = LogisticRegression()
fitModel = classifier.fit(train)

# Evaluation method for binary classification problem
predictionAndLabels = fitModel.transform(test)
auc = Bin_evaluator.evaluate(predictionAndLabels)
print("AUC:",auc)

# Evaluation for a multiclass classification problem
# predictions = fitModel.transform(test)
accuracy = (MC_evaluator.evaluate(predictionAndLabels))*100
print("Accuracy: {0:.2f}".format(accuracy),"%") #     print("Test Error = %g " % (1.0 - accuracy))
print(" ")

AUC: 0.7633464894058994
Accuracy: 41.27 %
 


## Check the model now for the rest of the models

In [21]:
# Comment out Naive Bayes if your data still contains negative values
classifiers = [
    LogisticRegression(),
    OneVsRest(),
    LinearSVC(),
    RandomForestClassifier(),
    # GBTClassifier(),
    DecisionTreeClassifier(),
    MultilayerPerceptronClassifier(),
]

# train, test = final_data.randomSplit([0.7, 0.3])
features = final_data.select(["features"]).collect()
class_count = final_data.select(countDistinct("label")).collect()
classes = class_count[0][0]
folds = 2  # because we have limited data

# set up your results table
columns = ["Classifier", "Result"]
vals = [("Place Holder", "N/A")]
results = spark.createDataFrame(vals, columns)

for classifier in classifiers:
    new_result = ClassTrainEval(classifier, features, classes, train, test)
    results = results.union(new_result)
results = results.where("Classifier!='Place Holder'")
print("!!!!!Final Results!!!!!!!!")
results.show(100, False)

 
[1mLogisticRegression  Coefficient Matrix[0m
You should compares these relative to eachother
Coefficients: 
DenseMatrix([[-0.04324422,  5.09107828,  1.96197063, ..., -3.88036697,
              -2.33682972,  3.89486709],
             [-1.75747626, -1.22239717,  2.08489531, ...,  8.64556005,
              -9.85462322,  5.63376547],
             [-0.30973076,  0.32463026,  4.74931335, ..., -3.82603477,
               0.50805853, -3.85062082],
             ...,
             [ 0.07823949, -1.44398873, -0.31353642, ..., -5.85611098,
               1.75617804,  1.88442984],
             [ 0.5890845 , -1.3035361 , -3.01174128, ..., -4.21974003,
               1.94556759,  3.03720769],
             [ 0.90387302, -3.24416927,  1.92224375, ..., -2.63062344,
               3.44099242,  2.42861158]])
Intercept: [-2.09276276694162,-1.8700292993015926,-3.5741204465889176,8.227174694393415,-2.235879393241644,-14.105212970923226,-5.424535993001208,22.340403176340445,-5.552849368952514,-5.1173610164

# Note that for the first test case, the heighst accuracy was around 45% and was performed by Random First Classifier

## Prepare The Data To Be Vectorized And Processed again
### Note that this time the treatments are set to **False**

In [20]:
# input_columns = df.columns # Collect the column names as a list
# input_columns = input_columns[8:] # keep only relevant columns: from column 8 until the end
input_columns = list(df.columns)[1:-1]
dependent_var = 'class'
final_data = MLClassifierDFPrep(df, input_columns, dependent_var, treat_outliers=False, treat_neg_values=False)
final_data.limit(5).toPandas()

 
 
You have opted not to correct that therefore you will not be able to use to Naive Bayes classifier
We will return the dataframe unscaled.


Unnamed: 0,features,label
0,"[0.117661826981, 0.08594057182, 3.14946566219,...",2.0
1,"[0.110922843634, 0.0348020291235, 2.9820995185...",10.0
2,"[0.145952813067, 0.0712348462318, 3.0627803637...",14.0
3,"[0.10710709396, 0.109676135011, 3.13823230705,...",9.0
4,"[0.0779188158691, 0.062301364374, 3.0450174995...",16.0


In [21]:
train,test = final_data.randomSplit([0.7,0.3])

## Test the model first on a baseline model (LR) again

In [22]:
# Set up our evaluation objects
Bin_evaluator = BinaryClassificationEvaluator(rawPredictionCol='prediction') #labelCol='label'
# Bin_evaluator = BinaryClassificationEvaluator() #labelCol='label'
MC_evaluator = MulticlassClassificationEvaluator(metricName="accuracy") # redictionCol="prediction",

# This is the most simplistic approach which does not use cross validation
# Let's go ahead and train a Logistic Regression Algorithm
classifier = LogisticRegression()
fitModel = classifier.fit(train)

# Evaluation method for binary classification problem
predictionAndLabels = fitModel.transform(test)
auc = Bin_evaluator.evaluate(predictionAndLabels)
print("AUC:",auc)

# Evaluation for a multiclass classification problem
# predictions = fitModel.transform(test)
accuracy = (MC_evaluator.evaluate(predictionAndLabels))*100
print("Accuracy: {0:.2f}".format(accuracy),"%") #     print("Test Error = %g " % (1.0 - accuracy))
print(" ")

AUC: 0.6741970802919708
Accuracy: 41.28 %
 


For the base model, it's the same accuracy but a less AUC score.

In [25]:
# Comment out Naive Bayes if your data still contains negative values
classifiers = [
    LogisticRegression(),
    OneVsRest(),
    LinearSVC(),
    RandomForestClassifier(),
    # GBTClassifier(),
    DecisionTreeClassifier(),
    MultilayerPerceptronClassifier(),
]

# train, test = final_data.randomSplit([0.7, 0.3])
features = final_data.select(["features"]).collect()
class_count = final_data.select(countDistinct("label")).collect()
classes = class_count[0][0]
folds = 2  # because we have limited data

# set up your results table
columns = ["Classifier", "Result"]
vals = [("Place Holder", "N/A")]
results = spark.createDataFrame(vals, columns)

for classifier in classifiers:
    new_result = ClassTrainEval(classifier, features, classes, train, test)
    results = results.union(new_result)
results = results.where("Classifier!='Place Holder'")
print("!!!!!Final Results!!!!!!!!")
results.show(100, False)

 
[1mLogisticRegression  Coefficient Matrix[0m
You should compares these relative to eachother
Coefficients: 
DenseMatrix([[-3.16166366e-01,  1.40430109e+01,  4.14546844e+00, ...,
              -9.33264080e-03, -3.13514231e+00,  2.79929892e-02],
             [-5.39081819e+00,  3.13427452e+00,  2.97250720e+00, ...,
               1.89643515e-02, -2.38554190e+01,  2.50535024e-02],
             [-2.81268752e+00,  1.54651223e+00,  8.22340053e+00, ...,
              -1.08676013e-02,  3.32000899e+00, -1.45058488e-02],
             ...,
             [ 1.89855266e+00, -7.45752541e+00, -2.00355637e-01, ...,
              -1.22342031e-02,  3.87708583e+00,  2.20303731e-02],
             [ 2.49861808e+00, -1.74543778e+00, -2.67356917e+00, ...,
              -5.26101564e-03,  7.38976020e+00,  1.39746814e-02],
             [ 3.72949946e+00, -9.85845810e+00,  3.85649245e+00, ...,
              -5.48695647e-03,  8.07004500e+00,  2.16142925e-02]])
Intercept: [-2.8630402730459834,-23.159664116529843,-

# To Conclude:
1. No class Imbalance.
2. The model was tested on two tets cases:
  - One with treatements set to True
  - One without treatments (set to False)
3. The model got nearly the same accuracy of around 45% for the 2 cases which was performed by Random forest classifier
4. I believe the first case is better since it resulted in a better AUC score along with the accuracy 