# Classification in PySpark's MLlib Project Solution

### Genre classification
Now it's time to leverage what we learned in the lectures to a REAL classification project! Have you ever wondered what makes us, humans, able to tell apart two songs of different genres? How we do we inherenly know the difference between a pop song and heavy metal? This type of classifcation may seem easy for us, but it's a very difficult challenge for a computer to do. So the question is, could an automatic genre classifcation model be possible? 

For this project we will be classifying songs based on a number of characteristics into a set of 23 electronic genres. This technology could be used by an application like Pandora to recommend songs to users or just create meaningful channels. Super fun!

### Dataset
*beatsdataset.csv*
Each row is an electronic music song. The dataset contains 100 song for each genre among 23 electronic music genres, they were the top (100) songs of their genres on November 2016. The 71 columns are audio features extracted of a two random minutes sample of the file audio. These features have been extracted using pyAudioAnalysis (https://github.com/tyiannak/pyAudioAnalysis).

### Your task
Create an algorithm that classifies songs into the 23 genres provided. Test out several different models and select the highest performing one. Also play around with feature selection methods and finally try to make a recommendation to a user.  

For the feature selection aspect of this project, you may need to get a bit creative if you want to select features from a non-tree algorithm. I did not go over this aspect of PySpark intentionally in the previous lectures to give you chance to get used to researching the PySpark documentation page. Here is the link to the Feature Selectors section of the documentation that just might come in handy: https://spark.apache.org/docs/latest/ml-features.html#feature-selectors

Good luck! Have fun :)

### Source
https://www.kaggle.com/caparrini/beatsdataset

In [18]:
import findspark
findspark.init()

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('ML Project').getOrCreate()

In [74]:
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.ml.feature import StringIndexer, VectorAssembler, StandardScaler, IndexToString
from pyspark.ml import Pipeline
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.classification import RandomForestClassifier, LogisticRegression
from sklearn.metrics import classification_report

In [20]:
df = spark.read.csv('Datasets/beatsdataset.csv', inferSchema=True, header=True)
df.limit(5).toPandas()

Unnamed: 0,_c0,1-ZCRm,2-Energym,3-EnergyEntropym,4-SpectralCentroidm,5-SpectralSpreadm,6-SpectralEntropym,7-SpectralFluxm,8-SpectralRolloffm,9-MFCCs1m,...,63-ChromaVector8std,64-ChromaVector9std,65-ChromaVector10std,66-ChromaVector11std,67-ChromaVector12std,68-ChromaDeviationstd,69-BPM,70-BPMconf,71-BPMessentia,class
0,0,0.13644,0.088861,3.201201,0.262825,0.249212,1.114423,0.007003,0.256682,-22.723259,...,0.003431,0.004981,0.010818,0.024001,0.005201,0.015056,133.333333,0.132792,128.0,BigRoom
1,1,0.117039,0.108389,3.194001,0.247657,0.250288,1.065668,0.005387,0.199821,-21.775871,...,0.004461,0.006441,0.007469,0.015499,0.005589,0.019339,120.0,0.112767,126.0,BigRoom
2,2,0.085308,0.128525,3.123837,0.217205,0.228652,0.789647,0.008247,0.156822,-22.472722,...,0.001529,0.004556,0.007723,0.017482,0.002901,0.022201,133.333333,0.123373,129.0,BigRoom
3,3,0.10305,0.167042,3.15083,0.233593,0.245032,0.967082,0.006571,0.168083,-21.470751,...,0.001591,0.003514,0.009477,0.023162,0.004165,0.015379,133.333333,0.158876,129.0,BigRoom
4,4,0.15173,0.148405,3.194498,0.29373,0.267231,1.353005,0.003872,0.292055,-21.371157,...,0.003945,0.004131,0.01133,0.028188,0.002639,0.019079,133.333333,0.190708,129.0,BigRoom


In [21]:
df.printSchema()

root
 |-- _c0: integer (nullable = true)
 |-- 1-ZCRm: double (nullable = true)
 |-- 2-Energym: double (nullable = true)
 |-- 3-EnergyEntropym: double (nullable = true)
 |-- 4-SpectralCentroidm: double (nullable = true)
 |-- 5-SpectralSpreadm: double (nullable = true)
 |-- 6-SpectralEntropym: double (nullable = true)
 |-- 7-SpectralFluxm: double (nullable = true)
 |-- 8-SpectralRolloffm: double (nullable = true)
 |-- 9-MFCCs1m: double (nullable = true)
 |-- 10-MFCCs2m: double (nullable = true)
 |-- 11-MFCCs3m: double (nullable = true)
 |-- 12-MFCCs4m: double (nullable = true)
 |-- 13-MFCCs5m: double (nullable = true)
 |-- 14-MFCCs6m: double (nullable = true)
 |-- 15-MFCCs7m: double (nullable = true)
 |-- 16-MFCCs8m: double (nullable = true)
 |-- 17-MFCCs9m: double (nullable = true)
 |-- 18-MFCCs10m: double (nullable = true)
 |-- 19-MFCCs11m: double (nullable = true)
 |-- 20-MFCCs12m: double (nullable = true)
 |-- 21-MFCCs13m: double (nullable = true)
 |-- 22-ChromaVector1m: double (null

In [22]:
df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df.columns]).toPandas().T.sort_values(0, ascending=False)

Unnamed: 0,0
_c0,0
37-EnergyEntropystd,0
53-MFCCs11std,0
52-MFCCs10std,0
51-MFCCs9std,0
...,...
24-ChromaVector3m,0
23-ChromaVector2m,0
22-ChromaVector1m,0
21-MFCCs13m,0


In [23]:
df.select('class').groupBy('class').count().show()

+--------------------+-----+
|               class|count|
+--------------------+-----+
|           PsyTrance|  100|
|           HardDance|  100|
|              Breaks|  100|
|  HardcoreHardTechno|  100|
|   IndieDanceNuDisco|  100|
|              Trance|  100|
|           DeepHouse|  100|
|ElectronicaDowntempo|  100|
|           ReggaeDub|  100|
|             Minimal|  100|
|         DrumAndBass|  100|
|             Dubstep|  100|
|             BigRoom|  100|
|              Techno|  100|
|               House|  100|
|         FutureHouse|  100|
|        ElectroHouse|  100|
|           GlitchHop|  100|
|           TechHouse|  100|
|              HipHop|  100|
+--------------------+-----+
only showing top 20 rows



In [24]:
string_indexer = StringIndexer(inputCol='class', outputCol='class_num')
df = string_indexer.fit(df).transform(df)

In [26]:
df.select('class','class_num').groupBy('class','class_num').count().show()

+--------------------+---------+-----+
|               class|class_num|count|
+--------------------+---------+-----+
|           ReggaeDub|     19.0|  100|
|           FunkRAndB|      8.0|  100|
|              Trance|     22.0|  100|
|              Techno|     21.0|  100|
|             Dubstep|      5.0|  100|
|   IndieDanceNuDisco|     15.0|  100|
|ElectronicaDowntempo|      7.0|  100|
|               House|     14.0|  100|
|  HardcoreHardTechno|     12.0|  100|
|              HipHop|     13.0|  100|
|           GlitchHop|     10.0|  100|
|         DrumAndBass|      4.0|  100|
|           PsyTrance|     18.0|  100|
|    ProgressiveHouse|     17.0|  100|
|        ElectroHouse|      6.0|  100|
|           HardDance|     11.0|  100|
|             Minimal|     16.0|  100|
|           TechHouse|     20.0|  100|
|           DeepHouse|      3.0|  100|
|             BigRoom|      0.0|  100|
+--------------------+---------+-----+
only showing top 20 rows



In [64]:
df_train, df_test = df.randomSplit([0.8, 0.2], seed=42)

# Functions in the pipeline
evaluator = MulticlassClassificationEvaluator(labelCol='class_num', metricName='f1')
assembler = VectorAssembler(inputCols=[c for c in df.columns if c not in ['class','class_num']], outputCol='features')
scaler = StandardScaler(inputCol='features', outputCol='scaledFeatures', withStd=True, withMean=True)

# classifiers
rfc = RandomForestClassifier(featuresCol='scaledFeatures', labelCol='class_num')
lr = LogisticRegression(featuresCol='scaledFeatures', labelCol='class_num')

# Grid for each classifier
paramGrid_rfc = ParamGridBuilder().addGrid(rfc.numTrees, [50,100]).build()
paramGrid_lr = ParamGridBuilder().addGrid(lr.maxIter, [10]).addGrid(lr.regParam, [0.3]).build()

def test_classifiers(clf, paramGrid):
    # Definition of pipeline
    pipeline = Pipeline(stages=[assembler, scaler, clf])
    # Definition of the cross validator
    cv = CrossValidator(
        estimator=pipeline,
        estimatorParamMaps=paramGrid, 
        evaluator=evaluator, 
        numFolds=5)
    print('training classifier: {}\n'.format(clf))
    model = cv.fit(df_train)
    pred_ml_test = model.transform(df_test)
    pred_ml_test_pd = pred_ml_test.toPandas()
    print(classification_report(pred_ml_test_pd.prediction, pred_ml_test_pd.class_num))
    return model

In [65]:
# train / test all classifiers
clf_to_test = [rfc, lr]
grid_to_test = [paramGrid_rfc, paramGrid_lr]
models_list = [test_classifiers(models, grids) for models, grids in zip(clf_to_test, grid_to_test)]

training classifier: RandomForestClassifier_25924c0e9027

              precision    recall  f1-score   support

         0.0       1.00      0.81      0.89        21
         1.0       0.86      0.80      0.83        15
         2.0       0.56      0.90      0.69        10
         3.0       0.91      0.77      0.83        13
         4.0       1.00      0.73      0.84        22
         5.0       0.71      0.79      0.75        19
         6.0       0.71      1.00      0.83        12
         7.0       0.59      0.67      0.62        15
         8.0       0.82      0.56      0.67        25
         9.0       0.75      0.75      0.75        16
        10.0       0.83      0.95      0.88        20
        11.0       0.71      0.71      0.71        17
        12.0       0.96      0.79      0.86        28
        13.0       0.53      0.82      0.64        11
        14.0       0.44      0.78      0.56         9
        15.0       0.35      0.50      0.41        12
        16.0       0.60

In [90]:
result_test = models_list[0].transform(df_test)
result_test = result_test.withColumn('prediction_results', result_test.class_num - result_test.prediction)

res = udf(lambda x: True if x==0 else False)
result_test = result_test.withColumn('prediction_results', res(result_test.prediction_results))

result_test.select('class','class_num','prediction','prediction_results').orderBy(rand()).show(15)

+--------------------+---------+----------+------------------+
|               class|class_num|prediction|prediction_results|
+--------------------+---------+----------+------------------+
|           ReggaeDub|     19.0|      19.0|              true|
|           ReggaeDub|     19.0|      19.0|              true|
|           TechHouse|     20.0|      20.0|              true|
|  HardcoreHardTechno|     12.0|      12.0|              true|
|               House|     14.0|      14.0|              true|
|              Techno|     21.0|      21.0|              true|
|             BigRoom|      0.0|       0.0|              true|
|              Trance|     22.0|      22.0|              true|
|           DeepHouse|      3.0|       3.0|              true|
|              Breaks|      1.0|       1.0|              true|
|ElectronicaDowntempo|      7.0|       7.0|              true|
|         DrumAndBass|      4.0|       4.0|              true|
|           ReggaeDub|     19.0|      10.0|            