##### Clasificación de tipos de flores, con 3 tipos de modelos de clasificación

[Dataset](http://scalableml.com/iris-bezdekIris.php)        
[Documentación](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html)

In [1]:
import pandas as pd 
import numpy
import matplotlib.pyplot as plt 
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession, SQLContext
from pyspark.sql.functions import regexp_extract,lit,udf,col
from pyspark.sql.types import IntegerType
# create sparksession
spark = SparkSession \
    .builder \
    .appName("Pysparkexample") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

sc = spark.sparkContext

In [2]:
df = spark.read.csv("./iris_bezdekIris.csv", inferSchema=True)\
.toDF("sep_len", "sep_wid", "pet_len", "pet_wid", "label")
df.show(5)

+-------+-------+-------+-------+-----------+
|sep_len|sep_wid|pet_len|pet_wid|      label|
+-------+-------+-------+-------+-----------+
|    5.1|    3.5|    1.4|    0.2|Iris-setosa|
|    4.9|    3.0|    1.4|    0.2|Iris-setosa|
|    4.7|    3.2|    1.3|    0.2|Iris-setosa|
|    4.6|    3.1|    1.5|    0.2|Iris-setosa|
|    5.0|    3.6|    1.4|    0.2|Iris-setosa|
+-------+-------+-------+-------+-----------+
only showing top 5 rows



Se transforman las 4 columnas de características (*features*) a 1 sola columna con **VectorAssembler**, al cual se le pasan las columnas de *input* y entrega la columna *output* con todos los valores de las columnas entrantes.

In [3]:
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

vector_assembler = VectorAssembler(inputCols=["sep_len", "sep_wid", "pet_len", "pet_wid"],\
outputCol="features")
df_temp = vector_assembler.transform(df)
df_temp.show(3)

+-------+-------+-------+-------+-----------+-----------------+
|sep_len|sep_wid|pet_len|pet_wid|      label|         features|
+-------+-------+-------+-------+-----------+-----------------+
|    5.1|    3.5|    1.4|    0.2|Iris-setosa|[5.1,3.5,1.4,0.2]|
|    4.9|    3.0|    1.4|    0.2|Iris-setosa|[4.9,3.0,1.4,0.2]|
|    4.7|    3.2|    1.3|    0.2|Iris-setosa|[4.7,3.2,1.3,0.2]|
+-------+-------+-------+-------+-----------+-----------------+
only showing top 3 rows



Luego se remueven las columnas innecesarias.

In [4]:
df = df_temp.drop('sep_len', 'sep_wid', 'pet_len', 'pet_wid')
df.show(3)

+-----------+-----------------+
|      label|         features|
+-----------+-----------------+
|Iris-setosa|[5.1,3.5,1.4,0.2]|
|Iris-setosa|[4.9,3.0,1.4,0.2]|
|Iris-setosa|[4.7,3.2,1.3,0.2]|
+-----------+-----------------+
only showing top 3 rows



Ahora se convierte la columna *label* de texto a una numérica con **StringIndexer**

In [5]:
from pyspark.ml.feature import StringIndexer

l_indexer = StringIndexer(inputCol="label", outputCol="labelIndex")
df = l_indexer.fit(df).transform(df)
df.show(3)

+-----------+-----------------+----------+
|      label|         features|labelIndex|
+-----------+-----------------+----------+
|Iris-setosa|[5.1,3.5,1.4,0.2]|       0.0|
|Iris-setosa|[4.9,3.0,1.4,0.2]|       0.0|
|Iris-setosa|[4.7,3.2,1.3,0.2]|       0.0|
+-----------+-----------------+----------+
only showing top 3 rows



Con los datos preprocesados, se dividen en sets de entrenamiento (*train*) y prueba (*test*). Para luego aplicar distintos modelos de clasificación.

In [6]:
(trainingData, testData) = df.randomSplit([0.7, 0.3])

A continuación se implementaron 3 algoritmos:
- Decision tree classifier
- Random forest classifier
- Naive Bayes

______________________________________________________________________________________________________________

### Decision tree classifier
**DecisionTreeClassifier**(<span style="color:green;">featuresCol='features'</span>, <span style="color:green;">labelCol='label'</span>, predictionCol='prediction', probabilityCol='probability', rawPredictionCol='rawPrediction', maxDepth=5, maxBins=32, minInstancesPerNode=1, minInfoGain=0.0, maxMemoryInMB=256, cacheNodeIds=False, checkpointInterval=10, impurity='gini', <span style="color:darkolivegreen;">seed=None</span>)    

Se hace **fit** con los datos de **entrenamiento**. Luego las **predicciones** con los datos de **prueba**.

In [7]:
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

dt = DecisionTreeClassifier(labelCol="labelIndex", featuresCol="features")
model = dt.fit(trainingData)
predictions = model.transform(testData)

In [8]:
predictions.select("prediction", "labelIndex").show(5)

+----------+----------+
|prediction|labelIndex|
+----------+----------+
|       0.0|       0.0|
|       0.0|       0.0|
|       0.0|       0.0|
|       0.0|       0.0|
|       0.0|       0.0|
+----------+----------+
only showing top 5 rows



In [12]:
evaluator = MulticlassClassificationEvaluator(\
labelCol="labelIndex", predictionCol="prediction",\
metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test Error = %g " % (1.0 - accuracy))
print("Test Accuracy = %g " % (accuracy*100)+'%')
evaluatorf1 = MulticlassClassificationEvaluator(labelCol="labelIndex", predictionCol="prediction", metricName="f1")
f1 = evaluatorf1.evaluate(predictions)
print("f1 = %g" % f1)
 
evaluatorwp = MulticlassClassificationEvaluator(labelCol="labelIndex", predictionCol="prediction", metricName="weightedPrecision")
wp = evaluatorwp.evaluate(predictions)
print("weightedPrecision = %g" % wp)
 
evaluatorwr = MulticlassClassificationEvaluator(labelCol="labelIndex", predictionCol="prediction", metricName="weightedRecall")
wr = evaluatorwr.evaluate(predictions)
print("weightedRecall = %g" % wr)
print(model)

Test Error = 0.0227273 
Test Accuracy = 97.7273 %
f1 = 0.977206
weightedPrecision = 0.978535
weightedRecall = 0.977273
DecisionTreeClassificationModel (uid=DecisionTreeClassifier_ea78cb93e6b0) of depth 5 with 15 nodes


### Random forest classifier

**RandomForestClassifier**(<span style="color:green;">featuresCol='features'</span>, <span style="color:green;">labelCol='label'</span>, predictionCol='prediction', probabilityCol='probability', rawPredictionCol='rawPrediction', maxDepth=5, maxBins=32, minInstancesPerNode=1, minInfoGain=0.0, maxMemoryInMB=256, cacheNodeIds=False, checkpointInterval=10, impurity='gini', <span style="color:green;">numTrees=20</span>, featureSubsetStrategy='auto', <span style="color:darkolivegreen;">seed=None</span>, subsamplingRate=1.0)   

Se hace **fit** con los datos de **entrenamiento**. Luego las **predicciones** con los datos de **prueba**.

In [20]:
from pyspark.ml.classification import RandomForestClassifier

rf = RandomForestClassifier(labelCol="labelIndex",\
featuresCol="features", numTrees=10)
model = rf.fit(trainingData)
predictions = model.transform(testData)

In [21]:
predictions.select("prediction", "labelIndex").show(5)

+----------+----------+
|prediction|labelIndex|
+----------+----------+
|       0.0|       0.0|
|       0.0|       0.0|
|       0.0|       0.0|
|       0.0|       0.0|
|       0.0|       0.0|
+----------+----------+
only showing top 5 rows



In [22]:
evaluator =\
MulticlassClassificationEvaluator(labelCol="labelIndex",\
predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test Error = %g" % (1.0 - accuracy))
print("Test Accuracy = %g " % (accuracy*100)+'%')
evaluatorf1 = MulticlassClassificationEvaluator(labelCol="labelIndex", predictionCol="prediction", metricName="f1")
f1 = evaluatorf1.evaluate(predictions)
print("f1 = %g" % f1)
 
evaluatorwp = MulticlassClassificationEvaluator(labelCol="labelIndex", predictionCol="prediction", metricName="weightedPrecision")
wp = evaluatorwp.evaluate(predictions)
print("weightedPrecision = %g" % wp)
 
evaluatorwr = MulticlassClassificationEvaluator(labelCol="labelIndex", predictionCol="prediction", metricName="weightedRecall")
wr = evaluatorwr.evaluate(predictions)
print("weightedRecall = %g" % wr)
print(model)

Test Error = 0.0227273
Test Accuracy = 97.7273 %
f1 = 0.977206
weightedPrecision = 0.978535
weightedRecall = 0.977273
RandomForestClassificationModel (uid=RandomForestClassifier_3b72d38ca990) with 10 trees


### Naive Bayes classifier

**NaiveBayes**(<span style="color:green;">featuresCol='features'</span>, <span style="color:darkolivegreen;">labelCol='label'</span>, predictionCol='prediction', probabilityCol='probability', rawPredictionCol='rawPrediction', <span style="color:green;">smoothing=1.0</span>, <span style="color:darkolivegreen;">modelType='multinomial'</span>, thresholds=None, weightCol=None)   

Naive Bayes soporta tanto Multinomial como Bernoulli.

Nuevamente, se hace **fit** con los datos de **entrenamiento**. Luego las **predicciones** con los datos de **prueba**.

In [17]:
from pyspark.ml.classification import NaiveBayes

nb = NaiveBayes(labelCol="labelIndex",\
featuresCol="features", smoothing=1.0,\
modelType="multinomial")
model = nb.fit(trainingData)
predictions = model.transform(testData)

In [18]:
predictions.select("label", "labelIndex",
"probability", "prediction").show()

+---------------+----------+--------------------+----------+
|          label|labelIndex|         probability|prediction|
+---------------+----------+--------------------+----------+
|    Iris-setosa|       0.0|[0.67589053383063...|       0.0|
|    Iris-setosa|       0.0|[0.58129776152275...|       0.0|
|    Iris-setosa|       0.0|[0.69413503465606...|       0.0|
|    Iris-setosa|       0.0|[0.72409196648375...|       0.0|
|    Iris-setosa|       0.0|[0.67528678942080...|       0.0|
|    Iris-setosa|       0.0|[0.71015112065611...|       0.0|
|    Iris-setosa|       0.0|[0.68156376940335...|       0.0|
|    Iris-setosa|       0.0|[0.62640031158086...|       0.0|
|    Iris-setosa|       0.0|[0.78268469801517...|       0.0|
|    Iris-setosa|       0.0|[0.73240810290220...|       0.0|
|    Iris-setosa|       0.0|[0.79316053353830...|       0.0|
|    Iris-setosa|       0.0|[0.73833522522100...|       0.0|
|Iris-versicolor|       1.0|[0.11308178808401...|       1.0|
|Iris-versicolor|       

In [19]:
evaluator =\
MulticlassClassificationEvaluator(labelCol="labelIndex",\
predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print('Test Error = '  + str(1-accuracy))
print("Test Accuracy = %g " % (accuracy*100)+'%')
evaluatorf1 = MulticlassClassificationEvaluator(labelCol="labelIndex", predictionCol="prediction", metricName="f1")
f1 = evaluatorf1.evaluate(predictions)
print("f1 = %g" % f1)
 
evaluatorwp = MulticlassClassificationEvaluator(labelCol="labelIndex", predictionCol="prediction", metricName="weightedPrecision")
wp = evaluatorwp.evaluate(predictions)
print("weightedPrecision = %g" % wp)
 
evaluatorwr = MulticlassClassificationEvaluator(labelCol="labelIndex", predictionCol="prediction", metricName="weightedRecall")
wr = evaluatorwr.evaluate(predictions)
print("weightedRecall = %g" % wr)
print(model)

Test Error = 0.045454545454545414
Test Accuracy = 95.4545 %
f1 = 0.954545
weightedPrecision = 0.959893
weightedRecall = 0.954545
NaiveBayes_e184d56042cc


### Conclusión

**MulticlassClassificationEvaluator**(predictionCol='prediction', labelCol='label', metricName='f1')   
*metricName:* f1, weightedPrecision, weightedRecall y accuracy

<u>Certeza de cada modelo:</u>

- Decision Tree = <span style="color:green;">97.7 %</span>
- Random Forest = <span style="color:green;">97.7 %</span>
- Naive Bayes = <span style="color:red;">95.4 %</span>

<u>Precisión:</u>

- Decision Tree = <span style="color:green;">97.85 %</span>
- Random Forest = <span style="color:green;">97.85 %</span>
- Naive Bayes = <span style="color:red;">95.98 %</span>

<u>Recall:</u>

- Decision Tree = <span style="color:green;">97.72 %</span>
- Random Forest = <span style="color:green;">97.72 %</span>
- Naive Bayes = <span style="color:red;">95.45 %</span>