## Classifying iris species based its features

In this notebook a classification of iris species will be performed with the use of Naive Bayes algorithm from Spark MLlib

## Dataset structure

This dataset contains description of three iris species () - 50 instances each. Each instance has 4 features:

- Sepal Length in cm
- Sepal Width in cm
- Petal Length in cm
- Petal Width in cm


Additionally, a species is assigned to each of the instances

*[The dataset has been taken from the UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/iris)*

### Importing all the necessary libraries

In [52]:
import findspark
from pyspark.ml import Pipeline
from pyspark.ml.classification import NaiveBayes
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import VectorAssembler
from pyspark.sql import SparkSession

### Initializing spark session

In [53]:
findspark.init()
spark = SparkSession.builder.getOrCreate()

### Loading the dataset

Printing a fragment of the dataset

In [54]:
irisDS = spark.read.format("csv").option("header", "true").option("delimiter", ",").option("quote", "").option("inferSchema", "true").load("/home/jovyan/work/iris.csv")
irisDS.show()

+-----------+----------+-----------+----------+-----------+
|SepalLength|SepalWidth|PetalLength|PetalWidth|    Species|
+-----------+----------+-----------+----------+-----------+
|        5.1|       3.5|        1.4|       0.2|Iris-setosa|
|        4.9|       3.0|        1.4|       0.2|Iris-setosa|
|        4.7|       3.2|        1.3|       0.2|Iris-setosa|
|        4.6|       3.1|        1.5|       0.2|Iris-setosa|
|        5.0|       3.6|        1.4|       0.2|Iris-setosa|
|        5.4|       3.9|        1.7|       0.4|Iris-setosa|
|        4.6|       3.4|        1.4|       0.3|Iris-setosa|
|        5.0|       3.4|        1.5|       0.2|Iris-setosa|
|        4.4|       2.9|        1.4|       0.2|Iris-setosa|
|        4.9|       3.1|        1.5|       0.1|Iris-setosa|
|        5.4|       3.7|        1.5|       0.2|Iris-setosa|
|        4.8|       3.4|        1.6|       0.2|Iris-setosa|
|        4.8|       3.0|        1.4|       0.1|Iris-setosa|
|        4.3|       3.0|        1.1|    

### Checking if Spark has correctly recognized all the columns

All columns (besides *Species*) should contain numerical data

In [55]:
irisDS.printSchema()

root
 |-- SepalLength: double (nullable = true)
 |-- SepalWidth: double (nullable = true)
 |-- PetalLength: double (nullable = true)
 |-- PetalWidth: double (nullable = true)
 |-- Species: string (nullable = true)



### Creating a StringIndexer and VectorAssembler

These objects are necessary for preparing the dataset for the classificator. It takes a dataset that contains the following columns: *features* and *label*. The first column contains a features vector. The second column contains class number. *StringIndexer* object is used to transform species names into numerical values. The *VectorAssembler* object is used for grouping the four features columns into one vector.

In [56]:
labelIndexer = StringIndexer(inputCol="Species", outputCol="label")

vecAssembler = VectorAssembler(inputCols=["SepalLength", "SepalWidth", "PetalLength", "PetalWidth"], outputCol="features")

### Splitting the dataset between test and train datasets

The dataset will be randomly split between two new sets. The *train* dataset will contain 70% of the original dataset's data. It will be used to build a classification model. The *test* dataset will contain 30% of the original dataset's data. It will be used to evaluate the classification model. After the split, the amount of instances will be printed.

In [57]:
(trainingData, testData) = irisDS.randomSplit([0.7, 0.3], seed=100)

print(f"Amount of instances in the train dataset: {trainingData.count()}")
print(f"Amount of instances in the test dataset: {testData.count()}")

Amount of instances in the train dataset: 104
Amount of instances in the test dataset: 46


### Creating and training a multiclass classificator with the use of Naive Bayes algorithm

The classificator will try to predict to which category a given instances belongs. It will determine this based on instance's features. **It's important to remember that this algorithm *assumes* independence between features. Additionally, this algorithm requires features to be non-negative**. A MLlib pipeline is used to fit the dataset to the model.

In [58]:
nb = NaiveBayes(smoothing=1.0, modelType="multinomial")

pipeline = Pipeline(stages=[labelIndexer, vecAssembler, nb])

model = pipeline.fit(trainingData)

### Running the classification model on the test dataset

Based on knowledge gained during training, the model will predict to which category a given instances from the test dataset belongs

In [59]:
predictions = model.transform(testData)

### Printing the resulting dataset schema



In [60]:
predictions.printSchema()

root
 |-- SepalLength: double (nullable = true)
 |-- SepalWidth: double (nullable = true)
 |-- PetalLength: double (nullable = true)
 |-- PetalWidth: double (nullable = true)
 |-- Species: string (nullable = true)
 |-- label: double (nullable = false)
 |-- features: vector (nullable = true)
 |-- rawPrediction: vector (nullable = true)
 |-- probability: vector (nullable = true)
 |-- prediction: double (nullable = false)



### Presenting predictions

In the *label* column sits the actual category number of a given instance. In the *features* column sits a features vector. In the *rawPrediction* column sits a vector containing *raw* prediction values. In the *probability* column sits a vector containing probabilities of belonging to each of the classes. In the *prediction* column sits a predicted class of a given instance.

In [61]:
predictions.select("label", "prediction", "probability").show(50, truncate=False)

+-----+----------+-------------------------------------------------------------+
|label|prediction|probability                                                  |
+-----+----------+-------------------------------------------------------------+
|2.0  |2.0       |[0.20253050754778992,0.11205481371745843,0.6854146787347515] |
|2.0  |2.0       |[0.22610513002031704,0.12516423426180695,0.6487306357178761] |
|2.0  |2.0       |[0.20155443460425587,0.10855263459197305,0.6898929308037711] |
|2.0  |2.0       |[0.20855321122455778,0.11231546581200702,0.6791313229634351] |
|2.0  |2.0       |[0.2398081472107575,0.13230333720250126,0.6278885155867413]  |
|2.0  |2.0       |[0.2018436957183859,0.10766518303273141,0.6904911212488827]  |
|2.0  |2.0       |[0.2018436957183859,0.10766518303273141,0.6904911212488827]  |
|2.0  |2.0       |[0.17954001348424262,0.0939934849813415,0.7264665015344159]  |
|2.0  |2.0       |[0.18366447593700436,0.09631770152992547,0.7200178225330701] |
|2.0  |2.0       |[0.1917821

### Evaluation

A *MulticlassClassificationEvaluator* from the MLlib will be used to evaluate the model. It will be used to compute the accuracy of the model.

In [62]:
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print(f"Model's accuracy is around {round(accuracy * 100, 2)}%")

Model's accuracy is around 91.3%


### Closing the Spark session

In [63]:
spark.stop()