<a href="https://colab.research.google.com/github/RealElvince/iris-flower-classification-using-apache-spark-mllib/blob/main/iris_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### upload iris dataset

In [1]:
from google.colab import files

def upload_files():
  upload_files = files.upload()
upload_files()

Saving Iris.csv to Iris.csv


### import and start spark

In [4]:
from pyspark.sql import SparkSession
spark = SparkSession\
        .builder\
        .appName("IrisClassification")\
        .getOrCreate()

#### Load Dataset

In [6]:
from os import truncate
iris = spark.read.csv("/content/Iris.csv",header=True,inferSchema=True)
iris.show(5,truncate=False)

+---+-------------+------------+-------------+------------+-----------+
|Id |SepalLengthCm|SepalWidthCm|PetalLengthCm|PetalWidthCm|Species    |
+---+-------------+------------+-------------+------------+-----------+
|1  |5.1          |3.5         |1.4          |0.2         |Iris-setosa|
|2  |4.9          |3.0         |1.4          |0.2         |Iris-setosa|
|3  |4.7          |3.2         |1.3          |0.2         |Iris-setosa|
|4  |4.6          |3.1         |1.5          |0.2         |Iris-setosa|
|5  |5.0          |3.6         |1.4          |0.2         |Iris-setosa|
+---+-------------+------------+-------------+------------+-----------+
only showing top 5 rows



In [7]:
iris.printSchema()

root
 |-- Id: integer (nullable = true)
 |-- SepalLengthCm: double (nullable = true)
 |-- SepalWidthCm: double (nullable = true)
 |-- PetalLengthCm: double (nullable = true)
 |-- PetalWidthCm: double (nullable = true)
 |-- Species: string (nullable = true)



#### Exploratory Data Analysis

#### Check null values

In [9]:
from pyspark.sql.functions import col,sum as _sum
iris.select([
    _sum(
        col(c).isNull().cast("int")).alias(c)
    for c in iris.columns
]).show()

+---+-------------+------------+-------------+------------+-------+
| Id|SepalLengthCm|SepalWidthCm|PetalLengthCm|PetalWidthCm|Species|
+---+-------------+------------+-------------+------------+-------+
|  0|            0|           0|            0|           0|      0|
+---+-------------+------------+-------------+------------+-------+



#### summary statistics

In [11]:
iris.describe(["SepalLengthCm","SepalWidthCm","PetalLengthCm","PetalWidthCm"]).show()

+-------+------------------+-------------------+------------------+------------------+
|summary|     SepalLengthCm|       SepalWidthCm|     PetalLengthCm|      PetalWidthCm|
+-------+------------------+-------------------+------------------+------------------+
|  count|               150|                150|               150|               150|
|   mean| 5.843333333333335| 3.0540000000000007|3.7586666666666693|1.1986666666666672|
| stddev|0.8280661279778637|0.43359431136217375| 1.764420419952262|0.7631607417008414|
|    min|               4.3|                2.0|               1.0|               0.1|
|    max|               7.9|                4.4|               6.9|               2.5|
+-------+------------------+-------------------+------------------+------------------+



#### Label(Species) Distribution

In [13]:
iris.groupBy("Species").count().show()

+---------------+-----+
|        Species|count|
+---------------+-----+
| Iris-virginica|   50|
|    Iris-setosa|   50|
|Iris-versicolor|   50|
+---------------+-----+



### Feature Engineeering

#### Encode species

In [15]:
from pyspark.ml.feature import StringIndexer
species_indexer = StringIndexer(inputCol="Species",outputCol="label")

#### Assemble Features

In [16]:
feature_cols = ["SepalLengthCm","SepalWidthCm","PetalLengthCm","PetalWidthCm"]
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=feature_cols,outputCol="features")