## Introduction

Most of the machine learning task usually revolves around either the **supervised** learning approach i.e. the one which gives the label (**the column to be predicted**) or the **unsupervised** learning that don't have any label column in the dataset we have to make relevant groups out of it under certain criteria (**choosing the best K value and centroid for each data point**).

Similarly in this article we are going to involve the concept of unsupervised method more specifically KMeans to **divide the seeds of wheat into clusters** i.e. **we have the features of all the wheat seed data though we don't know to which category they belong to hence clustering technique can help us to seggregate that**.



## About the dataset

Before going forward with any problem statement it is very much essential that **we should get the background and source of the dataset** so that the authenticity should sustain. This dataset includes thre different categories of wheat they are, **Canadian, Kama and Rosa**. For experiment purpose **70 features** were selected from each of the category. 


If we talk about the image resolution because that is one key area which is highly responsible for accuracy of the experiment then there were **high quality of visualization** using the **soft X-ray technique** and those images were captured by **X-ray KODAK plates**.

**If one needs to know more about this dataset then please visit this [link](https://archive.ics.uci.edu/ml/datasets/seeds)**.




This dataset is one of the great example as it can be used as **clustering** task as well as for the **classification** i.e. we can either group different wheat seed or we can classify which type of wheat seed is this?  


**Features Information:**

To maintain the authentic dataset it is being evaluated from **7 different geometric values**. They are as follows

1. **Area:** Denoted by A, have the total area of wheat kernels. 
2. **Perimeter:** Denoted by P, consisting the perimeter. 
3. **Compactness:** Denoted by C and following calculation is done to calculate this aspect = 4*pi*A/P^2. 
4. **Length:** Length of the kernel.
5. **Width:** Width of the kernel.
6. **Asymmetry coefficient:** The coefficient value of symmetrical kernels
7. **Length Kernel:** Length of the kernel groove. 


Now our main goal is to cluster the wheat seeds into 3 groups using K-means clustering.

In [2]:
!pip install pyspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.3.0.tar.gz (281.3 MB)
[K     |████████████████████████████████| 281.3 MB 46 kB/s 
[?25hCollecting py4j==0.10.9.5
  Downloading py4j-0.10.9.5-py2.py3-none-any.whl (199 kB)
[K     |████████████████████████████████| 199 kB 50.8 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.3.0-py2.py3-none-any.whl size=281764026 sha256=2dccc2528000832f72221457130cea869f1c9ebd3d5f5ada885b457e3ac9cb5b
  Stored in directory: /root/.cache/pip/wheels/7a/8e/1b/f73a52650d2e5f337708d9f6a1750d451a7349a867f928b885
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.5 pyspark-3.3.0


In [3]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('wheat_seed').getOrCreate()
spark

**Inference:** If any one is following my PySpark series then by far they are aware of this mandatory steps by which we setup the Spark environment by its PySpark distribution.

Here we gave the name to the session as **wheat_seed** and created the same using **builder** and **getOrCreate()** method.

In [4]:
from pyspark.ml.clustering import KMeans

**Inference:** Importing the libraries beforehand is usually recommended so that we don't fell short of resources that we need. 

Here we are importing specifically the KMeans algorithm from the clustering module of PySpark's MLIB which take in **input** columns and **return** the **predictions** as cluster tag.

Though clustering module don't only have the KMeans as the options but also **LDA, Bisecting KMeans, Guassian Mixture Model, and Power Iteration Clustering**.

## Reading the dataset

Let's read the Wheat seed dataset which is there with us in the **CSV format** before actually reading it let's recall few major points of this dataset.

1. It has total of **7 features** or we can say **7 measurements of wheat kernels**.
2. We already know that in this whole dataset there are **3 types of seeds** hence through clustering we just need to give them tag.

In [5]:
dataset = spark.read.csv("seeds_dataset.csv",header=True,inferSchema=True)
dataset.show()

+-----+---------+-----------+------------------+------------------+---------------------+------------------+
| area|perimeter|compactness|  length_of_kernel|   width_of_kernel|asymmetry_coefficient|  length_of_groove|
+-----+---------+-----------+------------------+------------------+---------------------+------------------+
|15.26|    14.84|      0.871|             5.763|             3.312|                2.221|              5.22|
|14.88|    14.57|     0.8811| 5.553999999999999|             3.333|                1.018|             4.956|
|14.29|    14.09|      0.905|             5.291|3.3369999999999997|                2.699|             4.825|
|13.84|    13.94|     0.8955|             5.324|3.3789999999999996|                2.259|             4.805|
|16.14|    14.99|     0.9034|5.6579999999999995|             3.562|                1.355|             5.175|
|14.38|    14.21|     0.8951|             5.386|             3.312|   2.4619999999999997|             4.956|
|14.69|    14.49|  

**Inference:** As usual we used the **read.csv** function of the PySpark to **read the data which was in the CSV format** and kept the **header** parameter as **True** so that the first column of the dataset should be treated as column heading. Similarly **inferSchema** is also set to **True** because we want to see the original type of each column.


**Note:** If we closely look the above output then we can find out that **this dataset requires the "standard scaling" of columns** that will be done in the later section of this article.

In [6]:
dataset.head()

Row(area=15.26, perimeter=14.84, compactness=0.871, length_of_kernel=5.763, width_of_kernel=3.312, asymmetry_coefficient=2.221, length_of_groove=5.22)

**Inference:** If one wants to see the column name with their corresponding values i.e. **tuple of one or more records** then the best way is to go with **head**() function which will return the **Row** object which have the records and its values as well.

In [7]:
dataset.describe().show()

+-------+------------------+------------------+--------------------+-------------------+------------------+---------------------+-------------------+
|summary|              area|         perimeter|         compactness|   length_of_kernel|   width_of_kernel|asymmetry_coefficient|   length_of_groove|
+-------+------------------+------------------+--------------------+-------------------+------------------+---------------------+-------------------+
|  count|               210|               210|                 210|                210|               210|                  210|                210|
|   mean|14.847523809523816|14.559285714285718|  0.8709985714285714|  5.628533333333335| 3.258604761904762|   3.7001999999999997|  5.408071428571429|
| stddev|2.9096994306873647|1.3059587265640225|0.023629416583846364|0.44306347772644983|0.3777144449065867|   1.5035589702547392|0.49148049910240543|
|    min|             10.59|             12.41|              0.8081|              4.899|            

**Inference:** The **describe()** method is the go to function of PySpark when we want to see the **statistical information** of the dataset. In the above output as well we can see that total number of instances are **210** and its same for each column that means there are **no null values**. 

## Formatting the data for MLIB

In MLIB we can't really fed all the features to the model in this case we have to first **combine all the columns together in the vectorised format** so that model in the backend can traverse through each numerical value. This clubbing features task is done by **VectorAssembler**.



In [8]:
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

dataset.columns

['area',
 'perimeter',
 'compactness',
 'length_of_kernel',
 'width_of_kernel',
 'asymmetry_coefficient',
 'length_of_groove']

**Inference:** As we need to format our data using **Vectors** and **VectorAssembler** so we are importing them from PySpark's **feature** module. Also later looking at all the available columns which will help us in the following code.

In [9]:
vec_assembler = VectorAssembler(inputCols = dataset.columns, outputCol='features')
final_data = vec_assembler.transform(dataset)
final_data.show()

+-----+---------+-----------+------------------+------------------+---------------------+------------------+--------------------+
| area|perimeter|compactness|  length_of_kernel|   width_of_kernel|asymmetry_coefficient|  length_of_groove|            features|
+-----+---------+-----------+------------------+------------------+---------------------+------------------+--------------------+
|15.26|    14.84|      0.871|             5.763|             3.312|                2.221|              5.22|[15.26,14.84,0.87...|
|14.88|    14.57|     0.8811| 5.553999999999999|             3.333|                1.018|             4.956|[14.88,14.57,0.88...|
|14.29|    14.09|      0.905|             5.291|3.3369999999999997|                2.699|             4.825|[14.29,14.09,0.90...|
|13.84|    13.94|     0.8955|             5.324|3.3789999999999996|                2.259|             4.805|[13.84,13.94,0.89...|
|16.14|    14.99|     0.9034|5.6579999999999995|             3.562|                1.355| 

**Code breakdown:**

1. Creating a **VectorAssembler** object and passing all the columns present in the dataset in the **input** **columns** parameter and naming it as the **features** column.
2. **Transforming the changes** so that it will reflect in the real dataset.
3. Then at the last if we will look in the **dataset**, the last column is the **collection of all the features in vector**.

## Scaling the data

Scaling the data is completely optional step in the data preprocessing stage but sometimes equally necessary as well **depending on the nature of the dataset** also scaling down the dataset at same scale helps to ncrease the accuracy and deals with the **curse of dimensionality**.

In [10]:
from pyspark.ml.feature import StandardScaler

scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures", withStd=True, withMean=False)

scalerModel = scaler.fit(final_data)

final_data = scalerModel.transform(final_data)

**Code breakdown:**

1. Importing the **StandardScaler** object from **ml.feature** library of the PySpark.

2. Then passing the features as the **input** column value and **scaled** features as **output** column features. Main thing to note here is that we are scaling the data in terms of **standard deviation (True)** but not with **mean (False)**.

3. At the third step we are gonna compute the **summary statistics** by using the **fit function**.

4. At the last step **scaled model** will normalize every feature to have the same unit of standard deviation.

## Training and evaluating the model

Now we are actually in the **model development phase** where first we are gonna build the **KMeans clustering model** and then for the **testing phase** we will **evaluate the model** using relevant metrics which will let us know how our model performed.

In [13]:
kmeans = KMeans(featuresCol='scaledFeatures',k=3)
model = kmeans.fit(final_data)
model = model.transform(final_data)

**Inference:** First thing to note that in the training phase we are passing the k value i.e. **number of clusters as 3** because we already know that there are 3 types of seeds available.

Then its necesssary to **transform** the changes i.e. **training of the model on the whole dataset (as there are no labels)**.

In [19]:
from pyspark.ml.evaluation import ClusteringEvaluator
evaluator = ClusteringEvaluator()

silhouette_3_groups = evaluator.evaluate(model)
print("Silhouette evaluation results for wheat seed segmentation= " + str(silhouette_3_groups))

Silhouette evaluation results for wheat seed segmentation= 0.630000103338996


**Inference:** Here comes the model evaluation phase where first and foremost we import the **ClusteringEvaluator** module so that we could statistically check how well the model performed using the **Silhouette** evaluation measures. **The results are neither too good nor too bad.** For that one could tune the model and see if it is resulting to better results.

In [18]:
model.select('prediction').show()

+----------+
|prediction|
+----------+
|         1|
|         1|
|         1|
|         1|
|         1|
|         1|
|         1|
|         1|
|         2|
|         1|
|         1|
|         1|
|         1|
|         1|
|         1|
|         1|
|         1|
|         1|
|         1|
|         0|
+----------+
only showing top 20 rows



**Inference:** If one wants to see the tag, like **which sets of records belongs to what cluster** then navigate through the **"prediction"** column and you can see the results as the above output. 

## Conclusion

The final part of the article where we will go through each step in a brief explanation that helped us to solve the problem of seggregating the three types of wheat seeds through KMeans clustering.

1. Firstly we went through the theory part and learnt about the dataset then followed few compulsory steps like **starting the spark session and reading the dataset using PySpark.**

2. Then after some analysis of data we format it to make it ready for the machine learning algorithm ~ **KMeans clustering**.

3. When we closely looked at the data we found that it requires standard scaling as well, so after **scaling** the data we trained it and get through the **evaluation** part too later reached to the conclusion that model **moderately** **performed**.