# **Dataset Information**

The examined group comprised kernels belonging to three different varieties of wheat: Kama, Rosa and Canadian, 70 elements each, randomly selected for the experiment. 

High quality visualization of the internal kernel structure was detected using a soft X-ray technique. It is non-destructive and considerably cheaper than other more sophisticated imaging techniques like scanning microscopy or laser technology. The images were recorded on 13x18 cm X-ray KODAK plates. 

Studies were conducted using combine harvested wheat grain originating from experimental fields, explored at the Institute of Agrophysics of the Polish Academy of Sciences in Lublin.


# **Attribute Information**

To construct the data, seven geometric parameters of wheat kernels were measured:
1. area A
2. perimeter P
3. compactness C = 4*(pi)*(A)/(P^2)
4. length of kernel
5. width of kernel
6. asymmetry coefficient
7. length of kernel groove

All of these parameters were real-valued continuous.

In [5]:
import os
import warnings
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
import warnings

%matplotlib inline
warnings.filterwarnings('ignore')

!pip install plotly

from IPython.display import HTML
import plotly.express as px



In [6]:
df = pd.read_excel('Wheat_dataset1.xlsx')
df1 = df.copy(deep = True)
df2 = df.copy(deep = True)

FileNotFoundError: [Errno 2] No such file or directory: '/content/Wheat_dataset - Copy.xlsx'

In [None]:
df1

In [None]:
sns.pairplot(df1)

Clustering algorithms like K-means require feature scaling of the data as part of data preprocessing to produce good results. This is because clustering techniques use distance calculation between the data points. Hence it is proper to bring data of different units under a common scale.

In [None]:
# STANDARDIZING
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
x_scaled = scaler.fit_transform(df1)
x_scaled = pd.DataFrame(x_scaled, columns = df1.columns)

In [None]:
x_scaled

To keep the example simple and to visualize the clustering on a 2-D graph we will use only two attributes. 

Later we will also see after this how you can use more than 2 attributes for clustering and still visualize the results in 2-D with the help of Principal Component Analysis (PCA).

In [None]:
x_scaled1 = x_scaled[['compactness', 'asymmetry_coef']]

In [None]:
x_scaled2 = x_scaled1.copy(deep = True)


Let us see how to apply K-Means in Sklearn to group the dataset clusters (0, 1, 2, 3)

In [None]:
from sklearn.cluster import KMeans

model = KMeans(n_clusters = 2)
pred = model.fit_predict(x_scaled1)

In [None]:
pred

Add the clusters to their respective data points

In [None]:
x_scaled1['Clusters'] = pred
print(x_scaled1)

From the graph, it is evident that there is a scope for data to be grouped into more clusters than only 2. But how to know how many clusters?

In [None]:
sns.scatterplot(x = "compactness", y = "asymmetry_coef", hue = 'Clusters',  data = x_scaled1 ,palette = 'viridis')


The tricky part with K-Means clustering is you do not know in advance that in how many clusters the given data can be divided (hence it is an unsupervised learning algorithm). It can be done with the trial and error method but let us see a more proper technique for this.

A.) The Elbow Method is a popular technique for determining the optimal number of clusters. Here, we calculate the Within-Cluster-Sum of Squared Errors (WCSS) for various values of k and choose the k for which WSS first starts to diminish.

  1.) The Squared Error for a data point is the square of the distance of a point from its cluster center.

  2.) The WSS score is the summation of Squared Errors for all given data points.

  3.) Distance metrics like Euclidean Distance or the Manhattan Distance can be used

Continuing with our example, we calculate the WCSS for K=2 to k=12 and calculate the WCSS in each iteration.

In [None]:
K = range(2, 12)
wss = []

for k in K:
    kmeans = KMeans(n_clusters=k)
    kmeans=kmeans.fit(x_scaled2)
    wss_iter = kmeans.inertia_
    wss.append(wss_iter)

In [None]:
wss

It can be seen below that there is an elbow bend at K=3 or K=4 i.e. it is the point after which WCSS does not diminish much with the increase in value of K.

In [None]:
import matplotlib.pyplot as plt

plt.xlabel('K')
plt.ylabel('Within-Cluster-Sum of Squared Errors (WSS)')
plt.xticks(range(2,14))
plt.plot(K,wss, '-o')

To get the elbow point we use the KneeLocator library

In [None]:
!pip install kneed
from kneed import KneeLocator

kl = KneeLocator(range(2, 12), wss, curve="convex", direction="decreasing")
print('the elbow point is : ', kl.elbow)

B.) The silhouette value measures the similarity of a data point within its cluster. It has a range between +1 and -1 and the higher values denote a good clustering.

Below we calculate the Silhouette Score for k=2 to 12 and it can be seen that the maximum value is for k=2 and k=3.

In [None]:
import sklearn.cluster as cluster
import sklearn.metrics as metrics

for i in range(2,13):
  labels = KMeans(n_clusters=i, random_state=200).fit(x_scaled2).labels_
  print ("Silhouette score for k = " + str(i) + " is " + str(metrics.silhouette_score(x_scaled2, labels, metric="euclidean", random_state=200)))

In [None]:
x_scaled2

In [None]:
model1 = KMeans(n_clusters = 4)
model1.fit(x_scaled2)

In [None]:
model1.predict(x_scaled2)

In [None]:
x_scaled2['Clusters'] = model1.labels_
x_scaled2

In [None]:
sns.scatterplot(x="compactness", y="asymmetry_coef", hue = 'Clusters',  data=x_scaled2, palette='viridis')

# Clustering with more than 2 features
In the above example, we used only two attributes to perform clustering because it is easier for us to visualize the results in 2-D graph. 

We cannot visualize anything beyond 3 attributes in 3-D and in real-world scenarios there can be hundred of attributes. So how can we visualize the clustering results?

Well, it can be done by applying principal component analysis (PCA) on the dataset to reduce its dimension to only two while still preserving the information. 

And then clustering can be applied to this transformed dataset and then visualized in a 2-D plot. Moreover, PCA can also help to avoid the curse of dimensionality.

#What is Dimensionality reduction?

Dimensionality reduction algorithms project high-dimensional data to a low-dimensional space while keeping as much of the variance in the original dataset as possible

#Why do we need Dimensionality Reduction algorithms?
A large number of features requires a lot of computer resources, and a longer period of time to train. The calculations between the data points will become complex and harder when the number of dimensions is very high in the data. 

That kind of problem is often referred to as the curse of dimensionality in the context of machine learning.

Once the dimensionality has been reduced, machine learning algorithms will be able to perform calculations very effectively and efficiently during training.

#What is PCA?

PCA is a linear dimensionality reduction technique. It transforms a set of variables (p) into a smaller k (k<p) number of variables called "principal components" while retaining as much of the variation in the original dataset as possible.

In [None]:
df2

In [None]:
scaler2 = StandardScaler()
df2 = pd.DataFrame(scaler2.fit_transform(df2), columns = df2.columns)
df2.head()

First we will use all the features to see which features explain the most variance. Then we will use only those features in the actual PCA.

In [None]:
from sklearn.decomposition import PCA
pca = PCA()
pca.fit(df2)

In [None]:
print(pca.n_components_)
print(pca.explained_variance_)

we see that first 3 features explain almost 99% of the variance present in the dataset.


In [None]:
np.cumsum(pca.explained_variance_ratio_ * 100)

In [None]:
# GET THE VARIANCE OF EACH FEATURES
import matplotlib.pyplot as plt
features = range(0, pca.n_components_)
plt.bar(features, pca.explained_variance_)

SELECT FEATURES WITH HIGH VARIANCE

In [None]:
pca1 = PCA(n_components = 3)
pca1.fit(x_scaled)
pca_features = pd.DataFrame(pca1.transform(x_scaled), columns = ['pca1', 'pca2', 'pca3'])
print(pca_features.shape)

In [None]:
pca_features

In [None]:
########## K MEANS CLUSTERING ############

from sklearn.cluster import KMeans

cost = []
for i in range(2,13):
    model = KMeans(n_clusters = i)
    model.fit(pca_features)
    cost.append(model.inertia_)

In [None]:
cost

In [None]:
# ELBOW PLOT
plt.plot(range(2,13), cost, '-o')
plt.xlabel('k value')
plt.ylabel('cost value')
plt.xticks(range(2,13))
plt.show()

In [None]:
from kneed import KneeLocator

kl = KneeLocator(range(2, 13), cost, curve="convex", direction="decreasing")
print('the elbow point is : ', kl.elbow)

In [None]:
for i in range(2,13):
  labels = KMeans(n_clusters=i, random_state=200).fit(pca_features).labels_
  print ("Silhouette score for k = " + str(i) + " is " + str(metrics.silhouette_score(pca_features, labels, metric="euclidean", random_state=200)))

In [None]:
# TRAIN WITH KNEE VALUE
model2 = KMeans(n_clusters = 3)
model2.fit(pca_features)

In [None]:
pred2 = model2.predict(pca_features)

In [None]:
pca_features['Clusters'] = pred2

In [None]:
pca_features

In [None]:
sns.scatterplot(x="pca1", y="pca2", hue = 'Clusters',  data = pca_features, palette='viridis')

In [None]:
sns.scatterplot(x="pca1", y="pca3", hue = 'Clusters',  data = pca_features, palette='viridis')

In [None]:
sns.scatterplot(x="pca2", y="pca3", hue = 'Clusters',  data = pca_features, palette='viridis')

In [None]:
from mpl_toolkits import mplot3d

fig = plt.figure(figsize = (12, 8))
ax = plt.axes(projection = '3d')

sctt = ax.scatter3D(pca_features['pca1'], pca_features['pca2'], pca_features['pca3'], c = pca_features['Clusters'], s = 50, alpha = 0.6)

plt.title('3D scatterplot of clusters')
ax.set_xlabel('pca1')
ax.set_ylabel('pca2')
ax.set_zlabel('pca3')
plt.savefig('3dplot.png')

In [None]:
fig = px.scatter_3d(pca_features, x = 'pca1', y = 'pca2', z = 'pca3', color = 'Clusters')
fig.show()


In [None]:
pca_features

Testing on untrained data

In [None]:
test_df = pd.read_excel('test_data.xlsx')

In [None]:
test_df

In [None]:
test_df = pd.DataFrame(scaler2.transform(test_df), columns = test_df.columns)

In [None]:
test_pca_df = pd.DataFrame(pca1.transform(test_df), columns = ['pca1', 'pca2', 'pca3'])

In [None]:
test_pca_df

In [None]:
test_pred = model2.predict(test_pca_df)

In [None]:
test_df['Clusters'] = test_pred

In [None]:
test_df