# ML Application Exampls
## Clustering: Seeds Data Set 

The task of this example is to implement a clustering algorithm within a ML pipeline (load, data-analysis, visualisation, model selection and optimization, prediction) on a specific Dataset. 


## Dataset 
The notebook will upload a public available dataset: https://archive.ics.uci.edu/ml/datasets/seeds

Measurements of geometrical properties of kernels belonging to three different varieties of wheat. A soft X-ray technique and GRAINS package were used to construct all seven, real-valued attributes.

  <b>Source:</b>
        
  Magorzata Charytanowicz, Jerzy Niewczas <br />
  Institute of Mathematics and Computer Science, <br /> 
  The John Paul II Catholic University of Lublin, Konstantynw 1 H, PL 20-708 Lublin, Poland <br />
        e-mail: mchmat@kul.lublin.pl , jniewczas@kul.lublin.pl <br />

  Piotr Kulczycki, Piotr A. Kowalski, Szymon Lukasik, Slawomir Zak <br />
  Department of Automatic Control and Information Technology, <br />
  Cracow University of Technology, Warszawska 24, PL 31-155 Cracow, Poland <br />
  and <br />
  Systems Research Institute, Polish Academy of Sciences, Newelska 6, PL 01-447 Warsaw, Poland <br />
  e-mail: kulczycki@ibspan.waw.pl , pakowal@ibspan.waw.pl , slukasik@ibspan.waw.pl , slzak@ibspan.waw.pl
    
   <br/>
    <b>Data Set Information:</b>
 The examined group comprised kernels belonging to three different varieties of wheat: Kama, Rosa and Canadian, 70 elements each, randomly selected for
the experiment. High quality visualization of the internal kernel structure was detected using a soft X-ray technique. It is non-destructive and considerably cheaper than other more sophisticated imaging techniques like scanning microscopy or laser technology. The images were recorded on 13x18 cm X-ray KODAK plates. Studies were conducted using combine harvested wheat grain originating from experimental fields, explored at the Institute of Agrophysics of the Polish Academy of Sciences in Lublin.

The data set can be used for the tasks of classification and cluster analysis.
        
<b>Attribute Information:</b>
    All attributes are continuous, with the exception of the class attribute.
    To construct the data, seven geometric parameters of wheat kernels were measured:
    <ul>
    <li> area A, </li>
    <li> perimeter P, </li>
    <li> compactness C = 4*pi*A/P^2, </li>
    <li> length of kernel, </li>
    <li>width of kernel, </li>
    <li> asymmetry coefficient </li>
    <li> length of kernel groove. </li>
    </ul>
All of these parameters were real-valued continuous.

    


In [None]:
# algebra
import numpy as np
# data structure
import pandas as pd
# data visualization
import matplotlib.pylab as plt
import seaborn as sns
#file handling
from pathlib import Path




# Data load
The process consist in downloading the data if needed, loading the data as a Pandas dataframe

In [None]:
filename  = "seeds_dataset.txt"

#if the dataset is not already in the working dir, it will download
my_file = Path(filename)
if not my_file.is_file():
  print("Downloading dataset")
  !wget https://archive.ics.uci.edu/ml/machine-learning-databases/00236/seeds_dataset.txt
 


sep = '\t'
columns = ['area',
 'perimeter',
 'compactness',
 'length_kernel',
 'width',
 'asymmetry',
 'length_groove',
 'unknown_class']

all_data          = pd.read_csv(filename,sep=sep,names=columns,usecols=np.arange(8),index_col=None)
all_data          = all_data.dropna()

data          = all_data[columns[:7]]
unknown_class = all_data[columns[7]]

print("n. samples: {}".format(data.shape[0]))
print("n. columns: {}".format(data.shape[1]))

# Data Analysis and Visualization
In this section confidence with the data is gained, data are plotted and cleaned

In [None]:
#How does the dataset look like? 
print(all_data.head())
print('Columns names:',all_data.columns.values)
print("Classes to predict in the data:",all_data['unknown_class'].unique())

In [None]:
sns.pairplot(all_data,hue = 'unknown_class')

# Machine Learning
## Clustering metrics



See the Clustering performance evaluation section of the user guide for further details.

The sklearn.metrics.cluster submodule contains evaluation metrics for cluster analysis results. There are two forms of evaluation:

- supervised, which uses a ground truth class values for each sample.

- unsupervised, which does not and measures the ‘quality’ of the model itself.

Look at https://scikit-learn.org/stable/modules/clustering.html#clustering-evaluation

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
# normalisation
#Having features on a similar scale can help the model converge more quickly towards the minimum
scaler_X = StandardScaler().fit(data)
X = scaler_X.transform(data)
#check if nan are present on the data after normalization to avoid trouble later
if(sum(np.isnan(X)).all()): print("Error nan")

# Supervised clustering

In [None]:
from sklearn.cluster import KMeans
from sklearn import metrics

plot_flag = False
max_k = 7

estimators = []
for k in range(2,max_k+1):
  estimators.append(('k_means_'+str(k), KMeans(n_clusters=k)))


try_scores = ['adjusted_rand_score',
              'adjusted_mutual_info_score']

results = {s:dict() for s in try_scores }
titles  = ['2 clusters','3 clusters', '4 clusters']
for name, est in estimators:
    
  est.fit(X)
  labels_pred = est.labels_

  # SUPERVISED Metrics
  # [...]  the (adjusted or unadjusted) Rand index requires knowledge of the 
  # ground truth classes which is almost never available in practice or requires
  # manual assignment by human annotators (as in the supervised learning setting).
  results['adjusted_rand_score'][name] = metrics.adjusted_rand_score(unknown_class, labels_pred)

  results['adjusted_mutual_info_score'][name] = metrics.adjusted_mutual_info_score(unknown_class, labels_pred)

  if plot_flag:

    df = pd.DataFrame(X)
    df['predicted labels'] = labels_pred
    sns.pairplot(df,hue='predicted labels',diag_kws={'palette':'Set2'},plot_kws={'palette':'Set2'})

if plot_flag:
  # True classes
  df = pd.DataFrame(X)
  df['true labels'] = unknown_class
  sns.pairplot(df,hue='true labels',diag_kws={'palette':'Set2'},plot_kws={'palette':'Set2'})

n = len(results.keys())
fig,axs = plt.subplots(1,2,figsize=[8*2,4*2])
for i,score_name in enumerate(results):
    axs.flatten()[i].bar(results[score_name].keys(),results[score_name].values())
    axs.flatten()[i].set_title(score_name)
    axs.flatten()[i].set_xticks(axs.flatten()[i].get_xticks())
    axs.flatten()[i].set_xticklabels(results[score_name].keys(),rotation=15)

In [None]:
#let's plot the results
from sklearn.manifold import TSNE

tsne = TSNE(n_components=2,random_state=0)
X_r = tsne.fit_transform(X)
est = estimators[1][1]
pred_y = est.fit_predict(X_r)

fig = plt.figure(figsize=[10,10])
plt.scatter(X_r[:,0], X_r[:,1], c=unknown_class.values, cmap='viridis', linewidth=0.5);
plt.scatter(est.cluster_centers_[:, 0], est.cluster_centers_[:, 1], s=300, c='red')
plt.show()

## Unsupervised

In [None]:
#add TSNE feature
Xnew = np.insert(X,0,X_r[:,1],axis=1)
Xnew = np.insert(Xnew,0,X_r[:,0],axis=1)
X = Xnew

from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score

import matplotlib.pyplot as plt
import matplotlib.cm as cm
import numpy as np

range_n_clusters = [2, 3, 4, 5, 6]

for n_clusters in range_n_clusters:
    # Create a subplot with 1 row and 2 columns
    fig, (ax1, ax2) = plt.subplots(1, 2)
    fig.set_size_inches(18, 7)

    # The 1st subplot is the silhouette plot
    # The silhouette coefficient can range from -1, 1 but in this example all
    # lie within [-0.1, 1]
    ax1.set_xlim([-0.1, 1])
    # The (n_clusters+1)*10 is for inserting blank space between silhouette
    # plots of individual clusters, to demarcate them clearly.
    ax1.set_ylim([0, len(X) + (n_clusters + 1) * 10])

    # Initialize the clusterer with n_clusters value and a random generator
    # seed of 10 for reproducibility.
    clusterer = KMeans(n_clusters=n_clusters, random_state=10)
    cluster_labels = clusterer.fit_predict(X)

    # The silhouette_score gives the average value for all the samples.
    # This gives a perspective into the density and separation of the formed
    # clusters
    silhouette_avg = silhouette_score(X, cluster_labels)
    print("For n_clusters =", n_clusters,
          "The average silhouette_score is :", silhouette_avg)

    # Compute the silhouette scores for each sample
    sample_silhouette_values = silhouette_samples(X, cluster_labels)

    y_lower = 10
    for i in range(n_clusters):
        # Aggregate the silhouette scores for samples belonging to
        # cluster i, and sort them
        ith_cluster_silhouette_values = \
            sample_silhouette_values[cluster_labels == i]

        ith_cluster_silhouette_values.sort()

        size_cluster_i = ith_cluster_silhouette_values.shape[0]
        y_upper = y_lower + size_cluster_i

        color = cm.nipy_spectral(float(i) / n_clusters)
        ax1.fill_betweenx(np.arange(y_lower, y_upper),
                          0, ith_cluster_silhouette_values,
                          facecolor=color, edgecolor=color, alpha=0.7)

        # Label the silhouette plots with their cluster numbers at the middle
        ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))

        # Compute the new y_lower for next plot
        y_lower = y_upper + 10  # 10 for the 0 samples

    ax1.set_title("The silhouette plot for the various clusters.")
    ax1.set_xlabel("The silhouette coefficient values")
    ax1.set_ylabel("Cluster label")

    # The vertical line for average silhouette score of all the values
    ax1.axvline(x=silhouette_avg, color="red", linestyle="--")

    ax1.set_yticks([])  # Clear the yaxis labels / ticks
    ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])

    # 2nd Plot showing the actual clusters formed
    colors = cm.nipy_spectral(cluster_labels.astype(float) / n_clusters)
    ax2.scatter(X[:, 0], X[:,1], marker='o', s=80, lw=0, alpha=0.7,
                c=colors, edgecolor='k')

    # Labeling the clusters
    centers = clusterer.cluster_centers_
    # Draw white circles at cluster centers
    ax2.scatter(centers[:, 0], centers[:, 1], marker='o',
                c="white", alpha=1, s=200, edgecolor='k')

    for i, c in enumerate(centers):
        ax2.scatter(c[0], c[1], marker='$%d$' % i, alpha=1,
                    s=50, edgecolor='k')

    ax2.set_title("The visualization of the clustered data.")
    ax2.set_xlabel("Feature space for the 1st feature")
    ax2.set_ylabel("Feature space for the 2nd feature")

    plt.suptitle(("Silhouette analysis for KMeans clustering on sample data "
                  "with n_clusters = %d" % n_clusters),
                 fontsize=14, fontweight='bold')

plt.show()