# Clustering examples

This notebook is for developing the fuzzy clustering package and demonstrating how to use it with scikit-learn.

The basic idea is that we create some scikit-learn compatible clustering estimators and a group of scoring functions. They can then be thrown about using scikit-learn for the following important purposes.

* [Pipelines](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) : Chain together pre-processing, clustering and scoring steps into one object
* [Model evaluation](https://scikit-learn.org/stable/model_selection.html#model-selection) : Cross validation of models to find optimum fitting parameters as scored against various scoring functions

We aim to build clustering estimators and store them in ./Models and maybe add:

* Ensemble methods (run many times to assess stability)
* I/O helper functions (read csv, netcdf, whatever)
* Suite of scoring metrics (silouette score, fuzzy partition matrix, etc)
* Default, pretrained models (one for each common dataset, like OLCI, OC-CCI etc)

in a way that is generalised to work on all models. Must be a scikit-learn compatible object and must not reinvent wheels here

In [1]:
from CmeansPython import CmeansModel

In [81]:
from sklearn.datasets import make_blobs
import pandas as pd
import hvplot.pandas
import holoviews as hv
import xarray as xr
import hvplot.xarray
import numpy as np
from functools import reduce

## Create a blobby dataset

In [228]:
blobs, labels = make_blobs(n_samples=2000, n_features=2)

In [229]:
df = pd.DataFrame(blobs, columns=['x','y'])

In [230]:
cmeans = CmeansModel(c=3)

In [231]:
cmeans.fit(df)

CmeansModel(c=3)

In [232]:
df.hvplot(kind='scatter',x='x',y='y',c=labels, cmap='rainbow') * hv.Points(cmeans.cntr_)

## Performance metrics

if an algorithm comes with its own scoring metric that will be class specific. However, we can get some metrics that are applicable across all methods. 

Scikit-learn has a [suite of metrics](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics)

Probably the best way is to make a scorer object from a scoring function with `sklearn.metrics.make_scorer` which can be placed at after clustering in a pipeline which is then fed to a GridSearch like object.

In [8]:
from sklearn.metrics.cluster import adjusted_mutual_info_score, calinski_harabasz_score, davies_bouldin_score, contingency_matrix, normalized_mutual_info_score, silhouette_samples, silhouette_score

In [9]:
metrics = [adjusted_mutual_info_score, calinski_harabasz_score, davies_bouldin_score, contingency_matrix, normalized_mutual_info_score, silhouette_samples, silhouette_score]

In [150]:
for metric in metrics:
    try:
        print(metric.__name__, metric(df, cmeans.labels_))
    except ValueError:
        pass
#         print(ValueError)
        
    

calinski_harabasz_score 25371.806013966107
davies_bouldin_score 0.3762313694477608
silhouette_samples [0.67759496 0.56706305 0.73504549 ... 0.49490367 0.75514517 0.74937199]
silhouette_score 0.7347101729903256


In [12]:
ss = silhouette_samples(df, cmeans.labels_)
hv.Bars(
    np.hstack([sorted(ss[cmeans.labels_==i]) for i in range(3)])
)

In [99]:
from sklearn.metrics import pairwise_distances

In [19]:
from sklearn.metrics import make_scorer

In [181]:
from sklearn.metrics import pairwise_distances, silhouette_score

def xie_beni(Estimator, X, y=None):
    """ Xie-Beni scoring function
        for cmeans.
        
        Output is negative as scikit-learn
        by default maximizes scores """
    
    u = Estimator.predict(X)
    v = Estimator.cntr_
    m = Estimator.m
    
    n = X.shape[0]
    c = v.shape[0]

    um = u**m
    
    d2 = pairwise_distances(X, v)
    v2 = pairwise_distances(v, v)
    
    v2[v2 == 0.0] = np.inf
   
    return np.sum(um.T*d2)/(n*np.min(v2))

def hard_silouette(Estimator, X, y=None):
    """ A hard silouette scoring function
        for clustering algorithms. Built on
        `sklearn.metrics.silhouette_score`
        
        Uses np.argmax() to convert soft
        clusters into hard clusters prior
        to evaluation."""
        
    u = Estimator.predict(X)
    if u.shape != X.shape[0]:
        u = np.argmax(u, axis=0)

    return silhouette_score(X, u)

In [190]:
# FIXME : WOULD A DECORATOR FUNCTION SUFFICE HERE?

def hard_silouette(Estimator, X, y=None):
    """ A hard silouette scoring function
        for clustering algorithms. Built on
        `sklearn.metrics.silhouette_score`
        
        Uses np.argmax() to convert soft
        clusters into hard clusters prior
        to evaluation."""
        
    u = Estimator.predict(X)
    if u.shape != X.shape[0]:
        u = np.argmax(u, axis=0)

    return silhouette_score(X, u)

def fuzzy_partition_coef(Estimator, X, y=None):
    """ Fuzzy partion coefficient (fpc) 
        for fuzzy clustering algorithms"""
    
    u = Estimator.predict(X)
    return Estimator.fpc_

def calinski_harabasz(Estimator, X, y=None):
    """ Fuzzy partion coefficient (fpc) 
        for fuzzy clustering algorithms"""
    
    u = Estimator.predict(X)
    
    if u.shape != X.shape[0]:
        u = np.argmax(u, axis=0)

    return calinski_harabasz_score(X, u)


def davies_bouldin(Estimator, X, y=None):
    """ Fuzzy partion coefficient (fpc) 
        for fuzzy clustering algorithms"""
    
    u = Estimator.predict(X)
    
    if u.shape != X.shape[0]:
        u = np.argmax(u, axis=0)
    
    return -davies_bouldin_score(X, u)

# Grid search of fitting parameters by cross validation

In [191]:
from sklearn.model_selection import cross_validate, GridSearchCV

In [233]:
scoring = {
    'XB': xie_beni,
    'SIL': hard_silouette,
    'FPC': fuzzy_partition_coef,
#     'CH': calinski_harabasz,
    'DB': davies_bouldin,
}

gs = GridSearchCV(cmeans,
                  param_grid={'c': range(2,10), 'm':[1.1,1.5,2.0,2.5,3.0]},
                  scoring=scoring, refit='SIL')
gs.fit(df, None)
results = gs.cv_results_

In [234]:
dfr = pd.DataFrame(results)

# Visualisation of results with `hvplot`

In [235]:
dfr['mean_total_score'] = dfr[[f'mean_test_{x}' for x in scoring.keys()]].mean(axis=1)

### Scores on the doors

In [236]:
dfr.hvplot(
    groupby=['param_m'],
    x='param_c',
    y=[f'mean_test_{x}' for x in scoring.keys()] + ['mean_total_score']
)

### Scatter plot dominant class

In [237]:
df.hvplot(kind='scatter',x='x',y='y',c=np.argmax(gs.predict(df),axis=0), cmap='rainbow') * hv.Points(gs.best_estimator_.cntr_)

In [241]:
dfu = pd.DataFrame(gs.best_estimator_.u_.T, columns=[f'cluster {x}' for x in range(gs.best_estimator_.u_.shape[0])])

In [243]:
dfu.hvplot(kind='hist', bins=100, 
           width=500, 
#            subplots=True,
           alpha=1/gs.best_estimator_.u_.shape[0]
          )