In [None]:
#| default_exp cluster_grid_search

# cluster_grid_search module

> Functions for optimising clustering

In [None]:
#| hide
from nbdev.showdoc import show_doc
from sklearn import datasets

In [None]:
#| export
from sklearn import cluster
import pandas as pd
import numpy as np
from typing import Union
from sklearn.base import ClusterMixin
from sklearn import metrics
from sklearn.metrics import silhouette_score
from tqdm import tqdm
from sklearn.model_selection import ParameterGrid

The [iris dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set) is used below as an example of some data to cluster

In [None]:
iris = datasets.load_iris(as_frame=True)
X = iris.data
X.head(5)

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


## Clustering

In [None]:
#| export
def assign_cluster_labels(
    data_to_cluster: Union[pd.DataFrame, np.ndarray, list], # Data points to cluster
    cluster_model: ClusterMixin, # Sk-learn clustering model. Must be specified in the format `cluster.model`, e.g `cluster.KMeans`
    model_kwargs: dict, # Keyword arguments for sk-learn clustering model
) -> np.ndarray: # Label assignments
    """Assigns cluster labels to `data_to_cluster` using specified `cluster_model`"""
    cluster_model = cluster_model(**model_kwargs)
    return cluster_model.fit_predict(data_to_cluster)

In [None]:
show_doc(assign_cluster_labels)

---

#### assign_cluster_labels

>      assign_cluster_labels
>                             (data_to_cluster:Union[pandas.core.frame.DataFrame
>                             ,numpy.ndarray,list],
>                             cluster_model:sklearn.base.ClusterMixin,
>                             model_kwargs:dict)

Assign cluster labels to `data_to_cluster` using specified `cluster_model`

|    | **Type** | **Details** |
| -- | -------- | ----------- |
| data_to_cluster | typing.Union[pandas.core.frame.DataFrame, numpy.ndarray, list] | Data points to cluster |
| cluster_model | ClusterMixin | Sk-learn clustering model. Must be specified in the format `cluster.model`, e.g `cluster.KMeans` |
| model_kwargs | dict | Keyword arguments for sk-learn clustering model |
| **Returns** | **ndarray** | **Label assignments** |

We can use the `assign_cluster_labels` function to assign cluster labels. The function can accept any sk-learn clustering model and you can specify any parameters for the model in the `model_kwargs` argument.
See an example below that assigns labels to the iris dataset:

In [None]:
labels = assign_cluster_labels(X, cluster.KMeans, {"n_clusters": 4})
labels

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 3, 3, 3, 0, 3, 0, 3, 0, 3, 0, 0, 0, 0, 3, 0, 3,
       0, 0, 3, 0, 3, 0, 3, 3, 3, 3, 3, 3, 3, 0, 0, 0, 0, 3, 0, 3, 3, 3,
       0, 0, 0, 3, 0, 0, 0, 0, 0, 3, 0, 0, 2, 3, 2, 2, 2, 2, 0, 2, 2, 2,
       3, 3, 2, 3, 3, 2, 2, 2, 2, 3, 2, 3, 2, 3, 2, 2, 3, 3, 2, 2, 2, 2,
       2, 3, 3, 2, 2, 2, 3, 2, 2, 2, 3, 2, 2, 2, 3, 3, 2, 3], dtype=int32)

## Silhouette Score -- Validity of Clusters

In [None]:
#| export
def calculate_silhouette_avg(
    data: Union[pd.DataFrame, np.ndarray, list], # The data that was used for the clustering
    cluster_labels: np.ndarray, # Cluster labels
) -> float: # Average silhouette score
    """Calculate the average silhouette score"""
    return silhouette_score(data, cluster_labels)

In [None]:
show_doc(calculate_silhouette_avg)

---

#### calculate_silhouette_avg

>      calculate_silhouette_avg (data:Union[numpy.__array_like._SupportsArray[nu
>                                mpy.dtype],numpy.__nested_sequence._NestedSeque
>                                nce[numpy.__array_like._SupportsArray[numpy.dty
>                                pe]],bool,int,float,complex,str,bytes,numpy.__n
>                                ested_sequence._NestedSequence[Union[bool,int,f
>                                loat,complex,str,bytes]]],
>                                cluster_labels:numpy.ndarray)

Calculate the average silhouette score

|    | **Type** | **Details** |
| -- | -------- | ----------- |
| data | typing.Union[numpy._typing._array_like._SupportsArray[numpy.dtype], numpy._typing._nested_sequence._NestedSequence[numpy._typing._array_like._SupportsArray[numpy.dtype]], bool, int, float, complex, str, bytes, numpy._typing._nested_sequence._NestedSequence[typing.Union[bool, int, float, complex, str, bytes]]] | The data that was used for the clustering |
| cluster_labels | ndarray | Cluster labels |
| **Returns** | **float** | **Average silhouette score** |

We can use the average [silhouette score](https://www.sciencedirect.com/science/article/pii/0377042787901257?via%3Dihub) as a measure of the validity of our clusters. It can be used to select an approriate number of clusters. Values range from -1 to +1, where a higher value indicates that the points are on average far away from clusters that aren't their own. Negative values indicate that there is a lot of overlap between the clusters.

Below is an example using the `calculate_silhouette_avg` on the iris example dataset and the assigned cluster labels from our clustering above.

In [None]:
calculate_silhouette_avg(X, labels)

0.49805050499728776

## Grid Search

In [None]:
#| export
def grid_search(
    data_to_cluster: Union[pd.DataFrame, np.ndarray, list], # Data points to cluster
    cluster_models: list, # List of Sk-learn clustering models to iterate through. Each model must be specified in the format `cluster.model`, e.g `cluster.KMeans`
    model_kwargs_list: list, # List of dicts of keyword arguments for the sk-learn clustering models to iterate through.
    highlight: bool, # True to highlight highest avg_silhouette_score
    sort: bool, # True to sort by highest avg_silhouette_score
) -> pd.DataFrame: # Table showing avg_silhouette_score for each model and parameter specified
    """Perform grid search for the specified clustering models and parameters"""
    cluster_model = []
    model_parameters = []
    silhouette_score = []
    for model, model_kwargs in tqdm(zip(cluster_models, model_kwargs_list)):
        for parameters in tqdm(ParameterGrid(model_kwargs)):
            labels = assign_cluster_labels(
                data_to_cluster,
                model,
                parameters
            )
            avg_silhouette_score = calculate_silhouette_avg(
                data_to_cluster,
                labels
            )
            cluster_model.append(str(model).split(".")[-1][:-2])
            model_parameters.append(parameters)
            silhouette_score.append(avg_silhouette_score)
    results = pd.DataFrame.from_dict({
        "cluster_model": cluster_model,
        "model_params": model_parameters,
        "avg_silhouette_score": silhouette_score
    })
    if sort:
        results = results.sort_values("avg_silhouette_score", ascending=False)
    if highlight:
        results = results.style.highlight_max(
            subset = ["avg_silhouette_score"],
            color = 'lightgreen', axis = 0
        )
    return results

In [None]:
show_doc(grid_search)

---

#### grid_search

>      grid_search
>                   (data_to_cluster:Union[pandas.core.frame.DataFrame,numpy.nda
>                   rray,list], cluster_models:list, model_kwargs_list:list,
>                   highlight:bool, sort:bool)

Perform grid search for the specified clustering models and parameters

|    | **Type** | **Details** |
| -- | -------- | ----------- |
| data_to_cluster | typing.Union[pandas.core.frame.DataFrame, numpy.ndarray, list] | Data points to cluster |
| cluster_models | list | List of Sk-learn clustering models to iterate through. Each model must be specified in the format `cluster.model`, e.g `cluster.KMeans` |
| model_kwargs_list | list | List of dicts of keyword arguments for the sk-learn clustering models to iterate through. |
| highlight | bool | True to highlight highest avg_silhouette_score |
| sort | bool | True to sort by highest avg_silhouette_score |
| **Returns** | **DataFrame** | **Table showing avg_silhouette_score for each model and parameter specified** |

We can use the `grid_search` function to optimise for the highest silhouette score across different sk-learn clustering models and different parameters. An example is below.

The cluster models list and model kwargs list that will be passed into `grid_search` function should be in the form below. The order of the two lists should relate to each other.

In [None]:
cluster_models = [
    cluster.KMeans,
    cluster.AffinityPropagation
]

model_kwargs_list = [
    {"n_clusters": [2, 3, 4], "init": ["k-means++", "random"]},
    {"damping": [0.6, 0.7, 0.8]}
]

We can then use the `grid_search` function to perform grid search over the models and kwargs to find the highest silhouette score:

In [None]:
grid_search(
    data_to_cluster=X,
    cluster_models=cluster_models,
    model_kwargs_list=model_kwargs_list,
    sort=True,
    highlight=True
)

0it [00:00, ?it/s]
  0%|                                               | 0/6 [00:00<?, ?it/s][A
100%|███████████████████████████████████████| 6/6 [00:00<00:00, 31.68it/s][A
1it [00:00,  5.14it/s]
100%|███████████████████████████████████████| 3/3 [00:00<00:00, 67.43it/s][A
2it [00:00,  8.26it/s]


Unnamed: 0,cluster_model,model_params,avg_silhouette_score
0,KMeans,"{'init': 'k-means++', 'n_clusters': 2}",0.681046
3,KMeans,"{'init': 'random', 'n_clusters': 2}",0.681046
1,KMeans,"{'init': 'k-means++', 'n_clusters': 3}",0.552819
4,KMeans,"{'init': 'random', 'n_clusters': 3}",0.552819
5,KMeans,"{'init': 'random', 'n_clusters': 4}",0.498051
2,KMeans,"{'init': 'k-means++', 'n_clusters': 4}",0.497455
7,AffinityPropagation,{'damping': 0.7},0.474338
8,AffinityPropagation,{'damping': 0.8},0.468801
6,AffinityPropagation,{'damping': 0.6},0.345462
