# Example of a practical use case

We present in this notebook a practical use case. Concretely, we will:
1. Create a synthetic mixed dataset
2. Compute the meta-features of the dataset
3. Load one of the pre-trained meta-learners (namely _KNN_ for K-Medoids algorithm) and the scaler
4. Predict the ranking of similarity measures pair (that can be computed on the dataset)
5. Run the K-Medoids algorithm with the 5 top ranked pairs and compare their results with the literature baseline (*manhattan_hamming*)

### Imports

In [212]:
import numpy as np
import pandas as pd
import pickle
from mixed_metrics import get_valid_similarity_pairs
from meta_features import compute_meta_features
from sklearn.datasets import make_blobs
from mixed_metrics import WeightedAverage
from sklearn.preprocessing import OneHotEncoder, minmax_scale
from base_metrics import get_available_metrics
from kmedoids import fasterpam
from experiments.utils import get_score

### Create a synthetic dataset

We first create a numeric dataset using the `make_blobs` function in _scikit-learn_. Then, we transform some numeric attributes into categorical ones by discretizing their values.

We also create a one-hot encoding representation of the categorical attributes that will be used by binary similarity measures. One-hot encoding consist in transforming each categorical attribute in several binary attributes, each one corresponding to one category of the transformed categorical attribute.

In [210]:
def discretize(x, n):
    """Discretize a list/array x by dividing its values into n intervals. 
    Each value in x is then associated with a category corresponding to one of the intervals.

    Parameters
    ----------
    x : list / 1D array
        The values to discretize
    n : int
        The number of categories

    Returns
    -------
    numpy array
        The discretized values
    """
    eps=1e-5
    bins = np.linspace(min(x), max(x)+eps, n+1)
    x_discrete = np.digitize(x, bins) - 1
    permutation = np.random.permutation(n)
    return permutation[x_discrete]

# Create a synthetic dataset with the following parameters
n_clusters = 5
n_att = 3
c_att = 7
n_feats = n_att + c_att
n_samples = 200

# We use the make_blobs function in sklearn to create an initial dataset with the total number of features
X, y = make_blobs(
    centers=n_clusters,
    n_samples=n_samples,
    n_features=n_feats,
    cluster_std=5,
    random_state=0
)
y = y.flatten()

# The we separate the numeric and categorical parts
Xnum = X[:, :n_att]
Xcat = np.zeros(shape=(n_samples, c_att))

# Finally we discretize the categorical attributes
for j in range(c_att):
    n_cat = np.random.randint(2, 10)
    Xcat[:, j] = discretize(X[:, n_att+j], n_cat)

# We create the one hot encoding representation of the Xcat which will be used with binary similarity measures
enc = OneHotEncoder(handle_unknown='ignore')
Xdummy = enc.fit_transform(Xcat).toarray()

### Compute the meta-feature vector of the dataset

To compute the meta-features vector, you can simply use the `compute_meta_features` function in [meta_features.py](meta_features.py).

In [None]:
# Important: Normalize the numeric part before computing the meta-features or performing clustering
Xnum = minmax_scale(Xnum)

# create the meta-features vector of your dataset
mf_vector = compute_meta_features(Xnum, Xcat)

### Load the pre-trained model and the scaler

In [190]:
# load the KNN model
with open("models/KMedoids/KNN.pickle", "rb") as f:
    ranker = pickle.load(f)

# load the scaler. The scaler is used to transform the meta-feature vector before passing it to the meta-learner.
with open("models/KMedoids/scaler.pickle", "rb") as f:
    scaler = pickle.load(f)

### Predict the ranking of the similarity measures pairs

In [186]:
# get the valid similarity measures pairs the dataset
valid_pairs = get_valid_similarity_pairs(Xnum, Xcat)

# predict the ranks/scores of all similarity measures pairs
y_pred = ranker.predict(scaler.transform([mf_vector]))[0]

# get a ranked list of similarity measures pairs
ranked_pairs = ranker.similarity_pairs_[np.argsort(-y_pred)]

# keep only valid similarity measures pairs for your dataset
ranked_pairs = [sim_pair for sim_pair in ranked_pairs if sim_pair in valid_pairs]
print(ranked_pairs[:5])

['cosine_eskin', 'sqeuclidean_co-oc', 'manhattan_jaccard', 'lorentzian_jaccard', 'chebyshev_jaccard']


In [204]:
# for pair_name in ranked_pairs[:5]:
#     print(pair_name)
#     X = np.c_[Xnum, Xcat] if pair_name.split("_")[1] in \
#         get_available_metrics(data_type="categorical") else np.c_[Xnum, Xdummy]
#     weights = np.linspace(0, 1, 11)
#     plt.figure(figsize=(len(weights)*3, 3))
#     for i, w in tqdm(enumerate(weights)):
#         plt.subplot(1, len(weights), i+1)
#         m = WeightedAverage(pair_name, w=w)
#         m.fit(X, categorical=np.arange(Xnum.shape[1], X.shape[1]))
#         D = m.pairwise(X, categorical=np.arange(Xnum.shape[1], X.shape[1]))
#         Xemb = MDS(dissimilarity="precomputed", normalized_stress="auto").fit_transform(D)
#         # Xemb = TSNE(metric="precomputed", init='random').fit_transform(D)
#         plt.scatter(Xemb[:,0], Xemb[:,1], c=y, s=30)
#         plt.xticks([])
#         plt.yticks([])
#     plt.tight_layout(w_pad=0)
#     plt.show()

### Run K-Medoids with top ranked pairs and compare to the literature baseline

In [None]:
def run_kmedoids(D, n_clusters, n_init=10):
    """Run n_init times the K-Medoids algorithm with random initialization and return the result that minimize intra-cluster distances.
    We consider the fasterpam version of K-Medoids implemented in the kmedoids library.

    Parameters
    ----------
    D : 2D numpy array
        The pairwise dissimilarity matrix
    n_clusters : int
        Number of clusters
    n_init : int, optional
        Number of random initialization, by default 10

    Returns
    -------
    list
        The cluster labels of all samples
    """
    final_clusters = None
    best_score = np.inf
    for random_state in range(n_init):
        try:
            res = fasterpam(D, n_clusters, random_state=random_state, n_cpu=1)
            clusters = res.labels
            score = res.loss
            if score < best_score:
                final_clusters = [val for val in clusters]
        except:
            print(f"Error : k-medoids; random_state={random_state}")
    return final_clusters

We run K-Medoids using the literature baseline and the top ranked similarity measures pairs. For each pair we run the algorithm using different weights for the combination of the two similarity measures of the pair. We show the clustering accuracy score obtained with the best weight.

In [207]:
best_scores = []
labels = ["manhattan_hamming"] + ranked_pairs[:5]

# For each similarity measures pair
for pair_name in labels:
    # print(f"{pair_name}:", end=" ")
    X = np.c_[Xnum, Xcat] if pair_name.split("_")[1] in \
        get_available_metrics(data_type="categorical") else np.c_[Xnum, Xdummy]
    weights = np.linspace(0, 1, 21)
    scores = []
    
    # We run the algorithm for different weight
    for i, w in enumerate(weights):
        m = WeightedAverage(pair_name, w=w)
        m.fit(X, categorical=np.arange(Xnum.shape[1], X.shape[1]))
        D = m.pairwise(X, categorical=np.arange(Xnum.shape[1], X.shape[1]))
        clusters = run_kmedoids(D, n_clusters=n_clusters)
        scores.append(get_score(y, clusters, eval_metric="acc"))

    # We store the best accuracy score
    best_score = max(scores)
    # print(best_score)
    best_scores.append(best_score)
pd.DataFrame(index=labels, data=best_scores, columns=["score"])

Unnamed: 0,score
manhattan_hamming,0.6
cosine_eskin,0.59
sqeuclidean_co-oc,0.74
manhattan_jaccard,0.59
lorentzian_jaccard,0.605
chebyshev_jaccard,0.595


We can observe that in this use case, one of the top ranked pairs has an accuracy of 0.74 while the literature baseline has only 0.6