# Example of a practical use case

We present in this notebook a practical use case. Concretely, we will:
1. Create a synthetic mixed dataset
2. Compute the meta-features of the dataset
3. Load one of the pre-trained meta-learners (namely _KNN_ for K-Prototypes algorithm) and the scaler
4. Use the loaded meta-learner to predict the ranking of similarity measures pair.
5. Run the K-Prototypes algorithm with the 5 top ranked pairs and compare their results with the literature baseline (*squared euclidean distance*, *hamming distance*)

### Imports

In [9]:
import numpy as np
import pandas as pd
import pickle
from mixed_metrics import get_valid_similarity_pairs
from meta_features import compute_meta_features
from sklearn.datasets import make_blobs
from mixed_metrics import WeightedAverage
from sklearn.preprocessing import OneHotEncoder, minmax_scale
from base_metrics import get_available_metrics, get_metric
from kmodes.kprototypes import KPrototypes
from experiments.utils import get_score

### Create a synthetic dataset

We first create a numeric dataset using the `make_blobs` function in _scikit-learn_. Then, we transform some numeric attributes into categorical ones by discretizing their values.

We also create a one-hot encoding representation of the categorical attributes that will be used by binary similarity measures. One-hot encoding consist in transforming each categorical attribute in several binary attributes, each one corresponding to one category of the transformed categorical attribute.

In [2]:
def discretize(x, n):
    """Discretize a list/array x by dividing its values into n intervals. 
    Each value in x is then associated with a category corresponding to one of the intervals.

    Parameters
    ----------
    x : list / 1D array
        The values to discretize
    n : int
        The number of categories

    Returns
    -------
    numpy array
        The discretized values
    """
    eps=1e-5
    bins = np.linspace(min(x), max(x)+eps, n+1)
    x_discrete = np.digitize(x, bins) - 1
    permutation = np.random.permutation(n)
    return permutation[x_discrete]

# Create a synthetic dataset with the following parameters
n_clusters = 5
n_att = 3
c_att = 7
n_feats = n_att + c_att
n_samples = 200

# We use the make_blobs function in sklearn to create an initial dataset with the total number of features
X, y = make_blobs(
    centers=n_clusters,
    n_samples=n_samples,
    n_features=n_feats,
    cluster_std=5,
    random_state=0
)
y = y.flatten()

# The we separate the numeric and categorical parts
Xnum = X[:, :n_att]
Xcat = np.zeros(shape=(n_samples, c_att))

# Finally we discretize the categorical attributes
for j in range(c_att):
    n_cat = np.random.randint(2, 10)
    Xcat[:, j] = discretize(X[:, n_att+j], n_cat)

# We create the one hot encoding representation of the Xcat which will be used with binary similarity measures
enc = OneHotEncoder(handle_unknown='ignore')
Xdummy = enc.fit_transform(Xcat).toarray()

### Compute the meta-feature vector of the dataset

To compute the meta-features vector, you can simply use the `compute_meta_features` function in [meta_features.py](meta_features.py).

In [3]:
# Important: Normalize the numeric part before computing the meta-features or performing clustering
Xnum = minmax_scale(Xnum)

# create the meta-features vector of your dataset
mf_vector = compute_meta_features(Xnum, Xcat)

### Load the pre-trained model and the scaler

In [4]:
# load the KNN model
with open("models/KPrototypes/KNN.pickle", "rb") as f:
    ranker = pickle.load(f)

# load the scaler. The scaler is used to transform the meta-feature vector before passing it to the meta-learner.
with open("models/KPrototypes/scaler.pickle", "rb") as f:
    scaler = pickle.load(f)

### Predict the ranking of the similarity measures pairs

In [5]:
# get the valid similarity measures pairs the dataset
valid_pairs = get_valid_similarity_pairs(Xnum, Xcat)

# predict the ranks/scores of all similarity measures pairs
y_pred = ranker.predict(scaler.transform([mf_vector]))[0]

# get a ranked list of similarity measures pairs
ranked_pairs = ranker.similarity_pairs_[np.argsort(-y_pred)]

# keep only valid similarity measures pairs for your dataset
ranked_pairs = [sim_pair for sim_pair in ranked_pairs if sim_pair in valid_pairs]
print(ranked_pairs[:5])

['canberra_sokalsneath', 'divergence_sokalsneath', 'divergence_jaccard', 'canberra_jaccard', 'canberra_dice']


### Run K-Prototypes using the top ranked pairs and compare to the literature baseline

We run K-Prototypes using the literature baseline and the 5 top ranked similarity measures pairs. For each pair, we run the algorithm using different weights for the combination of the two similarity measures of the pair, and show the clustering accuracy score obtained with the best weight.

In [31]:
literature_baseline = "sqeuclidean_hamming"

X = np.c_[Xnum, Xcat]
num_metric = get_metric(literature_baseline.split("_")[0]).fit(Xnum)
cat_metric = get_metric(literature_baseline.split("_")[0]).fit(Xcat)
weights = np.concatenate((np.linspace(0, 1, 6), np.arange(2, 10)))
scores = []
# We run the algorithm for different weight
for i, gamma in enumerate(weights):
    kp = KPrototypes(
        n_clusters=n_clusters,
        gamma=gamma, 
        num_dissim=num_metric.flex,
        cat_dissim=cat_metric.flex,
        random_state=0,
        n_init=10,
        init='huang',
        n_jobs=-1
    )
    clusters = kp.fit_predict(X, categorical=list(range(Xnum.shape[1], X.shape[1])))
    scores.append(get_score(y, clusters, eval_metric="acc"))

best_score = max(scores)
print("Score of the literature baseline:", best_score)

Score of the literature baseline: 0.55


In [33]:
best_scores = {}
# For each of the 5 top ranked similarity measures pairs
for k, pair_name in enumerate(ranked_pairs[:5]):
    # print(f"{pair_name}:", end=" ")
    X = np.c_[Xnum, Xcat] if pair_name.split("_")[1] in \
        get_available_metrics(data_type="categorical") else np.c_[Xnum, Xdummy]
    num_metric = get_metric(pair_name.split("_")[0]).fit(Xnum)
    cat_metric = get_metric(pair_name.split("_")[0]).fit(X[:, Xnum.shape[1]:])
    weights = list(np.linspace(0, 0.9, 10)) + list(np.linspace(1, 10, 19))
    scores = []
    
    # We run the algorithm for different weights
    for i, gamma in enumerate(weights):
        kp = KPrototypes(
            n_clusters=n_clusters,
            gamma=gamma, 
            num_dissim=num_metric.flex,
            cat_dissim=cat_metric.flex,
            random_state=0,
            n_init=10,
            init='huang',
            n_jobs=-1
        )
        clusters = kp.fit_predict(X, categorical=list(range(Xnum.shape[1], X.shape[1])))
        scores.append(get_score(y, clusters, eval_metric="acc"))

    # And store the best accuracy score
    best_score = max(scores)
    # print(best_score)
    best_scores[k+1] = {
        "Name": pair_name,
        "Score": best_score
    }
df = pd.DataFrame.from_dict(best_scores, orient="index")
print("Scores of the 5 top ranked similarity measure pairs")
df

Scores of the 5 top ranked similarity measure pairs


Unnamed: 0,Name,Score
1,canberra_sokalsneath,0.62
2,divergence_sokalsneath,0.615
3,divergence_jaccard,0.615
4,canberra_jaccard,0.62
5,canberra_dice,0.62


We can observe that in this use case, the top ranked pairs has an accuracy of 0.62 while the literature baseline has only 0.55