# Create Binary Dataset

The first step consists in building the binary dataset defined as:

We then introduce the binary data set. Let $X^{(b)} = \{x^{(b)}_l | 1 ≤ l ≤ n\}$ be a binary data set derived from the set of $r$ basic partitionings as follows:

$$
x^{(b)}_l = \left( x^{(b)}_{l,1}, \ldots, x^{(b)}_{l,i}, \ldots, x^{(b)}_{l,r} \right) \\
x^{(b)}_{l,i} = \left( x^{(b)}_{l,i1}, \ldots, x^{(b)}_{l,ij}, \ldots, x^{(b)}_{l,iKii} \right)\\
x^{(b)}_{l,ij} = \begin{cases} 1, & \text{if } L_i(x_l) = j \\ 0, & \text{otherwise} \end{cases}
$$

## Load Iris Dataset

In [43]:
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
import os
os.chdir("..") if "notebook" in os.getcwd() else None
import config
from tqdm import tqdm
# np.random.seed(42)

# Load data
X = pd.read_csv(os.path.join(config.DATA_FOLDER, 'iris.csv'))
X = X.sample(frac=1).reset_index(drop=True)
y = X.pop('target')
X = (X - X.mean()) / X.std()

print("X shape:", X.shape)
print("y shape:", y.shape)

X shape: (150, 4)
y shape: (150,)


## Create Basic  Partitionings

- To generate basic partitionings (BPs), we used the kmeans with squared Euclidean distance for
UCI data sets.
- Number of clusters:
  - [Default] Random Parameter Selection (RPS): We randomized the number of clusters within an interval for each basic clustering within $[K,\sqrt{n}]$.
  - Random Feature Selection (RFS): two features randomly for each BP, and set the number of clusters to K for kmeans.
- For each data set, 100 BPs are typically generated for consensus clustering (namely r = 100), and the weights of these BPs are exactly the same.

In [44]:
def one_hot_encode(labels, num_classes):
    one_hot = np.zeros((len(labels), num_classes))
    one_hot[np.arange(len(labels)), labels] = 1
    return one_hot

r = config.R
n = len(X)
classes = len(y.unique())

max_k = int(n**0.5) + 1
ls_partitions = []
ls_partitions_labels = []
total_k = 0
for i in tqdm(range(r), desc="Clustering Progress"):
    k = np.random.randint(classes, max_k) # Closed form both sides
    partition_i = KMeans(n_clusters=k, n_init=10, init='k-means++').fit(X).labels_
    ls_partitions_labels.append(partition_i)
    ls_partitions.append(one_hot_encode(partition_i, k))
    total_k += k

X_b = np.hstack(ls_partitions)

print("X_b shape:", X_b.shape)

assert X_b.shape == (n, total_k), "Error in shape of X_b"

Clustering Progress: 100%|██████████| 100/100 [00:05<00:00, 19.06it/s]

X_b shape: (150, 739)





In [45]:
X_b

array([[1., 0., 0., ..., 0., 1., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       [0., 1., 0., ..., 1., 0., 0.],
       ...,
       [0., 1., 0., ..., 1., 0., 0.],
       [1., 0., 0., ..., 0., 1., 0.],
       [0., 0., 1., ..., 0., 0., 0.]])

In [46]:
ls_partitions_labels[:2]

[array([0, 2, 1, 1, 4, 5, 1, 2, 3, 3, 5, 0, 4, 2, 0, 0, 1, 4, 2, 4, 2, 4,
        1, 2, 2, 5, 1, 2, 4, 2, 5, 4, 4, 2, 0, 5, 0, 2, 6, 1, 0, 1, 6, 0,
        1, 2, 1, 4, 4, 5, 3, 6, 0, 1, 0, 1, 1, 6, 4, 5, 2, 2, 5, 0, 2, 1,
        3, 2, 1, 4, 3, 1, 2, 0, 5, 6, 2, 5, 1, 0, 1, 2, 6, 0, 3, 5, 4, 1,
        5, 0, 4, 4, 0, 1, 2, 2, 3, 3, 5, 2, 0, 4, 1, 2, 5, 4, 0, 0, 4, 1,
        1, 6, 2, 3, 2, 0, 2, 5, 4, 0, 5, 0, 0, 0, 2, 0, 0, 0, 5, 3, 4, 2,
        2, 0, 2, 0, 5, 2, 3, 2, 2, 4, 2, 2, 2, 1, 3, 1, 0, 2]),
 array([1, 0, 2, 3, 0, 2, 2, 0, 1, 1, 2, 1, 0, 1, 1, 1, 2, 0, 0, 0, 0, 0,
        2, 0, 1, 2, 3, 1, 0, 0, 2, 0, 0, 0, 1, 2, 1, 0, 3, 3, 1, 3, 3, 1,
        3, 0, 3, 0, 0, 2, 1, 3, 1, 3, 1, 2, 3, 3, 0, 2, 0, 0, 2, 1, 0, 3,
        1, 0, 2, 0, 1, 3, 0, 1, 2, 3, 0, 2, 3, 1, 3, 1, 3, 1, 1, 2, 0, 3,
        2, 1, 0, 0, 1, 2, 0, 0, 1, 1, 2, 1, 1, 0, 2, 0, 2, 0, 1, 1, 0, 3,
        3, 3, 0, 1, 0, 1, 0, 2, 0, 1, 2, 1, 1, 1, 0, 1, 1, 1, 2, 1, 0, 1,
        0, 1, 0, 1, 2, 0, 1, 0, 0, 0, 0, 0, 0, 3

# Cluster the Binary Dataset

- Clustering tools. Three types of consensus clustering methods, namely the K-means-based algorithm (KCC), the graph partitioning algorithm (GP), and the hierarchical algorithm (HCC), were employed for the comparison purpose.
**In our work, only KCC is used and compared to the paper's one**.

- Define utility functions


Note: You may need to adjust the alignment of the columns depending on your markdown renderer.

I hope this helps! Let me know if you have any other questions.


**Compare different methods for contingency table**

In [47]:
from scipy.stats.contingency import crosstab
from sklearn.metrics.cluster import contingency_matrix
# pi = KMeans(n_clusters=classes, n_init=10, init='random').fit(X_b).labels_

In [48]:
# %%timeit
# data_crosstab = pd.crosstab(pi, 
# 							ls_partitions_labels[0], 
# 							margins = False)
# 2.89 ms ± 104 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [49]:
# %%timeit
# contingency_matrix(pi, ls_partitions_labels[0])
# 75.5 µs ± 1.32 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [50]:
# %%timeit
# crosstab(pi, ls_partitions_labels[0]).count
# 22.7 µs ± 347 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

**Extract Information from Contingency Table**

The information that we need to extract from the contingency table is:
- $p_{kj}^{(i)}$ = the ratio of objects in consensus cluster k that belong to cluster j based on partition i
- $p_k$ = the number of objects in consensus cluster k
- $P_k^{i}$ = $p_{kj}^{(i)}/p_k$
- $P^(i) = the vector with the ratios of objects in each cluster based on partition i

In [51]:
# Calculate contingency matrix (n)
pi = KMeans(n_clusters=classes, n_init=10, init='k-means++').fit(X_b).labels_

cont_i = crosstab(pi, ls_partitions_labels[0]).count
cont_i


array([[ 0,  0, 35,  0, 21,  0,  0],
       [30,  0,  2, 12,  0,  0,  0],
       [ 0, 25,  0,  0,  0, 18,  7]])

In [52]:
# Get p-contingency matrix
p_cont_i = cont_i / cont_i.sum()
p_cont_i

array([[0.        , 0.        , 0.23333333, 0.        , 0.14      ,
        0.        , 0.        ],
       [0.2       , 0.        , 0.01333333, 0.08      , 0.        ,
        0.        , 0.        ],
       [0.        , 0.16666667, 0.        , 0.        , 0.        ,
        0.12      , 0.04666667]])

In [53]:
# Calculate the p_k for each cluster in PI
p_k = p_cont_i.sum(axis=1)
p_k = p_k.reshape(-1, 1)
p_k

array([[0.37333333],
       [0.29333333],
       [0.33333333]])

In [54]:
# Get the P_k^i for each cluster in the partition i
P_k_i = p_cont_i / p_k
P_k_i

array([[0.        , 0.        , 0.625     , 0.        , 0.375     ,
        0.        , 0.        ],
       [0.68181818, 0.        , 0.04545455, 0.27272727, 0.        ,
        0.        , 0.        ],
       [0.        , 0.5       , 0.        , 0.        , 0.        ,
        0.36      , 0.14      ]])

In [55]:
# This is a constant and it's the ratio of points in each cluster in pi_i
P_i = p_cont_i.sum(axis=0)
P_i

array([0.2       , 0.16666667, 0.24666667, 0.08      , 0.14      ,
       0.12      , 0.04666667])

In [56]:
#get the norm of each row of P_k^i
norm_P_k_i = np.linalg.norm(P_k_i, axis=1, ord=2).reshape(-1, 1) ** 2
norm_P_k_i

array([[0.53125   ],
       [0.54132231],
       [0.3992    ]])

In [57]:
norm_P_i = np.linalg.norm(P_i, ord=2) ** 2
norm_P_i

0.17120000000000002

In [58]:
p_k * norm_P_k_i

array([[0.19833333],
       [0.15878788],
       [0.13306667]])

In [59]:
utility = (p_k * norm_P_k_i).sum() - norm_P_i
utility

0.31898787878787876

In [60]:
# from sklearn.metrics.cluster import adjusted_rand_score
# adjusted_rand_score(y, )

**Average the utility to get the final consensus**

In [61]:
ls_partitions[0].shape

(150, 7)

In [62]:
def extract_p(pi, pi_i):
    '''
    The information that we need to extract from the contingency table is:
        - $p_{kj}^{(i)}$ = the ratio of objects in consensus cluster k that belong to cluster j based on partition i
        - $p_k$ = the number of objects in consensus cluster k (p_k+)
        - $P_k^{i}$ = $p_{kj}^{(i)}/p_k$
        - $P^(i) = the vector with the ratios of objects in each cluster based on partition i
    '''
    # Calculate contingency matrix (n)
    cont_i = crosstab(pi, pi_i).count

    # Get p-contingency matrix
    p_cont_i = cont_i / cont_i.sum()

    # Calculate the p_k for each cluster in PI
    p_k = p_cont_i.sum(axis=1)
    p_k = p_k.reshape(-1, 1)

    # Get the P_k^(i)
    P_k_i = p_cont_i / p_k

    # This is a constant and it's the ratio of points in each cluster in pi_i
    P_i = p_cont_i.sum(axis=0)

    return p_k, P_k_i, P_i
    
def compute_U_c_i(p_k, P_k_i, P_i):
    norm_P_k_i = np.linalg.norm(P_k_i, axis=1, ord=2).reshape(-1, 1)
    norm_P_i = np.linalg.norm(P_i, ord=2)
    utility = (p_k * norm_P_k_i).sum() - norm_P_i
    return utility

ls_utilities = []
for pi_i in ls_partitions_labels:
    p_k, P_k_i, P_i = extract_p(pi, pi_i)
    utility = compute_U_c_i(p_k, P_k_i, P_i)
    ls_utilities.append(utility)

np.mean(ls_utilities)

0.2630467145245428

In [63]:
from sklearn.metrics.cluster import adjusted_rand_score
adjusted_rand_score(pi, y)

0.5923326221845838