# Create Binary Dataset

The first step consists in building the binary dataset defined as:

We then introduce the binary data set. Let $X^{(b)} = \{x^{(b)}_l | 1 ≤ l ≤ n\}$ be a binary data set derived from the set of $r$ basic partitionings as follows:

$$
x^{(b)}_l = \left( x^{(b)}_{l,1}, \ldots, x^{(b)}_{l,i}, \ldots, x^{(b)}_{l,r} \right) \\
x^{(b)}_{l,i} = \left( x^{(b)}_{l,i1}, \ldots, x^{(b)}_{l,ij}, \ldots, x^{(b)}_{l,iKii} \right)\\
x^{(b)}_{l,ij} = \begin{cases} 1, & \text{if } L_i(x_l) = j \\ 0, & \text{otherwise} \end{cases}
$$

## Load Iris Dataset

In [20]:
import pandas as pd
import numpy as np
# from sklearn.cluster import KMeans
import os
from tqdm import tqdm

os.chdir("..") if "notebook" in os.getcwd() else None
from src.KMeans import KMeans
import config

# np.random.seed(42)

# Load data
X = pd.read_csv(os.path.join(config.DATA_FOLDER, 'iris.csv'))
X = X.sample(frac=1).reset_index(drop=True)
y = X.pop('target')
X = (X - X.mean()) / X.std()
X = X.values

print("X shape:", X.shape)
print("y shape:", y.shape)

X shape: (150, 4)
y shape: (150,)


## Create Basic  Partitionings

- To generate basic partitionings (BPs), we used the kmeans with squared Euclidean distance for
UCI data sets.
- Number of clusters:
  - [Default] Random Parameter Selection (RPS): We randomized the number of clusters within an interval for each basic clustering within $[K,\sqrt{n}]$.
  - Random Feature Selection (RFS): two features randomly for each BP, and set the number of clusters to K for kmeans.
- For each data set, 100 BPs are typically generated for consensus clustering (namely r = 100), and the weights of these BPs are exactly the same.

In [2]:
def one_hot_encode(labels, num_classes):
    one_hot = np.zeros((len(labels), num_classes))
    one_hot[np.arange(len(labels)), labels] = 1
    return one_hot

r = config.R
n = len(X)
classes = len(y.unique())

max_k = int(n**0.5) + 1
ls_partitions = []
ls_partitions_labels = []
total_k = 0
for i in tqdm(range(r), desc="Clustering Progress"):
    k = np.random.randint(classes, max_k) # Closed form both sides
    model = KMeans(k, n_init=10)
    model.fit(X)
    partition_i = model.labels_
    ls_partitions_labels.append(partition_i)
    ls_partitions.append(one_hot_encode(partition_i, k))
    total_k += k

X_b = np.hstack(ls_partitions)

print("X_b shape:", X_b.shape)

assert X_b.shape == (n, total_k), "Error in shape of X_b"

Clustering Progress: 100%|██████████| 100/100 [00:00<00:00, 485.84it/s]

X_b shape: (150, 709)





In [3]:
k

8

In [4]:
np.unique(partition_i)

array([0, 1, 2, 3, 4, 5, 6, 7], dtype=int64)

In [5]:
X_b

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 1.]])

In [6]:
ls_partitions_labels[:2]

[array([9, 1, 7, 1, 1, 7, 8, 9, 9, 0, 3, 8, 1, 8, 4, 7, 6, 4, 9, 0, 6, 4,
        6, 5, 3, 0, 9, 1, 5, 5, 6, 9, 9, 9, 6, 3, 3, 7, 1, 9, 9, 9, 3, 8,
        5, 7, 8, 9, 6, 5, 7, 3, 3, 4, 6, 1, 6, 0, 3, 3, 8, 4, 8, 1, 8, 9,
        5, 9, 6, 7, 3, 7, 4, 6, 8, 5, 9, 9, 6, 9, 9, 1, 3, 0, 8, 8, 4, 1,
        3, 8, 5, 0, 9, 6, 1, 9, 5, 1, 3, 5, 5, 4, 8, 0, 9, 1, 5, 3, 9, 0,
        1, 9, 9, 6, 9, 4, 3, 3, 6, 1, 7, 7, 3, 0, 8, 3, 6, 5, 8, 5, 9, 4,
        1, 7, 1, 7, 0, 5, 7, 0, 6, 5, 9, 4, 1, 9, 5, 9, 6, 5], dtype=int64),
 array([1, 5, 5, 5, 3, 5, 1, 1, 1, 0, 3, 1, 5, 7, 2, 5, 4, 2, 1, 0, 4, 2,
        4, 1, 3, 0, 4, 3, 1, 1, 4, 4, 1, 1, 4, 3, 3, 5, 3, 1, 1, 1, 3, 1,
        4, 5, 7, 1, 4, 1, 5, 3, 3, 2, 4, 3, 4, 2, 3, 3, 7, 7, 1, 5, 1, 1,
        2, 1, 4, 5, 3, 5, 2, 3, 7, 1, 1, 1, 4, 1, 4, 3, 3, 0, 7, 7, 2, 3,
        3, 7, 1, 2, 1, 4, 5, 4, 1, 5, 3, 1, 1, 2, 2, 2, 1, 5, 1, 3, 4, 0,
        5, 4, 1, 4, 1, 2, 3, 3, 4, 3, 5, 5, 3, 0, 7, 3, 4, 1, 1, 1, 4, 2,
        5, 5, 5, 5, 0, 1, 5, 2, 4, 

# Cluster the Binary Dataset

- Clustering tools. Three types of consensus clustering methods, namely the K-means-based algorithm (KCC), the graph partitioning algorithm (GP), and the hierarchical algorithm (HCC), were employed for the comparison purpose.
**In our work, only KCC is used and compared to the paper's one**.

- Define utility functions


Note: You may need to adjust the alignment of the columns depending on your markdown renderer.

I hope this helps! Let me know if you have any other questions.


**Compare different methods for contingency table**

In [7]:
from scipy.stats.contingency import crosstab
from sklearn.metrics.cluster import contingency_matrix
pi = KMeans(classes, n_init=10)
pi.fit(X_b)
pi = pi.labels_

In [8]:
# %%timeit
# data_crosstab = pd.crosstab(pi, 
# 							ls_partitions_labels[0], 
# 							margins = False)

# 2.58 ms ± 158 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [9]:
# %%timeit
# contingency_matrix(pi, ls_partitions_labels[0])
# 79.1 µs ± 1.12 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [10]:
# %%timeit
# crosstab(pi, ls_partitions_labels[0]).count
# 23 µs ± 318 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

**Extract Information from Contingency Table**

The information that we need to extract from the contingency table is:
- $p_{kj}^{(i)}$ = the ratio of objects in consensus cluster k that belong to cluster j based on partition i
- $p_k$ = the number of objects in consensus cluster k
- $P_k^{i}$ = $p_{kj}^{(i)}/p_k$
- $P^{(i)}$ = the vector with the ratios of objects in each cluster based on partition i

In [11]:
# Calculate contingency matrix (n)
pi = KMeans(classes)
pi.fit(X_b)
pi = pi.labels_

cont_i = crosstab(pi, ls_partitions_labels[0]).count
cont_i


array([[11, 18, 18, 11, 18, 17, 13, 15, 29]])

In [12]:
# Get p-contingency matrix
p_cont_i = cont_i / cont_i.sum()
p_cont_i

array([[0.07333333, 0.12      , 0.12      , 0.07333333, 0.12      ,
        0.11333333, 0.08666667, 0.1       , 0.19333333]])

In [13]:
# Calculate the p_k for each cluster in PI
p_k = p_cont_i.sum(axis=1)
p_k = p_k.reshape(-1, 1)
p_k

array([[1.]])

In [14]:
# Get the P_k^i for each cluster in the partition i
P_k_i = p_cont_i / p_k
P_k_i

array([[0.07333333, 0.12      , 0.12      , 0.07333333, 0.12      ,
        0.11333333, 0.08666667, 0.1       , 0.19333333]])

In [15]:
# This is a constant and it's the ratio of points in each cluster in pi_i
P_i = p_cont_i.sum(axis=0)
P_i

array([0.07333333, 0.12      , 0.12      , 0.07333333, 0.12      ,
       0.11333333, 0.08666667, 0.1       , 0.19333333])

In [16]:
#get the norm of each row of P_k^i
norm_P_k_i = np.linalg.norm(P_k_i, axis=1, ord=2).reshape(-1, 1) ** 2
norm_P_k_i

array([[0.12168889]])

In [17]:
norm_P_i = np.linalg.norm(P_i, ord=2) ** 2
norm_P_i

0.12168888888888892

In [18]:
p_k * norm_P_k_i

array([[0.12168889]])

In [19]:
utility = (p_k * norm_P_k_i).sum() - norm_P_i
utility

-4.163336342344337e-17