## Applying our PreDeCon implementation to the IMDB-BINARY dataset

In [None]:
import numpy as np
import random
import csv
import sys
sys.path.append('..')
sys.path.append('../tudataset/tud_benchmark/')

from predecon import PreDeCon
from scipy.sparse import linalg
from sklearn.metrics import normalized_mutual_info_score as NMI
from pathlib import Path
from auxiliarymethods.auxiliary_methods import normalize_feature_vector as normalize_v
from auxiliarymethods.auxiliary_methods import normalize_gram_matrix as normalize_m
from auxiliarymethods.datasets import get_dataset as labels
from util.utility_functions import load_sparse as load_v
from util.utility_functions import load_csv as load_m
from util.dimensionality_reduction import truncatedSVD as svd
from util.dimensionality_reduction import kernelPCA as pca

We use NMI with regard to the ground truth labels to measure the performance of our algorithm.

In [None]:
labels_path = '../tudataset/datasets/IMDB-BINARY/IMDB-BINARY/raw/IMDB-BINARY_graph_labels.txt'
true_labels = np.loadtxt(labels_path, dtype=int)

Exploratory data analysis showed a significant number of outliers that were fully connected graphs. In trying to use this knowledge to our advantage, we have implemented a method to strip graphs with a high density from the dataset. (This is not used in our best clustering to date.)

In [None]:
densities = np.fromfile('graph_densities')

We do not yet know for which kernel method and representation (vectors or gram matrix) the algorithms works best; the same goes for the reduced number of dimensions that we get after applying SVD or KernelPCA. Accordingly, we are treating these inputs as additional parameters to the algorithm.

In [None]:
def predecon_config(kernel, format, dims, minPts, eps, delta, lambda_, kappa, strip_densities=1):
    imdb = Path('../data/IMDB-BINARY/')
    vector_path = imdb / f'IMDB-BINARY_vectors_{kernel}.npz'
    matrix_path = imdb / f'IMDB-BINARY_gram_matrix_{kernel}.csv'

    if format == 'vector':
        vectors = normalize_v(load_v(vector_path))
        data = svd(vectors, dims)
    else:
        matrix = normalize_m(load_m(matrix_path))
        data = pca(matrix, dims)
    
    if strip_densities < 1:
        # option to remove highly connected graphs from the dataset
        data = data[densities < strip_densities]
    
    predecon = PreDeCon(minPts=minPts, eps=eps, delta=delta, lambda_=lambda_, kappa=kappa)
    predecon.fit(data)
    return predecon

It is not quite straightforward to find parameters for which the algorithm returns usable clusterings. In many cases, the entire dataset is determined to be one big cluster; in other cases, each and every point is marked as noise.

We have implemented a random parameter space search to get usable results (shown below). One of the first successful configurations that at least came up with the correct number of clusters (2 classes + noise) is shown here: 

In [None]:
kernel = 'wl3'
format = 'matrix'
dims = 50

predecon = predecon_config(kernel, format, dims, 25, 0.75, 1, 50, 10)
print("Clusters found:", set(predecon.labels))
print(f"NMI: {NMI(true_labels, predecon.labels)}")

The NMI of 0.02 shows hardly better-than-random performance.

After some iterations of using the results to limit our parameter search space to more relevant ranges, we found this configuration, the best one yet. The NMI of 0.13 is still not exactly great, but it shows improvement.

One interesting aspect of this configuration is that it returns 12 clusters, much more than in the ground truth. This shows room for improvement. 

In [None]:
kernel = 'wl1'
format = 'vector'
dims = 25

predecon = predecon_config(kernel, format, dims, 7, 5, 20, 25, 100)
print("Clusters found:", set(predecon.labels))
print(f"NMI: {NMI(true_labels, predecon.labels)}")

The random parameter search is implemented here. The search ranges started out kind of randomly but were refined when we found some configurations that worked significantly better than average.

In [None]:
all_kernels = ['wl1', 'wl2', 'wl3', 'wl4', 'wl5', 'graphlet', 'shortestpath']
all_formats = ['vector', 'matrix']
all_dims    = [25, 50, 75]

all_minPts  = [2, 5, 7, 10, 15, 25]
all_eps     = [0.25, 0.75, 2, 5, 50]
all_deltas  = [0.1, 0.25, 0.5, 1, 5, 20]
all_lambdas = [5, 15, 30, 50, 75]
all_kappas  = [10, 100, 1000]

Here we just choose one random element from each of the parameter lists and use this as the configuration for one trial run. If the configuration returns a usable clustering, the parameters and the resulting NMI are saved to a CSV file. We can use this file to find patterns in the successful clusterings to improve our search range for future runs.

In [None]:
# randomized parameter space search

num_trials = 10

for trial in range(num_trials):
    print(f"Trial {trial}: ", end='')

    kernel  = random.choice(all_kernels)
    format  = random.choice(all_formats)
    dims    = random.choice(all_dims)

    minPts  = random.choice(all_minPts)
    eps     = random.choice(all_eps)
    delta   = random.choice(all_deltas)
    lambda_ = random.choice(all_lambdas)
    kappa  = random.choice(all_kappas)

    predecon = predecon_config(kernel=kernel, format=format, dims=dims, \
            minPts=minPts, eps=eps, delta=delta, lambda_=lambda_, kappa=kappa)
    
    # true_labels_stripped = true_labels[densities < 0.99]
    
    if len(set(predecon.labels)) > 1:
        nmi = NMI(true_labels, predecon.labels)

        print("\n ", kernel, format, dims)
        print(" ", minPts, eps, delta, lambda_, kappa)
        print(" ", set(predecon.labels))
        print("  NMI:", nmi)
        print(f"  time: {predecon._performance['fit'] / 1000_000_000:.4f}s")

        with open('parameters.csv', 'a') as f:
            csv.writer(f).writerow([nmi, kernel, format, dims, minPts, eps, delta, lambda_, kappa])
    else:
        print("No clusterings found")
    
    if predecon._performance['fit'] > 60 * 1000_000_000:
        print("  Took too long…")
        print("  ", kernel, format, dims)
        print("  ", minPts, eps, delta, lambda_, kappa)
        print(f"  time: {predecon._performance['fit'] / 1000_000_000:.4f}s")