# Clustering Evaluation Functions

This section is designed for the implementation of metric functions in relation with clustering evaluation. 

Author : Maxime Fontana

### Symmetric Distance Error 

This function is meant to measure the error as a percentage of missclassified vertices w.r.t the ground-truth. This is used for the study on synthetic data (Stochastic Block Model) in the paper that is presented in the file "Reproduce Figure 2".

This error could be represented by the Symmetric Difference between a group (a cluster) in a 'unfair' clustered dataset and a so called fair clustered one.

This sub-set can be represented as below.

<div>
<img src="attachment:symmetric-56a8fa9f5f9b58b7d0f6ea14.jpg" width="500"/>
</div>

In [13]:
# This function measures the error as a percentage of missclassified vertices w.r.t the
# ground-truth

# !!! NEEDS OPTIMISATION FOR SCALABILITY

def error_sym(labels, fair_labels):
    """
    Parameters
    ----------
    labels : ARRAY
        'UNFAIR LABELS'
    fair_labels : ARRAY
        'FAIR LABELS'.
        
    Returns
    -------
    INTEGER
        PORTION OF MISSCLASSIFIED VERTICES W.R.T GROUND-TRUTH.
    """
    pre_results = []
    ground_truth = [] 
    results = []
    my_range = np.arange(0,len(labels))
    lengths = []
    
    # Return the indices of elements of the same group and store them
    for h in range(max(fair_labels)+1):
        my_set = {i for i, x in enumerate(fair_labels) if x == h}
        pre_results.append(my_set)
        lengths.append(len(my_set))
    #print("Fair clusters")
    print(pre_results)
    
    # Generate ground-truth in the same format (indices)
    for k in range(max(labels)+1):
        my_set = get_ground_truth(5, 5, my_range, k).astype(int)
        ground_truth.append(set(my_set))
        lengths.append(len(my_set))
    #print("ground truth")
    print(ground_truth)
    
    # Cross-compute the symmetric difference between the 2
    for i in range(len(ground_truth)):
        for y in range(len(pre_results)):
            x = len(ground_truth[i].symmetric_difference(pre_results[y]))
            results.append(x)
        
    return (min(results) * 100) / len(labels)

In [None]:
# Compute the Ratio-Cut of the output

def ratio_cut(laplacian, matrix_H):
    return(np.trace(np.transpose(matrix_H) @ laplacian @ matrix_H))

In [None]:
# Balance 

def custom_error(labels, fair_labels): # (hn^kn(min(set1, set2)))
    results = []
    
    for h in range(max(labels)+1): # O(h)
        scores_list = []
        my_set = {i for i, x in enumerate(labels) if x == h} # O(n)
        my_calc = (len(my_set) * 100) / len(labels)
        print("Initial Population : ", my_calc)
        for k in range(max(fair_labels)+1): # O(k)
            score = 0
            my_fair_set = {i for i, x in enumerate(fair_labels) if x == k} # O(n)
            my_intersec = len(my_set.intersection(my_fair_set)) #O(min(my_set, my_fair_set))
            print("proportion :", my_intersec, " on", len(my_fair_set))
            my_fair_calc = ((my_intersec) * 100) / len(my_fair_set)
            print("Proportion in Fair Cluster :", my_fair_calc)
            score = abs(my_calc - my_fair_calc)
            print("local score :", score)
            scores_list.append(score)
        scores = sum(scores_list) / (max(fair_labels) + 1)
        print(scores, " !!!")
        results.append(scores)
            
    return sum(results) / len(results)

In [14]:
# Small worked example for Ground-Truth generation
%run 'GroundTruthGen.ipynb'
import numpy as np
test_labels = np.random.randint(0, 5, 100)
test_fair_labels = np.random.randint(0, 5, 100)

#print("Algo 1", test_labels)
#print("Algo 2", test_fair_labels)

error_sym(test_labels, test_fair_labels)

Algo 1 [2 1 4 1 4 1 1 1 4 2 2 3 1 2 4 2 1 2 2 2 0 1 2 3 2 1 2 4 4 0 0 2 3 4 1 2 3
 2 4 3 1 3 4 4 3 2 0 3 1 3 3 3 3 1 2 4 1 4 1 4 1 2 0 1 0 2 4 4 1 1 1 4 1 4
 4 2 2 4 4 3 3 2 0 1 1 4 0 1 2 2 0 0 3 2 2 2 1 3 2 3]
Algo 2 [0 3 4 4 3 3 4 0 2 4 3 3 3 0 0 3 1 1 0 4 3 3 3 1 1 0 0 0 1 0 4 0 4 0 3 0 2
 2 4 1 0 2 3 4 2 3 2 2 0 4 0 2 3 0 1 0 1 4 0 1 0 4 0 2 2 1 1 2 4 0 4 1 1 2
 0 4 0 2 4 1 2 2 0 1 0 2 3 2 0 4 1 3 3 3 0 0 1 3 2 4]
Fair clusters
[{0, 7, 13, 14, 18, 25, 26, 27, 29, 31, 33, 35, 40, 48, 50, 53, 55, 58, 60, 62, 69, 74, 76, 82, 84, 88, 94, 95}, {96, 65, 66, 90, 39, 71, 72, 79, 16, 17, 83, 54, 23, 24, 56, 59, 28}, {64, 98, 67, 36, 37, 8, 41, 73, 44, 77, 46, 47, 80, 81, 51, 85, 87, 63}, {1, 4, 5, 10, 11, 12, 15, 20, 21, 22, 34, 42, 45, 52, 86, 91, 92, 93, 97}, {32, 89, 2, 3, 68, 99, 6, 38, 70, 9, 43, 75, 78, 49, 19, 57, 61, 30}]
ground truth
[{0, 1, 2, 3, 20, 21, 22, 23, 40, 41, 42, 43, 60, 61, 62, 63, 80, 81, 82, 83}, {4, 5, 6, 7, 24, 25, 26, 27, 44, 45, 46, 47, 64, 65, 66, 67, 84, 85, 86

23.0