# Clustering Analysis

## PreAnalysis

### Loading Data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os

In [2]:
# Best params
RESULTS_PATH = os.path.join('..', 'results')
HIST_BEST = os.path.join(RESULTS_PATH, 'hist_best_params.csv')
LATENT_BEST = os.path.join(RESULTS_PATH, 'latent_best_params.csv')

In [3]:
# Data features
FEATURES_PATH = os.path.join('..', 'features')
LATENT_PATH = os.path.join(FEATURES_PATH, 'incv1_feats.csv')
HIST_PATH = os.path.join(FEATURES_PATH, 'color_hist.csv')

In [4]:
# Best techniques
BEST_TECHNIQUES = os.path.join(RESULTS_PATH, 'votes_summary.csv')

#### Clustering Parameters

We load the best params to include them all into a nice visual dataframe.

In [5]:
hist_params = pd.read_csv(HIST_BEST)
latent_params = pd.read_csv(LATENT_BEST)

In [6]:
dbscan_best_params = pd.concat([hist_params, latent_params], axis=0, ignore_index=True)
dbscan_best_params

Unnamed: 0,m,e,data,scaled,similarity,sscore,clusters,instances
0,2,0.16,hist,False,cosine,0.44,2,147
1,2,0.19,hist,False,cosine,0.37,3,163
2,4,11.8,latent,False,euclid,0.46,9,140
3,3,0.2,latent,False,cosine,0.68,11,144


#### Histogram Features, Latent Features

We load the original features so we can cluster them using the best params

In [7]:
latent_feats = pd.read_csv(LATENT_PATH)
latent_feats.head(3)

Unnamed: 0.1,Unnamed: 0,0,1,2,3,4,5,6,7,8,...,1014,1015,1016,1017,1018,1019,1020,1021,1022,1023
0,1222__pool_table__0.9999995.jpg,0.882798,0.896023,0.123852,0.257982,0.03605,0.108023,0.633841,0.457301,1.684949,...,0.422634,0.346122,0.111589,1.441579,0.198722,0.246648,0.295942,0.56095,0.058328,0.117393
1,1328__coil__0.99999607.jpg,0.483815,0.134309,0.021849,0.367267,0.08925,0.007518,0.069921,0.219347,0.08926,...,0.049852,0.00414,0.199223,0.718976,0.0,0.0,0.0,0.159411,0.012007,0.001601
2,134__zebra__0.9999949.jpg,0.291067,0.375913,0.217742,1.269691,0.384181,0.07647,0.66207,0.662391,0.827774,...,0.018289,0.0,0.000775,0.903884,0.589769,0.016957,0.418493,0.00535,0.004198,0.18546


In [8]:
hist_feats = pd.read_csv(HIST_PATH)
hist_feats.head(3)

Unnamed: 0.1,Unnamed: 0,0,1,2,3,4,5,6,7,8,...,758,759,760,761,762,763,764,765,766,767
0,1222__pool_table__0.9999995.jpg,178,51,43,49,37,40,54,57,57,...,8,5,12,9,13,14,12,12,7,51
1,1328__coil__0.99999607.jpg,47,39,66,118,112,134,143,164,194,...,97,114,127,188,211,172,121,90,61,186
2,134__zebra__0.9999949.jpg,0,0,1,1,4,4,7,5,12,...,34,17,40,14,25,12,2,4,2,13


#### Best Techniques

The file for best techniques lists, for each image, the number of votes casted to each interpretation technique, and also the name of the best (most votes) technique. We will convert this dataframe into a simple dictionary that maps the name of an image to the name of the best technique selected for that image.

In [9]:
techniques = pd.read_csv(BEST_TECHNIQUES, dtype='object', index_col=0)
techniques.head(3)

Unnamed: 0,ig,lime,xrai,anchor,best
1222__pool_table__0.9999995.jpg,12,13,3,1,lime
1328__coil__0.99999607.jpg,17,4,3,2,ig
134__zebra__0.9999949.jpg,14,1,8,2,ig


In [10]:
def gen_name_technique_tuples(x):
    return [x.name, x['best']]

In [11]:
foo = techniques.apply(gen_name_technique_tuples, axis=1)
foo.values[:3] # Values of Series obj

array([list(['1222__pool_table__0.9999995.jpg', 'lime']),
       list(['1328__coil__0.99999607.jpg', 'ig']),
       list(['134__zebra__0.9999949.jpg', 'ig'])], dtype=object)

In [12]:
name_tech_map = {name: tech for name, tech in foo.values}

The dictionary has 198 entries, the exact number of images in out analysis.

In [13]:
len(name_tech_map.keys())

198

Also, we can see which technique is the most popular

In [14]:
# IG seems to be the most votes technique, followed by XRAI, LIME and ANCHOR
np.unique(list(name_tech_map.values()), return_counts=True)

(array(['anchor', 'ig', 'lime', 'xrai'], dtype='<U6'),
 array([  1, 147,  18,  32], dtype=int64))

## Clustering

Here we will perform clustering using the best parameters for DBSCAN and identify in which cluster does every instance gets assigned to.

In [15]:
from sklearn.cluster import DBSCAN
from sklearn.metrics import silhouette_score

In [16]:
feats_map = {
    'hist': hist_feats,
    'latent': latent_feats
}
sim_metric_map = {
    'euclid': 'euclidean',
    'cosine': 'cosine'
}

In [17]:
def get_indiv_clustering_results(param_set):
    '''Return a dictionary mapping the name of an image
    with the cluster it belongs'''
    # Prepare parameters
    data = feats_map[param_set[2]]
    img_names = data.values[:, 0]
    instances = data.values[:, 1:]
    metric = sim_metric_map[param_set[4]]
    # Do clustering
    # TODO: Accept previously created sim. matrices
    dbscan = DBSCAN(min_samples=param_set[0], eps=param_set[1], metric=metric)
    dbscan = dbscan.fit(instances)
    # Generate {img_name : label} mapping
    name_label_map = {name: label for name, label in zip(img_names, dbscan.labels_)}
    return name_label_map

def get_global_clustering_results(params):
    '''Returns a dictionary mapping the index of every param set
    in 'params' arg. with the clustering results generated with that param. set'''
    results = {}
    for i, param_set in enumerate(params):
        results[i] = get_indiv_clustering_results(param_set)
    return results

In [18]:
cl_results = get_global_clustering_results(dbscan_best_params.values)

The clustering results is a dictionary that maps the index of every DBSCAN parameter set with the clustering labels it produces. These clustering labels are, each one, a dictionary that maps the name of an image (i.e. an instance) with the number of the cluster it waas assigned to.

In [19]:
cl_results

{0: {'1222__pool_table__0.9999995.jpg': -1,
  '1328__coil__0.99999607.jpg': 0,
  '134__zebra__0.9999949.jpg': -1,
  '2377471__pizza__0.9999988.jpg': 0,
  '2377620__zebra__0.9999882.jpg': 0,
  '2377698__zebra__0.9999999.jpg': 0,
  '2378170__zebra__0.9999902.jpg': -1,
  '2378358__park_bench__0.99999833.jpg': 0,
  '2378523__banana__0.99999785.jpg': 0,
  '2379086__zebra__0.9999975.jpg': 0,
  '2379489__parking_meter__0.9999989.jpg': 0,
  '2380017__zebra__0.9999995.jpg': 0,
  '2380019__zebra__0.9999926.jpg': 1,
  '2380189__zebra__0.9999993.jpg': 0,
  '2380319__broccoli__0.9999957.jpg': 0,
  '2380447__bullet_train__0.9999869.jpg': 0,
  '2380669__parking_meter__0.9999993.jpg': 0,
  '2380865__traffic_light__0.99999714.jpg': 0,
  '2380905__gondola__0.9999888.jpg': -1,
  '2380925__zebra__0.9999987.jpg': 0,
  '2381648__zebra__0.9999995.jpg': 0,
  '2381879__zebra__0.99999523.jpg': 0,
  '2381932__traffic_light__0.99999964.jpg': -1,
  '2381941__zebra__0.9999914.jpg': -1,
  '2381968__ski__0.999984.jpg

### Instances per cluster

We can calculate how many instances were assgined to the identified clusters, so we can detect any imbalance in cluster sizes.

In [20]:
for i, res in enumerate(list(cl_results.values())):
    counts = np.unique(list(res.values()), return_counts=True)
    print('PARAM. SET #', i, ':')
    print('Clusters:             ', counts[0])
    print('Instances per cluster:', counts[1])

PARAM. SET # 0 :
Clusters:              [-1  0  1]
Instances per cluster: [ 50 144   4]
PARAM. SET # 1 :
Clusters:              [-1  0  1  2]
Instances per cluster: [ 35 157   4   2]
PARAM. SET # 2 :
Clusters:              [-1  0  1  2  3  4  5  6  7  8]
Instances per cluster: [58 78 11  5 18  6  5  8  4  5]
PARAM. SET # 3 :
Clusters:              [-1  0  1  2  3  4  5  6  7  8  9 10]
Instances per cluster: [54 78 11  6 14  5  8  8  5  3  3  3]


Param. sets # 2 and # 3 are the most variated so far. We need a way to enforce a min. no. of instances on every cluster, because right now every clustering obtained has one cluster with many instances and the rest of the clusters only contain very few instances.

## Clustering Analysis

Once we have, for every set of clustering parameters, the clusters it produces, we can perform an analysis about the proportion of interpretation techniques selected for the clustered instances and also an analysis on how "pure" or "accurate" clusters are. 

### Clustering Proportions

In [21]:
def get_indiv_clustering_proportions(name_label_map, name_best_tech_map=name_tech_map):
    # Clustering proportions
    cl_props = {}
    for name, cl_label in name_label_map.items():
        # Add cluster label to clustering proportions if it hasn't been added yet
        if (cl_label not in cl_props.keys()): cl_props[cl_label] = {}
        # If best technique is in current cluster
        technique = name_best_tech_map[name]
        if (technique in cl_props[cl_label].keys()):
            cl_props[cl_label][technique] += 1 # add one...
        else:
            cl_props[cl_label][technique] = 1 # else, create with one
    return cl_props

def get_global_clustering_proportions(cl_results):
    '''Returns a dictionary mapping the index of every param set
    used for the clustering results with the best technique proportions
    of every individual cluster'''
    results = {}
    for param_set_idx, name_label_map in cl_results.items():
        results[param_set_idx] = get_indiv_clustering_proportions(name_label_map)
    return results

In [22]:
cl_props = get_global_clustering_proportions(cl_results)

In [23]:
for param_set_idx, cl_prop in cl_props.items():
    print('CLUSTERING PROPORTIONS FOR PARAM. SET #', param_set_idx)
    for cl_label, ind_cl_prop in cl_prop.items():
        print(cl_label, '-', ind_cl_prop)

CLUSTERING PROPORTIONS FOR PARAM. SET # 0
-1 - {'lime': 9, 'ig': 31, 'xrai': 10}
0 - {'ig': 113, 'lime': 9, 'xrai': 21, 'anchor': 1}
1 - {'ig': 3, 'xrai': 1}
CLUSTERING PROPORTIONS FOR PARAM. SET # 1
0 - {'lime': 12, 'ig': 120, 'xrai': 24, 'anchor': 1}
-1 - {'ig': 22, 'lime': 6, 'xrai': 7}
1 - {'ig': 3, 'xrai': 1}
2 - {'ig': 2}
CLUSTERING PROPORTIONS FOR PARAM. SET # 2
-1 - {'lime': 11, 'ig': 38, 'xrai': 9}
0 - {'ig': 60, 'xrai': 15, 'lime': 2, 'anchor': 1}
1 - {'ig': 11}
2 - {'lime': 2, 'ig': 3}
3 - {'ig': 12, 'xrai': 6}
4 - {'ig': 5, 'xrai': 1}
5 - {'ig': 3, 'lime': 2}
8 - {'ig': 5}
6 - {'ig': 8}
7 - {'ig': 2, 'lime': 1, 'xrai': 1}
CLUSTERING PROPORTIONS FOR PARAM. SET # 3
-1 - {'lime': 7, 'ig': 39, 'xrai': 8}
0 - {'ig': 60, 'xrai': 15, 'lime': 2, 'anchor': 1}
1 - {'ig': 11}
2 - {'lime': 3, 'ig': 3}
3 - {'ig': 8, 'xrai': 6}
10 - {'lime': 1, 'ig': 1, 'xrai': 1}
4 - {'ig': 3, 'lime': 2}
5 - {'lime': 2, 'ig': 6}
6 - {'ig': 8}
7 - {'ig': 4, 'xrai': 1}
8 - {'ig': 1, 'lime': 1, 'xrai': 1}


Almost every cluster has a majority of instances best explained with the IG technique. This behavior is not surprising because we already knew that the by far most popular interpretation technique was IG, but we also wanted to see if some clusters had a preference for other techniques (i.e. they had a cluster proportion favoring LIME, XRAI or ANCHOR).   
These proportions will be visualized using stacked bar graphs in later stages.

### Clustering Custom Score ("Clustering Accuracy")

Cluster proportions reflect the balance between the majority technique of every cluster and all the other techniques, or in other words, how "pure" the clusters are, but we can condense all that information into a single metric.   
That's the motivation of the clustering accuracy.

In [24]:
def get_indiv_clust_accuracy(cl_prop, ignore_noise_cluster=True):
    majority, total = 0, 0
    # Get cluster label and cluster proportion from clustering proportion
    for cl_label, props in cl_prop.items():
        #Ignore noise cluster
        if cl_label == -1 and ignore_noise_cluster: continue
        # Get quantities
        for x in props.values(): total += x
        # Do not count majority in noise cluster
        if cl_label == -1: continue
        else: majority += max(props.values())
    print(majority, total) # DEBUG
    return majority / total

def get_global_clust_accuracy(cl_props, ignore_noise_clusters=True):
    results = {}
    for param_set_idx, cl_prop in cl_props.items():
        results[param_set_idx] = get_indiv_clust_accuracy(cl_prop, ignore_noise_clusters)
    return results

We calculate clustering accuracies with and without noise instances

In [25]:
cl_scores = get_global_clust_accuracy(cl_props)

116 148
125 163
109 140
108 144


In [27]:
cl_scores_with_noise = get_global_clust_accuracy(cl_props, ignore_noise_clusters=False)

116 198
125 198
109 198
108 198


We get the following clustering accuracies, for data without and with noise

In [34]:
total = 0
print('SCORES WITHOUT NOISE')
for param_set_idx, score in cl_scores.items():
    print(param_set_idx, '-', score)
    total += score
print('Avg -', total/len(cl_scores.values()))

SCORES WITHOUT NOISE
0 - 0.7837837837837838
1 - 0.7668711656441718
2 - 0.7785714285714286
3 - 0.75
Avg - 0.769806594499846


In [33]:
total = 0
print('SCORES WITH NOISE')
for param_set_idx, score in cl_scores_with_noise.items():
    print(param_set_idx, '-', score)
    total += score
print('Avg -', total/len(cl_scores.values()))

SCORES WITH NOISE
0 - 0.5858585858585859
1 - 0.6313131313131313
2 - 0.5505050505050505
3 - 0.5454545454545454
Avg - 0.5782828282828283


Both scores with and without noise seem to be stable. That allos us to interpret, more generally, how clustering works in the selection of interpretation techniques.   
The avg. clustering accuracy for data without noise is aprox 77% which means that, if and only if the image you want to explain is assigned to a cluster, there's 77% chances that the most frequent interpretation technique in that cluster is the best technique for your image.   

In [29]:
dbscan_best_params.iloc[1]

m                  2
e               0.19
data            hist
scaled         False
similarity    cosine
sscore          0.37
clusters           3
instances        163
Name: 1, dtype: object