# Cluster-based Vote Count Prediction for new images

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
import sklearn

## Loading Data

In [2]:
# Loading IncV1 latent features
SIM_MX_FILE_PATH = os.path.join('..', 'results', 'matrices', 'incv3_feats_cosine_sim_matrix.csv')
VOTES_FILE_PATH = os.path.join('..', 'results', 'votes_summary.csv')

#### Data (Sim. Matrix between images)

In [3]:
sim_mx_df = pd.read_csv(SIM_MX_FILE_PATH, index_col=0)
sim_mx_df.head(3)

Unnamed: 0,1222__pool_table__0.9999995.jpg,1328__coil__0.99999607.jpg,134__zebra__0.9999949.jpg,2377471__pizza__0.9999988.jpg,2377620__zebra__0.9999882.jpg,2377698__zebra__0.9999999.jpg,2378170__zebra__0.9999902.jpg,2378358__park_bench__0.99999833.jpg,2378523__banana__0.99999785.jpg,2379086__zebra__0.9999975.jpg,...,2417881__zebra__0.9999945.jpg,2417938__banana__0.9999944.jpg,4099__pool_table__0.9999945.jpg,4339__manhole_cover__0.99999416.jpg,4534__viaduct__0.9999877.jpg,4573__barrel__0.9999974.jpg,4673__triumphal_arch__0.9999893.jpg,576__gondola__0.9999993.jpg,577__gondola__0.9999962.jpg,691__cheetah__0.99999213.jpg
1222__pool_table__0.9999995.jpg,0.0,0.766186,0.681436,0.771736,0.731295,0.778913,0.706014,0.751689,0.810367,0.738298,...,0.753613,0.762703,0.06842,0.76917,0.747195,0.747668,0.744787,0.828638,0.826484,0.718752
1328__coil__0.99999607.jpg,0.766186,0.0,0.655775,0.620482,0.643737,0.673256,0.650034,0.668247,0.682593,0.656279,...,0.659462,0.672215,0.75713,0.66209,0.672293,0.739528,0.640632,0.66133,0.660451,0.61215
134__zebra__0.9999949.jpg,0.681436,0.655775,0.0,0.645445,0.126298,0.108648,0.095005,0.656602,0.708973,0.102951,...,0.079798,0.637221,0.648792,0.660485,0.67203,0.660639,0.688565,0.560902,0.557332,0.563768


#### Votes

In [4]:
votes_df = pd.read_csv(VOTES_FILE_PATH, index_col=0)
votes_df.head(3)

Unnamed: 0,ig,lime,xrai,anchor,best
1222__pool_table__0.9999995.jpg,12,13,3,1,lime
1328__coil__0.99999607.jpg,17,4,3,2,ig
134__zebra__0.9999949.jpg,14,1,8,2,ig


Here's a sanity check for vote proportion in our the dataset. In the original XAI-CBR paper, vote proportion was like this:
- IG: 45%
- XRAI: 30%
- LIME: 18%
- ANCHOR: 7%

Also, IG was the most voted technique, at least by hard voting aggregation, with a majority of 62% images.


In [5]:
votes_df[['ig','lime','xrai','anchor']].sum() / 2867

ig        0.488315
lime      0.183467
xrai      0.271713
anchor    0.056505
dtype: float64

There's a slight imbalance of these proportions with respect to ones presented in the paper. It seems like some votes from XRAI and ANCHOR techniques drifted out to the IG technique. We'll check this out later, this should not be of great importance in the experiments of this notebook.

### Data Preprocessing

In [6]:
X = sim_mx_df.values # Values from sim. matrix
X_names = sim_mx_df.index.values # Names of every image
y = votes_df.values[:, :4] # Vote count for each imae
best = votes_df.values[:, -1] # Most voted technique for each image

In [7]:
print(X.shape, X_names.shape, y.shape, best.shape)

(198, 198) (198,) (198, 4) (198,)


#### Instance deletion
Stratified Subsampling cannot be performed onto the dataset because only one instance is best explained with ANCHOR. Due to the very small importance of that instance in the dataset, we will continue without that instance (i.e. we will find that instance and remove it from the dataset).

In [8]:
# At what index is the anchor instance located?
anchor_idxs = np.argwhere(best == 'anchor')[0]
anchor_idxs

array([155], dtype=int64)

In [9]:
# What's the name of that image and its associated technique?
X_names[anchor_idxs], best[anchor_idxs]

(array(['2411942__zebra__0.99999654.jpg'], dtype=object),
 array(['anchor'], dtype=object))

In [10]:
# Delete that instance from all data partitions (X, y, etc.)
X = np.delete(X, anchor_idxs, axis=0)
X = np.delete(X, anchor_idxs, axis=1) # Twice in sim. matrix (both rows and columns)
X_names = np.delete(X_names, anchor_idxs, axis=0)
y = np.delete(y, anchor_idxs, axis=0)
best = np.delete(best, anchor_idxs, axis=0)

In [11]:
print(X.shape, X_names.shape, y.shape, best.shape)

(197, 197) (197,) (197, 4) (197,)


## Splitting and Fold Creation

In [12]:
from sklearn.model_selection import StratifiedShuffleSplit as SSS
from sklearn.model_selection import ShuffleSplit as SS

#### TODO: Should I perform statified subsampling or standard subsampling?

In [13]:
STRATIFIED = True

In [14]:
# Perform split
splitter = None
if STRATIFIED: splitter = SSS(n_splits=5, test_size=0.2, random_state=42)
else: splitter = SS(n_splits=5, test_size=0.2, random_state=42)
splits = splitter.split(X, best)
splits = list(splits)

In [15]:
splits[0]

(array([192, 147, 177,  11, 140,  51, 127, 118, 172, 191,  62, 124, 115,
         80, 190, 142,  92,  69,  25,  14,  42,   3, 185,  90,  10,  76,
        176, 114,  44,  98, 166, 121,  79, 170,   1, 183,  28,  31, 155,
         75, 156, 101, 171,  13, 110, 122,  38,  27, 136,  20,   6,  56,
         35,  59, 139,  33,  78,  82,  21, 167, 117,  12,  49,  15,   5,
        152, 132,  81,  61, 163, 175,  91,   7, 174, 135,  74, 193, 129,
         60,  96,  50, 161, 159, 145, 126,  19,  65, 188,  73,  89, 133,
        179,  40,  86, 112,  26, 168, 189, 149,  94, 194,  18, 138, 169,
        102,  97,  71, 130,  53,  99, 148, 154,   8,  34, 182, 105,  55,
         95, 153,  72, 144,  77,  52,  30,   9,  37,   4,  93, 128, 137,
        195, 160, 111,  45, 164, 151,  29,  48,  70,  43,  57, 157,  39,
        141,  85, 150,  67,   0,  47, 113,  32,  17, 131, 180,  66, 100,
        186], dtype=int64),
 array([ 54, 187, 103,  23, 104, 108, 181,  64, 109, 134,  16, 146,   2,
        116, 106, 119, 

## Clustering

In [16]:
clusterable_params = []

In [17]:
from sklearn.cluster import DBSCAN
from sklearn.metrics import silhouette_score

In [18]:
def get_sim_mx_subset(sim_mx_values, filter_idxs):
    return sim_mx_values.take(filter_idxs, axis=0).take(filter_idxs, axis=1)

In [19]:
def fit_dbscan_sim_mx(data, min_samples, eps_values, 
               min_no_clusters=5, max_no_clusters=np.inf,
               min_clust_instances=None, min_clust_instances_pct=0.85,
               max_clust_instances=np.inf):
    # Condition precalculation
    if min_clust_instances_pct: # If % was defined
        min_clust_instances = round(data.shape[0] * min_clust_instances_pct)
    elif not min_clust_instances: # Else, if nominal amount was not specified
        min_clust_instances = 100
    # Code
    scores, clusters, instances = [], [], []
    for m in min_samples:
        row_scores, row_clusters, row_instances = [], [], []
        for e in eps_values:
            db = DBSCAN(min_samples=m, eps=e, metric='precomputed').fit(data)
            # Get only non anomalous instances and indices
            non_a = db.labels_ != -1 # [False, ..., False] if all are outliers
            non_a_idxs = np.argwhere(non_a==True)
            non_a_idxs = non_a_idxs.reshape(non_a_idxs.shape[0])
            # Calculate conditions
            n_clusters = len(np.unique(db.labels_[non_a])) # 0 if all are outliers
            n_instances = len(db.labels_[non_a]) # 0 if all are outliers
            # Apply conditions (why does it output NaN and not None?)
            valid_n_clusters = n_clusters >= min_no_clusters and n_clusters <= max_no_clusters
            valid_n_cl_instances = n_instances >= min_clust_instances and n_instances <= max_clust_instances
            if (valid_n_clusters and valid_n_cl_instances):
                non_a_data = get_sim_mx_subset(data, non_a_idxs)
                score = silhouette_score(non_a_data, db.labels_[non_a], metric='precomputed')
            else:
                score = None
            # Store results
            row_scores.append(score)
            row_clusters.append(n_clusters)
            row_instances.append(n_instances)
        # Store row results
        scores.append(row_scores)
        clusters.append(row_clusters)
        instances.append(row_instances)
    # Prepare and return values
    ms_axis = pd.Index(min_samples, name='Min_samples')
    eps_axis = pd.Index(eps_values, name='Epsilon')
    df_scores = pd.DataFrame(scores, index=ms_axis, columns=eps_axis)
    df_clusters = pd.DataFrame(clusters, index=ms_axis, columns=eps_axis)
    df_instances = pd.DataFrame(instances, index=ms_axis, columns=eps_axis)
    return df_scores, df_clusters, df_instances

In [20]:
def print_results(m, eps, scores_df, instances_df, clusters_df):
    score = round(scores_df.loc[m][eps], 4)
    instances = instances_df.loc[m][eps]
    clusters = clusters_df.loc[m][eps]
    print(f'DBSCAN using parameters m={m} and eps={eps} yields the next clustering results:')
    print()
    print(f'- Sil. score: {score}')
    print(f'- {instances} clustered instances into {clusters} clusters')
    print(f'- Avg. of {round(instances/clusters, 2)} instances per cluster')

In [21]:
X[splits[0][0]].shape[0] * 0.85 # about 135 clustered instances are needed

133.45

#### Split #0

In [22]:
X_split_0 = get_sim_mx_subset(X, splits[0][0])
X_split_0.shape

(157, 157)

In [26]:
dfs, dfc, dfi = fit_dbscan_sim_mx(X_split_0, range(2, 6), np.arange(0.15, 0.6, 0.05))
dfs

Epsilon,0.15,0.20,0.25,0.30,0.35,0.40,0.45,0.50,0.55
Min_samples,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2,,0.803143,0.799461,0.799461,0.795223,0.795223,0.771984,0.726677,0.60593
3,,,0.796563,0.796563,0.792157,0.792157,0.767886,0.72055,0.598963
4,,,,,,,,,0.608909
5,,,,,,,,,


In [27]:
print_results(2, 0.2, dfs, dfi, dfc)

DBSCAN using parameters m=2 and eps=0.2 yields the next clustering results:

- Sil. score: 0.8031
- 137 clustered instances into 15 clusters
- Avg. of 9.13 instances per cluster


In [28]:
clusterable_params.append([2, 0.2, 0])

#### Split #1

In [29]:
X_split_1 = get_sim_mx_subset(X, splits[1][0])
X_split_1.shape

(157, 157)

In [33]:
dfs, dfc, dfi = fit_dbscan_sim_mx(X_split_1, range(2, 5), np.arange(0.15, 0.6, 0.05))
dfs

Epsilon,0.15,0.20,0.25,0.30,0.35,0.40,0.45,0.50,0.55
Min_samples,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2,,0.80996,0.80996,0.80996,0.805387,0.805387,0.781744,0.735845,0.534037
3,,,,,,,,,0.5277
4,,,,,,,,,0.536607


In [34]:
print_results(2, 0.2, dfs, dfi, dfc)

DBSCAN using parameters m=2 and eps=0.2 yields the next clustering results:

- Sil. score: 0.81
- 139 clustered instances into 16 clusters
- Avg. of 8.69 instances per cluster


In [35]:
clusterable_params.append([2, 0.2, 1])

#### Split #2

In [36]:
X_split_2 = get_sim_mx_subset(X, splits[2][0])
X_split_2.shape

(157, 157)

In [38]:
dfs, dfc, dfi = fit_dbscan_sim_mx(X_split_2, range(2, 5), np.arange(0.15, 0.6, 0.05))
dfs

Epsilon,0.15,0.20,0.25,0.30,0.35,0.40,0.45,0.50,0.55
Min_samples,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2,,0.806571,0.80183,0.80183,0.797586,0.797586,0.797586,0.734346,0.554357
3,,,0.807863,0.807863,0.803408,0.803408,0.803408,0.734158,0.552517
4,,,,,,,,,0.558464


In [40]:
print_results(3, 0.25, dfs, dfi, dfc)

DBSCAN using parameters m=3 and eps=0.25 yields the next clustering results:

- Sil. score: 0.8079
- 134 clustered instances into 12 clusters
- Avg. of 11.17 instances per cluster


In [41]:
clusterable_params.append([3, 0.25, 2])

#### Split #3

In [42]:
X_split_3 = get_sim_mx_subset(X, splits[3][0])
X_split_3.shape

(157, 157)

In [44]:
dfs, dfc, dfi = fit_dbscan_sim_mx(X_split_3, range(2, 5), np.arange(0.15, 0.6, 0.05))
dfs

Epsilon,0.15,0.20,0.25,0.30,0.35,0.40,0.45,0.50,0.55
Min_samples,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2,,0.806117,0.803959,0.803959,0.799261,0.799261,0.784283,0.730971,0.484026
3,,,,,,,0.780785,0.725251,0.478951
4,,,,,,,,,0.490822


In [45]:
print_results(2, 0.2, dfs, dfi, dfc)

DBSCAN using parameters m=2 and eps=0.2 yields the next clustering results:

- Sil. score: 0.8061
- 138 clustered instances into 15 clusters
- Avg. of 9.2 instances per cluster


In [46]:
clusterable_params.append([2, 0.2, 3])

#### Split #4

In [47]:
X_split_4 = get_sim_mx_subset(X, splits[4][0])
X_split_4.shape

(157, 157)

In [49]:
dfs, dfc, dfi = fit_dbscan_sim_mx(X_split_4, range(2, 5), np.arange(0.15, 0.6, 0.05))
dfs

Epsilon,0.15,0.20,0.25,0.30,0.35,0.40,0.45,0.50,0.55
Min_samples,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2,,0.806661,0.80185,0.80185,0.797538,0.797538,0.774407,0.726842,0.636311
3,,,,,0.794926,0.794926,0.770752,0.721169,0.62965
4,,,,,,,,,0.660789


In [50]:
print_results(2, 0.2, dfs, dfi, dfc)

DBSCAN using parameters m=2 and eps=0.2 yields the next clustering results:

- Sil. score: 0.8067
- 136 clustered instances into 15 clusters
- Avg. of 9.07 instances per cluster


In [51]:
clusterable_params.append([2, 0.2, 4])

#### Clusterable parameters for each split

In [52]:
clusterable_params

[[2, 0.2, 0], [2, 0.2, 1], [3, 0.25, 2], [2, 0.2, 3], [2, 0.2, 4]]

## Clustering Results

In [53]:
def get_indiv_clustering_results(params):
    '''Returns a dictionary mapping the name of an image
    with the cluster it belongs to'''
    # Preconditions
    split_idx = params[2]
    train_idxs = splits[split_idx][0]
    # Prepare data (always X, not feats_df)
    sim_mx_subset = get_sim_mx_subset(X, train_idxs)
    img_names = X_names[train_idxs]
    # Perform clustering
    dbscan = DBSCAN(min_samples=params[0], eps=params[1], metric='precomputed')
    dbscan = dbscan.fit(sim_mx_subset)
    # Generate {img_name : label} mapping
    name_label_map = {name: label for name, label in zip(img_names, dbscan.labels_)}
    return name_label_map

def get_global_clustering_results(params_set):
    '''Returns a dictionary mapping the index of every param set
    in 'params' arg. with the clustering results generated with that param. set'''
    results = {}
    for i, params in enumerate(params_set):
        results[i] = get_indiv_clustering_results(params)
    return results

In [54]:
cl_results = get_global_clustering_results(clusterable_params)

In [55]:
cl_results

{0: {'4573__barrel__0.9999974.jpg': -1,
  '2411372__parking_meter__0.999995.jpg': 0,
  '2415910__zebra__0.9999962.jpg': 1,
  '2380017__zebra__0.9999995.jpg': 1,
  '2410410__ski__0.99999356.jpg': 2,
  '2387305__traffic_light__1.0.jpg': 3,
  '2408884__zebra__0.9999913.jpg': 1,
  '2406581__zebra__0.9999939.jpg': 1,
  '2415102__zebra__0.9999876.jpg': 1,
  '4534__viaduct__0.9999877.jpg': -1,
  '2391862__broccoli__0.99999714.jpg': 4,
  '2408592__goose__0.999998.jpg': -1,
  '2405479__traffic_light__0.9999939.jpg': 3,
  '2396034__remote_control__0.9999856.jpg': -1,
  '4339__manhole_cover__0.99999416.jpg': -1,
  '2410779__parking_meter__0.99999917.jpg': 0,
  '2401383__slug__0.9999933.jpg': -1,
  '2392579__zebra__0.9999969.jpg': 1,
  '2382183__pizza__0.99998593.jpg': 5,
  '2380319__broccoli__0.9999957.jpg': 4,
  '2385461__zebra__0.99998415.jpg': 1,
  '2377471__pizza__0.9999988.jpg': 5,
  '2417421__parking_meter__0.9999999.jpg': 0,
  '2401217__traffic_light__0.9999895.jpg': 3,
  '2379489__parking

In [56]:
# A little sanity check...
# Number of elements should be the same as clusters detected in clustering phase
for i in range(5): print(len(np.unique(list(cl_results[i].values())))-1)

15
16
12
15
15


## Clustering Prototypes

In our experiment, we want to predict the vote count for a new image, based on the proximity it has to the avaliable clusters. These clusters are composed of many data points, so the proximity of a new data point to a cluster can be measured in different ways, like taking the distance between the new point and the nearest clustered point in the dataset.   
However, this approach can be biased when new poins get associated to the cluster taking in account the nearest point of a cluster instead of the overall position of a cluster. To avoid this, for each cluster we calculate a "prototype", a data point which is the centroid of all the data points in a cluster. This way, we can measure the distance to the general position of a cluster in a more confident way.

In [57]:
votes_df.loc['1222__pool_table__0.9999995.jpg'].values[:-1]

array([12, 13, 3, 1], dtype=object)

In [58]:
def gen_indiv_cl_prototypes(cl_result, ignore_noise=True):
    # Separate image votes according to the clusters they belong to
    votes_by_cluster = {}
    for img_name, cl_idx in cl_result.items():
        if ignore_noise and cl_idx == -1: continue # ignore noise cluster
        img_votes = votes_df.loc[img_name].values[:-1]
        if cl_idx not in votes_by_cluster.keys(): votes_by_cluster[cl_idx] = [img_votes]
        else: votes_by_cluster[cl_idx].append(img_votes)
    # For each cluster, calculate their vote prototype
    vote_prts_by_cluster = {}
    for cl_idx, cl_votes in votes_by_cluster.items():
        unrounded_prt = np.average(np.array(cl_votes,'uint8'), axis=0)
        vote_prts_by_cluster[cl_idx] = np.array(np.round(unrounded_prt), 'int')
    return vote_prts_by_cluster
    
def get_global_cl_prototypes(cl_results, ignore_noise=True):
    global_prototypes = {}
    for i, cl_result in cl_results.items():
        global_prototypes[i] = gen_indiv_cl_prototypes(cl_result, ignore_noise=ignore_noise)
    return global_prototypes

In [59]:
global_prototypes = get_global_cl_prototypes(cl_results)

In [60]:
# Sanity check: No. of elements should be the same as no. of clusters detected in clustering phase
global_prototypes[3]

{0: array([5, 4, 4, 0]),
 1: array([7, 2, 5, 1]),
 2: array([8, 2, 3, 2]),
 3: array([ 8, 10,  2,  0]),
 4: array([7, 1, 6, 0]),
 5: array([5, 6, 1, 1]),
 6: array([8, 2, 2, 1]),
 7: array([7, 4, 2, 0]),
 8: array([9, 2, 1, 2]),
 9: array([6, 5, 3, 0]),
 10: array([6, 3, 4, 0]),
 11: array([11,  2,  6,  0]),
 12: array([8, 0, 4, 0]),
 13: array([5, 2, 5, 0]),
 14: array([8, 2, 2, 0])}

## Vote Count Prediction

In [61]:
np.average(np.array(list(global_prototypes[0].values())), axis=0)

array([7.33333333, 3.53333333, 3.13333333, 0.66666667])

In [62]:
foo = np.array([4, 5, 6, 7])
bar = np.array([8, 9, 10, 10])
np.sum(np.abs(bar - foo))

15

In [63]:
def calc_vote_dist(p1, p2, vote_dist_metric):
    if vote_dist_metric == 'euclidian': return np.sqrt(np.sum(np.square(p1 - p2)))
    elif vote_dist_metric == 'manhattan': return np.sum(np.abs(p1 - p2))
    else: print('Unknown metric type')

def get_img_idxs_per_cluster(cl_result, ignore_noise=True):
    img_idxs_per_cluster = {}
    for img_name, cl_idx in cl_result.items():
        if cl_idx==-1 and ignore_noise: continue # ignore noise cluster
        img_idx = np.argwhere(X_names == img_name)[0][0]
        if cl_idx not in img_idxs_per_cluster.keys():
            img_idxs_per_cluster[cl_idx] = [img_idx]
        else:
            img_idxs_per_cluster[cl_idx].append(img_idx)
    return img_idxs_per_cluster

def get_dist_to_clusters(img_idx, img_idxs_per_cluster):
    dist_to_clusters = {}
    for cl_idx, img_idxs in img_idxs_per_cluster.items():
        distances = X[img_idx, img_idxs]
        dist_to_clusters[cl_idx] = np.average(distances)
    return dist_to_clusters

def get_nearest_clusters_indices(dist_to_clusters, k):
    if k >= len(dist_to_clusters): nearest_cls_idxs = list(dist_to_clusters.keys())
    else:
        nearest_cls_idxs = []
        for i in range(k): # K times...
            nearest_cl_idx, min_dist = None, np.inf
            # ...iterate searching the nearest cluster
            for cl_idx, dist in dist_to_clusters.items():
                if cl_idx in nearest_cls_idxs: continue # ignore prev. found nearest clusters
                if dist < min_dist: nearest_cl_idx, min_dist = cl_idx, dist
            nearest_cls_idxs.append(nearest_cl_idx)
    return nearest_cls_idxs

def get_indiv_vote_distances(prototypes, cl_result, split_idx, k=3, vote_dist_metric='manhattan'):
    vote_distances = {}
    # Prepare data
    test_idxs = splits[split_idx][1]
    img_idxs_per_cluster = get_img_idxs_per_cluster(cl_result)
    # For each test image...
    for test_img_idx in test_idxs:
        # Measure average distances to each cluster
        dist_to_clusters = get_dist_to_clusters(test_img_idx, img_idxs_per_cluster)
        # Using those distances, find the nearest k clusters
        kn_clusters_idxs =  get_nearest_clusters_indices(dist_to_clusters, k=k)
        # Aggregate the vote count prototypes of the clusters associated with those distances
        nearest_prototypes = [prototypes[kn_cl_idx] for kn_cl_idx in kn_clusters_idxs]
        unrounded_vcp = np.average(np.array(nearest_prototypes), axis=0)
        vote_count_prediction = np.round(unrounded_vcp) # int parsing needed?
        # Measure vote distance of test image real vote count vs. VCP of test_image
        test_img_name = X_names[test_img_idx]
        test_img_votes = y[test_img_idx]
        vote_dist = calc_vote_dist(test_img_votes, vote_count_prediction, vote_dist_metric)
        vote_distances[test_img_name] = vote_dist
    return vote_distances

def get_global_vote_distances(all_prototypes, all_cl_results, k=3, vote_dist_metric='manhattan'):
    global_vote_distances = {}
    for split_idx in range(len(all_cl_results)):
        global_vote_distances[split_idx] = get_indiv_vote_distances(all_prototypes[split_idx], all_cl_results[split_idx], split_idx, k=k, vote_dist_metric=vote_dist_metric)
    return global_vote_distances

In [64]:
d = 'manhattan' # 'euclidian' or 'manhattan'
global_vote_distances_k1 = get_global_vote_distances(global_prototypes, cl_results, k=1, vote_dist_metric=d)
global_vote_distances_k3 = get_global_vote_distances(global_prototypes, cl_results, k=3, vote_dist_metric=d)
global_vote_distances_k5 = get_global_vote_distances(global_prototypes, cl_results, k=5, vote_dist_metric=d)
global_vote_distances_k7 = get_global_vote_distances(global_prototypes, cl_results, k=7, vote_dist_metric=d)

In [65]:
global_vote_distances_k5

{0: {'2388889__hotdog__0.99999714.jpg': 12.0,
  '2417881__zebra__0.9999945.jpg': 4.0,
  '2403403__banana__0.9999926.jpg': 4.0,
  '2381941__zebra__0.9999914.jpg': 8.0,
  '2403741__zebra__0.99999523.jpg': 5.0,
  '2404281__zebra__0.999998.jpg': 9.0,
  '2416627__zebra__0.9999987.jpg': 6.0,
  '2391964__flamingo__1.0.jpg': 9.0,
  '2404583__umbrella__0.99999297.jpg': 8.0,
  '2409637__four-poster__0.99999464.jpg': 7.0,
  '2380669__parking_meter__0.9999993.jpg': 9.0,
  '2411196__crane__0.9999995.jpg': 11.0,
  '134__zebra__0.9999949.jpg': 14.0,
  '2405905__traffic_light__0.99999535.jpg': 2.0,
  '2404127__zebra__0.9999933.jpg': 3.0,
  '2406857__zebra__0.9999894.jpg': 5.0,
  '2414277__zebra__0.9999908.jpg': 6.0,
  '2385298__parking_meter__0.9999865.jpg': 6.0,
  '2406887__ski__0.99999785.jpg': 7.0,
  '2416228__parking_meter__0.9999914.jpg': 13.0,
  '2415402__zebra__0.99998903.jpg': 5.0,
  '2408701__zebra__0.9999981.jpg': 5.0,
  '2389484__street_sign__0.99998474.jpg': 3.0,
  '2417395__zebra__0.99999

#### TODO: Should distance be vote-based (i.e. nominal) or vote proportion-based (i.e. relative)?

## Metric Evaluation

In [66]:
def eval_indiv_rmse_vote_dist(vote_distances):
    vote_distances = list(vote_distances.values())
    metrics = {
        'average': round(np.average(vote_distances), 2),
        'std. dev.': round(np.std(vote_distances), 2),
        'range': [round(np.min(vote_distances), 2), round(np.max(vote_distances), 2)],
    }
    return metrics

def eval_global_vote_dist(global_vote_distances, mode='rmse'):
    global_metrics = {}
    # Calculate metrics for each split
    for cl_key, vote_distances in global_vote_distances.items():
        if mode=='rmse': metrics = eval_indiv_rmse_vote_dist(vote_distances)
        else: pass # For technique-wise vote distances
        global_metrics[cl_key] = metrics
    # Aggregate metrics for all splits
    global_metrics['global'] = {}
    for metric_type in global_metrics[0].keys():
        metrics_per_type = [metrics[metric_type] for split_key, metrics in global_metrics.items() if split_key != 'global']
        avgd_metrics_per_type = np.round(np.average(np.array(metrics_per_type), axis=0), 2)
        if metric_type == 'range': avgd_metrics_per_type = list(avgd_metrics_per_type)
        global_metrics['global'][metric_type] = avgd_metrics_per_type
    return global_metrics

In [67]:
global_vote_metrics_k1 = eval_global_vote_dist(global_vote_distances_k1)
global_vote_metrics_k3 = eval_global_vote_dist(global_vote_distances_k3)
global_vote_metrics_k5 = eval_global_vote_dist(global_vote_distances_k5)
global_vote_metrics_k7 = eval_global_vote_dist(global_vote_distances_k7)

In [68]:
global_vote_metrics_k1['global']

{'average': 5.83, 'std. dev.': 2.97, 'range': [1.6, 14.8]}

In [69]:
global_vote_metrics_k3['global']

{'average': 6.38, 'std. dev.': 2.87, 'range': [1.4, 13.8]}

In [70]:
global_vote_metrics_k5['global']

{'average': 6.85, 'std. dev.': 2.76, 'range': [2.4, 13.6]}

In [71]:
global_vote_metrics_k7['global']

{'average': 6.98, 'std. dev.': 2.78, 'range': [1.8, 13.8]}

The previous results shine a light about the viability to predict the vote count for a new image given the vote prototypes of previously generated image clusters.   

In average, the predicted vote count for a new image differs by 6 votes compared to the real vote count. The difference between vote count shows a ascending tendence proportional to the number of nearest clusters used in the vote count prediction, although the growing rate is very small. In the end, this means that when predicting the vote count for a new image, it is recommended to use the vote prototype of only the nearest cluster.   

Additional metrics also show that the distribution of vote count differences shows a gaussian shape with a slight skeweness to the right, i.e. towards higher vote differences). The standard deviation shows that the majority of vote differences are between +-3 to the average vote difference. Given that the average vote difference is 6, this means that the majority of vote differences will be inside the 3-9 range.

Taking in account that for every image around 30 votes were casted, the difference in vote count prediction is pretty large. A difference of 6 votes when predicting votes can be really important. However, we need to calculate vote count proportion differences, beacuase, at the end of the day, proportions are also a important factor in deciding which techniques are better for new images.

#### TODO: Predict techniques with hard voting using vote count predictions