# Cluster-based Vote Count Prediction for new images

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
import sklearn

## Loading Data

In [2]:
# Loading IncV1 latent features
SIM_MX_FILE_PATH = os.path.join('..', 'results', 'matrices', 'incv3_feats_euclid_sim_matrix.csv')
VOTES_FILE_PATH = os.path.join('..', 'results', 'votes_summary.csv')

#### Data (Sim. Matrix between images)

In [3]:
sim_mx_df = pd.read_csv(SIM_MX_FILE_PATH, index_col=0)
sim_mx_df.head(3)

Unnamed: 0,1222__pool_table__0.9999995.jpg,1328__coil__0.99999607.jpg,134__zebra__0.9999949.jpg,2377471__pizza__0.9999988.jpg,2377620__zebra__0.9999882.jpg,2377698__zebra__0.9999999.jpg,2378170__zebra__0.9999902.jpg,2378358__park_bench__0.99999833.jpg,2378523__banana__0.99999785.jpg,2379086__zebra__0.9999975.jpg,...,2417881__zebra__0.9999945.jpg,2417938__banana__0.9999944.jpg,4099__pool_table__0.9999945.jpg,4339__manhole_cover__0.99999416.jpg,4534__viaduct__0.9999877.jpg,4573__barrel__0.9999974.jpg,4673__triumphal_arch__0.9999893.jpg,576__gondola__0.9999993.jpg,577__gondola__0.9999962.jpg,691__cheetah__0.99999213.jpg
1222__pool_table__0.9999995.jpg,0.0,24.89917,22.871903,25.031346,23.751015,25.26349,23.017009,23.726696,26.78418,23.866859,...,23.893972,28.554533,8.689397,23.648353,23.595179,24.177,23.629427,26.545261,27.091866,23.200351
1328__coil__0.99999607.jpg,24.89917,0.0,17.500986,18.181155,17.596692,19.09994,17.112792,17.488145,20.532969,17.789077,...,17.571292,23.570307,21.901239,16.987964,17.437562,19.139305,17.123877,19.577529,20.261407,16.637508
134__zebra__0.9999949.jpg,22.871903,17.500986,0.0,17.436307,7.257223,7.579433,6.056684,16.007572,19.907645,6.56977,...,5.660697,22.369222,19.397503,15.556074,16.074305,16.874761,16.384309,17.144394,17.840449,14.699963


#### Votes

In [4]:
votes_df = pd.read_csv(VOTES_FILE_PATH, index_col=0)
votes_df.head(3)

Unnamed: 0,ig,lime,xrai,anchor,best
1222__pool_table__0.9999995.jpg,12,13,3,1,lime
1328__coil__0.99999607.jpg,17,4,3,2,ig
134__zebra__0.9999949.jpg,14,1,8,2,ig


Here's a sanity check for vote proportion in our the dataset. In the original XAI-CBR paper, vote proportion was like this:
- IG: 45%
- XRAI: 30%
- LIME: 18%
- ANCHOR: 7%

Also, IG was the most voted technique, at least by hard voting aggregation, with a majority of 62% images.


In [5]:
votes_df[['ig','lime','xrai','anchor']].sum() / 2867

ig        0.488315
lime      0.183467
xrai      0.271713
anchor    0.056505
dtype: float64

There's a slight imbalance of these proportions with respect to ones presented in the paper. It seems like some votes from XRAI and ANCHOR techniques drifted out to the IG technique. We'll check this out later, this should not be of great importance in the experiments of this notebook.

### Data Preprocessing

In [6]:
X = sim_mx_df.values # Values from sim. matrix
X_names = sim_mx_df.index.values # Names of every image
y = votes_df.values[:, :4] # Vote count for each imae
best = votes_df.values[:, -1] # Most voted technique for each image

In [7]:
print(X.shape, X_names.shape, y.shape, best.shape)

(198, 198) (198,) (198, 4) (198,)


#### Instance deletion
Stratified Subsampling cannot be performed onto the dataset because only one instance is best explained with ANCHOR. Due to the very small importance of that instance in the dataset, we will continue without that instance (i.e. we will find that instance and remove it from the dataset).

In [8]:
# At what index is the anchor instance located?
anchor_idxs = np.argwhere(best == 'anchor')[0]
anchor_idxs

array([155], dtype=int64)

In [9]:
# What's the name of that image and its associated technique?
X_names[anchor_idxs], best[anchor_idxs]

(array(['2411942__zebra__0.99999654.jpg'], dtype=object),
 array(['anchor'], dtype=object))

In [10]:
# Delete that instance from all data partitions (X, y, etc.)
X = np.delete(X, anchor_idxs, axis=0)
X = np.delete(X, anchor_idxs, axis=1) # Twice in sim. matrix (both rows and columns)
X_names = np.delete(X_names, anchor_idxs, axis=0)
y = np.delete(y, anchor_idxs, axis=0)
best = np.delete(best, anchor_idxs, axis=0)

In [11]:
print(X.shape, X_names.shape, y.shape, best.shape)

(197, 197) (197,) (197, 4) (197,)


## Splitting and Fold Creation

In [12]:
from sklearn.model_selection import StratifiedShuffleSplit as SSS
from sklearn.model_selection import ShuffleSplit as SS

#### TODO: Should I perform statified subsampling or standard subsampling?

In [13]:
STRATIFIED = True

In [14]:
# Perform split
splitter = None
if STRATIFIED: splitter = SSS(n_splits=5, test_size=0.2, random_state=42)
else: splitter = SS(n_splits=5, test_size=0.2, random_state=42)
splits = splitter.split(X, best)
splits = list(splits)

In [15]:
splits[0]

(array([192, 147, 177,  11, 140,  51, 127, 118, 172, 191,  62, 124, 115,
         80, 190, 142,  92,  69,  25,  14,  42,   3, 185,  90,  10,  76,
        176, 114,  44,  98, 166, 121,  79, 170,   1, 183,  28,  31, 155,
         75, 156, 101, 171,  13, 110, 122,  38,  27, 136,  20,   6,  56,
         35,  59, 139,  33,  78,  82,  21, 167, 117,  12,  49,  15,   5,
        152, 132,  81,  61, 163, 175,  91,   7, 174, 135,  74, 193, 129,
         60,  96,  50, 161, 159, 145, 126,  19,  65, 188,  73,  89, 133,
        179,  40,  86, 112,  26, 168, 189, 149,  94, 194,  18, 138, 169,
        102,  97,  71, 130,  53,  99, 148, 154,   8,  34, 182, 105,  55,
         95, 153,  72, 144,  77,  52,  30,   9,  37,   4,  93, 128, 137,
        195, 160, 111,  45, 164, 151,  29,  48,  70,  43,  57, 157,  39,
        141,  85, 150,  67,   0,  47, 113,  32,  17, 131, 180,  66, 100,
        186], dtype=int64),
 array([ 54, 187, 103,  23, 104, 108, 181,  64, 109, 134,  16, 146,   2,
        116, 106, 119, 

## Vote Count Prediction

In [34]:
def calc_vote_dist(p1, p2, vote_dist_metric):
    if vote_dist_metric == 'euclidian': return np.sqrt(np.sum(np.square(p1 - p2)))
    elif vote_dist_metric == 'manhattan': return np.sum(np.abs(p1 - p2))
    else: print('Unknown metric type')

def get_nearest_instances_indices(dist_to_train_imgs, train_idxs, k):
    if k >= len(dist_to_train_imgs): nearest_train_idxs = train_idxs
    else:
        nearest_train_idxs = []
        for i in range(k): # K times...
            nearest_train_idx, min_dist = None, np.inf
            # ...iterate searching the nearest iamge
            for dist, train_idx in zip(dist_to_train_imgs, train_idxs):
                if train_idx in nearest_train_idxs: continue # ignore prev. found nearest clusters
                if dist < min_dist: nearest_train_idx, min_dist = train_idx, dist
            nearest_train_idxs.append(nearest_train_idx)
        nearest_train_idxs = np.array(nearest_train_idxs)
    return nearest_train_idxs

def get_indiv_vote_distances(split_idx, k=3, vote_dist_metric='manhattan'):
    vote_distances = {}
    # Prepare data
    train_idxs = splits[split_idx][0]
    test_idxs = splits[split_idx][1]
    # For each test image...
    for test_img_idx in test_idxs:
        # Get distances from test img to each train image
        dist_to_train_imgs = X[test_img_idx, train_idxs]
        # Using those distances, find the nearest k clusters
        kn_train_idxs =  get_nearest_instances_indices(dist_to_train_imgs, train_idxs, k=k)
        # Aggregate the vote counts associated with those instances
        nearest_vote_counts = [y[train_idx] for train_idx in kn_train_idxs]
        unrounded_vcp = np.average(np.array(nearest_vote_counts, np.float64), axis=0)
        vote_count_prediction = np.round(unrounded_vcp) # int parsing needed?
        # Measure vote distance of test image real vote count vs. VCP of test_image
        test_img_name = X_names[test_img_idx]
        test_img_votes = y[test_img_idx]
        vote_dist = calc_vote_dist(test_img_votes, vote_count_prediction, vote_dist_metric)
        vote_distances[test_img_name] = vote_dist
    return vote_distances

def get_global_vote_distances(no_of_splits, k=3, vote_dist_metric='manhattan'):
    global_vote_distances = {}
    for split_idx in range(no_of_splits):
        global_vote_distances[split_idx] = get_indiv_vote_distances(split_idx, k=k, vote_dist_metric=vote_dist_metric)
    return global_vote_distances

In [35]:
no_of_splits = 5
d = 'manhattan' # 'euclidian' or 'manhattan'
global_vote_distances_k1 = get_global_vote_distances(no_of_splits=no_of_splits, k=1, vote_dist_metric=d)
global_vote_distances_k3 = get_global_vote_distances(no_of_splits=no_of_splits, k=3, vote_dist_metric=d)
global_vote_distances_k5 = get_global_vote_distances(no_of_splits=no_of_splits, k=5, vote_dist_metric=d)
global_vote_distances_k7 = get_global_vote_distances(no_of_splits=no_of_splits, k=7, vote_dist_metric=d)
global_vote_distances_k9 = get_global_vote_distances(no_of_splits=no_of_splits, k=9, vote_dist_metric=d)
global_vote_distances_k11 = get_global_vote_distances(no_of_splits=no_of_splits, k=11, vote_dist_metric=d)

In [36]:
global_vote_distances_k5

{0: {'2388889__hotdog__0.99999714.jpg': 10.0,
  '2417881__zebra__0.9999945.jpg': 3.0,
  '2403403__banana__0.9999926.jpg': 3.0,
  '2381941__zebra__0.9999914.jpg': 7.0,
  '2403741__zebra__0.99999523.jpg': 9.0,
  '2404281__zebra__0.999998.jpg': 8.0,
  '2416627__zebra__0.9999987.jpg': 3.0,
  '2391964__flamingo__1.0.jpg': 6.0,
  '2404583__umbrella__0.99999297.jpg': 5.0,
  '2409637__four-poster__0.99999464.jpg': 8.0,
  '2380669__parking_meter__0.9999993.jpg': 6.0,
  '2411196__crane__0.9999995.jpg': 15.0,
  '134__zebra__0.9999949.jpg': 16.0,
  '2405905__traffic_light__0.99999535.jpg': 4.0,
  '2404127__zebra__0.9999933.jpg': 6.0,
  '2406857__zebra__0.9999894.jpg': 1.0,
  '2414277__zebra__0.9999908.jpg': 4.0,
  '2385298__parking_meter__0.9999865.jpg': 2.0,
  '2406887__ski__0.99999785.jpg': 6.0,
  '2416228__parking_meter__0.9999914.jpg': 12.0,
  '2415402__zebra__0.99998903.jpg': 6.0,
  '2408701__zebra__0.9999981.jpg': 3.0,
  '2389484__street_sign__0.99998474.jpg': 1.0,
  '2417395__zebra__0.99999

#### TODO: Should distance be vote-based (i.e. nominal) or vote proportion-based (i.e. relative)?

## Metric Evaluation

In [37]:
def eval_indiv_vote_dist(vote_distances):
    vote_distances = list(vote_distances.values())
    metrics = {
        'average': round(np.average(vote_distances), 2),
        'std. dev.': round(np.std(vote_distances), 2),
        'range': [round(np.min(vote_distances), 2), round(np.max(vote_distances), 2)],
    }
    return metrics

def eval_global_vote_dist(global_vote_distances):
    global_metrics = {}
    # Calculate metrics for each split
    for cl_key, vote_distances in global_vote_distances.items():
        metrics = eval_indiv_vote_dist(vote_distances)
        global_metrics[cl_key] = metrics
    # Aggregate metrics for all splits
    global_metrics['global'] = {}
    for metric_type in global_metrics[0].keys():
        metrics_per_type = [metrics[metric_type] for split_key, metrics in global_metrics.items() if split_key != 'global']
        avgd_metrics_per_type = np.round(np.average(np.array(metrics_per_type), axis=0), 2)
        if metric_type == 'range': avgd_metrics_per_type = list(avgd_metrics_per_type)
        global_metrics['global'][metric_type] = avgd_metrics_per_type
    return global_metrics

In [38]:
global_vote_metrics_k1 = eval_global_vote_dist(global_vote_distances_k1)
global_vote_metrics_k3 = eval_global_vote_dist(global_vote_distances_k3)
global_vote_metrics_k5 = eval_global_vote_dist(global_vote_distances_k5)
global_vote_metrics_k7 = eval_global_vote_dist(global_vote_distances_k7)
global_vote_metrics_k9 = eval_global_vote_dist(global_vote_distances_k9)
global_vote_metrics_k11 = eval_global_vote_dist(global_vote_distances_k11)

In [39]:
global_vote_metrics_k1['global']

{'average': 7.59, 'std. dev.': 3.87, 'range': [1.4, 16.8]}

In [40]:
global_vote_metrics_k3['global']

{'average': 6.19, 'std. dev.': 3.24, 'range': [0.8, 14.8]}

In [41]:
global_vote_metrics_k5['global']

{'average': 5.9, 'std. dev.': 3.2, 'range': [1.0, 15.2]}

In [42]:
global_vote_metrics_k7['global']

{'average': 5.82, 'std. dev.': 3.21, 'range': [1.0, 15.8]}

In [43]:
global_vote_metrics_k9['global']

{'average': 5.68, 'std. dev.': 3.11, 'range': [1.0, 15.2]}

In [45]:
global_vote_metrics_k11['global']

{'average': 5.87, 'std. dev.': 3.13, 'range': [1.2, 15.8]}

The previous results shine a light about the viability to predict the vote count for a new image given the vote prototypes of previously generated image clusters.   

In average, the predicted vote count for a new image differs by 6 votes compared to the real vote count. The difference between vote count shows a ascending tendence proportional to the number of nearest clusters used in the vote count prediction, although the growing rate is very small. In the end, this means that when predicting the vote count for a new image, it is recommended to use the vote prototype of only the nearest cluster.   

Additional metrics also show that the distribution of vote count differences shows a gaussian shape with a slight skeweness to the right, i.e. towards higher vote differences). The standard deviation shows that the majority of vote differences are between +-3 to the average vote difference. Given that the average vote difference is 6, this means that the majority of vote differences will be inside the 3-9 range.

Taking in account that for every image around 30 votes were casted, the difference in vote count prediction is pretty large. A difference of 6 votes when predicting votes can be really important. However, we need to calculate vote count proportion differences, beacuase, at the end of the day, proportions are also a important factor in deciding which techniques are better for new images.

#### TODO: Predict techniques with hard voting using vote count predictions