# Instance-based Vote Count Prediction for new images

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
import sklearn

## Loading Data

In [2]:
# Loading IncV1 latent features
SIM_MX_FILE_PATH = os.path.join('..', 'results', 'matrices', 'color_hist_euclid_sim_matrix.csv')
VOTES_FILE_PATH = os.path.join('..', 'results', 'votes_summary.csv')

#### Data (Sim. Matrix between images)

In [3]:
sim_mx_df = pd.read_csv(SIM_MX_FILE_PATH, index_col=0)
sim_mx_df.head(3)

Unnamed: 0,1222__pool_table__0.9999995.jpg,1328__coil__0.99999607.jpg,134__zebra__0.9999949.jpg,2377471__pizza__0.9999988.jpg,2377620__zebra__0.9999882.jpg,2377698__zebra__0.9999999.jpg,2378170__zebra__0.9999902.jpg,2378358__park_bench__0.99999833.jpg,2378523__banana__0.99999785.jpg,2379086__zebra__0.9999975.jpg,...,2417881__zebra__0.9999945.jpg,2417938__banana__0.9999944.jpg,4099__pool_table__0.9999945.jpg,4339__manhole_cover__0.99999416.jpg,4534__viaduct__0.9999877.jpg,4573__barrel__0.9999974.jpg,4673__triumphal_arch__0.9999893.jpg,576__gondola__0.9999993.jpg,577__gondola__0.9999962.jpg,691__cheetah__0.99999213.jpg
1222__pool_table__0.9999995.jpg,0.0,6285.621051,14593.537679,5772.380618,11441.225721,11880.166834,9123.23835,5983.605101,7356.368941,8848.581129,...,10007.829435,8019.789274,10733.466448,9367.552722,8243.123316,7766.734835,7264.524073,9446.944797,7846.279883,8571.525069
1328__coil__0.99999607.jpg,6285.621051,0.0,12651.202235,4392.810718,9290.19031,10232.351538,6772.767824,4410.541237,7820.760705,6133.401177,...,7638.769273,5194.897304,8131.912321,6862.227918,6088.735008,7629.691999,3859.095749,7329.847611,5987.893787,6005.23505
134__zebra__0.9999949.jpg,14593.537679,12651.202235,0.0,13183.485427,15001.600315,14800.066284,12944.626144,14127.997947,13790.634141,13337.084014,...,13134.598129,13174.04896,11249.446564,13218.859331,11597.730726,14856.417065,11823.161168,13457.201195,13181.74116,12538.407634


#### Votes

In [4]:
votes_df = pd.read_csv(VOTES_FILE_PATH, index_col=0)
votes_df.head(3)

Unnamed: 0,ig,lime,xrai,anchor,best
1222__pool_table__0.9999995.jpg,12,13,3,1,lime
1328__coil__0.99999607.jpg,17,4,3,2,ig
134__zebra__0.9999949.jpg,14,1,8,2,ig


Here's a sanity check for vote proportion in our the dataset. In the original XAI-CBR paper, vote proportion was like this:
- IG: 45%
- XRAI: 30%
- LIME: 18%
- ANCHOR: 7%

Also, IG was the most voted technique, at least by hard voting aggregation, with a majority of 62% images.


In [5]:
votes_df[['ig','lime','xrai','anchor']].sum() / 2867

ig        0.488315
lime      0.183467
xrai      0.271713
anchor    0.056505
dtype: float64

There's a slight imbalance of these proportions with respect to ones presented in the paper. It seems like some votes from XRAI and ANCHOR techniques drifted out to the IG technique. We'll check this out later, this should not be of great importance in the experiments of this notebook.

### Data Preprocessing

In [6]:
X = sim_mx_df.values # Values from sim. matrix
X_names = sim_mx_df.index.values # Names of every image
y = votes_df.values[:, :4] # Vote count for each imae
best = votes_df.values[:, -1] # Most voted technique for each image

In [7]:
print(X.shape, X_names.shape, y.shape, best.shape)

(198, 198) (198,) (198, 4) (198,)


#### Instance deletion
Stratified Subsampling cannot be performed onto the dataset because only one instance is best explained with ANCHOR. Due to the very small importance of that instance in the dataset, we will continue without that instance (i.e. we will find that instance and remove it from the dataset).

In [8]:
# At what index is the anchor instance located?
anchor_idxs = np.argwhere(best == 'anchor')[0]
anchor_idxs

array([155], dtype=int64)

In [9]:
# What's the name of that image and its associated technique?
X_names[anchor_idxs], best[anchor_idxs]

(array(['2411942__zebra__0.99999654.jpg'], dtype=object),
 array(['anchor'], dtype=object))

In [10]:
# Delete that instance from all data partitions (X, y, etc.)
X = np.delete(X, anchor_idxs, axis=0)
X = np.delete(X, anchor_idxs, axis=1) # Twice in sim. matrix (both rows and columns)
X_names = np.delete(X_names, anchor_idxs, axis=0)
y = np.delete(y, anchor_idxs, axis=0)
best = np.delete(best, anchor_idxs, axis=0)

In [11]:
print(X.shape, X_names.shape, y.shape, best.shape)

(197, 197) (197,) (197, 4) (197,)


## Splitting and Fold Creation

In [12]:
from sklearn.model_selection import StratifiedShuffleSplit as SSS
from sklearn.model_selection import ShuffleSplit as SS

#### TODO: Should I perform statified subsampling or standard subsampling?

In [13]:
STRATIFIED = True

In [14]:
# Perform split
splitter = None
if STRATIFIED: splitter = SSS(n_splits=5, test_size=0.2, random_state=42)
else: splitter = SS(n_splits=5, test_size=0.2, random_state=42)
splits = splitter.split(X, best)
splits = list(splits)

In [15]:
splits[0]

(array([192, 147, 177,  11, 140,  51, 127, 118, 172, 191,  62, 124, 115,
         80, 190, 142,  92,  69,  25,  14,  42,   3, 185,  90,  10,  76,
        176, 114,  44,  98, 166, 121,  79, 170,   1, 183,  28,  31, 155,
         75, 156, 101, 171,  13, 110, 122,  38,  27, 136,  20,   6,  56,
         35,  59, 139,  33,  78,  82,  21, 167, 117,  12,  49,  15,   5,
        152, 132,  81,  61, 163, 175,  91,   7, 174, 135,  74, 193, 129,
         60,  96,  50, 161, 159, 145, 126,  19,  65, 188,  73,  89, 133,
        179,  40,  86, 112,  26, 168, 189, 149,  94, 194,  18, 138, 169,
        102,  97,  71, 130,  53,  99, 148, 154,   8,  34, 182, 105,  55,
         95, 153,  72, 144,  77,  52,  30,   9,  37,   4,  93, 128, 137,
        195, 160, 111,  45, 164, 151,  29,  48,  70,  43,  57, 157,  39,
        141,  85, 150,  67,   0,  47, 113,  32,  17, 131, 180,  66, 100,
        186], dtype=int64),
 array([ 54, 187, 103,  23, 104, 108, 181,  64, 109, 134,  16, 146,   2,
        116, 106, 119, 

## Vote Count Prediction

In [16]:
def get_nearest_instances_indices(dist_to_train_imgs, train_idxs, k):
    if k >= len(dist_to_train_imgs): nearest_train_idxs = train_idxs
    else:
        nearest_train_idxs = []
        for i in range(k): # K times...
            nearest_train_idx, min_dist = None, np.inf
            # ...iterate searching the nearest iamge
            for dist, train_idx in zip(dist_to_train_imgs, train_idxs):
                if train_idx in nearest_train_idxs: continue # ignore prev. found nearest clusters
                if dist < min_dist: nearest_train_idx, min_dist = train_idx, dist
            nearest_train_idxs.append(nearest_train_idx)
        nearest_train_idxs = np.array(nearest_train_idxs)
    return nearest_train_idxs

def get_indiv_vote_predictions(split_idx, k):
    vote_predictions = {}
    # Prepare data
    train_idxs = splits[split_idx][0]
    test_idxs = splits[split_idx][1]
    # For each test image...
    for test_img_idx in test_idxs:
        # Get distances from test img to each train image
        dist_to_train_imgs = X[test_img_idx, train_idxs]
        # Using those distances, find the nearest k clusters
        kn_train_idxs =  get_nearest_instances_indices(dist_to_train_imgs, train_idxs, k=k)
        # Aggregate the vote counts associated with those instances
        nearest_vote_counts = [y[train_idx] for train_idx in kn_train_idxs]
        unrounded_vcp = np.average(np.array(nearest_vote_counts, np.float64), axis=0)
        vote_count_prediction = np.round(unrounded_vcp) # int parsing needed?
        # Associate an image's name with its vote prediction
        test_img_name = X_names[test_img_idx]
        vote_predictions[test_img_name] = vote_count_prediction
    return vote_predictions

def get_global_vote_predictions(no_of_splits, k):
    global_vote_predictions = {}
    for split_idx in range(no_of_splits):
        global_vote_predictions[split_idx] = get_indiv_vote_predictions(split_idx, k=k)
    return global_vote_predictions

In [17]:
no_of_splits = 5
global_vote_predictions_k1 = get_global_vote_predictions(no_of_splits=no_of_splits, k=1)
global_vote_predictions_k3 = get_global_vote_predictions(no_of_splits=no_of_splits, k=3)
global_vote_predictions_k5 = get_global_vote_predictions(no_of_splits=no_of_splits, k=5)
global_vote_predictions_k7 = get_global_vote_predictions(no_of_splits=no_of_splits, k=7)
global_vote_predictions_k9 = get_global_vote_predictions(no_of_splits=no_of_splits, k=9)
global_vote_predictions_k10 = get_global_vote_predictions(no_of_splits=no_of_splits, k=10)
global_vote_predictions_k11 = get_global_vote_predictions(no_of_splits=no_of_splits, k=11)
global_vote_predictions_k13 = get_global_vote_predictions(no_of_splits=no_of_splits, k=13)
global_vote_predictions_k15 = get_global_vote_predictions(no_of_splits=no_of_splits, k=15)
global_vote_predictions_k17 = get_global_vote_predictions(no_of_splits=no_of_splits, k=17)
global_vote_predictions_k19 = get_global_vote_predictions(no_of_splits=no_of_splits, k=19)
global_vote_predictions_k20 = get_global_vote_predictions(no_of_splits=no_of_splits, k=20)

In [18]:
global_vote_predictions_k5

{0: {'2388889__hotdog__0.99999714.jpg': array([10.,  3.,  3.,  0.]),
  '2417881__zebra__0.9999945.jpg': array([6., 1., 5., 2.]),
  '2403403__banana__0.9999926.jpg': array([9., 1., 3., 1.]),
  '2381941__zebra__0.9999914.jpg': array([5., 3., 5., 1.]),
  '2403741__zebra__0.99999523.jpg': array([9., 2., 4., 2.]),
  '2404281__zebra__0.999998.jpg': array([7., 1., 4., 3.]),
  '2416627__zebra__0.9999987.jpg': array([7., 2., 3., 0.]),
  '2391964__flamingo__1.0.jpg': array([7., 2., 4., 0.]),
  '2404583__umbrella__0.99999297.jpg': array([5., 3., 6., 1.]),
  '2409637__four-poster__0.99999464.jpg': array([6., 3., 3., 2.]),
  '2380669__parking_meter__0.9999993.jpg': array([7., 1., 5., 0.]),
  '2411196__crane__0.9999995.jpg': array([7., 2., 4., 2.]),
  '134__zebra__0.9999949.jpg': array([7., 2., 3., 2.]),
  '2405905__traffic_light__0.99999535.jpg': array([7., 3., 5., 1.]),
  '2404127__zebra__0.9999933.jpg': array([6., 2., 5., 0.]),
  '2406857__zebra__0.9999894.jpg': array([8., 3., 2., 1.]),
  '241427

## Metric Evaluation

In [19]:
def calc_distance(p1, p2, metric):
    if metric == 'rmse': return np.sum(np.square(p1 - p2))
    elif metric == 'manhattan': return np.sum(np.abs(p1 - p2))
    else: print('Unknown metric type')

def eval_indiv_vote_preds(vote_predictions, metric):
    vote_distances = []
    for img_name, vote_pred in vote_predictions.items():
        # Fetch real votes and compare with vote predictions
        real_votes = votes_df.loc[img_name].values[:4]
        distance = calc_distance(real_votes, vote_pred, metric)
        vote_distances.append(distance)
    vote_distances = np.array(vote_distances)
    if metric=='rmse': metrics = {'rmse': round(np.sqrt(np.average(vote_distances)), 2)}
    elif metric=='manhattan': metrics = {'manhattan': (np.average(vote_distances), 2)}
    else:
        metrics = {
            'average': round(np.average(vote_distances), 2),
            'std. dev.': round(np.std(vote_distances), 2),
            'range': [round(np.min(vote_distances), 2), round(np.max(vote_distances), 2)],
        }
    return metrics

def eval_global_vote_preds(global_vote_predictions, metric):
    global_metrics = {}
    # Calculate metrics for each split
    for cl_key, vote_predictions in global_vote_predictions.items():
        metrics = eval_indiv_vote_preds(vote_predictions, metric)
        global_metrics[cl_key] = metrics
    # Aggregate metrics for all splits
    global_metrics['global'] = {}
    for metric_type in global_metrics[0].keys():
        metrics_per_type = [metrics[metric_type] for split_key, metrics in global_metrics.items() if split_key != 'global']
        avgd_metrics_per_type = np.round(np.average(np.array(metrics_per_type), axis=0), 2)
        if metric_type == 'range': avgd_metrics_per_type = list(avgd_metrics_per_type)
        global_metrics['global'][metric_type] = avgd_metrics_per_type
    return global_metrics

In [20]:
METRIC = 'rmse'
global_vote_metrics_k1 = eval_global_vote_preds(global_vote_predictions_k1, metric=METRIC)
global_vote_metrics_k3 = eval_global_vote_preds(global_vote_predictions_k3, metric=METRIC)
global_vote_metrics_k5 = eval_global_vote_preds(global_vote_predictions_k5, metric=METRIC)
global_vote_metrics_k7 = eval_global_vote_preds(global_vote_predictions_k7, metric=METRIC)
global_vote_metrics_k9 = eval_global_vote_preds(global_vote_predictions_k9, metric=METRIC)
global_vote_metrics_k10 = eval_global_vote_preds(global_vote_predictions_k10, metric=METRIC)
global_vote_metrics_k11 = eval_global_vote_preds(global_vote_predictions_k11, metric=METRIC)
global_vote_metrics_k13 = eval_global_vote_preds(global_vote_predictions_k13, metric=METRIC)
global_vote_metrics_k15 = eval_global_vote_preds(global_vote_predictions_k15, metric=METRIC)
global_vote_metrics_k17 = eval_global_vote_preds(global_vote_predictions_k17, metric=METRIC)
global_vote_metrics_k19 = eval_global_vote_preds(global_vote_predictions_k19, metric=METRIC)
global_vote_metrics_k20 = eval_global_vote_preds(global_vote_predictions_k20, metric=METRIC)

In [21]:
global_vote_metrics_k1['global']

{'rmse': 6.03}

In [22]:
global_vote_metrics_k3['global']

{'rmse': 4.93}

In [23]:
global_vote_metrics_k5['global']

{'rmse': 4.66}

In [24]:
global_vote_metrics_k7['global']

{'rmse': 4.52}

In [25]:
global_vote_metrics_k9['global']

{'rmse': 4.51}

In [26]:
global_vote_metrics_k10['global']

{'rmse': 4.53}

In [27]:
global_vote_metrics_k11['global']

{'rmse': 4.49}

In [28]:
global_vote_metrics_k13['global']

{'rmse': 4.46}

In [29]:
global_vote_metrics_k15['global']

{'rmse': 4.44}

In [30]:
global_vote_metrics_k17['global']

{'rmse': 4.49}

In [31]:
global_vote_metrics_k19['global']

{'rmse': 4.44}

In [32]:
global_vote_metrics_k20['global']

{'rmse': 4.43}

The previous results shine a light about the viability to predict the vote count for a new image given the vote prototypes of previously generated image clusters.   

In average, the predicted vote count for a new image differs by 6 votes compared to the real vote count. The difference between vote count shows a ascending tendence proportional to the number of nearest clusters used in the vote count prediction, although the growing rate is very small. In the end, this means that when predicting the vote count for a new image, it is recommended to use the vote prototype of only the nearest cluster.   

Additional metrics also show that the distribution of vote count differences shows a gaussian shape with a slight skeweness to the right, i.e. towards higher vote differences). The standard deviation shows that the majority of vote differences are between +-3 to the average vote difference. Given that the average vote difference is 6, this means that the majority of vote differences will be inside the 3-9 range.

Taking in account that for every image around 30 votes were casted, the difference in vote count prediction is pretty large. A difference of 6 votes when predicting votes can be really important. However, we need to calculate vote count proportion differences, beacuase, at the end of the day, proportions are also a important factor in deciding which techniques are better for new images.

#### TODO: Predict techniques with hard voting using vote count predictions