# Downsampling data

The number of images best explained with every interpretation technique is as follows:
- IG: 147 (74% of total images)
- XRAI: 32 (16% of total images)
- LIME: 18 (9% of total images)
- ANCHOR: 1 (>1% of total images)
- Total images: 198

Due to a majority of IG images, prediction for new images is likely to be biased towards the IG technique. Is is neccesary to smoth the proportions of elected techniques in the dataset.
In this notebook, we adress that problem by downsampling our dataset (i.e. picking a subset of instances that smoths the proportion of techniques).

We'll substract IG images from the dataset, so the proportion of the IG technique in the dataset is of ~60% instead of 74%, allowing other techniques to be more present during the prediction of interpretation techniques for new images. The number of images after downsampling should look like this:

- IG: 77 (60% of total images, 52% of original IG count)
- XRAI: 32 (25% of total images)
- LIME: 18 (14% of total images)
- ANCHOR: 1 (>1% of total images)
- Total images: 128

If we try to lower IG proportions to ~50%, the downsampled dataset would look like this:
- IG: 51 (50% of total images, 35% of original IG count)
- XRAI: 32 (31% of total images)
- LIME: 18 (18% of total images)
- ANCHOR: 1 (1% of total images)
- Total images: 102

We'll only use the latent features dataset now, because other similarity metrics do not perform as well as euclidian latent feature distance. However, is possible to extend this notebook to perform subsampling on datasets based on other similarity metrics.

### Some subsampling constants

In [6]:
p = 0.55 # DESIRED_IG_PROP
ig_rem = (147-198*p)/(1-p) # No. of IG imgs. to remove
ig_keep = 147 - ig_rem
ig_pct_keep = 1 - ig_rem/147 # % of IG imgs. to keep
new_len = 198 - ig_rem # New lenght for downsampled dataset
print(ig_rem, ig_keep, ig_pct_keep, new_len)

84.66666666666666 62.33333333333334 0.4240362811791384 113.33333333333334


## Loading Data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
import sklearn

In [2]:
FEATS_FILE_PATH = os.path.join('..', 'features', 'incv1_feats.csv')
BEST_TECHNIQUES = os.path.join('..', 'results', 'votes_summary.csv')

In [3]:
feats_df = pd.read_csv(FEATS_FILE_PATH, index_col=0)
feats_df.head(3)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1014,1015,1016,1017,1018,1019,1020,1021,1022,1023
1222__pool_table__0.9999995.jpg,0.882798,0.896023,0.123852,0.257982,0.03605,0.108023,0.633841,0.457301,1.684949,0.285681,...,0.422634,0.346122,0.111589,1.441579,0.198722,0.246648,0.295942,0.56095,0.058328,0.117393
1328__coil__0.99999607.jpg,0.483815,0.134309,0.021849,0.367267,0.08925,0.007518,0.069921,0.219347,0.08926,0.046694,...,0.049852,0.00414,0.199223,0.718976,0.0,0.0,0.0,0.159411,0.012007,0.001601
134__zebra__0.9999949.jpg,0.291067,0.375913,0.217742,1.269691,0.384181,0.07647,0.66207,0.662391,0.827774,0.115826,...,0.018289,0.0,0.000775,0.903884,0.589769,0.016957,0.418493,0.00535,0.004198,0.18546


In [4]:
techniques_df = pd.read_csv(BEST_TECHNIQUES, dtype='object', index_col=0)
techniques_df.head(3)

Unnamed: 0,ig,lime,xrai,anchor,best
1222__pool_table__0.9999995.jpg,12,13,3,1,lime
1328__coil__0.99999607.jpg,17,4,3,2,ig
134__zebra__0.9999949.jpg,14,1,8,2,ig


## Subsampling

In [52]:
# Create is_ig as a True/False masking array
best = techniques_df['best'].values
is_ig = best == 'ig'
is_ig[:20]

array([False,  True,  True,  True,  True,  True,  True,  True, False,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True])

In [29]:
# Use masking array to filter names of IG images
ig_img_names = techniques_df[is_ig].index.values
ig_img_names[:5]

array(['1328__coil__0.99999607.jpg', '134__zebra__0.9999949.jpg',
       '2377471__pizza__0.9999988.jpg', '2377620__zebra__0.9999882.jpg',
       '2377698__zebra__0.9999999.jpg'], dtype=object)

In [36]:
# Downsample IG images so proportion relative to the whole dataset is near ~60%
from sklearn.model_selection import train_test_split
ig_img_split = train_test_split(ig_img_names, train_size=0.52,
                                         random_state=42, shuffle=True, stratify=None)
selected_ig_img_names = ig_img_split[0]

In [37]:
# As planned, only ~77 images were selected
len(selected_ig_img_names)

76

### Creating new subsampled datasets

We'll create a new dataset with the subsampled IG images and the rest of images associated with other techniques. This will be useful for later usage.

In [64]:
# We already kwow the names of sumsampled IG images
selected_ig_img_names[:5]

array(['2405479__traffic_light__0.9999939.jpg',
       '2392818__park_bench__0.99999.jpg',
       '2411665__zebra__0.99998856.jpg', '2401224__zebra__0.9999882.jpg',
       '2390296__umbrella__0.99999106.jpg'], dtype=object)

In [65]:
# Then we get the names of images NOT explained with IG
is_not_ig = is_ig == False
not_ig_img_names = techniques_df[is_not_ig].index.values
not_ig_img_names[:5]

array(['1222__pool_table__0.9999995.jpg',
       '2378523__banana__0.99999785.jpg',
       '2381932__traffic_light__0.99999964.jpg',
       '2382792__umbrella__0.9999838.jpg',
       '2385767__zebra__0.9999958.jpg'], dtype=object)

In [68]:
# We join the two arrays of names...
subsampled_img_names = list(selected_ig_img_names) + list(not_ig_img_names)
len(subsampled_img_names)

127

In [69]:
# ...and use it to create new datasets (for features and techniques)
sub_feats_df = feats_df.loc[subsampled_img_names]
sub_techniques_df = techniques_df.loc[subsampled_img_names]

In [74]:
sub_feats_df.head(3)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1014,1015,1016,1017,1018,1019,1020,1021,1022,1023
2405479__traffic_light__0.9999939.jpg,0.235278,0.093458,0.096435,0.431311,0.061168,0.201984,0.105483,0.098843,0.309048,0.096289,...,0.071269,0.0801,0.356105,0.096008,0.136125,0.407917,0.707833,0.787253,0.097295,0.010121
2392818__park_bench__0.99999.jpg,0.552391,0.528561,0.226194,1.417185,0.005555,0.519028,0.700005,0.099091,0.123433,0.164651,...,1.73041,0.14895,0.0,0.065658,0.586209,0.083003,0.054312,0.944901,0.130074,0.047629
2411665__zebra__0.99998856.jpg,0.395213,1.130813,0.120035,1.675753,0.145812,0.11705,0.25309,0.129668,1.04725,0.137647,...,0.000423,0.009721,0.02659,0.914146,0.275013,0.020816,0.503529,0.015693,0.036215,0.005369


In [75]:
sub_techniques_df.head(3)

Unnamed: 0,ig,lime,xrai,anchor,best
2405479__traffic_light__0.9999939.jpg,6,5,3,0,ig
2392818__park_bench__0.99999.jpg,7,2,2,2,ig
2411665__zebra__0.99998856.jpg,5,2,4,2,ig


As we can see, the number of images in the dataset has changed, changing the propotions of techniques

In [79]:
np.unique(sub_techniques_df['best'].values, return_counts=True)

(array(['anchor', 'ig', 'lime', 'xrai'], dtype=object),
 array([ 1, 76, 18, 32], dtype=int64))

### Saving new datasets

In [81]:
# New file names
SUB_FEATS_FILE_PATH = os.path.join('..', 'features', 'sub_incv1_feats.csv')
SUB_BEST_TECHNIQUES = os.path.join('..', 'results', 'sub_votes_summary.csv')

In [82]:
# Saving datasets
sub_feats_df.to_csv(SUB_FEATS_FILE_PATH)
sub_techniques_df.to_csv(SUB_BEST_TECHNIQUES)