The aim of this notebook is twofold: 
 - simulate dirfts on images in order to generate synthethic drifted data. We are going to timulate Gaussian noise and change of intensity of greys;
 - find thresholds for metrics, by applying them on permutations of the train dataset

In [None]:
import sys, os
sys.path.append(os.path.abspath('Utils'))
sys.path.append(os.path.abspath('data'))
sys.path.append(os.path.abspath('thresholds_and_results'))



from utils_driftSimulating import  only_image_folder, create_black_folder, create_gaussian_folder, create_intensity_folder
from utils_thresholds import thresholds_PCA_images, thresholds_UMAP_images, thresholds_AUTOENCODER_U_images, thresholds_AUTOENCODER_T_images, thresholds_None
from utils_resNet import init_resnet, df_from_folder


### Settings

 Eliminate Zone.Identifier

In [None]:
starting_path = 'data/original_data/training_images/'
# only_image_folder(starting_path)

Applying central black circle and saving images in the "black" folder

In [None]:
black_path = 'data/synthetic_data/black/'
# create_black_folder(starting_path, black_path)

### Simulate drift and save a folder for each drift type

In [None]:
intensity_path = 'data/synthetic_data/drift_intensity/'
# create_intensity_folder(starting_path, intensity_path, shiftValue = 40)
gaussian_path = 'data/synthetic_data/drift_gaussian_1/'
# create_gaussian_folder(starting_path, gaussian_path, sigma=1)
gaussian_path = 'data/synthetic_data/drift_gaussian_10/'
# create_gaussian_folder(starting_path, gaussian_path, sigma=10)
gaussian_path = 'data/synthetic_data/drift_gaussian_100/'
# create_gaussian_folder(starting_path, gaussian_path, sigma=100)

### Data preprocessing

Create dataframe from black folder

In [None]:
# set seeds
seed_split = 1
seed_drift = 10
seed_metrics = 100
seeds = [seed_split, seed_drift, seed_metrics]


In [None]:
# Define configs
k=6
fileName='thresholds_and_results/6dim/thresholds_6d'

In [None]:
# Model initialization: a resnet18 will be used to preprocess the images before dimensionality reduction process
model = init_resnet(seed_split)

In [None]:
# Preprocess images by:
# - applying a black mask over them
# - applying the resnet to do feature selection on each image (to not use directly raw image pixels for the governance process)
# - storing an array with shape (1,512) for each image in a dataframe to be used in the governance process (a df was an input also for the process applied on tabular data)
black_df = df_from_folder(black_path, model)

### Thresholds definition

We are going to apply metrics for drift detection on a new dataset and we would like to select which metrics are able to detect drift on it. In order to interpret the values given by such metrics we need some threshold values, so that if the value returned from a metric is extremer than the corresponding threshold a drift is detected, otherwise the drift is not detected. Such thresholds are not global but depend on each dataset, so for each dataset we are going to find them. We propose as threshold the 5-th percentile of the metrics' results, applied on permutation of batches of the considered datase.

With no dim red

In [None]:
thresholds_None(seeds, black_df, fileName, k)

With Dimensionality reduction techniques

With PCA

In [None]:
thresholds_PCA_images(seeds, black_df, fileName,k)

With UMAP

In [None]:
thresholds_UMAP_images(seeds, black_df, fileName, k)

With U_AUTOENCODER (untrained)

In [None]:
thresholds_AUTOENCODER_U_images(seeds, black_df, fileName,k)

With T_AUTOENCODER (trained)

In [None]:
thresholds_AUTOENCODER_T_images(seeds, black_df, fileName,k)