# Identify outliers with Easy-ICD

In this notebook we show how we can use a trained outlier detector to identify and mark outliers in a scraped dataset.

## Required imports

In [11]:
import torch
import numpy as np
import matplotlib.pyplot as plt

from torch.utils.data import Dataset, DataLoader

from easy_icd.utils.datasets import create_dataset
from easy_icd.utils.augmentation import RandomImageAugmenter
from easy_icd.utils.models import ResNet
from easy_icd.outlier_detection.detect_outliers import analyze_data
from easy_icd.outlier_removal.remove_outliers import remove_outliers

In [5]:
from importlib import reload

import easy_icd.outlier_removal.remove_outliers

reload(easy_icd.outlier_removal.remove_outliers)

<module 'easy_icd.outlier_removal.remove_outliers' from 'F:\\College\\Penn\\Fall 22\\CMPSC 445\\Final_Project\\CMPSC445\\easy_icd\\src\\easy_icd\\outlier_removal\\remove_outliers.py'>

First, we need to create a Dataset object that contains all of the images we want to analyze:

In [8]:
img_dir = 'marine_animals'
class_names = ['hammerhead shark', 'orca whale', 'manta ray', 'jellyfish', 'axolotl']
one_hot_labels = False
scale_images = True

ds = create_dataset(img_dir, class_names, one_hot_labels, scale_images)
dataloader = DataLoader(ds, batch_size=4, shuffle=True)

Now we create a model using the same architecture as the model we trained as an outlier detector, and load the trained weights:

In [9]:
num_layers = 3
num_blocks = [1, 1, 1]
in_channels = 3
out_channels = [16, 32, 64]
linear_sizes = [128, 32]
supervised = False

model = ResNet(num_layers, num_blocks, in_channels, out_channels, linear_sizes, supervised)

In [10]:
model.load_state_dict(torch.load('marine_animals_model_training/model_state_epoch_1.pt'))

<All keys matched successfully>

In [13]:
dataset_means_and_stds = [[0.3128, 0.3886, 0.5122], [0.2856, 0.2370, 0.2553]]
image_size = (512, 512)
num_hardness_loss_samples = 5
gpu = False

analyze_data(model, img_dir, class_names, dataset_means_and_stds, image_size, num_hardness_loss_samples, gpu)

Analyzing images in class: hammerhead_shark
Analyzing images in class: orca_whale
Analyzing images in class: manta_ray
Analyzing images in class: jellyfish
Analyzing images in class: axolotl


Once we have analyzed the images, we can mark the outliers that we detected for exclusion in the cleaned dataset. We do this by first selecting how many images we want to retain the cleaned versions of each class:

In [15]:
desired_images_per_class = [15 for i in range(len(class_names))]

remove_outliers(img_dir, class_names, desired_images_per_class)

Then, constructing the cleaned dataset is as simple as passing the argument `cleaned=True` to the create_dataset function:

In [17]:
img_dir = 'marine_animals'
class_names = ['hammerhead shark', 'orca whale', 'manta ray', 'jellyfish', 'axolotl']
one_hot_labels = False
scale_images = True
cleaned = True

ds = create_dataset(img_dir, class_names, one_hot_labels, scale_images, cleaned=cleaned)
dataloader = DataLoader(ds, batch_size=4, shuffle=True)