Supervised Outlier Detection (Python 3)

In this problem, images are provided of two categories – inlier and outliers. The inlier images heavily overshadow the outlier images. Basically, inlier images are those which have some text in their main body. The aim of this solution is to efficiently classify the test data images as outliers (1) and inliers (0).

1. Getting Started

The provided images are heavily imbalanced as the number of inliers are much higher than the outliers. Hence data augmentation is required. Here the images are changed with respect to dimensionality, zooming, rotation, etc.

1.1. Data Description

Three folders are given (inlier_train, outlier_train and test) which contain all the images. Outlier folder contains about 800 images, inlier folder contains about 4000 images while the test folder has about 1300 images.

1.2. Prerequisites

The following libraries are required to run the code. All of them can be downloaded via conda or pip:

pandas
numpy
os, sys, cv2
keras
sklearn
PIL
shutil Add all the three given folders into the same directory where the code is located.

1.3. Running the Code

The code can be run Jupyter Notebook.

2. Methodology

The following steps are followed to find the outliers in the given test image folder:

Import all the necessary libraries
Read all the filenames in inlier train and outlier train
Make two folders for storing the augmented images and the final train images
For each outlier image, create augmented images and save them in the augmented folder.
Shift all the images within the augmented folder into train folder. Also do the same for the outlier folder.
Read all the images within the train folder (augmented and outliers) and give them a label of 1.
Add a label of 0 to the inlier filenames and then shift the images to train folder as well.
Extract all features from training data images with ResNet50 model.
Do the same for test data images.
Scale the features of both test and train features with Standard Scaler
Train the KNN Classifier (classes =2) with the scaled train features and the labels (0,1).
Test the data and predict the labels for the test images.
Write the result in CSV in necessary format.

For this question I have used KNN Classifier as the model. SVC gave very poor number of outliers (1). Though OneClassSVM is usually used for such datasets, it was giving very low number of outliers for this particular dataset. Hence, I did not use it.

3. Author

BANERJEE, Rohini - HKUST MSc BDT (Student ID: 20543577)

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
LICENSE		LICENSE
README.md		README.md
outlierdetection.ipynb		outlierdetection.ipynb
output.csv		output.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LICENSE

LICENSE

README.md

README.md

outlierdetection.ipynb

outlierdetection.ipynb

output.csv

output.csv

Repository files navigation

Supervised Outlier Detection (Python 3)

1. Getting Started

1.1. Data Description

1.2. Prerequisites

1.3. Running the Code

2. Methodology

3. Author

4. References

About

Releases

Packages

Languages

License

RohiBaner/Supervised-Image-Outlier-Detection

Folders and files

Latest commit

History

Repository files navigation

Supervised Outlier Detection (Python 3)

1. Getting Started

1.1. Data Description

1.2. Prerequisites

1.3. Running the Code

2. Methodology

3. Author

4. References

About

Resources

License

Stars

Watchers

Forks

Languages