In this problem, images are provided of two categories – inlier and outliers. The inlier images heavily overshadow the outlier images. Basically, inlier images are those which have some text in their main body. The aim of this solution is to efficiently classify the test data images as outliers (1) and inliers (0).
The provided images are heavily imbalanced as the number of inliers are much higher than the outliers. Hence data augmentation is required. Here the images are changed with respect to dimensionality, zooming, rotation, etc.
Three folders are given (inlier_train, outlier_train and test) which contain all the images. Outlier folder contains about 800 images, inlier folder contains about 4000 images while the test folder has about 1300 images.
The following libraries are required to run the code. All of them can be downloaded via conda or pip:
- pandas
- numpy
- os, sys, cv2
- keras
- sklearn
- PIL
- shutil Add all the three given folders into the same directory where the code is located.
The code can be run Jupyter Notebook.
The following steps are followed to find the outliers in the given test image folder:
- Import all the necessary libraries
- Read all the filenames in inlier train and outlier train
- Make two folders for storing the augmented images and the final train images
- For each outlier image, create augmented images and save them in the augmented folder.
- Shift all the images within the augmented folder into train folder. Also do the same for the outlier folder.
- Read all the images within the train folder (augmented and outliers) and give them a label of 1.
- Add a label of 0 to the inlier filenames and then shift the images to train folder as well.
- Extract all features from training data images with ResNet50 model.
- Do the same for test data images.
- Scale the features of both test and train features with Standard Scaler
- Train the KNN Classifier (classes =2) with the scaled train features and the labels (0,1).
- Test the data and predict the labels for the test images.
- Write the result in CSV in necessary format.
For this question I have used KNN Classifier as the model. SVC gave very poor number of outliers (1). Though OneClassSVM is usually used for such datasets, it was giving very low number of outliers for this particular dataset. Hence, I did not use it.
BANERJEE, Rohini - HKUST MSc BDT (Student ID: 20543577)