<a href="https://colab.research.google.com/github/JoeMGomes/VCOM-FEUP/blob/main/VCOM_Proj.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# VCOM projects

* Multiclass classification of images
* Multilabel classification of images
* Image detection

## Colab data setup (mount drive and download to Colab)

### Mount to google drive



How to check?
1. open Files on the left side bar
2. check if there is a folder called gdrive

If not, run the following cells

------------------

Dataset available in [Kaggle](https://www.kaggle.com/c/imet-2019-fgvc6)

Downloaded into [UP Google Drive](https://drive.google.com/drive/folders/16iBIVzeiW9DLSHXIpaK-vf1ZH05x9O23?usp=sharing)

A shortcut to this folder needs to be added into the user's MyDrive folder in order to access the pictures

In [None]:
from google.colab import drive
drive.flush_and_unmount()
drive.mount('/content/gdrive', force_remount=True)

In [None]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = "/content/gdrive/My Drive/Kaggle"

### Check the file navigation

In [None]:

%cd /content/gdrive/My Drive/Kaggle

In [None]:
!pwd # should be ^^

### Download data from Drive to Colab
---
To ease the utilization of images, the dataset is copied into this Colab Runtime. (Takes a few minutes to run)

Use for Multiclass problem



In [None]:
driveDir = '/content/gdrive/My Drive/Kaggle/'
newDir = '/content/proj/'
import shutil
import os
if not os.path.exists(newDir):
  os.mkdir(newDir)

shutil.copy(driveDir+"data.zip", newDir+'data.zip', follow_symlinks=True)
shutil.copy(driveDir+"multilabel.csv", newDir)
shutil.copy(driveDir+"vocabulary.py", newDir)
shutil.copy(driveDir+"MultiThreadedLoader.py" ,newDir)
shutil.copy(driveDir+"multiclass.csv",newDir)
shutil.unpack_archive(newDir+"data.zip",newDir)
os.remove(newDir+"data.zip")

os.chdir(newDir)

 Use for the Multi-Label Problem

In [None]:
driveDir = '/content/gdrive/My Drive/Kaggle/'
newDir = '/content/proj/'
import shutil
import os
if not os.path.exists(newDir):
  os.mkdir(newDir)

shutil.copy(driveDir+"multilabel.zip", newDir)
shutil.unpack_archive(newDir+"multilabel.zip",newDir)
os.remove(newDir+"multilabel.zip")
shutil.copy(driveDir+"multilabel.csv", newDir)
shutil.copy(driveDir+"vocabulary.py", newDir)
shutil.copy(driveDir+"MultiThreadedLoader.py" ,newDir)
shutil.copy(driveDir+"multiclass.csv",newDir)

os.chdir(newDir)

### Test the file usage 


In [None]:
!ls data/ | grep 1a1e777b14d78e2c.png # should not return error, but the name of the file

In [None]:
import cv2 as cv 
from google.colab.patches import cv2_imshow as show

image = cv.imread('data/1a1e777b14d78e2c.png')
show(image)

## Multiclass classification - Task 1

### Treating data


In [None]:
import pandas as pd

imgDir = 'data/'
df = pd.read_csv('multiclass.csv')

df.head()

#### Original dataset class distribution 

As it is visible here, the dataset is highly imbalanced, with some classes containing just 10 samples

We can try to apply different techniques, explained below.


In [None]:
import matplotlib.pyplot as plt

df.attribute_ids.value_counts().plot(kind='bar', figsize=(12,3))
plt.title("Number of images per label")
plt.ylabel('Number of images')
plt.xlabel('Label')

In [None]:
import numpy as np
from collections import Counter

X = df.id.to_numpy().reshape(-1, 1)
y = df.attribute_ids.to_numpy()

print(Counter(y))

##### Over-sampling 

Over-sampling is a method used in imbalanced dataset to balance it by duplicating (as many times as necessary) the least represented class(es)' examples.

For this particular case, when the most frequent class has 9151 examples and the least frequents have 10, oversampling those 10 to 9151 leads to having particular examples repeated 900 times! (statistically, as this method is random) 

In [None]:
from imblearn.over_sampling import RandomOverSampler

oversample = RandomOverSampler(sampling_strategy='not majority')
X_over, y_over = oversample.fit_resample(X, y)

print(Counter(y_over))

##### Under-sampling

Opposed to over-sampling, under-sampling balances an imbalanced class distribution by randomly removing some examples of the most frequent class(es).

This method is also not appropriate because all classes would be left with only 10 instances: without a reasonable amount of cases to work with, the training might suffer from under or over-fitting, not resulting in a good classifier

In [None]:
from imblearn.under_sampling import RandomUnderSampler

undersample = RandomUnderSampler(sampling_strategy='not minority')
X_under, y_under = undersample.fit_resample(X, y)

print(Counter(y_under))

##### Joining classes

As an alternative approach, our professor proposed us to translate this dataset into a 3-class distribution: class 13 (9151 examples), class 51 (7463) and all the remaining, joined into one (5473). This operation would be an intermediate step when the example belongs to one of the merged classes. In this case, after the initial classification, another classifier would be necessary to differ between those under-represented classes.

In [None]:
def joinLabels(row):
  if row.attribute_ids == 13 or row.attribute_ids == 51:
    return row.attribute_ids
  else: 
    return 1

df['joinedLabel'] = df.apply(lambda row: joinLabels(row), axis=1)
df.joinedLabel.value_counts().plot(kind='bar', figsize=(5, 3))

#### Applying the techniques in different ways

Here, the goal is to combine these above-refered techniques and trying to balance the dataset, depending on the chosen approach

Previously, the classes RandomOverSampler and RandomUnderSampler were being used, and although they work as expected, they aren't quite useful for this use case, so we implemented our own:

In [None]:
def os_us_to_value(df, value):
  lst = []
  for class_index, group in df.groupby('attribute_ids'):
    if len(group) > value:
      lst.append(group.sample(value, replace=True))
    else:
      lst.append(group)
      lst.append(group.sample(value-len(group), replace=True))
  return pd.concat(lst)

In [None]:
def os_us_to_value_bound(df, limit, bounds):
  lst = []
  for class_index, group in df.groupby('attribute_ids'):
    size = len(group)
    bound = bounds[0]
    if size > limit:
      bound = bounds[1]

    if size < bound:
      lst.append(group)
      lst.append(group.sample(bound-size, replace=True))
    else:
      lst.append(group.sample(bound, replace=True))
    
  return pd.concat(lst)

##### Over-sampling and Under-sampling to 200 examples each (OS&US-200)

In the next cells, these functions are used in order to balance at around 200 examples per class.


In [None]:
osus200 = os_us_to_value(df, 200)
osus200.attribute_ids.value_counts().plot(kind='bar', figsize=(12, 3))

##### Over-sampling and Under-sampling to 500 examples each (OS&US-500)

In [None]:
osus500 = os_us_to_value(df, 500)
osus500.attribute_ids.value_counts().plot(kind='bar', figsize=(12, 3))

##### 3-class approach (with OS and US) 3C-OS&US

By just merging the least frequent classes, and having the merged set balanced with the other 2, it still doesn't fix the imbalanceness inside the 3rd class. So here, we are trying to balance both the 3-classes distribution, and the merged classes, at the same time.

In [None]:
osus3c = os_us_to_value_bound(df, 1000, (150, 7000))
osus3c['merged3C'] = osus3c.apply(lambda row: joinLabels(row), axis=1)
osus3c.merged3C.value_counts().plot(kind='bar', figsize=(5, 3))

#### Splitting into training and testing datasets and Loading the images


In [None]:
from sklearn.model_selection import train_test_split

TEST_SIZE = 0.3
train, test = train_test_split(osus200, test_size=TEST_SIZE)

In [None]:
from MultiThreadedLoader import LoadImagesToMemory

# Loads every image into memory for easier use (avoids slow IO operations later on) 
train_imgs = []
test_imgs = []

train_list = train.id.unique()
# Tuple list [(id,img),...]
train_imgs = LoadImagesToMemory(imgDir,train_list,verbose=False) # Set verbose to true to print every loaded file path

test_list = test.id.unique()
# Tuple list [(id,img),...]
test_imgs = LoadImagesToMemory(imgDir,test_list, verbose=False) # Set verbose to true to print every loaded file path 

train_df = pd.DataFrame(train_imgs, columns=['id','image'])
test_df = pd.DataFrame(test_imgs, columns=['id','image'])

# Dataframe (id, atribute_ids, joinedLabel, image)
train_df = train.merge(train_df, how='left', on="id")

# Dataframe (id, atribute_ids, joinedLabel, image)
test_df = test.merge(test_df, how='left', on="id")

The **train_df** and **test_df** variables are the main data containers of this notebook. They will both contain id's, actual image data and attributes, descriptors, etc

Run this if RAM problems arise

In [None]:
del train_imgs
del test_imgs

### Bag of Words
The Bag of Words approach is actually very simple and straightforward.  
For every image in the training dataset, descriptors of features are computed and added to a list. These descriptors are then clustered having so created a vocabulary for the training dataset (every cluster should correspond to a visual word).

In [None]:
# detects descriptors in training dataset
def addDescriptor(image_tuple):
    keypoints, descriptors = detector.detectAndCompute(image_tuple[1], None)
    return (image_tuple[0],descriptors)


In [None]:
import multiprocessing 
import cv2 as cv
detector = cv.KAZE_create() # Used to detect keypoints ans descriptors from an image

# This solution utilizes multiprocessing to speed up the processing times
# Unfortunately Google Colab only provides two computational cores
cpu_count = multiprocessing.cpu_count()
pool = multiprocessing.Pool(processes = cpu_count)
descriptor_list = []

print("CPU Cores:",cpu_count)

try:
    descriptor_list = pool.map(addDescriptor, train_imgs)
finally:
    pool.close()
    pool.join()



In [None]:
## Adds descriptors to train_df
descriptors_df = pd.DataFrame(descriptor_list, columns=['id','descriptors'])
train_df = pd.merge(train_df,descriptors_df, on="id")
train_df.head()

In [None]:
bowTrainer = cv.BOWKMeansTrainer(100) # Used to cluster descriptors 

# Adds found descriptors to Trainer Object
for index, row in train_df.iterrows():
  if row["descriptors"] is not None:
      bowTrainer.add(row["descriptors"])

vocabulary = bowTrainer.cluster()

Creating Histograms and Standardization of features

In [None]:
from sklearn.preprocessing import StandardScaler
from scipy.cluster.vq import vq
import numpy as np

## Creates an "histogram"/feature count list for each image
histograms_list = np.zeros((len(train_df),100),"float32")
for index,row in train_df.iterrows(): 
    if row["descriptors"] is not None:
        words,distance= vq(row["descriptors"],vocabulary)
        for w in words:
            histograms_list[index][w]+=1

## Standardizes the features range as z = (x - u)/s 
## Where x is the actual value, u is the mean and s is the standard deviation
standardized_range = StandardScaler().fit(histograms_list)
histograms_list= standardized_range.transform(histograms_list)

Classification model with SVM

In [None]:
from sklearn.svm import LinearSVC

clf=LinearSVC(max_iter=1000)
clf.fit(histograms_list,np.array(train_df.attribute_ids)) 


In [None]:
descriptor_list_test=[]


for index, row in test_df.iterrows():
    keypoints, descriptors = detector.detectAndCompute(row.image, None)

    descriptor_list_test.append((row.id,descriptors))   

In [None]:
## Adds descriptors to test_df
descriptors_test_df = pd.DataFrame(descriptor_list_test, columns=['id','descriptors'])
test_df = pd.merge(test_df,descriptors_test_df, on="id")
test_df.head()

In [None]:
from scipy.cluster.vq import vq

## Creates an "histogram"/feature count list for each image
histogram_test_list = np.zeros((len(descriptor_list_test),100),"float32")
for index,row in test_df.iterrows(): 
    if row["descriptors"] is not None:
        words,distance= vq(row["descriptors"],vocabulary) 
        for w in words:
            histogram_test_list[index][w]+=1

standardized_test_features = standardized_range.transform(histogram_test_list)

#### Get Results 

In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix,accuracy_score,plot_confusion_matrix,fbeta_score, recall_score, precision_score
predictions = clf.predict(standardized_test_features)

accuracy = accuracy_score(test_df.attribute_ids.values, predictions)
recall = recall_score(test_df.attribute_ids.values, predictions, average="macro")
precision= precision_score(test_df.attribute_ids.values, predictions, average="macro")
fscore = fbeta_score(test_df.attribute_ids.values, predictions,average="macro", beta=2)

plt.figure(figsize=(20,20))
plot_confusion_matrix(clf, standardized_test_features, test_df.attribute_ids.values)

plt.show()
print("Accuracy:",accuracy)
print("Recall:",recall)
print("Precision:",precision)
print("F2 Score:",fscore)

### CNN

In [None]:
import tensorflow as tf
import cv2 as cv
from tensorflow.keras import layers, models
import numpy as np

In [None]:
plt.figure(figsize=(10,10))
for i in range(25):
    plt.subplot(5,5,i+1)
    plt.xticks([])
    plt.yticks([])
    plt.grid(False)
    plt.imshow(train_df.image[i])
plt.show()

In [None]:
labels = train_df.attribute_ids.unique() # with osus200 and osus500
# labels = train_df.merged3C.unique() # with osus3C
def id2Label(row):
  return np.where(labels==row.attribute_ids)[0][0] #osus200 and osus500
  # return np.where(labels==row.merged3C)[0][0] # osus3c

train_df["label"] = train_df.apply(lambda row: id2Label(row), axis=1)
test_df["label"] = test_df.apply(lambda row: id2Label(row), axis=1)

In [None]:
from keras.layers.normalization import BatchNormalization
model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), input_shape=(300,300,1), activation='relu'))
model.add(layers.MaxPooling2D(pool_size=(2, 2)))

model.add(layers.Conv2D(32, (3, 3), activation='relu'))
model.add(BatchNormalization())
model.add(layers.MaxPooling2D(pool_size=(2, 2)))
model.add(layers.Dropout(0.25))

model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(BatchNormalization())
model.add(layers.Dropout(0.25))

model.add(layers.Conv2D(128, (3, 3), activation='relu'))
model.add(BatchNormalization())
model.add(layers.MaxPooling2D(pool_size=(2, 2)))
model.add(layers.Dropout(0.25))

model.add(layers.Conv2D(256, (3, 3), activation='relu'))
model.add(BatchNormalization())
model.add(layers.MaxPooling2D(pool_size=(2, 2)))
model.add(layers.Dropout(0.25))

model.add(layers.Flatten())

model.add(layers.Dense(512, activation='relu'))
model.add(BatchNormalization())
model.add(layers.Dropout(0.3))

model.add(layers.Dense(256, activation='relu'))
model.add(BatchNormalization())
model.add(layers.Dropout(0.4))

model.add(layers.Dense(128, activation='relu'))
model.add(BatchNormalization())
model.add(layers.Dropout(0.5))

model.add(layers.Dense(labels.size, activation='softmax'))
model.summary()

In [None]:
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

history = model.fit(np.concatenate(train_df.image).reshape((-1, 300, 300 ,1)), train_df.label, epochs=200, validation_split=0.1)
print(history.history)

y_pred = model.predict(np.concatenate(test_df.image).reshape((-1, 300, 300 ,1)), batch_size=32, verbose=1)
y_pred_bool = np.argmax(y_pred, axis=1)

In [None]:
## PREDICTIONS
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix,accuracy_score,plot_confusion_matrix,fbeta_score, recall_score, precision_score

accuracy = accuracy_score(test_df.label, y_pred_bool)
recall = recall_score(test_df.label, y_pred_bool, average="macro")
precision= precision_score(test_df.label, y_pred_bool, average="macro")
fscore = fbeta_score(test_df.label, y_pred_bool,average="macro", beta=2)

# TODO: get other measurements
print("Accuracy:",accuracy)
print("Recall:",recall)
print("Precision:",precision)
print("F2 Score:",fscore)

## Multilabel classification - Task 2 Option 1


In [None]:
import pandas as pd

imgDir = 'data/'
df = pd.read_csv('multilabel.csv')
df.head()

In [None]:
shutil.copy(driveDir+"multilabel.csv", newDir)

In [None]:
df.loc[df["id"]== "1fad69ed268ac538" ]

In [None]:
from sklearn.preprocessing import MultiLabelBinarizer

def convert2List(row):
  return row.attribute_ids.split(' ')

df['label'] = df.apply(lambda row: convert2List(row), axis=1)
one_hot = MultiLabelBinarizer()
df['one_hot'] = list(one_hot.fit_transform(df.label))
df.head()

In [None]:
from sklearn.model_selection import train_test_split

TEST_SIZE = 0.3
train, test = train_test_split(df, test_size=TEST_SIZE)
print(len(train))
print(len(test))

In [None]:
import glob

all_images = glob.glob('data/*.png')
train_ids = train.id.values
test_ids = test.id.values
train_both = [x for x in train_ids if 'data/' + x +'.png' in all_images]
test_both = [x for x in test_ids if 'data/' + x +'.png' in all_images]

print(len(train_both), len(test_both), len(all_images) )

In [None]:
from MultiThreadedLoader import LoadImagesToMemory

# Loads every image into memory for easier use (avoids slow IO operations later on) 
train_imgs = []
test_imgs = []

train_list = train.id.unique()
# Tuple list [(id,img),...]
train_imgs = LoadImagesToMemory(imgDir,train_both,verbose=False) # Set verbose to true to print every loaded file path

test_list = test.id.unique()
# Tuple list [(id,img),...]
test_imgs = LoadImagesToMemory(imgDir,test_both, verbose=False) # Set verbose to true to print every loaded file path 

train_df = pd.DataFrame(train_imgs, columns=['id','image'])
test_df = pd.DataFrame(test_imgs, columns=['id','image'])
print(train_df.shape)
# Dataframe (id, atribute_ids, joinedLabel, image)
train_df = train_df.merge(train, how='left', on="id")

# Dataframe (id, atribute_ids, joinedLabel, image)
test_df = test_df.merge(test, how='left', on="id")

In [None]:
import tensorflow as tf
import cv2 as cv
from tensorflow.keras import layers, models
import numpy as np
from keras.layers.normalization import BatchNormalization

In [None]:
model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), input_shape=(300,300,1), activation='relu'))
model.add(layers.MaxPooling2D(pool_size=(2, 2)))

model.add(layers.Conv2D(32, (3, 3), activation='relu'))
model.add(BatchNormalization())
model.add(layers.MaxPooling2D(pool_size=(2, 2)))
model.add(layers.Dropout(0.25))

model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(BatchNormalization())
model.add(layers.Dropout(0.25))

model.add(layers.Conv2D(128, (3, 3), activation='relu'))
model.add(BatchNormalization())
model.add(layers.MaxPooling2D(pool_size=(2, 2)))
model.add(layers.Dropout(0.25))

model.add(layers.Conv2D(256, (3, 3), activation='relu'))
model.add(BatchNormalization())
model.add(layers.MaxPooling2D(pool_size=(2, 2)))
model.add(layers.Dropout(0.25))

model.add(layers.Flatten())

model.add(layers.Dense(512, activation='relu'))
model.add(BatchNormalization())
model.add(layers.Dropout(0.3))

model.add(layers.Dense(256, activation='relu'))
model.add(BatchNormalization())
model.add(layers.Dropout(0.4))

model.add(layers.Dense(128, activation='relu'))
model.add(BatchNormalization())
model.add(layers.Dropout(0.5))
model.add(layers.Dense(len(df.one_hot[0]), activation='sigmoid'))
model.summary()

In [None]:
from keras import backend as K

def recall_m(y_true, y_pred):
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
    recall = true_positives / (possible_positives + K.epsilon())
    return recall

def precision_m(y_true, y_pred):
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
    precision = true_positives / (predicted_positives + K.epsilon())
    return precision

def f1_m(y_true, y_pred):
    precision = precision_m(y_true, y_pred)
    recall = recall_m(y_true, y_pred)
    return 2*((precision*recall)/(precision+recall+K.epsilon()))

def f2_score(y_true, y_pred):
    y_true = tf.cast(y_true, "int32")
    y_pred = tf.cast(tf.round(y_pred), "int32") # implicit 0.5 threshold via tf.round
    y_correct = y_true * y_pred
    sum_true = tf.reduce_sum(y_true, axis=1)
    sum_pred = tf.reduce_sum(y_pred, axis=1)
    sum_correct = tf.reduce_sum(y_correct, axis=1)
    precision = sum_correct / sum_pred
    recall = sum_correct / sum_true
    f_score = 5 * precision * recall / (4 * precision + recall)
    f_score = tf.where(tf.math.is_nan(f_score), tf.zeros_like(f_score), f_score)
    return tf.reduce_mean(f_score)

In [None]:
model.compile(optimizer='adam', loss='binary_crossentropy',  metrics=['acc',f1_m,precision_m, recall_m, f2_score])

history = model.fit(np.concatenate(train_df.image).reshape((-1, 300, 300 ,1)), pd.DataFrame(np.concatenate(train_df.one_hot.values).reshape(-1, len(df.one_hot[0]))), epochs=100, validation_split=0.1)

print(history.history)

In [None]:
## PREDICTIONS
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix,accuracy_score,plot_confusion_matrix,fbeta_score, recall_score, precision_score

predictions = model.predict(np.concatenate(test_df.image).reshape((-1, 300, 300 ,1)), batch_size=32, verbose=1)
y_pred = (predictions > 0.5) 
y_pred_bool = np.argmax(predictions, axis=1)

print(y_pred)

accuracy = accuracy_score(pd.DataFrame(np.concatenate(test_df.one_hot.values).reshape(-1, len(df.one_hot[0]))), y_pred)
recall = recall_score(pd.DataFrame(np.concatenate(test_df.one_hot.values).reshape(-1, len(df.one_hot[0]))), y_pred, average="macro")
precision= precision_score(pd.DataFrame(np.concatenate(test_df.one_hot.values).reshape(-1, len(df.one_hot[0]))), y_pred, average="macro")
fscore = fbeta_score(pd.DataFrame(np.concatenate(test_df.one_hot.values).reshape(-1, len(df.one_hot[0]))), y_pred,average="macro", beta=2)

# TODO: get other measurements
print("Accuracy:",accuracy)
print("Recall:",recall)
print("Precision:",precision)
print("F2 Score:",fscore)
