### Get the Data Files

If you haven't already, please download the dogs-vs-cats.zip from __[here](https://www.kaggle.com/competitions/dogs-vs-cats/data)__ and extract only the train.zip ( do not extract it's contents yet ) as a zipfile into the current project folder.

In [1]:
import os
import zipfile

If the train.zip is available in the current folder, then the code below extracts its contents into a  subfolder called __train__. Its contents are then mached with __trainfiles.txt__ <br> This is to make sure that you have all the required 25000 image files for this experiment.

In [2]:
data_path = os.path.join(os.getcwd(),"train")
f = open("trainfiles.txt", "r")
trainfiles = f.read().split(",")
target_zip = "train.zip"

try:
    
    #if train folder exists, check all file names with trainfiles.txt 
    avlblfiles = os.listdir(data_path)
    chk_data = [file for file in trainfiles if file not in avlblfiles]
    
    if len(chk_data)==0:
        
        unzip = False
        print("All Train Files Found, please proceed")
    else:
        
        print("Some train files are missing")
                
except FileNotFoundError:
    
    unzip = True
    print("Folder train not found, attempting unzip")
    
#If train folder is not found, look for train.zip and
#unzip its contents into a folder train
    
if unzip:
    
    try:
        
        with zipfile.ZipFile(target_zip) as zip_file:
            print("Unzipping data")
            zip_file.extractall()
        print("Train Data unzipped, please proceed")
        
    except FileNotFoundError:
        
        print("File train.zip not found, please download from Kaggle ")

All Train Files Found, please proceed


### Import Libraries

In [3]:
import tensorflow as tf
from tensorflow.image import resize
from tensorflow import keras
from tensorflow.keras.callbacks import ModelCheckpoint
from sklearn.model_selection import train_test_split

from tensorflow.keras.utils import to_categorical
from keras.metrics import  Recall, CategoricalAccuracy
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import entropy
import os

#A custom library for helper functions
from src.helper import *
np.random.seed(0)

### Build Datasets

The dataset is one large collection of images of cats and dogs. The label can be identified from the file name. We load all the 25000 file names and the corresponding labels into an array.  

In [4]:
label_dict={'cat':0,'dog':1}
dataset=np.array([(os.path.join(data_path,i),label_dict[i.split('.')[0]]) for i in os.listdir(data_path)])

In [5]:
dataset[0:3]

array([['C:\\Users\\arind\\Documents\\ActiveLearning\\train\\cat.0.jpg',
        '0'],
       ['C:\\Users\\arind\\Documents\\ActiveLearning\\train\\cat.1.jpg',
        '0'],
       ['C:\\Users\\arind\\Documents\\ActiveLearning\\train\\cat.10.jpg',
        '0']], dtype='<U59')

We assign all the filenames to X and the labels to y. Since we have two classes , we use the keras method to one-hot encode the labels. <br> Since the files aren't shuffled , with all the cat files appearing first , followed by the dog, we shuffle X and y. <br>

We reserve 10% of the samples for train, validation and test sets.

In [6]:
X,y=dataset[::,0],dataset[::,1]
y = y.astype(int)

#One hot encode the labels
y = to_categorical(y)

#Shuffle the dataset
p = np.random.permutation(len(X))
X,y = X[p], y[p]

#Strip off 10% samples for hold out test set
test_idxs = np.random.choice(len(X), size=int(0.1*len(X)), replace=False, p=None)
x_test, y_test = X[test_idxs],y[test_idxs]

#Delete the test set samples from X,y 
X = np.delete(X, test_idxs)
y = np.delete(y, test_idxs, axis = 0)

#usual train-val split. We use 0.11 here just match the test set size to validation set.
x_train, x_val, y_train, y_val = train_test_split(X, y, test_size=0.11, random_state=42)

A quick check on the sample counts

In [7]:
print(f"Samples in Training set: {x_train.shape[0]}")
print(f"Samples in Validation set: {x_val.shape[0]}")
print(f"Samples in Test set: {x_test.shape[0]}")

Samples in Training set: 20025
Samples in Validation set: 2475
Samples in Test set: 2500


A quick check for data imbalance. 

In [8]:
for i in [y_train, y_test, y_val]:
    print(np.unique(i, return_counts = True, axis = 0))

(array([[0., 1.],
       [1., 0.]], dtype=float32), array([10085,  9940], dtype=int64))
(array([[0., 1.],
       [1., 0.]], dtype=float32), array([1211, 1289], dtype=int64))
(array([[0., 1.],
       [1., 0.]], dtype=float32), array([1204, 1271], dtype=int64))


We use the helper function to convert the data into tensorflow dataset objects. Note that , the __repeat__ flag needs to be set only for the train set , which by default is true.

In [9]:
#The buid_dataset is a cutom function that returns tensor batches

val_dataset=build_dataset(x_val,y_val,repeat=False,batch=256)
test_dataset=build_dataset(x_test,y_test,repeat=False,batch=256)

BATCH_SIZE=16
STEPS_PER_EPOCH=len(x_train)/BATCH_SIZE

train_dataset=build_dataset(x_train,y_train,batch=BATCH_SIZE)
input_shape=train_dataset.element_spec[0].shape[1:]

### Model Building

This is quite standard. We use the helper functions to build a simple neural network model

In [10]:
model=simple_model(input_shape)
model.compile(
        loss = "categorical_crossentropy",
        optimizer = Adam(),
        metrics = CategoricalAccuracy()
    )
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 conv2d (Conv2D)             (None, 64, 64, 32)        896       
                                                                 
 batch_normalization (BatchN  (None, 64, 64, 32)       128       
 ormalization)                                                   
                                                                 
 max_pooling2d (MaxPooling2D  (None, 32, 32, 32)       0         
 )                                                               
                                                                 
 dropout (Dropout)           (None, 32, 32, 32)        0         
                                                                 
 conv2d_1 (Conv2D)           (None, 30, 30, 64)        18496     
                                                                 
 batch_normalization_1 (Batc  (None, 30, 30, 64)       2

In [11]:
checkpoint=ModelCheckpoint(filepath='model/model_baseline.h5',
                           monitor='val_loss',save_best_only=True,verbose=1)

csv_logger=keras.callbacks.CSVLogger('logger/trainlog_baseline.csv',
                                     separator=',',append=False)

early_stopper=keras.callbacks.EarlyStopping(monitor='val_loss',
                                            min_delta=0.001,
                                            restore_best_weights=True,
                                            patience=10)

callbacks_list=[checkpoint,early_stopper,csv_logger]

In [12]:
model.fit(train_dataset,steps_per_epoch=STEPS_PER_EPOCH,epochs=200,
          validation_data=val_dataset,validation_steps=None,
          callbacks=callbacks_list)

Epoch 1/200
Epoch 1: val_loss improved from inf to 0.65976, saving model to model\model_baseline.h5
Epoch 2/200
Epoch 2: val_loss improved from 0.65976 to 0.59598, saving model to model\model_baseline.h5
Epoch 3/200
Epoch 3: val_loss improved from 0.59598 to 0.49338, saving model to model\model_baseline.h5
Epoch 4/200
Epoch 4: val_loss did not improve from 0.49338
Epoch 5/200
Epoch 5: val_loss did not improve from 0.49338
Epoch 6/200
Epoch 6: val_loss improved from 0.49338 to 0.36065, saving model to model\model_baseline.h5
Epoch 7/200
Epoch 7: val_loss did not improve from 0.36065
Epoch 8/200
Epoch 8: val_loss did not improve from 0.36065
Epoch 9/200
Epoch 9: val_loss did not improve from 0.36065
Epoch 10/200
Epoch 10: val_loss did not improve from 0.36065
Epoch 11/200
Epoch 11: val_loss improved from 0.36065 to 0.33583, saving model to model\model_baseline.h5
Epoch 12/200
Epoch 12: val_loss did not improve from 0.33583
Epoch 13/200
Epoch 13: val_loss did not improve from 0.33583
Epoc

<keras.callbacks.History at 0x1ff857f7c40>

### Model Evaluation

So we have trained a model using the full training set with 20000 samples. How does it perform on the test set ?

In [13]:
model = keras.models.load_model("model/model_baseline.h5")

In [14]:
print("-" * 100)
print(model.evaluate(test_dataset, verbose=0,return_dict=True))

----------------------------------------------------------------------------------------------------
{'loss': 0.2882618308067322, 'categorical_accuracy': 0.8831999897956848}


### Measuring Uncertainties

In this section we evaluate the three metrics to measure uncertainty. We use the formula to find out the prediction probabilities of the 10 test samples with most uncertainty.  

In [15]:
y_test_proba = model.predict(test_dataset)



Now that we have the prediction probabilities of the entire test set, we can apply the formula to calculate the uncertainty metric and select the top 10 uncertain samples.<br> Let us start with Least Confidence or __LC__

$$ LC_i = 1 - P_{imax}  $$

P_imax is the maximum probability of the i_th sample

In [16]:
#Calculate Least Confidence
y_test_uncert = 1 - y_test_proba.max(axis=1)
#Indices of the top 10 Least Confidence
y_test_top_lc = np.argsort(y_test_uncert)[-10:]
#Print the predictions for the top 10 least confidence
print(y_test_proba[y_test_top_lc])

[[0.49423876 0.50576127]
 [0.49471268 0.5052873 ]
 [0.50421554 0.4957844 ]
 [0.50365025 0.49634972]
 [0.503      0.49699995]
 [0.497582   0.50241804]
 [0.49760765 0.50239235]
 [0.49801224 0.50198776]
 [0.50127554 0.49872446]
 [0.5001126  0.49988738]]


Margin of confidence of a sample is given by the 1st and 2nd highest prediction probability of a sample

$$ MC_i = P_{i1} - P_{i2} $$

In [17]:
part = np.partition(-y_test_proba, 1, axis=1)
# margin calculation
margin = - part[:, 0] + part[:, 1]
# indices of the lowest margin scores
y_test_least_mc = np.argsort(margin)[:10]
#Print the predictions for the 10 least margins
print(y_test_proba[y_test_least_mc])

[[0.5001126  0.49988738]
 [0.50127554 0.49872446]
 [0.49801224 0.50198776]
 [0.49760765 0.50239235]
 [0.497582   0.50241804]
 [0.503      0.49699995]
 [0.50365025 0.49634972]
 [0.50421554 0.4957844 ]
 [0.49471268 0.5052873 ]
 [0.49423876 0.50576127]]


Finally entropy of the i_th sample is given by. Thankfully , we don't have to write the code for this calculation , as scipy provides a neat method called __entropy__ to do precisely that. 

$$ \ E_{i} = \sum_{n=1} p_{in} log   p_{in} \ $$

In [18]:
#indices of the predictions with 10 largest entropies
y_test_max_ents = np.argsort(entropy(y_test_proba.T))[-10:]
#Print the 10 predictions with largest entropies
print(y_test_proba[y_test_max_ents])

[[0.49423876 0.50576127]
 [0.49471268 0.5052873 ]
 [0.50421554 0.4957844 ]
 [0.50365025 0.49634972]
 [0.503      0.49699995]
 [0.497582   0.50241804]
 [0.49760765 0.50239235]
 [0.49801224 0.50198776]
 [0.50127554 0.49872446]
 [0.5001126  0.49988738]]
