<a href="https://colab.research.google.com/github/SamuelBFG/DL-studies/blob/master/2_similarity_search_level_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Similarity Search**

Before we start you need to add the Caltech 101 dataset to your Google Colab enviroment. Here is the [link](https://www.kaggle.com/ceciliala/caltech-101).

In [1]:
# Kaggle dependencies will already be installed so there is no need for "!pip install kaggle"
# You'll need to upload your kaggle.json file though

from google.colab import files
files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"samuelbfg","key":"0720c91e1f13c46c81f9cf850f621e4e"}'}

In [2]:
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/

# Change the permission
!chmod 600 ~/.kaggle/kaggle.json

In [3]:
!kaggle datasets download -d ceciliala/caltech-101

Downloading caltech-101.zip to /content
 89% 103M/115M [00:00<00:00, 114MB/s]  
100% 115M/115M [00:01<00:00, 108MB/s]


In [4]:
from zipfile import ZipFile
file_name = "caltech-101.zip"

with ZipFile(file_name,'r') as zip:
  zip.extractall()
  print('Done')

Done


Here we are going to use the files in the ***data*** folder we obtained from the previous notebook ***1-feature-extraction***

The files that we are going to need are:

```
   class_ids-caltech101.pickle
   features-caltech101-resnet.pickle
   features-caltech101-resnet-finetuned.pickle
   filenames-caltech101.pickle
```

In [5]:
from zipfile import ZipFile
file_name = "data.zip"

with ZipFile(file_name,'r') as zip:
  zip.extractall()
  print('Done')

Done


### **Level 3**

So far we experimented with different visualization techniques on the results, t-SNE and PCA on the results. Now we will calculate the accuracies of the features obtained from the pretrained and finetuned models. The finetuning here follows the same finetuning technique we learnt in Chapter 2.

In [6]:
import numpy as np
import pickle
from tqdm import tqdm, tqdm_notebook
import random
import time
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
import PIL
from PIL import Image
from sklearn.neighbors import NearestNeighbors

import glob
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
%matplotlib inline

For these experiments we will use the same features of the Caltech101 dataset that we were using before.

Let's utilize the features from the previously trained model.

In [11]:
filenames = pickle.load(open('data/filenames-caltech101.pickle', 'rb'))
feature_list = pickle.load(open('data/features-caltech101-resnet.pickle',
                                'rb'))
class_ids = pickle.load(open('data/class_ids-caltech101.pickle', 'rb'))

num_images = len(filenames)
num_features_per_image = len(feature_list[0])
print("Number of images = ", num_images)
print("Number of features per image = ", num_features_per_image)

Number of images =  8677
Number of features per image =  2048


First, let's make a helper function that calculates the accuracy of the resultant features using the nearest neighbors brute force algorithm.

In [12]:
# Helper function to get the classname
def classname(str):
    return str.split('/')[-2]


# Helper function to get the classname and filename
def classname_filename(str):
    return str.split('/')[-2] + '/' + str.split('/')[-1]


def calculate_accuracy(feature_list):
    num_nearest_neighbors = 5
    correct_predictions = 0
    incorrect_predictions = 0
    neighbors = NearestNeighbors(n_neighbors=num_nearest_neighbors,
                                 algorithm='brute',
                                 metric='euclidean').fit(feature_list)
    for i in tqdm_notebook(range(len(feature_list))):
        distances, indices = neighbors.kneighbors([feature_list[i]])
        for j in range(1, num_nearest_neighbors):
            if (classname(filenames[i]) == classname(
                    filenames[indices[0][j]])):
                correct_predictions += 1
            else:
                incorrect_predictions += 1
    print(
        "Accuracy is ",
        round(
            100.0 * correct_predictions /
            (1.0 * correct_predictions + incorrect_predictions), 2))

### **1. Accuracy of Brute Force over Caltech101 features**

In [13]:
# Calculate accuracy
calculate_accuracy(feature_list[:])

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`


HBox(children=(FloatProgress(value=0.0, max=8677.0), HTML(value='')))


Accuracy is  88.36


### **2. Accuracy of Brute Force over the PCA compressed Caltech101 features**

In [14]:
num_feature_dimensions = 100
pca = PCA(n_components=num_feature_dimensions)
pca.fit(feature_list)
feature_list_compressed = pca.transform(feature_list[:])


Let's calculate accuracy over the compressed features.

In [15]:
calculate_accuracy(feature_list_compressed[:])

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`


HBox(children=(FloatProgress(value=0.0, max=8677.0), HTML(value='')))


Accuracy is  88.5


### **3. Accuracy of Brute Force over the finetuned Caltech101 features**

In [16]:
# Use the features from the finetuned model
filenames = pickle.load(open('data/filenames-caltech101.pickle', 'rb'))
feature_list = pickle.load(
    open('data/features-caltech101-resnet-finetuned.pickle', 'rb'))
class_ids = pickle.load(open('data/class_ids-caltech101.pickle', 'rb'))

In [17]:
num_images = len(filenames)
num_features_per_image = len(feature_list[0])
print("Number of images = ", num_images)
print("Number of features per image = ", num_features_per_image)

Number of images =  8677
Number of features per image =  101


In [18]:
# Calculate accuracy
calculate_accuracy(feature_list[:])

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`


HBox(children=(FloatProgress(value=0.0, max=8677.0), HTML(value='')))


Accuracy is  95.53


### **4. Accuracy of Brute Force over the PCA compressed finetuned Caltech101 features**

In [19]:
# Perform PCA
num_feature_dimensions = 100
pca = PCA(n_components=num_feature_dimensions)
pca.fit(feature_list)
feature_list_compressed = pca.transform(feature_list[:])

In [20]:
# Calculate accuracy over the compressed features
calculate_accuracy(feature_list_compressed[:])

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`


HBox(children=(FloatProgress(value=0.0, max=8677.0), HTML(value='')))


Accuracy is  95.54
