<table class="tfo-notebook-buttons" align="center">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/practicaldl/Practical-Deep-Learning-Book/blob/master/code/chapter-4/2-similarity-search-level-2.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/practicaldl/Practical-Deep-Learning-Book/blob/master/code/chapter-4/2-similarity-search-level-2.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View source on GitHub</a>
  </td>
</table>

This code is part of [Chapter 4 - Building a Reverse Image Search Engine: Understanding Embeddings ](https://learning.oreilly.com/library/view/practical-deep-learning/9781492034858/ch04.html).

Note: In order to run this notebook on Google Colab you need to [follow these instructions](https://colab.research.google.com/github/googlecolab/colabtools/blob/master/notebooks/colab-github-demo.ipynb#scrollTo=WzIRIt9d2huC) so that the local data such as the images are available in your Google Drive.

In [1]:
try:
    # Mount Google Drive
    from google.colab import drive

    drive.mount("/content/gdrive")

    IS_COLAB_ENV = True
except:
    IS_COLAB_ENV = False
IS_COLAB_ENV

Mounted at /content/gdrive


True

# Similarity Search

## Level 2

We benchmark the algorithms based on the time it takes to index images and locate the most similar image based on its features using the Caltech-101 dataset. We also experiment with t-SNE and PCA.

### Understanding the time it takes to index images and locate the most similar image based on its features

For these experiments we will use the features of the Caltech101 dataset that we read above.

First, let's choose a random image to experiment with. We will be using the same image for all the following experiments. Note: the results may change if the image is changed.

In [2]:
if IS_COLAB_ENV:
    !mkdir -p ../../datasets
    !pip install gdown
    !gdown https://drive.google.com/uc?id=137RyRjvTBkBiIfeYBNZBtViDHQ6_Ewsp --output ../../datasets/caltech101.tar.gz
    !tar -xvzf ../../datasets/caltech101.tar.gz --directory ../../datasets
    !mv ../../datasets/101_ObjectCategories ../../datasets/caltech101
    !rm -rf ../../datasets/caltech101/BACKGROUND_Google

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
101_ObjectCategories/chair/image_0005.jpg
101_ObjectCategories/chair/image_0006.jpg
101_ObjectCategories/chair/image_0007.jpg
101_ObjectCategories/chair/image_0008.jpg
101_ObjectCategories/chair/image_0010.jpg
101_ObjectCategories/chair/image_0011.jpg
101_ObjectCategories/chair/image_0012.jpg
101_ObjectCategories/chair/image_0013.jpg
101_ObjectCategories/chair/image_0014.jpg
101_ObjectCategories/chair/image_0016.jpg
101_ObjectCategories/chair/image_0017.jpg
101_ObjectCategories/chair/image_0018.jpg
101_ObjectCategories/chair/image_0019.jpg
101_ObjectCategories/chair/image_0020.jpg
101_ObjectCategories/chair/image_0022.jpg
101_ObjectCategories/chair/image_0023.jpg
101_ObjectCategories/chair/image_0024.jpg
101_ObjectCategories/chair/image_0025.jpg
101_ObjectCategories/chair/image_0026.jpg
101_ObjectCategories/chair/image_0028.jpg
101_ObjectCategories/chair/image_0029.jpg
101_ObjectCategories/chair/image_0030.jpg
101_ObjectC

In [3]:
import numpy as np
import pickle
from tqdm import tqdm, tqdm_notebook
import random
import time
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
import PIL
from PIL import Image
from sklearn.neighbors import NearestNeighbors

import glob
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.image as mpimg

%matplotlib inline

In [4]:
model_architecture = "resnet"
model_features_path = f"data/features-caltech101-{model_architecture}.pickle"
generator_classes_path = "data/class_ids-caltech101.pickle"
filenames_path = "data/filenames-caltech101.pickle"

if IS_COLAB_ENV:
    generator_classes_path = f"/content/gdrive/MyDrive/Practical-Deep-Learning-Book/code-outputs/chapter-4/{generator_classes_path}"
    filenames_path = f"/content/gdrive/MyDrive/Practical-Deep-Learning-Book/code-outputs/chapter-4/{filenames_path}"
    model_features_path = f"/content/gdrive/MyDrive/Practical-Deep-Learning-Book/code-outputs/chapter-4/{model_features_path}"

In [5]:
filenames = pickle.load(open(filenames_path, "rb"))
feature_list = pickle.load(open(model_features_path, "rb"))
class_ids = pickle.load(open(generator_classes_path, "rb"))

In [6]:
num_images = len(filenames)
num_features_per_image = len(feature_list[0])
print("Number of images = ", num_images)
print("Number of features per image = ", num_features_per_image)

Number of images =  8677
Number of features per image =  2048


In [7]:
random_image_index = random.randint(0, num_images)

### Standard features

The following experiments are based on the ResNet-50 features derived from the images of the Caltech101 dataset.

### Standard features + Brute Force Algorithm on one image

We will be timing the indexing for various Nearest Neighbors algorithms, so let's start with timing the indexing for the Brute force algorithm. While running terminal commands in iPython like the `timeit` command, the variables are not stored in memory, so we need to rerun the same command to compute and store the results in the variable.

In [8]:
%timeit NearestNeighbors(n_neighbors=5, algorithm='brute', metric='euclidean').fit(feature_list)
neighbors = NearestNeighbors(n_neighbors=5, algorithm="brute", metric="euclidean").fit(
    feature_list
)

7.21 ms ± 1.13 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


Now, let's look at the time it takes to search for the nearest neighbors for the selected random image using the trained model with the Brute force algorithm.

In [9]:
%timeit neighbors.kneighbors([feature_list[random_image_index]])

48.8 ms ± 1.76 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


###  Standard features + k-d Tree Algorithm  on one image

Now let's turn our attention to the next nearest neighbors algorithm, the k-d tree. Let's time the indexing for the k-d tree algorithm.

In [10]:
%timeit NearestNeighbors(n_neighbors=5, algorithm='kd_tree').fit(feature_list)
neighbors = NearestNeighbors(n_neighbors=5, algorithm="kd_tree").fit(feature_list)

3.46 s ± 453 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Now, time the search for the same random image using the k-d tree trained model.

In [11]:
%timeit neighbors.kneighbors([feature_list[random_image_index]])

49.7 ms ± 3.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


###  Standard features + Ball Tree Algorithm  on one image

Finally, its time for our last nearest neighbors algorithm - the Ball Tree algorithm. As before, let's calculate the time it takes to train the model.

In [12]:
%timeit NearestNeighbors(n_neighbors=5, algorithm='ball_tree').fit(feature_list)
neighbors = NearestNeighbors(n_neighbors=5, algorithm="ball_tree").fit(feature_list)

2.78 s ± 492 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


As before, let's time the search for the Ball Tree model.

In [13]:
%timeit neighbors.kneighbors([feature_list[random_image_index]])

29 ms ± 927 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


We will increase the number of our test images so that we can experiment with how the scalability of different nearest neighbors algorithms change. Let's choose a random set of 100 or 1000 images to experiment.

Note: the results may change if any of the images are changed

Generate a list of images to do the next set of experiments on.

In [14]:
random_image_indices = random.sample(range(0, num_images), 1000)
random_feature_list = [feature_list[each_index] for each_index in random_image_indices]

### Standard features + Brute Force Algorithm on a set of images

Time the search for the Brute force algorithm.

In [15]:
neighbors = NearestNeighbors(n_neighbors=5, algorithm="brute", metric="euclidean").fit(
    feature_list
)
%timeit neighbors.kneighbors(random_feature_list)

1.3 s ± 341 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


### Standard features +  k-d Tree Algorithm on a set of images

Time the search for the k-d tree algorithm.

In [16]:
neighbors = NearestNeighbors(n_neighbors=5, algorithm="kd_tree").fit(feature_list)
%timeit neighbors.kneighbors(random_feature_list)

40.4 s ± 89.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


### Standard features +  Ball Tree Algorithm on a set of images

Time the search for the Ball Tree algorithm.

In [17]:
neighbors = NearestNeighbors(n_neighbors=5, algorithm="ball_tree").fit(feature_list)
%timeit neighbors.kneighbors(random_feature_list)

28.4 s ± 43.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


### PCA

Now we have seen the time it takes to index and search using nearest neighbor algorithms on the full feature length. We can use PCA to compress the features and reduce the time. As before we set the number of features intended.

In [18]:
num_feature_dimensions = 100
num_feature_dimensions = min(num_images, num_feature_dimensions, len(feature_list[0]))

Train the PCA model with the number of desired feature dimensions.

In [19]:
pca = PCA(n_components=num_feature_dimensions)
pca.fit(feature_list)
feature_list_compressed = pca.transform(feature_list)
feature_list_compressed = feature_list_compressed.tolist()

Let's try to understand the importance of each of the resultant features. The numbers displayed below show the relative importance of the first 20 features.

In [20]:
print(pca.explained_variance_ratio_[0:20])

[0.0610047  0.04370117 0.04059478 0.032334   0.02126303 0.01967745
 0.01750107 0.01519594 0.01503152 0.01316031 0.01260053 0.01227114
 0.01133631 0.01058025 0.00960155 0.00940311 0.00868717 0.00850451
 0.00839056 0.00774207]


Repeat the timing experiments. We use the same random image to experiment.
Note: the results may change if the image is changed.

### PCA + Brute Force Algorithm on one image

Let's time the indexing for the brute force algorithm.

In [21]:
%timeit NearestNeighbors(n_neighbors=5, algorithm='brute', metric='euclidean').fit(feature_list_compressed)
neighbors = NearestNeighbors(n_neighbors=5, algorithm="brute", metric="euclidean").fit(
    feature_list_compressed
)

81.8 ms ± 23.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


We will now time the search for the brute force algorithm.

In [22]:
%timeit neighbors.kneighbors([feature_list_compressed[random_image_index]])

1.7 ms ± 820 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


###  PCA + k-d Tree Algorithm  on one image

Time the indexing for the k-d tree algorithm.

In [23]:
%timeit NearestNeighbors(n_neighbors=5, algorithm='kd_tree').fit(feature_list_compressed)
neighbors = NearestNeighbors(n_neighbors=5, algorithm="kd_tree").fit(
    feature_list_compressed
)

158 ms ± 45.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


Time the search for the k-d tree algorithm.

In [24]:
%timeit neighbors.kneighbors([feature_list_compressed[random_image_index]])

967 µs ± 184 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


###  PCA + Ball Tree Algorithm  on one image

Time the indexing for the ball tree algorithm.

In [25]:
%timeit NearestNeighbors(n_neighbors=5, algorithm='ball_tree').fit(feature_list_compressed)
neighbors = NearestNeighbors(n_neighbors=5, algorithm="ball_tree").fit(
    feature_list_compressed
)

138 ms ± 37.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


Time the search for the ball tree algorithm.

In [26]:
%timeit neighbors.kneighbors([feature_list_compressed[random_image_index]])

1.45 ms ± 185 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


We use the same random indices to experiment. Note: the results may change if any of the images are changed.

Generate a list of images to do the next set of experiments on.

In [27]:
random_feature_list_compressed = [
    feature_list_compressed[each_index] for each_index in random_image_indices
]

### PCA + Brute Force Algorithm on a set of images

Time the search for the brute force algorithm.

In [28]:
neighbors = NearestNeighbors(n_neighbors=5, algorithm="brute", metric="euclidean").fit(
    feature_list_compressed
)
%timeit neighbors.kneighbors(random_feature_list_compressed)

81.7 ms ± 1.29 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


### PCA + k-d Tree Algorithm on a set of images

Time the search for the k-d tree algorithm.

In [29]:
neighbors = NearestNeighbors(n_neighbors=5, algorithm="kd_tree").fit(
    feature_list_compressed
)
%timeit neighbors.kneighbors(random_feature_list_compressed)

1.37 s ± 210 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


### PCA + Ball Tree Algorithm on a set of images

Time the search for the Ball Tree algorithm.

In [30]:
neighbors = NearestNeighbors(n_neighbors=5, algorithm="ball_tree").fit(
    feature_list_compressed
)
%timeit neighbors.kneighbors(random_feature_list_compressed)

1.19 s ± 121 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


### Annoy

Make sure you have `annoy` installed. You can install it using pip, by executing the command below.

In [31]:
!pip install annoy

Collecting annoy
  Downloading annoy-1.17.3.tar.gz (647 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/647.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.2/647.5 kB[0m [31m2.7 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m647.5/647.5 kB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: annoy
  Building wheel for annoy (setup.py) ... [?25l[?25hdone
  Created wheel for annoy: filename=annoy-1.17.3-cp310-cp310-linux_x86_64.whl size=552450 sha256=777c00640c6e47f50b28e2ab891a083648943e00ef3a2c57ac9f1bffc454aaca
  Stored in directory: /root/.cache/pip/wheels/64/8a/da/f714bcf46c5efdcfcac0559e63370c21abe961c48e3992465a
Successfully built annoy
Installing collected packages: annoy
Successfully installed annoy-1.17.3


In [32]:
from annoy import AnnoyIndex

In [33]:
# Time the indexing for Annoy
t = AnnoyIndex(2048, metric="angular")  # Length of item vector that will be indexed
starttime = time.time()
for i in range(num_images):
    feature = feature_list[i]
    t.add_item(i, feature)
endtime = time.time()
print(endtime - starttime)
t.build(40)  # 40 trees

annoy_index_path = "data/caltech101index.ann"
if IS_COLAB_ENV:
    annoy_index_path = f"/content/gdrive/MyDrive/Practical-Deep-Learning-Book/code-outputs/chapter-4/{annoy_index_path}"
t.save(annoy_index_path)

1.7003514766693115


True

### Annoy on one image

Load the saved annoy index and time the search for one image for Annoy.

In [34]:
u = AnnoyIndex(2048, metric="angular")
u.load(annoy_index_path)
%timeit u.get_nns_by_vector(feature_list[random_image_index], 5, include_distances=True)
indexes = u.get_nns_by_vector(
    feature_list[random_image_index], 5, include_distances=True
)

616 µs ± 5.72 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


Helper function to time the search for multiple images for Annoy. Perform the search for the same image multiple times to get an average value.


In [35]:
def calculate_annoy_time():
    for i in range(0, 100):
        indexes = u.get_nns_by_vector(
            feature_list[random_image_index], 5, include_distances=True
        )

### Annoy on a set of images

Time the search for multiple images for Annoy.

In [36]:
%time calculate_annoy_time()

CPU times: user 67.3 ms, sys: 2 µs, total: 67.3 ms
Wall time: 66.4 ms


### PCA + Annoy

Now, let's time the indexing for Annoy for the PCA generated features.

In [37]:
starttime = time.time()
# Length of item vector that will be indexed
t = AnnoyIndex(num_feature_dimensions, metric="angular")

for i in range(num_images):
    feature = feature_list_compressed[i]
    t.add_item(i, feature)
endtime = time.time()
print(endtime - starttime)
t.build(40)  # 50 trees

annoy_index_compressed_path = "data/caltech101index_compressed.ann"
if IS_COLAB_ENV:
    annoy_index_compressed_path = f"/content/gdrive/MyDrive/Practical-Deep-Learning-Book/code-outputs/chapter-4/{annoy_index_compressed_path}"
t.save(annoy_index_compressed_path)

0.026437759399414062


True

### PCA + Annoy for one image

Load the saved annoy index and time the search for one image for Annoy.

In [38]:
u = AnnoyIndex(num_feature_dimensions, metric="angular")
u.load(annoy_index_compressed_path)
%timeit u.get_nns_by_vector(feature_list_compressed[random_image_index], 5, include_distances=True)
indexes = u.get_nns_by_vector(
    feature_list_compressed[random_image_index], 5, include_distances=True
)

24.3 µs ± 1.32 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


Helper function to time the search for multiple images for Annoy. Perform the search for the same image multiple times to get an average value.


In [39]:
def calculate_annoy_time():
    for i in range(0, 100):
        indexes = u.get_nns_by_vector(
            feature_list_compressed[random_image_index], 5, include_distances=True
        )

### PCA + Annoy on a set of images

Time the search for multiple images for Annoy.

In [40]:
%time calculate_annoy_time()

CPU times: user 4.89 ms, sys: 0 ns, total: 4.89 ms
Wall time: 4.84 ms


### NMS Lib

Make sure you have `nmslib` installed. You can install it using pip, by executing the command below.

In [41]:
!pip install nmslib

Collecting nmslib
  Downloading nmslib-2.1.1.tar.gz (188 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/188.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.7/188.7 kB[0m [31m1.2 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m184.3/188.7 kB[0m [31m3.1 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m188.7/188.7 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pybind11<2.6.2 (from nmslib)
  Using cached pybind11-2.6.1-py2.py3-none-any.whl (188 kB)
Building wheels for collected packages: nmslib
  Building wheel for nmslib (setup.py) ... [?25l[?25hdone
  Created wheel for nmslib: filename=nmslib-2.1.1-cp310-cp310-linux_x86_64.whl size=13578646 sha256=344fce7f4b974ac1d3d53a6bd9bf06650ac41400af3e4100c3c962f72bcab3c7
  S

In [42]:
import nmslib

In [43]:
index = nmslib.init(method="hnsw", space="cosinesimil")
index.addDataPointBatch(feature_list_compressed)
index.createIndex({"post": 2}, print_progress=True)

### NMS Lib on one image

In [44]:
# Query for the nearest neighbors of the first datapoint
%timeit index.knnQuery(feature_list_compressed[random_image_index], k=5)
ids, distances = index.knnQuery(feature_list_compressed[random_image_index], k=5)

28.2 µs ± 12 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)


### NMS Lib on a set of images

In [45]:
# Get all nearest neighbors for all the datapoint
# using a pool of 4 threads to compute
%timeit index.knnQueryBatch(feature_list_compressed, k=5, num_threads=16)
neighbors = index.knnQueryBatch(feature_list_compressed, k=5, num_threads=16)

156 ms ± 39.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


### Falconn

Make sure you have `falconn` installed. You can install it using pip, by executing the command below.

In [46]:
!pip install falconn

Collecting falconn
  Downloading FALCONN-1.3.1.tar.gz (1.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: falconn
  Building wheel for falconn (setup.py) ... [?25l[?25hdone
  Created wheel for falconn: filename=FALCONN-1.3.1-cp310-cp310-linux_x86_64.whl size=14919686 sha256=a1da7b4d48284d661698ca9bcd16a50d5ee6b53c849c5018c5e581b8ff18a90e
  Stored in directory: /root/.cache/pip/wheels/0b/4a/bc/68ac1e3cd3f263c47dfde8586fc3fdf704014ee3db0e5eb651
Successfully built falconn
Installing collected packages: falconn
Successfully installed falconn-1.3.1


In [47]:
import falconn

In [48]:
# Setup different parameters for Falonn
parameters = falconn.LSHConstructionParameters()
num_tables = 1
parameters.l = num_tables
parameters.dimension = num_feature_dimensions
parameters.distance_function = falconn.DistanceFunction.EuclideanSquared
parameters.lsh_family = falconn.LSHFamily.CrossPolytope
parameters.num_rotations = 1
parameters.num_setup_threads = 1
parameters.storage_hash_table = falconn.StorageHashTable.BitPackedFlatHashTable

# Train the Falconn model
falconn.compute_number_of_hash_functions(16, parameters)

### Falconn on a set of images

In [49]:
dataset = np.array(feature_list_compressed)
a = np.random.randn(8677, 100)
a /= np.linalg.norm(a, axis=1).reshape(-1, 1)
dataset = a

index = falconn.LSHIndex(parameters)
%time index.setup(dataset)

query_object = index.construct_query_object()
num_probes = 1
query_object.set_num_probes(num_probes)

searchQuery = np.array(feature_list_compressed[random_image_index])
searchQuery = a[0]
%timeit query_object.find_k_nearest_neighbors(searchQuery, 5)

CPU times: user 8.78 ms, sys: 0 ns, total: 8.78 ms
Wall time: 9.41 ms
3.43 µs ± 51.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


### Some benchmarks on different algorithms to see relative speeds

These results lead to the benchmarking of time for indexing and searching on Caltech101. Repeating Level 2 on the Caltech256 features we can benchmark that as well.

Benchmarking the different models on Caltech101. (Rounded to nearest integer)

| Algorithm | Number of features indexed | Time to search 1 image (ms) | Time to search 100 images (ms)  | Time to search 1000 images (ms)  | Time to index (ms)    |
|-------------|----------------------------|------------------------|---------------------------|---|---|---|
| Brute force | 2048 | 14 | 38 | 240 | 22 |
| k-d tree | 2048 | 16 | 2270 | 24100 | 1020    |
| Ball tree | 2048 | 15 | 1690 | 17000 | 1090   |
| PCA + brute force | 100 | 1 | 13 | 135 | 0.334   |
| PCA + k-d tree | 100 | 1 | 77 | 801 | 20   |
| PCA + ball tree | 100 | 1 | 80 | 761 |  23   |
| Annoy | 2048 | 0.16 | 40    | 146 | 1420 |
| PCA + Annoy | 100 | **.008** | **2.3**   | **20.3** | 109 |


Benchmarking the different models on Caltech256. (Rounded to nearest integer)


| Algorithm | Number of features indexed | Time to search 1 image (ms) | Time to search 100 images (ms)  | Time to search 1000 images (ms)  | Time to index (ms)    |
|-------------|----------------------------|------------------------|---------------------------|---|---|---|
| Brute force | 2048 |  16 | 135 |  747  | 23  |
| k-d tree | 2048 |  15  | 7400  | 73000 |    4580 |
| Ball tree | 2048 | 15 | 5940  | 59700 |   4750  |
| PCA + brute force | 100 | 6.42  | 43.8  | 398  |  1.06   |
| PCA + k-d tree | 100 |  6.46  | 530  | 5200  |  89.6  |
| PCA + ball tree | 100 | 6.43  |  601 |  6000 |   104  |
| Annoy | 2048 | .156  |  41.6  | 166  | 4642  |
| PCA + Annoy | 100 | **.0076**  |   **2.68** | **23.8**  |  296 |