# Search Similar Images

Given an example image can we find similar images without the need of any labels? Leveraging Jina, we have the advantage that 
we don't need to use any labels or textual information about the images in order to build a search for similar images.

In this tutorial we are going to create an image search system that retrieves similar images. We are going to
use the test split of the [Dogs vs. Cats](https://www.kaggle.com/c/dogs-vs-cats/data?select=test1.zip) dataset, which we
will subsequently refer to as the pets dataset. It contains 12.5K images of cats and dogs. Now, we can define our
problem as selecting an image of cat or dog, and getting back images of similar cats or dogs respectively.

Jina searches semantically, and the results will vary depending on the neural network that we use for image encoding. Our
task is to search for similar images so we will consider visually-similar images as semantically-related.

## Build the Flow

The solution uses a simple pipeline that can be subdivided into two steps:  **Index** and **Query**

### Index

To search something out of the full dataset, we first need to index the data. This means that we store the embeddings
of all the images from the dataset in some form of storage. The images can be read as a numpy array which is then
fed to the neural network of our choice. This neural network encodes the input images into some latent space which we call
"embeddings". We then use an **Indexer** to store these embeddings in memory.

### Query

Once the data is indexed, i.e. our database is built, we simply need to feed our query (an image or set of
images) to the model to encode it into embeddings and then use the **Indexer** to retrieve matching images. The matching
can be based on any type of metric but without going deeper into this, we will focus only on Euclidean distance between
two embeddings (corresponding to two images).

We will use the **SimpleIndexer** Executor as
our indexer (the one that stores and retrieves data). This Executor also returns the matching `Document` when we make
a query. The search part is done using the built-in `match` function of `DocumentArray`. To encode the images into
embeddings we will use our own Executor which uses the pre-trained 'ResNet101' model.


## Insights

Our first task is to wrap the image data as `Document`s and form a `DocumentArray`. This can be done easily with the
following code snippet. `from_files` creates an iterator over a list of image paths and yields `Document`s:

first, let's set up your kaggle account, download the credentials in the your `HOME/.kaggle/kaggle.json`. You can follow this [tutorial](https://www.kaggle.com/docs/api) for more details.

We are going to use the [Dogs vs. Cats](https://www.kaggle.com/c/dogs-vs-cats/data?select=test1.zip) dataset so please first accept the competition rules on the kaggle website to be able to download the dataset

In [None]:
!kaggle competitions download -c dogs-vs-cats --force
!unzip dogs-vs-cats.zip -y
!unzip train.zip -y
!unzip test1.zip -y

In [None]:
from docarray import Document, DocumentArray

image_format = "jpg"
docs_array = DocumentArray.from_files(f"test1/*.{image_format}")

Once the image is loaded our next step is to encode these images into embeddings. As stated earlier you can use
Executors from [Jina Hub](https://hub.jina.ai) off-the-shelf or you can define an Executor of your own in
just a few steps. For this tutorial we will write our own Executor:

In [None]:
import torch
import torch.nn as nn
from jina import Executor, requests
from docarray import DocumentArray
from torchvision import models


class ImageEncoder(Executor):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self._embedder = models.resnet101(pretrained=True)
        self._embedder.fx = (
            nn.Identity()
        )  # so that the output of the model is the embedding vector and not the classification logits

    def _uri_to_torch_tensor(self, doc: Document):
        return (
            doc.load_uri_to_image_tensor()
            .set_image_tensor_shape(shape=(224, 224))
            .set_image_tensor_normalization()
            .set_image_tensor_channel_axis(-1, 0)
        )
        

    @requests
    @torch.inference_mode()
    def predict(self, docs: DocumentArray, **kwargs):
        docs.apply(lambda d : self._uri_to_torch_tensor(d))  # load image from files and reshape make them torch tensors
        embeds = self._embedder(torch.from_numpy(docs.tensors))  # embed with the resnet101
        docs.embeddings = embeds  # store the embedding in the docs
        del docs[:,'tensor'] # delete the tensors as we only want to have the embedding when indexing

To build an Encoder Executor we inherit the base `Executor` and use a decorator
to define endpoints. As this `request` decorator is empty, this function will be called regardless of the
endpoints invoked, i.e., on both the `/index` and `/search` endpoints. We
leverage [torchvision](https://pytorch.org/vision/stable/index.html) to use the pre-trained `ResNet101` model for
getting the embeddings. You can replace this model with any other pre-trained models of your choice. When this
Executor is instantiated, the pre-trained weights are downloaded automatically. 

Finally, comes the storage/retrieval step. We do this with the **Indexer** Executor. You can use any of the
available indexers on [Jina Hub](https://hub.jina.ai) or define your own. To create an **Indexer** you need to have two
endpoints: `/index` and `/search`. For this tutorial we will define a `SimpleIndexer` which is [also available on jina
Hub](https://hub.jina.ai/executor/zb38xlt4).

In [None]:
from jina import Executor, requests

class SimpleIndexer(Executor):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self._index = DocumentArray(
            storage='sqlite',
            config={'connection': 'index.db','table_name':'image_to_image'},
        )

    @requests(on='/index')
    def index(self, docs: DocumentArray, **kwargs):
        self._index.extend(docs)

    @requests(on='/search')
    def search(self, docs: DocumentArray, **kwargs):
        docs.match(self._index)

`SimpleIndexer` stores all the Documents with [sqlite backend](https://docarray.jina.ai/advanced/document-store/sqlite/?highlight=sqlite) on disk  when invoked with the `/index` endpoint. During the search
Flow, it matches the query `Document` with the indexed `Document` using the built-in `match` function
of `DocumentArray`.

## Putting it all together in a Flow

We will have one Flow defined for this tutorial. However, it handles requests to `/index` and `/search` differently by
defining different endpoints using `requests` decorators. Below we see the Flow, which consists of an `Encoder` to encode
the images as the first step, followed by an `Indexer` to store/retrieve data.

So far we saw individual components of the Flow and how to define them. Next comes putting all of this together in a Flow:

In [None]:
from jina import Flow

f = (
    Flow(cors=True, port_expose=12345, protocol="http")
    .add(uses=ImageEncoder, name="Encoder")
    .add(uses=SimpleIndexer, name="Indexer")
)

### Start the Flow and Index data

here we only index 1000 images, if you want to index more you should consider using a GPU (see [this section](https://docs.jina.ai/how-to/gpu-executor/?highlight=gpu) to learn how to use Executor with GPU)

In [None]:
with f:
    f.post("/index", inputs=docs_array.shuffle()[0:10],show_progress=True) 

### Query from Python

Keeping the server running we can start a simple client to make a query:

In [None]:
from jina import Client, Document
from docarray import DocumentArray


def print_matches(resp):  # the callback function invoked when task is done
    resp.docs.plot_image_sprites()
    for doc in resp.docs:
        for idx, d in enumerate(doc.matches[:3]):  # print top-3 matches
            print(f'[{idx}]{d.scores["cosine"].value:2f}')

        DocumentArray(doc.matches[:3]).plot_image_sprites()

with f:
    c = Client(protocol="http", port=12345)  # connect to localhost:12345
    c.post("/search", inputs=docs_array[0:2], on_done=print_matches)

## Results

The returned response contains the matching `Document` which in turn contains the `uri` of the images. Below we can see the
returned matching images of the query as well as the cosine similarity score:

