## Index

* [Introduction](#intro)
* [Preparation](#preparation)
* [Reverse Image Search](#reverse-image-search)
    * [Configuration](#configuration)
    * [Embedding pipeline](#embedding-pipeline)
    * [Steps](#steps)
        * [1. Create Milvus collection](#step1)
        * [2. Insert data](#step2)


# Reverse Image Search powered by Towhee & Milvus <a class="anchor" id="intro"></a>

Reverse image search takes an image as input and retrieves most similar images based on its content. The basic idea behind semantic image search is to represent each image as an embedding of features extracted by a pretrained deep learning model. Then image retrieval can be performed by storing & comparing image embeddings.

This notebook illustrates how to build an reverse image search engine from scratch using [Towhee](https://towhee.io/) and [Milvus](https://milvus.io/). We will go through procedures with example data. With this tutorial, you will learn how to build and evaluate a reverse image search system.

<img src="https://github.com/towhee-io/examples/raw/main/image/reverse_image_search/workflow.png" width = "60%" height = "60%" align=center />

## Preparation <a class="anchor" id="preparation"></a>

To get ready for building the image search engine, we need to install some python packages, download example data, and start Milvus service in advance.

**Install dependencies**

First we need to install dependencies such as towhee, opencv-python and pillow. Please note you should install proper versions based on your environment.

| package | version |
| -- | -- |
| towhee | 1.1.0 |
| opencv-python | |
| pillow | |

In [1]:
! python -m pip install -q towhee opencv-python pillow

**Prepare data**

Here we use a subset of the [ImageNet](https://www.image-net.org/) dataset (100 classes). The example data is available on [Github](https://github.com/towhee-io/examples/releases/download/data/reverse_image_search.zip). You can follow command below to download it. The example data is organized as follows:

- train: directory of candidate images, 10 images per class from ImageNet train data
- test: directory of query images, 1 image per class from ImageNet test data
- reverse_image_search.csv: a csv file containing *id, path, and label* for each candidate image

In [3]:
! python -m pip install -q pymilvus==2.3.1

^C
[31mERROR: Operation cancelled by user[0m[31m
[0m

## Reverse Image Search <a class="anchor" id="reverse-image-search"></a>

In this section, we will learn how to build the image search engine using Towhee. Towhee is a framework that provides ETL for unstructured data using SoTA machine learning models. It allows to create data processing pipelines. It also has built-in operators for different purposes, such as generating image embeddings, inserting data into Milvus collection, and querying across Milvus collection.


### Configuration <a class="anchor" id="configuration"></a>

For later use, we import packages & set parameters at the beginning. You are able to change parameters according to your needs and environment. Please note that the embedding dimension `DIM` should match the selected model name `MODEL`.

By default, this tutorial selects a pretrained model 'resnet50' to extract image embeddings. It sets ['IVF_FLAT'](https://milvus.io/docs/v2.0.x/index.md#IVF_FLAT) as index and ['L2'](https://milvus.io/docs/v2.0.x/metric.md#Euclidean-distance-L2) as distance metric for Milvus configuration. `TOPK` determines how many search results returned, which defaults to 10.

In [None]:
import csv
from glob import glob
from pathlib import Path
from statistics import mean

from towhee import pipe, ops, DataCollection
from pymilvus import connections, FieldSchema, CollectionSchema, DataType, Collection, utility


# Towhee parameters
MODEL = 'resnet50'
DEVICE = None # if None, use default device (cuda is enabled if available)

# Milvus parameters
HOST = '' # Add your milvus grpc host without port and protocol
PORT = '' # 
SERVER_NAME = 'localhost'
USER = ''
PASSWORD = '' #More information on how to create IBM API key is available at https://www.ibm.com/docs/en/mas-cd/continuous-delivery?topic=cli-creating-your-cloud-api-key 
TOPK = 5
DIM = 2048 # dimension of embedding extracted by MODEL
COLLECTION_NAME = 'reverse_image_search'
INDEX_TYPE = 'IVF_FLAT'
METRIC_TYPE = 'L2'
 
# path to csv (column_1 indicates image path) OR a pattern of image paths
INSERT_SRC = 'reverse_image_search.csv'

### Embedding pipeline <a class="anchor" id="embedding-pipeline"></a>

As mentioned above, the similarity search actually happens to vectors. So we need to convert each image into an embedding. To pass image path into the image embedding operator, we use a function streamly reads image path given a pattern or a csv. Thus the embedding pipeline generates image embeddings given a pattern or csv of image path(s).

In [5]:
# Load image path
def load_image(x):
    if x.endswith('csv'):
        with open(x) as f:
            reader = csv.reader(f)
            next(reader)
            for item in reader:
                yield item[1]
    else:
        for item in glob(x):
            yield item
            
# Embedding pipeline
p_embed = (
    pipe.input('src')
        .flat_map('src', 'img_path', load_image)
        .map('img_path', 'img', ops.image_decode())
        .map('img', 'vec', ops.image_embedding.timm(model_name=MODEL, device=DEVICE))
)

### Steps <a class="anchor" id="steps"></a>

With work above done, we are ready to build and try the image search engine. The core procedure includes 3 steps:

1. create a Milvus collection
2. insert data into collection


#### 1. Create Milvus collection <a class="anchor" id="step1"></a>

Before insert or search data, we need to have a collection. This step creates a new collection using configurations above. Please note that it will delete the collection first if it already exists.

In [6]:
# Connect to Milvus service and create a collecion 
try:
    connections.connect(host=HOST, port=PORT, secure=True, server_name=SERVER_NAME, user=USER, password=PASSWORD)
    print('Milvus Database connected successfully.')
except Exception as e:
    print(f'Error connecting to Milvus Database: {e}')

Error connecting to Milvus Database: <MilvusException: (code=2, message=Fail connecting to server on b0bb22b7-99e3-4127-afed-51eaaa232a8f.cise77rd04nf1e5p5s20.lakehouse.appdomain.cloud:32430, illegal connection params or server unavailable)>


In [25]:
# Create milvus collection (delete first if exists)
def create_milvus_collection(collection_name, dim):
    if utility.has_collection(collection_name):
        utility.drop_collection(collection_name)
    
    fields = [
        FieldSchema(name='path', dtype=DataType.VARCHAR, description='path to image', max_length=500, 
                    is_primary=True, auto_id=False),
        FieldSchema(name='embedding', dtype=DataType.FLOAT_VECTOR, description='image embedding vectors', dim=dim)
    ]
    schema = CollectionSchema(fields=fields, description='reverse image search')
    collection = Collection(name=collection_name, schema=schema)

    index_params = {
        'metric_type': METRIC_TYPE,
        'index_type': INDEX_TYPE,
        'params': {"nlist": 2048}
    }
    collection.create_index(field_name='embedding', index_params=index_params)
    return collection

Connect to Milvus with `HOST` & `PORT` and create collection with `COLLECTION_NAME` & `DIM`:

In [26]:
# Connect to Milvus service and create a collecion 
try:
    connections.connect(host=HOST, port=PORT, secure=True, server_name=SERVER_NAME, user=USER, password=PASSWORD)
    print('Milvus Database connected successfully.')

    # Create collection
    collection = create_milvus_collection(COLLECTION_NAME, DIM)
    print(f'A new collection created: {COLLECTION_NAME}')

except Exception as e:
    print(f'Error connecting to Milvus Database: {e}')

Error connecting to Milvus Database: <MilvusException: (code=2, message=Fail connecting to server on ca206930-4baa-4d42-adb8-85626df2a99d.cise77rd04nf1e5p5s20.lakehouse.appdomain.cloud:31624, illegal connection params or server unavailable)>


#### 2. Insert data <a class="anchor" id="step2"></a>

This step uses an **insert pipeline** to insert image embeddings into Milvus collection. The insert pipeline consists of the embedding pipeline and the Milvus insert operator.

In [27]:
# Insert pipeline

p_insert = (
    p_embed.map(('img_path', 'vec'), 'mr', ops.ann_insert.milvus_client(
                        host=HOST,
                        port=PORT,
                        user=USER, 
                        password=PASSWORD,
                        collection_name=COLLECTION_NAME
                        ))
              .output('mr')
    )


Insert all candidate images for  `INSERT_SRC`:

In [28]:
# Insert data
p_insert(INSERT_SRC)


<towhee.runtime.data_queue.DataQueue at 0x33462d1f0>