# Jina Workshop @ TUM.ai: Building a Neural Image Search Engine

In this workshop we will build a neural search engine for images of Pokemons.

# Downloading data and model

Skip this if you've already downloaded them.

## Download and Extract Data

For this example we're using Pokemon sprites from [veekun.com](https://veekun.com/dex/downloads). To download them run:

```sh
sh ./get_data.sh
```

## Download and Extract Pretrained Model

In this example we use [BiT (Big Transfer) model](https://github.com/google-research/big_transfer), To download it:

```sh
sh ./download.sh
```

# Code

Required imports

In [2]:
import os
import sys
from shutil import rmtree

from jina.flow import Flow
from components import *

Some configuration options.

- restrict the nr of docs we index
- the path to the images

In [3]:
num_docs = int(os.environ.get('JINA_MAX_DOCS', 50000))
image_src = 'data/**/*.png'

Environment variables

- workspace (folder where the encoded data will be stored)
- port we will listen on

In [4]:
workspace = './workspace'
os.environ['JINA_WORKSPACE'] = workspace
os.environ['JINA_PORT'] = os.environ.get('JINA_PORT', str(45678))

We need to make sure to not index on top of an existing workspace. 

This can cause problems if you are using different configuration options between the two runs.

In [10]:
if os.path.exists(workspace):
    print(f'Workspace at {workspace} exists. Will delete')
    rmtree(workspace)

# Flows

The Flow is the main pipeline in Jina. It describes the way data should be loaded, processed, stored etc. within the system. 

It is made up of components (called Pods), which are the ones doing the specific task.

Ex. we have an Encoder Pod, which loads the model and *encodes* that data; crafter Pod; segmenter Pod etc.

## Index Flow

Depending on your need the Flow can be configured in different ways. 

While indexing (storing) data, we can optimize the pipeline to process the data in parallel

In [30]:
f = Flow.load_config('flows/index.yml')

In [31]:
f.plot('index.png')

The Flow is a context manager (like a file handler).

We load data into the pipeline from the directory we provided above. 

`request_size` dictates how many images should be sent in one request (~batching).

In [16]:
with f:
    f.index_files(image_src, request_size=64, read_mode='rb', size=num_docs)

        crafter@120786[I]:starting jina.peapods.runtimes.zmq.zed.ZEDRuntime...
        crafter@120786[I]:input [33mtcp://0.0.0.0:46857[0m (SUB_CONNECT) output [33mtcp://0.0.0.0:44729[0m (PUSH_CONNECT) control over [33mtcp://0.0.0.0:58003[0m (PAIR_BIND)
        encoder@120795[I]:starting jina.peapods.runtimes.zmq.zed.ZEDRuntime...
        encoder@120795[I]:input [33mtcp://0.0.0.0:44729[0m (PULL_BIND) output [33mtcp://0.0.0.0:35963[0m (PUSH_CONNECT) control over [33mtcp://0.0.0.0:57231[0m (PAIR_BIND)
        vec_idx@120804[I]:starting jina.peapods.runtimes.zmq.zed.ZEDRuntime...
        vec_idx@120804[I]:input [33mtcp://0.0.0.0:35963[0m (PULL_BIND) output [33mtcp://0.0.0.0:44061[0m (PUSH_CONNECT) control over [33mtcp://0.0.0.0:44019[0m (PAIR_BIND)
BigTransferEncoder@120795[I]:post_init may take some time...
        doc_idx@120816[I]:starting jina.peapods.runtimes.zmq.zed.ZEDRuntime...
    ImageReader@120786[I]:post_init may take some time...
        doc_idx@120816[I]:inp

# Searching

When searching we need to make sure the data is processed in serial manner.

In [32]:
f = Flow.load_config('flows/query.yml')

In [33]:
f.plot('search.png')

This will activate the REST API.

You can use [Jinabox.js](https://jina.ai/jinabox.js/) to find the Pokemon which matches most clearly. Just set the endpoint to `http://127.0.0.1:45678/api/search` and drag from the thumbnails on the left or from your file manager.

In [21]:
with f:
    f.block()

        crafter@121602[I]:starting jina.peapods.runtimes.zmq.zed.ZEDRuntime...
        crafter@121602[I]:input [33mtcp://0.0.0.0:50975[0m (PULL_BIND) output [33mtcp://0.0.0.0:60921[0m (PUSH_CONNECT) control over [33mtcp://0.0.0.0:33117[0m (PAIR_BIND)
        encoder@121611[I]:starting jina.peapods.runtimes.zmq.zed.ZEDRuntime...
        encoder@121611[I]:input [33mtcp://0.0.0.0:60921[0m (PULL_BIND) output [33mtcp://0.0.0.0:47887[0m (PUSH_CONNECT) control over [33mtcp://0.0.0.0:34797[0m (PAIR_BIND)
   vec_idx/tail@121620[I]:starting jina.peapods.runtimes.zmq.zed.ZEDRuntime...
   vec_idx/tail@121620[I]:input [33mtcp://0.0.0.0:50897[0m (PULL_BIND) output [33mtcp://0.0.0.0:51633[0m (PUSH_CONNECT) control over [33mtcp://0.0.0.0:33365[0m (PAIR_BIND)
    ImageReader@121602[I]:post_init may take some time...
BigTransferEncoder@121611[I]:post_init may take some time...
        vec_idx@121629[I]:starting jina.peapods.runtimes.zmq.zed.ZEDRuntime...
        vec_idx@121629[I]:input

# Advanced Topics


**NOTE**: After configuring these, you will need to re-index your data and search again. 

## 1. Changing Encoders

We can switch the `Encoder` easily.

This is the component that is the actual **model**. This encodes the images into a vector space upon which you can perform cosine similarity (or other linear algebra operations).


`pods/encode.yml`:

```yaml
!ImageKerasEncoder
with:
  model_name: ResNet50V2 # any model could go here
  pool_strategy: avg
  channel_axis: -1
```

## 2. Changing Crafters

These are the components that transform your data. In this case, we crop and resize the image. You can try out other alterations to the images and see if you get better results.

In `pods/craft.yml`:

- remove `target_size: 96` from `ImageNormalizer`

```yaml
- !CenterImageCropper
with:
  target_size: 96
  channel_axis: -1
metas:
  name: img_cropper
```

We also need to specify the request paths, both for `IndexRequest` and for `SearchRequest`:

```yaml
      - !CraftDriver
        with:
          traversal_paths: ['r']
          executor: img_cropper
```

We can save an intermediary file to examine the cropped image to see if everything looks as expected. Add this to the `IndexRequest`:

```yaml
      - !PngToDiskDriver
        with:
          prefix: 'crop'
```

Now you can find the intermediary forms of the file in `workspace/`, under the folders with the given prefix.

## 3. Optimization

Explain what the end goal is.

Two parameters.yml files: one with a smaller subset of models and one with many options.