# Jina Workshop @ TUM.ai: Building a Neural Image Search Engine

In this workshop we will build a neural search engine for images of Pokemons.

# Downloading data and model

## Download and Extract Data

For this example we're using Pokemon sprites from [veekun.com](https://veekun.com/dex/downloads). To download them run:

```sh
sh ./get_data.sh
```

## Download and Extract Pretrained Model

In this example we use [BiT (Big Transfer) model](https://github.com/google-research/big_transfer), To download it:

```sh
sh ./download.sh
```

# Code

Required imports

In [1]:
import os
import sys

from jina.flow import Flow
from components import *

Some configuration options.

- restrict the nr of docs we index
- the path to the images

In [2]:
num_docs = int(os.environ.get('JINA_MAX_DOCS', 50000))
image_src = 'data/**/*.png'

Environment variables

- sharding
- workspace (folder where the encoded data will be stored)
- port we will listen on

In [3]:
def config():
    num_encoders = 1 if sys.argv[1] == 'index' else 1
    shards = 8

    os.environ['JINA_SHARDS'] = str(num_encoders)
    os.environ['JINA_SHARDS_INDEXERS'] = str(shards)
    os.environ['JINA_WORKSPACE'] = './workspace'
    os.environ['JINA_PORT'] = os.environ.get('JINA_PORT', str(45678))

In [4]:
config()

In [5]:
workspace = os.environ['JINA_WORKSPACE']

We need to make sure to not index on top of an existing workspace. 

This can cause problems if you are using different configuration options between the two runs.

In [6]:
if os.path.exists(workspace):
    print(f'\n +---------------------------------------------------------------------------------+ \
            \n |                                   🤖🤖🤖                                        | \
            \n | The directory {workspace} already exists. Please remove it before indexing again. | \
            \n |                                   🤖🤖🤖                                        | \
            \n +---------------------------------------------------------------------------------+')
    sys.exit()


 +---------------------------------------------------------------------------------+             
 |                                   🤖🤖🤖                                        |             
 | The directory ./workspace already exists. Please remove it before indexing again. |             
 |                                   🤖🤖🤖                                        |             
 +---------------------------------------------------------------------------------+


SystemExit: 

  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)


# Flows

The Flow is the main pipeline in Jina. It describes the way data should be loaded, processed, stored etc. within the system. 

It is made up of components (called Pods), which are the ones doing the specific task.

Ex. we have an Encoder Pod, which loads the model and *encodes* that data

## Index Flow

Depending on your need the Flow can be configured in different ways. While indexing (storing) data, we can optimize the pipeline to process the data in parallel

In [9]:
# for index
f = Flow.load_config('flows/index.yml')

In [13]:
f.plot('index.png', inline_display=True)

           JINA@59807[E]:[31mcan not download image, please check your graph and the network connections[0m


The Flow is a context manager (like a file handler).

We load data into the pipeline from the directory we provided above. 

`request_size` dictates how many images should be sent in one request (~batching).

In [None]:
with f:
    f.index_files(image_src, request_size=64, read_mode='rb', size=num_docs)

# Searching

When searching we need to make sure the data is processed in serial manner.

In [14]:
f = Flow.load_config('flows/query.yml')

In [16]:
f.plot('search.png')

This will activate the REST API.

We can then use the frontend interface provided in https://jina.ai/jinabox.js/

 - You can use [Jinabox.js](https://jina.ai/jinabox.js/) to find the Pokemon which matches most clearly. Just set the endpoint to `http://127.0.0.1:45678/api/search` and drag from the thumbnails on the left or from your file manager.
 - Or you can `curl`/query/js it via HTTP POST request. [Details here](#query-via-rest-api). 

In [17]:
with f:
    f.block()

        crafter@60431[I]:starting jina.peapods.runtimes.zmq.zed.ZEDRuntime...
        crafter@60431[I]:input [33mtcp://0.0.0.0:44617[0m (PULL_BIND) output [33mtcp://0.0.0.0:47577[0m (PUSH_CONNECT) control over [33mtcp://0.0.0.0:40513[0m (PAIR_BIND)
      tf_encode@60440[I]:starting jina.peapods.runtimes.zmq.zed.ZEDRuntime...
      tf_encode@60440[I]:input [33mtcp://0.0.0.0:47577[0m (PULL_BIND) output [33mtcp://0.0.0.0:57581[0m (PUSH_CONNECT) control over [33mtcp://0.0.0.0:57207[0m (PAIR_BIND)
   vec_idx/head@60449[I]:starting jina.peapods.runtimes.zmq.zed.ZEDRuntime...
   vec_idx/head@60449[I]:input [33mtcp://0.0.0.0:57581[0m (PULL_BIND) output [33mtcp://0.0.0.0:54213[0m (PUB_BIND) control over [33mtcp://0.0.0.0:35763[0m (PAIR_BIND)
    ImageReader@60431[I]:post_init may take some time...
BigTransferEncoder@60440[I]:post_init may take some time...
   vec_idx/tail@60460[I]:starting jina.peapods.runtimes.zmq.zed.ZEDRuntime...
    ImageReader@60431[I]:post_init may take 

# Advanced Topics


After configuring these, you will need to re-index your data and search again. 

## Changing Encoders

We can switch the `Encoder` easily.

`pods/encode.yml`:

```yaml
!ImageKerasEncoder
with:
  model_name: ResNet50V2 # any model could go here
  pool_strategy: avg
  channel_axis: -1
```

## Changing Crafters

In `pods/craft.yml`:

- remove `target_size: 96` from `ImageNormalizer`

```yaml
- !CenterImageCropper
with:
  target_size: 96
  channel_axis: -1
metas:
  name: img_cropper
```

We also need to specify the request paths, both for `IndexRequest` and for `SearchRequest`:

```yaml
      - !CraftDriver
        with:
          traversal_paths: ['r']
          executor: img_cropper
```

We can save an intermediary file to examine the cropped image to see if everything looks as expected. Add this to the `IndexRequest`:

```yaml
      - !PngToDiskDriver
        with:
          prefix: 'crop'
```

Now you can find the intermediary forms of the file in `workspace/`, under the folders with the given prefix.
