# Search

Elasticsearch is a data store which is built for search. 

At its core, it runs a variant of [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) called [BM25](https://en.m.wikipedia.org/wiki/Okapi_BM25), and the way it stores data is built to support that algorithm's performance.

Elasticsearch allows developers to use a mix of structured and unstructured data  to compose complex queries; _structure_ in the hierarchy of different fields, _unstructured_ data (like long strings of text) often within those fields.

At query-time, a user's **search terms** are injected into a structured piece of JSON called a **query**. That query is run against an **index** of data, whose fields are structured according to another bit of JSON called a **mapping**. For every **document** in the index which matches the search terms (ie contains the same terms), a numeric **score** is calculated.  

We're able to sort the search results according to their relevance, because in theory, the most relevant documents should be those with the highest score.  

By changing parts of the query or the mapping, developers can tune the system to produce more appropriate scores, and thereby bring more relevant results to the top of the list.

## Queries

The simplest thing we can change is the query. Let's connect to the elasticsearch index and try that out.

In [None]:
import os
from elasticsearch import Elasticsearch
from piffle.iiif import IIIFImageClient 
from io import BytesIO
import httpx
from PIL import Image

In [None]:
local_es = Elasticsearch(
    hosts=os.environ['LOCAL_HOST'],
    http_auth=(
        os.environ['LOCAL_USER'],
        os.environ['LOCAL_PASS']
    )
)

## Get
The simplest search we can do is a straightforward GET. We tell the cluster the exact ID of the document we're looking for and the index we know it's in, and elasticsearch gives us all of the data it has about it:

In [None]:
response = local_es.get(
    index=os.environ['INDEX_NAME'],
    id='agq44vu9'
)

response

## Search
But elasticsearch supports a rich query syntax, and we can do much, much more than that. The simplest place to start is a [query string query](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html). This contains a load of baked-in clever lexical rules (eg stemming), and will be applied to all of the fields in the mapping. For a simple index with simple search intents, this is usually the best place to start.

In [None]:
search_terms = "dog"

In [None]:
response = local_es.search(
    index=os.environ['INDEX_NAME'],
    body={
        "query": {
            "query_string": {
                "query": search_terms
            }
        }
    }
)

In [None]:
print(
    f"Found {response['hits']['total']['value']} "
    f"results in {response['took'] / 1000}s"
)

The index returned lots of matching results, sorted according to their BM25 scores. We'd expect the first result to be pretty relevant. Let's take a look at the image associated with the record

In [None]:
first_result = response['hits']['hits'][0]
iiif_url = first_result['_source']['state']['derivedData']['thumbnail']['url']
image_url = str(IIIFImageClient().init_from_url(iiif_url).size(width=500))
Image.open(BytesIO(httpx.get(image_url).content))

Looks like a pretty good match for our search terms to me!

Let's try again with a different search term:

In [None]:
search_terms = "cat"

response = local_es.search(
    index=os.environ['INDEX_NAME'],
    body={
        "query": {
            "query_string": {
                "query": search_terms
            }
        }
    }
)

first_result = response['hits']['hits'][0]
iiif_url = first_result['_source']['state']['derivedData']['thumbnail']['url']
image_url = str(IIIFImageClient().init_from_url(iiif_url).size(width=500))
Image.open(BytesIO(httpx.get(image_url).content))

In [None]:
first_result['_source']['source']['canonicalWork']['data']['title']

Not so good. We can tweak the query to only match search terms in the title field:

In [None]:
response = local_es.search(
    index=os.environ['INDEX_NAME'],
    body={
        "query": {
            "match": {
                "source.canonicalWork.data.title": search_terms
            }
        }
    }
)


first_result = response['hits']['hits'][0]
iiif_url = first_result['_source']['state']['derivedData']['thumbnail']['url']
image_url = str(IIIFImageClient().init_from_url(iiif_url).size(width=500))
Image.open(BytesIO(httpx.get(image_url).content))

In [None]:
first_result['_source']['source']['canonicalWork']['data']['title']

The title's much more obviously related to the query's search terms, but the image is still seems like an odd choice. 

We can search across multiple fields:

In [None]:
response = local_es.search(
    index=os.environ['INDEX_NAME'],
    body={
        "query": {
            "multi_match": {
                "query": search_terms,
                "fields": [
                    "source.canonicalWork.data.title",
                    "source.canonicalWork.data.description"
                ]
            }
        }
    }
)


first_result = response['hits']['hits'][0]
iiif_url = first_result['_source']['state']['derivedData']['thumbnail']['url']
image_url = str(IIIFImageClient().init_from_url(iiif_url).size(width=500))
Image.open(BytesIO(httpx.get(image_url).content))

And we can add varying levels of **boost** to each field - If we believe that an image's description is more important than it's title, we can add a higher boost to that field.

In [None]:
response = local_es.search(
    index=os.environ['INDEX_NAME'],
    body={
        "query": {
            "multi_match": {
                "query": search_terms,
                "fields": [
                    "source.canonicalWork.data.title^5",
                    "source.canonicalWork.data.description^20"
                ]
            }
        }
    }
)


first_result = response['hits']['hits'][0]
iiif_url = first_result['_source']['state']['derivedData']['thumbnail']['url']
image_url = str(IIIFImageClient().init_from_url(iiif_url).size(width=500))
Image.open(BytesIO(httpx.get(image_url).content))

Hooray, we're finally seeing a cat! Maybe we've improved the structure of our query! Let's search again for "dog".

In [None]:
search_terms = "dog"

In [None]:
response = local_es.search(
    index=os.environ['INDEX_NAME'],
    body={
        "query": {
            "multi_match": {
                "query": search_terms,
                "fields": [
                    "source.canonicalWork.data.title^5",
                    "source.canonicalWork.data.description^20"
                ]
            }
        }
    }
)


first_result = response['hits']['hits'][0]
iiif_url = first_result['_source']['state']['derivedData']['thumbnail']['url']
image_url = str(IIIFImageClient().init_from_url(iiif_url).size(width=500))
Image.open(BytesIO(httpx.get(image_url).content))

Ah, again, we've lost the relevant results because the query isn't quite right. Hopefully this demonstrates the balancing act that we have to practice when tuning search - Because queries are run against the _whole_ index, we need to consider how a change will affect _everything_. 

# Similarity
We can also run 'searches' using data from fields within the index, not supplying any new terms apart from a target work ID. These ["more like this"](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-mlt-query.html) queries are used to find similar results to the query work.

Here's the title of the original work which we'll run the query with:

In [None]:
local_es.get(
    index=os.environ['INDEX_NAME'],
    id='agq44vu9'
)['_source']['source']['canonicalWork']['data']['title']

Structuring the query to just look at the title, and we get a first result with a very similar title

In [None]:
response = local_es.search(
    index=os.environ['INDEX_NAME'],
    body={
      "query": {
        "more_like_this": {
          "fields": [ "source.canonicalWork.data.title" ],
          "like": [
            {
              "_index": os.environ['INDEX_NAME'],
              "_id": "agq44vu9"
            }
          ],
        }
      }
    }
)


first_result = response['hits']['hits'][0]
first_result['_source']['source']['canonicalWork']['data']['title']

and fairly similar image content as a result!

In [None]:
iiif_url = first_result['_source']['state']['derivedData']['thumbnail']['url']
image_url = str(IIIFImageClient().init_from_url(iiif_url).size(width=500))
Image.open(BytesIO(httpx.get(image_url).content))