# A brief introduction to ElasticSearch

## 1. Run Elasticsearch in a Docker container:

First, make sure Docker is installed on your machine. Then, run the following command to start an Elasticsearch container:

```bash
docker run -it \
    --rm \
    --name elasticsearch \
    -p 9200:9200 \
    -e "discovery.type=single-node" \
    -e "xpack.security.enabled=false" \
    docker.elastic.co/elasticsearch/elasticsearch:8.4.3
```

Learn more about setting up Elasticsearch [here](https://www.elastic.co/guide/en/elasticsearch/reference/current/run-elasticsearch-locally.html
).

## 2. The Elasticsearch client

The client maps from Python to Elasticsearch REST APIs. We will ingest our book data from `data/books.json` into an Elasticsearch **index**. For this, we will define:

- Settings: determines how the index is set up and managed
- Mappings: these will specify the data types for each field in our documents

In [1]:
from elasticsearch import Elasticsearch, helpers
import json

with open('./data/books.json', 'r') as f:
    data = json.load(f)

es = Elasticsearch("http://localhost:9200")

index_name = 'books'

# Delete the index if it already exists
if es.indices.exists(index=index_name):
    es.indices.delete(index=index_name)


body = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0
    },
    "mappings": {
        "properties": {
            "id": {"type": "integer"},
            "book": {"type": "keyword"},
            "book_name": {"type": "text"},
            "edition": {"type": "keyword"},
            "author": {"type": "text"},
            "publication_year": {"type": "integer"},
            "publication_month": {"type": "integer"},
            "publication_day": {"type": "integer"}
        }
    }
}

es.indices.create(index=index_name, body=body)

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'books'})

Now we create the required structure for the `helpers.bulk` method by constructing a list of dictionaries with the appropriate keys: `(_index, _id, and _source)`. Then we use `bulk` to ingest the data into our index.

In [2]:
actions = [
    {
        "_index": index_name,
        "_id": entry['id'],
        "_source": entry
    }
    for entry in data
]

helpers.bulk(es, actions)

(30, [])

## 3. Basic queries and clauses

We will explore how to perform simple queries using the Elasticsearch Python client. We will cover basic query structures including *must*, *must_not*, and *should* clauses. These clauses are part of the *bool* query, which allows for the combination of multiple conditions.

### `match`

A basic **match** query searches for documents that match a given text. For example, let's find all books with the word "Python" in their `book_name`:

In [3]:
query = {
    "query": {
        "match": {
            "book_name": "Python"
        }
    }
}

response = es.search(index=index_name, body=query)
response['hits']['hits']

[]

To streamline our workflow, we created a utility function, `search_and_print`, that helps us easily access and print the hits from Elasticsearch query responses. Without this function, we would have to manually inspect `response['hits']['hits']` every time we run a query. This function simplifies the process, allowing us to quickly print the results while testing different queries.

In [4]:
def search_and_print(query, keys=None, es_client=es, index_name=index_name):
    response = es_client.search(index=index_name, body=query)
    hits = response['hits']['hits']
    
    print("Search Results:")
    for hit in hits:
        print(f"ID: {hit['_id']}")
        print(f"Score: {hit['_score']}")
        print("Source:")
        for key, value in hit['_source'].items():
            if keys:
                if key in keys:
                    print(f"  {key}: {value}")
        print("")

### `must`

The must clause requires that the specified conditions are satisfied. Let's find all books authored by "Seth Godin":

In [5]:
query = {
    "query": {
        "bool": {
            "must": [
                {"match": {"author": "Seth Godin"}}
            ]
        }
    }
}

search_and_print(query, keys=['book_name', 'author'])

Search Results:


### `must_not`

The must_not, conversely, excludes documents that match the specified conditions:

In [6]:
query = {
    "query": {
        "bool": {
            "must_not": [
                {"match": {"author": "Seth Godin"}}
            ]
        }
    }
}

search_and_print(query, keys=['book_name', 'author'])

Search Results:
ID: 1
Score: 0.0
Source:
  book_name: Cosmos
  author: Carl Sagan

ID: 2
Score: 0.0
Source:
  book_name: A Brief History of Time
  author: Stephen Hawking

ID: 3
Score: 0.0
Source:
  book_name: The Elegant Universe
  author: Brian Greene

ID: 4
Score: 0.0
Source:
  book_name: The Art of Electronics
  author: Paul Horowitz, Winfield Hill

ID: 5
Score: 0.0
Source:
  book_name: Make: Electronics
  author: Charles Platt

ID: 6
Score: 0.0
Source:
  book_name: Practical Electronics for Inventors
  author: Paul Scherz, Simon Monk

ID: 7
Score: 0.0
Source:
  book_name: Clean Code
  author: Robert C. Martin

ID: 8
Score: 0.0
Source:
  book_name: The Pragmatic Programmer
  author: Andrew Hunt, David Thomas

ID: 9
Score: 0.0
Source:
  book_name: Introduction to the Theory of Computation
  author: Michael Sipser

ID: 10
Score: 0.0
Source:
  book_name: Black's Law Dictionary
  author: Bryan A. Garner



### `should`

The should clause is used to specify optional conditions.

In [7]:
query = {
    "query": {
        "bool": {
            "should": [
                {"match": {"author": "Seth Godin"}},
                {"match": {"book_name": "Python"}}
            ]
        }
    }
}

search_and_print(query, keys=['book_name', 'author'])

Search Results:
ID: 24
Score: 5.850758
Source:
  book_name: Purple Cow: Transform Your Business by Being Remarkable
  author: Seth Godin

ID: 25
Score: 5.850758
Source:
  book_name: This Is Marketing: You Can't Be Seen Until You Learn to See
  author: Seth Godin

ID: 20
Score: 3.6756785
Source:
  book_name: Python Crash Course
  author: Eric Matthes



### Boosting

Boosting in Elasticsearch allows you to give more weight to certain conditions in your query, which helps rank more relevant documents higher in the search results. In this example, we'll search for programming books and give a higher `boost` to books authored by someone with the last name "Simpson".

In [8]:
query = {
    "size": 3,
    "query": {
        "bool": {
            "must": [
                {"match": {"book": "Programming"}}
            ],
            "should": [
                {
                    "match": {
                        "author": {
                            "query": "Simpson",
                            "boost": 2
                        }
                    }
                }
            ]
        }
    }
}

search_and_print(query, keys=['author', 'book_name', 'publication_year'])

Search Results:
ID: 21
Score: 8.600027
Source:
  book_name: You Don't Know JS: Scope & Closures
  author: Kyle Simpson
  publication_year: 2014

ID: 7
Score: 1.562185
Source:
  book_name: Clean Code
  author: Robert C. Martin
  publication_year: 2008

ID: 8
Score: 1.562185
Source:
  book_name: The Pragmatic Programmer
  author: Andrew Hunt, David Thomas
  publication_year: 2019



### `terms`

Term queries are used to find documents that contain an exact term in a specified field. Term queries are not analyzed, meaning they look for exact matches.

In [9]:
query = {
    "size": 4,
    "query": {
        "terms": {
            "publication_year": [2017, 2019, 2020]
        }
    }
}

search_and_print(query, keys=['author', 'book_name', 'publication_year'])

Search Results:
ID: 8
Score: 1.0
Source:
  book_name: The Pragmatic Programmer
  author: Andrew Hunt, David Thomas
  publication_year: 2019

ID: 10
Score: 1.0
Source:
  book_name: Black's Law Dictionary
  author: Bryan A. Garner
  publication_year: 2019

ID: 12
Score: 1.0
Source:
  book_name: Constitutional Law: Principles and Policies
  author: Erwin Chemerinsky
  publication_year: 2019

ID: 16
Score: 1.0
Source:
  book_name: Astrophysics for People in a Hurry
  author: Neil deGrasse Tyson
  publication_year: 2017



### `range`

The range query is useful for numeric and date ranges. For example, let's find all books published after the year 2019:

In [10]:
query = {
    "query": {
        "range": {
            "publication_year": {
                "gt": 2019
            }
        }
    }
}

search_and_print(query, keys=['author', 'book_name', 'publication_year'])

Search Results:
ID: 26
Score: 1.0
Source:
  book_name: The Astrophysics of Planet Formation
  author: Phil Armitage
  publication_year: 2020

ID: 28
Score: 1.0
Source:
  book_name: Embedded Systems: Introduction to Arm Cortex-M Microcontrollers
  author: Jonathan W. Valvano
  publication_year: 2021

