# Exploring ElasticSearch

**Mehdi Boustani** - S221594  
**Nicolas Schneiders** - S203005  
**Maxim Piron** - S211493  
**Andreas Stistrup** - S212891  

*Faculty of Applied Sciences, University of Liège*

May 17, 2025


# Introduction

In today’s digital world, users interact with vast amounts of data through search interfaces—whether it's browsing e-commerce platforms, reading documentation, or exploring media archives. They expect search engines to return accurate, relevant results instantly, even when queries include misspellings, vague phrasing, or synonyms. These expectations have grown alongside the exponential increase in the volume, variety, and velocity of data generated across platforms and applications.

This surge in data (structured, semi-structured, and unstructured) presents a major challenge. Traditional **relational databases**, while excellent for transactional operations and structured data storage, are ill-suited for flexible, large-scale search. Their rigid schemas, exact-match query requirements, and vertical scaling limitations make them inadequate for handling modern search demands, especially when working with diverse data types like JSON, logs, and text-heavy content.

This is where **ElasticSearch** comes in. ElasticSearch is a powerful, open-source, distributed search and analytics engine designed to operate at scale with low latency. It is built on top of [Apache Lucene](https://lucene.apache.org/), and provides robust full-text search capabilities, fault-tolerant horizontal scalability, and a flexible document-based data model. Its ability to handle fuzzy matching, relevance scoring, autocomplete, and real-time analytics makes it a core component of many data-driven applications.

Unlike traditional SQL-based systems, ElasticSearch uses a **schema-free JSON format** for indexing and querying, allowing for more dynamic data ingestion and flexible exploration. Features like *match*, *fuzzy match*, and *filtering* mechanisms allow users to retrieve relevant information, even in the presence of errors or ambiguous input, while keeping response times fast and consistent.

In this tutorial, we will introduce the core concepts of ElasticSearch, walk through its installation and setup, and demonstrate how to import and query data using real-world examples. Along the way, we will highlight key features such as full-text search, filtering, boosting, and aggregations, and also explore its limitations.

# Real-life use cases

ElasticSearch is widely adopted across industries for its efficiency, scalability, and flexibility in handling large volumes of data. Here are some common real-world applications:

### E-commerce/Product Catalog Search
ElasticSearch powers fast, relevant, and up-to-date results in e-commerce product searches, supporting faceted navigation. This requires inventory synchronization, user behavior tracking, and results caching.

- [Netflix - ElasticSearch Indexing Strategy in Asset Management Platform](https://netflixtechblog.com/elasticsearch-indexing-strategy-in-asset-management-platform-amp-99332231e541) (article)
- [eBay - ElasticSearch as a Service](https://www.elastic.co/elasticon/conf/2017/sf/elasticsearch-as-a-service-at-ebay) (webinar)
- [Ticketmaster - Revolutionizing the Fan Experience with Search](https://www.elastic.co/elasticon/tour/2017/los-angeles/revolutionizing-the-fan-experience-with-search-at-ticketmaster) (webinar)
- [BMW - BMW picks Elastic to drive new marketing and sales strategies](https://www.elastic.co/customers/bmw) (case study)

### Workplace/Knowledge Base Search
In enterprise settings, ElasticSearch enables efficient searching across various data sources while enforcing permissions. It integrates with third-party connectors, ensures document-level security, and supports role-based access.

- [Airbus - Airbus ADNS: Powering the Search for Near Real-Time Access to Aircraft Technical Documents](https://www.elastic.co/customers/airbus) (case study)
- [Pfizer - Elastic as a Fundamental Core to Pfizer’s Scientific Data Cloud](https://www.elastic.co/elasticon/tour/2019/boston/elastic-as-a-fundamental-core-to-pfizers-scientific-data-cloud) (webinar)

### Website Search
Website search functionality is enhanced with ElasticSearch for delivering relevant, up-to-date results. It involves web crawling, incremental indexing, and query caching.

- [Github - Accelerating software development](https://www.elastic.co/customers/github) (case study)
- [Wikimedia - Navigating the World's Encyclopedia](https://www.elastic.co/elasticon/conf/2015/sf/navigating-through-worlds-encyclopedia) (webinar)
- [City of Portland - Better Search Means Happier Portlanders](https://www.elastic.co/customers/city-of-portland) (case study)

### Customer Support Search
ElasticSearch is used to surface relevant solutions and manage customer support queries, with features such as knowledge graphs, role-based access, and analytics tracking.

- [AirBnB - How Airbnb manages to monitor customer issues at scale](https://medium.com/airbnb-engineering/how-airbnb-manages-to-monitor-customer-issues-at-scale-b883301ca461) (article)
- [Shopify - Powering the search for better help documentation](https://www.elastic.co/customers/shopify) (case study)

### Chatbots and Retrieval-Augmented Generation (RAG)
ElasticSearch supports chatbots and RAG applications by enabling natural conversations, providing context, and maintaining knowledge. It leverages vector search, machine learning models, and knowledge base integration.

- [Stack Overflow - Stack Overflow rolls out generative AI using ElasticSearch and Azure Open AI](https://www.elastic.co/customers/stack-overflow) (case study)

### Geospatial Search
Geospatial search in ElasticSearch handles location-based queries, sorts results by proximity, and filters by area. It utilizes geo-mapping, spatial indexing, and distance calculations.

- [Uber - Engineering Uber Predictions in Real Time with ELK](https://www.uber.com/en-BE/blog/elk/) (article)
- [Google, Yelp, Zamato - Find Nearby Businesses: Geospatial Index](https://medium.com/@jagriti.bansal/how-google-search-yelp-zomato-find-nearby-businesses-geospatial-index-06c78f4f935b) (article)

# How ElasticSearch works

## Data format
ElasticSearch manipulates data in the form of indexes, documents, and fields. These are the equivalent of databases, rows, and columns in traditional SQL-based databases management systems. The data is stored as JSON files, which allows ElasticSearch to have a flexible schema.

In addition to the data stored in a document, ElasticSearch uses [metadata fields](https://www.elastic.co/docs/reference/elasticsearch/mapping-reference/document-metadata-fields) to assist with querying. These fields are prefixed with an underscore (`_`) to distinguish them from regular data fields. Examples include `_id`, which represents the document’s unique identifier, and `_index`, which indicates the index the document belongs to.

Here is an [example of a document](https://www.elastic.co/docs/manage-data/data-store/index-basics):

```json{
  "_index": "my-first-elasticsearch-index",
  "_id": "DyFpo5EBxE8fzbb95DOa",
  "_version": 1,
  "_seq_no": 0,
  "_primary_term": 1,
  "found": true,
  "_source": {
    "email": "john@smith.com",
    "first_name": "John",
    "last_name": "Smith",
    "info": {
      "bio": "Eco-warrior and defender of the weak",
      "age": 25,
      "interests": [
        "dolphins",
        "whales"
      ]
    },
    "join_date": "2024/05/01"
  }
}

## Distributed systems
Since ElasticSearch is built to work on distributed systems at its core, it utilizes **clusters** (groups of machines), **nodes** (machines), and **shards** (pieces of indexes).

Sharding is done by splitting each index into multiple parts called shards, which are distributed across different nodes in the cluster. Each shard is a fully functional Lucene index that can be searched and queried independently. When a document is indexed, ElasticSearch uses a hashing algorithm on the document’s ID to determine which primary shard it should be stored in. Replica shards are also created for fault tolerance and load balancing.

There are two types of shards: **primaries** and **replicas**. Each document in an index belongs to one primary shard. A replica shard is a copy of a primary shard. Replicas maintain redundant copies of data across the nodes in a cluster. This protects against hardware failure and increases capacity to serve read requests like searching or retrieving a document.

Entire documents are sharded according to their document ID. The ID is hashed and then the hash modulo number of shards determines which primary shard receives the document.

This allows ElasticSearch to scale **horizontally**, meaning that adding more nodes to the cluster can distribute data and workload more effectively, leading to better performance, fault tolerance, and high availability.

## Representational Sate Transfer (REST)

ElasticSearch follows the **REST** convention, which means that:
- It is **stateless**: Each request from the client to the server must contain all the information needed to understand and process the request.
- It follows a **client-server model**: The client (such as [Kibana](https://www.elastic.co/fr/kibana) or an HTTP client) sends requests to the server (ElasticSearch), which processes and returns responses.
- It has a **uniform interface**: Standard HTTP methods like `GET`, `POST`, `PUT`, and `DELETE` are used to perform actions on resources (e.g., documents or indexes).
- It is **resource-based**: Data is organized and accessed via URLs that represent resources (e.g., `/my_index/_doc/1` represents document ID 1 in the index `my_index`).

## Domain Specific Language (DSL)
ElasticSearch uses its own **Domain Specific Language (DSL)** for querying and manipulating data. This JSON-based query language allows users to build rich and expressive queries to filter, sort, aggregate, and search data.

DSL is structured in a way that enables:
- Full-text search
- Structured filtering
- Complex aggregations (like averages, histograms, and term counts)
- Boolean logic (`must`, `should`, `must_not`)
- Nested and range queries

This powerful query language is a key feature that distinguishes ElasticSearch from traditional databases, making it ideal for full-text search, analytics, and log aggregation use cases.



## Inverted indexes
An **inverted index** is a core data structure used by ElasticSearch to enable fast and efficient full-text searches.

An inverted index can be compared to the index at the back of a book:
- A regular (forward) index maps documents to the words (terms) they contain.
- An inverted index maps each word (term) to the list of documents that contain it.

<img src="https://i0.wp.com/spotintelligence.com/wp-content/uploads/2023/10/inverted-index.png?resize=1024%2C576&ssl=1" alt="Forward vs. inverted index" width="700" height="400"/>

(Image source: [How To Implement Inverted Indexing [Top 10 Tools & Future Trends]](https://spotintelligence.com/2023/10/30/inverted-indexing/) )

For example, in the forward index (left), each document ID points to the term it contains — this is useful for storing what’s in each document. In contrast, the inverted index (right) flips this around: each term points to the document IDs where it appears. This reversal enables fast full-text searches, since you can quickly find all documents containing a specific word like "Cat" (which appears in documents 1, 3, and 6).


An inverted index is composed of:
1. A **term dictionary** that maps terms to postings lists, implemented as a [finite state transducer](https://en.wikipedia.org/wiki/Finite-state_transducer)[<sup>1</sup>](#invertedindex);
2. **Postings lists** which are lists of document IDs in which the terms appear. There is additional metadata associated with the document IDs, such as:
    - *Term frequency*: How often the term appears in each document.
    - *Positions*: The positions within the document where the term appears (for phrase queries).
    - *Offsets*: Byte offsets in the text.  

This reversal allows ElasticSearch to quickly find all documents that match a search term.



## Relevance scoring[<sup>2</sup>](#bm25)
ElasticSearch outputs results based on relevance. The relevance score is computed using the following function:

$$
\text{score}(D, T) = \sum_{i=1}^{n} \text{IDF}(t_i) \cdot \frac{f(t_i, D) \cdot (k_1 + 1)}{f(t_i, D) + k_1 \cdot \left(1 - b + b \cdot \frac{|D|}{\text{avgdl}}\right)}
$$

where:
- $D$ is the document being scored;
- $T$ is the list of tokens of the query;
- $\text{IDF}(t_i)$ is the inverse document frequency (measures how frequent a term is across all documents);
- $f(t_i, D)$ is the query term's frequency within $D$;
- $k_1$ is a parameter controlling how much the query term's frequency affects the score;
- $b$ is a normalization parameter controlling how much document length affects the score;
- $avgdl$ is the average document length.

The inverse document frequency is defined as follows:
$$\text{IDF}(t_i) = ln \left( 1 + \frac{(\text(docCount) - f(t_i) + 0.5)}{f(t_i) + 0.5} \right)$$

where:
- $\text{docCount}$ is the amount of documents; 
- $f(t_i)$ is the amount of times the query term $t_i$ occurs. 

This ranking function is known as the [Okapi BM25](https://en.wikipedia.org/wiki/Okapi_BM25) ranking function.

# Installation & configuration


## Docker

### Installing ElasticSearch[<sup>3</sup>](#elasticlab)
If you don't have Docker installed yet, you can download and install it from the [official website](https://www.docker.com/). 

Once Docker is running on your machine, launch ElasticSearch using the following command:

In [None]:
!docker run -p 127.0.0.1:9200:9200 -d --name elasticsearch \
  -e "discovery.type=single-node" \
  -e "xpack.security.enabled=false" \
  -e "xpack.license.self_generated.type=trial" \
  -v "elasticsearch-data:/usr/share/elasticsearch/data" \
  docker.elastic.co/elasticsearch/elasticsearch:8.15.0


## Dependencies  
Let's install all the necessary Python packages we will be using throughout this tutorial.


In [1]:
# requests       → to interact with the ElasticSearch REST API
# elasticsearch  → official ElasticSearch Python client

!pip install requests elasticsearch==8.15.0




[notice] A new release of pip is available: 24.0 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


## Connexion

In [7]:
from pprint import pprint
from elasticsearch import Elasticsearch, helpers

es = Elasticsearch('http://localhost:9200')
info = es.info()

# Wait approximately 30 seconds, letting the docker container launch itself
print('Connected to ElasticSearch !')
pprint(info.body)

Connected to ElasticSearch !
{'cluster_name': 'docker-cluster',
 'cluster_uuid': '3XcS5eLWRbqa3dMix8YLlQ',
 'name': 'f5202efe872c',
 'tagline': 'You Know, for Search',
 'version': {'build_date': '2024-08-05T10:05:34.233336849Z',
             'build_flavor': 'default',
             'build_hash': '1a77947f34deddb41af25e6f0ddb8e830159c179',
             'build_snapshot': False,
             'build_type': 'docker',
             'lucene_version': '9.11.1',
             'minimum_index_compatibility_version': '7.0.0',
             'minimum_wire_compatibility_version': '7.17.0',
             'number': '8.15.0'}}


## Importing data with the Bulk API

To efficiently load a large dataset[<sup>4</sup>](#freecodecamp) into ElasticSearch, we use the **Bulk API**. This method allows us to insert multiple documents in a single request, which is much faster and more efficient than indexing documents one by one. In this example, we will import the contents of our `apod.json` file—where each element is a document—into a new index called `apod`.

In [8]:
import json

# Open the file
with open("apod.json", "r") as f:
    data = json.load(f)

# Prepare the actions for the bulk
actions = [
    {
        "_index": "apod",
        "_id": doc["title"], # We use the title as index since it is a unique field (the unicity is important!)
        "_source": doc
    }
    for doc in data
]

# We import the data in bulk
try:
    helpers.bulk(es, actions)
    print("Bulk import terminé !")
except  Exception as e:
    print(e)

Bulk import terminé !


## Basic queries in ElasticSearch

ElasticSearch exposes a **RESTful API**, which means you interact with it using standard HTTP methods. Here are the most common operations:

- **GET**: Read a document or perform a search
- **POST**: Add a new document
- **PUT**: Create or replace a document or an index
- **DELETE**: Remove a document or an index

### GET method

The **GET method** is used to retrieve data from our json file by providing an id. If the document with the specified id doesn't exist, it throws an exception.

In [9]:
try:
    doc = es.get(index="apod", id="A Hazy Harvest Moon")
    pprint(doc['_source'])

except:
    print("A document with this id doesn't exist!")

{'authors': 'Petr Horálek, Institute of Physics in Opava\n',
 'date': '2024-09-20',
 'explanation': "Explanation: For northern hemisphere dwellers, September's "
                'Full Moon was the Harvest Moon. On September 17/18 the sunlit '
                "lunar nearside passed into shadow, just grazing Earth's "
                "umbra, the planet's dark, central shadow cone, in a partial "
                'lunar eclipse. Over the two and a half hours before dawn a '
                'camera fixed to a tripod was used to record this series of '
                'exposures as the eclipsed Harvest Moon set behind Spiš Castle '
                'in the hazy morning sky over eastern Slovakia. Famed in '
                'festival, story, and song, Harvest Moon is just the '
                'traditional name of the full moon nearest the autumnal '
                'equinox. According to lore the name is a fitting one. Despite '
                'the diminishing daylight hours as the growing se

### POST method

The **POST method** is used to create a new document. When using the index() method without specifying an id, ElasticSearch automatically generates one (not the title as the other documents). The POST method is "hidden" behind the index method of the ElasticSearch python client (same for the PUT method detailed after).

In [10]:
from datetime import datetime

new_id = "A New APOD"

new_doc = {
    "date": datetime.now().strftime("%Y-%m-%d"),
    "title": new_id,
    "explanation": "This is a new document added via POST.",
    "image_url": "https://apod.nasa.gov/apod/image/2410/new_apod.jpg",
    "authors": "Mehdi Boustani"
}

res = es.index(index="apod", document=new_doc)

print("Document added successfully")

Document added successfully


### PUT method

The **PUT method** is used to create or replace a document at a specified id. If a document with that id already exists, it will be overwritten.

In [11]:
replaced_id = "Replaced APOD"

doc = {
    "date": "2024-10-02",
    "title": replaced_id,
    "explanation": "This document replaces any previous one with the same ID.",
    "image_url": "https://apod.nasa.gov/apod/image/2410/new_apod.jpg",
    "authors": "Mehdi Boustani"
}

# Let's replace our previously created document
es.index(index="apod", id=doc["title"], document=doc)

print(f"Document with id '{new_id}' replaced by a new document with id '{replaced_id}'")

Document with id 'A New APOD' replaced by a new document with id 'Replaced APOD'


### DELETE method

The **DELETE method** is used to delete a document by its id. The id must be known and specified in the request.

In [12]:
try:
    es.delete(index="apod", id=replaced_id)
    print(f"Document with ID {replaced_id} deleted.")

except:
    print("The specified document to delete doesn't exist")

# Delete the entire index (be careful, this is command is irreversible)
# es.indices.delete(index="apod")
# print("Index 'apod' deleted.")

Document with ID Replaced APOD deleted.


## DSL vs SQL

ElasticSearch doesn't use traditional SQL language to query data, but rather a **DSL (Domain Specific Language)** based on JSON.

### Main differencies

1. **Query Structure**
   - **SQL**: Uses a strict syntax with clauses like `SELECT`, `FROM`, `WHERE`
   - **DSL**: Uses a nested JSON format, offering more flexibility in how queries are expressed

2. **Search Types**
   - **SQL**: Focuses mainly on exact matches
   - **DSL**: Supports advanced search techniques like full-text search, fuzzy matching, and range queries

Let's explore practical examples comparing SQL concepts with ElasticSearch's DSL.

### Full-text search

Let's do a full-text search in order to find documents where the "title" field matches the word "moon", then prints the content of each matching document.

In [14]:
query = {
    "query": {
        "match": {
            "title": "moon"
        }
    }
}

response = es.search(index="apod", body=query)
for hit in response["hits"]["hits"]:
    print(hit["_source"])

{'date': '2016-11-13', 'title': 'Super Moon vs Micro Moon', 'explanation': "Explanation: What is so super about tomorrow's supermoon? Tomorrow, a full moon will occur that appears slightly larger and brighter than usual. The reason is that the Moon's fully illuminated phase occurs within a short time from perigee - when the Moon is its closest to the Earth in its elliptical orbit. Although the precise conditions that define a supermoon vary, tomorrow's supermoon will undoubtedly qualify because it will be the closest, largest, and brightest full moon in over 65 years. One reason supermoons are popular is because they are so easy to see -- just go outside at sunset and watch an impressive full moon rise! Since perigee actually occurs tomorrow morning, tonight's full moon, visible starting at sunset, should also be impressive. Pictured here, a supermoon from 2012 is compared to a micromoon -- when a full Moon occurs near the furthest part of the Moon's orbit -- so that it appears smaller

**SQL equivalent:** `SELECT * FROM apod WHERE title LIKE '%moon%'`

### Exact match with term

Now, we search for documents where the title exactly matches "A Hazy Harvest Moon" using the keyword field (i.e., not full-text), and print the matching documents.

In [13]:
query = {
    "query": {
        "term": {
            "title.keyword": {
                "value": "A Hazy Harvest Moon"
            }
        }
    }
}

response = es.search(index="apod", body=query)
for hit in response["hits"]["hits"]:
    print(hit["_source"])


{'date': '2024-09-20', 'title': 'A Hazy Harvest Moon', 'explanation': "Explanation: For northern hemisphere dwellers, September's Full Moon was the Harvest Moon. On September 17/18 the sunlit lunar nearside passed into shadow, just grazing Earth's umbra, the planet's dark, central shadow cone, in a partial lunar eclipse. Over the two and a half hours before dawn a camera fixed to a tripod was used to record this series of exposures as the eclipsed Harvest Moon set behind Spiš Castle in the hazy morning sky over eastern Slovakia. Famed in festival, story, and song, Harvest Moon is just the traditional name of the full moon nearest the autumnal equinox. According to lore the name is a fitting one. Despite the diminishing daylight hours as the growing season drew to a close, farmers could harvest crops by the light of a full moon shining on from dusk to dawn. This September's Harvest Moon was also known to some as a supermoon, a term becoming a traditional name for a full moon near perige

**SQL equivalent:** `SELECT * FROM apod WHERE title = 'A Hazy Harvest Moon'`

### Range Query (Numeric/date filtering)

We can also search for documents with a date between January 1st and January 15th, 2020, and print each matching document’s content.

In [15]:
query = {
    "query": {
        "range": {
            "date": {
                "gte": "2020-01-01",
                "lte": "2020-01-15"
            }
        }
    }
}

response = es.search(index="apod", body=query)
for hit in response["hits"]["hits"]:
    print(hit["_source"])


{'date': '2020-01-15', 'title': 'Iridescent Clouds over Sweden', 'explanation': 'Explanation: Why would these clouds multi-colored? A relatively rare phenomenon in clouds known as iridescence can bring up unusual colors vividly or even a whole spectrum of colors simultaneously. These polar stratospheric clouds clouds, also known as nacreous and mother-of-pearl clouds, are formed of small water droplets of nearly uniform size. When the Sun is in the right position and, typically, hidden from direct view, these thin clouds can be seen significantly diffracting sunlight in a nearly coherent manner, with different colors being deflected by different amounts. Therefore, different colors will come to the observer from slightly different directions. Many clouds start with uniform regions that could show iridescence but quickly become too thick, too mixed, or too angularly far from the Sun to exhibit striking colors. The featured image and an accompanying video were taken late last year over O

**SQL equivalent:** `SELECT * FROM apod WHERE date BETWEEN '2020-01-01' AND '2020-12-31'`

### Fuzzy Query (Typo-tolerant search)

An finally, ElasticSearch allows us to search for documents where the title **approximately** matches "Galaxi" using fuzzy matching to handle potential typos, and prints the matching documents.

In [16]:
# Typo-tolerant search with fuzzy
query = {
    "query": {
        "fuzzy": {
            "title": {
                "value": "Galaxi",
                "fuzziness": "AUTO"
            }
        }
    }
}

response = es.search(index="apod", body=query)
for hit in response["hits"]["hits"]:
    print(hit["_source"])

{'date': '2020-01-25', 'title': "Rubin's Galaxy", 'explanation': "Explanation: In this Hubble Space Telescope image the bright, spiky stars lie in the foreground toward the heroic northern constellation Perseus and well within our own Milky Way galaxy. In sharp focus beyond is UGC 2885, a giant spiral galaxy about 232 million light-years distant. Some 800,000 light-years across compared to the Milky Way's diameter of 100,000 light-years or so, it has around 1 trillion stars. That's about 10 times as many stars as the Milky Way. Part of a current investigation to understand how galaxies can grow to such enormous sizes, UGC 2885 was also part of astronomer Vera Rubin's pioneering study of the rotation of spiral galaxies. Her work was the first to convincingly demonstrate the dominating presence of dark matter in our universe.", 'image_url': 'https://apod.nasa.gov/apod/image/2001/RubinsGalaxy_hst1024.jpg', 'authors': 'NASA, ESA, B. Holwerda (University of Louisville)\n'}
{'date': '2022-07

**SQL equivalent:** No direct equivalent, similar to a `LIKE` with typos

# ElasticSearch as a search engine

The goal of ElasticSearch is to empower client workflows to retrieve data from your database using powerful, flexible queries. To customize search behavior, ElasticSearch offers several fine-tuning parameters. In this section, we will explore the **filter**, **must**, **must_not**, and **should** clauses[<sup>5</sup>](#elasticlabBoolQueries).

A key concept here is document **scoring**[<sup>6</sup>](#elasticscore). When you run a query, ElasticSearch calculates a relevance score for each candidate document and orders results accordingly. Then, it returns the top n documents based on that ranking. To further control how scores influence ordering, you can use the boost parameter to adjust relevance and achieve custom ranking.

## Filter requests

When you apply a filter criterion to your query, you define one or more clauses that documents **must satisfy** to be included. Filters are **score-neutral**, they don’t alter a document’s relevance score, they only prune out non-matching hits. Below we will explore a selection of the most common filter clauses.

In [17]:
index_name = "apod"

queries = []

# This query will filter out documents that do not match the date "2024-09-27"
term_query = {
    "term": {"date": "2024-09-27"}
}
queries.append(("=== Term Query on date ===", term_query))

# This query will filter documents with a date between "2024-09-09" and "2024-09-30"
range_query = {
    "range": {
        "date": {"gte": "2024-09-09", "lte": "2024-09-30"}
    }
}
queries.append(("=== Range Query on date ===", range_query))

# This query will filter documents that have a non-null value for the field "image_url"
exists_query = {
    "exists": {"field": "note"}
}
queries.append(("=== Exists Query on note ===", exists_query))

# This query will filter documents that have the exact term "David Martinez Delgado et al." in the "authors" field
term_authors_query = {
    "term": {"authors.keyword": "David Martinez Delgado et al.\n"}
}
queries.append(("=== Exact Term Query on authors ===", term_authors_query))

# This query will filter documents that have a title starting with "Comet"
prefix_query = {
    "prefix": {"title.keyword": "Comet"}
}
queries.append(("=== Prefix Query on title ===", prefix_query))


for title, query in queries:
    print(title)
    response = es.search(index=index_name, body={"query": {"bool": {"filter": [query]}}})
    for hit in response["hits"]["hits"]:
        pprint(hit["_source"])
    print("\n")


=== Term Query on date ===
{'authors': 'David Martinez Delgado et al.\n',
 'date': '2024-09-27',
 'explanation': 'Explanation: The twenty galaxies arrayed in these panels are '
                'part of an ambitious astronomical survey of tidal stellar '
                'streams. Each panel presents a composite view; a deep, '
                'inverted image taken from publicly available imaging surveys '
                'of a field that surrounds a nearby massive galaxy image. The '
                'inverted images reveal faint cosmic structures, star streams '
                'hundreds of thousands of light-years across, that result from '
                'the gravitational disruption and eventual merger of satellite '
                'galaxies in the local universe. Such surveys of mergers and '
                'gravitational tidal interactions between massive galaxies and '
                'their dwarf satellites are crucial guides for current models '
                'of galaxy for

## Must requests

The must criterion works much like filter in that it first determines which records are eligible but with one key difference: when a document matches a must clause, **its relevance score is increased**. You can include multiple must clauses, and they are combined with a **logical AND** (i.e., a document must satisfy all of them).

In [18]:
must_title = {
    "match": {"title": "Comet"}
}

must_explanation = {
    "match": {"explanation": "nebula"}
}

print("=== Must: Title Contains 'Comet' (size=2) ===")
res = es.search(
    index=index_name,
    # The size parameter limits the number of results returned. Default is 10.
    body={
        "size": 2,
        "query": {
            "bool": {
                "must": [must_title]
            }
        }
    }
)
for hit in res["hits"]["hits"]:
    print(f"_score={hit['_score']:.2f}")
    pprint(hit["_source"])


print("\n=== Must: Title Contains 'Comet' AND Explanation Contains 'nebula' (size=2) ===")
res = es.search(
    index=index_name,
    body={
        "size": 2,
        "query": {
            "bool": {
                "must": [
                    must_title,
                    must_explanation
                ]
            }
        }
    }
)
for hit in res["hits"]["hits"]:
    print(f"_score={hit['_score']:.2f}")
    pprint(hit["_source"])


=== Must: Title Contains 'Comet' (size=2) ===
_score=4.07
{'authors': 'Luc Perrot\n',
 'date': '2020-05-14',
 'explanation': 'Explanation: The pre-dawn hours of May 3rd were moonless as '
                'grains of cosmic dust streaked through southern skies above '
                'Reunion Island. Swept up as planet Earth plowed through dusty '
                'debris streams left behind periodic Comet 1/P Halley, the '
                'annual meteor shower is known as the Eta Aquarids. This '
                'inspired exposure captures a bright aquarid meteor flashing '
                'left to right over a sea of clouds. The meteor streak points '
                "back to the shower's radiant in the constellation Aquarius, "
                'well above the eastern horizon and off the top of the frame. '
                'Known for speed Eta Aquarid meteors move fast, entering the '
                'atmosphere at about 66 kilometers per second, visible at '
                'altitudes 

In the first query, you will notice the first document’s _score is higher than the second’s—clearly demonstrating how the must clause impacts relevance scoring.

## Must_not requests [<sup>7</sup>](#mustnot)

Although its name might imply the opposite of must, must_not actually behaves like a **negated filter**. Any document that matches a must_not clause is simply **removed** from the set of candidates.

In [19]:
must_not_image = {
    "exists": {"field": "image_url"}
}

print("=== Example 1: must_not exists image_url (size=2) ===")
res1 = es.search(
    index=index_name,
    body={
        "size": 2,
        "query": {
            "bool": {
                "must_not": [must_not_image]
            }
        }
    }
)
for hit in res1["hits"]["hits"]:
    pprint(hit["_source"])


filter_date = {
    "range": {
        "date": {"gte": "2024-09-09", "lte": "2024-09-30"}
    }
}
must_not_comet = {
    "prefix": {"title.keyword": "Comet"}
}

print("\n=== Example 2: range on date AND must_not prefix 'Comet' (size=2) ===")
res2 = es.search(
    index=index_name,
    body={
        "size": 2,
        "query": {
            "bool": {
                "filter":   [filter_date],
                "must_not": [must_not_comet]
            }
        }
    }
)
for hit in res2["hits"]["hits"]:
    # Only docs from 2024-09-27 whose title does NOT start with "Comet"
    pprint(hit["_source"])


=== Example 1: must_not exists image_url (size=2) ===
{'authors': 'Bruno Rota Sargi\n',
 'date': '2024-11-22',
 'explanation': 'Explanation: Braided and serpentine filaments of glowing gas '
                "suggest this nebula's popular name, The Medusa Nebula. Also "
                'known as Abell 21, this Medusa is an old planetary nebula '
                'some 1,500 light-years away in the constellation Gemini. Like '
                'its mythological namesake, the nebula is associated with a '
                'dramatic transformation. The planetary nebula phase '
                'represents a final stage in the evolution of low mass stars '
                'like the sun as they transform themselves from red giants to '
                'hot white dwarf stars and in the process shrug off their '
                'outer layers. Ultraviolet radiation from the hot star powers '
                "the nebular glow. The Medusa's transforming star is the faint "
                'one near t

## Should requests

The should clause in a bool query implements a **logical OR** across its clauses:

1. **Standalone should (no must clauses)**  
   - A document only needs to match at least one should clause to be included.

2. **Combined must + should**  
   - All must clauses still act as required filters that documents must satisfy every must.  
   - Each should clause that matches simply boosts the document’s relevance score; non-matching should clauses do not exclude the document.


In [20]:
# This query will filter documents that match at least one the specified conditions
query1 = {
    "size": 2,
    "query": {
        "bool": {
            "should": [
                { "match": { "title": "Comet" } },
                { "match": { "explanation": "nebula" } },
                { "match": { "authors": "David Martinez Delgado et al." } }
            ],
        }
    }
}

print("=== Query 1: Standalone should (title, explanation OR authors) ===")
res1 = es.search(index=index_name, body=query1)
for hit in res1["hits"]["hits"]:
    print(f"_score={hit['_score']:.2f}")
    pprint(hit["_source"])


# This query will filter documents that match the date "2024-09-27" and boost the score if the title or explanation matches
query2 = {
    "size": 2,
    "query": {
        "bool": {
            "must": [
                { "range": { "date": { "gte": "2024-09-20", "lte": "2024-09-30" } } }
            ],
            "should": [
                { "match": { "title": "Comet" } },
                { "match": { "explanation": "comet" } }
            ]
        }
    }
}

print("\n=== Query 2: must range 2024-09-20 to 2024-09-30 + should (boost if title/explanation) ===")
res2 = es.search(index=index_name, body=query2)
for hit in res2["hits"]["hits"]:
    print(f"_score={hit['_score']:.2f}")
    pprint(hit["_source"])


=== Query 1: Standalone should (title, explanation OR authors) ===
_score=24.17
{'authors': 'David Martinez Delgado et al.\n',
 'date': '2024-09-27',
 'explanation': 'Explanation: The twenty galaxies arrayed in these panels are '
                'part of an ambitious astronomical survey of tidal stellar '
                'streams. Each panel presents a composite view; a deep, '
                'inverted image taken from publicly available imaging surveys '
                'of a field that surrounds a nearby massive galaxy image. The '
                'inverted images reveal faint cosmic structures, star streams '
                'hundreds of thousands of light-years across, that result from '
                'the gravitational disruption and eventual merger of satellite '
                'galaxies in the local universe. Such surveys of mergers and '
                'gravitational tidal interactions between massive galaxies and '
                'their dwarf satellites are crucial guide

## Boosting request

When you want to increase the relevance of certain documents in ElasticSearch, you can apply a **boost factor** to a query clause. If the clause is satisfied (i.e., it matches a document), its score contribution will be multiplied by the boost factor, increasing the document’s overall relevance in the results.

However, if you want to decrease the relevance of certain documents (i.e., apply a *negative boost*), you need to use a boosting query. Boosting queries consist of two main components:

- **`positive`**: the main query whose matches should be scored normally (and optionally boosted).
- **`negative`**: a query whose matches should reduce the score of the document.

Additionally, the query includes a **`negative_boost`** parameter, which controls how much the negative matches reduce the document’s final score. This value must be between `0` and `1`, where:

- A value close to `0` significantly reduces the score of documents matching the negative clause.
- A value of `1` means no penalty is applied.

This mechanism is useful when you want to penalize certain documents without completely excluding them from the results.

In [21]:
#----- THIS CODE WAS WRITTEN WITH THE HELP OF CHATGPT -----#

# 1️⃣ Positive boost: increase score when explanation contains "meteor"
#    Uses `boost` directly on a match query.
print("=== Positive Boost: explanation contains 'meteor' (boost=2.0) ===")
pos_query = {
    "size": 2,
    "query": {
        "match": {
            "explanation": {
                "query": "meteor",
                "boost": 2.0      # positive boost on match
            }
        }
    }
}
res = es.search(index=index_name, body=pos_query)
for hit in res["hits"]["hits"]:
    print(f"_score={hit['_score']:.2f}") 
    pprint(hit["_source"])

# 2️⃣ Negative boost: de-emphasize docs where explanation contains "comet"
#    Uses the boosting query with match_all as the positive clause.
print("\n=== Negative Boost: explanation contains 'comet' (negative_boost=0.5) ===")
neg_query = {
    "size": 2,
    "query": {
        "boosting": {
            "positive": { "match_all": {} },        # match everything
            "negative": {                           # demote these
                "match": { "explanation": "comet" }
            },
            "negative_boost": 0.5                   # reduce score by 50% on match
        }
    }
}
res = es.search(index=index_name, body=neg_query)
for hit in res["hits"]["hits"]:
    print(f"_score={hit['_score']:.2f}")
    pprint(hit["_source"])

# 3️⃣ Combined boost: promote "meteor" matches and demote "comet" matches
print("\n=== Combined Boost: +2.0 for 'meteor', -0.5 for 'comet' ===")
both_query = {
    "size": 2,
    "query": {
        "boosting": {
            "positive": {
                "match": {
                    "explanation": {
                        "query": "meteor",
                        "boost": 2.0
                    }
                }
            },
            "negative": {
                "match": {
                    "explanation": "comet"
                }
            },
            "negative_boost": 0.5
        }
    }
}
res = es.search(index=index_name, body=both_query)
for hit in res["hits"]["hits"]:
    print(f"_score={hit['_score']:.2f}")
    pprint(hit["_source"])


=== Positive Boost: explanation contains 'meteor' (boost=2.0) ===
_score=10.95
{'authors': 'Fritz Helmut Hemmerich\n',
 'date': '2016-08-17',
 'explanation': "Explanation: What's that green streak in front of the "
                'Andromeda galaxy? A meteor. While photographing the Andromeda '
                'galaxy last Friday, near the peak of the Perseid Meteor '
                'Shower, a sand-sized rock from deep space crossed right in '
                "front of our Milky Way Galaxy's far-distant companion. The "
                'small meteor took only a fraction of a second to pass through '
                'this 10-degree field. The meteor flared several times while '
                "braking violently upon entering Earth's atmosphere. The green "
                "color was created, at least in part, by the meteor's gas "
                'glowing as it vaporized. Although the exposure was timed to '
                'catch a Perseids meteor, the orientation of the imaged strea

# Advanced features

In this section, we will dive into a handful of Elasticsearch’s most powerful advanced features. While ElasticSearch offers a wealth of capabilities beyond what we cover here, the three topics we will focus on are:

- **Aggregations**: Real-time analytics and data summarization  
- **Highlighting**: Extracting and emphasizing matching text snippets  
- **Autocomplete**: Instant-search experiences via suggesters and search-as-you-type  

Each feature can be mixed and matched or extended with dozens of other ElasticSearch tools to build rich, high-performance search applications tailored to your needs.


## Aggregations [<sup>8</sup>](aggregations)

Aggregation queries don’t return individual documents, instead, they compute analytics over the set of matched records. ElasticSearch supports three main aggregation types:

- **Bucket aggregations** group documents into “buckets” based on shared values (e.g., terms, date intervals, numeric ranges, or histograms).  
- **Metric aggregations** calculate statistics (such as count, sum, average, min/max) over those documents.  
- **Pipeline aggregations** take the output of one or more aggregations and run further calculations on it, enabling you to chain operations together.

In this tutorial, we will use those aggregations to show how you can both segment your data and then perform successive analyses on those segments.  

In [22]:
# We want to get the number of documents per month
# To do so, we will use a date_histogram aggregation to group documents by month
# and a cumulative_sum aggregation to get the cumulative count of documents over time.
body = {
    "size": 0,
    "aggs": {
        "entries_per_month": {
            # Bucket agg that uses a date_histogram to group documents by month
            "date_histogram": {
                "field":             "date",
                "calendar_interval": "month",
                "format":            "yyyy-MM"
            },
            # Pipeline agg that calculates the cumulative sum of the monthly counts
            "aggs": {
                # Metric agg that counts the number of documents in each month
                "monthly_count": {
                    "value_count": { "field": "date" }
                },
                # Pipeline agg that calculates the cumulative sum of the monthly counts
                "cumulative_entries": {
                    "cumulative_sum": {
                        "buckets_path": "monthly_count"
                    }
                }
            }
        }
    }
}

response = es.search(index="apod", body=body)

for bucket in response["aggregations"]["entries_per_month"]["buckets"]:
    month = bucket["key_as_string"]
    count = bucket["monthly_count"]["value"]
    cum   = bucket["cumulative_entries"]["value"]
    print(f"{month} → count: {count}, cumulative: {cum}")


2015-01 → count: 30, cumulative: 30.0
2015-02 → count: 27, cumulative: 57.0
2015-03 → count: 30, cumulative: 87.0
2015-04 → count: 28, cumulative: 115.0
2015-05 → count: 30, cumulative: 145.0
2015-06 → count: 25, cumulative: 170.0
2015-07 → count: 29, cumulative: 199.0
2015-08 → count: 22, cumulative: 221.0
2015-09 → count: 29, cumulative: 250.0
2015-10 → count: 28, cumulative: 278.0
2015-11 → count: 28, cumulative: 306.0
2015-12 → count: 26, cumulative: 332.0
2016-01 → count: 31, cumulative: 363.0
2016-02 → count: 25, cumulative: 388.0
2016-03 → count: 28, cumulative: 416.0
2016-04 → count: 27, cumulative: 443.0
2016-05 → count: 27, cumulative: 470.0
2016-06 → count: 25, cumulative: 495.0
2016-07 → count: 30, cumulative: 525.0
2016-08 → count: 30, cumulative: 555.0
2016-09 → count: 28, cumulative: 583.0
2016-10 → count: 26, cumulative: 609.0
2016-11 → count: 28, cumulative: 637.0
2016-12 → count: 25, cumulative: 662.0
2017-01 → count: 30, cumulative: 692.0
2017-02 → count: 23, cumulat

## Highlighting [<sup>9</sup>](#highlighting)

Highlighting in ElasticSearch works by surrounding each occurrence of a query term in your document text with customizable tags (by default <em>/</em>, but you can use any HTML or marker you like). When you include a highlight section in your search request, ElasticSearch will:

1. Analyze the specified field(s) to find where your query terms fall.
2. Extract short snippets (fragments) around each match.
3. Wrap each matching term in your chosen pre_tags and post_tags.
4. Return those snippets alongside each hit in a top-level highlight block.

This makes it easy to show users exactly where and in what context—their search terms appeared.

In [None]:
query = {
  "query": {
    "match": {
      "explanation": "comet"
    }
  },
  "highlight": {
    # Customize the highlight tags
    "pre_tags":  ["<mark>"],
    "post_tags": ["</mark>"],
    # Specify in wich fields we want to highlight
    "fields": {
      "explanation": {
        # This skip the fragmentation of the text and return the whole text with each match highlighted
        "number_of_fragments": 0, 
      }
    }
  }
}

response = es.search(index="apod", body=query)

print("=== Highlighted Results ===")
for hit in response["hits"]["hits"]:
    print(f"Title: {hit['_source']['title']}")
    print("Highlighted explanation:")
    for fragment in hit["highlight"]["explanation"]:
        print(fragment)
    print("\n")


## Autocomplete [<sup>10</sup>](#autocomplete)

The final feature we will cover is **autocomplete**. ElasticSearch offers four different approaches, but we will choose the simplest one with the least setup: the search-as-you-type mechanism.

The first step is to reindex your data to add the field required for this feature (we could add it directly on our first index).


In [23]:
if es.indices.exists(index="apod_v2"):
    es.indices.delete(index="apod_v2")  

# New index mapping with search_as_you_type
mapping = {
    "mappings": {
        "properties": {
            "title": {
                "type":"search_as_you_type", # Enable autocomplete search
                "max_shingle_size": 3
            },
            "date":        { "type": "date",   "format": "yyyy-MM-dd" },
            "explanation": { "type": "text" },
            "image_url":   { "type": "keyword" },
            "authors":     { "type": "text" }
        }
    }
}

es.indices.create(index="apod_v2", body=mapping)

es.reindex(
    body={
        "source": { "index": "apod" },
        "dest":   { "index": "apod_v2" }
    },
    wait_for_completion=True,
)

print("===Reindexing complete===")


===Reindexing complete===


To demonstrate search-as-you-type in this notebook, we will embed an ***ipywidgets.Combobox*** as our search bar. Under the hood, each time you type a character, a tiny Python function sends a **bool_prefix** query to our **search_as_you_type** index and updates the dropdown with the matching titles. Note that you might need to press the enter key once to activate the search bar. You may need to restart your kernel in order for the search bar to work.

In [24]:
%pip install ipywidgets jupyterlab_widgets                      
%pip install jupyterlab
!jupyter labextension install @jupyter-widgets/jupyterlab-manager

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.0 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.0 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip
(Deprecated) Installing extensions with the jupyter labextension install command is now deprecated and will be removed in a future major version of JupyterLab.

Users should manage prebuilt extensions with package managers like pip and conda, and extension authors are encouraged to distribute their extensions as prebuilt packages 


In [25]:
# Realized with the help of chatGPT
import ipywidgets as widgets
from IPython.display import display
import threading
import time

# Create the Search bar
combo = widgets.Combobox(
    placeholder='Type to search titles…',
    options=[],
    description='Search:',
    ensure_option=False,
    continuous_update=True
)
display(combo)

last_call = 0
lock = threading.Lock()

def fetch_suggestions(text):
    # ElasticSearch bool_prefix query against search-as-you-type
    body = {
        "query": {
            "multi_match": {
                "query": text,
                "type":  "bool_prefix",
                "fields": ["title","title._2gram","title._3gram"]
            }
        },
        "_source": ["title"],
        "size": 5
    }
    resp = es.search(index="apod_v2", body=body)
    return [hit["_source"]["title"] for hit in resp["hits"]["hits"]]

def on_value_change(change):
    global last_call
    value = change['new']
    now = time.time()
    with lock:
        if now - last_call < 0.3:
            return
        last_call = now
    if value:
        combo.options = fetch_suggestions(value)
    else:
        combo.options = []

combo.observe(on_value_change, names='value')

Combobox(value='', description='Search:', placeholder='Type to search titles…')

# Limitations of ElasticSearch[<sup>15</sup>](#ESvsSQL1) [<sup>16</sup>](#ESvsSQL2) [<sup>17</sup>](#ESvsSQL3)
One of the most commonly mentioned limitations in forums and blogs is that ElasticSearch performance and data durability can be heavily constrained by the resources allocated to the cluster. To maintain good performance and ensure data survivability, it may be necessary to increase the number of nodes and replicate data across them. This is not a technical limitation per se, but rather an infrastructure constraint that depends on available resources.

From a technical standpoint, ElasticSearch is built on top of the **Lucene engine**, which is optimized for full-text search rather than relational queries such as joins or transactions. While ElasticSearch does offer limited support for join-like operations[<sup>11</sup>](#JoinQueries), they are far less powerful than those available in SQL databases and are typically very resource-intensive[<sup>12</sup>](#ressourcesLimit). As a result, these features are often disabled in production configurations.

In terms of consistency, ElasticSearch follows an eventual consistency model rather than strong consistency. This means that in certain race conditions, inconsistencies in query results can occur shortly after data is written or updated.

Although a portion of ElasticSearch is available for free, accessing the full feature set—particularly advanced security, monitoring, and machine learning capabilities—requires a paid license.

Nevertheless, ElasticSearch remains one of the most powerful and widely adopted solutions for full-text search, offering high performance, scalability, and a rich query language. Its ability to index and search large volumes of textual data in near real-time makes it a strong choice for use cases such as log analytics, product search, and document indexing.

Alternatives to ElasticSearch include **Apache Solr**, which is also built on Lucene and excels in traditional search applications, and **OpenSearch**, a community-driven fork of ElasticSearch that retains many of its features under a fully open-source license.  [<sup>13</sup>](#ESalternative1) [<sup>14</sup>](#ESalternative2)

### SQL for full-text search
While many relational databases offer built-in full-text extensions, they remain fundamentally optimized for structured, transactional workloads rather than rich, large-scale text retrieval. Below are four core reasons why SQL full-text search falls short compared to ElasticSearch:

#### 1. Limited Indexing Control & Analysis
- **Bolt-on inverted index**  
  SQL full-text search layers an inverted index onto its B-tree/row-store engine, but provides virtually no knobs for customizing tokenization, stemming, or character filters.  
- **Minimal analyzer pipelines**  
  Text is split on whitespace/punctuation with fixed stemmers and stop-words—no pluggable analyzers, synonym maps, or language-specific processing as in ElasticSearch.  

#### 2. Restricted Query & Search Features  
- **Limited query operators**  
  SQL full-text supports only Boolean operators (e.g., `CONTAINS`, `FREETEXT`) and basic proximity/NEAR clauses; there’s no built-in fuzzy, wildcard, regex, or span query support.  
- **DIY fuzzy workarounds**  
  To approximate typo-tolerance you must implement custom edit-distance functions or join strategies, which are far more complex and slower than Elasticsearch’s native fuzzy queries.  

#### 3. Scalability & Distributed Architecture  
- **Monolithic extension**  
  Full-text runs in-process on your primary database server, limiting throughput under heavy search loads.  
- **No native sharding/replication**  
  ElasticSearch auto-shards, replicates, and rebalances indices across nodes; SQL Server or MySQL require manual partitioning or expensive clustering extensions to scale horizontally.

### Comparison between relational Databases and ElasticSearch [<sup>18</sup>](#TableGen)
As with most decisions in technology, selecting the appropriate tools hinges on your specific use case. It's essential to assess the features you require, understand your typical workflows, and determine the data handling properties that are most critical for your operations. The following table provides a comparative overview of two prominent technologies, aiding in an informed decision-making process.

| **Aspect**                | **ElasticSearch**                                                                                                                             | **MySQL**                                                                                                                       |
|---------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------|
| **Data Model**            | Document-oriented (NoSQL); stores data as JSON documents.                                                                                     | Relational (SQL); stores data in structured tables with predefined schemas.                                                     |
| **Primary Use Cases**     | Full-text search, log and event data analysis, real-time analytics, and applications requiring complex search capabilities.                   | Transactional applications, structured data storage, and scenarios requiring complex joins and ACID compliance.                 |
| **Query Language**        | ElasticSearch Query DSL (Domain Specific Language); designed for flexible and complex search operations.                                      | SQL (Structured Query Language); widely adopted for structured data querying and manipulation.                                  |
| **Joins & Relationships** | Limited support for joins; alternatives like nested documents and parent-child relationships exist but can be complex and resource-intensive. | Robust support for joins, foreign keys, and complex relational queries.                                                         |
| **Schema Flexibility**    | Schema-less; allows dynamic mapping, making it adaptable to varying data structures.                                                          | Schema-based; requires predefined schemas, offering strict data validation and integrity.                                       |
| **Consistency Model**     | Eventually consistent; suitable for scenarios where immediate consistency is not critical.                                                    | Strong consistency with ACID (Atomicity, Consistency, Isolation, Durability) properties, ensuring reliable transactions.        |
| **Scalability**           | Horizontally scalable; designed to handle large volumes of data across distributed systems.                                                   | Vertically scalable; can be scaled horizontally with additional configurations but is primarily optimized for vertical scaling. |
| **Performance**           | Optimized for search operations; excels in scenarios requiring rapid full-text search and analytics.                                          | Optimized for transactional operations; performs well in scenarios requiring complex transactions and data integrity.           |
| **Security Features**     | Basic security features available; advanced features like role-based access control and encryption are part of the commercial offerings.      | Offers robust security features, including user authentication, SSL support, and role-based access control.                     |
| **Licensing**             | Open-source with a dual license model; some advanced features require a commercial license.                                                   | Open-source (GPL) with commercial support available; widely adopted and supported by a large community.                         |

# Conclusion

We began this tutorial by introducing the goals of ElasticSearch and how it leverages complex indexing mechanisms to deliver high performance and scalability. We also provided a brief recap of what indexes are, how they are used in relational databases, and the limitations and performance costs they can introduce.

As we progressed, we walked through setting up an ElasticSearch cluster using Docker, connecting to it, and uploading bulk data. We then explored the basic API operations for document manipulation by ID and highlighted the differences between Query DSL and SQL. From there, we reviewed a broad range of advanced query types and features specifically tailored for building powerful search engine experiences.

Using ElasticSearch proved to be quite intuitive. Even though we only covered a subset of its capabilities, it quickly became clear how the technology fits into real-world applications. Along the way, we encountered many configuration parameters that allow you to fine-tune ElasticSearch's behavior to meet specific needs.

While ElasticSearch excels at full-text search, it may require more resources than a traditional database system. We also noted that its eventual consistency model differs from conventional databases, making it less suitable for some workflows. Nevertheless, if your application requires a robust and flexible full-text search engine, ElasticSearch is one of the most powerful solutions available, thanks to its rich feature set and high configurability.

By following this tutorial, you should now have the knowledge and skills to build your first search engine application. With a thoughtfully designed search interface and parameterized queries, you can now deliver accurate and relevant results to your users efficiently and effectively.

# References

1. <a id="invertedindex"></a> [Lucene Query Principle](https://www.mo4tech.com/lucene-query-principle.html)

2. <a id="bm25"></a> [Understanding BM25, the Revolutionary Backbone of ElasticSearch](https://elkutils.com/blog/2161/understanding-bm25-the-revolutionary-backbone-of-elasticsearch)

3. <a id="elasticlab"></a> [Elastic Search Lab - Tutorials](https://www.elastic.co/search-labs/tutorials)

4. <a id="freecodecamp"></a> [ElasticSearch Course for Beginners - FreeCodeCamp](https://www.youtube.com/watch?v=a4HBKEda_F8&ab_channel=freeCodeCamp.org)

5. <a id="elasticlabBoolQueries"> [Elastic Search Lab - Tutortial - Filters](https://www.elastic.co/search-labs/tutorials/search-tutorial/full-text-search/filters)

6. <a id="elasticscore"></a> [Elastic Search Lab - Understanding ElasticSearch scoring and the Explain API](https://www.elastic.co/search-labs/blog/elasticsearch-scoring-and-explain-api)

7. <a id="mustnot"></a> [Soumendra - Stack Overflow - Difference between must_not and filter in elasticsearch](https://stackoverflow.com/questions/47226479/difference-between-must-not-and-filter-in-elasticsearch)

8. <a id="aggregations"></a> [logz.io - Daniel Berman - A Basic Guide To ElasticSearch Aggregations](https://logz.io/blog/elasticsearch-aggregations)

9. <a id="highlighting"></a> [Elastic Official Documentation - Highlighting](https://www.elastic.co/docs/reference/elasticsearch/rest-apis/highlighting)

10. <a id="autocomplete"></a> [Opster - Amit Khandelwal - ElasticSearch Autocomplete Search](https://opster.com/guides/elasticsearch/how-tos/elasticsearch-auto-complete-guide/)

11. <a id="JoinQueries"></a> [Elastic Official Documentation - Joining Queries](https://www.elastic.co/docs/reference/query-languages/query-dsl/joining-queries)

12. <a id="ressourcesLimit"></a> [Reddit - What are the limits of elastic search?](https://www.reddit.com/r/elasticsearch/comments/6xm8wv/what_are_the_limits_of_elastic_search/)

13. <a id="ESalternative1"></a> [sematext - 11 Alternatives to ElasticSearch, OpenSearch, and Solr](https://sematext.com/blog/elasticsearch-opensearch-solr-alternatives/)

14. <a id="ESalternative2"></a> [BIGDATA - ElasticSearch Alternatives - The Ultimate Guide](https://bigdataboutique.com/blog/elasticsearch-alternatives-the-ultimate-guide-59ad00)

15. <a id="ESvsSQL1"></a> [Medium - ElasticSearch vs. Traditional Databases: Diving into Elastic search's Strengths](https://medium.com/@rajeevprasanna/elasticsearch-vs-traditional-databases-diving-into-elastic-searchs-strengths-c6f55b9b449f)

16. <a id="ESvsSQL2"></a> [knowi - ElasticSearch vs. MySQL: What to Choose?](https://www.knowi.com/blog/elasticsearch-vs-mysql-what-to-choose/)

17. <a id="ESvsSLQ3"></a> [Airbyte - ElasticSearch vs SQL Server - Key Differences](https://airbyte.com/data-engineering-resources/elasticsearch-vs-sql-server)

18. <a id="TableGen"></a> [Table Generator](https://www.tablesgenerator.com/markdown_tables)

19. <a id="GPT"></a> [ChatGPT](https://chatgpt.com/) was used to improve some sentences and to write some pieces of code.

20. <a id="elasticdoc"></a> [Elastic Official Documentation](https://www.elastic.co/docs/get-started)

# Process and Work Distribution

First, Nicolas researched some topics related to ElasticSearch. After discussing with the rest of the group to agree on the approach, Mehdi distributed the tasks equitably:

- **Introduction, Real-world use cases and Explanation of ElasticSearch's workings**: Andreas
- **Installation, Importing data with the bulk api (Dataset), DPL vs SQL and Basic queries in ElasticSearch**: Mehdi
- **ElasticSearch as a search engine and Advanced features**: Nicolas
- **Limitations of ElasticSearch and Conclusion**: Maxim

After completing our respective sections, each member reviewed and improved on the work of others to ensure the overall correctness and coherence of the tutorial.