# Exploring Elastic Search

**Mehdi Boustani** - S221594  
**Nicolas Schneiders** - S203005  
**Maxim Piron** - S211493  
**Andreas Stistrup** - S212891  

*Faculty of Applied Sciences, University of Liège*

April 28, 2025


# Introduction

# Installation & configuration

## Docker

### Installing Elasticsearch
If you don't have Docker installed yet, you can download and install it from the [official website](https://www.docker.com/). 

Once Docker is running on your machine, launch Elasticsearch using the following command:

In [None]:
!docker run -p 127.0.0.1:9200:9200 -d --name elasticsearch \
  -e "discovery.type=single-node" \
  -e "xpack.security.enabled=false" \
  -e "xpack.license.self_generated.type=trial" \
  -v "elasticsearch-data:/usr/share/elasticsearch/data" \
  docker.elastic.co/elasticsearch/elasticsearch:8.15.0


## Dependencies  
Let's install all the necessary Python packages we'll be using throughout this tutorial.


In [15]:
# requests       → to interact with the Elasticsearch REST API
# elasticsearch  → official Elasticsearch Python client
# pandas         → for handling and analyzing tabular data (e.g., dataset exploration)
# matplotlib     → for optional data visualization (e.g., query stats or aggregations)

!pip install requests elasticsearch==8.15.0 pandas matplotlib

Defaulting to user installation because normal site-packages is not writeable


## Connexion

In [16]:
from pprint import pprint
from elasticsearch import Elasticsearch, helpers

es = Elasticsearch('http://localhost:9200')
info = es.info()

print('Connected to ElasticSearch !')
pprint(info.body)


Connected to ElasticSearch !
{'cluster_name': 'docker-cluster',
 'cluster_uuid': 'sY43eXx2R5C_MHq1iRpKKQ',
 'name': '6e8d9e503ed5',
 'tagline': 'You Know, for Search',
 'version': {'build_date': '2024-08-05T10:05:34.233336849Z',
             'build_flavor': 'default',
             'build_hash': '1a77947f34deddb41af25e6f0ddb8e830159c179',
             'build_snapshot': False,
             'build_type': 'docker',
             'lucene_version': '9.11.1',
             'minimum_index_compatibility_version': '7.0.0',
             'minimum_wire_compatibility_version': '7.17.0',
             'number': '8.15.0'}}


## Importing data with the bulk api

To efficiently load a large dataset into Elasticsearch, we use the Bulk API. This method allows us to insert multiple documents in a single request, which is much faster and more efficient than indexing documents one by one. In this example, we will import the contents of our `apod.json` file—where each element is a document—into a new index called `apod`.

In [17]:
import json

with open("apod.json", "r") as f:
    data = json.load(f)

# Prepare the actions for the bulk
actions = [
    {
        "_index": "apod",
        "_id": doc["title"], # We use the title as index since it is a unique field (the unicity is important!)
        "_source": doc
    }
    for doc in data
]

# We import the data in bulk
try:
    helpers.bulk(es, actions)
    print("Bulk import terminé !")
except  Exception as e:
    print(e)

Bulk import terminé !


## Basic queries in ElasticSearch

Elasticsearch exposes a RESTful API, which means you interact with it using standard HTTP methods. Here are the most common operations:

- **GET**: Read a document or perform a search
- **POST**: Add a new document
- **PUT**: Create or replace a document or an index
- **DELETE**: Remove a document or an index

### GET method

The GET method is used to retrieve data from our json file by providing an id. If the document with the specified id doesn't exist, it throws an exception.

In [18]:
try:
    doc = es.get(index="apod", id="A Hazy Harvest Moon")
    pprint(doc['_source'])

except:
    print("A document with this id doesn't exist!")

{'authors': 'Petr Horálek, Institute of Physics in Opava\n',
 'date': '2024-09-20',
 'explanation': "Explanation: For northern hemisphere dwellers, September's "
                'Full Moon was the Harvest Moon. On September 17/18 the sunlit '
                "lunar nearside passed into shadow, just grazing Earth's "
                "umbra, the planet's dark, central shadow cone, in a partial "
                'lunar eclipse. Over the two and a half hours before dawn a '
                'camera fixed to a tripod was used to record this series of '
                'exposures as the eclipsed Harvest Moon set behind Spiš Castle '
                'in the hazy morning sky over eastern Slovakia. Famed in '
                'festival, story, and song, Harvest Moon is just the '
                'traditional name of the full moon nearest the autumnal '
                'equinox. According to lore the name is a fitting one. Despite '
                'the diminishing daylight hours as the growing se

### POST method

The POST method is used to create a new document. When using the index() method without specifying an id, elasticsearch automatically generates one (not the title as the other documents).

In [19]:
from datetime import datetime

new_id = "A New APOD"

new_doc = {
    "date": datetime.now().strftime("%Y-%m-%d"),
    "title": new_id,
    "explanation": "This is a new document added via POST.",
    "image_url": "https://apod.nasa.gov/apod/image/2410/new_apod.jpg",
    "authors": "Mehdi Boustani"
}

res = es.index(index="apod", document=new_doc)

print("Document added successfully")

Document added successfully


### PUT method

The PUT method is used to create or replace a document at a specified id. If a document with that id already exists, it will be overwritten.

In [20]:
replaced_id = "Replaced APOD"

doc = {
    "date": "2024-10-02",
    "title": replaced_id,
    "explanation": "This document replaces any previous one with the same ID.",
    "image_url": "https://apod.nasa.gov/apod/image/2410/new_apod.jpg",
    "authors": "Mehdi Boustani"
}

# Let's replace our previously created document
es.index(index="apod", id=doc["title"], document=doc)

print(f"Document with id '{new_id}' replaced by a new document with id '{replaced_id}'")

Document with id 'A New APOD' replaced by a new document with id 'Replaced APOD'


### DELETE method

This method is used to delete a document by its id. The id must be known and specified in the request.

In [21]:
try:
    es.delete(index="apod", id=replaced_id)
    print(f"Document with ID {replaced_id} deleted.")

except:
    print("The specified document to delete doesn't exist")

# Delete the entire index (be careful, this is command is irreversible)
# es.indices.delete(index="apod")
# print("Index 'apod' deleted.")

Document with ID Replaced APOD deleted.


## DPL vs SQL

Elasticsearch doesn't use traditional SQL language to query data, but rather a **DSL (Domain Specific Language)** based on JSON.

### Main differencies

1. **Query Structure**
   - **SQL**: Uses a strict syntax with clauses like `SELECT`, `FROM`, `WHERE`
   - **DSL**: Uses a nested JSON format, offering more flexibility in how queries are expressed

2. **Search Types**
   - **SQL**: Focuses mainly on exact matches
   - **DSL**: Supports advanced search techniques like full-text search, fuzzy matching, and range queries

Let's explore practical examples comparing SQL concepts with Elasticsearch's DSL.

### Full-text search

In [22]:
query = {
    "query": {
        "match": {
            "title": "moon"
        }
    }
}

response = es.search(index="apod", body=query)
for hit in response["hits"]["hits"]:
    print(hit["_source"])

{'date': '2016-11-13', 'title': 'Super Moon vs Micro Moon', 'explanation': "Explanation: What is so super about tomorrow's supermoon? Tomorrow, a full moon will occur that appears slightly larger and brighter than usual. The reason is that the Moon's fully illuminated phase occurs within a short time from perigee - when the Moon is its closest to the Earth in its elliptical orbit. Although the precise conditions that define a supermoon vary, tomorrow's supermoon will undoubtedly qualify because it will be the closest, largest, and brightest full moon in over 65 years. One reason supermoons are popular is because they are so easy to see -- just go outside at sunset and watch an impressive full moon rise! Since perigee actually occurs tomorrow morning, tonight's full moon, visible starting at sunset, should also be impressive. Pictured here, a supermoon from 2012 is compared to a micromoon -- when a full Moon occurs near the furthest part of the Moon's orbit -- so that it appears smaller

**SQL equivalent:** `SELECT * FROM apod WHERE title LIKE '%moon%'`

### Exact match with term

In [23]:
query = {
    "query": {
        "term": {
            "title.keyword": {
                "value": "A Hazy Harvest Moon"
            }
        }
    }
}

response = es.search(index="apod", body=query)
for hit in response["hits"]["hits"]:
    print(hit["_source"])


{'date': '2024-09-20', 'title': 'A Hazy Harvest Moon', 'explanation': "Explanation: For northern hemisphere dwellers, September's Full Moon was the Harvest Moon. On September 17/18 the sunlit lunar nearside passed into shadow, just grazing Earth's umbra, the planet's dark, central shadow cone, in a partial lunar eclipse. Over the two and a half hours before dawn a camera fixed to a tripod was used to record this series of exposures as the eclipsed Harvest Moon set behind Spiš Castle in the hazy morning sky over eastern Slovakia. Famed in festival, story, and song, Harvest Moon is just the traditional name of the full moon nearest the autumnal equinox. According to lore the name is a fitting one. Despite the diminishing daylight hours as the growing season drew to a close, farmers could harvest crops by the light of a full moon shining on from dusk to dawn. This September's Harvest Moon was also known to some as a supermoon, a term becoming a traditional name for a full moon near perige

**SQL equivalent:** `SELECT * FROM apod WHERE title = 'A Hazy Harvest Moon'`

### Range Query (Numeric/date filtering)

In [26]:
# We filter documents by a date range
query = {
    "query": {
        "range": {
            "date": {
                "gte": "2020-01-01",
                "lte": "2020-01-15"
            }
        }
    }
}

response = es.search(index="apod", body=query)
for hit in response["hits"]["hits"]:
    print(hit["_source"])


{'date': '2020-01-15', 'title': 'Iridescent Clouds over Sweden', 'explanation': 'Explanation: Why would these clouds multi-colored? A relatively rare phenomenon in clouds known as iridescence can bring up unusual colors vividly or even a whole spectrum of colors simultaneously. These polar stratospheric clouds clouds, also known as nacreous and mother-of-pearl clouds, are formed of small water droplets of nearly uniform size. When the Sun is in the right position and, typically, hidden from direct view, these thin clouds can be seen significantly diffracting sunlight in a nearly coherent manner, with different colors being deflected by different amounts. Therefore, different colors will come to the observer from slightly different directions. Many clouds start with uniform regions that could show iridescence but quickly become too thick, too mixed, or too angularly far from the Sun to exhibit striking colors. The featured image and an accompanying video were taken late last year over O

**SQL equivalent:** `SELECT * FROM apod WHERE date BETWEEN '2020-01-01' AND '2020-12-31'`

### Fuzzy Query (Typo-tolerant search)

In [28]:
# Typo-tolerant search with fuzzy
query = {
    "query": {
        "fuzzy": {
            "title": {
                "value": "Galaxi",
                "fuzziness": "AUTO"
            }
        }
    }
}

response = es.search(index="apod", body=query)
for hit in response["hits"]["hits"]:
    print(hit["_source"])

{'date': '2020-01-25', 'title': "Rubin's Galaxy", 'explanation': "Explanation: In this Hubble Space Telescope image the bright, spiky stars lie in the foreground toward the heroic northern constellation Perseus and well within our own Milky Way galaxy. In sharp focus beyond is UGC 2885, a giant spiral galaxy about 232 million light-years distant. Some 800,000 light-years across compared to the Milky Way's diameter of 100,000 light-years or so, it has around 1 trillion stars. That's about 10 times as many stars as the Milky Way. Part of a current investigation to understand how galaxies can grow to such enormous sizes, UGC 2885 was also part of astronomer Vera Rubin's pioneering study of the rotation of spiral galaxies. Her work was the first to convincingly demonstrate the dominating presence of dark matter in our universe.", 'image_url': 'https://apod.nasa.gov/apod/image/2001/RubinsGalaxy_hst1024.jpg', 'authors': 'NASA, ESA, B. Holwerda (University of Louisville)\n'}
{'date': '2021-09

**SQL equivalent:** No direct equivalent, similar to a `LIKE` with typos

# Elastic search as a search engine

The goal of Elasticsearch is to empower client workflows to retrieve data from your database using powerful, flexible queries. To customize search behavior, Elasticsearch offers several fine-tuning parameters. In this section, we’ll explore the filter, must, must_not, and should clauses.

A key concept here is document scoring. When you run a query, Elasticsearch calculates a relevance score for each candidate document and orders results accordingly. You then return the top n documents based on that ranking. To further control how scores influence ordering, you can use the boost parameter to adjust relevance and achieve custom ranking.

## Filter requests

When you apply a filter criterion to your query, you define one or more clauses that documents must satisfy to be included. Filters are score-neutral, they don’t alter a document’s relevance score, they only prune out non-matching hits. Below we’ll explore a selection of the most common filter clauses.

In [None]:
from pprint import pprint
index_name = "apod"

# This query will filter out documents that do not match the date "2024-09-27"
term_query = {
    "term": {"date": "2024-09-27"}
}

# This query will filter documents with a date between "2024-09-09" and "2024-09-30"
range_query = {
    "range": {
        "date": {"gte": "2024-09-09", "lte": "2024-09-30"}
    }
}

# This query will filter documents that have a non-null value for the field "image_url"
exists_query = {
    "exists": {"field": "note"}
}

# This query will filter documents that have the exact term "David Martinez Delgado et al." in the "authors" field
term_authors_query = {
    "term": {"authors.keyword": "David Martinez Delgado et al."}
}

# This query will filter documents that have a title starting with "Comet"
prefix_query = {
    "prefix": {"title.keyword": "Comet"}
}


print("=== Term Query on date ===")
res = es.search(index=index_name, body={"query": {"bool": {"filter": [term_query]}}})
for hit in res["hits"]["hits"]:
    pprint(hit["_source"])

print("\n=== Range Query on date ===")
res = es.search(index=index_name, body={"query": {"bool": {"filter": [range_query]}}})
for hit in res["hits"]["hits"]:
    pprint(hit["_source"])

print("\n=== Exists Query on note ===")
res = es.search(index=index_name, body={"query": {"bool": {"filter": [exists_query]}}})
for hit in res["hits"]["hits"]:
    pprint(hit["_source"])

print("\n=== Exact Term Query on authors ===")
res = es.search(index=index_name, body={"query": {"bool": {"filter": [term_authors_query]}}})
for hit in res["hits"]["hits"]:
    pprint(hit["_source"])

print("\n=== Prefix Query on title ===")
res = es.search(index=index_name, body={"query": {"bool": {"filter": [prefix_query]}}})
for hit in res["hits"]["hits"]:
    pprint(hit["_source"])


## Must requests

The must criterion works much like filter in that it first determines which records are eligible but with one key difference: when a document matches a must clause, its relevance score is increased. You can include multiple must clauses, and they’re combined with a logical AND (i.e., a document must satisfy all of them).

In [None]:
must_title = {
    "match": {"title": "Comet"}
}

must_explanation = {
    "match": {"explanation": "nebula"}
}

print("=== Must: Title Contains 'light' (size=2) ===")
res = es.search(
    index=index_name,
    # The size parameter limits the number of results returned. Default is 10.
    body={
        "size": 2,
        "query": {
            "bool": {
                "must": [must_title]
            }
        }
    }
)
for hit in res["hits"]["hits"]:
    print(f"_score={hit['_score']:.2f}")
    pprint(hit["_source"])


print("\n=== Must: Title Contains 'Comet' AND Explanation Contains 'nebula' (size=2) ===")
res = es.search(
    index=index_name,
    body={
        "size": 2,
        "query": {
            "bool": {
                "must": [
                    must_title,
                    must_explanation
                ]
            }
        }
    }
)
for hit in res["hits"]["hits"]:
    print(f"_score={hit['_score']:.2f}")
    pprint(hit["_source"])


In the first query, you’ll notice the first document’s _score is higher than the second’s—clearly demonstrating how the must clause impacts relevance scoring.

## Must_not requests

Although its name might imply the opposite of must, must_not actually behaves like a negated filter. Any document that matches a must_not clause is simply removed from the set of candidates.

In [None]:
must_not_image = {
    "exists": {"field": "image_url"}
}

print("=== Example 1: must_not exists image_url (size=2) ===")
res1 = es.search(
    index=index_name,
    body={
        "size": 2,
        "query": {
            "bool": {
                "must_not": [must_not_image]
            }
        }
    }
)
for hit in res1["hits"]["hits"]:
    pprint(hit["_source"])


filter_date = {
    "range": {
        "date": {"gte": "2024-09-09", "lte": "2024-09-30"}
    }
}
must_not_comet = {
    "prefix": {"title.keyword": "Comet"}
}

print("\n=== Example 2: range on date AND must_not prefix 'Comet' (size=2) ===")
res2 = es.search(
    index=index_name,
    body={
        "size": 2,
        "query": {
            "bool": {
                "filter":   [filter_date],
                "must_not": [must_not_comet]
            }
        }
    }
)
for hit in res2["hits"]["hits"]:
    # Only docs from 2024-09-27 whose title does NOT start with "Comet"
    pprint(hit["_source"])


## Should requests

The should clause in a bool query implements a logical OR across its clauses:

1. **Standalone should (no must clauses)**  
   - A document only needs to match at least one should clause to be included.

2. **Combined must + should**  
   - All must clauses still act as required filters that documents must satisfy every must.  
   - Each should clause that matches simply boosts the document’s relevance score; non-matching should clauses do not exclude the document.


In [None]:
# This query will filter documents that match at least one the specified conditions
query1 = {
    "size": 2,
    "query": {
        "bool": {
            "should": [
                { "match": { "title": "Comet" } },
                { "match": { "explanation": "nebula" } },
                { "match": { "authors": "David Martinez Delgado et al." } }
            ],
        }
    }
}

print("=== Query 1: Standalone should (title OR explanation) ===")
res1 = es.search(index=index_name, body=query1)
for hit in res1["hits"]["hits"]:
    print(f"_score={hit['_score']:.2f}")
    pprint(hit["_source"])


# This query will filter documents that match the date "2024-09-27" and boost the score if the title or explanation matches
query2 = {
    "size": 2,
    "query": {
        "bool": {
            "must": [
                { "range": { "date": { "gte": "2024-09-20", "lte": "2024-09-30" } } }
            ],
            "should": [
                { "match": { "title": "Comet" } },
                { "match": { "explanation": "comet" } }
            ]
        }
    }
}

print("\n=== Query 2: must range 2024-09-20 to 2024-09-30 + should (boost if title/explanation) ===")
res2 = es.search(index=index_name, body=query2)
for hit in res2["hits"]["hits"]:
    print(f"_score={hit['_score']:.2f}")
    pprint(hit["_source"])


## Boosting request

When you want to increase the relevance of certain documents, you attach a positive boost to your query clauses (for example, a match or term query) by adding a boost parameter—this simply multiplies that clause’s score contribution in the final _score. To softly penalize documents without filtering them out entirely, you use the boosting query: it takes a required positive query and a negative query, and for any document that matches the negative clause, it multiplies its overall score by a negative_boost factor (a value between 0 and 1).

In [None]:
from pprint import pprint
from elasticsearch import Elasticsearch

es = Elasticsearch('http://localhost:9200')
index_name = "apod"

# 1️⃣ Positive boost: increase score when explanation contains "meteor"
#    Uses `boost` directly on a match query.
print("=== Positive Boost: explanation contains 'meteor' (boost=2.0) ===")
pos_query = {
    "size": 2,
    "query": {
        "match": {
            "explanation": {
                "query": "meteor",
                "boost": 2.0      # positive boost on match
            }
        }
    }
}
res = es.search(index=index_name, body=pos_query)
for hit in res["hits"]["hits"]:
    print(f"_score={hit['_score']:.2f}") 
    pprint(hit["_source"])

# 2️⃣ Negative boost: de-emphasize docs where explanation contains "comet"
#    Uses the boosting query with match_all as the positive clause.
print("\n=== Negative Boost: explanation contains 'comet' (negative_boost=0.5) ===")
neg_query = {
    "size": 2,
    "query": {
        "boosting": {
            "positive": { "match_all": {} },        # match everything
            "negative": {                           # demote these
                "match": { "explanation": "comet" }
            },
            "negative_boost": 0.5                   # reduce score by 50% on match
        }
    }
}
res = es.search(index=index_name, body=neg_query)
for hit in res["hits"]["hits"]:
    print(f"_score={hit['_score']:.2f}")
    pprint(hit["_source"])

# 3️⃣ Combined boost: promote "meteor" matches and demote "comet" matches
print("\n=== Combined Boost: +2.0 for 'meteor', -0.5 for 'comet' ===")
both_query = {
    "size": 2,
    "query": {
        "boosting": {
            "positive": {
                "match": {
                    "explanation": {
                        "query": "meteor",
                        "boost": 2.0
                    }
                }
            },
            "negative": {
                "match": {
                    "explanation": "comet"
                }
            },
            "negative_boost": 0.5
        }
    }
}
res = es.search(index=index_name, body=both_query)
for hit in res["hits"]["hits"]:
    print(f"_score={hit['_score']:.2f}")
    pprint(hit["_source"])


# Advanced features

# Conclusion

# References

1. <a id="freecodecamp"></a> [Elasticsearch Course for Beginners - FreeCodeCamp](https://www.youtube.com/watch?v=a4HBKEda_F8&ab_channel=freeCodeCamp.org)

2. <a id="elasticdoc"></a> [Elastic Official Documentation](https://www.elastic.co/docs/get-started)

3. <a id="elasticlab"></a> [Elastic Search Lab - Tutorials](https://www.elastic.co/search-labs/tutorials)

4. <a id="elasticlabBoolQueries"> [Elastic Search Lab - Tutortial - Filters](https://www.elastic.co/search-labs/tutorials/search-tutorial/full-text-search/filters)

5. <a id="elasticscore"></a> [Elastic Search Lab - Understanding Elasticsearch scoring and the Explain API](https://www.elastic.co/search-labs/blog/elasticsearch-scoring-and-explain-api)

6. <a id="mustnot"></a> [Soumendra - Stack Overflow - Difference between must_not and filter in elasticsearch](https://stackoverflow.com/questions/47226479/difference-between-must-not-and-filter-in-elasticsearch)

# Process and Work Distribution
