# Elasticsearch

## Basic Concepts

### Documents

* basically JSON objects that you search over
* each one has a unique ID
```
{
      "id": "XYZ123",
      "title": "The Great Gatsby",
      "author": "F. Scott Fitzgerald",
      "price": 10.99,
      "createdAt": "2024-01-01T00:00:00.000Z"
}
```

### Index

* a collection of documents
* searches are done against Indexes which return a list of documents

### Mappings and Fields

* Mapping = schema of the Index that defines the fields the Index can have and its data type
    - determines which field is searchable
* example of mapping:
    - keyword type = treats entire thing as a single value, a single token
        * if your id = 123, you can only search for it if your query = 123,
        * it would not return anything if your query = 12
        * think Hash map
    - text type = words or phrases of the text can be searched for
        * e.g. "the quick brown fox" can be searched for with "quick brown"
        * think Inverted Index

In [None]:
{
  "properties": {
    "id": { "type": "keyword" },
    "title": { "type": "text" },
    "author": { "type": "text" },
    "price": { "type": "float" },
    "createdAt": { "type": "date" }
  }
}


* Mappings can affect the performance of your cluster
    - too many fields in the Mapping that aren't actually searchable increases memory overhead of Index = wastes memory!!!
    - __you are allowed to not have every documents' fields in your Mapping__
    - the `dynamic` setting determines how to go about adding new fields into the Mapping
        * dynamic: true => adds new fields into Mapping if it encounters a new field
        * dynamic: false => disregards new fields in new documents not in the Mapping, i.e. doesn't add them to Mapping
        * dynamic: strict => will throw an error if it encounters new fields in new documents

In [None]:
// PUT users_index
{
    "mappings": {
        "dynamic": false, // IMPORTANT
            "properties": {
                "name": {
                    "type": "text"
                },
            "createdAt": {
                "type": "date"
            }
        }
    }
}


// POST users_index/_doc
{
    "name": "Alice",
    "createdAt": "2024-01-01T12:00:00Z",
    "occupation": "Engineer"
}

## Basic Use

### Create an Index

In [None]:
// PUT /books
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 1
  }
}

### Set a Mapping

* if most of the fields in your data are not searchable, you can create a Mapping for the index without relying on the dynamic mapping
* you can see that one of the fields has a type of `nested`
    - this means that these are nested documents with their own fields
    - your decision on when to nest something is entirely dependent on its query patterns
        * if something is queried often but updated infrequently, you might want to nest it
        * this is similar to normalization/denormalization tradeoff with SQL databases

In [None]:
// PUT /books/_mapping
{
  "properties": {
    "title": { "type": "text" },
    "author": { "type": "keyword" },
    "description": { "type": "text" },
    "price": { "type": "float" },
    "publish_date": { "type": "date" },
    "categories": { "type": "keyword" },
    "reviews": {
      "type": "nested", // IMPORTANT!!!
      "properties": {
        "user": { "type": "keyword" },
        "rating": { "type": "integer" },
        "comment": { "type": "text" }
      }
    }
  }
}

### Add Documents

*  simple POST request to /_doc endpoint
* each request will return a document ID and data on how it persisted across the cluster
    - the `version` field can be used to update the documents atomically

In [None]:
// POST /books/_doc
{
  "title": "The Great Gatsby",
  "author": "F. Scott Fitzgerald",
  "description": "A novel about the American Dream in the Jazz Age",
  "price": 9.99,
  "publish_date": "1925-04-10",
  "categories": ["Classic", "Fiction"],
  "reviews": [
    {
      "user": "reader1",
      "rating": 5,
      "comment": "A masterpiece!"
    },
    {
      "user": "reader2",
      "rating": 4,
      "comment": "Beautifully written, but a bit sad."
    }
  ]
}

// RESPONSE
{
  "_index": "books",
  "_id": "kLEHMYkBq7V9x4qGJOnh",
  "_version": 1, // IMPORTANT!!!
  "result": "created",
  "_shards": {
    "total": 2,
    "successful": 1,
    "failed": 0
  },
  "_seq_no": 0,
  "_primary_term": 1
}

### Updating Documents

* similar to creating a document but requires you specify the document ID in the URL
* if you pass in `version` as the query parameter, you prevent overwriting your changes
    - Elasticsearch will check the version number of the document with the one in the query parameter
    - if they both match, it can proceed to update
    - if not, it will return an error
    - it's a really simple example of __Optimistic Concurrency Control__
* * can use the `_update` endpoint to only update some fields and not the entire document at once

In [None]:
// PUT /books/_doc/kLEHMYkBq7V9x4qGJOnh
{
  "title": "To Kill a Mockingbird",
  "author": "Harper Lee",
  "description": "A novel about racial injustice in the American South",
  "price": 13.99,
  "publish_date": "1960-07-11",
  "categories": ["Classic", "Fiction"],
  "reviews": [
    {
      "user": "reader3",
      "rating": 5,
      "comment": "Powerful and moving."
    }
  ]
}

// PUT /books/_doc/kLEHMYkBq7V9x4qGJOnh?version=1
...

// UPDATE ONLY PARTS OF THE DOCUMENT
// POST /books/_update/kLEHMYkBq7V9x4qGJOnh
{
  "doc": {
    "price": 14.99
  }
}

## Search

* query syntax is very similar to SQL but JSON-based
* might have issues with body in a GET request
    - but you can put it into the query string
    - or use the POST endpoint

In [None]:
// GET /books/_search
{
  "query": {
    "match": {
      "title": "Great"
    }
  }
}

// GET /books/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "title": "Great" } },
        { "range": { "price": { "lte": 15 } } }
      ]
    }
  }
}

// GET /books/_search
{
  "query": {
    "nested": {
      "path": "reviews",
      "query": {
        "bool": {
          "must": [
            { "match": { "reviews.comment": "excellent" } },
            { "range": { "reviews.rating": { "gte": 4 } } }
          ]
        }
      }
    }
  }
}

* response will have:
    - document ids
    - scores based on relevance
    - source documents

In [None]:
{
  "took": 7,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 2.1806526,
    "hits": [
      {
        "_index": "books",
        "_type": "_doc",
        "_id": "1",
        "_score": 2.1806526,
        "_source": {
          "title": "The Great Gatsby",
          "author": "F. Scott Fitzgerald",
          "price": 12.99
        }
      },
      {
        "_index": "books",
        "_type": "_doc",
        "_id": "2",
        "_score": 1.9876543,
        "_source": {
          "title": "Great Expectations",
          "author": "Charles Dickens",
          "price": 10.50
        }
      }
    ]
  }
}

## Sort

* can just add a sort parameter
    - can sort by multiple fields

In [None]:
// GET /books/_search
{
  "sort": [
    { "price": "asc" }
  ],
  "query": {
    "match_all": {}
  }
}

// GET /books/_search
{
  "sort": [
    { "price": "asc" },
    { "publish_date": "desc" }
  ],
  "query": {
    "match_all": {}
  }
}

### Sorting By Script

* allows sorting based on custom scripts using the "Painless" scripting language
* useful when you need to sort by a computed value

In [None]:
// GET /books/_search
{
  "sort": [
    {
      "_script": {
        "type": "number",
        "script": {
          "source": "doc['price'].value * 0.9"
        },
        "order": "asc"
      }
    }
  ],
  "query": {
    "match_all": {}
  }
}

### Sorting On Nested Fields

* when sorting nested fields, you need to use a nested sort

In [None]:
// GET /books/_search
{
  "sort": [
    {
      "reviews.rating": {
        "order": "desc",
        "mode": "max",
        "nested": {
          "path": "reviews"
        }
      }
    }
  ],
  "query": {
    "match_all": {}
  }
}

### Relevance-Based Sorting

* if we don't specify sort order, Elasticsearch sorts results by relevance score (\_score)
* default scoring algorithm is closely related to TF-IDF (Term Frequency-Inverse Document Frequency)
    - basically, it weighs frequency of a term heavily
    - while also reducing weight if that term appears in a lot of documents
    - for example: the term "the" has a high term frequency but also a very high document frequency
        * we want a term to be highly frequent in a document but only if it's in a small subset of documents

## Pagination and Cursors

* pagination: allows you to retrieve a subset of search results, typically used to display results across multiple pages

### From/Size Pagination

* simplest form of pagination
    - from: starting index of results
    - size: # of results to return
* not efficient for deep pagination (10k+ results)
    - this is due to overhead of sorting/fetching all preceding documents
    - cluster needs to retrieve and sort all these documents on each request, which can be quite expensive
        * Elasticsearch stores documents in shards
        * pagination happens at the shard level
        * then a global merge sort and reduce happens to return the # of documents needed for the query
        * the rest get discard and only the # of size is returned
    - e.g. from = 10,000 and size = 10
        - Elasticsearch has to sort 10,010 results on each shard
        - depending on # of shards, this equals to: # shards * 10,010 results
        - then you must do a merge sort and reduce
            * think about the merging K-sorted arrays leetcode question using a priority queue
            * since the 10,010 results from each shard is already locally sorted, it just has to merge sort them
                - it does not naively combine 10,010 * N results into one, sorts it, then reduces down to 10,010
                - it always keeps it to 10,010 during the merging phase
        - discard 10,000 results and only return 10, which is the size
    

In [None]:
// GET /my_index/_search
{
  "from": 0,
  "size": 10,
  "query": {
    "match": {
      "title": "elasticsearch"
    }
  }
}

### Search After

* __more efficient for deep pagination__
    - Elasticsearch knows exactly where to start for the next page
    - this also has the advantages of:
        * not missing any documents added in subsequent pages (even if new documents are added between requests)
        * no duplicate results across pages
    - but also the cons of:
        * having to maintain state on the client side
            - i.e. remember sort values of last document
        * no random access to pages
            - must always move forward since you can only search_after
        * can also risk missing documents in previous pages if the underlying data was updated or deleted
* uses sort values of the last result as the starting point for the next page
* how it works:
    1. don't include `search_after` parameter in first query
    2. using the results of the first query, take the sort values of the last document
    3. the sort values become the `search_after` parameter for the next query
* in the example 1463538857 is the timestamp and 654323 is the \_id of the last document in the previous page
***
* `search_after` requires a deterministic sort
    - that's why we sort by date and \_id here
    - `search_after` then takes those 2 sort values and uses them as parameters
        * in the example below, we use date and \_id as the sort parameters
        * notice how we also pass in a date and an id into `search_after`'s array
    - __you usually need at least 2 fields: the first as a primary sort field and the second is the tie-breaker if you have duplicate values__
        * if you can guarantee uniqueness, then 1 field is sufficient
        * but \_id is usually used as the tie-breaker since it's always unique
* __the deterministic sort determines what `after` means__
    - e.g. if you sort in descending order
    | Document | Timestamp | Should shard scan?                |
| -------- | --------- | --------------------------------- |
| E        | 500       | ❌ (before 300 in descending sort) |
| D        | 400       | ❌ (before 300 in descending sort) |
| C        | 300       | ❌ (equal, already seen)           |
| B        | 200       | ✅                                 |
| A        | 100       | ✅                                 |

    - if this were ascending order for timestamp, you would only care for documents D and E since they have timestamps after 300
    | Document | Timestamp | `_id` | Should shard scan? |
| -------- | --------- | ----- | ------------------ |
| A        | 100       | "A"   | ❌ too early        |
| B        | 200       | "B"   | ❌ too early        |
| C        | 300       | "C"   | ❌ equal — skip     |
| D        | 400       | "D"   | ✅ after            |
| E        | 500       | "E"   | ✅ after            |


In [None]:
// GET /my_index/_search
{
  "size": 10,
  "query": {
    "match": {
      "title": "elasticsearch"
    }
  },
  "sort": [
    {"date": "desc"},
    {"_id": "desc"}
  ],
  "search_after": [1463538857, "654323"]
}

### Cursors

* cursors provide a stateful way to paginate through search results
    - solves problem of documents shifting underneath you
    - requires a lot more overhead than previous pagination methods
* it uses a `point in time (PIT)` API along with `search_after` for cursor-based pagination
* __it basically creates a snapshot of the data__
* how it works:
    1. create a `PIT` which returns an ID
    2. use the PIT ID in searches
    3. for subsequent paginated searches, use the `search_after`with the PIT ID
    4. close the PIT when done

In [None]:
// POST /my_index/_pit?keep_alive=1m
// which returns us a PIT ID

// use the PIT for initial search
// GET /_search
{
  "size": 10,
  "query": {
    "match": {
      "title": "elasticsearch"
    }
  },
  "pit": { // HERE!!!
    "id": "46To...",
    "keep_alive": "1m"
  },
  "sort": [
    {"_score": "desc"},
    {"_id": "asc"}
  ]
}

// for subsequent searches, add search_after
// GET /_search
{
  "size": 10,
  "query": {
    "match": {
      "title": "elasticsearch"
    }
  },
  "pit": {
    "id": "46To...",
    "keep_alive": "1m"
  },
  "sort": [
    {"_score": "desc"},
    {"_id": "asc"}
  ],
  "search_after": [1.0, "1234"]
}

// close the PIT when done
// DELETE /_pit
{
  "id" : "46To..."
}

## How it Works

* ES built on top of Apache Lucene
    - Lucene = low-level search library
    - handles the searching aspect of Elastic
* ES handles the distributed systems part
    - cluster coordination
    - APIs
    - aggregations
    - real-time capabilities 

## Cluster Architecture

### Node Types

* Elasticsearch is a __distributed search engine__
    - you're actually creating multiple nodes when you create an Elasticsearch cluster
* there are 5 different nodes:
    1. __Master Node__: coordinates the cluster (think Admin)
        - only node that can perform cluster-level operations
        - i.e. adding/removing nodes
        - or creating/deleting indices
    2. __Data Node__: stores the data
        - where your data is actually stored
        - will have a lot of these in a big cluster
    3. __Coordinating Node__: coordinates search requests across the cluster (frontend of your cluster)
        - receives search request from client and sends it to the appropriate nodes
    5. __Ingest Node__: responsible for data ingestion
        - i.e. transforms the data and prepares it for indexing
    7. __Machine Learning Node__: responsible for machine learning tasks
* every instance of Elasticsearch can be of multiple types depending on its configurations
    - e.g. an instance can be configured to be a master-eligible node and a coordinating node
* each node type may also have its own dedicated host
    - e.g. ingest node host might be CPU bound and have many processors
    - or data node host might have high disk I/O or more memory
* each node type also has specializations (Data tiers)
    - e.g. data nodes can be hot, warm, cold, or frozen depending on how likely data is to be queried (e.g. recent or not) and whether it can change
* when a cluster starts, you'll initially have a list of seed nodes that are master-eligible
    - they then perform a leader election algorithm process to choose a master for the cluster
    - only one node is allowed to be the active master while the other master-eligible nodes are on standby

## Data Nodes

* primary function: store documents and optimize search
    - think of it like a separate document database
* a request has 2 phases:
    1. query: grab the relevant docs
    2. fetch: document IDs are pulled from the nodes optionally
* data nodes house our `Elasticsearch indexes`, i.e. collections of documents, not like a database index
    - each Index has `shards` of the data and their `replicas`
    - inside each shard/replica are `Lucene indexes`
    - inside of a Lucene Index are `Lucene Segments`
* shards allow you to split up your data across hosts
    - i.e. across multiple nodes in the cluster which improve performance and scalability
    - searches done on multiple nodes in parallel and merged/sorted by the Coordinating Node
* replica: exact copy of a shard
    - they serve 2 purposes:
        1. high availability => if one fails, you have the other to rely on
        2. increased throughput => Coordinating node can load balance requests to primary shard or its replica
* ES shards are 1:1 with Lucene indexes
    - you can think of the ES operations on shards as proxy operations on the Lucene indexes underneath

### Lucene Segment CRUD

* Lucene indexes made up of segments
* Lucene Segment: __immutable__ containers of indexed data
    - construct segments from multiple documents by batching writes together
    - Inserts: we batch inserts together into a segment and flush it to disk
    - when there are too many segments, Lucene merges them together to create a new segment
        * the old segments are deleted
    - Deletions: has a set of Deleted Identifiers
        * when we query for a deleted document on a segment, it's treated as not being there even though the data is still there
        * during segment merge operations, these deleted documents are cleaned up
    - Updates: insert a new document with the updated information and soft delete the old document in the previous segment
        * the old document will get cleaned up during segment merge operations
    - __`basically: updates and deletions may only create new segments and do not modify existing ones.`__`
* __UPDATES HAVE _WORSE_ PERFORMANCE THAN INSERTIONS BECAUSE OF THE OVERHEAD OF SOFT DELETIONS. THIS IS PART OF WHY ELASTICSEARCH IS NOT A GREAT FIT FOR DATA THAT UPDATES FREQUENTLY__
* Pros of Immutable Segments:
    - Improved write performance: new documents can be quickly added to new segments without modifying existing ones
    - Efficient caching: since segments are immutable, they can be safely cached in memory or on SSD without worrying about consistency issues
    - Simplified concurrency: read operations don't need to worry about data changing mid-query
    - Easier recovery: in case of a crash, easier to recover from immutable segments as their state is known and consistent
    - Optimized compression: immutable data can be more effectively compressed, saving dick space
    - Faster searches: allows for optimized data structures and algorithms for searching
* Cons of Immutable Segments:
    - Requires periodic segment merges
    - Temporary increased storage requirements before cleanup merge operations

### Lucene Segment Features

* have data structures for search operations and the most important ones are:
        1. inverted index
        2. doc values
    
#### Inverted Index

* the inverted index is the heart of Lucene
* it's basically a hash map where:
    - key = word/token
    - value = list of documents word/token is in
* this allows us to do O(1) lookups for things like finding all books that contain the word "great" in their title

#### Doc Values

* on-disk columnar-based data structure that allows us to easily access field values of a doc without having to query the entire thing up
    - if you only need the price of a book, why do you care about the title, author, reviews, etc
        * __this is a common problem for row-oriented databases like relational databases. even though I only need to access a single column, I need to read the entire row and index into it__
    - this is a waste of space/memory since you need to load up the entire row just for a specific value
    - what if you could just query the document ID and its corresponding price?
* so using Doc Values, we can easily sort the results using a field's value
* for example:
    - row-based:
| docId | Title                | Author              | Price |
| ----- | -------------------- | ------------------- | ----- |
| 101   | *The Great Gatsby*   | F. Scott Fitzgerald | \$14  |
| 102   | *Great Expectations* | Charles Dickens     | \$12  |
| 103   | *The Great Alone*    | Kristin Hannah      | \$16  |
    - colum based:
        * think of it like a contiguous chunk of memory where each cell implicitly represents the docID
            - keep in mind that doc values are on-disk data structures
        * the reason why this works is because Lucene assigns each document in a segment a __sequential docID based on when it was inserted__
            - the documents get assigned new docIDs when segments are merged
            - a docID is a 32-bit number
                * always starts at 0 per-segment
                * maximum docID = # of documents in a segment
                * deleted documents maintain their docIDs until the merging process; they are just treated as being non-existent
                * docIDs are only used internally by Lucene and cannot be used externally
            - there are 2 types of docIDs: global and per-segment
                * global docID: (per-segment docID) + (segment's base docID offset)
    ```
    {
  "Title":    ["The Great Gatsby", "Great Expectations", "The Great Alone"],
  "Author":   ["F. Scott Fitzgerald", "Charles Dickens", "Kristin Hannah"],
  "Price":    [14, 12, 16]
    }
    ```
* __source__: [Lucene Index](https://lucene.apache.org/core/9_9_1/core/org/apache/lucene/index/package-summary.html)