# Import and Client Connect

In [2]:
from dotenv import load_dotenv
import os
from elasticsearch import helpers  # For bulk Data Uploading
from elasticsearch import Elasticsearch  # Base function for interacting with Elasticsearch

load_dotenv()
client = Elasticsearch("http://localhost:9200/", api_key=os.getenv('apikey'))

#test client
print(client.info())

ConnectionError: Connection error caused by: ConnectionError(Connection error caused by: ProtocolError(('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))))

# Elasticsearch Terminology

https://www.elastic.co/guide/en/elastic-stack-glossary/current/terms.html

## Index

A data space to store documents with similar characteristics

## Document and Fields

A **document** is a single data entry, think row in SQL, which can also be thought of as a collection of **fields**. 

**Fields** are like columns in SQL, they represent the actual values that define a document. As elasticsearch is not a structured database, **fields** are a key-value pair. 

Translating this into python terminology, that would mean that each **document** is a collection of key-value pairs, aka a dictionary.

# Definition of an index

Defining an index allows you to set the settings for a given index, and pre-determine the data types that specific **fields** have by providing a mapping. If these are not set, elasticsearch will set default values for settings, and attempt to auto-interpret what the data type should be the first time it encounters a new **field**.

Probably most important keys are: 
- [**For the future**] settings: https://www.elastic.co/guide/en/elasticsearch/reference/8.15/index-modules.html, potentially important for the future are:
    - A **shard** is a component that holds the data and allows for the indexing and searching operations. These number of primary **shards** can only be set at index creation 
    - Replica **shards* provide data redundancy and facilitate extra searcher in case of multiple search requests on data in a given shard. These can be set dynamically not just at index creation. Elasticsearch balances these across nodes in a cluster
- mapping https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html
    - properties contains the actual fields of the documents: https://www.elastic.co/guide/en/elasticsearch/reference/current/explicit-mapping.html
        - Data types of **fields** can be found here: https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-types.html
        - **Fields** can be objects, or nested to contain their own fields: https://www.elastic.co/guide/en/elasticsearch/reference/current/properties.html
        - Some metadata can be added to fields, such as units: https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-field-meta.html
        - For more mapping parameters: https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-params.html

Replica shards not only provide data redundancy but also facilitate extra searches in case of multiple search requests. These shards can be set dynamically, and Elasticsearch will balance them across nodes in a cluster.

# Python definition of an index

Because of the simple interface provided by the elasticsearch python package, we can define the mapping using a dictionary, and this will then be translated into a JSON for us.
Here is an example:




In [57]:
'''
"settings" is technically not needed if we are working on a simple local host, but can be changed to optimise search performance on a database that is hosted on a cluster and searched by multiple users.
"mappings" is required if you wish to explicitly map fields to specific values
'''

index_definition = {
    "settings": {  
        "number_of_shards": 1,
    },
    "mappings": {
        "properties": {
            "git hash": {"type": "keyword"},
            "resolution": {"type": "integer"},
            "time": {
                "type": "float",
                "meta": {
                    "unit": "s",
                }
            },
            "path to image": {"type": "keyword"},
            "eccentricity": {"type": "float"},
            "semi-major axis": {"type": "float"},
            "primary mass": {"type": "float"},
            "primary racc": {"type": "float"},
            "primary Teff": {
                "type": "float",
                "meta": {
                    "unit": "K",
                }
            },
            "primary Reff": {
                "type": "float",
                "meta": {
                    "unit": "parsec",
                }
            },
            "secondary mass": {
                "type": "float",
                "meta": {
                    "unit": "g",
                }
            },
            "secondary racc": {"type": "float"},
            "secondary Teff": {
                "type": "float",
                "meta": {
                    "unit": "K",
                }
            },
            "secondary Reff": {
                "type": "float",
                "meta": {
                    "unit": "parsec",
                }
            },
            "adiabatic index": {"type": "float"},
            "mean molecular weight": {"type": "float"},
            "icooling": {"type": "integer"}
        }
    }
}

# Creating an Index

Creating an index is simple. The most important variables are likely to be:
- index: [str]: String of the index name to be created, must be lowercase.
- body: [dict]: Dictionary containing settings and mappings.
- settings: [Mapping]: Mapping data type containing only the settings for the index to be created (This is listed in the documentation for the package, but does not currently work for me, instead I used body)
- mappings: [Mapping]: Mapping data type containing only the mappings for the index to be created (This is listed in the documentation for the package, but does not currently work for me, instead I used body)

In [None]:
client.indices.create(index="trial_index", body=index_definition)
# You can delete an index with 
#client.indices.delete(index="trial_index", ignore_unavailable=True)

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'trial_index'})

# Kibana

Once the index is created in elasticsearch, in order to view it in Kibana, go to **Management** and on left menu bar, scroll down to the **Kibana** subsection, and click **Data Views**. From here, on the top right, click **Create data view** in order to integrate the new index into the Kibana interface. This will allow you to view how kibana interprets the index you have created.

# Adding documents

To add documents to an index, it's a simple as passing a list of dictionaries to client.bulk(), or a single dictionary to the client.index() function. Adding "refresh=True" to either function will make the new **document(s)** searchable immediately, else users will have to wait for the shard holding the data to refresh in its normal cycle.

In [59]:
# Code block to define function to randomly generate data
import random

def gen_data(image_number: int):
    return {
        "git hash": "8743b52063cd84097a65d1633f5c74f5",
        "resolution": random.choice([64, 128, 256, 512]),
        "time": random.uniform(2000.0, 50000.0),
        "path to image": f"/path/to/image_{image_number}.png",
        "eccentricity": random.uniform(0.0, 1.0),
        "semi-major axis": random.uniform(1.0, 10.0),
        "primary mass": random.uniform(1000.0, 100000.0),
        "primary racc": random.uniform(1.0, 10.0),
        "primary Teff": random.uniform(100.0, 1000.0),
        "primary Reff": random.uniform(1.0, 5.0),
        "secondary mass": random.uniform(10.0, 1000.0),
        "secondary racc": random.uniform(0.0, 1.0),
        "secondary Teff": random.uniform(1.0, 100.0),
        "secondary Reff": random.uniform(0.0, 1.0),
        "adiabatic index": random.uniform(1.0, 10.0),
        "mean molecular weight": random.uniform(1.0, 50.0),
        "icooling": int(random.uniform(20.0, 50.0)),
    }

# Single Document

In [65]:
data = gen_data(image_number=100)

client.index(index='trial_index', body=data)
del data

# Bulk / Multiple Documents

In order to load multiple items in bulk, we take advantage of helpers.bulk()

To set this up, we have to supply the function helpers.bulk() with a list of dictionaries. Each dictionary, needs to have the index it belongs to, and the type of operation we are performing with it. In this case, we are doing **index** which adds a document to a specific index. We could also bulk delete or update, but that would then require setting the correct variables.

For submitting a new Document, each dictionary needs to have at least the "_index" key, but here are some others that could be of interest:
- "_op_type": Name of operation to perform in String format. 
    - 'index': Add new document to named index [default value]
    - 'delete: Delete specific document (**requires '_id'**)
    - 'update': Update specific document with the values found in "_source" (**requires '_id'**)
- "_index": Name of index to add to in String format
- "_id": ID of the document
- "_source": Dictionary to be added to index as document, or parts of the document to be changed. (If not specified, then all non elasticsearch meta-data fields will be used as the "_source")

In [70]:
base_command = {
    "_index": 'trial_index',
    "_op_type": "index"
}
operations = []
for i in range(100):
    data = gen_data(image_number=i)
    operations.append((base_command | {"_source": data}))
    del data
print(operations)
helpers.bulk(client, operations, refresh=True)
del operations

[{'_index': 'trial_index', '_op_type': 'index', '_source': {'git hash': '8743b52063cd84097a65d1633f5c74f5', 'resolution': 64, 'time': 46731.8309969084, 'path to image': '/path/to/image_0.png', 'eccentricity': 0.021311017265269627, 'semi-major axis': 7.760456532013592, 'primary mass': 24628.116243496588, 'primary racc': 3.8284912267225883, 'primary Teff': 726.6522729196142, 'primary Reff': 2.868619449010579, 'secondary mass': 983.2194107488992, 'secondary racc': 0.9354971423304048, 'secondary Teff': 40.249319844601885, 'secondary Reff': 0.3945288955812304, 'adiabatic index': 1.097553242146389, 'mean molecular weight': 11.697197641730558, 'icooling': 44}}, {'_index': 'trial_index', '_op_type': 'index', '_source': {'git hash': '8743b52063cd84097a65d1633f5c74f5', 'resolution': 128, 'time': 11372.634024696366, 'path to image': '/path/to/image_1.png', 'eccentricity': 0.6935322363256596, 'semi-major axis': 9.356162441292232, 'primary mass': 24123.67365393174, 'primary racc': 2.862491998252523

# Updating a document

If you wish to update a single **document**, you can do so using the client.update() function. Not all fields have to be specified, it is okay to just list the fields to alter or add. For bulk updating, see above.

In [69]:
data = gen_data(1)
doc_id = "<INSERT DOCUMENT ID TO UPDATE HERE>"
client.update(index='trial_index', id=doc_id, doc=data)

NotFoundError: NotFoundError(404, 'document_missing_exception', '[<INSERT DOCUMENT ID TO UPDATE HERE>]: document missing')

# Deleting an Index

The following command will delete a named index, but will not remove the Kibana Data View, this has to be done manually in the Kibana interface.

In [None]:
client.indices.delete(index='trial_index')

ObjectApiResponse({'acknowledged': True})

# Queries

If you wanna search for something

In [77]:
def pretty_response(response):
    if len(response["hits"]["hits"]) == 0:
        print("Your search returned no results.")
    else:
        for hit in response["hits"]["hits"]:
            id = hit["_id"]
            git_hash = hit["_source"]["git hash"]
            time = hit["_source"]["time"]
            path_to_image = hit["_source"]["path to image"]
            eccentricity = hit["_source"]["eccentricity"]
            semimajor_axis = hit["_source"]["semi-major axis"]
            primarymass = hit["_source"]["primary mass"]
            pretty_output = f"\nID: {id}\nGit hash: {git_hash}\ntime: {time}\n path to image: {path_to_image}\neccentricity: {eccentricity}\nsemi major axis: {semimajor_axis}\nprimary mass: {primarymass}"
            print(pretty_output)


In [78]:

#specific query
response = client.search(
    index="trial_index",
query={"match": {"path to image": {"query": "/path/to/image_1.png"}}}
)
pretty_response(response)




ID: kxoj_ZIB2m-pINCabY4h
Git hash: 8743b52063cd84097a65d1633f5c74f5
time: 26997.225199188404
 path to image: /path/to/image_1.png
eccentricity: 0.14867078912827159
semi major axis: 4.7135573696277
primary mass: 72419.07068018237

ID: 9xop_ZIB2m-pINCakI4S
Git hash: 8743b52063cd84097a65d1633f5c74f5
time: 30202.858845965897
 path to image: /path/to/image_1.png
eccentricity: 0.7436089083583735
semi major axis: 6.680331363555909
primary mass: 7988.655233781478

ID: XBop_ZIB2m-pINCaq4-d
Git hash: 8743b52063cd84097a65d1633f5c74f5
time: 11537.91539363978
 path to image: /path/to/image_1.png
eccentricity: 0.8965022019032595
semi major axis: 1.5030274830953234
primary mass: 82129.9734928058

ID: wBpC_ZIB2m-pINCa1I89
Git hash: 8743b52063cd84097a65d1633f5c74f5
time: 11372.634024696366
 path to image: /path/to/image_1.png
eccentricity: 0.6935322363256596
semi major axis: 9.356162441292232
primary mass: 24123.67365393174


In [79]:

# Query all
response = client.search(
)
pretty_response(response)


ID: kRoj_ZIB2m-pINCabY4C
Git hash: 8743b52063cd84097a65d1633f5c74f5
time: 16964.27075460085
 path to image: /path/to/image_100.png
eccentricity: 0.6710235651354755
semi major axis: 7.607443895941159
primary mass: 90806.05516269672

ID: khoj_ZIB2m-pINCabY4h
Git hash: 8743b52063cd84097a65d1633f5c74f5
time: 34984.41159637372
 path to image: /path/to/image_0.png
eccentricity: 0.34049718145008046
semi major axis: 4.937621784322797
primary mass: 56460.23565818521

ID: kxoj_ZIB2m-pINCabY4h
Git hash: 8743b52063cd84097a65d1633f5c74f5
time: 26997.225199188404
 path to image: /path/to/image_1.png
eccentricity: 0.14867078912827159
semi major axis: 4.7135573696277
primary mass: 72419.07068018237

ID: lBoj_ZIB2m-pINCabY4h
Git hash: 8743b52063cd84097a65d1633f5c74f5
time: 33273.04307932191
 path to image: /path/to/image_2.png
eccentricity: 0.062241446395312705
semi major axis: 9.888259973299094
primary mass: 57109.6802337451

ID: lRoj_ZIB2m-pINCabY4h
Git hash: 8743b52063cd84097a65d1633f5c74f5
time: 3