# Elasticsearch: Basic Operations

In this notebook, I'll explore how to interact with Elasticsearch. I will cover:
- Setting up the connection to an Elasticsearch instance running locally.
- Creating an index and defining mappings.
- Indexing documents individually and in bulk.
- Performing basic searches and aggregations.
- Updating and deleting documents.


### 1. Setting Up the Connection

Start by connecting to the Elasticsearch instance running on `localhost:9200`. This connection will be used for all subsequent operations.


In [4]:
from elasticsearch import Elasticsearch

# Initialize the Elasticsearch client
db_client = Elasticsearch([{'host': 'localhost', 'port': 9200, 'scheme': 'http'}])
db_client = Elasticsearch("http://localhost:9200")

# Verify the connection by pinging the Elasticsearch server
if db_client.ping():
    print("Connected to Elasticsearch")
else:
    print("Could not connect to Elasticsearch")


Connected to Elasticsearch


In [5]:
import json
res = db_client.info()
res_dict = res.body if hasattr(res, 'body') else dict(res)
print(json.dumps(res_dict, indent=4))

{
    "name": "53454524eded",
    "cluster_name": "docker-cluster",
    "cluster_uuid": "YKSJfR6lQ2GNjB4UiyoJMQ",
    "version": {
        "number": "8.4.3",
        "build_flavor": "default",
        "build_type": "docker",
        "build_hash": "42f05b9372a9a4a470db3b52817899b99a76ee73",
        "build_date": "2022-10-04T07:17:24.662462378Z",
        "build_snapshot": false,
        "lucene_version": "9.3.0",
        "minimum_wire_compatibility_version": "7.17.0",
        "minimum_index_compatibility_version": "7.0.0"
    },
    "tagline": "You Know, for Search"
}


### 2. Creating an Index with Mappings

An index in Elasticsearch is like a database in a RDBMS. Here, i'll create an index called `doc_index` and define a simple mapping that specifies the data types for each field.


In [13]:
index_name = "doc_index"

In [6]:
# Define the settings and mappings for the index
index_body = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0
    },
    "mappings": {
        "properties": {
            "title": {"type": "text"},
            "content": {"type": "text"},
            "timestamp": {"type": "date"}
        }
    }
}


In [7]:
# Create the index
if not db_client.indices.exists(index=index_name):
    db_client.indices.create(index=index_name, body=index_body)
    print(f"Index '{index_name}' created successfully")
else:
    print(f"Index '{index_name}' already exists")

Index 'doc_index' already exists


### 3. Indexing a Single Document

In this step, we will index (insert) a single document into the `doc_index` index. Documents in Elasticsearch are stored in JSON format.


In [8]:
doc = {
    'title': 'Introduction to Elasticsearch',
    'content': 'Elasticsearch is a powerful search engine based on the Lucene library.',
    'timestamp': '2024-08-25T13:00:00'
}


In [9]:
# Index the document with an index ID
res = db_client.index(index=index_name, id=1, document=doc)
print("Document indexed:", res['result'])

Document indexed: updated


### 4. Bulk Indexing Multiple Documents

Indexing documents one at a time can be inefficient for large datasets. 

> Elasticsearch provides a bulk API that allows us to index multiple documents in a single request.


In [10]:
# Multiple documents to be indexed in bulk
actions = [
    {
        "_index": index_name,
        "_id": 2,
        "_source": {
            "title": "ABC",
            "content": "Bla bla bla bla.",
            "timestamp": "2024-08-25T14:00:00"
        }
    },
    {
        "_index": index_name,
        "_id": 3,
        "_source": {
            "title": "DEF",
            "content": "omg omg",
            "timestamp": "2024-08-25T15:00:00"
        }
    }
]


In [11]:
from elasticsearch.helpers import bulk

res = bulk(db_client, actions)
print(f"Bulk indexing completed: {res[0]} documents indexed.")


Bulk indexing completed: 2 documents indexed.


### 5. List all of the indexed documents 

we can list the documents in an Elasticsearch index by using the search API with a match-all query. 
    > This query retrieves all documents in the index. 
    > we can limit the number of documents returned and use pagination to navigate through larger datasets.

General Strategy should be: 
> Use a match-all query to retrieve and list documents in the index.

Pagination: 
> Implement pagination for larger datasets to retrieve documents in batches.

In [12]:
# List all documents in the index with a match-all query
list_query = {
    "query": {
        "match_all": {}
    }
}

# Execute the search to list documents
res = db_client.search(index=index_name, body=list_query)
print(f"Total documents found: {res['hits']['total']['value']}")
for hit in res['hits']['hits']:
    print(json.dumps(hit['_source'], indent=4))


Total documents found: 3
{
    "title": "Introduction to Elasticsearch",
    "content": "Elasticsearch is a powerful search engine based on the Lucene library.",
    "timestamp": "2024-08-25T13:00:00"
}
{
    "title": "ABC",
    "content": "Bla bla bla bla.",
    "timestamp": "2024-08-25T14:00:00"
}
{
    "title": "DEF",
    "content": "omg omg",
    "timestamp": "2024-08-25T15:00:00"
}


In [20]:
# Pagination Function to paginate through documents in the index
def paginate_through_documents(index_name, page_size=100):
    page = 0
    while True:
        query_body = {
            "query": {
                "match_all": {}
            },
            "size": page_size,
            "from": page * page_size
        }
        res = db_client.search(index=index_name, body=query_body)
        hits = res['hits']['hits']
        if not hits:
            break
        for hit in hits:
            print(json.dumps(hit['_source'], indent=4))
        page += 1


In [21]:
paginate_through_documents(index_name)

{
    "title": "Introduction to Elasticsearch",
    "content": "Elasticsearch is a powerful search engine based on the Lucene library.",
    "timestamp": "2024-08-25T13:00:00"
}
{
    "title": "ABC",
    "content": "Bla bla bla bla.",
    "timestamp": "2024-08-25T14:00:00"
}
{
    "title": "DEF",
    "content": "omg omg",
    "timestamp": "2024-08-25T15:00:00"
}


---

## Working with the data from teh documents from the knowledge base

### 6. Loading Knowledge Base Documents

Let's load the JSON documents from the `Knowledge_Base` directory. These documents will be indexed into Elasticsearch.


In [24]:
import os

In [64]:
knowledge_base_dir = 'Knowledge_Base'

def load_knowledge_base(directory):
    documents = []
    filename = 'parsed_chat.json'
    with open(os.path.join(directory, filename), 'r', encoding='utf-8') as file:
        documents.append(json.load(file))
    return documents

In [65]:
knowledge_base_docs = load_knowledge_base(knowledge_base_dir)
print(f"Loaded {len(knowledge_base_docs)} documents from the knowledge base.")


Loaded 1 documents from the knowledge base.


### 7. Creating an Index for the Knowledge Base

Next, we'll create an index in Elasticsearch where the knowledge base documents will be stored. We'll define mappings to handle different fields like `question`, `answer`, and `timestamp`.


In [78]:
kb_index = 'index_parsed_chat'

kb_index_settings = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0
    }
}

if not db_client.indices.exists(index=kb_index):
    db_client.indices.create(index=kb_index, body=kb_index_settings)
    print(f"Index '{kb_index}' created successfully.")
else:
    print(f"Index '{kb_index}' already exists.")



Index 'index_parsed_chat' already exists.


### 8. Indexing the Knowledge Base Documents

We'll now index the loaded documents into the newly created index. We'll use the bulk API for efficient indexing.


In [86]:
def prepare_bulk_indexing(docs, index_name):
    actions = []
    for i, doc_list in enumerate(docs):
        # Extract the first dictionary from each list
        if isinstance(doc_list, list) and doc_list:
            doc = doc_list[0]
            
            action = {
                "_index": index_name,
                "_id": i + 1,  # Use the document's position as the ID
                "_source": doc
            }
            actions.append(action)
    return actions


In [87]:
# Prepare the documents for bulk indexing
bulk_actions = prepare_bulk_indexing(knowledge_base_docs, kb_index)


In [88]:
res = bulk(db_client, bulk_actions)
print(f"Bulk indexing completed: {res[0]} documents indexed.")

Bulk indexing completed: 1 documents indexed.


### 9. Searching the Knowledge Base

Now that the documents are indexed, you can perform searches to retrieve relevant documents based on queries. 


we can start with a basic search that looks for a keyword in the question or answer fields.



In [92]:
search_query = {
    "query": {
        "multi_match": {
            "query": "where can i find the notes?", 
            "fields": ["question", "answer"]  
        }
    }
}


In [93]:

res = db_client.search(index=kb_index, body=search_query)
print(f"Search returned {res['hits']['total']['value']} results:")

for hit in res['hits']['hits']:
    print(json.dumps(hit['_source'], indent=4))


Search returned 1 results:
{
    "course": "GCS Certification",
    "day": "2",
    "question": "Yesterday I'd completed 1st lab i.e. Exploring a BigQuery a puvlic dataset. without any problem. But the 2nd lab is lot of confusing. There is a mismatch between on screen instructions and actual lab. Some options are missing which are in Lab but not in instructions.",
    "asked_by": "Yogeshwar Dayal Gaju",
    "answer": "Hi, thanks for your feedback. Could you please share what part was the issue? The cloud industry is evolving so quickly that changes are often not reflected in the manual.",
    "answered_by": "00 Hwan-Tae Kim"
}


### 10. Filtering Search Results
You can refine your search results by applying filters. 

In [100]:
search_query_with_filter = {
    "query": {
        "bool": {
            "must": {
                "multi_match": {
                    "query": "how do i access the labs",  
                    "fields": ["question", "answer"]
                }
            },
            "filter": [
                {"term": {"course": "GCS Certification"}} 
            ]
        }
    }
}



In [101]:
res = db_client.search(index=kb_index, body=search_query_with_filter)
print(f"Filtered search returned {res['hits']['total']['value']} results:")

for hit in res['hits']['hits']:
    print(json.dumps(hit['_source'], indent=4))

Filtered search returned 1 results:
{
    "course": "GCS Certification",
    "day": "2",
    "question": "Yesterday I'd completed 1st lab i.e. Exploring a BigQuery a puvlic dataset. without any problem. But the 2nd lab is lot of confusing. There is a mismatch between on screen instructions and actual lab. Some options are missing which are in Lab but not in instructions.",
    "asked_by": "Yogeshwar Dayal Gaju",
    "answer": "Hi, thanks for your feedback. Could you please share what part was the issue? The cloud industry is evolving so quickly that changes are often not reflected in the manual.",
    "answered_by": "00 Hwan-Tae Kim"
}


### 111. List All Indexes
To list all indexes in the Elasticsearch database, you can use the cat.indices API.

In [106]:
# List all indexes in the Elasticsearch database
all_indexes = db_client.cat.indices(format="json")
print("All Indexes:")
for index in all_indexes:
    print(f"Index Name: {index['index']}, Document Count: {index['docs.count']}, Size: {index['store.size']}")


All Indexes:
Index Name: parsed_chat, Document Count: 0, Size: 225b
Index Name: index_parsed_chat, Document Count: 1, Size: 9.9kb
Index Name: doc_index, Document Count: 3, Size: 4.3kb
Index Name: parsed_chat_index, Document Count: 329, Size: 265.7kb
Index Name: faq_documents_index, Document Count: 0, Size: 225b


In [105]:
# List all indexes in the Elasticsearch database
all_indexes = db_client.cat.indices(format="json")

print("All Indexes:")
for index in all_indexes:
    # Print basic information about the index
    print(f"Index Name: {index['index']}, Document Count: {index['docs.count']}, Size: {index['store.size']}")
    
    # Get and print the mappings for the index
    mappings_response = db_client.indices.get_mapping(index=index['index'])
    mappings = mappings_response.body if hasattr(mappings_response, 'body') else mappings_response
    print("Mappings:")
    print(json.dumps(mappings, indent=4))
    
    # Get and print the settings for the index
    settings_response = db_client.indices.get_settings(index=index['index'])
    settings = settings_response.body if hasattr(settings_response, 'body') else settings_response
    print("Settings:")
    print(json.dumps(settings, indent=4))
    
    print("\n" + "="*80 + "\n")


All Indexes:
Index Name: parsed_chat, Document Count: 0, Size: 225b
Mappings:
{
    "parsed_chat": {
        "mappings": {
            "properties": {
                "answer": {
                    "type": "keyword"
                },
                "course": {
                    "type": "text"
                },
                "day": {
                    "type": "keyword"
                },
                "question": {
                    "type": "text"
                }
            }
        }
    }
}
Settings:
{
    "parsed_chat": {
        "settings": {
            "index": {
                "routing": {
                    "allocation": {
                        "include": {
                            "_tier_preference": "data_content"
                        }
                    }
                },
                "number_of_shards": "1",
                "provided_name": "parsed_chat",
                "creation_date": "1724576547599",
                "number_of_replicas"