# **Hybrid Retriever- Combining Dense And Sparse Retriever**

In [None]:
from langchain_community.vectorstores import FAISS
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever
from langchain.schema import Document

In [None]:
# Step 1: Sample documents
docs = [
    Document(page_content="LangChain helps build LLM applications."),
    Document(page_content="Pinecone is a vector database for semantic search."),
    Document(page_content="The Eiffel Tower is located in Paris."),
    Document(page_content="Langchain can be used to develop agentic ai application."),
    Document(page_content="Langchain has many types of retrievers.")
]

# Step 2: Dense Retriever (FAISS + HuggingFace)
embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
dense_vectorstore = FAISS.from_documents(docs, embedding_model)
dense_retriever = dense_vectorstore.as_retriever()

In [None]:
### Sparse Retriever(BM25)
sparse_retriever=BM25Retriever.from_documents(docs)
sparse_retriever.k=3 ##top- k documents to retriever

## step 4 : Combine with Ensemble Retriever
hybrid_retriever=EnsembleRetriever(
    retrievers=[dense_retriever,sparse_retriever],
    weight=[0.7,0.3]
)

In [5]:
hybrid_retriever

EnsembleRetriever(retrievers=[VectorStoreRetriever(tags=['FAISS', 'HuggingFaceEmbeddings'], vectorstore=<langchain_community.vectorstores.faiss.FAISS object at 0x000002227A2DFA10>, search_kwargs={}), BM25Retriever(vectorizer=<rank_bm25.BM25Okapi object at 0x000002227A5A9A90>, k=3)], weights=[0.5, 0.5])

In [6]:
# Step 5: Query and get results
query = "How can I build an application using LLMs?"
results = hybrid_retriever.invoke(query)

# Step 6: Print results
for i, doc in enumerate(results):
    print(f"\n🔹 Document {i+1}:\n{doc.page_content}")


🔹 Document 1:
LangChain helps build LLM applications.

🔹 Document 2:
Langchain can be used to develop agentic ai application.

🔹 Document 3:
Langchain has many types of retrievers.

🔹 Document 4:
Pinecone is a vector database for semantic search.


# **RAG Pipeline with hybrid retriever**

In [8]:
from langchain.chat_models import init_chat_model
from langchain.prompts import PromptTemplate
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains.retrieval import create_retrieval_chain

In [12]:
# Step 5: Prompt Template
prompt = PromptTemplate.from_template("""
Answer the question based on the context below.

Context:
{context}

Question: {input}
""")

## step 6-llm
llm=init_chat_model("openai:gpt-3.5-turbo",temperature=0.2)
llm

ChatOpenAI(client=<openai.resources.chat.completions.completions.Completions object at 0x000002237E1BE710>, async_client=<openai.resources.chat.completions.completions.AsyncCompletions object at 0x000002237E1BEC10>, root_client=<openai.OpenAI object at 0x000002237E1BDE50>, root_async_client=<openai.AsyncOpenAI object at 0x000002237E1BE5D0>, temperature=0.2, model_kwargs={}, openai_api_key=SecretStr('**********'))

In [13]:
### Create stuff Docuemnt Chain
document_chain=create_stuff_documents_chain(llm=llm,prompt=prompt)

## create Full rAg chain
rag_chain=create_retrieval_chain(retriever=hybrid_retriever,combine_docs_chain=document_chain)
rag_chain


RunnableBinding(bound=RunnableAssign(mapper={
  context: RunnableBinding(bound=RunnableLambda(lambda x: x['input'])
           | EnsembleRetriever(retrievers=[VectorStoreRetriever(tags=['FAISS', 'HuggingFaceEmbeddings'], vectorstore=<langchain_community.vectorstores.faiss.FAISS object at 0x000002227A2DFA10>, search_kwargs={}), BM25Retriever(vectorizer=<rank_bm25.BM25Okapi object at 0x000002227A5A9A90>, k=3)], weights=[0.5, 0.5]), kwargs={}, config={'run_name': 'retrieve_documents'}, config_factories=[])
})
| RunnableAssign(mapper={
    answer: RunnableBinding(bound=RunnableBinding(bound=RunnableAssign(mapper={
              context: RunnableLambda(format_docs)
            }), kwargs={}, config={'run_name': 'format_inputs'}, config_factories=[])
            | PromptTemplate(input_variables=['context', 'input'], input_types={}, partial_variables={}, template='\nAnswer the question based on the context below.\n\nContext:\n{context}\n\nQuestion: {input}\n')
            | ChatOpenAI(client=

In [14]:
# Step 9: Ask a question
query = {"input": "How can I build an app using LLMs?"}
response = rag_chain.invoke(query)

# Step 10: Output
print("✅ Answer:\n", response["answer"])

print("\n📄 Source Documents:")
for i, doc in enumerate(response["context"]):
    print(f"\nDoc {i+1}: {doc.page_content}")

✅ Answer:
 You can build an app using LLMs by utilizing LangChain, which helps in developing LLM applications. LangChain can be used to develop agentic AI applications, and it offers various types of retrievers to enhance the functionality of your app. Additionally, you can also consider using Pinecone, a vector database for semantic search, to further improve the performance of your LLM-based app.

📄 Source Documents:

Doc 1: LangChain helps build LLM applications.

Doc 2: Langchain can be used to develop agentic ai application.

Doc 3: Langchain has many types of retrievers.

Doc 4: Pinecone is a vector database for semantic search.


# Notes

## **Hybrid Retriever — Combining Dense and Sparse Retrieval**

**Overview**

A **Hybrid Retriever** blends the strengths of **dense (vector-based)** and **sparse (keyword-based)** retrieval systems to produce **more accurate, context-aware, and robust search results**.
This approach is ideal when you need both **semantic understanding** (via embeddings) and **exact keyword matching** (via term frequency relevance).

---

**1. Why Hybrid Retrieval?**

Each retrieval method has strengths and weaknesses:

| Retrieval Type       | Description                                                         | Strengths                                        | Weaknesses                                  |
| -------------------- | ------------------------------------------------------------------- | ------------------------------------------------ | ------------------------------------------- |
| **Dense (Vector)**   | Uses embeddings to represent semantic meaning of text               | Captures **context** and **semantic similarity** | May miss **exact keyword matches**          |
| **Sparse (Keyword)** | Uses token-based models (like TF-IDF, BM25) for exact term matching | Strong for **exact keyword relevance**           | Fails at understanding **semantic meaning** |

💡 **Hybrid retrieval** merges both — leveraging **semantic embeddings** (e.g., OpenAI, Hugging Face models) and **lexical scores** (e.g., BM25, ElasticSearch) — providing the **best of both worlds**.

---

**2. Core Concept**

Let:

* ( S_d(q, x) ) = dense similarity score (cosine similarity between embeddings)
* ( S_s(q, x) ) = sparse similarity score (BM25, TF-IDF, etc.)

Then the **hybrid score** can be defined as:

[
S_{hybrid}(q, x) = \alpha \times S_d(q, x) + (1 - \alpha) \times S_s(q, x)
]

where

* ( \alpha \in [0,1] ) controls the weighting between dense and sparse results.
* Example: ( \alpha = 0.7 ) → 70% semantic, 30% keyword importance.

---

**3. Implementing Hybrid Search in Pinecone**

Pinecone provides **native support for hybrid retrieval**, allowing you to store **both dense and sparse vectors** in a **single index**.

**Step-by-Step Implementation**

**(1) Create a Hybrid Index**

```python
import pinecone

pinecone.init(api_key="YOUR_API_KEY", environment="gcp-starter")

index_name = "hybrid-search"
if index_name not in pinecone.list_indexes():
    pinecone.create_index(
        name=index_name,
        dimension=768,     # dense vector dimension
        metric="dotproduct"
    )

index = pinecone.Index(index_name)
```

---

**(2) Prepare Dense and Sparse Representations**

You can use:

* **Dense** → Sentence Transformers, OpenAI embeddings, etc.
* **Sparse** → BM25 or SPLADE (sparse transformer-based encoder)

```python
from sentence_transformers import SentenceTransformer
from rank_bm25 import BM25Okapi
import numpy as np

# Dense embedding model
dense_model = SentenceTransformer('all-MiniLM-L6-v2')

# Example corpus
corpus = ["Hybrid retrieval combines dense and sparse models",
          "Pinecone supports hybrid vector search",
          "BM25 is a sparse retrieval algorithm"]

# Sparse (BM25)
bm25 = BM25Okapi([doc.split() for doc in corpus])

# Dense embeddings
dense_vectors = dense_model.encode(corpus)
```

---

**(3) Insert Combined Representations into Pinecone**

Each vector includes **dense + sparse components**:

```python
# Example: dense + sparse vector upload
upserts = []
for i, doc in enumerate(corpus):
    sparse_vector = bm25.get_scores(doc.split()).tolist()
    upserts.append((
        str(i),
        dense_vectors[i],
        {"sparse_values": {"indices": list(range(len(sparse_vector))),
                           "values": sparse_vector},
         "text": doc}
    ))

index.upsert(vectors=upserts)
```

---

**(4) Hybrid Querying**

```python
query = "semantic and keyword search"
dense_q = dense_model.encode(query)

# Sparse component for the query
sparse_scores = bm25.get_scores(query.split()).tolist()

# Weighted hybrid query
alpha = 0.7
hybrid_results = index.query(
    vector=dense_q,
    sparse_vector={"indices": list(range(len(sparse_scores))),
                   "values": sparse_scores},
    top_k=5,
    include_metadata=True,
    alpha=alpha
)
```

---

**4. Weighted Scoring Techniques**

The **balance parameter (α)** controls the influence of each retriever:

| α   | Retrieval Focus | Typical Use Case                                  |
| --- | --------------- | ------------------------------------------------- |
| 0.0 | Purely Sparse   | Keyword search, factual lookup                    |
| 0.5 | Balanced        | Hybrid QA, semantic retrieval with term grounding |
| 1.0 | Purely Dense    | Conversational AI, RAG, semantic reasoning        |

You can **tune α dynamically** based on query intent — for example:

* If query has **rare keywords**, emphasize sparse (α ↓).
* If query is **semantic**, emphasize dense (α ↑).

---

**5. Real-World Applications**

| Use Case                                 | Description                                                      |
| ---------------------------------------- | ---------------------------------------------------------------- |
| **Retrieval-Augmented Generation (RAG)** | Combines semantic understanding with factual grounding for LLMs  |
| **Enterprise Search**                    | Merges semantic relevance with company-specific jargon           |
| **E-commerce**                           | Matches product descriptions semantically and by keywords        |
| **Legal or Medical Search**              | Ensures critical keywords are not missed while capturing context |

---

**6. Summary**

| Aspect       | Dense Retriever           | Sparse Retriever     | Hybrid Retriever  |
| ------------ | ------------------------- | -------------------- | ----------------- |
| **Basis**    | Semantic similarity       | Keyword matching     | Combined          |
| **Models**   | Embeddings (BERT, OpenAI) | TF-IDF, BM25, SPLADE | Both              |
| **Speed**    | High                      | High                 | Moderate          |
| **Accuracy** | Semantic                  | Lexical              | Best Overall      |
| **Use Case** | QA, Semantic Search       | Document Lookup      | RAG, Smart Search |

---

**In short:**
👉 A **Hybrid Retriever** in Pinecone merges **dense embeddings** and **sparse keyword vectors**, enabling **precise + meaningful search** — crucial for **RAG systems**, **enterprise knowledge bases**, and **AI search assistants**.

## **TF-IDF (Term Frequency–Inverse Document Frequency)**

**What Is TF-IDF?**

> **TF-IDF** stands for **Term Frequency–Inverse Document Frequency** —
a numerical statistic used to **measure the importance of a word** in a document relative to a collection (corpus).

It’s the foundation for **sparse vector representations**, where each document is represented as a **vector of term weights**.

---

**Formula**

The **TF-IDF score** of a term *t* in document *d* is calculated as:

$$\text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t)$$

where:

1. **Term Frequency (TF):**
   [
   \text{TF}(t, d) = \frac{\text{Number of times term t appears in d}}{\text{Total number of terms in d}}
   ]

   > Measures how frequently a word occurs within a document.

2. **Inverse Document Frequency (IDF):**
   [
   \text{IDF}(t) = \log \left( \frac{N}{1 + \text{DF}(t)} \right)
   ]
   where
   ( N ) = total number of documents
   ( \text{DF}(t) ) = number of documents containing the term *t*

   > Reduces the weight of **common terms** and increases the weight of **rare but important terms**.

---

**Example**

Let’s say we have a corpus of 3 documents:

| Doc | Text                                            |
| --- | ----------------------------------------------- |
| D1  | "Pinecone provides vector database services"    |
| D2  | "Vector databases are used for semantic search" |
| D3  | "TF-IDF is a sparse retrieval technique"        |

> Step 1 — Tokenize

```
["pinecone", "provides", "vector", "database", "services"]
["vector", "databases", "used", "semantic", "search"]
["tf-idf", "sparse", "retrieval", "technique"]
```

> Step 2 — Compute TF

For D1:

* TF("vector") = 1/5 = 0.2
* TF("pinecone") = 1/5 = 0.2

> Step 3 — Compute IDF

| Term      | Appears In | IDF            |
| --------- | ---------- | -------------- |
| vector    | 2          | log(3/2)=0.176 |
| pinecone  | 1          | log(3/1)=1.098 |
| retrieval | 1          | log(3/1)=1.098 |

> Step 4 — Compute TF-IDF

For “vector” in D1 → 0.2 × 0.176 = 0.035
For “pinecone” in D1 → 0.2 × 1.098 = 0.219

→ “pinecone” is **more important** to D1.

---

**How TF-IDF Works in Vector Search**

Each document is represented as a **sparse high-dimensional vector**,
where each dimension corresponds to a **word** in the vocabulary, and the value is its **TF-IDF score**.

For example:

| Term      | D1    | D2    | D3    |
| --------- | ----- | ----- | ----- |
| pinecone  | 0.219 | 0     | 0     |
| vector    | 0.035 | 0.035 | 0     |
| retrieval | 0     | 0     | 0.219 |

Similarity between two documents (or query & doc) is computed via **cosine similarity** between these sparse vectors.

---

**Python Implementation Example**

```python
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Corpus
docs = [
    "Pinecone provides vector database services",
    "Vector databases are used for semantic search",
    "TF-IDF is a sparse retrieval technique"
]

# Create TF-IDF matrix
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(docs)

# Query
query = ["semantic vector database"]

# Transform query to TF-IDF vector
query_vec = vectorizer.transform(query)

# Compute similarity
cosine_sim = cosine_similarity(query_vec, tfidf_matrix)

print("Similarity Scores:", cosine_sim)
```

🧾 Output:

```
Similarity Scores: [[0.35 0.67 0.05]]
```

→ The query matches best with **Document 2** (semantic search, vector databases).

---

**Integration in Hybrid Search**

In **hybrid retrieval**, TF-IDF acts as the **sparse retriever**.

For example:

* **Sparse vector** → TF-IDF or BM25 output
* **Dense vector** → BERT, OpenAI, or Hugging Face embeddings
* **Hybrid scoring** → Combine both (using weighted α)

Example hybrid score:

```python
hybrid_score = alpha * dense_similarity + (1 - alpha) * tfidf_similarity
```

---

**Advantages & Disadvantages**

| Aspect             | Pros                                                    | Cons                                     |
| ------------------ | ------------------------------------------------------- | ---------------------------------------- |
| **Speed**          | Fast and efficient for small-medium corpora             | Slower on large corpora without indexing |
| **Explainability** | Transparent scoring (easy to debug)                     | Lacks semantic understanding             |
| **Memory**         | Sparse representation saves space                       | Limited contextual relevance             |
| **Best Use Cases** | Keyword-based search, document filtering, hybrid setups | Semantic tasks (use embeddings instead)  |

---

**Real-World Applications**

* **Search engines** (e.g., ElasticSearch uses TF-IDF variants)
* **Document similarity & plagiarism detection**
* **Keyword-based filtering in hybrid systems**
* **Metadata matching in RAG pipelines**

---

**Summary**

| Concept               | Description                                               |
| --------------------- | --------------------------------------------------------- |
| **Goal**              | Measure how important a word is to a document in a corpus |
| **Formula**           | TF × IDF                                                  |
| **Similarity Metric** | Cosine similarity                                         |
| **Usage**             | Sparse retrieval, Hybrid RAG, Search systems              |
| **Integration**       | Works alongside embeddings in hybrid retrievers           |


## **BM25 — The Core Ranking Algorithm in Modern Search Engines**

**Introduction**

**BM25 (Best Match 25)** is a **ranking algorithm** used by modern search engines like **Elasticsearch**, **Lucene**, and **Whoosh** to measure how relevant a document is to a user’s search query.
It’s an **improved version of TF-IDF (Term Frequency–Inverse Document Frequency)** and is widely adopted because of its **robustness, interpretability, and effectiveness** in real-world retrieval tasks.

BM25 belongs to the **Okapi family** of probabilistic retrieval models, developed at **City University London** in the 1990s as part of the **Okapi IR System**.

---

**Why BM25?**

Traditional **TF-IDF** gives more importance to documents with high term frequency, but it doesn’t:

* Account for **document length differences** (longer documents unfairly score higher)
* **Cap term frequency influence**, which can over-boost repetitive words
* Provide easy tuning parameters for different datasets

**BM25** solves these issues by:
✅ Introducing **term saturation** — diminishing returns for high word counts
✅ Adding **document length normalization**
✅ Allowing tunable parameters **k₁** and **b** for fine control over ranking

---

**The BM25 Formula**

The BM25 score for a document *d* given a query *q* is calculated as:

$$\text{score}(q, d) = \sum_{t \in q} IDF(t) \times \frac{TF(t, d) \times (k_1 + 1)}{TF(t, d) + k_1 \times (1 - b + b \times \frac{|d|}{avg_d})}$$


Where:

| Symbol       | Meaning                                                |    |                                   |
| :----------- | :----------------------------------------------------- | -- | --------------------------------- |
| **t**        | Term in the query                                      |    |                                   |
| **q**        | Query consisting of multiple terms                     |    |                                   |
| **d**        | Document being scored                                  |    |                                   |
| **TF(t, d)** | Term frequency of t in document d                      |    |                                   |
| **IDF(t)**   | Inverse document frequency of term t                   |    |                                   |
| **           | d                                                      | ** | Length of the document (in words) |
| **avg_d**    | Average document length in the corpus                  |    |                                   |
| **k₁**       | Controls term frequency saturation (default: 1.2–2.0)  |    |                                   |
| **b**        | Controls document length normalization (default: 0.75) |    |                                   |

---

> **Breaking Down the Components**

**1. Term Frequency (TF)**

Measures how many times a word appears in a document.

In BM25, this is **non-linear** — it saturates after a certain count.
So, if a word appears 10 times, it doesn’t make the document 10× more relevant than if it appeared once.

[
TF(t, d) \Rightarrow \frac{TF(t, d) \times (k_1 + 1)}{TF(t, d) + k_1 \times (1 - b + b \times \frac{|d|}{avg_d})}
]



**2. Inverse Document Frequency (IDF)**

[
IDF(t) = \log \left( \frac{N - n_t + 0.5}{n_t + 0.5} + 1 \right)
]

| Term   | Description                             |
| ------ | --------------------------------------- |
| **N**  | Total number of documents               |
| **nₜ** | Number of documents containing term *t* |

IDF ensures **rare terms** across the corpus are **weighted more**, while common terms like “the” or “is” contribute less.



 **3. Document Length Normalization (b parameter)**

| Value of b   | Effect                      |
| ------------ | --------------------------- |
| **b = 0**    | Ignores document length     |
| **b = 1**    | Fully normalizes by length  |
| **b = 0.75** | Balanced approach (default) |

This adjustment ensures that **longer documents** don’t unfairly gain higher scores simply because they contain more words.

---

**Parameter Tuning**

| Parameter | Range   | Effect                                                                   |
| --------- | ------- | ------------------------------------------------------------------------ |
| **k₁**    | 1.2–2.0 | Controls term frequency saturation (higher = less saturation)            |
| **b**     | 0–1     | Controls document length normalization (higher = stronger normalization) |

**Typical Defaults:**

* `k₁ = 1.2`
* `b = 0.75`

You can tune these depending on your dataset:

* Short, concise texts (like tweets): lower `b`
* Long documents: higher `b`

---

**Example Calculation**

Let’s say we have:

* 10,000 documents in total (**N = 10,000**)
* Term “LangChain” appears in 100 documents (**nₜ = 100**)
* Document *d* contains “LangChain” 3 times (**TF = 3**)
* Document length = 500 words (**|d| = 500**)
* Average document length = 250 (**avg_d = 250**)
* **k₁ = 1.5**, **b = 0.75**

**Step 1: Compute IDF**
[
IDF(LangChain) = \log \left( \frac{10,000 - 100 + 0.5}{100 + 0.5} + 1 \right) \approx 3.9
]

**Step 2: Compute TF Normalization**
[
\frac{3 \times (1.5 + 1)}{3 + 1.5 \times (1 - 0.75 + 0.75 \times \frac{500}{250})} = \frac{7.5}{3 + 1.5 \times (1.25)} = \frac{7.5}{4.875} \approx 1.54
]

**Step 3: Final Score**
[
Score = 3.9 \times 1.54 \approx 6.0
]

So, this document’s relevance score for “LangChain” is **6.0**.
Documents with higher scores will rank higher in search results.

---

**BM25 vs. TF-IDF**

| Feature                           | TF-IDF               | BM25                        |
| --------------------------------- | -------------------- | --------------------------- |
| **Term Saturation**               | Linear               | Non-linear                  |
| **Document Length Normalization** | No                   | Yes                         |
| **Parameters**                    | None                 | k₁, b                       |
| **Accuracy**                      | Moderate             | High                        |
| **Used In**                       | Older search engines | Elasticsearch, Lucene, Solr |

---

**BM25 in Elasticsearch**

Elasticsearch uses BM25 as its **default scoring algorithm**.
You can confirm or modify it in your index settings:

```bash
PUT /my_index
{
  "settings": {
    "similarity": {
      "default": {
        "type": "BM25",
        "k1": 1.2,
        "b": 0.75
      }
    }
  }
}
```

You can also experiment with custom parameters to optimize retrieval for your specific dataset.

---

**Applications of BM25**

* **Search Engines:** Ranking web pages and documents
* **E-commerce:** Product search relevance
* **Chatbots:** Keyword retrieval before vector-based matching
* **RAG Pipelines:** Combining BM25 (sparse) with embeddings (dense)
* **Recommender Systems:** Textual content ranking

---

**Hybrid Search with BM25 + Vectors**

BM25 can be **combined with dense vector embeddings** for **hybrid retrieval** — merging keyword relevance and semantic understanding.

```bash
POST /hybrid_search/_search
{
  "query": {
    "bool": {
      "should": [
        { "match": { "text": "LangChain RAG" }},
        {
          "script_score": {
            "query": { "match_all": {} },
            "script": {
              "source": "cosineSimilarity(params.vector, 'embedding') + 1.0",
              "params": { "vector": [0.12, 0.45, ...] }
            }
          }
        }
      ]
    }
  }
}
```

This **BM25 + embedding synergy** is the foundation of **RAG pipelines** in systems like **Elasticsearch, Pinecone, and Weaviate**.

---

**Summary**

| Aspect                           | Description                                               |
| -------------------------------- | --------------------------------------------------------- |
| **Full Name**                    | Best Match 25                                             |
| **Type**                         | Probabilistic ranking model                               |
| **Core Improvement Over TF-IDF** | Term frequency saturation + document length normalization |
| **Key Parameters**               | k₁ (term frequency), b (document length)                  |
| **Used In**                      | Elasticsearch, Lucene, Solr, Vespa                        |
| **Ideal Use Cases**              | Text search, retrieval, hybrid RAG                        |
| **Advantages**                   | Accurate, tunable, scalable, interpretable                |



## **Elasticsearch**

**What is Elasticsearch?**

> **Elasticsearch** is a **distributed, RESTful search and analytics engine** built on top of **Apache Lucene**.
It is designed to **store, search, and analyze massive amounts of structured and unstructured data** in near real-time.

Developed by **Elastic NV**, it powers
* Search engines
* Log analytics (via Elastic Stack / ELK)
* Recommendation systems
* Vector and hybrid retrieval (semantic + keyword)

---

**Key Features**

| Feature                   | Description                                              |
| ------------------------- | -------------------------------------------------------- |
| **Full-Text Search**      | Advanced text analysis and relevance ranking             |
| **Scalability**           | Handles petabytes of data using sharding and replication |
| **Real-Time Indexing**    | Data becomes searchable within seconds of ingestion      |
| **RESTful API**           | Accessible via HTTP/JSON APIs                            |
| **Aggregation Framework** | Enables complex data analytics                           |
| **Vector Search**         | Supports dense vector similarity for AI-powered search   |
| **Schema Flexibility**    | JSON-based documents, schema-free or schema-mapped       |

---

**Core Concepts**

Understanding the building blocks of Elasticsearch:

| Concept      | Description                                                            | Analogy                  |
| ------------ | ---------------------------------------------------------------------- | ------------------------ |
| **Cluster**  | A collection of one or more nodes (servers) that holds the entire data | A database system        |
| **Node**     | A single running instance of Elasticsearch                             | A single database server |
| **Index**    | A logical namespace for related documents                              | A database or table      |
| **Document** | A JSON object containing fields (data)                                 | A row in a table         |
| **Shard**    | A subset of an index that stores part of the data                      | A partition              |
| **Replica**  | A copy of a shard for fault tolerance                                  | A backup                 |

---

**How Elasticsearch Works**

> 1. **Indexing Phase**

Data is sent to Elasticsearch as a **JSON document** through its REST API.

Example:

```bash
POST /library/_doc/1
{
  "title": "Learning LangChain",
  "author": "Panduka Bandara",
  "tags": ["AI", "LLM", "RAG"],
  "year": 2025
}
```

Elasticsearch:

* Tokenizes the text
* Builds an **inverted index** (maps terms → documents)
* Stores metadata for fast lookups



> 2. **Searching Phase**

Queries are made using the **Query DSL** (Domain Specific Language):

```bash
GET /library/_search
{
  "query": {
    "match": {
      "title": "LangChain"
    }
  }
}
```

Results are ranked by **relevance score** using the **BM25 algorithm** (an improvement over TF-IDF).



**Inverted Index: The Heart of Elasticsearch**

An **inverted index** is similar to a dictionary:

* Words (terms) → list of documents that contain them.

Example:

| Term        | Documents |
| ----------- | --------- |
| "LangChain" | 1, 3      |
| "RAG"       | 1, 2      |
| "Python"    | 2, 4      |

When you search for “LangChain RAG,” Elasticsearch quickly finds documents 1, 2, 3 from the inverted index and ranks them.

---

**Architecture Overview**

```
+--------------------------------------------------------+
|                   Elasticsearch Cluster                |
|                                                        |
|  +---------------+      +---------------+              |
|  |   Node 1      |      |   Node 2      |              |
|  | (Primary)     |      | (Replica)     |              |
|  +---------------+      +---------------+              |
|         |                        |                    |
|       [Shards]  <----->       [Shards]                |
+--------------------------------------------------------+
```

Each index is divided into **shards**, distributed across nodes for scalability and fault tolerance.

---

**Search Algorithms and Scoring**

Elasticsearch uses the **BM25** ranking function, which improves on TF-IDF:

[
\text{score}(q, d) = \sum_{t \in q} IDF(t) \times \frac{TF(t, d) \times (k + 1)}{TF(t, d) + k \times (1 - b + b \times \frac{|d|}{avg_d})}
]

* **TF** → Term frequency in document
* **IDF** → Inverse document frequency
* **k, b** → Tuning parameters controlling term saturation and length normalization

This results in **relevance scores** that rank the most relevant documents higher.

---

**Vector Search in Elasticsearch**

Since **v8.0**, Elasticsearch supports **dense vector fields** for **semantic and hybrid search**.

> Example: Creating an Index with Vector Fields

```bash
PUT /semantic_index
{
  "mappings": {
    "properties": {
      "text": { "type": "text" },
      "embedding": { 
        "type": "dense_vector",
        "dims": 1536,
        "index": true,
        "similarity": "cosine"
      }
    }
  }
}
```

> Inserting Data

```bash
POST /semantic_index/_doc/1
{
  "text": "LangChain integrates with vector databases like Pinecone.",
  "embedding": [0.12, 0.33, ...]  // vector from OpenAI or Hugging Face
}
```

> Querying by Vector Similarity

```bash
POST /semantic_index/_search
{
  "query": {
    "script_score": {
      "query": { "match_all": {} },
      "script": {
        "source": "cosineSimilarity(params.query_vector, 'embedding') + 1.0",
        "params": { "query_vector": [0.14, 0.31, ...] }
      }
    }
  }
}
```

This enables **semantic retrieval**, where documents are matched by meaning rather than exact keywords.

---

**Hybrid Search**

Elasticsearch can combine **sparse** (BM25/TF-IDF) and **dense** (vector) retrieval:

```bash
POST /hybrid_search/_search
{
  "query": {
    "bool": {
      "should": [
        { "match": { "text": "LangChain embeddings" }},
        {
          "script_score": {
            "query": { "match_all": {} },
            "script": {
              "source": "cosineSimilarity(params.query_vector, 'embedding') + 1.0",
              "params": { "query_vector": [0.15, 0.42, ...] }
            }
          }
        }
      ]
    }
  }
}
```

→ This approach gives both **keyword precision** and **semantic understanding**, ideal for **RAG pipelines**.

---

**Deployment Options**

| Environment          | Description                                                |
| -------------------- | ---------------------------------------------------------- |
| **Self-Managed**     | Install Elasticsearch on local or cloud servers            |
| **Elastic Cloud**    | Managed service provided by Elastic.co                     |
| **AWS OpenSearch**   | Amazon’s managed fork of Elasticsearch                     |
| **Kubernetes (ECK)** | Elastic Cloud on Kubernetes for containerized environments |

---

**Integration with LangChain**

LangChain provides direct integration via `ElasticsearchStore`:

```python
from langchain_community.vectorstores import ElasticsearchStore
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()
vectorstore = ElasticsearchStore(
    index_name="langchain_docs",
    embedding=embeddings,
    es_url="http://localhost:9200"
)

# Add documents
vectorstore.add_texts(["LangChain integrates with Elasticsearch"])
# Perform search
results = vectorstore.similarity_search("integration with vector stores")
```

This forms the foundation of **RAG (Retrieval-Augmented Generation)** pipelines.

---

**Security and Access Control**

Elasticsearch offers multiple security features:

* **API Key Authentication**
* **Role-Based Access Control (RBAC)**
* **TLS Encryption for communication**
* **Index-level and field-level access policies**
* **Audit logging and monitoring**

---

**Real-World Applications**

| Domain                  | Use Case                                        |
| ----------------------- | ----------------------------------------------- |
| **E-commerce**          | Product search and recommendations              |
| **Log Analytics (ELK)** | Centralized monitoring with Logstash and Kibana |
| **Chatbots**            | Context retrieval for LLMs                      |
| **Cybersecurity**       | Threat detection and log correlation            |
| **Knowledge Graphs**    | Semantic document retrieval                     |

---

**Summary**

| Aspect                | Description                                           |
| --------------------- | ----------------------------------------------------- |
| **Type**              | Distributed search & analytics engine                 |
| **Core Strength**     | Full-text + semantic + hybrid retrieval               |
| **Underlying Engine** | Apache Lucene                                         |
| **Supports**          | Sparse, dense, and hybrid vector search               |
| **Best Use Cases**    | Enterprise search, RAG, log analytics, recommendation |
| **Integrations**      | LangChain, OpenAI, Hugging Face, Pinecone, Kibana     |